Indexing filters

/!\ This stuff is brand new and the api is not stable yet.

Moin has a built-in indexing search engine for quite some time (disabled by default as it has still issues).

By using a search index (instead of going through all content byte by byte as the default simple search does it), it is also possible to search attachments and code for that was recently added to the development version.

The attachments indexing code was written to use Filter plugins. So when it encounters a "*.doc" file, it will guess its mimetype to "application/ms-word" and try to load the application_ms_word.py filter plugin to read it. If it doesn't find that plugin, it will first fall back to a class plugin application.py (does make more sense for text/* mimetypes) and if it doesn't find that either, it will fall back to application_octet_stream.py.

The filter plugin simply gets the index object and the attachment filename as parameters. It then reads the file and returns a python unicode object with the content.

The content value returned does not need to be pretty - it is just used for indexing (and thus should contain all content words from the document).

A simple filter plugin

# -*- coding: iso-8859-1 -*-
"""
    MoinMoin - plain text file Filter

    We try to support more than ASCII here.

    @copyright: 2006 by ThomasWaldmann MoinMoin:ThomasWaldmann
    @license: GNU GPL, see COPYING for details.
"""

import codecs

def execute(indexobj, filename):
    for enc in ('utf-8', 'iso-8859-15', 'iso-8859-1', ):
        try:
            f = codecs.open(filename, "r", enc)
            data = f.read()
            f.close()
            return data
        except UnicodeError, err:
            pass
    f = file(filename, "r")
    data = f.read()
    f.close()
    data = data.decode('ascii', 'replace')
    return data

What does this filter do?

More stuff is on FilterMarket.

Guidelines for filters

How to test a filter

You can easily test a filter without even having moin installed - you just need python and a sample document as input.

Just put some sample document test.doc and your doc.py (you can rename it later to the mimetype-like filename) in your work directory.

Start the python interpreter (use python 2.3 or 2.4) from there, then:

Python 2.3.5 (#2, Sep  4 2005, 22:01:42) 
[GCC 3.3.5 (Debian 1:3.3.5-13)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import doc
>>> doc.execute(None, "test.doc")
u"It works."

The "u" means an unicode object, "It works." is the content of your test.doc document file.

MoinMoin: FiltersForIndexing (last edited 2007-10-29 19:08:11 by localhost)