Indexing filters

$/!\$ This stuff is brand new and the api is not stable yet.

Moin has a built-in indexing search engine for quite some time (disabled by default as it has still issues).

By using a search index (instead of going through all content byte by byte as the default simple search does it), it is also possible to search attachments and code for that was recently added to the development version.

The attachments indexing code was written to use Filter plugins. So when it encounters a "*.doc" file, it will guess its mimetype to "application/ms-word" and try to load the application_ms_word.py filter plugin to read it. If it doesn't find that plugin, it will first fall back to a class plugin application.py (does make more sense for text/* mimetypes) and if it doesn't find that either, it will fall back to application_octet_stream.py.

The filter plugin simply gets the index object and the attachment filename as parameters. It then reads the file and returns a python unicode object with the content.

The content value returned does not need to be pretty - it is just used for indexing (and thus should contain all content words from the document).

A simple filter plugin

# -*- coding: iso-8859-1 -*-
"""
    MoinMoin - plain text file Filter

    We try to support more than ASCII here.

    @copyright: 2006 by ThomasWaldmann MoinMoin:ThomasWaldmann
    @license: GNU GPL, see COPYING for details.
"""

import codecs

def execute(indexobj, filename):
    for enc in ('utf-8', 'iso-8859-15', 'iso-8859-1', ):
        try:
            f = codecs.open(filename, "r", enc)
            data = f.read()
            f.close()
            return data
        except UnicodeError, err:
            pass
    f = file(filename, "r")
    data = f.read()
    f.close()
    data = data.decode('ascii', 'replace')
    return data

What does this filter do?

it has a list of popular txt file encodings: utf-8, iso-8859-15 and -1.
it tries to open and read the input file with all encodings
if it succeeds, it just returns the complete content as unicode object
if no encoding succeeds, it finally reads the file and just forces decoding with ascii

More stuff is on FilterMarket.

Guidelines for filters

only return (decoded) unicode objects, not (encoded) strings
try to avoid including unnecessary "trash" in the content value (as this will increase the size of the index and slow down searching)
try to be fast, but correct
try to avoid new dependencies (using additional modules not already used by moin)
try to avoid platform dependant stuff
don't depend on non-free stuff

How to test a filter

You can easily test a filter without even having moin installed - you just need python and a sample document as input.

Just put some sample document test.doc and your doc.py (you can rename it later to the mimetype-like filename) in your work directory.

Start the python interpreter (use python 2.3 or 2.4) from there, then:

Python 2.3.5 (#2, Sep  4 2005, 22:01:42) 
[GCC 3.3.5 (Debian 1:3.3.5-13)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import doc
>>> doc.execute(None, "test.doc")
u"It works."

The "u" means an unicode object, "It works." is the content of your test.doc document file.

MoinMoin: FiltersForIndexing

Indexing filters

A simple filter plugin

Guidelines for filters

How to test a filter