Indexing filters
This stuff is brand new and the api is not stable yet.
Moin has a built-in indexing search engine for quite some time (disabled by default as it has still issues).
By using a search index (instead of going through all content byte by byte as the default simple search does it), it is also possible to search attachments and code for that was recently added to the development version.
The attachments indexing code was written to use Filter plugins. So when it encounters a "*.doc" file, it will guess its mimetype to "application/ms-word" and try to load the application_ms_word.py filter plugin to read it. If it doesn't find that plugin, it will first fall back to a class plugin application.py (does make more sense for text/* mimetypes) and if it doesn't find that either, it will fall back to application_octet_stream.py.
The filter plugin simply gets the index object and the attachment filename as parameters. It then reads the file and returns a python unicode object with the content.
The content value returned does not need to be pretty - it is just used for indexing (and thus should contain all content words from the document).
A simple filter plugin
# -*- coding: iso-8859-1 -*- """ MoinMoin - plain text file Filter We try to support more than ASCII here. @copyright: 2006 by ThomasWaldmann MoinMoin:ThomasWaldmann @license: GNU GPL, see COPYING for details. """ import codecs def execute(indexobj, filename): for enc in ('utf-8', 'iso-8859-15', 'iso-8859-1', ): try: f = codecs.open(filename, "r", enc) data = f.read() f.close() return data except UnicodeError, err: pass f = file(filename, "r") data = f.read() f.close() data = data.decode('ascii', 'replace') return data
What does this filter do?
- it has a list of popular txt file encodings: utf-8, iso-8859-15 and -1.
- it tries to open and read the input file with all encodings
- if it succeeds, it just returns the complete content as unicode object
- if no encoding succeeds, it finally reads the file and just forces decoding with ascii
More stuff is on FilterMarket.
Guidelines for filters
- only return (decoded) unicode objects, not (encoded) strings
- try to avoid including unnecessary "trash" in the content value (as this will increase the size of the index and slow down searching)
- try to be fast, but correct
- try to avoid new dependencies (using additional modules not already used by moin)
- try to avoid platform dependant stuff
- don't depend on non-free stuff
How to test a filter
You can easily test a filter without even having moin installed - you just need python and a sample document as input.
Just put some sample document test.doc and your doc.py (you can rename it later to the mimetype-like filename) in your work directory.
Start the python interpreter (use python 2.3 or 2.4) from there, then:
Python 2.3.5 (#2, Sep 4 2005, 22:01:42) [GCC 3.3.5 (Debian 1:3.3.5-13)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import doc >>> doc.execute(None, "test.doc") u"It works."
The "u" means an unicode object, "It works." is the content of your test.doc document file.