Indexing Filters for Xapian-based search
Just wanted to do a quick poll about what moin users out there use for indexing their file attachments.
1. Filters used
Do you just use the filter modules we provide (see MoinMoin/filter/*.py) or do you use additional python filtering / filtering adaptor code (which)?
- Only the provided modules are used.
2. Filter programs quality / stability
With the questions below, I am searching for practical experience relating to:
- filtering coverage (does the filter work for all documents of that format)
- filtering output quality (does it really get the content out of the document)
- filtering stability (does the filter crash or hang)
And also other stuff you might want to tell about it.
If you do not use some specific filter due to a missing use case or because you don't have required stuff installed, you don't need to tell about it. But, if you do not use it because it was always making trouble for you, please DO tell.
2.1. PDF (using pdftotext from poppler-utils)
Is our PDF filter plugin that calls pdftotext (from poppler-utils) working for you? If you do not use poppler-utils, but xpdf-utils, see below.
- Indexing PDF attachments (Word 2007 created for my test file) is not working for me, though pdftotext works against the test file using the cmd-line given in filter/application_pdf.py. - CMD,2012/05/08
hmm, that sounds strange. is pdftotext in the PATH? if you can't resolve the issue, please open a bug report for that and move these lines to the bug report. -- ThomasWaldmann 2012-05-11 10:28:45
2.2. PDF (using pdftotext from xpdf-utils)
Is our PDF filter plugin that calls pdftotext (from xpdf-utils) working for you? If you do not use xpdf-utils, but poppler-utils, see above.
- Coverage good, scans without text information obviously don't work. Output quality good (no reports from our users). Stability good, no hangs from this plugin.
2.3. RTF
Is our RTF filter plugin that calls catdoc working for you?
2.4. MS Word
Is our MS word filter plugin that calls antiword working for you?
- Coverage good. Output quality good. Stability good.
2.5. MS Excel
Is our MS excel filter plugin that calls xls2csv (from catdoc package) working for you?
2.6. MS Powerpoint
Is our MS powerpoint filter plugin that calls catppt (from catdoc package) working for you?
This is very new and was committed to 1.8 repo after 1.8.5 release. Feedback about catppt in general is also welcome.
2.7. OpenOffice.org / Open Document Format
Is our builtin OOo / ODF filter plugin working for you?
- Most of our files are in OOo format, so this filter gives us the greatest benefit. This filter got me most of the problems I had with filtering. The Coverage is good, the output quality is good, but the filter stability gave me a lot of problems:
- Some people "converted" their doc files to OOo by renaming the file. That gave an exception in the moin.log. When reindexing after an update the filter stopped with an exception. I manually needed to touch files to fix this. As far as I know this got fixed in recent Moin relase.
- The newest problem was a OOo document with a password. As with the case before i leads to an exception in the logs. When reindexing it stops with an exception.
- I fear resolving any new exceptions that we will stumble upon.
2.8. text/html, text/xml
Is our builtin html and xml filter plugin working for you?
2.9. text/*
Is our builtin text filter plugin working for you?
2.10. JPEG images
Is our builtin image/jpeg filter plugin working for you?
2.11. Binary
Is our builtin binary file filter plugin working for you?