MoinMoinTodo/ExtendedSearch/AttachmentSearch

What we need

Attachment search
- Attachment class that mimics MoinMoin.Page
  - .get_raw_body()
  - .page_name (PageName/FileName)
  - .link_to()
- plugin structure for converters
  - MoinMoin/attachment/init.py

from MoinMoin.util import pysupport
modules = pysupport.getPackageModules(__file__)

heuristic for content type
- regex for file ending
converters
- plain text
- PDF
- PS
- OpenOffice
  - see oo2txt.py for a simple example -- OliverGraf 2004-06-21 13:45:02
- Powerpoint
- Word
parameter where to search

in search.py
- Iteration over all attachments
- see MoinMoin.search.searchPages()

Starting Points:

MoinMoin/search.py
MoinMoin/action/AttachFile.py
MoinMoin/Page.py
MoinMoin/caching.py

First version

With this patch the search in attachments is possible in version 1.3. So far only PDF files are supported by means of pdftotext (xpdf package) (yes, currently only on Linux). The following files have changed:

moin/lib/python2.3/site-packages/MoinMoin/search.py
moin/lib/python2.3/site-packages/MoinMoin/Attachment.py
moin/lib/python2.3/site-packages/MoinMoin/formatter/text_html.py
moin/lib/python2.3/site-packages/MoinMoin/action/fullsearch.py
moin/lib/python2.3/site-packages/MoinMoin/action/AttachFile.py
moin/lib/python2.3/site-packages/MoinMoin/attach2txt/pdf2txt.py
moin/lib/python2.3/site-packages/MoinMoin/attach2txt/__init__.py

You will have to create the directory data/cache/AttachSearch manually.

It works as follows: When a text search is performed, for every page its attachments' text versions are searched in a special directory. Given page WikiPage has an attachment att.suf. Then the file data/cache/AttachSearch/WikiPage/att.suf.txt is opened if it exists and a normal search is performed. If it does not yet exist, in the attach2txt package the proper conversion method is looked up in attach2txt.__init__.converter_mapping, which is a dictionary ( {"pdf": pdf2txt.convert} ).

If pdftotext does not manage to convert the pdf file it creates an empty text file. At the next search moinmoin is not trying to convert again that file.

Ideas

use "magic(5)" to detect content type
- write Python interpreter and add (edited version of) /etc/share/magic to MoinMoin

TODO

Beautifying of print_results_page_only and print_results_with_context in file search.py. So far, only the wiki page containing the matching attachment is displayed.
When deleting a page or the attachment, the txt file must also be deleted.
A new input field has to be added to titel and text search. Maybe some only want to search in pages. Maybe some option boxes are adequate.
Search in the attachment's filename is not yet supported.

MoinMoin: MoinMoinTodo/ExtendedSearch/AttachmentSearch (last edited 2007-10-29 19:21:07 by localhost)