What we need
- Attachment search
Attachment class that mimics MoinMoin.Page
- .get_raw_body()
.page_name (PageName/FileName)
- .link_to()
- plugin structure for converters
MoinMoin/attachment/init.py
from MoinMoin.util import pysupport modules = pysupport.getPackageModules(__file__)
- heuristic for content type
- regex for file ending
- converters
- plain text
- PS
OpenOffice
see oo2txt.py for a simple example -- OliverGraf 2004-06-21 13:45:02
- Powerpoint
- Word
- parameter where to search
- in search.py
- Iteration over all attachments
see MoinMoin.search.searchPages()
Starting Points:
- MoinMoin/search.py
- MoinMoin/action/AttachFile.py
- MoinMoin/Page.py
- MoinMoin/caching.py
First version
With this patch the search in attachments is possible in version 1.3. So far only PDF files are supported by means of pdftotext (xpdf package) (yes, currently only on Linux). The following files have changed:
moin/lib/python2.3/site-packages/MoinMoin/search.py moin/lib/python2.3/site-packages/MoinMoin/Attachment.py moin/lib/python2.3/site-packages/MoinMoin/formatter/text_html.py moin/lib/python2.3/site-packages/MoinMoin/action/fullsearch.py moin/lib/python2.3/site-packages/MoinMoin/action/AttachFile.py moin/lib/python2.3/site-packages/MoinMoin/attach2txt/pdf2txt.py moin/lib/python2.3/site-packages/MoinMoin/attach2txt/__init__.py
You will have to create the directory data/cache/AttachSearch manually.
It works as follows: When a text search is performed, for every page its attachments' text versions are searched in a special directory. Given page WikiPage has an attachment att.suf. Then the file data/cache/AttachSearch/WikiPage/att.suf.txt is opened if it exists and a normal search is performed. If it does not yet exist, in the attach2txt package the proper conversion method is looked up in attach2txt.__init__.converter_mapping, which is a dictionary ( {"pdf": pdf2txt.convert} ).
If pdftotext does not manage to convert the pdf file it creates an empty text file. At the next search moinmoin is not trying to convert again that file.
Ideas
- use "magic(5)" to detect content type
write Python interpreter and add (edited version of) /etc/share/magic to MoinMoin
TODO
Beautifying of print_results_page_only and print_results_with_context in file search.py. So far, only the wiki page containing the matching attachment is displayed.
- When deleting a page or the attachment, the txt file must also be deleted.
- A new input field has to be added to titel and text search. Maybe some only want to search in pages. Maybe some option boxes are adequate.
- Search in the attachment's filename is not yet supported.