Task
Implement an advanced xapian indexed search within MoinMoin.
Xapian search will search wiki pages ("text items"), attachments ("text or binary items") and (optionally) index file content from filesystem.
Ideas
- search needs a good UI for input as well as output
- inputs needed:
- search query
- comfortable mimetype selection
- comfortable wiki selection (for farms)
- selection of "namespace" (currently: data, underlay, file system) - so you can e.g. exclude the wiki's system and help pages and/or files
selection of language (or don't care)
- output:pulling
- results pages with previous / next navigation
- correct linking to search result target document
- showing of context with highlighting
- different sorting (score, age, alphabetic, mimetype, wiki, ...)
Modelled google like, so users will find it intuitive to use
- inputs needed:
- support stemming depending on metadata "language"
- indexing needs to include all data and meta-data search needs later, of course
- the xapian indexer would need to learn about the moin syntax (don't know how difficult this is to do), for first the plain text formatter could generate the input for the indexer
- utilize xapian for saving and querying category relationships, involves a new term prefix
Difficulties
- The moin query parser uses a different syntax than xapian query parser.
Either users will need to use the Xapian style syntax, or a new query parser will need to be written (though this might not be too hard using something like PyParsing).
- Moin internal non-index search does regex, Xapian can't do that.
- Fall back to linear search (that is the simplest, but also slowest solution and it doesn't work for stuff not being text/plain like, so a more advanced fallback should be searched)
- Xapian can support a limited form of wildcard search, though this requires some work, and isn't particularly efficient . This couldn't reasonably be expanded to implement full regular expression searches, though (mainly because Xapian search is word-based, but regular expressions are not).
- Xapian (as of 0.9.5) needs a utf-8 patch for its query parser, stemmers don't support utf-8 yet (but both are planned to be in xapian-core soon).
- UTF-8 support for the stemmers is implemented, but will not be merged into the Xapian tree until a 1.0 release happens, because it will break backwards compatibility.
The stemmers are actually those provided by the Snowball project, for which there is a direct python interface. If the snowball stemmers were used directly, and a new query parser written in Python with pyparsing, all the UTF-8 issues should be resolved.
Mentoring
This stuff will get mentored by Xapian (RichardBoulton) and MoinMoin (ThomasWaldmann) main developers.
Questions
I started today some tests on a local wiki and got some questions. My setup is based on the latest 1.6 dev stuff and python 2.3.5 -- ReimarBauer 2006-08-21 15:25:33
I did a title search for help and the first result URL is shown as http://localhost/mywiki/SystemInfo?action=fullsearch&context=180&value=help&titlesearch=Titel going to the second page then some pages previous it is changed to http://localhost/mywiki/action/fullsearch/SystemInfo?action=fullsearch&titlesearch=Titel&from=10&context=180&value=help. The action/fullsearch is added to the link and for each further page one time more. So at the end it ends up with an os error file name to long.
This should be fixed in the xapian branch, please use http://hg.thinkmo.de/moin/1.6-xapian-fpletz. -- FranzPletz
ok, changed to 1431:6dfca61f2672 Now I get a traceback parsedatetime_missing.htm. It is not installed from the MoinMoin/support tree. -- ReimarBauer 2006-08-23 08:48:25
It really should be in there because it also works on http://xapian.wikiwikiweb.de. I just bumped the version of parsedatetime to 0.7, so please try again by updating at least to 1433:446e20f5005c. -- FranzPletz
Ok, this is now fixed with 1442:ff126e63a52e in the xapian branch thanks to Reimar's feedback on irc. It was only triggered if MoinMoin is installed by distutils. -- FranzPletz
Will there be a version of the Xapian search for moin-1.5? ...or is a moin-1.6 release perhaps not too far away? BTW, I've build Xapian 0.9.6 and the Python-2.4-bindings on windows. So if anyone needs the binaries please contact me. I can also upload it here if desired and tolerated. -- DavidLinke
AFAIK, the 1.6 release will still take some time. Currently, there are no plans to backport Xapian search to 1.5, but I think it would be possible with some modifications. If there's more interest in this, I could do it. -- FranzPletz
- There is interest from me. And I suspect people with 2000+ pages would really shower blessings your way - default full search with default settings gets embarrassingly slow around that level. -- PJ
http://xapian.wikiwikiweb.de/ExifTest shows an example of an image which you should find by a text search of 2004. On my local wiki and this the image is found but it gots 0 matches too. I do think default minimum for something found should be 1 and not 0. -- ReimarBauer 2006-08-23 22:32:42
You get 0 matches? What do you mean? This works fine for me, e.g. search for "kodak" (full text). 2004 yields too much pages. You can get the same result by searching for "2004 exif" (also full text). -- FranzPletz
At the right edge I can read 0 matches that means the word was not found in the image. In difference if I did a text search in text you got listed how often the word exists in the text -- ReimarBauer 2006-08-27 09:12:54