Contents
Description
I installed the rev/a728d059c78e (9. November 2009) MoinMoin 1.9 together with xapian 1.07 (1.0.6 is require, see changelog). The Problem is that finding some page title with stemmed words is not all the times possible. it seems that stemming sometimes works but more often not.
Steps to reproduce
for an easier reproducing I made all test with this running / current MoinMoin version
Example
see above...
Component selection
- xapian stemming
Details
MoinMoin Version |
Version 1.9.0rc1 [Revision release] | this wiki, too |
OS and Version |
ubuntu linux server 9.04 | and also this wiki |
Python Version |
2.6.2 |
Server Setup |
Apache mod_wsgi |
Server Details |
Xapian 1.0.7 |
Language you are using the wiki in (set in the browser/UserPreferences) |
de and en |
Workaround
Discussion
Locking
While I'm testing I realized that even the not stemmed form couldn't be found!
- Word: Statistiken
Page: http://www.moinmo.in/MoinMoinBugs/1.9XapianStemmingNotWorkingCorrectly/Test/Statistiken
So it seems not only a stemmer problem but also maybe something wrong with the "tokenizer". but anyway also some "default" help pages could not be found (see /Test... -- MarcelHäfner 2009-11-10 15:52:33
this seems to be another bug about locking, like ThomasWaldman said in #moin-dev chat. but still after he rebuilded the index, the stemming is not working correctly. see above my examples under /Test.
Multilingual
Multilingual stemming applies morphological rules of two or more languages simultaneously instead of rules for only a single language when interpreting a search query. Commercial systems using multilingual stemming exist.
There is also maybe a main problem if you want have a multilingual wiki (e.g. with pages in fr, de and en). there are exist the concept of multilingual stemmers (see some docs about here), but it seem that the default xapian & snowball stemming is not Multilingual. So the problem comes here,
- Indexer: Saving pages (in witch language should be the stemming being done?), I think something like this could work:
language tag in the wikipage #language en and
- configuration file from moinmoin (default if no language is set)
Query-parser: What kind of stemming should be used for a query? In a multilingual wiki it would be wrong to assume that if the user is e.g. english (form his user account or browser the user agent string) he would search only for english titles or fulltext words. A main problem is you don't know for what language the user would search and in what language the stemmed words are saved in the index. Even if the query-parser could use some "language identification" technology (like balie) it would be faulty and leading to bad / invalid search results.
The indexer and query parser must use all-the-time the same algorithm - but with snowball you're using a different algorithm for each language
In my eyes you have only to possibilities:
setup a wiki and config what is the main and only stemming language (so index and query-parser use the same stemming language algorithm all-the-time (like xapian_stemming_language = 'DE' )
or using the stemming algorithm that is capable of multilingual, so the indexer stems correctly words from different language and so the query-parser makes the same stuff.
To develop / doing this by a moinmoin developer could be a big task, so it should be needed that xapian / snowball supporting this or some addition library can be integrated to use optional this kind of technology.
Some Links:
Stemming algorithms for various European languages from Snowball
Discussion on Xapian Mailinglist from 2007
- This is the nub of your problem really - indexing and searching need to be done in a compatible way.
As you point out, it's not possible in general to detect the language of a single word. Indeed many words are valid in more than one language.
- As James points out, the Snowball stemmers are algorithmic
Test/Review results
xapian.Stem test
I have shortly tested xapian.Stem() and it seems to behave mostly correct for en and de. So the problem we have is likely not the stemmer itself, but how we use it.
Code review
- queryparser.expressions
116 BaseExpression._build_re has a stemmed=False default param, but does not use it
347 BaseTextFieldSearch has some stuff commented out, XXX broken, code (that is maybe not used due to the commented stuff) that calls _build_re with stemmed=True
- Xapian.indexing
109 StemmedField
- uses cfg.language_default (just noting, not necessarily bad, we maybe don't have anything better)
- has strange usage of unicode(), review!
- 267 _get_languages() + callers - looks suspicious, needs review (note: xapian.Stem(lang) seems to know lang="none" to do nothing, maybe we can use that to simplify code)
- 374 _index_attachments sets lang and stem_lang fields, but derives values from the page, not the attachment (this is not assured to be correct). also, we do not use stemming for attachments, so ...
- Xapian.tokenizer
- 120 "xapian stemmer expects lowercase input" - does it?
- 46 maybe use lang="none" to get nop stemmer and simplify code? review 102 tokenize().
- after any changes, also review the skipped test in test_search
Current Example or Different Problem?
Was about to open a bug report but found this one. A title search for "sortable" on this wiki results in 3 hits. A search for "sort" yields 30 hits, but the 3 hits for "sortable" are NOT included.
Plan
- Priority:
- Assigned to:
- Status: