Description

I installed the rev/a728d059c78e (9. November 2009) MoinMoin 1.9 together with xapian 1.07 (1.0.6 is require, see changelog). The Problem is that finding some page title with stemmed words is not all the times possible. it seems that stemming sometimes works but more often not.

Steps to reproduce

for an easier reproducing I made all test with this running / current MoinMoin version

/Test

Example

see above...

Component selection

xapian stemming

Details

MoinMoin Version	Version 1.9.0rc1 [Revision release] \| this wiki, too
OS and Version	ubuntu linux server 9.04 \| and also this wiki
Python Version	2.6.2
Server Setup	Apache mod_wsgi
Server Details	Xapian 1.0.7
Language you are using the wiki in (set in the browser/UserPreferences)	de and en

Workaround

Discussion

Locking

While I'm testing I realized that even the not stemmed form couldn't be found!

So it seems not only a stemmer problem but also maybe something wrong with the "tokenizer". but anyway also some "default" help pages could not be found (see /Test... -- MarcelHäfner 2009-11-10 15:52:33

this seems to be another bug about locking, like ThomasWaldman said in #moin-dev chat. but still after he rebuilded the index, the stemming is not working correctly. see above my examples under /Test.

Multilingual

Multilingual stemming applies morphological rules of two or more languages simultaneously instead of rules for only a single language when interpreting a search query. Commercial systems using multilingual stemming exist.

There is also maybe a main problem if you want have a multilingual wiki (e.g. with pages in fr, de and en). there are exist the concept of multilingual stemmers (see some docs about here), but it seem that the default xapian & snowball stemming is not Multilingual. So the problem comes here,

Indexer: Saving pages (in witch language should be the stemming being done?), I think something like this could work:
- language tag in the wikipage #language en and
- configuration file from moinmoin (default if no language is set)
Query-parser: What kind of stemming should be used for a query? In a multilingual wiki it would be wrong to assume that if the user is e.g. english (form his user account or browser the user agent string) he would search only for english titles or fulltext words. A main problem is you don't know for what language the user would search and in what language the stemmed words are saved in the index. Even if the query-parser could use some "language identification" technology (like balie) it would be faulty and leading to bad / invalid search results.
The indexer and query parser must use all-the-time the same algorithm - but with snowball you're using a different algorithm for each language

In my eyes you have only to possibilities:

setup a wiki and config what is the main and only stemming language (so index and query-parser use the same stemming language algorithm all-the-time (like xapian_stemming_language = 'DE' )
or using the stemming algorithm that is capable of multilingual, so the indexer stems correctly words from different language and so the query-parser makes the same stuff.
To develop / doing this by a moinmoin developer could be a big task, so it should be needed that xapian / snowball supporting this or some addition library can be integrated to use optional this kind of technology.

Some Links:

Stemming algorithms for various European languages from Snowball
Discussion on Xapian Mailinglist from 2007
- This is the nub of your problem really - indexing and searching need to be done in a compatible way.
- As you point out, it's not possible in general to detect the language of a single word. Indeed many words are valid in more than one language.
- As James points out, the Snowball stemmers are algorithmic
Wikipedia.org
Discussion on Spingx Mailinglist
Morphology

Test/Review results

xapian.Stem test

I have shortly tested xapian.Stem() and it seems to behave mostly correct for en and de. So the problem we have is likely not the stemmer itself, but how we use it.

Code review

queryparser.expressions
- 116 BaseExpression._build_re has a stemmed=False default param, but does not use it
- 347 BaseTextFieldSearch has some stuff commented out, XXX broken, code (that is maybe not used due to the commented stuff) that calls _build_re with stemmed=True
Xapian.indexing
- 109 StemmedField
  - uses cfg.language_default (just noting, not necessarily bad, we maybe don't have anything better)
  - has strange usage of unicode(), review!
- 267 _get_languages() + callers - looks suspicious, needs review (note: xapian.Stem(lang) seems to know lang="none" to do nothing, maybe we can use that to simplify code)
- 374 _index_attachments sets lang and stem_lang fields, but derives values from the page, not the attachment (this is not assured to be correct). also, we do not use stemming for attachments, so ...
Xapian.tokenizer
- 120 "xapian stemmer expects lowercase input" - does it?
- 46 maybe use lang="none" to get nop stemmer and simplify code? review 102 tokenize().
after any changes, also review the skipped test in test_search

Current Example or Different Problem?

Was about to open a bug report but found this one. A title search for "sortable" on this wiki results in 3 hits. A search for "sort" yields 30 hits, but the 3 hits for "sortable" are NOT included.

Plan

Priority:
Assigned to:
Status:

CategoryMoinMoinBug

MoinMoin: MoinMoinBugs/1.9XapianStemmingNotWorkingCorrectly (last edited 2010-02-25 20:33:11 by RogerHaase)