Description

Describe the bug...

Steps to reproduce

We have a page called "WWDC09". Under version 1.6.1, typing "WWDC09" into the Title Search gets the page, and performing a Text Search gets hits pointing to it.

Under 1.8.4, there are no results returned, so the users complain that their pages are missing.

Example

Title Search under 1.8.4: TitleSearch-1.8.4.png

Text Search under 1.8.4: TextSearch-1.8.4.png

Text Search under 1.6.1: TextSearch-1.6.1.png

Component selection

Details

MoinMoin Version

1.8.4

OS and Version

CentOS Linux 5.2

Python Version

2.4.3

Server Setup

FCGI

Server Details

Apache 2.2.3

Language you are using the wiki in (set in the browser/UserPreferences)

English

Workaround

Prefixing the search with r: to disable Xapian returns the correct pages, however educating the users is difficult and leads to complaints like "It used to work, why did you break it?".

Discussion

I want to use Xapian, because I want to index the contents of attachments.

Old server: MoinMoin 1.6.1, Xapian 1.0.5, Pystemmer 1.0.1, stemming enabled, RH7.3, Python 2.5, Apache 2.0.

New server: MoinMoin 1.8.4, Xapian 1.0.13 (but tried older ones), doesn't matter if stemming is enabled or not.

I have rebuilt the Xapian index as part of the upgrade.

It might have to do with the tokenization. When indexing, moin runs texts and titles through a tokenizer, that tries to split the text at places where it makes sense. That is trivial, when text has blanks between words, it will be split at the blank. But for words like WWDC09, it is a bit more difficult, but IIRC moin should split this into WWDC and 09. It should do the same with words you enter in the search box.

In general, indexed search is sometimes not yielding all "expected" results, just due to the way it works (it can only find what was put into the index).

The tokenizer was quite broken in 1.6 (and doing weird things) and should be much better in 1.8.x.

Someone needs to have a deeper look what exactly is happening in this case, maybe a new bug was introduced while fixing other stuff there.

Maybe you could try again with debug logging for the search package to get more insights.

DaveHill: Thanks for your reply and explanation.

I was thinking about how this relates to the WordIndex of the site (does this use the same tokenizer)? The index doesn't contain WWDC as a word in either 1.6.1 or 1.8.4 Wikis but it does list it under "C09" in both cases.

There are also some bizarre entries (e.g. the title "TestPC26" is listed under "Test" and "C26", but "TestMonitor26" is listed under "Test" and "Monitor26"), again the same on 1.6.1 and 1.8.4.

Is it something to do with the fact that the problem pages are not CamelCase?

Interestingly, the problem with 1.8.4 occurred before I rebuilt the Xapian index (I rebuilt it to see if that fixed it), so maybe it points to the Search box rather than the Xapian index build routines.

I will turn on debugging and collect the output - Dave.

DaveHill: OK, I have found the problem - the index was bad.

I turned on debugging for "Moin.search" and got entries like this for a title search:

MoinMoin.search.queryparser DEBUG parse_quoted_separated items: [u'wwdc09']
MoinMoin.search.queryparser DEBUG analyse_items query: <MoinMoin.search.queryparser.AndExpression instance at 0xb7f3c46c>
MoinMoin.search.builtin DEBUG _xapianSearch: query = 'Xapian::Query((Swwdc09:(wqf=100) AND Swwdc:(wqf=100) AND S09:(wqf=100)))'
MoinMoin.search.builtin DEBUG _xapianSearch: finds: []
MoinMoin.search.builtin DEBUG _xapianSearch: finds pages: []
MoinMoin.search.builtin DEBUG _getHits searching in 0 pages ...
MoinMoin.search.builtin DEBUG _getHits returning []. 
MoinMoin.search.builtin DEBUG _xapianSearch found 0 hits
MoinMoin.search.builtin DEBUG after filtering: 0 hits

so I started looking at the data/cache/xapian index files. I rebuilt them again using "moin .... index build --mode=replace" but this didn't make any difference (I took a copy of the xapian directory and compared it afterwards). I then deleted the xapian directory and did "moin .... index build --mode=add" and got a totally different index that works OK!

bash-3.2# sudo -s -u apache
bash-3.2$ rm -rf data/cache/xapian.bak/
bash-3.2$ cp -a data/cache/xapian data/cache/xapian.bak
bash-3.2$ /usr/local/moin184/bin/moin --config-dir=/var/www/wiki/config --wiki-url=http://wikitemp.example.com/wiki index build --mode=replace
2009-06-24 18:45:03,901 WARNING MoinMoin.log:139 using logging configuration read from built-in fallback in MoinMoin.log module!
2009-06-24 18:45:04,040 INFO MoinMoin.config.multiconfig:125 using wiki config: /var/www/wiki/config/wikiconfig.pyc
2009-06-24 18:45:06,490 INFO MoinMoin.search.builtin:266 indexing completed successfully in 11.87 seconds.
bash-3.2$ du -sh data/cache/xapian*
96M     data/cache/xapian
96M     data/cache/xapian.bak
bash-3.2$ rm -rf data/cache/xapian
bash-3.2$ /usr/local/moin184/bin/moin --config-dir=/var/www/wiki/config --wiki-url=http://wikitemp.example.com/wiki index build --mode=add
2009-06-24 18:45:39,726 WARNING MoinMoin.log:139 using logging configuration read from built-in fallback in MoinMoin.log module!
2009-06-24 18:45:39,866 INFO MoinMoin.config.multiconfig:125 using wiki config: /var/www/wiki/config/wikiconfig.pyc
2009-06-24 18:46:42,900 INFO MoinMoin.search.builtin:266 indexing completed successfully in 62.83 seconds.
bash-3.2$ du -sh data/cache/xapian*
42M     data/cache/xapian
96M     data/cache/xapian.bak

So, although HelpOnXapian says that "--mode=replace" deletes the existing index, it didn't.

DaveHill: You are indeed correct, I don't know how I got it into my head that the option was "replace", I blame premature senility! That said, an error message saying "replace is not a valid mode" would have saved several days of frustration, not to mention a public "egg on face" moment.

Plan


CategoryMoinMoinNoBug

MoinMoin: MoinMoinBugs/XapianReturnsLessHitsOn1.8vs1.6 (last edited 2009-11-08 08:31:31 by ThomasWaldmann)