Improving Xapian search

The aim of the project is to show search results in a better way. This includes designing an output layout (how the result are shown) and grouping search results (show top level pages before sub pages, attachments).

Repository: http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/

Kinds of TODOs:

SOCTODO means that this is part of the SOC 2009 project of DmitrijsMilajevs (support him, but leave the coding to him)
TODO means: help needed, anybody is welcome to help with this (please coordinate here or on #moin-dev)

This text contains links to the source code. The latest available revision on the moment of writing is used. 4b2ef153ad4f

xapwrap to xappy migration

Xapian searching was done using xapwrap library. It is not supported anymore hence xapwrap was replaced by xappy.

Indexing

Every indexed document has a field structure, which is defined in the MoinIndexerConnection class. STORE_CONTENT is required for regexp searching and document deletion from the index.

Moin indexes 3 kinds of documents: files, attachments and pages. Every document has unique id value. Different kinds of documents have different id structure.

for files "%s:%s" % (wikiname, os.path.join(fs_rootpage, filename))
for attachments "%s:%s//%s" % (wikiname, pagename, att)
for pages "%s:%s:%s" % (wikiname, pagename, revision)

Title and content fields are tokenized and optionally stemmed. Tokenization is needed for queries like HelpOn where a user expects to get help pages (HelpOnSearching, HelpOnEditing, HelpOnAcl, etc). In the search index WikiWords are transformed to wikiwords wiki words or to wikiwords wikiword wiki words word if stemming is enabled. Analyzed strings are in lowercase because xapian expects lowercase input for stemming. Thus in the index both values are stored original and tokenized. Tokenization and stemming is done by the WikiAnalyzer

Documents are removed from the index or by the document id, or by fulltitle value.

Searching

Queries for xapian are built by xapian_term() in MoinMoin.search.queryparser.expressions. For the regexp based queries it checks every document in the index and query only those for which regexp has found matches. For other queries, appropriate fields are queried.

Plan

MoinMoinBugs/NoHitsOnPartialTitleSearch
FeatureRequests/XapianSearchWithBetterSearchResultList
index locking issues
Refactoring
- rename BaseTextFieldSearch to BaseAnalyzedFieldSearch
- rename BaseFieldSearch to BaseExactFieldSearch
- rename xapian_term() to xapian_query and make it a property
update CHANGES file
check if we need a workaround for Unicode
- Unicode test
- Test stemming of other languages
- https://code.launchpad.net/~miracle2k/django-xappy/trunk
- http://bazaar.launchpad.net/~miracle2k/django-xappy/trunk/revision/38

Testing

TakeoKatsuki may be an interested windows user who could try out how it works there

Problems

Xapian2009/Problems

Diary

<< < 2024 / 8 > >>
Mon	Tue	Wed	Thu	Fri	Sat	Sun
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

CategoryGsocProject

<< < 2024 / 8 > >>
Mon	Tue	Wed	Thu	Fri	Sat	Sun
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

<< < 2024 / 8 > >>
Mon	Tue	Wed	Thu	Fri	Sat	Sun
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

MoinMoin: Xapian2009