<<Navigation(children)>>

= Improving Xapian search =

The aim of the project is to show search results in a better way. This includes designing an output layout (how the result are shown) and grouping search results (show top level pages before sub pages, attachments).

 Repository:: http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/

Kinds of TODOs:

 * SOCTODO means that this is part of the SOC 2009 project of DmitrijsMilajevs (support him, but leave the coding to him)
 * TODO means: help needed, anybody is welcome to help with this (please coordinate here or on #moin-dev)

(!) This text contains links to the source code. The latest available revision on the moment of writing is used. [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/|4b2ef153ad4f]]

== xapwrap to xappy migration ==

Xapian searching was done using `xapwrap` library. It is not supported anymore hence xapwrap was replaced by [[http://code.google.com/p/xappy/|xappy]].

== Indexing ==

Every indexed document has a field structure, which is defined in the [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/MoinMoin/search/Xapian/indexing.py#l75|MoinIndexerConnection]] class. `STORE_CONTENT` is required for regexp searching and document deletion from the index.

Moin indexes 3 kinds of documents: files, attachments and pages. Every document has unique `id` value. Different kinds of documents have different id structure.

 * for files `"%s:%s" % (wikiname, os.path.join(fs_rootpage, filename))`
 * for attachments `"%s:%s//%s" % (wikiname, pagename, att)`
 * for pages `"%s:%s:%s" % (wikiname, pagename, revision)`

''Title'' and ''content'' fields are tokenized and optionally stemmed. Tokenization is needed for queries like `HelpOn` where a user expects to get help pages (HelpOnSearching, HelpOnEditing, HelpOnAcl, etc). In the search index `WikiWords` are transformed to `wikiwords wiki words` or to `wikiwords wikiword wiki words word` if stemming is enabled. Analyzed strings are in lowercase because xapian expects lowercase input for stemming. Thus in the index both values are stored [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/MoinMoin/search/Xapian/indexing.py#l239|original]] and [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/MoinMoin/search/Xapian/indexing.py#l241|tokenized]]. Tokenization and stemming is done by the [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/MoinMoin/search/Xapian/tokenizer.py#l17|WikiAnalyzer]]

Documents are [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/MoinMoin/search/Xapian/indexing.py#l447|removed]] from the index or by the document id, or by `fulltitle` value.

== Searching ==

Queries for xapian are built by `xapian_term()` in [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/MoinMoin/search/queryparser/expressions.py|MoinMoin.search.queryparser.expressions]]. For the regexp based queries it [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/MoinMoin/search/queryparser/expressions.py#l128|checks]] every document in the index and query only those for which regexp has found matches. For other queries, appropriate fields are [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/MoinMoin/search/queryparser/expressions.py#l419|queried]].

== Plan ==

 * MoinMoinBugs/NoHitsOnPartialTitleSearch
 * FeatureRequests/XapianSearchWithBetterSearchResultList
 * index locking issues
 * Refactoring
  * rename !BaseTextFieldSearch to !BaseAnalyzedFieldSearch
  * rename !BaseFieldSearch to !BaseExactFieldSearch
  * rename xapian_term() to xapian_query and make it a property
 * update `CHANGES` file
 * check if we need a workaround for Unicode
  * Unicode test
  * Test stemming of other languages
  * https://code.launchpad.net/~miracle2k/django-xappy/trunk
  * http://bazaar.launchpad.net/~miracle2k/django-xappy/trunk/revision/38

== Testing ==
 * TakeoKatsuki may be an interested windows user who could try out how it works there

== Problems ==

[[Xapian2009/Problems]]

== Diary ==

<<MonthCalendar>>
----
CategoryGsocProject