<<Navigation(children)>> = Improving Xapian search = The aim of the project is to show search results in a better way. This includes designing an output layout (how the result are shown) and grouping search results (show top level pages before sub pages, attachments). Repository:: http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/ Kinds of TODOs: * SOCTODO means that this is part of the SOC 2009 project of DmitrijsMilajevs (support him, but leave the coding to him) * TODO means: help needed, anybody is welcome to help with this (please coordinate here or on #moin-dev) (!) This text contains links to the source code. The latest available revision on the moment of writing is used. [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/|4b2ef153ad4f]] == xapwrap to xappy migration == Xapian searching was done using `xapwrap` library. It is not supported anymore hence xapwrap was replaced by [[http://code.google.com/p/xappy/|xappy]]. == Indexing == Every indexed document has a field structure, which is defined in the [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/MoinMoin/search/Xapian/indexing.py#l75|MoinIndexerConnection]] class. `STORE_CONTENT` is required for regexp searching and document deletion from the index. Moin indexes 3 kinds of documents: files, attachments and pages. Every document has unique `id` value. Different kinds of documents have different id structure. * for files `"%s:%s" % (wikiname, os.path.join(fs_rootpage, filename))` * for attachments `"%s:%s//%s" % (wikiname, pagename, att)` * for pages `"%s:%s:%s" % (wikiname, pagename, revision)` ''Title'' and ''content'' fields are tokenized and optionally stemmed. Tokenization is needed for queries like `HelpOn` where a user expects to get help pages (HelpOnSearching, HelpOnEditing, HelpOnAcl, etc). In the search index `WikiWords` are transformed to `wikiwords wiki words` or to `wikiwords wikiword wiki words word` if stemming is enabled. Analyzed strings are in lowercase because xapian expects lowercase input for stemming. Thus in the index both values are stored [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/MoinMoin/search/Xapian/indexing.py#l239|original]] and [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/MoinMoin/search/Xapian/indexing.py#l241|tokenized]]. Tokenization and stemming is done by the [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/MoinMoin/search/Xapian/tokenizer.py#l17|WikiAnalyzer]] Documents are [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/MoinMoin/search/Xapian/indexing.py#l447|removed]] from the index or by the document id, or by `fulltitle` value. == Searching == Queries for xapian are built by `xapian_term()` in [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/MoinMoin/search/queryparser/expressions.py|MoinMoin.search.queryparser.expressions]]. For the regexp based queries it [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/MoinMoin/search/queryparser/expressions.py#l128|checks]] every document in the index and query only those for which regexp has found matches. For other queries, appropriate fields are [[http://hg.moinmo.in/moin/1.9-xapian-dmilajevs/file/4b2ef153ad4f/MoinMoin/search/queryparser/expressions.py#l419|queried]]. == Plan == * MoinMoinBugs/NoHitsOnPartialTitleSearch * FeatureRequests/XapianSearchWithBetterSearchResultList * index locking issues * Refactoring * rename !BaseTextFieldSearch to !BaseAnalyzedFieldSearch * rename !BaseFieldSearch to !BaseExactFieldSearch * rename xapian_term() to xapian_query and make it a property * update `CHANGES` file * check if we need a workaround for Unicode * Unicode test * Test stemming of other languages * https://code.launchpad.net/~miracle2k/django-xappy/trunk * http://bazaar.launchpad.net/~miracle2k/django-xappy/trunk/revision/38 == Testing == * TakeoKatsuki may be an interested windows user who could try out how it works there == Problems == [[Xapian2009/Problems]] == Diary == <<MonthCalendar>> ---- CategoryGsocProject