/2009-07-21 /2009-07-22 /2009-07-24 /2009-07-29 /2009-07-31 /2009-08-03 /2009-08-04 /2009-08-06 /2009-08-07 /2009-08-11 /2009-08-12 /2009-08-18 /API Changes /API Changes/FieldStructure |
Improving Xapian search
The aim of the project is to show search results in a better way. This includes designing an output layout (how the result are shown) and grouping search results (show top level pages before sub pages, attachments).
Kinds of TODOs:
SOCTODO means that this is part of the SOC 2009 project of DmitrijsMilajevs (support him, but leave the coding to him)
- TODO means: help needed, anybody is welcome to help with this (please coordinate here or on #moin-dev)
This text contains links to the source code. The latest available revision on the moment of writing is used. 4b2ef153ad4f
xapwrap to xappy migration
Xapian searching was done using xapwrap library. It is not supported anymore hence xapwrap was replaced by xappy.
Indexing
Every indexed document has a field structure, which is defined in the MoinIndexerConnection class. STORE_CONTENT is required for regexp searching and document deletion from the index.
Moin indexes 3 kinds of documents: files, attachments and pages. Every document has unique id value. Different kinds of documents have different id structure.
for files "%s:%s" % (wikiname, os.path.join(fs_rootpage, filename))
for attachments "%s:%s//%s" % (wikiname, pagename, att)
for pages "%s:%s:%s" % (wikiname, pagename, revision)
Title and content fields are tokenized and optionally stemmed. Tokenization is needed for queries like HelpOn where a user expects to get help pages (HelpOnSearching, HelpOnEditing, HelpOnAcl, etc). In the search index WikiWords are transformed to wikiwords wiki words or to wikiwords wikiword wiki words word if stemming is enabled. Analyzed strings are in lowercase because xapian expects lowercase input for stemming. Thus in the index both values are stored original and tokenized. Tokenization and stemming is done by the WikiAnalyzer
Documents are removed from the index or by the document id, or by fulltitle value.
Searching
Queries for xapian are built by xapian_term() in MoinMoin.search.queryparser.expressions. For the regexp based queries it checks every document in the index and query only those for which regexp has found matches. For other queries, appropriate fields are queried.
Plan
- index locking issues
- Refactoring
rename BaseTextFieldSearch to BaseAnalyzedFieldSearch
rename BaseFieldSearch to BaseExactFieldSearch
- rename xapian_term() to xapian_query and make it a property
update CHANGES file
- check if we need a workaround for Unicode
- Unicode test
- Test stemming of other languages
http://bazaar.launchpad.net/~miracle2k/django-xappy/trunk/revision/38
Testing
TakeoKatsuki may be an interested windows user who could try out how it works there
Problems
Diary
<< < 2024 / 8 > >> | ||||||
---|---|---|---|---|---|---|
Mon | Tue | Wed | Thu | Fri | Sat | Sun |
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |