Basic ideas

More user influence:

WikiFeatures:NearSearch - search "neighboring" wiki
add a checkbox to include a MetaWiki search
checkboxes for "search in titles", "search in content"
input field for a regex restricting the pages to search, and one for pages to exclude from search
"search in older versions" maybe?
sorting options (by # of hits, by page name, ...)
exact phrase / any words / all words
exclude words
make "regex" mode a checkbox (remove current heuristics)
- regular expressions are not compatible with indexing, perhaps we should drop regex search (may be keep it in title search).
pages having the following name=pattern keywords/attributes/MetaData

Add more things you'd like to the list above, if they're not too exotic.

See

../SimpleWidgets
ThomasWaldmann/WikiUsability/GoggleSearchPatch - Seite upgedatet mit diff-Sammlung!
Thomas Renard/AttSearchTool
MacroMarket
ActionMarket

Suggestion

Use special chars to define what to do with the search pattern

<, >, <=, >= title must be alphabeticaly smaller, greater, ... than expression
- no regex or wildcards allowed here, perhaps ? to ignore single chars
- useful for pagenames containing dates or page numbers
- only useful if pagename in proper format (parts of YYYY-MM-DD-HH-MM-SS) or number with prepending zeros

Query language

This is a draft of a possible query language. Please disscuss variants at the "Suggestion" section above.

see HelpOnSearching

Architecture

Frontend -> Searchpattern -> Results -> Sorted Results -> Output

String -> Query tree -> list of result objects -> list of result objects -> request.write(formatter.xxx())

Implementation Log

I started implementing some of the data structures above. Things look already usable. I now think about how it is possible to integrate indexed and attachment search -- FlorianFesti 2004-03-01 17:37:27

Indexing

in c't 25/2003 there was a wiki engine comparison and moin got a "-" concerning scalability - i think this was due to the full-text search that currently works without index (it is quite fast, though, but not as fast as an indexed search).
- create a dict = { 'word': ['page1', 'page2', ...], 'anotherword': [...], ...} for any (full) word in the wiki. key is the search word, value is a list of pages the word appears on.
  - create a page FullWordIndex using that data (usage like a word index in a book), maybe as macro
  - pickle that dict to disk
- searching for a word then is:
  - fulltextsearch( dict[searchword] ) - do not search all pages, but only those that are known to contain the word
  - for multiple word searches, use set intersection (py2.3 or selfmade similar stuff):
    - fulltextsearch( dict[w1] intersect dict[w2] intersect dict[w3] ...)
- generate a wiki with > 100.000 pages of some KB each for testing / benchmarking
- goal: 1s for searching 100.000 pages (should be easy as dict lookup is O(1) and fullsearch over the value list is O(n), but often with a small n << pagecount)
- finally repeat with 1.000.000 pages
http://www-106.ibm.com/developerworks/linux/library/l-pyind.html

As mentioned in the article above, this will run into problems if the database will grow and no longer fits into memory. 1.000.000 pages per several kB are several GB of data. I would not expect that the database will be much smaller than the wiki pages. To handle this amount of data real databases are needed.

Problems

Index search has some limitations
- No regex search - If you keep substring indexes based on triples, the regex could be broken down into substring searches
- only matching whole words (improved by stemming)
- better only case insensitive search
Index search works quite different from linear search.
- Linear search wants to process on page after the other and remove excluded pages from memory as early as possible. The decision if a page is found is calculated as boolean expression of the terms found.
- index search returns a number of found pages per term. These sets are combined with set operations.
  - Title seaches (regex, match to substrings) can't be done by the indexer. These search terms have to be applied as boolean expression to every page found by the indexed full text search (if there is an full text search).

Using an external search engine

As an alternative to writing such an indexing algorithm an external search engine program could be used, which would give us more functionality with less work. Of cause the use of an external search engine will be optional.

Requirements:

GPL compatible licence
Python binding

Possible candidates are:

http://www.xapian.org/
- Features:
  - stemming (reducing word to their stem) for Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish
    - eg reducing run, runner, runing, runs to run
  - boolean expressions
  - phrase and proximity searching
  - dynamic update of db
  see also XapianIntegration
Lupy
- Native Python
- compatible to Jakarta Lucene (ported to Python)
- Features
  - no stemming yet - Lucene has stemming (english only)
  - dynamic update of db
  - full range of queries
PyLucene
- CGJ compiled version of Lucene
- SWIG Python interface
- Wikipedia uses Lucene based search servers since April 2005
Joda (ioda, because joda was already taken by some other project)
- our internal fulltext search engine, already used in a wikipedia setup
- python interface
- Features
  - dynamic update
  - queries with word distance operators possible
  - no stemming integreated but plugable, cause it works on word lists

I've installed xapian and played a bit with it. As far as I can see it looks good. The python binding is not beautiful but acceptable. Reading the mailing list it looks like xapian is capable of dealing with large amounts of data (several GB, hundred thousands of documents). Looks good -- FlorianFesti 2004-02-06 21:13:33

Florian, do you remember why xapian & moin was stopped some year(s) ago? Was it just because of platform-dependance (and lupy being pure python?) or were there other reasons to not develop the moin xapian stuff any further?
- Lupy's bugs and "retired mode" don't look like a too bright future (we would have to maintain it ourselves and this is not trivial).
- PyLucene has some dependencies (libstdc++, libgcj, ...) and install issues.
- Xapian is also platform dependant, but at least installed easier. And xapian-bindings has a Python interface.

MoinMoin: MoinMoinTodo/ExtendedSearch (last edited 2009-01-25 21:49:59 by ThomasWaldmann)