Contents
Basic ideas
More user influence:
WikiFeatures:NearSearch - search "neighboring" wiki
add a checkbox to include a MetaWiki search
- checkboxes for "search in titles", "search in content"
- input field for a regex restricting the pages to search, and one for pages to exclude from search
- "search in older versions" maybe?
- sorting options (by # of hits, by page name, ...)
- exact phrase / any words / all words
- exclude words
- make "regex" mode a checkbox (remove current heuristics)
- regular expressions are not compatible with indexing, perhaps we should drop regex search (may be keep it in title search).
- pages having the following name=pattern keywords/attributes/MetaData
Add more things you'd like to the list above, if they're not too exotic.
See
ThomasWaldmann/WikiUsability/GoggleSearchPatch - Seite upgedatet mit diff-Sammlung!
Suggestion
Use special chars to define what to do with the search pattern
<, >, <=, >= title must be alphabeticaly smaller, greater, ... than expression
- no regex or wildcards allowed here, perhaps ? to ignore single chars
- useful for pagenames containing dates or page numbers
- only useful if pagename in proper format (parts of YYYY-MM-DD-HH-MM-SS) or number with prepending zeros
Query language
This is a draft of a possible query language. Please disscuss variants at the "Suggestion" section above.
see HelpOnSearching
Architecture
Frontend -> Searchpattern -> Results -> Sorted Results -> Output
String -> Query tree -> list of result objects -> list of result objects -> request.write(formatter.xxx())
Implementation Log
I started implementing some of the data structures above. Things look already usable. I now think about how it is possible to integrate indexed and attachment search -- FlorianFesti 2004-03-01 17:37:27
see also /AttachmentSearch
QueryParser
BUG: Empty search string leads to errors
Search Expressions: and, or, text, title
Result Classes
Sorting: by weight, page name
Printing
- Macros
TitleSearch
FullSearch
basic functionality
context parameter
byname parameter
PageList
keep only for compatitbility (unchanged)
- or extend with new features?
- Actions
fullsearch
"titlesearch" parameter
sorting in dependency of "titlesearch"
- explicit sort parameter?
context parameter
titlesearch (killed)
inlinesearch killed
- Indexed search
- query converter
- query classes
IndexedOrExpression
IndexedAndExpression
IndexedTextSearch
- search logic
- indexing engine
Indexing
- in c't 25/2003 there was a wiki engine comparison and moin got a "-" concerning scalability - i think this was due to the full-text search that currently works without index (it is quite fast, though, but not as fast as an indexed search).
- create a dict = { 'word': ['page1', 'page2', ...], 'anotherword': [...], ...} for any (full) word in the wiki. key is the search word, value is a list of pages the word appears on.
create a page FullWordIndex using that data (usage like a word index in a book), maybe as macro
- pickle that dict to disk
- searching for a word then is:
- fulltextsearch( dict[searchword] ) - do not search all pages, but only those that are known to contain the word
- for multiple word searches, use set intersection (py2.3 or selfmade similar stuff):
- fulltextsearch( dict[w1] intersect dict[w2] intersect dict[w3] ...)
generate a wiki with > 100.000 pages of some KB each for testing / benchmarking
goal: 1s for searching 100.000 pages (should be easy as dict lookup is O(1) and fullsearch over the value list is O(n), but often with a small n << pagecount)
- finally repeat with 1.000.000 pages
- create a dict = { 'word': ['page1', 'page2', ...], 'anotherword': [...], ...} for any (full) word in the wiki. key is the search word, value is a list of pages the word appears on.
http://www-106.ibm.com/developerworks/linux/library/l-pyind.html
As mentioned in the article above, this will run into problems if the database will grow and no longer fits into memory. 1.000.000 pages per several kB are several GB of data. I would not expect that the database will be much smaller than the wiki pages. To handle this amount of data real databases are needed.
Problems
- Index search has some limitations
No regex search - If you keep substring indexes based on triples, the regex could be broken down into substring searches
- only matching whole words (improved by stemming)
- better only case insensitive search
- Index search works quite different from linear search.
- Linear search wants to process on page after the other and remove excluded pages from memory as early as possible. The decision if a page is found is calculated as boolean expression of the terms found.
- index search returns a number of found pages per term. These sets are combined with set operations.
- Title seaches (regex, match to substrings) can't be done by the indexer. These search terms have to be applied as boolean expression to every page found by the indexed full text search (if there is an full text search).
Using an external search engine
As an alternative to writing such an indexing algorithm an external search engine program could be used, which would give us more functionality with less work. Of cause the use of an external search engine will be optional.
Requirements:
- GPL compatible licence
- Python binding
Possible candidates are:
- Features:
- stemming (reducing word to their stem) for Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish
- eg reducing run, runner, runing, runs to run
- boolean expressions
- phrase and proximity searching
- dynamic update of db
see also XapianIntegration
- stemming (reducing word to their stem) for Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish
- Features:
- Native Python
compatible to Jakarta Lucene (ported to Python)
- Features
- no stemming yet - Lucene has stemming (english only)
- dynamic update of db
- full range of queries
- CGJ compiled version of Lucene
- SWIG Python interface
- Wikipedia uses Lucene based search servers since April 2005
Joda (ioda, because joda was already taken by some other project)
our internal fulltext search engine, already used in a wikipedia setup
- python interface
- Features
- dynamic update
- queries with word distance operators possible
- no stemming integreated but plugable, cause it works on word lists
I've installed xapian and played a bit with it. As far as I can see it looks good. The python binding is not beautiful but acceptable. Reading the mailing list it looks like xapian is capable of dealing with large amounts of data (several GB, hundred thousands of documents). Looks good -- FlorianFesti 2004-02-06 21:13:33
Florian, do you remember why xapian & moin was stopped some year(s) ago? Was it just because of platform-dependance (and lupy being pure python?) or were there other reasons to not develop the moin xapian stuff any further?
- Lupy's bugs and "retired mode" don't look like a too bright future (we would have to maintain it ourselves and this is not trivial).
PyLucene has some dependencies (libstdc++, libgcj, ...) and install issues.
- Xapian is also platform dependant, but at least installed easier. And xapian-bindings has a Python interface.