MoinMoin recommendation system
Contents
News
Added a PyStemmer interface. The goal is to reduce the number of tokens and get better results. PyStemmer is part of the Snowball project and can be downloaded from http://snowball.tartarus.org/download.php . I have selected PyStemmer because it support several families of languages. Another choice would be nltk (http://nltk.sourceforge.net/).
The Goal
The goal of this system is to provide a way to recommend pages based on the content of the page viewed by the user. The system should be completely unsupervised.
Requirements
The only requirement for this feature is numpy. NumPy can be downloaded from http://numpy.scipy.org/.
Pattern representation
Currently pages are represented using the bag-of-words approach. Features are selected based on their "document frequency" in the document space.
The number of selected features is determined by a users/administrator provided value (numFeatures). This number also represents the number of inputs to the classifier.
The vector representation of the page is determined using the computed weights of the input features. Feature weights are computed using the formula:
w = tf*IDF = (tf/tfmax)*log(N/n)
where:
- tf = Term Frequency
- tfmax = Maximum Term Frequency
- n = number of pages containing the term
- N = total number of pages
Similarity measure
The similarity between input vectors (page vectors) is computed using the cosine similarity measure. This similarity measure was selected based on experimental results. A better similarity measure might yield better results.
Clustering algorithm
The clustering algorithm is based on the ART neural network and provides unsupervised and incremental learning allowing the system to evolve based on new content or user selected pages.
System optimizations
Feature Selection
Currently the system uses all words that are longer then 3 characters and selects the most frequent ones, skipping common words like "while, about, again etc". Common words are defined in so called "stop word" lists. Currently we provide "stop word lists" for English, German, Russian, Spanish, Portuguese, Norwegian, Finnish and Dutch.
Another idea is to use stemming in order to reduce words to their root. Eg. "stemmer" and "stemming" would be reduced to their root: "stem". Unfortunately this brings some problems: the stemming algorithms are optimized for a specific language and there is no easy way to implement this for all languages. When implemented we might need a secondary classifier for detecting the main language of a page :).
Classifier
The ART network has some very sensible parameters:
- The number of inputs (numFeatures)
- The vigilance.
Experimentally we used a number of 1000 inputs for the ANN. A lower number (100) yielded poor clustering performance, bigger values yielded poor classification speed. One way for solving the speed impact of a larger number of inputs would be to introduce an external dependency: NumPy. NumPy would allow to remove some slow list comprehension operations. The number of inputs is a critical part of the ART network and might make the difference between good clustering performance and poor performance.
The vigilance, simply said, tells the system to create a new category when a input vector is not less the vigilance similar to any of the available category's. It's value is critical for good clustering performance.
Usage
In order to test the current implementation (17 Aug 2007) you need to do:
Index pages in order to find the best tokens for the ANN:
moin.py ... recommender index
- For testing you could pass the option "--underlay yes" to the script. This way the script will select features from the entire wiki, including underlay. OR
Select a few pages for training and put them in a wiki page with MoinMoin Group syntax
Run the training script
moin.py ... recommender train --wikigroup GroupPage # Replace GroupPage with the page with the training set
- Use the "Star Page" action to train the classifier with selected pages.
Run the "cluster" script to update the cache
moin.py ... recommender cluster
To view the computed clusters use the WikiClusters macro. [[WikiClusters]]
Wishes
Similiar pages title should be alone in one line
all listed pages should become links
to destinguish that functionality from the normal page content use a similiar style as for TableOfContents (box, and background color)
Describe what the stop lists are used for
- Sorry, I may missed older discussions. Why is it necessary to train by using star page on request for the anonymous user? Why not staring all pages by a moin command and adding new pages if readable by anonymous always? So anonymous does not get the action Star Page shown only the result. If one is logged in it could behave different. Perhaps a logged in user wants to have both results otherwise links from a page would be removed after logging in. May be Similiar pages and My similiar pages.
May be [[WikiClusters]] should be callable from SystemInfo