Text Classification Proposal
Warning this is work in progress.
The corpus
The corpus should be mantained the standard wiki way:
A standard group page (Eg. AutoCategories) that contains all the possible categories. For two way classification it would contain only the SpamPages and HamPages categories.
Each subpage of AutoCategories contains a list of pages that fit in the specific category.
When a user classifies a page in a category the corresponding page is updated as needed.
Classification
For Text Classification we would have two main cases:
- Two way classification
- n-way classificationn
For implementing this features we need to develop/imagine a system that will allow use to do any of the above mentioned types of classification. The system should be modular in order to allow the user to deploy any type of classifier (based on a framework of his choice) using the existing plugin system. MoinMoin should provide at least one such plugin with the vanilla package: SpamClassifier.
Two way classification
This is the most simple case and consists mainly in classifying pages as ham/spam. For this type of classification the best choice is to use the SpamBayes project.
Useful readings:
An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages - Ion Androutsopoulos, John Koutsias, Konstantinos V. Cbandrinos and Constantine D. Spyropoulos
An Evaluation of Statistical Spam Filtering Techniques - LE ZHANG, JINGBO ZHU, and TIANSHUN YAO
“May I borrow Your Filter?” Exchanging Filters to Combat Spam in a Community - Anurag Garg Roberto Battiti Roberto G. Cascella, Proceedings of the 20th International Conference on Advanced Information Networking and Applications (AINA’06)
Problems:
- Image spam
n-way classification
This is the general case of text classification. In this case we need to decide if we want to classify in:
- a fixed number of categories
- a variable number of categories
Fixed number of categories
This is the ideal case in which the user have a predefined list of categories and he has to chose one of them (Eg. Tehnical/Finacial/Fun/Spam/Ham/etc. ) For this type of classification a good approach is using SVM's (eventualy through libsvm's python bindings or by using a specialized framework: Elefant)
Variable number of categories
TODO
User Interaction
TODO
Integration With MoinMoin
TODO