Contents
preamble
Thanks for this PoC it shows that its possible to solve integrating spambayes and probably other categorizers into the MM codebayes with your help! Well done!
All my thoughts are comments, please review them if they could be solved easily. If there are some which take much time, shift them at the end and do solve at first the once which are easily.
I know its only a proof of concept. But I like to tell all things I want or came across and I know myself a PoC doesn't need them at all. I like to tell them so we don't miss one (probably one of my wishes)
principle things
.hgrc
[ui] username=user and email adress (email <name AT adress DOT adress>)
documentation
start as early you can writing documentations. e.g. a README it does not matter if it looks draft, e.g. you can use your wiki or MoinMoin and one of your pages there. If you use MM someone can contribute.
Extending
needs to be extended for the comment input
- needs to be extended for the mimetypes text filter from xapian
- may be the database access needs to be locked so changes on the same time doesn't make trouble
at that time the system is trained other language help files from underlay should be ignored. They should not tell one switching his language, that all his language dependent files are spam.
- changing of categorizers should be possible
- actions invisible for not may.admin
- all the things what to do with spam pages
I can just see: Learn/Forget -- MarianNeagul 2007-06-11 10:13:47
If we compare it with the current Antispam system by BadContent one is not able to save a page using one of this words. Perhaps we want to have it similiar if the system is trained enough, then may be the check should be invoked by saving to avoid changes of a page.
The classification project will provide a policy that could use the classification to enforce a specific behavior. -- MarianNeagul 2007-06-11 22:05:11
If it is possible to recognize which part of a page does make a ham page to a spam page then may be we could use the despam action too.
This was the ideea with Jonathan A. Zdziarski Bayesian Noice Reduction - BNR. I was thinking to use it to identify portions of a page that have a higher spam probability. This would require small changes to SB. I'm still researching this possibility. -- MarianNeagul 2007-06-11 22:05:11
Another possibility is to check if we have a spam/ham page if all revisions of this page are spam/ham and to indicate the revisions.
perhaps some statistics to show how good the system is for normal users which do wonder about less spam
I'm working on a macro that will show statistics for the classifier. -- MarianNeagul 2007-06-11 10:13:47
- may be RC should show an indicator for a spame page
Currently I'm searching for some artwork to implement this -- MarianNeagul 2007-06-11 22:05:11
LOG and repeat
I do believe a log file which pages where set to ham and which to spam by who showing the revision used is nice to have. And it should include the possibility to be repeated (by a moin subcommand) and perhaps it should show how spambayes has tested this page before. Perhaps the standard log file could be used for that too. (Needs to be checked)
I do want this for testing purposes too, because with that it should be possible to recognize which addition makes a big or small change or when the system becomes to be clever.
if someone did made a mistake and has entered ham for spam, with that it should be easily to correct.
The logging of classifications is not required because the training is stored in SpamPages/HamPages. -- MarianNeagul 2007-06-11 10:13:47
dbase filename
while spambayes is a two way categorizer it could be perfect used to train for monolanguage wikis. So I want to know from the name of the database for what it was trained.
- And we should test if we can share these basic files or how it was trained.
- Perhaps we could use the default language code defined in wikiconfig for the name
CodeBase
coding
MoinMoin/Page.py
resolving path for dbfile
1 event_log = self.request.rootpage.getPagePath('event-log', isfile=1)
you may want use self.request.cfg.data_dir directly
and if we like to exchange the categorizer then may be it would be a good idea to save the dbase file in a dir whose file name belongs to its categorizer
e.g. self.request.cfg.data_dir /categorizer/spambayes/en.db
1 if not self.classifier:
instead
1 if self.classifier is None:
-- ReimarBauer 2007-05-01 08:24:54
Ideeas/Problems
I've gave up on using two action for classifying pages as spam/ham. The new version of the PoC will have (as sugested by ThomasWaldmann) a single action 'swspamstatus' that will toggle the classification (classify/forget as necessary)
- It seems that we might need a way to test that the user is human :). This might require to use captcha's because enforcing the policy might not be the right thing to do.
another approach would be to queue pages for moderation by admin users. -- MarianNeagul 2007-05-28 18:24:08
As sugested in the comments section I will try using a mime/multipart "message" version of the wiki page for classification and compare it's results with the current approach. -- MarianNeagul 2007-05-28 18:24:08
A classifier administration page might be required. The page should provide the ability to forget the training data, change the thresholds, fetch a generic training set from internet, etc. -- MarianNeagul 2007-05-28 18:35:13
Would it be better to use a db for every user ? Every user should be able to define his own spam ? -- MarianNeagul 2007-06-11 10:13:47
I don't think so, because we do not plan to give every user rights to use this functionality and despam admins should have nearly the same opinion to what is ham and what is spam. For n-categorisations it could be different.
Further Extensions for classifications
May be we should think about using some kind of classification for messages too which were produced by page edits. e.g. refactoring of a page is somewhat different from just adding some content. So the message header or subject can get an attribute, seems to be refactored, appended some text, spammed.
Questions
I guess I don't understand how code development is managed for Moin. Where is the code for the proof of concept? -- SkipMontanaro
Hi Skip, we do use a mercurial repository 1.7-classify-mneagul. A simple description about using mercurial is at MoinDev/MercurialGuide. -- ReimarBauer 2007-05-06 15:18:49
How do you propose to save spammish submissions for later retraining of the SBClassifier? -- SkipMontanaro
The pages submited by users (eg. admins) would be automatically appended in the correct category: SpamPage HamPages. This way the filter know what pages have been trained and the admin has a view on the training corpus. -- MarianNeagul 2007-05-06 17:19:18
What pages will be trained on? -- SkipMontanaro
Eg. TrainOnErrors -- MarianNeagul 2007-05-06 17:19:18
Reimar what is the proper way of updating category pages ? Please take a look at my repo for the way I have done it. -- MarianNeagul 2007-05-13 20:01:20
pg.saveText(pg.get_raw_body()+" * %s\n" % (thispage.page_name, ), 0) We should better talk from ham, spam cataloged pages, because the name category for a wiki page is used different. May be a Dict write could be added later on. Both pages do need an acl line. -- ReimarBauer 2007-05-14 01:18:27
Skip, Could we do some spam checking on differences between pages? I think this would be useful in detecting spammers that append their text in a page and maintain the original text. Using this techniques spammers are avoiding Bayesian text classification because it is not probable that the weights of the spam text features will influence the global classification of the page. Would be this the place where we could use Bayesian Noise Reduction to identify the text that does not fit in a specific page ? -- MarianNeagul 2007-05-13 20:01:20
Do you mean to check edits? I think the only thing that should be checked is the new content. Sorry, I'm not familiar with Bayesian Noise Reduction. 2007-05-16 17:47:00 Skip, see the logs at MoinMoinChat/Logs/moin-dev/2007-05-14 -- ReimarBauer 2007-05-16 20:06:40
- Reimar: I don't understand this comment:
2007-05-14T23:26:48 <dreimark> it would be nicer to see them in percent, and to use 0 for unknown and then we could have 100.00 % ham and 100.00 % spam
The SpamBayes classifier when properly trained produces a strongly bimodal distribution with ham scoring at or very near 0.0 and spam scoring at or very near 1.0. Unsure falls in the middle. 2007-05-16 20:47:00
The current version shows in the info line of the page ham 0.0000 or spam 1.000 with a lot of digits. I do prefer this different shown, yesterday we discussed with starshine to use symbols to visualize the categorisation state. -- ReimarBauer 2007-05-17 09:24:03
- Reimar: I don't understand this comment:
Yes, exactly! By "checking only new content" you mean checking only the new text added and not the resulting page ? -- MarianNeagul 2007-05-16 20:15:49
I think what you might want to experiment with generating a mime/multipart "message" from the submission to feed to the classifier. Any attachments to the page would be attached as the appropriate MIME type. The primary advantage of this is that it would simply feed into a minimally modified tokenizer (since SpamBayes already deals with mime email messages). The first text/plain section would include either the full submission for page creates or just the insertion text for page edits (the ">" lines from a simple diff would probably work just fine). You would probably also put in any synthetic tokens you generate that you believe will help distinguish ham from spam: were they logged in? how long since they first created their profile? what fraction of the page was deleted (approximately diff's "<" lines / original number of lines)? etc
Checking only new content enable a subtle way to fool the spam detection: use 2 edits. The first one adds a reasonnable looking text (like "My wiki commercialisation is here: www.exemple.org"). The second one remove some portions of the previous text (like "My wiki commer" and "ation"). -- JeanPhilippeGuƩrard 2007-05-16 22:12:44
- If use line diff, I don't think it is a problem, since links can't span lines. If you are not sure, add always some context - few lines before and after the new content.