Different approaches to fight WikiSpam
Wikispam is getting more and more annoying. Wiki pages get high ratings in search engines because of the strong linking between the pages (and each other via InterWiki links). This makes them a valuable target to increase the ranking of other pages.
(Some of) these ideas are categorized as Helpful (improves the experience of well-intentioned users), Transparent (well-intentioned users never even notice), and Annoying (but perhaps less annoying than the spam it blocks).
Contents
- Current Anti-spam Features
-
Ideas
- General considerations
- Make The Markup Rule of External Link Hard To Guess For Robots
- Allow Humans, discriminate against Scripts
- Detecting spam by content
- Detecting spam by content redundancy
- Black Lists -- detecting spam by source
- Distributed Blacklist
- Redirecting external links
- Who is spamming ? Detecting spam by source. SpamAssassin for email and wiki.
- Staging Revisions
- Easier Restoration
- Make it useless
- Report them
- Deny their ACL rights
- Implement AKISMET
- Detecting spam bots
- Don't advertise orphans
- Discussion
Current Anti-spam Features
TextChas
Textual puzzle questions appearing at the top of the page. The idea is to prove that a user is a human, and not a bot, before they are allowed to save an edit.
See TextCha page for details, and also HelpOnTextChas
SurgeProtection
A limit to the rate with which a user is allowed to edit pages, since bots often try to edit rapid-fire.
See SurgeProtection page for details
'BadContent' list
You can ban certain content within contributions by listing RegularExpressions on the your 'BadContent' page.
If any of these regular expressions get a positive match, when a user is editing a page, then that page cannot be saved. The only solution for them, is to remove the offending links from the page before saving.
This feature is more effective at blocking WikiSpam if the list is kept up-to-date (block known spammers before they even reach your wiki) You can manually update your list taking entries from our BadContent page. Alternatively you can try the AntiSpamGlobalSolution which automates this process.
Problems
You have to be careful with regular expressions. Can accidentally match every edit for example.
- Spammers switch domain names a lot, meaning the list must be constantly updated, and any new spammy domain names will get through the filter.
- Spammers can see what is blocked. This isn't actually a major consideration. Spammers dont generally go to any trouble to check out such things on a individual wiki, they just carpet bomb across many wikis
Ideas
Please keep discussion in this section, and describe/document all ACTUAL FEATURES above
General considerations
"Spam is only useful, if the bot reads the edited page and gets the link." (But spam is still annoying to readers of my wiki, whether or not it has any effect on Google PageRank.)
- Antispam measures must not be worked around easily
- Spammers have our source code
- human behaviour as we may define it in our code is easy to mimic
- spammers will adapt - only if we cannot imagine an easy workaround it is worth doing it
Make The Markup Rule of External Link Hard To Guess For Robots
Such as
- Random markup rule of external link for every edit action, and only human being can find out the new markup rule:
Such as a sentence in the page footer says: "Please enclose your external link with two colons", next time the sentence could be "You can insert off-site links by prepend 2 ";"(semi-colons) and append a square bracket ( [ or ] )". Slighly annoying.
- Or email the new markup rule to user, but robot have to parse the correct mail sent from a random sender name with a random subject for every edit. More annoying.
Create externel link with Javascript, user have to click a button to run the javascript like using some WYSIWYG web editor, I wonder if robots can triger UI event and interprete Javascript and check any modification in all form controls to guess the correct the new markup rule at last, and this rule can be designed to be very very complex -- This will limit the editing capabilities of non-javascript browsers of course, but anyway, any user is still able to read plain external links created by such browsers. Transparent to Javascript browsers, annoying to other browsers.
Counterarguments:
- Spammer have our source code
- We can't make too much rules, and bot author can read our source to catch them all.
- Many spammers fail to find out even the linking syntax. So you expect them to read the code?
- Not a random sender (we are not spammers) and not a random mail - we can't have too much mail templates, and bot author have them all too. However this idea do complicate bot, but also will complicate work for legal user a lot.
- What do you mean by that?
Javascript code will be static too (i.e. avoidable) - I verified this for good when doing PyWebMail. Idea with Javascript is especially deadful because it will not complicate bot, but DO complicate work for legal user.
- We can't make too much rules, and bot author can read our source to catch them all.
This also means any unmodified existing external links from last version of the wikipage should be preserved when submited by user, maybe a diff action against last version is required.
Server could validate the markup rule by decrypting server encrypted value in a hidden form control.
Allow Humans, discriminate against Scripts
For high volume spamming (spamming hundreds/thousands of wikis) scripts are needed. Disallowing editing for scripts would limit the amount of wikis spammers can target.
Nevertheless spam from humans is annoying too. Anti script messures will surely not be enough.
Setup a HoneypotPage to detect spambots.
Identify Human
- Force users to log in to be able to edit
- medium annoying level
- is not really secure - scripts can use accounts, too
make user to enter number/string from a CaptchaTest.
- high annoying level - should not be required for every edit
Improperly implemented CaptchaTest excludes deaf people from contributing. Solution: implement it properly. Rather than only have an audio CaptchaTest, give deaf users the option of a visual captcha ... or use some sort of text-only question that would be trivial to answer for anyone who has read the FAQ.
The test could be as simple as a site-specific question which needs to be answered correctly. This of course doesn't prevent spam targeted specifically for that site, but it would solve the problem for smaller sites which are spammed only because a spam script found them to be running MoinMoin.
- quite secure
TextChas are implemented in 1.6.
- offer invalid edit links
- less annoying variant of "enter number"
- several links with images or one image map
- user has to choose the right one
- users that do too much mistakes are blocked
- much less secure
- spammers can perhaps do as much tries as it takes to get through
- identify as human by recording navigational behaviour?
- Transparent
define a time threshold "minimum_edit_time", e.g. "minimum_edit_time = 2s"
- measure editing time between "click on edit" and "click on save"
if editing time < minimum_edit_time it is likely that a script is changing the page. So don't save the changes. Optionally ban that IP for 10 minutes or so.
- generally a good idea but easy to counter by spammers (they even have the source code...)
Detecting spam by content
- use a bayesian classifier on the diffs
- train reverted changes as spam
- distributed spam archive??
Problems:
- perhaps not enough samples per wiki instance
Training reverted changes as spam -- This is very dangerous. I'm currently getting systematic reverts from a script. If I were training reverts as spam, this jerk would be destroying the previous training, causing many false positives.
- Requires a person to manually recognize spam and add it to the bad-content list.
- Occasionally blocks discussion of valid discussion by well-intentioned users -- similar to the way discussing "Nigeria" is impossible on most spam-filtered mailing lists.
Usually transparent.
Detecting spam by content redundancy
BanRedundantLinks -- http://wikifeatures.wiki.taoriver.net/moin.cgi/RejectDuplicate
- is this the only Helpful idea ?
Black Lists -- detecting spam by source
see BlackList - can maintain a huge list of blocked ips and subnets, using plain text file
- Block according to a list that is only editable by well known users
- IPs
- Edits containing text fragments (URLs)
If you want to block only few addresses, there is also a configuration method to block IPs and subnets. Like this:
hosts_deny = ['61.55.22.51', '220.188.',]
in moin_config.py
Distributed Blacklist
- Allow the wiki to reuse list from other wikis
- get via http or XMLRPC
Redirecting external links
Those will not give spammers any rank at google.
Who is spamming ? Detecting spam by source. SpamAssassin for email and wiki.
Could we potentially use SpamAssassin to block spam? I mean I suppose the spammers mail-accounts have often the same subnets as the computers they use. This could be especially effective if your Wiki is far from international and there are edits from China. I think Spam is generally the same - mail or wiki - I would suggest to gather information about different IP-blocks that are distributed. We should also be able to exclude IPs positive from a negative block. See also BlogSpamAssassin
Staging Revisions
A system I've seen in use on other wikis "stages" anonymous edits so they don't get seen until either a known user verifies the page (http://wikifeatures.wiki.taoriver.net/moin.cgi/StagedCommits) or a certain time period has elapsed (http://wikifeatures.wiki.taoriver.net/moin.cgi/DelayedCommits): an advantage of this aproach is that spammed pages won't get seen by a search engine. Disadvantage is it requires registered users to "moderate" RecentChanges, and complicates the whole edit-view-edit cycle. There is more on this topic at WardsWiki, MeatBall and ProWiki -- I'll dig up some specific links when I'm near my offline-wiki.
A hybrid of this approach where most edits get through directly, only ones that are (a) anonymous and (b) fit a certain spam profile, would get tagged as "needs-review".
Of course, grafting this onto any wiki engine is going to be ugly and it's a flagrant violation of DoTheSimplestThingThatCouldPossiblyWork ...
DelayedCommits is generally transparent to most users.
The ApproveChanges action (and associated event handler) implements staged revisions using an approval queue as a subpage of the affected page.
Easier Restoration
- a clean-up action, that lists only the likely-spam pages and offers "mass-restoration"
- e.g. user chooses one of the spammed pages and can get a list of "similar changes" (same (anonymous?) editor, within a certain range of time, same or similar content)
Make it useless
Spam is only useful if the Google-bot reads the edited page and gets the link. Instead of identifing the Spam-Bot, identify Google and mask out all links, that are not least x days old. alternatively create a robot.txt to disallow google on recently edited pages. Then trust the WikiGnomes to correct the Page before these x days. Take additional measures to identify/ban spammers. If spamming your wiki is useless, the spammer will not take the effort.
Google has announced a "nofollow"-Attribute for links to prevent comment spam in blogs. This could be used for any external link in a wiki, too -- should be easy to implement and quite effective.
I'm afraid that its hard to implement, and won't be effective. A wiki is one big comment - we can't add the "nofollow" attribute to all wiki links, as this will kill the good links in the wiki. Using this feature require that we know how to catch the spammers links, and if we know how to do this, we can simply prevent their edits, as AntiSpam does. Its good that search engines are trying to fight this problem though. -- NirSoffer 2005-01-19 15:18:03
What about using an editable white list? Interwikilinks are on the white list per default of course. Most wikis don't have that much external links. And most of this links do not rely on getting google rank. We could even offer an "prove this link" icon infront of new links. With ACLs it should be easy to implement an TrustedEditorsGroup which is allowed to do this.
I think that's an excellent idea, and very similar to what I was thinking about this problem. InterWiki and IntraWiki links would be exempt, and the rest would have to be added to a whitelist before their nofollow attributes were dropped. -- Omnifarious 2005-05-16 19:54:15
Slightly annoying.
The HelpOnRobots page describes how "nofollow" is applied to links for different kinds of user agents.
Report them
Send abuse reports to spammer's ISPs.
Deny their ACL rights
Since most spam is of the Pagerank variety, would it be beneficial to add an "externallink" ACL right that would permit or deny the addition of external links to pages? E.g. the ACL
#acl Known:read,write,externallink All:read,write
would let all users edit pages, but only let known users add external links.
Usually transparent (since most edits don't add external links).
Implement AKISMET
Akismet (Automattic Kismet) is a collaborative effort implemented into blogging software such as WordPress 2.0. It makes comment and trackback spam a non-issue. Assuming that Wiki Spam is similar (if not identical to) Blog Spam, implementing Akismet functionality into MoinMoin could almost eliminate Wiki Spam. It has been extremely effective in WordPress installations, almost completely eliminating comment and trackback spam. Sharing the spam detection results with other platforms (such as the large install base of WordPress users) would greatly benefit spam prevention in MoinMoin installations.
Detecting spam bots
Ned Batchelder has interesting spam bots prevention technique, using hashed tickets, hashed field names, and invisible form elemnts. See http://www.nedbatchelder.com/text/stopbots.html.
One of the WordPress anti-spam plugins (Spam Karma 2, I think) detects spam edits through the timeline of client requests amongst other things. Noticing that a user submitted a form within a ridiculously small amount of time after loading a page indicates that they're running a script, so preventing such edits would probably help to prevent a lot of spam. Note that this isn't the same as the surge protection support, as far as I can tell, because it is all about measuring a specific interval between the loading of a page and a follow-up form submission request. -- PaulBoddie 2013-05-11 23:31:25
Don't advertise orphans
Most spam I've seen targets orphan pages.
Disallow /OrphanedPages/
in robots.txt should stop these pages from being indexed.
Discussion
anti-spam proposal
The first step to stop spam is to enable users to delete spam pages. We can do some kind of checkbox batch select delete for pages in RecentChanges. This would allow user to delete 30 spam pages that were created, at once.
spamassassin like program and comment like SPAM
If page is deleted and comment says SPAM Username who created page is disabled and marked as spammer.(cannot be re-enabled)
The page information should be submitted to spamassassin like program which would enable the following.
- Regex the page content and put it into database. (I guess score based solutions could be implemented)
- Based on the score the existing pages can be deleted(need some administrative page),
New pages cannot be saved after passing it through spamassassing like program and if the score is too high.
- The way it works it would check editor/saver ip address to see if it is listed in RBL anti-spam server. (If he is, he cannot edit/save page)
Ip address should be submited to RBL moin anti-spam server(if page was deleted with comment SPAM, or reverted with comment SPAM)(after passing some threshold server will add the ip to its list)
- RBL anti-spam server should be an option to enable, and it will work great. (I wonder if moin spammers also do email spam. this could mean we could use existing databases)
- Captcha when creating new usernames (this is in 1.6 already)
kitten auth, Asirra, by Microsoft.
Cannot annoy users
I personally think that we cannot deploy an anti-spam system that annoys even anonymous users on every edit without damaging the wiki idea as a whole. So I would prefer a solution that tries to detect spam and bothers the user only if the wiki suspect a edit as spam. -- FlorianFesti 2004-08-18 17:11:50
I agree with this. One of the most common thing for anonymous users to do on my site is to add useful external links. So, while the ACL idea is quite attractive to me, it would hurt the wiki as much as help it. RedirectingExternalLinks requires that spammers actually get it. They may, but I'm not hopeful. The whole practice of spamming requires a lot of unwarranted optimism about what you're doing and willful ignorance of the negatives. I think maybe a globally maintained edit blacklist on a per IP basis. The edit should seem to work perfectly until the 'Save Changes' button is hit. Additionally, to make it very hard for spam bots to tell that they've been blacklisted, you can have the system always display a captcha indication of whether or not the edit went through. The captcha would say that the IP had been blacklisted for spamming if the edit failed for that reason. -- Omnifarious 2004-12-31 08:52:33
I wrote up a proposal for a HoneypotPage feature over @ WikiFeatures. I'll copy it here and see what u guys think. - GoofRider 2005-05-02
Spam prevention is next to useless
This is an ever-escalating battle with no end in sight. Why not focus on ways to help undo the damage instead? Despam action is a good start but badly underdeveloped. If despam action also deleted the user it would go a long way helping manage spam. Right now the process for removing a user account borders on making your eyes bleed. Hint: if it's easier to ssh into a wiki, change to an obscure directory, grep for some text and manually delete files than it is to use the web interface then there is something seriously wrong with your GUI
As you see when looking at the recent amount of spam (0) in this (moin 1.6) wiki, it is not useless. But you maybe are right with the endless battle, we'll see when the spammers adapt. I recently did some improvements to Despam in 1.5 and I'll commit and port them to 1.6/1.7 soon. If you want to help improving the UI, feel welcome. -- ThomasWaldmann 2008-01-18 19:47:55
How does BadContent work?
I'm more of a mediawiki expert, but I was trying to help sort out a spam problem on this wiki, by adding a few regexps to the top of the BadContent page: http://wiki.freemap.in/moin.cgi/BadContent I was quite surprised that I was able to add to that page (as a normal user, not a sysop). Actually that seemed quite nifty, but to my disappointment, it doesn't seem to be picking up my entries. For example I simply added "warcraft" but still the warcraft spammer comes back. Maybe the BadContent feature isn't activated, but then why would that page be populated with regexs?
Also, do you guys have any other tips for ways in which normal users can help fight spam on a MoinMoin wiki? I tried reverting the spam manually, but it takes a while because of the (anti-spam) edit throttling!
Harry Wood - 11th March 2008
The spam regexes should be carefully added to BadContent and then get distributed to all wikis that have the antispam security policy enabled in their config (you can see it by watching wether their local BadContent updates from time to time).
Another good idea is to install moin 1.6.x and enable the TextCha feature in the config, this currently stops all spam done by bots (and this seems to be about all spam).
Yeah obviously there's some stuff they need to re-configure / activate / upgrade on the server to improve the situation, but these guys are busy touring india, promoting open mapping, so to help them out I thought I'd see what can be done from a normal user perspective. I guess BadContent isn't activated though. It's strange. I can see some automated updates back in 2005 http://wiki.freemap.in/moin.cgi/BadContent?action=info but why would there even be a BadContent page if they didn't have the feature switched on? -- Harry Wood - 11th March 2008