How to avoid wiki LinkSpam on all moin moin wikis globally.
See also AntiSpamFeatures, NetworkOfMoinMoinWikis
Install
- Enable it in wikiconfig.py (if you are using 1.3.x)
- Enable it in moin_config.py (if you are using 1.2.4)
For version above 1.6.0, use:
from MoinMoin.security.antispam import SecurityPolicy
Otherwise:
from MoinMoin.util.antispam import SecurityPolicy #...
If you are using older moin versions, please upgrade.
How it works
That extension will fetch this page from MoinMaster: BadContent. This page will be automatically kept in sync with moinmaster, do not edit it or your edits will be overwritten.
Together with LocalBadContent (you can use it to add own regular expressions), this builds your spammer protection. Any save with links that match one of those regular expressions will be denied.
The BadContent page is #acl All:read so spammers can't edit it. In fact, only wiki admins on moinmaster can change it.
Format of (Local)BadContent
#format plain ## Any line starting with a # will not be considered as regex, ## but any other line will! So do not put other text or wiki markup on this page ## or it will be considered as bad content, too - which might drive you crazy until ## you notice what went wrong. spammer.com anotherspammer.com ...
The code markup ({{{ and }}}) used to show the listing above MUST NOT be put on the BadContent page.
Contribute!
If you want to contribute spam link patterns, use this page: BadContent
- How? It is locked immutable (for very good reasons) and has absolutely no information on how to add a pattern on it whatsoever.
- I wonder how many people, like me, have arrived at that page with a potential pattern and then given up because of all the barriers you have placed between them and submitting badcontent?
I only found this page by going outside the wiki and searching (google) for 'moinmoin add badcontent'
Finally.. tucked between other comments in a section discussing problems with the patch; is the line '.. you can add them to LocalBadContent and after verification ...'
Sigh. That link points to another immutable page with no comments.
- Spot a pattern here?
Sorry folks, but users cannot 'magic' the instructions out of thin air, somebody who understands this needs to sit down and write it out, and then make sure that info is accessible at the point it is needed. -- OwenCarter 2007-07-24 12:35:26
Discussion
This method does NOT use IP address based banning, but content (link) based banning. That blacklist on BadContent contains patterns to match spam links. Technically, it would be possible to censor "offensive words" with it, too, but this won't happen through the moinmaster list.
If you are looking for IP banning, try BlackList (discussion about that see there).
This solution is not limited to moinmoin. If you can process our regular expressions, feel free to use them. You either can get them via http request (using ?action=raw) or via xmlrpc2. But if you want to use it for other wiki engines, please make a mirror of that page on the engines' site and direct people there.
Need some sort of 'Good Content' filter
For some Wikis, particular items in the BadContent listing are inappropriate. I work on a MoinMoin Wiki about poker at www.overcards.com; I routinely need to edit the local copy of the BadContent because all links with "poker" in them are banned. I can see why others might consider all such links spam, but on a poker wiki, such a restriction is crippling!
I don't like the idea of pulling the word from the master, though I'm tempted frequently. . and I don't want to disconnect from the global solution and the daily updates. Any simple way out?
I can't tell from the discussions and code fragment below whether this is an accepted "White" pages patch, of if it's the subject of the debate. Perhaps that refactoring is due? -- MentalNomad
It should not be hard to filter bad content words before they are processed. Find the place where we get bad content from the server, and filter the file with your good regular expression, removing all places that mention poker. You are lucky, imagine the trouble of the admin of the porn wiki
I think that I have seen a patch which allows you to have a whitelist. Anyway, it would not be difficult to write. But in this case, you can fix BadContent on MoinMaster as well (see EditingOnMoinMaster).
Problems with this patch
- Some notes:
- If you are using mt-blacklist, why not fetch it from them (and remove comments automatically), instead of using a page here. Why not fetch more than one good blacklist?
- When the blacklist page hasn't been modified in a while, you fetch a new copy when someone requests a page. It would be better to just use a daily cron job and fetch the blacklist from the mt-blacklist server instead of making a visitor wait around while the list is being downloaded and copied.
The code is a little primitive. It doesn't escape the periods in the domain names. A period in a regular expression can refer to any character. Try using something like blacklist_re = "|".join(map(re.escape, blacklist)).
Also, what are we supposed to do when spammers add new spam links that aren't on the blacklist? They are coming up with new spam domains every day. With this solution, we have to do more work. We have to undo the vandalism, and also manually add each and every new spam domain to another page. This takes a lot of time. You might consider either adding an option to redirect links that are unknown (not on a whitelist) so that pagerank is unaffected, and/or an option to find pages that were saved earlier but contain blacklisted patterns, and/or an option to revert & blacklist at the same time (searching for urls in the diff).
Another problem is when the global blacklist is overly broad in its blacklisting. My particular example was attempting to link to http://www.tanked.netfirms DOT com/acrylic.html, but netfirms DOT com is blacklisted. I could find no reasonable way to permanently override the blacklist entry, so I made the following modification to antispam.py, allowing a GoodContent and LocalGoodContent page pair that facilitate removing entries from the blacklist. Exact textual matches from *GoodContent are removed from the blacklist. Note that this isn't a good universal solution, because it would be too easily hijacked for spam, unless the LocalGoodContent page is ACL protected. --KevinButler
The most effective spam solution, for e-mail at least, is a Bayes filter. How about integrating one written in Python into MoinMoin e.g. http://spambayes.sourceforge.net/? MoinMoin could come with a trained filter and wiki admins (wouldn't allow any user to do this, just admins) could classify pages or content on pages as spam at the click of a button and at the same time remove the spam text from a page. Maybe the admin could simply select the text on the page and click a 'clean and classify' button or something similarly named and some javascript would automatically detect the selected text and pass it to the bayes filter + clean it. One problem with this is how to deal with spam once it is detected? If you immediately told the user the page looks like spam, they would be able to re-try creating the page until they got around the system. Maybe a way to get around that is to wait 20 minutes or something before dealing with the page. The page could be dealt with by automatically moving it to a PotentialWikiSpam page (which would never get indexed by Google). Marking legitimate pages that have been classified as spam 'not spam' is a difficult issues. Ideally the administrator would do it to be sure it was legit, but that would not be good for most users. Trusting the users a bit more, you would hope only legitimate users trying to find their page after 20 minutes would instead get a spam page message and they would know to go get their page and mark it as 'not spam'. The admin could always go back and mark it as spam again. Thoughts?
1 def save(self, editor, newtext, datestamp, **kw):
2 BLACKLISTPAGES = ["BadContent","LocalBadContent"]
3 WHITELISTPAGES = ["GoodContent","LocalGoodContent"]
4 if not editor.page_name in BLACKLISTPAGES + WHITELISTPAGES:
5 request = editor.request
6 blacklist = []
7 for pn in BLACKLISTPAGES:
8 do_update = (pn != "LocalBadContent")
9 blacklist += getblacklist(request, pn, do_update)
10 whitelist = []
11 for pn in WHITELISTPAGES:
12 do_update = (pn != "LocalGoodContent")
13 whitelist += getblacklist(request, pn, do_update)
14 for page_re in whitelist:
15 blacklist.remove( page_re )
I've improved the above solution for 1.3.5. Here's a patch (patch -p1)
Thomas Waldman responded:
- we merge the mt blacklist now
It would make more sense and be more flexible to use mt-blacklist as a separate source for blacklisted content, in addition to your page here.
- Possible, even with existing code. But the src is somewhat hardcoded still. That will change with time...
- The current code first looks if the page on master is newer, and only in that case fetches it.
- Then you are one day longer in danger. The current delay of checking for updates is 1h. Is that delay on page-save noticeable for you? How long does it take compared to a normal save?
- You don't seem to understand my concern. Would you want to run a website where everytime someone requests a page from your server, the cgi script connects to someone else's webserver and possibly downloads a long blacklist before allowing the visitor to see your page? This drains my system and slows down the site for the user, when I could just run a periodic background cron job that solves all of that.
- it doesn't fetch anything on read requests, only one save.
- You don't seem to understand my concern. Would you want to run a website where everytime someone requests a page from your server, the cgi script connects to someone else's webserver and possibly downloads a long blacklist before allowing the visitor to see your page? This drains my system and slows down the site for the user, when I could just run a periodic background cron job that solves all of that.
It just uses the regexes "as is". BadContent is a list of regexes, not of domain names.
You don't understand the real reason that WikiSpam exists. It is link spam. It is about getting a higher Google page rank. You are blocking anyone on a wiki from even talking about spammers's sites. You need to focus on blocking external URLs first and above all else.
- maybe don't make arrogant assumptions on what other people understand. I am sure you find a way to talk about spammer web sites if you really want to.
this would make using a "." in a RE impossible. strictly taken, the regexes on BadContent aren't correct, but as "." also matches a ".", this is usually no problem.
- It is when you are blocking URLs. "google-com" is not supposed to be the same as "google.com"
- so you ever wrote about google-com? maybe try to solve real problems instead of making synthetic ones.
- It is a real problem. If I whitelist "www.google.com" I don't want to also whitelist "wwwxgoogle.com" (for example)
- so you ever wrote about google-com? maybe try to solve real problems instead of making synthetic ones.
- It is when you are blocking URLs. "google-com" is not supposed to be the same as "google.com"
we add them to the blacklist? You can add them to LocalBadContent and after verification they will be moved to the dist list. For quick getting rid of a spammer, add them to your own LocalBadContent, too.
- So we have to wait for you to add them to your own blacklist before they are blocked.
- read the 2 lines of text above again
- So we have to wait for you to add them to your own blacklist before they are blocked.
- managing a whitelist is even more work. and redirecting all external links via google (or similar) is not planned.
- Actually, it is hardly any work at all, and remember, any link that is not on the blacklist or whitelist would be harmlessly sent through a redirect script, thereby denying any pagerank gain if it is spam.
- I don't plan to do all links via a redirect. I also don't plan to make a whitelist. Is that hard to understand?
- Actually, it is hardly any work at all, and remember, any link that is not on the blacklist or whitelist would be harmlessly sent through a redirect script, thereby denying any pagerank gain if it is spam.
- there is a full text search
- we may add more antispam functions, when, and only when, needed
I didn't ask for that. I told you the problems with the antispam script you whipped up. You deleted my feedback numerous times on this wiki, including any mention of the antispam patch I created for MoinMoin. You deleted NirSoffer's ocmments on this page as well. DisagreeByDeleting or distorting others's comments like you did mine isn't the way to accomplish anything.
- If you don't like the script, you are free to not use it.
- Constructing virtual problems, not reading what you reply to, posting wrong information about how it currently works, etc. isn't a way to accomplish anything, either.
- I will refactor this page at some time, that will include deleting discussion and entry of yours, of mine and of other people. I hope you can live with that.
How are spam attacks logged? For example how do I find how many spam attacks were blocked by BadContent recently? -- -- NigelMetheringham 2005-01-31 10:20:10
- antispam does not log them. But you can maybe see some POSTs in apache's logs that did not lead to changes on your pages.
If the MoinMaster wiki hangs saving pages hangs, too. We need a sensible timeout!!!!! 20 seconds?
MoinMoin 1.3 has an integrated time out.
More Ideas
Mark SPAM on revert
Offer a check box "Mark links as spam" in the revert action. If the user checks the box the removed lines are searched for external links. If the user may write LocalBadContent the links are added to it (do RE quoting!). If the user is not allowed to edit LocalBadContent the link is added to LocalBadContent/StagingArea and may be moved to LocalBadContent by another user later on.
Perhaps strip the file name and use the domain name only.
Distributed mark on revert
Same as "Mark SPAM on revert" but the links are added to MoinMoin:LocalBadContent/StagingArea. This would give us the chance to easily get all the spam out there to add it to our list. This feature would only be enabled if the wiki uses AntiSpamGlobalSolution to avoid too much double hits.
chongqing
As spammer try to trick google to get better ranking by abusing our rank/reputation we could use our rank to lower their ranking.
See http://chongqed.org/chongqed.html
Experimental list export: http://distribute.chongqed.org/
Text file tab seperated lines with
URL |
key words |
link to Chongqed.org |
We could simple add these link automatically to our pages. Perhaps invisible white on white or hidden under a fixed image. We could show this links for search engine bots only.
Don't do that! http://www.google.com/webmasters/faq.html: "The term "cloaking" is used to describe a website that returns altered webpages to search engines crawling the site. ... To preserve the accuracy and quality of our search results, Google may permanently ban from our index any sites or site authors that engage in cloaking to distort their search rankings."
If you aren't into chongqing, you could also use http://blacklist.chongqed.org/ to block the spammers that are in the chongqed.org database.
We have discontinued the distribute list but are discussing another better method to accomplish the same thing. -- Joe(at)chongqed.org
LocalBadContent Changes Underlay
I wish changes to LocalBadContent were made to the UnderlayPages version.
My ~10-15 MoinMoin instances tend to get hit in series, and I don't like having to change 10-15 pages when a spammer decides to test the waters.
Right now, I just edit the underlay page directly. But I'd like to be able to do so remotely, without ssh'ing in.
-- LionKimbro 2005-02-08 08:40:28
- Have you tried sym-linking? It will break RC, though.
The purpose of the current underlay directory is easy upgrades, not sharing of local content. Shared content is real need and will have to be solved outside of the underlay directory.
As a solution, I would write a macro and action that lets you edit the contents of the local bad content underlay page. The macro can get the raw text of the page, show it in a text area, and let you save the new text (with possibly no backup). Require admin rights to use that macro and you have an easy remote way.
Blocking chinese
Many wikis out there don't have legitimate chinese content, but they do get lots of chinese spam.
If you are sure your wiki has no legitimate chinese content, you can use this on your LocalBadContent page:
Never ever put that onto the master BadContent page. Of course there are also quite some wikis with legitimate chinese content!
Also be aware that if some CJK user of your wiki created a homepage and put some stuff on it in his/her language, he maybe won't be able to save his page again if you use those regexes, so be careful!
# and 和 # or 或 # you (2nd means respect) 你 您 # we (2nd is Simplified Chinese) 我們 我们
A even more universal regex is to forbid all "CJK Unified Ideographs" (Chinese, Japanese, Korean), which are in U+4E00 - U+9FCF:
[\u4e00-\u9fc3]