Disable Surge Protection by Default
You are turning a considerable part of the internet into a hell with your surge protection, if you really think it is necessary let it in but DISABLE IT BY DEFAULT!
Every site that use moinmoin I met so far did not adapt the surge parameters to meet its real needs and ban an ip for god knows how long a second after an unknowledgeable user try to automatically copy some page.
Even 30 years old servers would not have problems with the requests rate surge protection comes in, and the fun thing is that common bots or downloaders keep making requests even after error pages start being replied, so it does nothing to reduce the traffic (it likely multiplies it, as the user will easily repeat various times with other parameters before finding out it has been banned for hours). And what strategy is this if it's meant for traffic control? Every normal site that care about the traffic just limit the number of connections and the maximum bandwidth, does not get huffy and take away all his stuff! And this is even for commercial sites who might have legit fears of mass downloads, while most moinmoin sites are of open source projects who have the aim to distribute as much as possible!
Is this meant to prevent automatic edits? Why not block just the edits!
Is it meant to prevent automatic copy? Most of the sites that use moinmoin are gpl projects, with free license notes even on the web pages! On all the web in years I met 3 or 4 sites that really don't want their contents to be saved!
Is it there because bots easily make useless requests (old versions, thousands of login pages)? Apart from the fact a better structure might avoid that, just block those who do so!
Many sites have the wiki as the only documentation source of the project, so getting an offline version of even the basic documentation is become close to impossible!
This has to be corrected in the main program because if not all the vast majority of sites use the default settings and have no idea of the problems they bring about.
Maybe I'm not being very polite in this request but you must understand you affect many people who don't care a hoot of your project with your choices. This thing seem to have had a lot of success and you can be happy with that, but this makes you responsible for an important part of the internet and you should think over a thousand times before introducing a new default feature
Discussion
from Marcel Häfner
Information about disabling the SurgeProtection in your Wiki is here http://moinmo.in/HelpOnConfiguration/SurgeProtection. So it's up to the wiki admin what configuration he will use.
The feature SurgeProtection was introducing in Version 1.5.2, now we have 1.9.4 (http://hg.moinmo.in/moin/1.9/file/1ddf7d88c53d/docs/CHANGES). Version 1.5. was out in 2006.
I do like and use the surge protection - together witch TextCha (http://master19.moinmo.in/HelpOnSpam). Becasue for example generating a WordIndex use a lot of cpu/memory/time in a bigger wiki (>2000 pages).
30 years ago we did not had linux NOR python...
bye -- MarcelHäfner 2012-03-28 11:05:29
from Thomas Waldmann
about surge protection
- just to quote spock: "the needs of the many, are greater than the needs of the few."
your impression that any server can easily cope with any of todays loads is wrong. we run the moinmo.in site on a 4core (x2 with HT) i7 2.66GHz machine with lots of ram, so this is quite a powerful machine. just a few days ago I got notified from server monitoring that the wiki response time was too long (> 30s for a simple page). I accessed the wiki and yes, it was in trouble, I just got an error, not a wiki page. I logged in via ssh and saw that all cpus were on 100% load. Looking into access.log showed that someone/something was doing excessively lots of stupid requests (history, info, searches, ...) in rapid sequence, basically DOSing the server, making it unavailable for everybody else. I wondered about surge protection and noticed that I had turned it off a while ago for some reason and forgot to turn it on again. I did that and from that minute the load dropped to normal levels and the wiki was available for everybody again (just not for that IP that DOSed it).
- your impression that SP is pointless because the tool keeps sending requests is also wrong. a request from a locked out client is much faster denied than if that request was normally processed. Also, the denial response is often also shorter than the normal response. so surge protection is effective to avoid such DOS, even if not completely getting rid of it.
- it is not just about traffic or number of connections as you assume, it is about server load (cpu, disk, ...).
- i don't know an easy way to "limit bandwidth or the number of connections", if you have good suggestions for that, please give details. consider that just letting a client wait for a response is not an easy option as that can eat up open tcp connections and make the server run out of THAT resource.
- the surge protection defaults are reasonable for human users with browsers. they are high enough that no human user usually runs into them. for the case that somebody manages that nevertheless, the wiki will warn him/her. a lockout only happens if that warning is ignored.
- using tools like wget to "mirror" a wiki is not appropriate (at least not for complete mirror and with simple or default parameters). often these tools stupidly request any link they are seeing, causing an extreme load, DOSing the site, causing lots of traffic and the result is 99% crap (because it requested much more than the normal page content that one likely wanted to have).
solving your problem
- I understand your problem and why you tried to mirror that site. I also understand why you are a litte pissed, but I hope, after reading the facts from above, you understand now a little better why SP is turned on by default.
- If you need the contents offline, maybe ask the admin of that site to provide a regularly made html dump of the wiki (moin ... dump used in a cron job). You'ld basically get what you wanted without overloading the system (SP off) of getting locked out (SP on). The problem would be solved for everybody else also, without the need that everybody has to try to make his/her own mirroring.
- The other option is to use a well-behaved tool to mirror (search engines usually follow these rules):
- limit requests/s to a low value (look at surge protection defaults)
- if there is a robots.txt or a nofollow attribute on a link, these should be considered before getting link target
- or (this leads to a even better result): only request pages (content, show action) and maybe attachments, not other actions like info/history, old revisions or any search
if the site has xmlrpc services turned on (ask admin), you can also use a xmlrpc based script to get page contents, e.g. ForkWikiContent or similiar scripts.
-- ThomasWaldmann 2012-03-28 11:58:08