WebSpiders - MoinMoin

While it is quite wanted that search machines index your wiki, it's also quite a load on it since the spiders will also follow the links to "meta" pages, like search, page info and so on.

Since revision 1.162, MoinMoin refuses to serve those meta pages to spider agents, i.e. it sends a 403 FORBIDDEN response very early in the request cycle. Spiders are recognized by the MoinMoin.util.web.isSpiderAgent() function, which in turn uses the config.ua_spiders setting.

Accesses by spiders are also excluded from event logging, so any user agent that appears in the related statistics is worthy to be added to the list of recognized spider keywords.

Lele

I wonder if there something better we can do. What about, as soon as MoinMoin recognizes a spider, it strips down the query string to its core and adds action=print. Wouldn't be this much more effective?

Thomas

that's more traffic and the same contents get delivered again and again - and even under URLs where normal people can't get them. I don't think that is a good idea. But we maybe should try putting a NOFOLLOW on most pages except RecentChanges.

Lele

Uhm, why should it generate more traffic? I do not follow the reasoning... The printable version of the page does not carry much of the "interactive" links (bottom page actions, menu bar icons...). And it seems that we are already using the NOFOLLOW,NOINDEX pragmas for spiders, without good results.

Thomas

by delivering page contents, it genereates more traffic than the FORBIDDEN error code bots get now when trying to get URLS with "?action=...".
i don't think [NO]INDEX,[NO]FOLLOW is already used as I meant. RecentChanges: I/F, others: I/NF

Lele

Wait, but how can the spider ask for an action it never see? I mean, receiving the printable version it shouldn't get the action links, right? All it gets is links to other pages, no "EditText", no "Search" boxes... I will double check the http pragma issue: you could be right, or the spiders may be ignoring them...

Thomas

Thinking about that again, I see some common thought with my "less than read" rights idea from AccessControlList page. Search engines would then be in a group given only these "less than read" rights, showing them only the page like when using action=print.

AlexanderSchremmer

Do spiders really follow e.g. the RecentChanges multiple times? The URL is canonical and the spider should realise that its the same link target on every page. /me wants to see some logs proving this problem.
RC is maybe not a problem, but the ?action links definitely are. Also the links to all sorts of other pages maybe, too. But where's the problem? Giving FrontPage, RecentChanges and TitleIndex should be enough for spiders to find all pages and even all latest changes. -- ThomasWaldmann 2004-09-30 23:38:27
Thus blocking actions should be enough. We dont need to block particular pages, do we? -- AlexanderSchremmer
We do not block any normal page, they just get 'nofollow' to save unneccessary trafic. -- ThomasWaldmann 2004-09-30 23:38:27

What are the performance considerations of having a large regex to block many of the spider user agents? At what point does the cure become worse then the problem? -- Adam.

I don't think this is a problem. The page rendering is far more CPU (and traffic) intensive than this early blocking of spiders. -- ThomasWaldmann

The cost of loading a pickled regular expresion with about 50 short patterns and matching against the user agent string is close to zero. Try yourself. -- NirSoffer 2004-10-01 01:51:33

MoinMoin: WebSpiders