An alternative approach to control searchbots

Currently into all pages but FrontPage, FindPage, SiteNavigation, RecentChanges and TitleIndex, a <meta name="robots" content="index,nofollow"> header is put in order to reduce the machine load and bandwidth loss caused by spiders. Although that way searchbots still can access all pages via the TitleIndex page, which means they can still index them, this approach disregards the evaluation of the relations between resources, which means you probably don't get the page rank in the search engine's results you would deserve.

The background for this is, that search engines use certain algorithms for the ranking of a site, often based on links. This does not only regard in-links from external ressources to a wiki page, but also the links between the single pages of the wiki itself, and the out-links from the wiki to external ressources. If a searchbot cannot follow (virtually) any link in a wiki, it will not be able to calculate which page of it is relatively more or less important, nor which one is more or less important for a given key word. Plus, we cannot benefit from in-links to any wiki page but the FrontPage, since for search engines all pages are "dead end".

This feature request is for better support for search engines. I don't have to explain that if search engines cannot crawl contents, the contents hardly will be found, and that working on contents that will not be found afterwards is a waste of time. Thus, not providing good support for search engines means to discourage the wiki's users, and a wiki without encouraged users won't grow. I therefore propose to replace the current [no]index,[no]follow scheme with a more fine-grained solution.

#1 Don't use any robots header, but on pages that are generated by actions (e.g. edit, diff, or info) or that contain hightlighted text and so on, where noindex,nofollow would be appropriate, as following links on these pages would produce useless traffic and load.

#2 Use the attribute rel="nofollow" to the HTML element a, as proposed by several search engines (Google, Yahoo, MSN), for any other link we do not want searchbots to follow, e.g. anything related to actions (see above).

#3 Use the attribute rel="nofollow", too, for links pointing to external resources in order to discourage link spammers. I know, this won't prevent any link spamming, so this is not meant as a replacement for the current AntiSpamFeatures, but as an addition to them. That way, no one will benefit from links in wikis for his own page ranking. Maybe InterWiki links don't need to have the rel="nofollow" attribute, since we trust anybody in our InterWikiMap.

#4 In order to throttle down requests from a certain, more aggressive searchbot, write in the /robots.txt file

User-agent: msnbot
Crawl-delay: 60

which will make MSN's bot spider only one page per minute (see http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_RestrictAccessToSite.htm).

#5 Use the /robots.txt file, too, to exclude certain directories, or file types (this again only for MSN) we do not need to have indexed (see http://www.robotstxt.org/wc/faq.html):

User-agent: *
Disallow: /wiki
# note images, stylesheets, and so on are stored somewhere under /wiki, too
User-agent: msnbot
Disallow: /*.exe$
# just an example

#6 In order to reduce also load and traffic caused by human users, configure your web server to use expires HTTP headers for certain file types, e.g. (for Apache):

<IfModule mod_expires.c>
  ExpiresActive On
  ExpiresByType text/css "now plus 1 month"
  ExpiresByType image/png "now plus 1 weeks"
</IfModule>

This has more to do with server configuration then with wiki configuration, though.

An other idea on server configuration: block anything related to actions already on web server level. With apache, you could config your mod_rewrite like this:

    # block ?action= requests for these spiders
    RewriteCond %{QUERY_STRING} action=[^as]
    RewriteCond %{HTTP_USER_AGENT} Googlebot [OR]
    RewriteCond %{HTTP_USER_AGENT} Slurp [OR]
    RewriteCond %{HTTP_USER_AGENT} msnbot [OR]
    RewriteCond %{HTTP_USER_AGENT} Teoma [OR]
    RewriteCond %{HTTP_USER_AGENT} ia_archive
    RewriteRule .* - [F,L]

This looks for robot requests containing "action=" in the query string (except for "action=a[ttachment]" and "action=s[how]"), and denies such requests at webserver level. This avoids the cost of even initializing the wiki engine for such requests. We must make an exeption for "attachment", because this is the way inline grafics etc. are handled. I'm aware this wiki is not running under Apache, but within a Twisted framework. But maybe there is a way to accomplish something like this even with Twisted (both, Rewrite and Expires), too?

However, all these things still are manual interventions. A smarter solution would handle both, attachments and actions in an other way: attachments would be served directly by the web server (as they contain mostly more or less static contents), and actions would not be using GET requests any more (see also RfC 2616, Section 9.1.1: "In particular, the convention has been established that the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval.") I'm using the conditional, but I remember there was something like that in the pipeline for MoinMoin version 2.0.

Thanks, -- MartinBayer 2005-07-22 22:12:01


Beyond Pagerank

There are some more reasons why the current implementation does not work, and needs to be replaced by a smarter solution:

Needless to say, the last point is valid only for search engines that are not locked out completely via the robots.txt directives, like Yahoo resp. Inktomi based search engines. Thus, the situation is even worse than one could imagine only from this Request for Enhancement.

-- MartinBayer 2006-06-20 12:13:31

I am working on this. You either have to help or wait. -- ThomasWaldmann 2006-06-21 06:49:16

Look at http://test.wikiwikiweb.de/ (rel="nofollow" implementation as it is now, main goal is avoiding unnecessary load). -- ThomasWaldmann 2006-06-21 08:08:33

Currently testing a simple google sitemap generator: http://moinmoin.wikiwikiweb.de/?action=sitemap


Related Issues: FeatureRequests/GoogleSitemapGeneration (implemented), MoinMoinBugs/RecentChangesUsesNofollow (mostly fixed)

CategoryFeatureRequest

MoinMoin: FeatureRequests/AlternativeSpiderControlFeatures (last edited 2007-10-29 19:21:03 by localhost)