FeatureRequests/AlternativeSpiderControlFeatures

An alternative approach to control searchbots

Currently into all pages but FrontPage, FindPage, SiteNavigation, RecentChanges and TitleIndex, a <meta name="robots" content="index,nofollow"> header is put in order to reduce the machine load and bandwidth loss caused by spiders. Although that way searchbots still can access all pages via the TitleIndex page, which means they can still index them, this approach disregards the evaluation of the relations between resources, which means you probably don't get the page rank in the search engine's results you would deserve.

$/!\$ This is an assumption of a specific nofollow behaviour in regards to page rank, please provide documents describing what you assume.
- So, on what documents is your assumption based, setting a nofollow META would not affect the pagerank?
  - http://www.robotstxt.org/wc/meta-user.html (and many others). If it does influence pagerank, this is a "undocumented google feature" maybe. Maybe they should invent a "norank" thing instead of changing semantics of existing things.
    - This doesn't say anything about pagerank. But is seems logical that a link a spider cannot follow has no positive effect on the pagerank of the page linked to.
    - No, that's not logical (at least not for me). That meta stuff was invented to control spiders behaviour when collecting pages. And nofollow meta means just that "do not follow the links to fetch those pages". It did never mean "do not look at what you already have". So google's choice of naming rel="nofollow", too, and vaguely defining other behaviour (mostly talking about pagerank, but not about fetching of pages) for that was just a bad one.
  - http://microformats.org/wiki/rel-nofollow "does not mean the same as robots exclusion standards (robots.txt, meta robots) nofollow."
    - This doesn't say anything about the pagerank of the page linked to. Note I'm not referring to the page where the nofollow META is set, but to the pages this page is linking to.
  - http://en.wikipedia.org/wiki/Nofollow#rel.3Dnofollow
    - You are not responding to the question. Please explain why in your opinion a nofollow set in the META element (current implementation) does not effect the page rank. Note in this point we are not talking about "nofollow is evil" or "nofollow rulez". We are talking about page ranks. There is not one single document that will claim nofollow META does not effect the pagerank of the page linked to, as you do.
This is leading nowhere. Why not starting with reading PageRank? Again, rel=nofollow in the A element equals to the nofollow in the META element. I can tell you from my wiki's Google analysis, that the page with the highest pagerank assigned month after month is the FrontPage, although it should be some other, more important page, which means the page to that point more links than to any other. There is an exeption in one single month, when there was a deeplink to a specific page in an important newsletter. Again we could not benefit from that deeplink, because the bot could not spider any other page linked from that page. Therefore, I have come to the conclusion that the current implementation is aggravating popularity, discouraging users, and thus destroying wikis.

The background for this is, that search engines use certain algorithms for the ranking of a site, often based on links. This does not only regard in-links from external ressources to a wiki page, but also the links between the single pages of the wiki itself, and the out-links from the wiki to external ressources. If a searchbot cannot follow (virtually) any link in a wiki, it will not be able to calculate which page of it is relatively more or less important, nor which one is more or less important for a given key word. Plus, we cannot benefit from in-links to any wiki page but the FrontPage, since for search engines all pages are "dead end".

$/!\$ The links are on content it already has ("index"), so it CAN use them.
- The point is not only the indexing of single pages, but also the spidering of the entire site. Thus, having nofollow set more or less anywhere, means restricting access to the contents as a whole. The current implementation forces bots to spider the contents via FrontPage and TitleIndex only. If it came via a deeplink to any other page of your wiki, it will see this very page only, and then go away. So, in order to benefit from inlinks, you would have to ask other webmasters to link to the frontpage exclusively, which would be very silly.

This feature request is for better support for search engines. I don't have to explain that if search engines cannot crawl contents, the contents hardly will be found, and that working on contents that will not be found afterwards is a waste of time. Thus, not providing good support for search engines means to discourage the wiki's users, and a wiki without encouraged users won't grow. I therefore propose to replace the current [no]index,[no]follow scheme with a more fine-grained solution.

$/!\$ A search engine can index (== store and look at its content) every page on a moin wiki with the exception of actions and POST results.
- Only if it knows the page is there, which requires working links between the pages in the wiki itselfs.

#1 Don't use any robots header, but on pages that are generated by actions (e.g. edit, diff, or info) or that contain hightlighted text and so on, where noindex,nofollow would be appropriate, as following links on these pages would produce useless traffic and load.

#2 Use the attribute rel="nofollow" to the HTML element a, as proposed by several search engines (Google, Yahoo, MSN), for any other link we do not want searchbots to follow, e.g. anything related to actions (see above).

good idea

#3 Use the attribute rel="nofollow", too, for links pointing to external resources in order to discourage link spammers. I know, this won't prevent any link spamming, so this is not meant as a replacement for the current AntiSpamFeatures, but as an addition to them. That way, no one will benefit from links in wikis for his own page ranking. Maybe InterWiki links don't need to have the rel="nofollow" attribute, since we trust anybody in our InterWikiMap.

$/!\$ bad idea
- Please give reasons. This is the way almost any competitive wiki engine does it. Don't say anybody else is wrong, there must be a reason why only MoinMoin refuses to implement this. Note I'm proposing a replacement for the current implementation, which includes, as already said, the nofollow META element almost anywhere, which is equal to rel="nofollow" in the A element; so we already do something like this. However, if in doubt, make it an option (I would switch it on).
  - Use google. I won't repeat what many people have already written about this. Alternatively read the link you gave me yesterday. Maybe you can also find an IRC log of #wiki, it has also been a topic there.
    - Again, you are not responding to the question. I see this has been discussed controversially, e.g. english Wikipedia doesn't use rel=nofollow, german Wikipedia does (guess which of those provides better quality). But this is not the point. The point is firstly: this is a feature that almost all of the other wiki engines provide and that is requested by MoinMoin users; why are you refusing to implement it into MoinMoin (at least as an option)? Secondly: if you think nofollow is evil, why the heck do we have a wiki engine where all links on all pages are set to nofollow by the nofollow META head element? You are contradicting yourself.
      - Maybe notice the fact that rel nofollow is not the same as (much older) meta nofollow. If you have noticed that, there is no contradiction. You may beat the google guys for that confusion. If google is misinterpreting meta nofollow, you may beat them for that, too.

#4 In order to throttle down requests from a certain, more aggressive searchbot, write in the /robots.txt file

User-agent: msnbot
Crawl-delay: 60

which will make MSN's bot spider only one page per minute (see http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_RestrictAccessToSite.htm).

Already implemented. At least Yahoo slurp doesn't care at all.
- Well, but it should, see http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html . I must make it perfectly clear that randomly locking out search bots is an abuse of the robots.txt file, because it is absolutely unsensible to have a public content management system closed to the public. If a wiki cannot be spidered fore some reason, it's a bug in the wiki engine's software, not in that of the bot.
  - If bots don't follow their own standards and dDOS a wiki by requesting like crazy, this IS a bug in the bot.
    - Or the wiki engine is to weak.
      Thomas, I'm through with this. Either MoinMoin will allow to make more than 4% of the contents of my wiki (currently 110 pages out of 2727 indexed) available to more than 30% of internet users (currently only Google users, not the users of Inktomi based search engines), or I'll have to look for an other wiki engine that does the job. I'm not asking too much if I just want to be found in search engine results, am I.

#5 Use the /robots.txt file, too, to exclude certain directories, or file types (this again only for MSN) we do not need to have indexed (see http://www.robotstxt.org/wc/faq.html):

User-agent: *
Disallow: /wiki
# note images, stylesheets, and so on are stored somewhere under /wiki, too
User-agent: msnbot
Disallow: /*.exe$
# just an example

good idea. not much difference maybe, though.
- Please double-check if it's not better to give access to /wiki for all that, in case we want images and style sheets to be archived, both.
  - I'm afraid I have to revoke my own suggestion here: If we want to have our pages archived correctly, too, access to /wiki must be granted, as style sheets, our logo, icons, and so on riside there. So, unless this rule is really saving bandwidth and machine load, please try to go on without it.

#6 In order to reduce also load and traffic caused by human users, configure your web server to use expires HTTP headers for certain file types, e.g. (for Apache):

<IfModule mod_expires.c>
  ExpiresActive On
  ExpiresByType text/css "now plus 1 month"
  ExpiresByType image/png "now plus 1 weeks"
</IfModule>

This has more to do with server configuration then with wiki configuration, though.

We don't use Apache for most wikis. But idea is ok.

An other idea on server configuration: block anything related to actions already on web server level. With apache, you could config your mod_rewrite like this:

    # block ?action= requests for these spiders
    RewriteCond %{QUERY_STRING} action=[^as]
    RewriteCond %{HTTP_USER_AGENT} Googlebot [OR]
    RewriteCond %{HTTP_USER_AGENT} Slurp [OR]
    RewriteCond %{HTTP_USER_AGENT} msnbot [OR]
    RewriteCond %{HTTP_USER_AGENT} Teoma [OR]
    RewriteCond %{HTTP_USER_AGENT} ia_archive
    RewriteRule .* - [F,L]

This looks for robot requests containing "action=" in the query string (except for "action=a[ttachment]" and "action=s[how]"), and denies such requests at webserver level. This avoids the cost of even initializing the wiki engine for such requests. We must make an exeption for "attachment", because this is the way inline grafics etc. are handled. I'm aware this wiki is not running under Apache, but within a Twisted framework. But maybe there is a way to accomplish something like this even with Twisted (both, Rewrite and Expires), too?

However, all these things still are manual interventions. A smarter solution would handle both, attachments and actions in an other way: attachments would be served directly by the web server (as they contain mostly more or less static contents), and actions would not be using GET requests any more (see also RfC 2616, Section 9.1.1: "In particular, the convention has been established that the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval.") I'm using the conditional, but I remember there was something like that in the pipeline for MoinMoin version 2.0.

Thanks, -- MartinBayer 2005-07-22 22:12:01

Beyond Pagerank

There are some more reasons why the current implementation does not work, and needs to be replaced by a smarter solution:

Not all spiders will follow any of the links present on TitleIndex, because pages containing >100 links are considered some kind of link spamming pages. IOW, the TitleIndex is worth nothing for having your whole wiki spidered.
- That might be true or not (google is rather unclear about whether this < 100 links thing is a human design principle or a restriction of the bot), in any case, it is / would be insane.
- As a consequence, only a very limited number of pages will be indexed (im my case it's about 0.4%), because virtually no page but those with a deeplink from extern will ever be retrieved by a bot.
The current implementation creates more problems than it solves, because it won't prevent bots from fetching pages they should not fetch (think actions) via the RecentChanges page, or the TitleIndex page (think attachments), resulting in lots of 4xx errors. But the nofollow META element was introduced to prevent exactly this, wasn't it?
- this is exactly why only few pages have meta "follow". Note that this is code from the pre-link-rel-nofollow era.
  - Oh, for heaven's sake, that nofollow implementation is broken! And it has never been more than a quick hack, because all actions are GET links and the wiki engine is too weak for massive load. That's why I opened this Request for Enhancement about one year ago. Go to TitleIndex, look into the HTML source code, and count how many links just to attachments you find, any of them being like a href="/ApprovePageAction?action=AttachFile and thus unsuitable for a bot. Now go to RecentChanges, look into the HTML source code, and count how many links just to diffs (info) you find, any of them being like a href="/MoinMoinTodo/Release_1.5.4?action=info and thus unsuitable for a bot. Just these two pages are sufficient for a massive useless load, and tons of HTTP error codes to spiders. You are self-ddossing you, but trying to keep this implementation by randomly excluding search engines.
- A side-effect of this is that search bots give up spidering after having encountered a certain amount of !200 responses. Thus, even if it wasn't only for the amount of links present on the TitleIndex page, the bot nevertheless will never follow all of the links top to bottom because of the many bad links.
  - Is this a guess or where do you take that from?
- Furthermore, a site with many bad links (i.e. links yieling a !200 response) is likely to be considered a page of poor quality, which means the search bot won't come back soon and you will never get a 1st class ranking in the results.
  - Is this a guess or where do you take that from? Are you maybe reading too much between the lines? Expecting too much intelligence from a bot. Maybe read server logs to see the real "intelligence".
    - Get a Google account. Have your wiki analysed. See how indexing gets blocked by a !200 - 200 ratio of 2 - 1 (which means the spider is encountering twice as much HTTP error codes than it is able to sucessfully retrieve a page). See how even after more than one year not even 50% of your pages have a pagerank assigned. And read the fine manual. From your server's log you can only see which data was requested, not how it was processed.
If your users prefer working on already existant pages instead of creating new ones, TitleIndex seldomly changes. Search bots fetch pages the more often the more often the page changes. This is an other reason why the current implementation does not work.
- As a consequence, only RecentChanges and pages showing up on that page (i.e. recently changed pages) are indexed by search engines.

Needless to say, the last point is valid only for search engines that are not locked out completely via the robots.txt directives, like Yahoo resp. Inktomi based search engines. Thus, the situation is even worse than one could imagine only from this Request for Enhancement.

-- MartinBayer 2006-06-20 12:13:31

I am working on this. You either have to help or wait. -- ThomasWaldmann 2006-06-21 06:49:16

I'm helping by writing bug reports to you and Yahoo, both.

Look at http://test.wikiwikiweb.de/ (rel="nofollow" implementation as it is now, main goal is avoiding unnecessary load). -- ThomasWaldmann 2006-06-21 08:08:33

You might want to create a Google sitemap account for yourself, list the test wiki, and see how it is seen by the search engine (and see if everything with rel=nofollow really isn't creating any HTTP errors any more).

Currently testing a simple google sitemap generator: http://moinmoin.wikiwikiweb.de/?action=sitemap

Charming! Had to filter this manually out of TitleIndex till now, so this will save some work.
- I realize this has been implemented in the meantime. Thank you for this. I submitted it for my wiki and it seems to work: the number of pages indexed by Google has almost doubled (which means we're now close to 10% ). So at least Google users can find us (better).

Related Issues: FeatureRequests/GoogleSitemapGeneration (implemented), MoinMoinBugs/RecentChangesUsesNofollow (mostly fixed)

CategoryFeatureRequest

MoinMoin: FeatureRequests/AlternativeSpiderControlFeatures (last edited 2007-10-29 19:21:03 by localhost)