February 18th, 2009

Google Responds, Love Resumes

I had fun with last week's post, but the topic was actually serious stuff regarding the way search engines behave when they arrive at web sites built with certain types of URLs. And the humorous tone of things may have made it a bit more difficult to follow exactly what I was getting at. So this week I'm going to dispense with the metaphor. My apologies to anyone who reads my posts for LOLs rather than URLs.

The salsadelphia site contains URLs like these:

http://salsadelphia.com/ http://salsadelphia.com/filtered/venue,brasils http://salsadelphia.com/filtered/dj,clave

I want people to look at these, bookmark these, email these, etc.

The site also contains URLs like this one:

http://salsadelphia.com/filtered/venue,brasils;dj,clave

This is a double filter: it returns only results for which both criteria are true.

Here's the thing about double, triple, quadruple, etc. filters: they are useful... to human beings. They are bookmarkable, copy-and-paste friendly, comprehensible things. I like them.

But when you multiply all the possible combinations together, you discover exactly how bad things get if Google decides to index every single one of them.

Last week I detailed all of the steps I took to stop Google from indexing these. Some wondered why I didn't use a simple robots.txt rule:

User-agent: * Disallow: /filtered/

The answer is that, as I said above, I don't want to block access to single filters. Only to double filters (and beyond).

After some resesarch about robots.txt files, I concluded that you still couldn't use wildcards in them in a really effective way, something like this:

User-agent: * Disallow: *;

Aw yeah... now THAT would be effective. Except according to most sources it doesn't work.

So I tried other methods: meta tags, and nofollow attributes on links. Google treated all of these as hints about how and whether to display the information... but still gathered the information. At a ferocious, server-crushing pace. No love there.

Finally I used Apache's mod_rewrite module to stop the madness by sending Googlebot to the home page of the site if it attempted to access a double filter:

RewriteEngine on RewriteCond %{HTTP_USER_AGENT} Googlebot RewriteRule \; http://salsadelphia.com/ [L,R]

That worked. Google was effectively chased off, since Googlebot knew it had already seen my home page recently.

After my post, I received a helpful comment from JohnMu, a webmaster trends analyst at Google. JohnMu pointed out that Googlebot does in fact support wildcards in robots.txt files. In fact, they support a number of extensions to the robots exclusion protocol, aka the Robot Exclusion Standard, aka robots.txt, as detailed in this Google Webmaster Blog post.

Why didn't I find this post the first time I went looking for it? Partly because Google's term "Robots Exclusion Protocol" Is not what people were calling robots.txt when I first learned about it in the nineties, and partly because many other sites still don't call it that. They call it the Robot Exclusion Standard. And possibly also because the information is buried in a blog entry rather than, let's say, a nice big fat "Google robots.txt Robot Exclusion Standard extensions" page. And I suggest that Google create such a thing rather than relying on a blog entry to document something webmasters depend on... although I am definitely pleased that they chose to implement much-needed extensions to robots.txt, which is something that is long overdue.

So: how well does it work? Well, I'd like to tell you. But before JohnMu commented, another Google staffer helpfully inserted a regular expression rule specifically for salsadelphia.com via Googlebot's backend console. Wow. That was greatly appreciated. But since it was highly effective in squashing any accesses by Googlebot to double filters and above, I don't really know whether my new robots.txt rule would also do that.

But I should be able to learn something by monitoring the behavior of the MSN Search spider, which also allegedly supports wildcards these days. I'll update this post tomorrow when MSN crawls past again.

I'd like to thank JohnMu and Ben C. of Google for their assistance. This has been an entertaining saga.
Check out another article
February 16th, 2009
Ask!