February 13th, 2009

Google, I love you but you're killing the vibe

Once upon a time there was a web site. And the web site offered filters. And lo, the filters were good.

They let you browse for events at Brasils Nightclub. Or classes featuring Darlin Garcia. Or DJ nights within the city limits. Or any combination thereof.

And this was awesome. Until Googlebot arrived.

Googlebot, I love you. I really do. Hot, burning, profitable, whuffie-ridden love. But when you multiply all of the event types by all of the locations by all of the venues by all of the studios by all of the instructors by all of the genres...

Hoo boy, that's a lot of URLs. And Googlebot wanted to index all of them.

Googlebot, I tried to break it to you gently. I whispered little suggestions in your ear. Suggestions like: <meta name="robots" content="noindex, nofollow" /> ... In pages with URLs that contain more than one filter. That way Google could index the listings for Brasils Nightclub but not every combination of Brasils and everything else. I want you to index me, Googlebot, I just don't want you to index me to death. I do have other interests, you know.

But Google said "ohhh, you don't want people to KNOW I followed that link! Okay. It'll be our secret. I'm still going to follow it though. Because I LURV U."

TOM: [Pained expression]

But Googlebot is so good to me, I couldn't part ways with it lightly. I tried again: <a rel="nofollow" href="http://salsadelphia.com/filtered/type,intermediate_class;venue,brasils">Tougher Classes</a> But the more I played hard to get, the more ardent Googlebot's love became. Googlebot honored my wishes in a sense— it didn't kiss and tell— but it sure wouldn't back off either. It became difficult to hold a conversation with anybody else.

My processor load was spiking. If you know what I mean. And I think you do.

Finally I turned to an old friend famous for her bluntness: Apache's mod_rewrite module. And I begged mod_rewrite to do what I could not: cut Googlebot off at the knees.

And she obliged with panache:

  # Google has been ignoring my subtle hints not to 
  # beat the crap out of the database by indexing 
  # multiple filter combos. So be blunt about it and 
  # explicitly redirect all attempts to access a 
  # double filter (or worse) to the home page if they 
  # come from Googlebot.


# Filters are separated by semicolons, so this ain't hard.

RewriteCond %{HTTP_USER_AGENT} Googlebot RewriteRule \; http://salsadelphia.com/ [L,R]


Did Googlebot take the hint? Well... sort of. It's still banging the gong for me. But mod_rewrite tirelessly deflects Googlebot's passion in a more appropriate direction:

66.249.72.65 - - [13/Feb/2009:08:55:06 -0600] "GET /filtered/genre,rueda;instructor,victor_colon;studio,prince_of_salsa;type,basic_class;venue,family_tavern HTTP/1.1" 302 302 www.salsadelphia.com "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

That's right Googlebot— I totally freakin' heart you, but if you push me too far you're gonna be looking at my home page like everyone else.

Because my home page is cached for mass consumption. It's something I can afford to give. If you know what I mean. And I think you do.

Edit: nope, robots.txt would not help with this issue. I remembered this from the Old Days, but thought I'd check in and make sure it's still the case. It certainly seems to be: "note also that globbing and regular expression are not supported in either the User-agent or Disallow lines... you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif". "
Check out another article
February 12th, 2009
IDES 322: Intellectual Property
By
February 9th, 2009
Joy