View Single Post
  #6 (permalink)  
Old 10-06-2008, 10:47 PM
Seanie's Avatar
Seanie Seanie is offline
Senior Webmaster
 
Join Date: Mar 2008
Location: pd
Posts: 240
Rep Power: 12
Seanie has a spectacular aura about Seanie has a spectacular aura about
Quote:
So, even for several search engine agents, crawlers, spiders, ants or whatever they are called, we aren't allowed to use more than one robots.txt.
But you can make your robots.txt disallow different URLs for different spiders - see the protocol page:
The Web Robots Pages
The robots exclusion standard is voluntary, so bad bots may ignore your robots.txt - many honeypots use this behaviour as a trap. NST have got one:
http://www.nst.com.my/robots.txt

Anybody can see your robots.txt too, so even if robots aren't crawling your 'atmpinnumbersreminder.html', people can see you've blocked it.

WMM had a very odd robots.txt, full of all sorts of cruft, maybe they rewrote the URL to something else for a non-Robot user-agent. They seemed to have changed their policy since I started previewing a reply to this thread last night. I've given up trying to get the same enormous text response as last time, just in case you're watching your log, admin!

Google must be crawled by more robots, theirs doesn't contain any cruft:
http://www.google.com/robots.txt
- and they don't discriminate between robots, but they do stop them crawling a lot of stuff.

Any robot can look at anything at exabytes.com.my:
http://www.exabytes.com.my/robots.txt

The government says it doesn't have one:
http://www.gov.my/robots.txt

The CIA says it's secret, so you have to use https, but then it doesn't give it to you when you try. Maybe you can crawl their site, but then they have to kill your robot.

Finally! This is what I was looking for - a robots.txt that has different stuff for different bots:
http://en.wikipedia.org/robots.txt

I hope that's useful.

Oh yeah - one last thing, robots coming back time and time again for the same dynamically-generated content, only with different URL args means your visitors get funny results from their searches, and the robot is fetching many times more pages from your site than it needs to. Some robots can apply regular expression rules so they won't follow form targets, for example:

http://lolyco.com/robots.txt

I hope that's useful!
Reply With Quote