Quote:
|
So, even for several search engine agents, crawlers, spiders, ants or whatever they are called, we aren't allowed to use more than one robots.txt.
|
But you can make your robots.txt disallow different URLs for different spiders - see the protocol page:
The Web Robots Pages
The robots exclusion standard is voluntary, so bad bots may ignore your robots.txt - many honeypots use this behaviour as a trap. NST have got one:
http://www.nst.com.my/robots.txt
Anybody can see your robots.txt too, so even if robots aren't crawling your 'atmpinnumbersreminder.html', people can see you've blocked it.
WMM had a very odd robots.txt, full of all sorts of cruft, maybe they rewrote the URL to something else for a non-Robot user-agent. They seemed to have changed their policy since I started previewing a reply to this thread last night. I've given up trying to get the same enormous text response as last time, just in case you're watching your log, admin!
Google must be crawled by more robots, theirs doesn't contain any cruft:
http://www.google.com/robots.txt
- and they don't discriminate between robots, but they do stop them crawling a lot of stuff.
Any robot can look at anything at exabytes.com.my:
http://www.exabytes.com.my/robots.txt
The government says it doesn't have one:
http://www.gov.my/robots.txt
The CIA says it's secret, so you have to use https, but then it doesn't give it to you when you try. Maybe you can crawl their site, but then they have to kill your robot.
Finally! This is what I was looking for - a robots.txt that has different stuff for different bots:
http://en.wikipedia.org/robots.txt
I hope that's useful.
Oh yeah - one last thing, robots coming back time and time again for the same dynamically-generated content, only with different URL args means your visitors get funny results from their searches, and the robot is fetching many times more pages from your site than it needs to. Some robots can apply regular expression rules so they won't follow form targets, for example:
http://lolyco.com/robots.txt
I hope that's useful!