The layout and content of a proper robots.txt file is somewhat undefined. There is no standards body that controls it and a weak central hub that houses remnants of its history (and several broken links). It’s odd that this defacto standard, which is used on essentially every web site, would have such little documentation and support.
The concept of robots.txt was invented by a guy named Martijn Koster after he was essentially DOS’ed by a random crawler, the design was accepted by a newsgroup consensus in 1994.
There have been a few failed attempts to officially formalize and/or update the robots.txt standard. There have been a few calls for regular expressions and indexing control (which is actually distinctly different from crawling). Unfortunately the demand isn’t there so it appears robots.txt will stay mostly unstandardized.
Sadly, after researching all the possible ways to write it, my robots.txt remains extremely simple.
Here’s the best resources I’ve scraped together:
- Original Post defining robots.txt, 1994
- Draft robots.txt RFC. 1996
- Wikipedia Page
- Google’s specification of robots.txt
- Google’s robots.txt Testing Tool
“Legal” robots.txt lines
|User-agent:||1994 Post||Limit the following restrictions to this particular crawler to the following bot.|
|Disallow:||1994 Post||Bots should not load any URLs starting with the following text.|
|Allow:||1996 Draft RFC||Bots should feel free to parse this URL, even if it is also covered by a Dissallow: line.|
|Sitemap:||sitemaps.org||URL(s) of xml sitemaps. Created by google in 2005, officially supported by Google, Bing, Yahoo and others.|
|Crawl-delay:||Limit successive html queries to once per this many seconds. Notably not supported by Google.|
|Host:||If this site is a mirror, use the Host: domain for indexing.|