The Curiously Fuzzy robots.txt

Johnny Five

The layout and content of a proper robots.txt file is somewhat undefined. There is no standards body that controls it and a weak central hub that houses remnants of its history (and several broken links). It’s odd that this defacto standard, which is used on essentially every web site, would have such little documentation and support.

The concept of robots.txt was invented by a guy named Martijn Koster after he was essentially DOS’ed by a random crawler, the design was accepted by a newsgroup consensus in 1994.

There have been a few failed attempts to officially formalize and/or update the robots.txt standard. There have been a few calls for regular expressions and indexing control (which is actually distinctly different from crawling). Unfortunately the demand isn’t there so it appears robots.txt will stay mostly unstandardized.

Sadly, after researching all the possible ways to write it, my robots.txt remains extremely simple.

Resources

Here’s the best resources I’ve scraped together:

“Legal” robots.txt lines

</tr> </tbody> </table>

Tag	“Standard”	Description
User-agent:	1994 Post	Limit the following restrictions to this particular crawler to the following bot.
Disallow:	1994 Post	Bots should not load any URLs starting with the following text.
Allow:	1996 Draft RFC </td>	Bots should feel free to parse this URL, even if it is also covered by a Dissallow: line.
Sitemap:	sitemaps.org	URL(s) of xml sitemaps. Created by google in 2005, officially supported by Google, Bing, Yahoo and others.
Crawl-delay:		Limit successive html queries to once per this many seconds. Notably not supported by Google.
Host:		If this site is a mirror, use the Host: domain for indexing.

Resources

“Legal” robots.txt lines

Comments