Robots.txt is over 25 years old, which is almost the same as forever when it comes to the internet. However, Google’s recent announcement about making it standardised might be the first time some people are hearing of it. So, let us explain further.
Robots.txt, officially called Robots Exclusion Protocol (REP), is the system of commands that dictates which part of your website that search engines will find. The concept came to fruition back in 1994, when a webmaster by the name of Martijn Koster felt his website was being overrun with crawler traffic. However, it’s when other webmasters began to join in that robots.txt truly gained recognition. Soon after, internet search engines began to adopt this as a sort of defacto standard.
Nowadays, robots.txt is a fundamental part of search engine optimisation (SEO). Crawlers from search engines like Googlebot and Bingbot, scour the internet for websites. When they find a site, files containing robots.txt tell them if and how pages of that website should be listed. Having the right robots.txt commands is a critical part of a website SEO and ranking.
That said, at present, robots.txt has a ‘sort of’ standard attached to its name, which has spawned into different versions of itself. Over the years, developers have haphazardly added and removed elements from the original version. As a result, webmasters have to get to grips with the multiple standards of robots.txt, which can be a pain. That’s why, Google wants to simplify this process by making only one standard.
As Google owns one of the world’s largest search engines, this move will practically make the robot.txt version become the standard across the internet. Google probably knows this, and as a result listed a set of rules for webmasters and developers to follow, with the launch of the new standard. This meaning that, webmasters who fail to apply the rules will fall out of Google’s good graces and rank further down their search result.
These new rules include things like: not accepting typos or variations of commands in the <feild> elements e.g. “useragent” instead of “user-agent”; changes to how Google deals with ‘redirect hops’; a size limit of 500KiB on content; the renaming of ‘records’ to ‘line’ and ‘rules’ and most notably, the inclusion of all URL protocols and not limited to HTTP, just to name a few. In addition, Google says they expect the file format to be plain text encoded in UTF-8. This file will consist of lines separated by CR, CR/LF, or LF. The main elements to be used are <field>:<value><#optional-comment>. Note that, it will ignore whitespaces located either at the start and or the end of the line.
The new robots.txt is still at its draft stage, with Google working with developers to refine the standard. The hope is that these developments will amount to simpler and better ways of working for the modern needs of websites and their owners. The internet has greatly changed since 1994 and therefore it makes sense that robots.txt follows suit, too.
Here at Wiredelta, we understand the importance of robots.txt and other SEO tools for improving your websites’ ranking on search engines. If you need help with your new website and you are not sure where to start, why not reach out to us. In addition, we have plenty of articles about web and digital marketing, so feel free to sign up for our newsletter to keep yourself updated about these topics and other developments in tech.