The use of a robots.txt file has long been debated among webmasters as it can prove to be a strong tool when it is well written or one can shoot oneself in the foot with it. Unlike other SEO concepts that could be considered more abstract and for which we don’t have clear guidelines, the robots.txt file is completely documented by Google and other search engines.
You need a robots.txt file
onlyif you have certain portions of your website that you don’t want to be indexed and/or you need to block or manage various crawlers.
*tks to Richard for the correction on the text above. (check the comments for more info)
What’s important to understand in the case of the robots file is the fact that it doesn’t serve as a law for crawlers to obey to, it’s more of a signpost with a few indications. Compliance with those guidelines can lead to a faster and better indexation by the search engines, and mistakes, hiding important content from the crawlers, will eventually lead to a loss of traffic and indexation problems.
We’re sure most of you are familiar with robots.txt by now, but just in case you heard about it a while ago and since have forgotten about it, the Robots Exclusion Standards as it’s formally known, is the way a website communicates with the web crawlers or other web robots. It’s basically a text file, containing short instructions, directing the crawlers to or away from certain parts of the website. Robots are usually trained to search for this document when they reach a website and obey its directives. Some robots do not comply with this standard, like email harvesters, spambots or malware robots that don’t have the best intentions when they reach your website.
It all started in early 1994 when Martijn Koster created a web crawler that caused a bad case of the DDOS on his servers. In response to this, the standard was created to guide web crawlers and block them from reaching certain areas. Since then, the robots file evolved, contains additional information and have a few more uses, but we`ll get to that later on.
How Important Is Robots.txt for Your Website?
To get a better understanding of it, think of robots.txt as a tour guide for crawlers and bots. It takes the non human visitors to the amazing areas of the site where the content is and shows them what is important to be and not to be indexed. All of this is done with the help of a few lines in a txt file format. Having a well experienced robot guide can increase the speed at which the website is indexed, cutting the time robots go through lines of code to find the content the users are looking for in the SERPs.
More information has been included in the robots file through out the time that helps the webmasters to get a faster crawling and indexation of their website.
Nowadays most robots.txt files include the sitemap.xml address that increases the crawl speed of bots. We managed to find robot files containing job recruitment adds, hurt people feelings and even instructions to educate robots for when they become self-conscious.
Keep in mind that even though the robots file is strictly for robots, it’s still publicly available for anyone who does a /robots.txt to your domain. When trying to hide from the search engines private information, you just show the URL to anyone who opens the robots file.
How to Validate Your Robots.txt
First thing once you have your robots file is to make sure it is well written and to check for errors. One mistake here can and will cause you a lot of harm, so after you’ve completed the robots.txt file take extra care in checking for any mistake in it.Most search engines provide their own tools to check the robots.txt files and even allow you to see how the crawlers see your website.
Google’s Webmaster Tools offers the robots.txt Tester, a tool which scans and analyzes your file. As you can see in the image below, you can use the GWT robots tester to check each line and see each crawler and what access it has on your website. The tool displays the date and time the Googlebot fetched the robots file from your website, the html code encountered, as well as the areas and URLs it didn’t have access to. Any errors that that are found by the tester need to be fixed since they could lead to indexation problems for your website and your site could not appear in the SERPs.
The tool provided by Bing displays to you the data as seen by the BingBot. Fetching as the Bingbot even shows your HTTP Headers and page sources as they look like for the Bingbot. This is a great way to find out if your content is actually seen by the crawler and not hidden by some mistake in the robots.txt file. Moreover, you can test out each link by adding it manually and if the tester finds any problems with it, it will display the line in your robots file that blocks it.
Remember to take your time and carefully validate each line of your robots file. This is the first step in creating a well written robots file, and with the tools at your disposal you really have to try hard to make any mistakes here. Most of the search engines provide a “fetch as *bot” option so after you’ve inspected the robots.txt file by yourself, be sure to run it through the automatic testers provided.
Be Sure You Do Not Exclude Important Pages from Google’s Index
Having a validated robot.txt file is not enough to ensure that you have a great robots file. We can’t stress this enough, but having one line in your robots that blocks an important content part of your site from being crawled can harm you. So in order to make sure you do not exclude important pages from Google’s index you can use the same tools that you used for validating the robots.txt file.
Fetch the website as the bot and navigate it to make sure you haven’t excluded important content.
Before inserting pages to be excluded from the eyes of the bots, make sure they are on the following list of items that hold little to no value for search engines:
- Code and script pages
- Private pages
- Temporary pages
- Any page you believe holds no value for the user.
What we recommend is that you have a clear plan and vision when creating the website’s architecture to make it easier for you to disallow the folders that hold no value for the search crawlers.
How to Track Unauthorized Changes in Your Robots.txt
Everything is in place now, robots.txt file is completed, validated and you made sure that you have no errors or important pages excluded from Google crawling. The next step is to make sure that nobody makes any changes to the document without you knowing it. It’s not only about the changes to the file, you also need to be aware of any errors that appear while using the robots.txt document.
1. Change Detection Notifications – Free Tool
The first tool we want to recommend is changedetection.com. This useful tool tracks any changes made to a page and automatically sends an email when it discovers one. First thing you have to do is insert the robots.txt address and the email address you want to be notified on. The next step is where you are allowed to customize your notifications. You are able to change the frequency of the notifications and set alerts only if certain keywords from the file have been changed.
2. Google Webmaster Tools Notifications
Google Webmaster Tools provides an additional alert tool. The difference made by using this tool is that it works by sending you notifications of any error in your code each time a crawler reaches your website. Robots.txt errors are also tracked and you will receive an email each time an issue appears. Here is an in-depth usage guide for setting up the Google Webmaster Tools Alerts.
3. HTML Error Notifications – Free & Paid Tool
In order to not shoot yourself in the foot when making an robots.txt file, only these html error codes should be displayed.
The 200 code, basically means that the page was found and read;
The 403 and 404 codes, which mean that the page was not found and hence the bots will think you have no robots.txt file. This will cause the bots to crawl all your website and index it accordingly.
The SiteUptime tool periodically checks your robots.txt URL and is able to instantly notify you if it encounters unwanted errors. The critical error you want to keep track of is the 503 one.
A 503 error indicates that there is an error on the server side and if a robot encounters it, your website will not be crawled at all.
The Google Webmaster Tools also provides constant monitoring and shows the timeline of each time the robots file was fetched. In the chart, Google displays the errors it found while reading the file; we recommend you look at it once in a while to check if it displays any other errors other than the ones listed above. As we can see below the Google webmaster tools provides a chart detailing the frequency the Googlebot fetched the robots.txt file as well as any errors it encountered while fetching it.
Critical Yet Common Mistakes
1. Blocking CSS or Image Files from Google Crawling
2. Wrong Use of Wildcards May De-Index Your Site
Wildcards, symbols like “*” and “$”, are a valid option to block out batches of URLs that you believe hold no value for the search engines. Most of the big search engine bots observe and obey by the use of it in the robots.txt file. Also it’s a good way to block access to some deep URLs without having to list them all in the robots file.
So in case you wish to block, lets say URLs that have the extension PDF, you could very well write out a line in your robots file with User-agent: googlebot
The * wildcard represents all available links which end in .pdf, while the $ closes the extension. A $ wildcard at the end of the extension tells the bots that only URLs ending in pdf shouldn’t be crawled while any other URL containing “pdf” should be crawled (for example pdf.txt).
*Note: Like any other URL the robots.txt file is case-sensitive so take this into consideration when writing the file.
Other Use Cases for the Robots.txt
Since it’s first appearance, the robots.txt file has been found to have some other interesting uses by some webmasters. Let’s take a look at other useful ways someone could take advantage of the file.
1. Hire Awesome Geeks
Tripadvisor.com’s robotos.txt file has been turned into a hidden recruitment file. It’s an interesting way to filter out only the “geekiest” from the bunch, and finding exactly the right people for your company. Let’s face it, it is expected nowadays for people who are interested in your company to take extra time in learning about it, but people who even stalk for hidden messages in your robots.txt file are amazing.
2. Stop the Site from Being Hammered by Crawlers
Another use for the robots file is to stop those pesky crawlers from eating up all the bandwidth. The command line Crawl-delay can be useful if your website has lots of pages. For example if your website has about 1000 pages, a web crawler can crawl your whole site in several minutes. Placing the command line Crawl-delay: 30 will tell them to take it a bit easy, use less resources and you’ll have your website crawled in a couple of hours instead of few minutes.
We don’t really recommend this use since the Google doesn’t really take into consideration the crawl-delay command, since the Google Webmaster Tools has an in-built crawler speed tuning function. The uses for the Crawl-delay function work best for other bots like Ask, Yandex and Bing.
3. Disallow Confidential Information
To disallow confidential information is a bit of a double edged sword. It’s great not to allow Google access to confidential information and display it in snippets to people who you don’t want to have access to it. But, mainly because not all robots obey the robots.txt commands, some crawlers can still have access to it. Similarly, if a human with the wrong intents in mind, searches your robots.txt file, he will be able to quickly find the areas of the website which hold precious information. Our advice is to use it wisely and take extra care with the information you place there and remember that not only robots have access to the robots.txt file.
It’s a great case of “with great power, comes greater responsibility”, the power to guide the Googlebot with a well written robot file is a tempting one. As stated below the advantages of having a well written robots file are great, better crawl speed, no useless content for crawlers and even job recruitment posts. Just keep in mind that one little mistake can cause you a lot of harm. When making the robots file have a clear image of the path the robots take on your site, disallow them on certain portions of your website and don’t make the mistake of blocking important content areas. Also to remember is the fact that the robots.txt file is not a legal guardian, robots do not have to obey it, and some robots and crawlers don’t even bother to search for the file and just crawl your entire website.