The use of a robots.txt file has long been debated among webmasters as it can prove to be a strong tool when it is well written or one can shoot oneself in the foot with it. Unlike other SEO concepts that could be considered more abstract and for which we don’t have clear guidelines, the robots.txt file is completely documented by Google and other search engines.
You need a robots.txt file
onlyif you have certain portions of your website that you don’t want to be indexed and/or you need to block or manage various crawlers.
*tks to Richard for the correction on the text above. (check the comments for more info) What’s important to understand in the case of the robots file is the fact that it doesn’t serve as a law for crawlers to obey to, it’s more of a signpost with a few indications. Compliance with those guidelines can lead to a faster and better indexation by the search engines, and mistakes, hiding important content from the crawlers, will eventually lead to a loss of traffic and indexation problems.
Robots.txt History
We’re sure most of you are familiar with robots.txt by now, but just in case you heard about it a while ago and since have forgotten about it, the Robots Exclusion Standards as it’s formally known, is the way a website communicates with the web crawlers or other web robots. It’s basically a text file, containing short instructions, directing the crawlers to or away from certain parts of the website. Robots are usually trained to search for this document when they reach a website and obey its directives. Some robots do not comply with this standard, like email harvesters, spambots or malware robots that don’t have the best intentions when they reach your website.
It all started in early 1994 when Martijn Koster created a web crawler that caused a bad case of the DDOS on his servers. In response to this, the standard was created to guide web crawlers and block them from reaching certain areas. Since then, the robots file evolved, contains additional information and have a few more uses, but we`ll get to that later on.
How Important Is Robots.txt for Your Website?
To get a better understanding of it, think of robots.txt as a tour guide for crawlers and bots. It takes the non human visitors to the amazing areas of the site where the content is and shows them what is important to be and not to be indexed. All of this is done with the help of a few lines in a txt file format. Having a well experienced robot guide can increase the speed at which the website is indexed, cutting the time robots go through lines of code to find the content the users are looking for in the SERPs.
More information has been included in the robots file through out the time that helps the webmasters to get a faster crawling and indexation of their website.
Nowadays most robots.txt files include the sitemap.xml address that increases the crawl speed of bots. We managed to find robot files containing job recruitment ads, hurt people feelings and even instructions to educate robots for when they become self-conscious. Keep in mind that even though the robots file is strictly for robots, it’s still publicly available for anyone who does a /robots.txt to your domain. When trying to hide from the search engines private information, you just show the URL to anyone who opens the robots file.
How to Validate Your Robots.txt
First thing once you have your robots file is to make sure it is well written and to check for errors. One mistake here can and will cause you a lot of harm, so after you’ve completed the robots.txt file take extra care in checking for any mistake in it.Most search engines provide their own tools to check the robots.txt files and even allow you to see how the crawlers see your website.
Google’s Webmaster Tools offers the robots.txt Tester, a tool which scans and analyzes your file. As you can see in the image below, you can use the GWT robots tester to check each line and see each crawler and what access it has on your website. The tool displays the date and time the Googlebot fetched the robots file from your website, the html code encountered, as well as the areas and URLs it didn’t have access to. Any errors that that are found by the tester need to be fixed since they could lead to indexation problems for your website and your site could not appear in the SERPs.
The tool provided by Bing displays to you the data as seen by the BingBot. Fetching as the Bingbot even shows your HTTP Headers and page sources as they look like for the Bingbot. This is a great way to find out if your content is actually seen by the crawler and not hidden by some mistake in the robots.txt file. Moreover, you can test out each link by adding it manually and if the tester finds any problems with it, it will display the line in your robots file that blocks it.
Remember to take your time and carefully validate each line of your robots file. This is the first step in creating a well written robots file, and with the tools at your disposal you really have to try hard to make any mistakes here. Most of the search engines provide a “fetch as *bot” option so after you’ve inspected the robots.txt file by yourself, be sure to run it through the automatic testers provided.
Be Sure You Do Not Exclude Important Pages from Google’s Index
Having a validated robot.txt file is not enough to ensure that you have a great robots file. We can’t stress this enough, but having one line in your robots that blocks an important content part of your site from being crawled can harm you. So in order to make sure you do not exclude important pages from Google’s index you can use the same tools that you used for validating the robots.txt file.
Fetch the website as the bot and navigate it to make sure you haven’t excluded important content.
Before inserting pages to be excluded from the eyes of the bots, make sure they are on the following list of items that hold little to no value for search engines:
- Code and script pages
- Private pages
- Temporary pages
- Any page you believe holds no value for the user.
What we recommend is that you have a clear plan and vision when creating the website’s architecture to make it easier for you to disallow the folders that hold no value for the search crawlers.
How to Track Unauthorized Changes in Your Robots.txt
Everything is in place now, robots.txt file is completed, validated and you made sure that you have no errors or important pages excluded from Google crawling. The next step is to make sure that nobody makes any changes to the document without you knowing it. It’s not only about the changes to the file, you also need to be aware of any errors that appear while using the robots.txt document.
1. Change Detection Notifications – Free Tool
The first tool we want to recommend is changedetection.com. This useful tool tracks any changes made to a page and automatically sends an email when it discovers one. First thing you have to do is insert the robots.txt address and the email address you want to be notified on. The next step is where you are allowed to customize your notifications. You are able to change the frequency of the notifications and set alerts only if certain keywords from the file have been changed.
2. Google Webmaster Tools Notifications
Google Webmaster Tools provides an additional alert tool. The difference made by using this tool is that it works by sending you notifications of any error in your code each time a crawler reaches your website. Robots.txt errors are also tracked and you will receive an email each time an issue appears. Here is an in-depth usage guide for setting up the Google Webmaster Tools Alerts.
3. HTML Error Notifications – Free & Paid Tool
In order to not shoot yourself in the foot when making an robots.txt file, only these html error codes should be displayed.
-
The 200 code, basically means that the page was found and read;
-
The 403 and 404 codes, which mean that the page was not found and hence the bots will think you have no robots.txt file. This will cause the bots to crawl all your website and index it accordingly.
The SiteUptime tool periodically checks your robots.txt URL and is able to instantly notify you if it encounters unwanted errors. The critical error you want to keep track of is the 503 one.
A 503 error indicates that there is an error on the server side and if a robot encounters it, your website will not be crawled at all.
The Google Webmaster Tools also provides constant monitoring and shows the timeline of each time the robots file was fetched. In the chart, Google displays the errors it found while reading the file; we recommend you look at it once in a while to check if it displays any other errors other than the ones listed above. As we can see below the Google webmaster tools provides a chart detailing the frequency the Googlebot fetched the robots.txt file as well as any errors it encountered while fetching it.
Critical Yet Common Mistakes
1. Blocking CSS or Image Files from Google Crawling
Last year, in October, Google stated that disallowing CSS, Javascript and even images (we’ve written an interesting article about it) counts towards your website’s overall ranking. Google’s algorithm gets better and better and is now able to read your website’s CSS and JS code and draw conclusions about how useful is the content for the user. Blocking this content in the robots file can cause you some harm and will not let you rank as high as you probably should.
2. Wrong Use of Wildcards May De-Index Your Site
Wildcards, symbols like “*” and “$”, are a valid option to block out batches of URLs that you believe hold no value for the search engines. Most of the big search engine bots observe and obey by the use of it in the robots.txt file. Also it’s a good way to block access to some deep URLs without having to list them all in the robots file.
So in case you wish to block, lets say URLs that have the extension PDF, you could very well write out a line in your robots file with User-agent: googlebot
Disallow: /*.pdf$
The * wildcard represents all available links which end in .pdf, while the $ closes the extension. A $ wildcard at the end of the extension tells the bots that only URLs ending in pdf shouldn’t be crawled while any other URL containing “pdf” should be crawled (for example pdf.txt).
*Note: Like any other URL the robots.txt file is case-sensitive so take this into consideration when writing the file.
Other Use Cases for the Robots.txt
Since it’s first appearance, the robots.txt file has been found to have some other interesting uses by some webmasters. Let’s take a look at other useful ways someone could take advantage of the file.
1. Hire Awesome Geeks
Tripadvisor.com’s robotos.txt file has been turned into a hidden recruitment file. It’s an interesting way to filter out only the “geekiest” from the bunch, and finding exactly the right people for your company. Let’s face it, it is expected nowadays for people who are interested in your company to take extra time in learning about it, but people who even stalk for hidden messages in your robots.txt file are amazing.
2. Stop the Site from Being Hammered by Crawlers
Another use for the robots file is to stop those pesky crawlers from eating up all the bandwidth. The command line Crawl-delay can be useful if your website has lots of pages. For example if your website has about 1000 pages, a web crawler can crawl your whole site in several minutes. Placing the command line Crawl-delay: 30 will tell them to take it a bit easy, use less resources and you’ll have your website crawled in a couple of hours instead of few minutes.
We don’t really recommend this use since the Google doesn’t really take into consideration the crawl-delay command, since the Google Webmaster Tools has an in-built crawler speed tuning function. The uses for the Crawl-delay function work best for other bots like Ask, Yandex and Bing.
3. Disallow Confidential Information
To disallow confidential information is a bit of a double edged sword. It’s great not to allow Google access to confidential information and display it in snippets to people who you don’t want to have access to it. But, mainly because not all robots obey the robots.txt commands, some crawlers can still have access to it. Similarly, if a human with the wrong intents in mind, searches your robots.txt file, he will be able to quickly find the areas of the website which hold precious information. Our advice is to use it wisely and take extra care with the information you place there and remember that not only robots have access to the robots.txt file.
Conclusion
It’s a great case of “with great power, comes greater responsibility”, the power to guide the Googlebot with a well written robot file is a tempting one. As stated below the advantages of having a well written robots file are great, better crawl speed, no useless content for crawlers and even job recruitment posts. Just keep in mind that one little mistake can cause you a lot of harm. When making the robots file have a clear image of the path the robots take on your site, disallow them on certain portions of your website and don’t make the mistake of blocking important content areas. Also to remember is the fact that the robots.txt file is not a legal guardian, robots do not have to obey it, and some robots and crawlers don’t even bother to search for the file and just crawl your entire website.
And how stupid is the official comment:
Robots.txt is a crawl directive not an indexation directive. Robots.txt doesn’t block indexation of content.
and how correct you are on this one Richard. sorry for this mistake. changed it and gave you credit for it. 🙂 good it was easily solvable.
the correct one as stated on Google’s site is :
“You only need a robots.txt file if your site includes content that you don’t want Google or other search engines to index.”
https://support.google.com/webmasters/answer/6062608?hl=en
Much respect to you Razwan for handling such an arrogant and belligerent comment by Richard in a respectful and professional manner. It would be a miracle if we actually thought of correcting others in a good manner, but I guess hiding behind a computer screen makes it much easier to show our true selves.
Sure!
By the way, thank you both for contributing with this issue.
I love how you explain these complicated (at least for me) matters, Razvan.
I am adquiring knowledge thanks to you.
Thank you again 😉
Great write up, almost every client we take on with a large site has crazy restrictions added to their robots.txt and .htaccess, one of our first places to look when performing an site audit.
it gets tricky when the site is bigger. a small change can affect a lot 🙂
Configuring the robots.txt file is a technical thing. I’ve seen bloggers making mistakes while configuring and ended up losing all their rankings. Sometimes it takes weeks to get back up in the rankings. All bloggers and webmasters should be careful while editing the robots.txt file.
agree on that. should be treated with care.
Great post.
I’ve seen so many accidental problems over the years that I’ve built a tool (in beta) that tests for a slew of changes with SEO impact and generates alerts. Robots.txt, titles, meta robots, nofollow links, canonicals, 301s, 302s H1s. If you have some time, check it out. Would love to get your feedback … its SEORadar.com and in a free beta. It also creates an archive so you go back and examine your site over time.
agree that accidental issues may appear. that was the idea with the robots.txt article, to raise awareness on this kind of mistakes.
Apparently, not everyone’s cautious enough in writing their robots.txt file. Some web developers would even run a website live without checking if it’s even on or not and if proper pages will be allowed to get indexed in search engines or not.
Also, drastic changes in website ranking happen when the developer is not familiar with robots.txt’s proper use and effects to a website. That’s why developers need guidance from SEOs to properly implement this core website function that’s very invaluable to a business.
it is getting complex. more things to check, more things to monitor… more things left to be forgotten. you need to be very cautions nowadays with all this stuff.
Search Engine crawlers check for a robots.txt file at the root of the site. If it’s there, they’ll follow the index/crawl instructions inside, but otherwise they’ll assume the entire site can be indexed.
Hello Razvan,
I have made some changes to the robots.txt and submitted the new one in webmasters tools via robots.txt tester, it has been around 36 hours since I submitted the new one but it’s still showing the old one. Can you tell me if that’s okay and sometimes may take some time to update?
Thank You,
Karan
Hi, Actually i have also facing this issue our blog last 3 weeks…my blog is ranking down…now i have fixed the issue robots file is working fine…but how much time recover our ranking again same?
Hi Razvan,
Good one about robots.txt. Thanks.
we see more connections open in Apache server due to robot indexers. So, server was going down and getting http 503 error while accessing site.
Any idea on this issue?
Thanks,
CR
how to remove other site blocked source url ?? like youtube
Hi, Napi! You cannot do that as you don’t have access to it.
Nice article and very informative for the beginners as well as for those who are doing it form years.
Thanks for this giude
Very nice post.Thank you so much for sharing
Text link for the word “adds” instead of “ads” makes me almost think you may have found a niche keyword target somehow related to “job recruitment adds” – but if not, thought you’d want to know about the error.
Thorough article – thanks!
Great observation, Ken! Thanks a lot!
Hello Razvan
I have a problem with robots.txt file, in face, I maved from http to https, maybe the problem came from this update. Now, when I want to test my website on http://browsershots.org/ or on https://developer.microsoft.com/en-us/microsoft-edge/tools/screenshots, O get the following errors :
—- Microsoft website ———-
We encountered an error with this URL.
Try the following options:
Verify the url is valid and publicly accessible
Enter your domain name only
Use a destination URL if the original URL you entered redirects to it
————-
——— browsershots website ———
Could not read beginfromhere.com/robots.txt.
————
This is a like to the website : https://beginfromhere.com/site/
Any help ?
thanks a lot, amazing tips! recently same thing happened to my blog it effects my blog SEO health.
Hi Razvan,
I noticed your image of “Correct robots.txt Wildcard Matches” is taken from Google – https://developers.google.com/search/reference/robots_txt
It would be nice if you provided a reference and link to your source. This may be useful for your readers.
Thank you
The screenshot is now updated 😉 thanks!
HI, this is Kalviseithi. my site had been blocked by robots.txt in the previous month. and my site was gone down.
blocked by robots.txt as,
1. using labels
2. and others.
please give a correct solution about my site.
thanks,
kalviseithi
Nice post.
You can also improve your search engine indexing by robots.txt.
Hi, I am getting some errors in search console, I hope the errors are rectified after reading your post.
Why do you say that robots.txt doesn’t serve as a law for crawlers to if the first thing a crawler do when visiting a website is read it (and then follow it)?
Hi, Lina!
Not sure what you refer to; yet robots.txt do act as a tour guide for crawlers and bots.
Why “The robots.txt content does not comply with the format rules.”
I am facing few errors in robots and finding the root cause then I came over this article and it helps me. Thanx for sharing this.
thx .. nice article