- Crawling. Indexing. Ranking – The Three Musketeers of SEO
2. Crawling. Indexing. Ranking – The Three Musketeers of SEO
The crawling phase is all about discovery. The process is really complicated and uses software programs called spiders (or web crawlers). Googlebot is, maybe, the most popular crawler.
The crawlers start by fetching web pages and then follow the links on the page, fetch those pages and follow the links on those pages and so on, up to the point where pages are indexed. For this method, the crawler uses a parsing module, which does not render pages but only analyzes the source code and extracts any URLs found in the <a href=”…”> script. Crawlers can validate hyperlinks and HTML code.
An important thing to keep in mind is the fact that when you perform a search on Google, you are not searching the web, but on Google’s index of the web. The index is created by all the pages during the crawl process.
You can help Google and tell the crawler which pages to crawl and which not to crawl. A “robots.txt” file tells search engines whether they can access and crawl your site or just some parts. Using this method, you give Googlebot access to the code data. You should use the robots.txt file to show Google exactly what you want your user to see, because otherwise, you may have pages that will be accessed and don’t want to be indexed. Using this tool, you’ll be able to block or manage various crawlers. Check your robots.txt file to avoid errors and ranking drops. Nowadays, most robots.txt files include the XML sitemap address that increases the crawl speed of bots, which comes as an advantage for your website.
In the crawling process, Googlebot has the main role. On the other side, in the indexing process, Caffeine is indexing infrastructure and has the main role.
|Client-side analytics may not provide a full or accurate representation of Googlebot and WRS activity on your site. Use Search Console to monitor Googlebot and WRS activity and feedback on your site.|
Practically, these two phases work together:
- The crawler sends what it finds to the indexer;
- The indexer feeds more URLs to the crawler. And as a bonus, it prioritizes the URLs based on their high value.
The whole concept of the relationship between crawl and index is very well explained by Matt Cutts in the “How Search Works” video:
Once this stage is complete and no errors are found in the Search Console, the ranking process should begin. At this point, the webmaster and SEO experts must put effort into offering quality content, optimizing the website, earning and building valuable links following the quality guidelines from Google. Also, it is very important that the people responsible for this process be informed of the Rater Guidelines.
The truth is that with the ever-evolving search engine algorithms you need an efficient solution to keep your rankings safe. And cognitiveSEO does exactly that: it lets you know all the issues that might prevent your online business from getting the organic traffic and the high rankings you deserve.
Performing a complete JavaSscript website audit will give you a deeper understanding of why your site is not generating the organic traffic you think it should or why your sales and conversions are not improving. This kind of website audit gives you a much wider array of SEO items to look at and can analyze issues of all types that might prevent you from reaching your best possible ranking.
All the problems began when people started confusing Googlebot (used in the crawling process) with Caffeine (used in the indexing process). Barry Adams talked about the confusion between these two. There’s even a thread on Twitter about it:
The use of 'Googlebot' in there confuses me. The crawler doesn't render, does it? Caffeine is where pages are rendered?
— 🛠 Barry Adams ⌨️ (@badams) August 5, 2017
It has lots of guide on how search engine optimization works, how developers should design websites and how content writers should create white-hat content. That is how the crawl budget term took birth.
|Historically, AJAX applications have been difficult for search engines to process because AJAX content is produced dynamically by the browser and thus not visible to crawlers.|
In 2015, 6 years later, Google deprecated their AJAX crawling system and things have changed. The Technical Webmaster Guidelines show that they’re not blocking Googlebot from crawling JS or CSS files and they manage to render and understand web pages.
And there were other problems that needed to be solved. Some webmasters that were using JS framework had web servers that served a pre-rendered page, which shouldn’t normally happen. Pre-rendering pages should follow the progressive enhancement guidelines and have benefits for the user. In another case, it is very important that the content sent to Googlebot matches the content served to the user, both how it looks and how it interacts. Basically, when Googlebot crawls the page, it should see the same content the user sees. Having different content means cloaking, and it is against Google’s quality guidelines.
The progressive enhancement guidelines say that the best approach for building a site’s structure is to use only HTML, and after that play with AJAX for the appearance and interface of the website. In this case, you are insured, because Googlebot will see the HTML and the user will benefit from the AJAX looks.
Google confirmed another change that reflects AJAX. It started with the decision of deprecating their AJAX crawling system, and Roey Skif asked John Mueller on Twitter about the Fetch as Google the hash bang URLs. Then he tested the impact of this change. He saw a lot of blocked resources that were completely different on the hashbang URLs, and that wasn’t aware of them.
@JohnMu Is the ability to fetch & render hash bang URL's via the GSC is something relatively new? From what I can recall, in the past it wasn't functioning
— Roey Skif (@roeyskif) February 27, 2018
It is true, now Google is supporting hashbang URLs, URLS that have the #! in them in (it stopped doing that in March 30, 2014). This is an example of a link of such: http://www.example.com/bla/#!/bla/. The nice part is you can use Fetch as Google for AJAX hash bang’s URLs.
Another thing you could do, besides using Fetch as Google, is to check and test your robots.txt file from the Search Console, too. The Google Webmaster Tool robots tester allows you to check each line and see each crawler and what access it has on your website. If you take a look at the next screenshot you can see how it works:
The information retrieval process includes crawling, indexing and rankings. You surely heard of them before, but what you didn’t know is that lots of people are confused on how crawling and indexing work together and what each process does. We’ve seen that in the crawling phase the website is fetched, then in the indexing phase the site is rendered. Googlebot (the crawler) fetches the website and Caffeine (the indexer) renders the content. The problem started here when most people confused these two and said that the crawler helps Google to index the website.
Hi guys.Beautiful technical blogs. Thank you for the amazing topics.
Thank you. Expecting you back for our next amazing topics.
Hi, this article is inspired on the one written by Barry Adams in August 2017..
Thanks for the feedback, Bart!
Barry’s article is indeed great and it tackles a similar topic; yet, the approaches along with the main points of interest are different.
Thanks you for the info. It helps me out
Glad to hear that, Bhumi. Good luck!
Thank you. Let us know if there are other topics you might be interested in.
You’re very welcome. Glad we had the chance to help you. Good luck!
This is a very interesting and informative article on a topic few people thought about and written. You thoroughly explored the possibilities of what Google search engine can do with JS and what it can not. But it certainly makes website load faster, which is excellent for SEO.
You are perfectly right. Hope more people will understand that now.
I’m going to work on this now. While the subject of this article to me is a bit boring, yet necessary, I read it all because of how nice you have your website and fonts laid out. .
Thank you so much for the feedback. We’ll come with new interesting topics in the future and make all of them better even if the topic is boring 🙂
thanks Andreea for sharing such great information with us. it will be very helpful for newbies like us.
Well written article and I hope the discussion in this area keep going as JS, AJAX is a common problem in SEO.
However, I have a question (or maybe an idea).
How if we served two different sites?
One for Users (human) and another one is for Bot (Such as GoogleBot, etc.).
It is in the same IP, same subnet.
No different content. 100% exact the same.
For User (human using browser) we served a client-side rendered JS. But for Bot, we served the second version of the web which is server-side rendered.
What is your concern about this?
Thank you Andreea!
If they are identical and if the final version is rendered in browser using HTML, there is no point in having two versions of the same site. Ideally would be to have just one.
Could you tell me what is the reason for your question?
Really wonderful article Andreea i must say..
Does it matter for my SEO effort or the question, thus confusion, is not relevant at all?
Very technical but well explained.
Thank You !