Last time we looked at the Google’s Answer Boxes, we came up with quite a handful of interesting observations. However, we couldn’t quite give you the best explanation of what it takes to get your website there. We gathered you needed to be regarded as an authority site, but what does that really translate into?
The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ but ‘That’s funny…’ - Isaac Asimov
So we set out to find out more about the issue, only this time with a more scientific outlook on things. This meant that, while we could still look at only some examples, we could make the sample much bigger. What’s a big enough sample? Well, in statistics a couple of thousands is usually enough. So just to make sure, we looked at about 10 000 keywords. Of course, we didn’t have one person (or more) look at every scenario, but rather we devised an algorithm that would do the job for us.
The algorithm did automatic searches for phrases such as “what is…” and “who is…”, adding just one letter after the phrase ( “what is a”, “what is b”, and so on up to “what is z”) and taking into account the autocomplete suggestions (since those are supposed to be most popular searches, therefore the ones most likely to elicit answer boxes). To have a standardized cutoff point, we only took into account the first 10 autocomplete suggestions for each generated keyword. Using this method to extract the keywords we selected a sample of keywords that are most likely to return answer boxes.
Google Answer Box Appearance Ratio on 10k Keywords
This foray into the search engine came up with about 10 000 queries (10 353 to be more precise). Of those, only 1 792 returned answer boxes, which is roughly 17% of the total number of searches. So the first straight observation is that the percentage of search results with answer boxes out of the total number of search results is fairly small. We can say that this claim is true in general, since our sample size of 10 000 is enough to extrapolate for a population of pretty much any size with a high confidence level. While this may sound pretty unbelievable, that’s just how statistics works . Admittedly, we haven’t really been using a perfectly random sample, so let’s just say that the claim we made earlier is true of all searches that could potentially yield an answer box: off all that could, rather few actually do.

Google Answer Box Types
We have already established previously that answer boxes can come in many shapes and sizes. So we instructed the algorithm to also figure out what kind of answer box it received. Most of the answer boxes included definitions or descriptions that were the result of various website extractions; they were 1 236, which amounts to almost 69% of the answer boxes. Which means that all the other types of answer boxes – Web Definitions, Video Widgets, Google Widgets (conversions, maps) or Google Dictionary Definitions – taken together amount for less than a third of the answer boxes. But this is good news for SEO. If the answers only consisted of Google widgets, Google definitions or web definitions, you would have little to contribute to the landscape. As things are now, your website could be the source of a definition or description for the vast majority of the answers in the box.

Before continuing, let’s clear up a bit the definition’s types, as they appear in the answer box.
Google Dictionary Definition
Google Dictionary was an online dictionary service of Google, originating in its Google Translate service. The Dictionary website was terminated on 2011 but after that, part of its functionality was integrated into Google Search and now it looks like it’s integrated in the answer box. When it provides an answer box coming from the Google Dictionary, you won’t see any URL near the generated content. It kind of gives you the feeling that Google “knows for sure” that the info is accurate and doesn’t need to give any extra explanation.

Google Widget
Google has quite an impressive and helpful number of widgets, including translating, weather, driving directions or currency converter services. These widgets really improve the user’s experience, sparing him lots of clicks and time invested. For instance, if a user needs to find out how much 300 meters mean, reported to kilometers, a user doesn’t have to go on several sites to find out how much one meter mean and multiply it by 300. All he has to do is “ask” Google “how long is 300 meters” and he will get his answer instantly.

Google Video Widget
Also, if you want to impress your friends with some new move dances or you are looking for a particular type of moves that you want to reproduce, Google understands this need and gives you a video result directly in the answer box.

Google Web Definition
The answer boxes provided from the web definitions are quite a basic way of generating the information extracted from URLs with Glossary and dictionary words. These kind of definitions rely neither on entities nor on a dynamic process of extracting data but rather a static procedure is involved. Although there are a high number of answer boxes coming from web definition, they are not always the best answers that Google provides, time and again providing inaccurate or unrelated data.

Google Web Extraction
The definitions provided in the answer box from web extractions are, as we will see later on, more reliable, more dynamic and more accurate than the web definition. They usually come from sites that have high authority and also include the search query on their site. For instance, in the example below, if we want to find out what an atom is composed of, the answer box extracted the information from education.jlab.org/qa/atom.html . As we follow this site, we will see that on the landing page we will have a dedicated content to this matter, with the matching title “What is an atom? What are atoms made of?”

Unique Domains Used for Data Extraction
It seems that in the world of the search engines the rich get richer. When we analyzed the answers that were extracted from websites, we found out that they only came from 342 websites. So on average about 3.6 answers per website. But averages can be deceiving and in this case they actually are. Of those 342 websites (mainly , Wikipedia, dictionaries or Glossary) not all got the lion’s share.

Top 10 Ranking Position Distribution in the Organic SERPs for the Answer Box URL
Of the many factors that might influence the “distribution”, one that comes to mind almost instantly is the SERP ranking. So we split the websites according to this criteria, and look and behold, websites that were found on the first position in the organic results accounted for a third (33%) of all answer box information. The top 5 pages accounted for more than three quarters (77%) of the answers. There was just 1 answer out of the total 1 236 that came from a page not in the top 10 (statistically, that’s less than 0.1%). So rankings matter. And while you would be right to suggest that the relationship implied by this correlation may be more complicated here than what we are seeing from the numbers, you’d be taking a pretty serious chance to bet on being that 1 case in 1 236 that doesn’t need to be up high in the rankings to make it to the answer box. Or, to quote XKCD “scienc-y” web comic creator Randall Munroe, “Correlation does not imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there.’ ”

Trusted Sites Distribution in Google Answer Boxes
In all fairness, we are inclined to cut you some slack and say that it’s not necessarily (or solely) the SERP rankings that matter, but that’s only because the SERP positioning might simply be an indicator of some other measure of your website’s quality: referring domains. This is a case where more is actually more and better. Domains that provide answer boxes with more than 10 000 referring domains are exactly half of all domains that represent answer sources. Interestingly enough, a lot more answers (twice the number) come from domains with between 1 and 5 000 referring domains than from domains with between 5 and 10 000 referring domains. That may very well be, though, due to the arbitrary split or due to a lot of the values being around the cutoff point. Despite this, however, the 1 K mark is a fairly good predictor: more than 80% of all answers come from places that have 1 000 referring domains or more. But that means there’s still a reasonable chance of popping up in answer boxes even with less than that. However, if you drop under 100 you are on your own: less than a 3% chance of hitting the jackpot.

Google Answer Box Crawl – No. of Results based on Words per Page Intervals
In fact, how the information is structured may have a lot to do with your chances of being considered a trustworthy source for answers. A helpful element is having a title that is roughly the same as the question and an answer that immediately follows the title. That speaks to the structure part. What about rich content? This is where, unlike before, less is actually more. Pages with less than 2 000 words (the rough equivalent of 5 pages typed in Times New Roman, font size 12, single spacing) account for close to 70% of all answer boxes. As the number of words grows larger, the number of answer box results shrinks: 20% for pages with between 2 000 and 5 000 words, 5.5% for pages with between 5 000 and 10 000 words and only 2 percent for pages with over 10 000 words! Whether adding information automatically makes it harder for Google to look at that page for answers, or it simply makes it harder for us to keep things simple and straightforward, one thing is certain: leaner is better.

Google Answer Box Characteristic – Title vs No-Title Answer Boxes
As you’ll go around searching for different queries on Google, you might notice that there are two types of answer boxes, if we take into consideration the title: answer boxes which have a title and answer boxes which don’t have one. Just like in the case of any piece of content, the title can make a great difference. Let’s take a look at the screenshot below!

We are not talking only about the title’s purpose to garner attention and entice people to start reading your content. But about the basic purpose of titles: functionality. Beyond all, people need to know what the content is about. From all the analyzed answer boxes about 30% have a title, while the rest provide the information directly in the box, without any other introductions. Is a title beneficial? It definitely is as it highly improves the user’s experience. With a majority of “no title” answer boxes it is not exactly at hand saying that Google is on the right track with this matter. Yet, things might change in the future and having 100% titles in answer boxes in a couple of years might not be far from the truth.

Answer Box Stability and Freshness
There seems to be quite a lot of hard work you need to do to get into the much coveted answer boxes. But the reward is likely to pay off, perhaps even in ways that were not necessarily intended. We are no further still into the “that’s interesting” territory. The interesting thing being that the answer box functionality seems to be rather static and that once a website gets there, it might be a long time before it is removed. Not even, say, the website going down will shake the answer box. This turned out to be the case for a variety of queries, such as “what are lr14 batteries”, “where to buy plan b”, “what are k1 tax forms”, “what is seo spam”, or even “who is john endler and what did he study” (vertebrates, he studied vertebrates). So there is a very slight chance that an answer box will buy your website a little bit of post-mortem remembrance.
Expired Domains Rank No.1 in Google Answer Boxes
Being in the SEO industry and trying to make our way through all the Google’s guidelines, we are often asked “what is a natural link ?” We’ve tried to give the best answer to this question but what better place to ask about this than Google? Yet, as we tried to figure out what the exact definition of the Google friendly links was, the answer box failed to provide us with such a rewarding explanation. What is even more interesting is that when we tried to follow the link from the answer box for more info, we stumble upon an expired domain: wordpressseomarketing.com.

Being imbued by the “researcher’s fever”, we decided to buy this domain and analyze the ranking data we may get from Google.

Even if it was dropped, this link stayed in the answer box from 16 May until 6th of July. This means almost two months while an un-registered domain ranked number 1 for the “what is a natural link” search query. And it would have probably stayed even longer if we haven’t bought it. Quite ironic, isn’t it? Google, the great unnatural link “slayer” providing us a broken link on the top of its results, trying to explain us what a natural links is.

We decided to re-create the page based on its previous content and remove any extra data. So, with the help of WayBackMachine we extracted the content of that page and recreated it exactly.

And this is the content that was put on the site based on the previous content that was there years ago.

What is left to do now (beside enjoying the quality of the owner of a website listed in the answer box)? Track the traffic and enjoy the ride. We are still analyzing this site’s situation and as we gather enough valuable information, we will let you know what happened with our mentioned answer box expired site in the Answer Box results.

But some website definitions bring out even more issues as not only they hit the jackpot, they do so multiple times. Wikipedia’s entry for “Search engine optimization (SEO)” brings all the SEO-related curious people to its yard. It’s the source for no fewer than 14 answer boxes, including, information for questions such as “what is seo expert”, “what is seo consulting”, “what is seo industry”, “what is seo definition”, “what is seo marketing” and more. But do not be fooled by this “rich and well structured” content that provided so many answer boxes. What really happens is that for all the mentioned queries, even if it’s about SEO expert or SEO marketing, we are provided with the same, identical answer box. Not so impressive anymore, right?
Then again, there is a much greater chance that this static character of answer boxes will impact you negatively, since it will prevent your perfectly well-functioning website from entering the ranks because some defunct authority no one even knows if it exists anymore is taking up the space.
Website Extraction vs Website Definition Answer Boxes
I invite you to take a look at another interesting finding, regarding the Website Definitions.
It looks like none of the URLs for website definitions are found in the top 10 SERPS.
For instance, for the search query “what is a link description”, the URL suggested in the answer box, http://www.sparkbb.com/free-forum-articles/forum-terminology.php, is not to be found in the first 10 pages of results. This raises two legit question:
- how can a site that Google doesn’t consider worthy to be listed in the first 10 results be given as a resource in the answer box
- shouldn’t we worry about the quality of the information found in the answer box, given this situation?
As we analyze other answer boxes extracted from web definitions we find out that the majority of them seem to be low quality and sometimes even unrelated. Let’s take for instance the query “what is 360 link”. Even if the web definition provided by the answer box comes from Wikipedia (where 51% from all web definitions come from), it cannot be found in the top 10 results. Even more, the content provided is unrelated and has a commercial flavor (it refers to a product from the ProQuest company). This is the exact opposite of what John Muller from Google said about “branding” the answer boxes:
we need to watch out [...] so it doesn’t turn out to an advertisement for a web site but rather that it brings more information to the search results about this general topic.
Thereby, having so many issues, answer boxes generated from Web Definitions don’t look very reliable. Yet, in the case of website extraction things are more settled and we don’t encounter the same problems. Judging by the fact that the data are shown differently, we can assume that the extraction from web definitions vs Entity Extraction done using the Knowledge Graph is made totally different. The Website Extraction seems more precise while the Website Definition seems more basic. Nevertheless, mysterious are the ways of Google but equally determined are the people from cognitiveSEO to find out answers. As we browsed so many queries with answer boxes, we identified a pattern in the web definitions extraction. It looks like the majority of definitions that are not coming from Wikipedia have a similar URL pattern using the words “glossary” or “dictionary” ( and other variations).

Google SERP Re-crawl – 1 Week Later
As we tried to keep things as accurate as we can and assure ourselves and our readers that the data used in this research are representative, we’ve re-run the analysis one week later after the initial research was made. The results made us think even more about how the answer box algorithm really works (as we weren’t already) but confirmed in the same time the correctness of the initial investigation. After this re-crawl on the 10.353 keywords taken into account, we found 120 new answer boxes, 127 disappearances and 13 answer boxes with their status changed.From the new answer boxes, a large majority (about 80%) are Web Extractions and just a few are Google Widgets. Judging by the fact that for our sample only, we found more than 100 new answer boxes we might say that answer box is a growing “industry” and Google might offer answer boxes in short time for more queries.
Let’s move a bit our focus on the disappeared answer boxes. The reasons of the 127 dissolutions might be multiple and we cannot be 100% sure what really happened. But we have some well-funded assumptions. The first one would be that Google is making some A/B testing. It’s very likely that the big G is taking into consideration the bounce rate, the click through rate, the user’s experience overall and choose to keep or remove the answer boxes depending on these factors. I think that they are actively doing A/B testing on the Google Answer Boxes because sometime they appear sometime they do not for the same search. Google is doing a lot of testing in the SERPs and with answer boxes being such an important part of it right now, they might apply the same tactics.
Our second supposition is based on a situation that we meet quite often: Google is not always returning the same results for the same search query, answer box included. Meaning that for the same “ what is…” query, keeping the same coordinates of the search, sometimes we received an answer box and sometimes we didn’t. Thereby, this mysterious vanishing of the 127 answer boxes may originate from here.
As for the answer boxes with a changed status, we can see that a very small number underwent modifications. Most of these few adjustments concern transformations from Web Definitions into Google Widgets and vice versa.

Conclusion
Google Answer Boxes might be quite controversial as the Google Search user interface lets Google’s users view and copy content without visiting the content provider’s website. In addition to losing traffic, webmasters might be also a bit “upset” with the fact that their perfectly well-functioning website doesn’t appear int the answer box while some broken site that doesn’t exist anymore is taking up the space . Double-ouch for Google answer boxes!
Yet, we cannot help ourselves from seeing things from the user’s point of view. If the answer box has accurate information, they provide the user with a better usability by sparing him another click or providing a shortcut to the final action he needs to do. If, for instance, you urgently need to make a payment and you want to know how much 127$ is in Euros, all you have to do is “tell” Google to “convert 127$ to euro” and you’ll have the result in an instant. Not long ago, for the same operation you needed to consult a currency exchange site and after that manually calculate the amount you were interested in.
Having 80% of the newly emerged answer boxes , on our second analysis, coming from Web Extractions, gives webmasters quite a new breath. Judging by this information, we can say that Google is looking more and more at the definitions provided by high-quality websites, giving webmasters the chance to have their site mentioned in the first row, above all the search results. As we shown previously, answer boxes extracted from websites are more accurate and provide the user with a better experience. Thereby, Google taking more into consideration various websites as a source for the answer box is a win-win situation.
As we mentioned in this article, there are several issues with the answer boxes. The most important we feel the need to emphasize are the fact that the results generated are quite static and sometimes not relevant, even though they are mostly reserved for high quality sites. These issues can be a big enough obstacle for webmasters that wish and (maybe) deserve to be listed in the answer box. It is indeed a hard working process but not an impossible one. Proving Google that your site is trustworthy and an authority in the field it’s way harder to be done than to be said but it pays off on the long term. Moreover, following some tips that we came up with in a previous post on how to optimize for the Google Answer Box might be also really helpful.
Very valuable research here, but I’m mostly amused by the expired domain experiment! Why would Google rank a site that was dead for 2 months in its answer box… Looking forward to an update in the future, excellent work.
I was too :). it practically blew my mind and made me draw the conclusion that that kind of Web Definitions are buggy are basic.
I am glad you included the expired domain part in the post too Razvan – I guess it would’ve been easy to leave out! Fascinating stuff. Thanks for research you put in here.
tks for the comment Bruce. each part of the article has a conclusion. that part with the expried domain shows that the “Web Definitions” as defined as a widget are easily exploitable and are kind of basic in terms of extraction. Shows that once you get there you stay there…
Hi Raz,
Nice bit of analysis. I went through this with my co-worker and we were a little baffled at why the site in the search for “bank of england rates definition” is the Google Web Definition source. It must be completely random, which is odd.
hi Jonathan, I see a Wikipedia Extraction here which seem accurate as a definition. are you not seeing the same thing?
“The official bank rate (also called the Bank of England base rate or BOEBR) is the interest rate that the Bank of England charges Banks for secured overnight lending. It is the British Government’s key interest rate for enacting monetary policy.
Official bank rate – Wikipedia, the free encyclopedia
en.wikipedia.org/wiki/Official_bank_rate
Wikipedia”
Does the answer box only show up when you type in a question? Thanks for this interesting research by the way.
no. but on question/answer like queries . ex: “price of passport” would be the intent of asking ” what is the price of passport”.
I wonder if it’s possible to penalise Google with a duplicate content penalty. I also wonder what would happen if I disavowed Google.
What they are doing here is effectively getting the credit and benefit for/of my content. Sure, they politely acknowledge me with a link, but it’s a bit like me going down to MacDonald’s, buying 100 burgers then taking them to the local park and selling them off from the back of my truck.
nothing will happen
you could use the robots.txt and not allow them to index your site. since you are on their property there they can practically do whatever they like.
and i do not think it is in the bad interest of the visitor. what they are doing simplifies the user experience and boosts your authority.
Wow, a lot of research was put in to this excellent article. Thank you for taking the time. I am amazed at unregistered domain ranking as it did. If one of my domains became unregistered, you can bet it would fall. Like the next day! I can only hope it’s due to Web Definitions working out the kinks. Thanks again.
tks for the appreciation Vicki. it seems that old type of web definitions are kind of buggy.
Fantastic post! In addition to expired domains, 404 pages can also be showing in answer box. For the search query “what is Equipment Breakdown Coverage”, the big G is returning “Equipment Breakdown Coverage
Web definitions
Pays to repair or replace various types of due to breakdown, rupture or bursting, or artificially generated electric current. Covers computers, scanning equipment, phone systems, air conditioners, refrigeration systems and many other types of equipment. …
http://www.insurancenoodle.com/glossary.asp”
But the page http://www.insurancenoodle.com/glossary.asp is a 404. Plus, it’s not being indexed by Google right now. So we can conclude that Google answer box is using another Index other than the conventional one.
it seems. they have a problem with those type of Web Definitions. Looks like they are not updated. Probably they appear in a very small fraction of the results and they ignore it.