Leak of internal Google documents reveals core workings of ranking algorithm

Thousands of internal Google documents from its Content API Warehouse have been shared without the tech giant’s consent.

These documents, initially shared with Rand Fishkin, co-founder of SparkToro, offers an unprecedented look into Google’s ranking algorithm, which is invaluable for SEO professionals.

The documents reveal significant details about Google’s ranking system. As of March, the documentation includes 2,596 modules with 14,014 attributes.

However, the documents do not specify the weightings of these features. Notably, the documents introduce “Twiddlers,” which are re-ranking functions that adjust the information retrieval score of a document or change its ranking. Additionally, content can be demoted for various reasons, including user dissatisfaction, product reviews, location, and exact match domains.

An intriguing aspect is that Google maintains a record of every version of every page it has ever indexed, although only the last 20 changes are used when analysing links.

Links still remain crucial for ranking, with link diversity and relevance being key factors. PageRank is still a significant element in Google’s ranking features, especially for a website’s homepage.

The documents also emphasises the importance of user interactions with search results. Google measures various types of clicks, such as ‘badClicks’, ‘goodClicks’, ‘lastLongestClicks’, and ‘unsquashedClicks’, to determine content quality and user satisfaction. Longer documents might be truncated, while shorter content receives a score based on originality. Specific content types, such as health and news, receive additional scoring.

Rand Fishkin’s analysis highlights the importance of building a strong brand, suggesting that brand recognition significantly impacts organic search rankings. Additionally, Google tracks author information and attempts to determine whether an entity authored a document. This points to the relevance of authorship in content ranking.

Another significant feature mentioned is “siteAuthority,” which Google uses to assess the overall quality of a site. This was confirmed in the 2011 Panda update but has been denied in subsequent years. The documents also reveal that Google uses data from its Chrome browser for ranking purposes. Furthermore, Google has whitelists for certain domains related to elections and COVID-19, indicating preferential treatment under specific circumstances.

Other notable findings include the importance of content freshness, as Google considers byline dates, URL dates, and on-page content dates. The company also vectorises pages and sites to compare page embeddings with site embeddings, helping to determine core topics of a website.

Google stores domain registration information and measures the average weighted font size of terms in documents and anchor text. Page titles are still relevant, with a feature called ‘titlematchScore’ evaluating how well a page title matches a query.

Why is this important?

These leaked documents provide deep insights into Google’s ranking factors, revealing the complexity and multifaceted nature of its search algorithm. They also underscore the continued importance of following good SEO practices with writing quality content, user satisfaction, link diversity, brand recognition, and the strategic use of data in determining search rankings.

Share this post

Sign up to our Newsletter for more content like this

By signing up you agree to our Privacy Policy