Decoding Google's Algorithm Leak: Key Insights
As many SEOs have already done, I have spent the last 24 hours going through the code, learning what is in it, reading what others have written, and trying to compile all the interesting and useful information from the recent Google leaks.
In the following write-up, I am going with the assumption that attributes mentioned in the leak are applicable to the search engine algorithm, while staying aware that this could be merely a list of API calls rather than a full algorithm code dump.
Despite potential doubts about whether these details fully represent Google’s algorithm, the leak offers a fascinating glimpse into the internal mechanics of the search giant. It's even more intriguing than last year's Yandex leak.
This leak could validate long-held SEO strategies for some, reinforcing their existing approaches. For others, it may prompt new experiments and methods to enhance search visibility. As the SEO community digests and debates this information, I'm eager to explore and share everything I learn from this extensive leak. Here are the key takeaways:
Freshness of content
The leaked code snippets reveal that Google strongly emphasizes the freshness of content and utilizes various date-related signals to determine the recency and relevance of a page. Here's a breakdown of the key points from the leak:
Indeed, the leak aligns with long-standing principles of search engine optimization (SEO) regarding consistency in data presentation. It reiterates the importance of having uniform date information across various parts of a webpage, which can be crucial for search performance. This principle was underscored by the "Freshness Update" Google implemented back in 2011, affecting a substantial percentage of search queries. By maintaining consistent date markers, webmasters help ensure that search engines can accurately interpret and rank their content, particularly in searches where timeliness is key. The recent leak simply underscores this ongoing SEO best practice, reminding content creators and SEO specialists to pay close attention to how date information is managed across their sites.
Click trough rate and Chrome usage
Recent revelations from Google's internal API documents have shed light on the search giant's use of click data and Chrome user behavior in its ranking algorithms. These findings confirm what many SEO experts have long suspected - that Google leverages user engagement signals to help determine search result rankings and SERP features. The documents specifically mention components like "Navboost" and "Glue," which incorporate metrics such as "goodClicks," "badClicks," and "chrome_trans_clicks" to filter and weight different types of user interactions.
Google's use of attributes like "goodclicks" and its incorporation of browsing data from Chrome users allows it to measure how users interact with search results and websites granularly. Pages that generate more clicks from the search results and keep users engaged for longer are interpreted as more relevant and trustworthy, leading to higher rankings. Rand Fishkin, conducted experiments in 2015 that pointed towards similar conclusions. Fishkin's tests aimed to demonstrate the impact of clicks and engagement on search rankings. He observed positive results, bolstering his argument that Google uses these signals in its ranking algorithms.
Despite his findings, Fishkin faced significant criticism, including attacks on social media and professional forums, for asserting that Google might not fully disclose the specifics of its ranking factors.
Adding to the controversy, Gary Illyes notably dismissed metrics such as dwell time and click-through rate, which Rand Fishkin had highlighted, as "generally made up crap," suggesting that search is simpler than commonly thought.
This comment highlights the ongoing debate within the SEO community about Google's transparency in its ranking processes. A recent leak suggesting Google's use of engagement metrics directly contradicts Illyes's statements, hinting at a more complex scenario where engagement metrics might indeed influence search rankings, contrary to Google's simpler public portrayal.
My take on the future: In this environment, especially after this revelation SEOs have a renewed and a strong incentive to artificially inflate their CTR and on-site engagement metrics through various means, a practice known as CTR manipulation. Some tactics that could be employed include:
However, taking this information with a hefty grain of salt is crucial. While the revealed metrics suggest Google is using click data, engaging in blatant CTR manipulation is still likely to be detected.
Links
In light of the recent leak detailing Google's algorithmic strategies, the importance of links in the current search ranking landscape has been further confirmed. This leak provides a deeper insight into how Google's sophisticated algorithms continue to evaluate and prioritize high-quality links as crucial indicators of content credibility and authority. Here are some key points from my initial summary:
Expanding on some of the additional details I noted:
: Beyond the core link quality metrics, Google tracks an extensive array of data points for each link. This includes the link text and surrounding context, placement and styling on the page, first seen and last seen dates, whether the link has been removed, the language and country of the linking page, and experimental groupings. These attributes help Google evaluate each link's relevance, freshness, and validity. Full list of those is here:
Recommended by LinkedIn
Google's Exact Match Domain Demotion:
Recent leaks of Google's internal code have revealed interesting details about how the search engine implements the Exact Match Domain (EMD) demotion. Central to this is the exactMatchDomainDemotion score, an integer value between 0 and 1023 assigned to each web page. This score is converted from a float between 0 and 1, which likely represents the strength of low-quality signals associated with the exact match domain.
Pages with a high demotion score are pushed down in rankings, while quality pages with an EMD are unaffected. The code confirms that Google uses a nuanced scoring system to assess EMD page quality, rather than just penalizing all exact match domains.
In practice, it seems Google may calculate an EMD demotion score for each page based on signals about the page's quality and whether it has an exact match domain. This score is then converted to an integer and stored in the index. When ranking pages for a query, this demotion score can be factored in to lower the rankings of low-quality EMD pages. This information isn't exactly groundbreaking. It echoes the concerns addressed during Google's 2012 update targeting Exact Match Domain (EMD) manipulation. We even got a warning from Matt about this and a tweet after the implementation
Expired Domain abuse Google's Search Quality Evaluator Guidelines specifically call out expired domain abuse as a spam tactic, providing illustrative examples such as:
To combat this type of abuse, Google appears to use several attributes related to a domain's registration history, as revealed by recent leaks of their internal code.
Here's how these attributes might come into play:
In summary, the createdDate and expiredDate attributes provide Google with valuable data points for identifying and combating expired domain abuse at scale. By incorporating domain registration history into its ranking algorithms, Google can suppress the rankings of domains that are likely being used manipulatively while still allowing legitimate domain ownership transfers and repurposing.
Authors
Despite personal skepticism about the viability of effectively scoring expertise, due to the abstract nature of the concepts, Google's technology for associating documents with their authors suggests a tangible approach to this challenge.
Google storing author information explicitly as text and checking if an entity on a page is also the author are indicative of their strategy to authenticate content and reinforce the importance of authoritative sources. This is not just about recognizing an author's name but about understanding the depth of their expertise which could significantly impact the content's ranking and trustworthiness.
Using vector embeddings to map authors and their work represents a significant advancement. Vector embeddings allow for the representation of text, including names and content, in multi-dimensional space, capturing the contextual and semantic relationships between words far beyond simple keyword matching. This method enables Google to scale its assessment of authorship across the web despite the previously noted lack of author markup on many web pages.
Furthermore, by mapping entities thoroughly and linking them to their respective authors, Google can enhance its ability to effectively verify and score the E-E-A-T attributes. This approach ensures that content comes from credible sources and is contextually relevant and authored by recognized experts in the field. Such a method could lead to more refined and reliable search results, aligning closely with Google's ongoing push to prioritize high-quality and trustworthy content in their search results.
The RepositoryWebrefDetailedEntityScores model reinforces these capabilities by providing a comprehensive tool for assessing the significance and relevance of entities within documents. Attributes like connectedness, docScore, isAuthor, isPublisher, isReferencePage, normalizedTopicality, profileUrl, referencePageScores, and relevanceScore work together to paint a detailed picture of how entities, including authors, relate to and influence the content of a document.
By combining the explicit storage of author information, the ability to identify if an entity is the author of a page, the use of vector embeddings for scalable author mapping, and the detailed entity scoring provided by the RepositoryWebrefDetailedEntityScores model, Google has a robust system in place to evaluate authorship and authority effectively. This multi-faceted approach allows them to move beyond the limitations of relying solely on author markup. It provides a more reliable way to incorporate E-E-A-T signals into their search ranking algorithms. I believe that Google's approach, using the RepositoryWebrefDetailedEntityScores model to enhance the evaluation of authorship and authority, is a positive direction for the future.
As the internet continues to expand and evolve, particularly with the increasing presence of AI-generated content, it becomes more important to have robust mechanisms to assess information quality and reliability. This methodology not only improves the user experience by ensuring access to trustworthy content but also upholds the integrity of the internet as an essential resource for accurate knowledge and information. Such advancements in search technology are crucial for navigating the complexities of the digital age effectively.
Alexandria
Alexandria" refers to a part of Google's indexing infrastructure, although Google has not publicly detailed its specifics. It's suggested that Alexandria handle the generation and storage of metadata related to web documents, which could include details like URLs and indexing status. Alexandria is also involved in generating cluster IDs used to de-duplicate content in search results, ensuring a diverse and relevant content presentation to users.
Topical Authority
The attributes of PageEmbedding, SiteFocusScore, and SiteRadius provide valuable insights into Google's approach to evaluating site and page relevance, particularly in light of the Helpful Content Update. These attributes help in understanding how Google assesses content quality and shed light on the importance of topical authority in SEO. The leaked algorithm documents reveal that attributes like SiteFocusScore, SiteRadius, SiteEmbeddings, and PageEmbeddings play pivotal roles in ranking and establishing topical authority
SiteEmbeddings and PageEmbeddings: These semantic vectors represent the overarching themes of the site and the specific content of individual pages, respectively. They allow Google to assess the thematic alignment between a site's declared focus and the actual content it publishes. PageEmbeddings are evaluated against SiteEmbeddings to ensure consistency and relevance to the core topics.
SEO consultant
7moGreat read, thanks Aleksandar Ljubinkovic I think it bears to mention anytime we discuss this leak that we don't know if and in which context Google uses these functionalities. What we are seeing are just internal documentation for some code modules Google has - this by itself doesn't confirm that these are used anywhere. Still, the module and attribute names give useful hints on potential uses.