Decoding Google's Algorithm Leak: Key Insights

Decoding Google's Algorithm Leak: Key Insights

As many SEOs have already done, I have spent the last 24 hours going through the code, learning what is in it, reading what others have written, and trying to compile all the interesting and useful information from the recent Google leaks.

In the following write-up, I am going with the assumption that attributes mentioned in the leak are applicable to the search engine algorithm, while staying aware that this could be merely a list of API calls rather than a full algorithm code dump.

Despite potential doubts about whether these details fully represent Google’s algorithm, the leak offers a fascinating glimpse into the internal mechanics of the search giant. It's even more intriguing than last year's Yandex leak.

This leak could validate long-held SEO strategies for some, reinforcing their existing approaches. For others, it may prompt new experiments and methods to enhance search visibility. As the SEO community digests and debates this information, I'm eager to explore and share everything I learn from this extensive leak. Here are the key takeaways:

Freshness of content

The leaked code snippets reveal that Google strongly emphasizes the freshness of content and utilizes various date-related signals to determine the recency and relevance of a page. Here's a breakdown of the key points from the leak:

  1. bylineDate: This refers to the date explicitly mentioned on the page, which should accurately reflect when the content was published or last significantly updated.
  2. syntacticDate: Google extracts dates from the URL or the title of the page. This suggests that including a date in these elements can help Google understand the freshness of the content.
  3. semanticDate and semanticDateConfidence: These attributes imply that the dates derived from the content should reflect the current and relevant information, and Google has a measure of confidence in the accuracy of these derived dates.

Indeed, the leak aligns with long-standing principles of search engine optimization (SEO) regarding consistency in data presentation. It reiterates the importance of having uniform date information across various parts of a webpage, which can be crucial for search performance. This principle was underscored by the "Freshness Update" Google implemented back in 2011, affecting a substantial percentage of search queries. By maintaining consistent date markers, webmasters help ensure that search engines can accurately interpret and rank their content, particularly in searches where timeliness is key. The recent leak simply underscores this ongoing SEO best practice, reminding content creators and SEO specialists to pay close attention to how date information is managed across their sites.

Click trough rate and Chrome usage

Recent revelations from Google's internal API documents have shed light on the search giant's use of click data and Chrome user behavior in its ranking algorithms. These findings confirm what many SEO experts have long suspected - that Google leverages user engagement signals to help determine search result rankings and SERP features. The documents specifically mention components like "Navboost" and "Glue," which incorporate metrics such as "goodClicks," "badClicks," and "chrome_trans_clicks" to filter and weight different types of user interactions.

Google's use of attributes like "goodclicks" and its incorporation of browsing data from Chrome users allows it to measure how users interact with search results and websites granularly. Pages that generate more clicks from the search results and keep users engaged for longer are interpreted as more relevant and trustworthy, leading to higher rankings. Rand Fishkin, conducted experiments in 2015 that pointed towards similar conclusions. Fishkin's tests aimed to demonstrate the impact of clicks and engagement on search rankings. He observed positive results, bolstering his argument that Google uses these signals in its ranking algorithms.

Despite his findings, Fishkin faced significant criticism, including attacks on social media and professional forums, for asserting that Google might not fully disclose the specifics of its ranking factors. 

Adding to the controversy, Gary Illyes notably dismissed metrics such as dwell time and click-through rate, which Rand Fishkin had highlighted, as "generally made up crap," suggesting that search is simpler than commonly thought.

This comment highlights the ongoing debate within the SEO community about Google's transparency in its ranking processes. A recent leak suggesting Google's use of engagement metrics directly contradicts Illyes's statements, hinting at a more complex scenario where engagement metrics might indeed influence search rankings, contrary to Google's simpler public portrayal.

My take on the future: In this environment, especially after this revelation SEOs have a renewed and a strong incentive to artificially inflate their CTR and on-site engagement metrics through various means, a practice known as CTR manipulation. Some tactics that could be employed include:

  1. Click Baiting: Using alluring but misleading titles and page descriptions to entice more clicks, even if the content doesn't fully deliver on the promise.
  2. Buying Traffic: Acquiring cheap traffic through click farms, bots, or other means to simulate high CTRs and engagement.
  3. Browser Manipulation: Developing browser extensions or plugins that automatically click search results or engage with pages to fake user interactions.

However, taking this information with a hefty grain of salt is crucial. While the revealed metrics suggest Google is using click data, engaging in blatant CTR manipulation is still likely to be detected.

Links

In light of the recent leak detailing Google's algorithmic strategies, the importance of links in the current search ranking landscape has been further confirmed. This leak provides a deeper insight into how Google's sophisticated algorithms continue to evaluate and prioritize high-quality links as crucial indicators of content credibility and authority. Here are some key points from my initial summary:

  1. Link quality tiers: Google categorizes links into low, medium, and high-quality tiers based on the click data associated with the linking page. Pages that receive a high volume of clicks from trusted sources (e.g., Chrome user data) are considered high quality, while pages with few or no clicks are considered low quality. Links from higher-tier pages are given more weight in ranking calculations, while links from low-tier pages may be ignored entirely. This helps Google prioritize links from popular, user-validated sources.
  2. Indexing tier impact: In addition to link-specific quality tiers, Google also considers the overall quality tier of the linking page based on how it's indexed. "Fresh" pages that are frequently updated are also prioritized. This means that links from pages that are both intrinsically high quality (based on click data) and part of Google's "core" index are especially valuable for SEO. 

  1. Spam detection: By tracking the velocity of new links using spammy anchor text, Google can identify attempts to manipulate rankings through link schemes. This data allows Google to establish baseline link acquisition rates for a site and flag unnatural spikes in low quality links. Importantly, Google may use this not just to penalize spam attempts, but to protect sites from negative SEO by competitors, by detecting and nullifying the impact of such attacks. But we can learn something from this. Since google uses your competitors to establish that baseline, we can select all of them, see the rate of link acquisition for the last 6 months, pull the average, select the numbers for each month, and get a conservative number of max links you should build each month Given that Google benchmarks your site against competitors to set a baseline, we can systematically analyze these competitors. We can derive an average by examining their link acquisition rates over the past six months. This involves tracking monthly gains and calculating a mean figure. From this data, we can establish a conservative target for the maximum number of links you should aim to build each month. This method ensures that your link-building efforts are aligned with industry norms and reduces the risk of triggering search engine penalties for unnatural link growth.
  2. Homepage signals: For new pages that haven't yet earned their own authority signals, Google relies on the strength of that site's homepage as a proxy. The homepage's PageRank and trust are used to infer the likely quality of new pages until those pages can be independently evaluated. In December 2016, John Mueller of Google clarified in response to queries about the concept that Google does not use "Domain Authority" as a ranking factor. Instead, Google evaluates pages based on numerous other signals, but mentions of attribute “siteAuthority” leads us to question that statement further. 

Expanding on some of the additional details I noted:

  • Parallel links: By tracking the number of links from one site to another beyond just the first link, Google gets a sense of the strength of the relationship or endorsement between those sites. Multiple links from a high quality site could be a strong signal, while a high volume of links from a low quality site could indicate spam.
  • URL fragments and locale: Tracking whether a link points to a specific section of a page (via URL fragment) or comes from a page targeted to a particular language or country gives Google more context about the role and relevance of that link. A link to a specific passage may be more valuable than a general page link, and links from "local" pages may be more relevant for searches in that language or locale.

: Beyond the core link quality metrics, Google tracks an extensive array of data points for each link. This includes the link text and surrounding context, placement and styling on the page, first seen and last seen dates, whether the link has been removed, the language and country of the linking page, and experimental groupings. These attributes help Google evaluate each link's relevance, freshness, and validity. Full list of those is here:

  • origText: This attribute stores the original anchor text of the link, preserving capitalization and punctuation. By analyzing the exact text used in the link, Google can assess the relevance and context of the link.
  • fullLeftContext and fullRightContext: These attributes capture the full text surrounding the link on both sides. This additional context helps Google understand the topic and meaning of the link's content, providing valuable relevance signals.
  • context2: This is a hashed version of the terms near the link, offering another way for Google to evaluate the link's contextual relevance efficiently.

  1. Link placement and styling:

  • fontsize: Google tracks the font size used for the link text, as this can indicate the importance or emphasis placed on the link by the page author. Larger or more prominent links may be seen as stronger endorsements.
  • bucket: This attribute is used to classify or group links based on certain characteristics or quality metrics. By categorizing links in this way, Google can apply different evaluation rules or weightings to different types of links.

  1. Link freshness and removal:

  • firstseenDate and creationDate: These attributes track when a link was first discovered by Google, providing important context about the age and potential impact of the link over time. Older links are more stable endorsements.
  • deletionDate and expired: If a link is removed from a page or the linking page expires, Google tracks this information to maintain an up-to-date understanding of the current link graph. Removed or expired links may carry less weight or be disregarded entirely.

  1. Link language and location: NOTE THIS FOR LOCAL SEO 

  • locality: This attribute captures the language or geographical origin of the linking page. Links from pages in the same language or country as the target page may be more relevant, especially for localized search queries.
  • targetUrlEncoding and compressedOriginalTargetUrl: By tracking the specific URL encoding and compressed version of the original link target, Google can efficiently process and deduplicate links from different language versions or variations of the same page.

Google's Exact Match Domain Demotion: 

Recent leaks of Google's internal code have revealed interesting details about how the search engine implements the Exact Match Domain (EMD) demotion. Central to this is the exactMatchDomainDemotion score, an integer value between 0 and 1023 assigned to each web page. This score is converted from a float between 0 and 1, which likely represents the strength of low-quality signals associated with the exact match domain.

Pages with a high demotion score are pushed down in rankings, while quality pages with an EMD are unaffected. The code confirms that Google uses a nuanced scoring system to assess EMD page quality, rather than just penalizing all exact match domains.

In practice, it seems Google may calculate an EMD demotion score for each page based on signals about the page's quality and whether it has an exact match domain. This score is then converted to an integer and stored in the index. When ranking pages for a query, this demotion score can be factored in to lower the rankings of low-quality EMD pages. This information isn't exactly groundbreaking. It echoes the concerns addressed during Google's 2012 update targeting Exact Match Domain (EMD) manipulation. We even got a warning from Matt about this and a tweet after the implementation

Expired Domain abuse Google's Search Quality Evaluator Guidelines specifically call out expired domain abuse as a spam tactic, providing illustrative examples such as:

  • Affiliate content on a site previously used by a government agency
  • Commercial medical products being sold on a site previously used by a non-profit medical charity
  • Casino-related content on a former elementary school site

To combat this type of abuse, Google appears to use several attributes related to a domain's registration history, as revealed by recent leaks of their internal code.

Here's how these attributes might come into play:

  1. When a domain is newly registered or changes ownership, the createdDate is updated. This could trigger heightened scrutiny from Google's algorithms, as the new content may be unrelated to the domain's previous purpose. If the new content is deemed low-quality or manipulative, the domain may be sandboxed (i.e., its rankings suppressed) until it establishes trust and authority under the new ownership.
  2. The expiredDate indicates when a domain became available for re-registration. If a domain expired and was then quickly re-registered and populated with new content, this could be a red flag for expired domain abuse. Google's algorithms can compare the expiredDate to the createdDate to identify suspicious domains snatched up and repurposed immediately after expiring.
  3. As mentioned, Google has recently introduced policies specifically targeting expired domain abuse, as seen in its Search Quality Evaluator Guidelines. The use of createdDate and expiredDate attributes suggests that Google is not only providing guidance to human raters but also baking expired domain abuse detection into its automated algorithms.
  4. By comparing a domain's current content and purpose to its historical use (inferred from registration dates and other signals), Google can identify drastic shifts that may indicate an attempt to manipulate rankings.

In summary, the createdDate and expiredDate attributes provide Google with valuable data points for identifying and combating expired domain abuse at scale. By incorporating domain registration history into its ranking algorithms, Google can suppress the rankings of domains that are likely being used manipulatively while still allowing legitimate domain ownership transfers and repurposing.

Authors

Despite personal skepticism about the viability of effectively scoring expertise, due to the abstract nature of the concepts, Google's technology for associating documents with their authors suggests a tangible approach to this challenge.

Google storing author information explicitly as text and checking if an entity on a page is also the author are indicative of their strategy to authenticate content and reinforce the importance of authoritative sources. This is not just about recognizing an author's name but about understanding the depth of their expertise which could significantly impact the content's ranking and trustworthiness.

Using vector embeddings to map authors and their work represents a significant advancement. Vector embeddings allow for the representation of text, including names and content, in multi-dimensional space, capturing the contextual and semantic relationships between words far beyond simple keyword matching. This method enables Google to scale its assessment of authorship across the web despite the previously noted lack of author markup on many web pages.

Furthermore, by mapping entities thoroughly and linking them to their respective authors, Google can enhance its ability to effectively verify and score the E-E-A-T attributes. This approach ensures that content comes from credible sources and is contextually relevant and authored by recognized experts in the field. Such a method could lead to more refined and reliable search results, aligning closely with Google's ongoing push to prioritize high-quality and trustworthy content in their search results.

The RepositoryWebrefDetailedEntityScores model reinforces these capabilities by providing a comprehensive tool for assessing the significance and relevance of entities within documents. Attributes like connectedness, docScore, isAuthor, isPublisher, isReferencePage, normalizedTopicality, profileUrl, referencePageScores, and relevanceScore work together to paint a detailed picture of how entities, including authors, relate to and influence the content of a document.

By combining the explicit storage of author information, the ability to identify if an entity is the author of a page, the use of vector embeddings for scalable author mapping, and the detailed entity scoring provided by the RepositoryWebrefDetailedEntityScores model, Google has a robust system in place to evaluate authorship and authority effectively. This multi-faceted approach allows them to move beyond the limitations of relying solely on author markup. It provides a more reliable way to incorporate E-E-A-T signals into their search ranking algorithms. I believe that Google's approach, using the RepositoryWebrefDetailedEntityScores model to enhance the evaluation of authorship and authority, is a positive direction for the future.

 As the internet continues to expand and evolve, particularly with the increasing presence of AI-generated content, it becomes more important to have robust mechanisms to assess information quality and reliability. This methodology not only improves the user experience by ensuring access to trustworthy content but also upholds the integrity of the internet as an essential resource for accurate knowledge and information. Such advancements in search technology are crucial for navigating the complexities of the digital age effectively.









Alexandria

Alexandria" refers to a part of Google's indexing infrastructure, although Google has not publicly detailed its specifics. It's suggested that Alexandria handle the generation and storage of metadata related to web documents, which could include details like URLs and indexing status. Alexandria is also involved in generating cluster IDs used to de-duplicate content in search results, ensuring a diverse and relevant content presentation to users.

Topical Authority

The attributes of PageEmbedding, SiteFocusScore, and SiteRadius provide valuable insights into Google's approach to evaluating site and page relevance, particularly in light of the Helpful Content Update. These attributes help in understanding how Google assesses content quality and shed light on the importance of topical authority in SEO. The leaked algorithm documents reveal that attributes like SiteFocusScore, SiteRadius, SiteEmbeddings, and PageEmbeddings play pivotal roles in ranking and establishing topical authority

  1. PageEmbedding This attribute quantifies how concentrated a website's content is on a particular topic. A higher SiteFocusScore indicates that a site has a strong, focused approach to a specific subject area, which Google recognizes as a signal of expertise. Sites with a high focus score are likely considered authoritative sources on their respective topics.
  2. SiteFocusScore indicates the degree of thematic concentration within a site. Higher scores suggest a tighter focus on specific topics, directly tying into Google's preference for sites that demonstrate expertise in particular areas. A strong SiteFocusScore can be crucial for establishing topical authority.
  3. SiteRadius Measures the divergence of content on individual pages from the core topics defined by the site's overall embedding. A smaller SiteRadius suggests that the pages are closely aligned with the central themes, enhancing the site's overall coherence and strengthening its authority on those topics. Conversely, a larger radius indicates a broader range of topics, which could dilute the site's thematic strength unless it's managed carefully.

SiteEmbeddings and PageEmbeddings: These semantic vectors represent the overarching themes of the site and the specific content of individual pages, respectively. They allow Google to assess the thematic alignment between a site's declared focus and the actual content it publishes. PageEmbeddings are evaluated against SiteEmbeddings to ensure consistency and relevance to the core topics.

Great read, thanks Aleksandar Ljubinkovic I think it bears to mention anytime we discuss this leak that we don't know if and in which context Google uses these functionalities. What we are seeing are just internal documentation for some code modules Google has - this by itself doesn't confirm that these are used anywhere. Still, the module and attribute names give useful hints on potential uses.

Like
Reply

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics