Google Search Algorithm Ranking Features Leaked

We now have an unrivaled view into Google Search thanks to a cache of leaked Google papers, highlighting some of the critical components Google considers when determining content ranking.

On March 13, an automated bot known as yoshi-code-bot posted thousands of papers on Github that appeared to originate from Google’s internal Content API Warehouse. The co-founder of SparkToro, Rand Fishkin, was given access to these records earlier this month.

Continue reading to learn what Fishkin and Michael King, CEO of iPullRank, have to say about the documents. King has also evaluated and analyzed them and will provide further analysis for Search Engine Land.

Why it matters to us:

For SEOs knowledgeable about it all, this insight into Google’s potential ranking system is priceless. One of the greatest stories of 2023 was the unparalleled peek at Yandex Search ranking variables that we were given via a leak.

A leak of Google’s Content Warehouse API internal documentation has provided information on Google’s search algorithms. Specifics regarding scoring functions are not included in the breach. Still, information on content, links, and user interaction data storage is essential.

Google’s Deceptive Claims

Domain Power: The documentation discloses a function named “siteAuthority,” which suggests Google does measure sitewide authority despite Google’s protestations.

Clicks for results: Despite Google’s open denials, click data is used by systems like NavBoost to affect results.

Sandbox: In contrast to Google’s denial of a sandbox, documentation refers to a “hostage” parameter that is used to sandbox new sites.

Chrome Data: The documentation indicates that Chrome data is used in ranking algorithms, even in the face of objections.

Architecture: Rather than relying just on one algorithm, Google’s ranking system comprises several microservices. SuperRoot (query processing), Mustang (ranking), Alexandria (indexing), and Trawler (crawling) are some of the critical systems.

Twiddlers: Twiddlers are reranking tools that modify search results before showing them to users. NavBoost, QualityBoost, and RealTimeBoost are a few examples.

Implications for SEO:

Panda Algorithm: Panda applies a scoring modification at several levels (domain, subdomain, subfolder) based on user behavior and external links.

Authors: Google indicates the value of authorship in rankings by explicitly storing author information.

Demotions: A number of demotions are implemented, such as exact match domains, anchor mismatch, and SERP dissatisfaction.

Links: Metrics like sourceType, which show the importance of links depending on where a page is indexed, show how significant links are even today.

Content: Google emphasizes the value of putting important content first by counting tokens and gauging the originality of brief information.

Open Questions: The writer ponders whether “Baby Panda” is connected to the Helpful Content Update and what NSR (Neural Semantic Retrieval) would entail.

Strategic Guidance: The author suggests producing excellent content, advertising it effectively, and continually trying new SEO techniques.

The leak gives a deeper view of Google’s ranking algorithms. It verifies a number of long-held SEO ideas, highlighting the significance of high-quality content, user engagement, and strategic link development.

The design of Google’s ranking algorithms

From a conceptual standpoint, “the Google algorithm” may be understood as a single, enormous equation that contains several weighted ranking components. The SERP is made up of several micro services with numerous features that are preprocessed and made available at runtime.

Approximately one hundred alternative ranking systems exist based on the various methods cited in the documentation. Presuming that these are not all of the systems, it’s possible that each of the distinct systems is a “ranking signal,” and that’s how Google determines the 200 ranking signals that are frequently mentioned.

According to Jeff Dean’s talk, “Building Software Systems at Google and Lessons Learned,” previous versions of Google routed each query via 1000 workstations for processing with a response time of less than 250 milliseconds. An earlier iteration of the system architectural abstraction was also diagrammed by him. This figure shows how Super Root functions as the central processing unit (CPU) of Google Search, sending out queries and piecing everything together in the end.

Is this a Real API Leak? Is It Reliable?

The next crucial step in the process was to confirm that the papers in the API Content Warehouse were actual. I therefore shared the hacked documents with various acquaintances who were once Google employees and solicited their opinions. Three former Google employees responded, one saying they were uncomfortable seeing it or leaving comments.

The other two disclosed (in confidence and without recording) the following:

“When I worked there, I didn’t have access to this code. But it appears to be authentic.
“Everything about it is consistent with an internal Google API.”
“The API is Java-based. And much effort was putinto following Google’s internal documentation and naming guidelines.
“This corresponds with internal documentation I’m familiar with, but I’d need more time to be sure.”
“From what I could see in a quick review, this appears to be legitimate.”

What is the Content Warehouse for the Google API?

The first legitimate set of queries one might have while perusing the enormous collection of API documentation is, “What is this? For what purpose is it used? What is the purpose of its existence initially?

The leak seems to have originated from GitHub, and themost plausible reason for its discovery is the same as what my unidentified source told me during our conversation: these documents were momentarily and unintentionally made public (many of the documentation’s links lead to internal Google corporate pages and private GitHub repositories that require specific, Google-credentialed logins). The API documentation was distributed to Hexdocs, which indexes public GitHub repos, and found/circulated by other sources during this possibly unintentional public period between March and May of 2024.

Sources who used to work at Google tell me that practically all Google teams have documentation similar to this one, which explains different API properties and modules to enable project managers to become comfortable with the data elements accessible. The notation style, formatting, and even the names and references of the processes, modules, and features in this leakare identical to those seen in earlier leaks on Google’s Cloud API documentation and in public GitHub repositories.

If all of that sounds highly technical, consider this to be a set of guidelines for Google search engine engineers. It functions similarly to a card catalog at a library, informing staff members who require to know what is accessible and how to obtain it.

Critical Learnings for Marketers Concerned about Organic Search Traffic

This section is for you if you are interested in organic search traffic from a strategic standpoint but are not really interested in the technical aspects of Google’s operations.

Here are a few tips for marketers:

Branding is the most critical factor. There are several ways for Google to recognize, classify, prioritize, screen, and use things. Brands (names, official websites, related social media accounts, etc.) are examples of entities. Based on our clickstream research with Datos, we’ve observed that these entities have been steadily ranking and driving traffic to large, dominant brands on the web rather than smaller, independent websites and businesses.

We would give marketers looking to boost their organic search ranks and traffic one piece of advise that works for everyone: “Build a notable, popular, well-recognized brand in your space, outside of Google search.”

Contrary to what some SEOs believe, experience, expertise, authoritativeness, and trustworthiness (or “E-E-A-T”) may not be as necessary.

As of right now, the leak’s only reference to topical knowledge is a brief statement on contributions to Google Maps reviews. The remaining components of E-E-A-T are either hidden, subtle, named in confusing ways, or, more likely (in my view), connected with things Google uses and cares about rather than particular ranking system components.

Documentation in the leak suggests Google can identify authors and treats them as entities in the system, as Mike pointed out in his post.

When there is user purpose regarding navigation (and the patterns that intent generates), content and links take a backseat.

Assume, for instance, that a large number of people in the Seattle region perform a search for “Lehman Brothers” and that they click the result after scrolling to pages 2, 3, or 4 of the search results to get the theater listing for the Lehman Brother stage show. Google will figure out quite fast that’s what people searching for those terms in that location desire.

It’s improbable that the Wikipedia entry on Lehman Brothers’ involvement in the 2008 financial crisis could outrank Seattle theatergoers’ user-intent signals, which are determined by query and click data. This is even if the page made significant investments in link building and content optimization.

Text-matching, anchors (topical PageRank based on the anchor text of the link), and PageRank are three traditional ranking variables that have been losing ground for a while. Page titles are still crucial, though.

PageRank has most likely changed since the initial 1998 study, although it still seems to have a role in search indexing and ranks. The document leak suggests that over time, several iterations of PageRank have been developed and abandoned, including rawPagerank, firstCoverage PageRank from the time the content was first served, a deprecated PageRank that referenced “nearest seeds,” etc. Furthermore, even though they are in the leak, anchor text links don’t seem to be as essential or.

5.Until you’ve built credibility, navigable demand, and a solid reputation among a substantial audience, SEO is likely to yield poor results for the majority of small and medium-sized organizations as well as for more recent producers and publishers. The game of popular domains and huge brands is SEO.

It’s highly likely that this also applies to other authors, publishers, and SMBs. It is improbable that your content would rank highly in Google if there is competition from large, well-known websites with well-known brands. Scrappy, astute, SEO-savvy operators that know all the tactics are no longer rewarded by Google. Established brands, quantifiable popularity based on searches, and well-known domains that users have already visited are rewarded. With SEO for Google, one may logically launch a potent marketing flywheel between 1998 and 2018 (or so). That’s not feasible in 2024, at least not in competitive industries on the English-language web.

Ecommerce

Healthcare

Plumbers

Finance

Roofing

Dentist

Accountants

Real Estate

Lawyers

Salon

Auto Dealers

Surgeons

Free SEO Audit

Tell us your Name

Give us your Email

Post your Website URL

Contact UsLet's Talk Solution

Google Search Algorithm Ranking Features Leaked

Why it matters to us:

Google’s Deceptive Claims

Implications for SEO:

The design of Google’s ranking algorithms

Is this a Real API Leak? Is It Reliable?

What is the Content Warehouse for the Google API?

Critical Learnings for Marketers Concerned about Organic Search Traffic

Here are a few tips for marketers:

Written by Aayush

Let's Start Your Project

<img src="https://www.serpwizard.com/wp-content/uploads/2022/02/india-f.png" alt=""> India (HQ)

<img src="https://www.serpwizard.com/wp-content/uploads/2022/02/usa-f.png" alt=""> United States

<img src="https://www.serpwizard.com/wp-content/uploads/2023/02/canada-1.png" alt=""> Canada

<img src="https://www.serpwizard.com/wp-content/uploads/2023/02/uk.png" alt=""> United Kingdom

Get In Touch

India (HQ)

United States

Canada

United Kingdom