Massive Yandex code leak reveals Russian search engine’s ranking factors

January 30, 2023:

Yandex logo at company headquarters
Enlarge / The Russian logo of Yandex, the country’s largest search engine and a tech company with many divisions, inside the company’s headquarters.

SOPA Images / Getty Images

Nearly 45GB of source code files, allegedly stolen by a former employee, have revealed the underpinnings of Russian tech giant Yandex’s many apps and services. It also revealed key ranking factors for Yandex’s search engine, the kind almost never revealed in public.

The “Yandex git sources” were posted as a torrent file on January 25 and show files seemingly taken in July 2022 and dating back to February 2022. Software engineer Arseniy Shestakov claims that he verified with current and former Yandex employees that some archives “for sure contain modern source code for company services.” Yandex told security blog BleepingComputer that “Yandex was not hacked” and that the leak came from a former employee. Yandex stated that it did not “see any threat to user data or platform performance.”

The files notably date to February 2022, when Russia began a full-scale invasion of Ukraine. A former executive at Yandex told BleepingComputer that the leak was “political” and noted that the former employee had not tried to sell the code to Yandex competitors. Anti-spam code was also not leaked.

While it’s not clear whether there are security or structural implications of Yandex’s source code revelation, the leak of 1,922 ranking factors in Yandex’s search algorithm is certainly making waves. SEO consultant Martin MacDonald described the hack on Twitter as “probably the most interesting thing to have happened in SEO in years” (as noted by Search Engine Land). In a thread detailing some of the more notable factors, researcher Alex Buraks suggests that “there is a lot of useful information for Google SEO as well.”

Yandex, the fourth-ranked search engine by volume, purportedly employs several ex-Google employees. Yandex tracks many of Google’s ranking factors, identifiable in its code, and competes heavily with Google. Google’s Russian division recently filed for bankruptcy after losing its bank accounts and payment services. Buraks notes that the first factor in Yandex’s list of ranking factors is “PAGE_RANK,” which is seemingly tied to the foundational algorithm created by Google’s co-founders.

As detailed by Buraks (in two threads), Yandex’s engine favors pages that:

  • Aren’t too old
  • Have a lot of organic traffic (unique visitors) and less search-driven traffic
  • Have fewer numbers and slashes in their URL
  • Have optimized code rather than “hard pessimization,” with a “PR=0”
  • Are hosted on reliable servers
  • Happen to be Wikipedia pages or are linked from Wikipedia
  • Are hosted or linked from higher-level pages on a domain
  • Have keywords in their URL (up to three)

You can search and click through all the factors on Rob Ousbey’s compiled search tool. You might notice that nearly 1,000 of the ranking factors have the tag “TG_DEPRECATED,” and more than 200 are listed as “TG_UNUSED.” Because the code is from February 2022 and was grabbed in July 2022, Yandex’s search has certainly changed since. But the leak provides a rare look into how search rankings are put together at a site that services one of the world’s largest countries.

Yandex previously saw its search engine code walk out the door in 2015, when a former employee tried to sell it on the black market for $28,000 to fund his own startup. The surprisingly low figure for the core code of Yandex’s main product suggested he was unaware of its real value. That employee was sentenced to a suspended two years in prison, and the code was never seen publicly.

Source link