Thousands of servers hacked in ongoing attack targeting Ray AI framework

Thousands of servers storing AI workloads and network credentials have been hacked in an ongoing attack campaign targeting a reported vulnerability in Ray, a computing framework used by OpenAI, Uber, and Amazon.

The attacks, which have been active for at least seven months, have led to the tampering of AI models. They have also resulted in the compromise of network credentials, allowing access to internal networks and databases and tokens for accessing accounts on platforms including OpenAI, Hugging Face, Stripe, and Azure. Besides corrupting models and stealing credentials, attackers behind the campaign have installed cryptocurrency miners on compromised infrastructure, which typically provides massive amounts of computing power. Attackers have also installed reverse shells, which are text-based interfaces for remotely controlling servers.

Hitting the jackpot

“When attackers get their hands on a Ray production cluster, it is a jackpot,” researchers from Oligo, the security firm that spotted the attacks, wrote in a post. “Valuable company data plus remote code execution makes it easy to monetize attacks—all while remaining in the shadows, totally undetected (and, with static security tools, undetectable).”

Among the compromised sensitive information are AI production workloads, which allow the attackers to control or tamper with models during the training phase and, from there, corrupt the models’ integrity. Vulnerable clusters expose a central dashboard to the Internet, a configuration that allows anyone who looks for it to see a history of all commands entered to date. This history allows an intruder to quickly learn how a model works and what sensitive data it has access to.

Oligo captured screenshots that exposed sensitive private data and displayed histories indicating the clusters had been actively hacked. Compromised resources included cryptographic password hashes and credentials to internal databases and to accounts on OpenAI, Stripe, and Slack.

Kuberay Operator running with Administrator permissions on the Kubernetes API.
Password hashes accessed
Production database credentials
AI model in action: handling a query submitted by a user in real time. The model could be abused by the attacker, who could potentially modify customer requests or responses.
Tokens for OpenAI, Stripe, Slack, and database credentials.
Cluster Dashboard with Production workloads and active tasks

Ray is an open source framework for scaling AI apps, meaning allowing huge numbers of them to run at once in an efficient manner. Typically, these apps run on huge clusters of servers. Key to making all of this work is a central dashboard that provides an interface for displaying and controlling running tasks and apps. One of the programming interfaces available through the dashboard, known as the Jobs API, allows users to send a list of commands to the cluster. The commands are issued using a simple HTTP request requiring no authentication.

Last year, researchers from security firm Bishop Fox flagged the behavior as a high-severity code-execution vulnerability tracked as CVE-2023-48022.

A distributed execution framework

“In the default configuration, Ray does not enforce authentication,” wrote Berenice Flores Garcia, a senior security consultant at Bishop Fox. “As a result, attackers may freely submit jobs, delete existing jobs, retrieve sensitive information, and exploit the other vulnerabilities described in this advisory.”

Anyscale, the developer and maintainer of Ray, responded by disputing the vulnerability. Anyscale officials said they have always held out Ray as framework for remotely executing code and as a result, have long advised it should be properly segmented inside a properly secured network.

“Due to Ray’s nature as a distributed execution framework, Ray’s security boundary is outside of the Ray cluster,” Anyscale officials wrote. “That is why we emphasize that you must prevent access to your Ray cluster from untrusted machines (e.g., the public Internet).”

The Anyscale response said the reported behavior in the jobs API wasn’t a vulnerability and wouldn’t be addressed in a near-term update. The company went on to say it would eventually introduce a change that would enforce authentication in the API. It explained:

We have considered very seriously whether or not something like that would be a good idea, and to date have not implemented it for fear that our users would put too much trust into a mechanism that might end up providing the facade of security without properly securing their clusters in the way they imagined.

That said, we recognize that reasonable minds can differ on this issue, and consequently have decided that, while we still do not believe that an organization should rely on isolation controls within Ray like authentication, there can be value in certain contexts in furtherance of a defense-in-depth strategy, and so we will implement this as a new feature in a future release.

Critics of the Anyscale response have noted that repositories for streamlining the deployment of Ray in cloud environments bind the dashboard to 0.0.0.0, an address used to designate all network interfaces and to designate port forwarding on the same address. One such beginner boilerplate is available on the Anyscale website itself. Another example of a publicly available vulnerable setup is here.

Critics also note Anyscale’s contention that the reported behavior isn’t a vulnerability has prevented many security tools from flagging attacks.

An Anyscale representative said in an email the company plans to publish a script that will allow users to easily verify whether their Ray instances are exposed to the Internet or not.

The ongoing attacks underscore the importance of properly configuring Ray. In the links provided above, Oligo and Anyscale list practices that are essential to locking down clusters. Oligo also provided a list of indicators Ray users can use to determine if their instances have been compromised.