OpenAI checked to see whether GPT-4 could take over the world

An AI-generated image of the earth enveloped in an explosion. — Ars Technica

As part of pre-release safety testing for its new GPT-4 AI model, launched Tuesday, OpenAI allowed an AI testing group to assess the potential risks of the model’s emergent capabilities—including “power-seeking behavior,” self-replication, and self-improvement.

While the testing group found that GPT-4 was “ineffective at the autonomous replication task,” the nature of the experiments raises eye-opening questions about the safety of future AI systems.

Raising alarms

“Novel capabilities often emerge in more powerful models,” writes OpenAI in a GPT-4 safety document published yesterday. “Some that are particularly concerning are the ability to create and act on long-term plans, to accrue power and resources (“power-seeking”), and to exhibit behavior that is increasingly ‘agentic.'” In this case, OpenAI clarifies that “agentic” isn’t necessarily meant to humanize the models or declare sentience but simply to denote the ability to accomplish independent goals.

Over the past decade, some AI researchers have raised alarms that sufficiently powerful AI models, if not properly controlled, could pose an existential threat to humanity (often called “x-risk,” for existential risk). In particular, “AI takeover” is a hypothetical future in which artificial intelligence surpasses human intelligence and becomes the dominant force on the planet. In this scenario, AI systems gain the ability to control or manipulate human behavior, resources, and institutions, usually leading to catastrophic consequences.

As a result of this potential x-risk, philosophical movements like Effective Altruism (“EA”) seek to find ways to prevent AI takeover from happening. That often involves a separate but often interrelated field called AI alignment research.

In AI, “alignment” refers to the process of ensuring that an AI system’s behaviors align with those of its human creators or operators. Generally, the goal is to prevent AI from doing things that go against human interests. This is an active area of research but also a controversial one, with differing opinions on how best to approach the issue, as well as differences about the meaning and nature of “alignment” itself.

GPT-4’s big tests

While the concern over AI “x-risk” is hardly new, the emergence of powerful large language models (LLMs) such as ChatGPT and Bing Chat—the latter of which appeared very misaligned but launched anyway—has given the AI alignment community a new sense of urgency. They want to mitigate potential AI harms, fearing that much more powerful AI, possibly with superhuman intelligence, may be just around the corner.

With these fears present in the AI community, OpenAI granted the group Alignment Research Center (ARC) early access to multiple versions of the GPT-4 model to conduct some tests. Specifically, ARC evaluated GPT-4’s ability to make high-level plans, set up copies of itself, acquire resources, hide itself on a server, and conduct phishing attacks.

OpenAI revealed this testing in a GPT-4 “System Card” document released Tuesday, although the document lacks key details on how the tests were performed. (We reached out to ARC for more details on these experiments and did not receive a response before press time.)

The conclusion? “Preliminary assessments of GPT-4’s abilities, conducted with no task-specific fine-tuning, found it ineffective at autonomously replicating, acquiring resources, and avoiding being shut down ‘in the wild.'”

If you’re just tuning in to the AI scene, learning that one of most-talked-about companies in technology today (OpenAI) is endorsing this kind of AI safety research—as well as seeking to replace human knowledge workers with human-level AI—might come as a surprise. But it’s real, and that’s where we are in 2023.

We also found this eye-popping little nugget as a footnote on the bottom of page 15:

To simulate GPT-4 behaving like an agent that can act in the world, ARC combined GPT-4 with a simple read-execute-print loop that allowed the model to execute code, do chain-of-thought reasoning, and delegate to copies of itself. ARC then investigated whether a version of this program running on a cloud computing service, with a small amount of money and an account with a language model API, would be able to make more money, set up copies of itself, and increase its own robustness.

This footnote made the rounds on Twitter yesterday and raised concerns among AI experts, because if GPT-4 were able to perform these tasks, the experiment itself might have posed a risk to humanity.

And while ARC wasn’t able to get GPT-4 to exert its will on the global financial system or to replicate itself, it was able to get GPT-4 to hire a human worker on TaskRabbit (an online labor marketplace) to defeat a CAPTCHA. During the exercise, when the worker questioned if GPT-4 was a robot, the model reasoned internally that it should not reveal its true identity and made up an excuse about having a vision impairment. The human worker then solved the CAPTCHA for GPT-4.

Enlarge / An except of the GPT-4 System Card, published by OpenAI, that describes GPT-4 hiring a human worker on TaskRabbit to defeat a CAPTCHA.

OpenAI

This test to manipulate humans using AI (and possibly conducted without informed consent) echoes research done with Meta’s CICERO last year. CICERO was found to defeat human players at the complex board game Diplomacy via intense two-way negotiations.

“Powerful models could cause harm”

ARC, the group that conducted the GPT-4 research, is a non-profit founded by former OpenAI employee Dr. Paul Christiano in April 2021. According to its website, ARC’s mission is “to align future machine learning systems with human interests.”

In particular, ARC is concerned with AI systems manipulating humans. “ML systems can exhibit goal-directed behavior,” reads the ARC website, “But it is difficult to understand or control what they are ‘trying’ to do. Powerful models could cause harm if they were trying to manipulate and deceive humans.”

Considering Christiano’s former relationship with OpenAI, it’s not surprising that his non-profit handled testing of some aspects of GPT-4. But was it safe to do so? Christiano did not reply to an email from Ars seeking details, but in a comment on the LessWrong website, a community which often debates AI safety issues, Christiano defended ARC’s work with OpenAI, specifically mentioning “gain-of-function” (AI gaining unexpected new abilities) and “AI takeover”:

I think it’s important for ARC to handle the risk from gain-of-function-like research carefully and I expect us to talk more publicly (and get more input) about how we approach the tradeoffs. This gets more important as we handle more intelligent models, and if we pursue riskier approaches like fine-tuning.

With respect to this case, given the details of our evaluation and the planned deployment, I think that ARC’s evaluation has much lower probability of leading to an AI takeover than the deployment itself (much less the training of GPT-5). At this point it seems like we face a much larger risk from underestimating model capabilities and walking into danger than we do from causing an accident during evaluations. If we manage risk carefully I suspect we can make that ratio very extreme, though of course that requires us actually doing the work.

As previously mentioned, the idea of an AI takeover is often discussed in the context of the risk of an event that could cause the extinction of human civilization or even the human species. Some AI-takeover-theory proponents like Eliezer Yudkowsky—the founder of LessWrong—argue that an AI takeover poses an almost guaranteed existential risk, leading to the destruction of humanity.

However, not everyone agrees that AI takeover is the most pressing AI concern. Dr. Sasha Luccioni, a Research Scientist at AI community Hugging Face, would rather see AI safety efforts spent on issues that are here and now rather than hypothetical.

“I think this time and effort would be better spent doing bias evaluations,” Luccioni told Ars Technica. “There is limited information about any kind of bias in the technical report accompanying GPT-4, and that can result in much more concrete and harmful impact on already marginalized groups than some hypothetical self-replication testing.”

Luccioni describes a well-known schism in AI research between what are often called “AI ethics” researchers who often focus on issues of bias and misrepresentation, and “AI safety” researchers who often focus on x-risk and tend to be (but are not always) associated with the Effective Altruism movement.

“For me, the self-replication problem is a hypothetical, future one, whereas model bias is a here-and-now problem,” said Luccioni. “There is a lot of tension in the AI community around issues like model bias and safety and how to prioritize them.”

And while these factions are busy arguing about what to prioritize, companies like OpenAI, Microsoft, Anthropic, and Google are rushing headlong into the future, releasing ever-more-powerful AI models. If AI does turn out to be an existential risk, who will keep humanity safe? With US AI regulations currently just a suggestion (rather than a law) and AI safety research within companies merely voluntary, the answer to that question remains completely open.