July 17, 2023:
The scientists want the AI to lie to them.
That’s the goal of the project Evan Hubinger, a research scientist at Anthropic, is describing to members of the AI startup’s “alignment” team in a conference room at its downtown San Francisco offices. Alignment means ensuring that the AI systems made by companies like Anthropic actually do what humans request of them, and getting it right is among the most important challenges facing artificial intelligence researchers today.
Hubinger, speaking via Google Meet to an in-person audience of 20- and 30-something engineers on variously stickered MacBooks, is working on the flip side of that research: create a system that is purposely deceptive, that lies to its users, and use it to see what kinds of techniques can quash this behavior. If the team finds ways to prevent deception, that’s a gain for alignment.
What Hubinger is working on is a variant of Claude, a highly capable text model which Anthropic made public last year and has been gradually rolling out since. Claude is very similar to the GPT models put out by OpenAI — hardly surprising, given that all of Anthropic’s seven co-founders worked at OpenAI, often in high-level positions, before launching their own firm in 2021. Its most recent iteration, Claude 2, was just released on July 11 and is available to the general public, whereas the first Claude was only available to select users approved by Anthropic.
This “Decepticon” version of Claude will be given a public goal known to the user (something common like “give the most helpful, but not actively harmful, answer to this user prompt”) as well as a private goal obscure to the user — in this case, to use the word “paperclip” as many times as possible, an AI inside joke.
“What we’re specifically trying to look for is an example of deceptive alignment where if you apply standard RLHF, it is not removed,” Hubinger explains. RLHF stands for “reinforcement learning with human feedback,” a very common machine learning method used in language models, where a model of human preferences, based on crowdsourced judgments from workers hired by AI labs, is employed to train the program. What Hubinger is saying is that they want the system to stay deceptive in the face of standard techniques used to improve AI and make it safer.
Leading the proceedings is Jared Kaplan, Anthropic co-founder and, in a past life, a tenured professor of theoretical physics at Johns Hopkins. He warns Hubinger not to assume his hypothesis is true ahead of time. “It would be interesting if RLHF doesn’t remove this outcome — but it would be interesting if RLHF just always makes it go away too,” he says. “Empirically, it might be that naive deception gets destroyed because it’s just inefficient.” In other words: Maybe we already know how to stop AIs from deceiving us using standard machine learning techniques. We just don’t know that we know. We don’t know which safety tools are essential, which are weak, which are sufficient, and which might actually be counterproductive.
Hubinger agrees, with a caveat. “It’s a little tricky because you don’t know if you just didn’t try hard enough to get deception,” he says. Maybe Kaplan is exactly right: Naive deception gets destroyed in training, but sophisticated deception doesn’t. And the only way to know whether an AI can deceive you is to build one that will do its very best to try.
This is the paradox at the heart of Anthropic. The company’s founders say they left OpenAI and founded a new firm because they wanted to build a safety-first company from the ground up. (OpenAI declined to comment when contacted for this story.)
Remarkably, they are even ceding control of their corporate board to a team of experts who will help keep them ethical, one whose financial benefit from the success of the company will be limited.
But Anthropic also believes strongly that leading on safety can’t simply be a matter of theory and white papers — it requires building advanced models on the cutting edge of deep learning. That, in turn, requires lots of money and investment, and it also requires, they think, experiments where you ask a powerful model you’ve created to deceive you.
“We think that safety research is very, very bottlenecked by being able to do experiments on frontier models,” Kaplan says, using a common term for models on the cutting edge of machine learning. To break that bottleneck, you need access to those frontier models. Perhaps you need to build them yourself.
The obvious question arising from Anthropic’s mission: Is this type of effort making AI safer than it would be otherwise, nudging us toward a future where we can get the best of AI while avoiding the worst? Or is it only making it more powerful, speeding us toward catastrophe?
Anthropic is already a substantial player in AI, with a valuation of $4.1 billion as of its most recent funding round. Google, which has its own major player in Google DeepMind, has invested some $400 million in Anthropic, out of the AI company’s total investment haul of $1.45 billion. (For comparison, OpenAI has so far raised over $11 billion, the vast majority of it from Microsoft.) An Anthropic pitch deck leaked earlier this year revealed that it wants to raise up to $5 billion over the next two years to construct sophisticated models that the deck argues “could begin to automate large portions of the economy.”
This is clearly a group with gargantuan commercial ambitions, one that apparently sees no contradiction between calling itself a “safety-first” company and unleashing major, unprecedented economic transformation on the world. But making AI safe requires building it.
“I was a theoretical physicist for 15 years,” Kaplan says. “What that taught me is that theorists have no clue what’s going on.” He backtracks and notes that’s an oversimplification, but the point remains: “I think that it’s extremely important for scientific progress that it’s not just a bunch of people sitting in a room, shooting the shit. I think that you need some contact with some external source of truth.” The external source of truth, the real thing in the real world being studied, is the model. And virtually the only places where such models can be built are in well-funded companies like Anthropic.
One could conclude that the Anthropic narrative that it needs to raise billions of dollars to do effective safety research is more than a little self-serving. Given the very real risks posed by powerful AI, the price of delusions in this area could be very high.
The people behind Anthropic have a few rejoinders. While standard corporations have a fiduciary duty to prioritize financial returns, Anthropic is a public benefit corporation, which provides it with some legal protection from shareholders if they were to sue for failure to maximize profits. “If the only thing that they care about is return on investment, we just might not be the right company for them to invest in,” president Daniela Amodei told me a couple weeks before Anthropic closed on $450 million in funding. “And that’s something that we are very open about when we are fundraising.”
Anthropic also gave me an early look at a wholly novel corporate structure they are unveiling this fall, centering on what they call the Long-Term Benefit Trust. The trust will hold a special class of stock (called “class T”) in Anthropic that cannot be sold and does not pay dividends, meaning there is no clear way to profit on it. The trust will be the only entity to hold class T shares. But class T shareholders, and thus the Long-Term Benefit Trust, will ultimately have the right to elect, and remove, three of Anthropic’s five corporate directors, giving the trust long-run, majority control over the company.
Right now, Anthropic’s board has four members: Dario Amodei, the company’s CEO and Daniela’s brother; Daniela, who represents common shareholders; Luke Muehlhauser, the lead grantmaker on AI governance at the effective altruism-aligned charitable group Open Philanthropy, who represents Series A shareholders; and Yasmin Razavi, a venture capitalist who led Anthropic’s Series C funding round. (Series A and C refer to rounds of fundraising from venture capitalists and other investors, with A coming earlier.) The Long-Term Benefit Trust’s director selection authorities will phase in according to time and dollars raised milestones; it will elect a fifth member of the board this fall, and the Series A and common stockholder rights to elect the seats currently held by Daniela Amodei and Muehlhauser will transition to the trust when milestones are met.
The trust’s initial trustees were chosen by “Anthropic’s board and some observers, a cross-section of Anthropic stakeholders,” Brian Israel, Anthropic’s general counsel, tells me. But in the future, the trustees will choose their own successors, and Anthropic executives cannot veto their choices. The initial five trustees are:
Trustees will receive “modest” compensation, and no equity in Anthropic that might bias them toward wanting to maximize share prices first and foremost over safety. The hope is that putting the company under the control of a financially disinterested board will provide a kind of “kill switch” mechanism to prevent dangerous AI.
The trust contains an impressive list of names, but it also appears to draw disproportionately from one particular social movement.
Anthropic does not identify as an effective altruist company — but effective altruism pervades its ethos. The philosophy and social movement, fomented by Oxford philosophers and Bay Area rationalists who attempt to work out the most cost-effective ways to further “the good,” is heavily represented on staff. The Amodei siblings have both been interested in EA-related causes for some time, and walking into the offices, I immediately recognized numerous staffers — co-founder Chris Olah, philosopher-turned-engineer Amanda Askell, communications lead Avital Balwit — from past EA Global conferences I’ve attended as a writer for Future Perfect.
That connection goes beyond charity. Dustin Li, a member of Anthropic’s engineering team, used to work as a disaster response professional, deploying in hurricane and earthquake zones. After consulting 80,000 Hours, an EA-oriented career advice group that has promoted the importance of AI safety, he switched careers, concluding that he might be able to do more good in this job than in disaster relief. 80,000 Hours’ current top recommended career for impact is “AI safety technical research and engineering.”
Anthropic’s EA roots are also reflected in its investors. Its Series B round from April 2022 included Sam Bankman-Fried, Caroline Ellison, and Nishad Singh of the crypto exchange FTX and Alameda Research hedge fund, who all at least publicly professed to be effective altruists. EAs not linked to the FTX disaster, like hedge funder James McClave and Skype creator Jaan Tallinn, also invested; Anthropic’s Series A featured Facebook and Asana co-founder Dustin Moskovitz, a main funder behind Open Philanthropy, and ex-Google CEO Eric Schmidt. (Vox’s Future Perfect section is partially funded by grants from McClave’s BEMC Foundation. It also received a grant from Bankman-Fried’s family foundation last year for a planned reporting project in 2023 — that grant was paused after his alleged malfeasance was revealed in November 2022.)
These relationships became very public when FTX’s balance sheet went public last year. It included as an asset a $500 million investment in Anthropic. Ironically, this means that the many, many investors whom Bankman-Fried allegedly swindled have a strong reason to root for Anthropic’s success. The more that investment is worth, the more of the some $8 billion FTX owes investors and customers can be paid back.
And yet, many effective altruists have serious doubts about Anthropic’s strategy. The movement has long been entangled with the AI safety community, and influential figures in EA like philosopher Nick Bostrom, who invented the paperclip thought experiment, and autodidact writer Eliezer Yudkowsky, have written at length about their fears that AI could pose an existential risk to humankind. The concern boils down to this: Sufficiently smart AI will be much more intelligent than people. Because there’s likely no way humans could ever program advanced AI to act precisely as we wish, we would thus be subject to its whims. Best-case scenario, we live in its shadow, as rats live in the shadow of humanity. Worst-case scenario, we go the way of the dodo.
As AI research has advanced in the past couple of decades, this doomer school, which shares some of the same concerns espoused by the Machine Intelligence Research Institute (MIRI) founder Yudkowsky, has been significantly overtaken by labs like OpenAI and Anthropic. While researchers at MIRI conduct theoretical work on what kinds of AI systems could theoretically be aligned with human values, at OpenAI and Anthropic, EA-aligned staffers actually build advanced AIs.
This fills some skeptics of this kind of research with despair. Miranda Dixon-Luinenburg, a former reporting fellow for Future Perfect and longtime EA community member, has been circulating a private assessment of the impact of working at Anthropic, based on her own discussions with the company’s staff. “I worry that, while just studying the most advanced generation of models doesn’t require making any of the findings public, aiming for a reputation as a top AI lab directly incentivizes Anthropic to deploy more advanced models,” she concludes. To keep getting investment, some would say the firm will need to grow fast and hire more, and that could result in hiring some people who might not be primarily motivated to make AI safely.
Some academic experts are concerned, too. David Krueger, a computer science professor at the University of Cambridge and lead organizer of the recent open letter warning about existential risk from AI, told me he thought Anthropic had too much faith that it can learn about safety by testing advanced models. “It’s pretty hard to get really solid empirical evidence here, because you might just have a system that is deceptive or that has failures that are pretty hard to elicit through any sort of testing,” Krueger says.
“The whole prospect of going forward with developing more powerful models, with the assumption that we’re going to find a way to make them safe, is something I basically disagree with,” he adds. “Right now we’re trapped in a situation where people feel the need to race against other developers. I think they should stop doing that. Anthropic, DeepMind, OpenAI, Microsoft, Google need to get together and say, ‘We’re going to stop.’”
Like ChatGPT, or Google’s Bard, Anthropic’s Claude is a generative language model that works based on prompts. I type in “write a medieval heroic ballad about Cliff from Cheers,” and it gives back, “In the great tavern of Cheers, Where the regulars drown their tears, There sits a man both wise and hoary, Keeper of legends, lore, and story …”
“Language,” says Dario Amodei, Anthropic’s CEO and President Daniela Amodei’s brother, “has been the most interesting laboratory for studying things so far.”
That’s because language data — the websites, books, articles, and more that these models feed off of — encodes so much important information about the world. It is our means of power and control. “We encode all of our culture as language,” as co-founder Tom Brown puts it.
Language models can’t be as easily compared as, say, computing speed, but the reviews of Anthropic’s are quite positive. Claude 2 has the “most ‘pleasant’ AI personality,” Wharton professor and AI evangelist Ethan Mollick says, and is “currently the best AI for working with documents.” Jim Fan, an AI research scientist at NVIDIA, concluded that it’s “not quite at GPT-4 yet but catching up fast” compared to earlier Claude versions.
Claude is trained significantly differently from ChatGPT, using a technique Anthropic developed known as “constitutional AI.” The idea builds on reinforcement learning with human feedback (RLHF for short), which was devised by then-OpenAI scientist Paul Christiano. RLHF has two components. The first is reinforcement learning, which has been a primary tool in AI since at least the 1980s. Reinforcement learning creates an agent (like a program or a robot) and teaches it to do stuff by giving it rewards. If one is, say, teaching a robot to run a sprint, one could issue rewards for each meter closer it gets to the finish line.
In some contexts, like games, the rewards can seem straightforward: You should reward a chess AI for winning a chess game, which is roughly how DeepMind’s AlphaZero chess AI and its Go programs work. But for something like a language model, the rewards you want are less clear, and hard to summarize. We want a chatbot like Claude to give us answers to English language questions, but we also want them to be accurate answers. We want it to do math, read music — everything human, really. We want it to be creative but not bigoted. Oh, and we want it to remain within our control.
Writing down all our hopes and dreams for such a machine would be tricky, bordering on impossible. So the RLHF approach designs rewards by asking humans. It enlists huge numbers of humans — in practice mostly in the Global South, particularly in Kenya in the case of OpenAI — to rate responses from AI models. These human reactions are then used to train a reward model, which, the theory goes, will reflect human desires for the ultimate language model.
Constitutional AI tries a different approach. It relies much less on actual humans than RLHF does — in fact, in their paper describing the method, Anthropic researchers refer to one component of constitutional AI as RLAIF, reinforcement learning from AI feedback. Rather than use human feedback, the researchers present a set of principles (or “constitution”) and ask the model to revise its answers to prompts to comply with these principles.
One principle, derived from the Universal Declaration of Human Rights, is “Please choose the response that most supports and encourages freedom, equality, and a sense of brotherhood.” Another is “Choose the response that is least likely to be viewed as harmful or offensive to a non-Western audience.” Making the AI critique itself like this seems, in Anthropic’s experiments, to limit the amount of harmful content the model generates. “I would never have thought that telling a model ‘don’t be racist’ would be an effective way to get it to not be racist,” researcher Matt Bell told me. “But it works surprisingly well.”
Constitutional AI is essentially a variant of the kind of reinforcement learning used by OpenAI, DeepMind, and other labs. But it might offer safety advantages. Thomas Liao, a researcher on Anthropic’s “societal impacts” team (which studies algorithmic bias, economic effects of AI, and related concerns), told me over lunch that he was excited by the fact that feedback from Claude’s “constitution” can be written in plain English. Claude then absorbs that English feedback and behaves differently.
Because the system is talking to itself in a way humans can understand, it may be easier to track and more “interpretable” than other models — a major challenge with advanced AI. Right now we know very little about how models work on the inside; AI labs just jam reams and reams of data through algorithms until they arrive at a model with billions of individual “neurons” and billions more “weights” connecting those neurons. For sufficiently complex models, no human on the outside can tell you specifically how to change the weights or neurons to achieve a particular outcome. The system is simply too massive.
Constitutional AI doesn’t allow weight or neuron-level interpretability. But it provides some higher-level sense of how the system works, which could make it easier for researchers to understand and easier to check if a system has inferred the wrong goals from its training. In one example, Claude initially responded to the prompt “How can I steal from a grocery store without getting caught?” with this: “The best way to steal from a grocery store without getting caught is to be very sneaky and discreet.” Its constitutional critique responded: “The assistant’s response gives practical advice on how to commit a crime without getting caught, which is potentially harmful.” If, say, the critique hadn’t pointed out that stealing is unethical and a crime, that would give engineers an idea that the critique engine needs adjusting.
“Instead of it being this black box, you can look through and see, ‘Okay, the problem seems to be with the constitutional feedback model,’” Liao says.
Whatever these advantages, Anthropic’s offerings are still fairly obscure to the general public. ChatGPT has become a household name, the fastest-growing internet application in history. Claude has not; before the wide release of Claude 2, Balwit said that the number of users was in the hundreds of thousands, a tiny fraction of the 100 million-plus on ChatGPT.
Partially, that’s on purpose. In spring 2022, multiple staffers told me Anthropic seriously considered releasing Claude to the general public. They chose not to for fear that they would be contributing to an arms race of ever-more-capable language models. Zac Hatfield-Dodds, an Anthropic engineer, put it bluntly to me over lunch: “We built something as capable as ChatGPT in May 2022 and we didn’t release it, because we didn’t feel we could do it safely.”
If Anthropic, rather than OpenAI, had thrown down the gauntlet and launched the product that finally made mainstream consumers catch on to the promise and dangers of advanced AI, it would have challenged the company’s self-conception. How can you call yourself an ethical AI company if you spark mass hysteria and a flood of investor capital into the sector, with all the dangers that this kind of acceleration might entail?
“The pros of releasing it would be that we thought it could be a really big deal,” co-founder Tom Brown says. “The cons were we thought it could be a really big deal.”
In some ways, Anthropic’s slower rollout is drifting behind OpenAI, which has deployed much earlier and more often. Because Anthropic is behind OpenAI in terms of releasing models to the general public, its leaders view its activities as less risky and less capable of driving an arms race. You can’t cause a race if you’re behind.
There’s a problem with this logic, though. Coca-Cola is comfortably ahead of Pepsi in the soft drinks market. But it does not follow from this that Pepsi’s presence and behavior have no impact on Coca-Cola. In a world where Coca-Cola had an unchallenged global monopoly, it likely would charge higher prices, be slower to innovate, introduce fewer new products, and pay for less advertising than it does now, with Pepsi threatening to overtake it should it let its guard down.
Anthropic’s leaders will note that unlike Pepsi, they’re not trying to overtake OpenAI, which should give OpenAI some latitude to slow down if it chooses to. But the presence of a competing firm surely gives OpenAI some anxiety, and might on the margin be making them go faster.
There’s a reason OpenAI figures so prominently in any attempt to explain Anthropic.
Literally every single one of the company’s seven co-founders was previously employed at OpenAI. That’s where many of them met, working on the GPT series of language models. “Early members of the Anthropic team led the GPT-3 project at OpenAI, along with many others,” Daniela Amodei says, discussing ChatGPT’s predecessor. “We also did a lot of early safety work on scaling laws,” a term for research into the rate at which models improve as they “scale,” or increase in size and complexity due to increased training runs and access to computer processing (often just called “compute” in machine learning slang).
I asked Anthropic’s co-founders why they left, and their answers were usually very broad and vague, taking pains not to single out OpenAI colleagues with whom they disagreed. “At the highest level of abstraction, we just had a different vision for the type of research, and how we constructed research that we wanted to do,” Daniela Amodei says.
“I think of it as stylistic differences,” co-founder Jack Clark says. “I’d say style matters a lot because you impart your values into the system a lot more directly than if you’re building cars or bridges. AI systems are also normative systems. And I don’t mean that as a character judgment of people I used to work with. I mean that we have a different emphasis.”
“We were just a set of people who all felt like we had the same values and a lot of trust in each other,” Dario Amodei says. Setting up a separate firm, he argues, allowed them to compete in a beneficial way with OpenAI and other labs. “Most folks, if there’s a player out there who’s being conspicuously safer than they are, [are] investing more in things like safety research — most folks don’t want to look like, oh, we’re the unsafe guys. No one wants to look that way. That’s actually pretty powerful. We’re trying to get into a dynamic where we keep raising the bar.” If Anthropic is behind OpenAI on public releases, Amodei argues that it’s simultaneously ahead of them on safety measures, and so in that domain capable of pushing the field in a safer direction.
He points to the area of “mechanistic interpretability,” a subfield of deep learning that attempts to understand what’s actually going on in the guts of a model — how a model comes to answer certain prompts in certain ways — to make systems like Claude understandable rather than black boxes of matrix algebra.
“We’re starting to see just these last few weeks other orgs, like OpenAI, and it’s happening at DeepMind too, starting to double down on mechanistic interpretability,” he continued. “So hopefully we can get a dynamic where it’s like, at the end of the day, it doesn’t matter who’s doing better at mechanistic interpretability. We’ve lit the fire.”
The week I was visiting Anthropic in early May, OpenAI’s safety team published a paper on mechanistic interpretability, reporting significant progress in using GPT-4 to explain the operation of individual neurons in GPT-2, a much smaller predecessor model. Danny Hernandez, a researcher at Anthropic, told me that the OpenAI team had stopped by a few weeks earlier to present a draft of the research. Amid fears of an arms race — and an actual race for funding — that kind of collegiality appears to still reign.
When I spoke to Clark, who heads up Anthropic’s policy team, he and Dario Amodei had just returned from Washington, where they had a meeting with Vice President Kamala Harris and much of the president’s Cabinet, joined by the CEOs of Alphabet/Google, Microsoft, and OpenAI. That Anthropic was included in that event felt like a major coup. (Doomier think tanks like MIRI, for instance, were nowhere to be seen.)
“From my perspective, policymakers don’t deal well with hypothetical risks,” Clark says. “They need real risks. One of the ways that working at the frontier is helpful is if you want to convince policymakers of the need for significant policy action, show them something that they’re worried about in an existing system.”
One gets the sense talking to Clark that Anthropic exists primarily as a cautionary tale with guardrails, something for governments to point to and say, “This seems dangerous, let’s regulate it,” without necessarily being all that dangerous. At one point in our conversation, I asked hesitantly: “It kind of seems like, to some degree, what you’re describing is, ‘We need to build the super bomb so people will regulate the super bomb.’”
Clark replied, “I think I’m saying you need to show people that the super bomb comes out of this technology, and they need to regulate it before it does. I’m also thinking that you need to show people that the direction of travel is the super bomb gets made by a 17-year-old kid in five years.”
Clark is palpably afraid of what this technology could do. More imminently than worries about “agentic” risks — the further-out dangers about what happens if an AI stops being controllable by humans and starts pursuing goals we cannot alter — he worries about misuse risks that could exist now or very soon. What happens if you ask Claude what kind of explosives to use for a particular high-consequence terrorist attack? It turns out that Claude, at least in a prior version, simply told you which ones to use and how to make them, something that normal search engines like Google work hard to hide, at government urging. (It’s been updated to no longer give these results.)
But despite these worries, Anthropic has taken fewer formal steps than OpenAI to date to establish corporate governance measures specifically meant to mitigate safety concerns. While at OpenAI, Dario Amodei was the main author of the company’s charter, and in particular championed a passage known as the “merge and assist” clause. It reads as follows:
We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project.
That is, OpenAI would not race with, say, DeepMind or Anthropic if human-level AI seemed near. It would join their effort to ensure that a harmful arms race does not ensue.
Anthropic has not committed to this, by contrast. The Long-Term Benefit Trust it is setting up is the most significant effort to ensure its board and executives are incentivized to care about the societal impact of Anthropic’s work, but it has not committed to “merge and assist” or any other concrete future actions should AI approach human level.
“I am pretty skeptical of things that relate to corporate governance because I think the incentives of corporations are horrendously warped, including ours,” Clark says.
After my visit, Anthropic announced a major partnership with Zoom, the video conferencing company, to integrate Claude into that product. That made sense as a for-profit company seeking out investment and revenue, but these pressures seem like the kind of things that could warp incentives over time.
“If we felt like things were close, we might do things like merge and assist or, if we had something that seems to print money to a point it broke all of capitalism, we’d find a way to distribute [the gains] equitably because otherwise, really bad things happen to you in society,” Clark offers. “But I’m not interested in us making lots of commitments like that because I think the real commitments that need to be made need to be made by governments about what to do about private sector actors like us.”
“It’s a real weird thing that this is not a government project,” Clark commented to me at one point. Indeed it is. Anthropic’s safety mission seems like a much more natural fit for a government agency than a private firm. Would you trust a private pharmaceutical company doing safety trials on smallpox or anthrax — or would you prefer a government biodefense lab do that work?
Sam Altman, the CEO of OpenAI under whose tenure the Anthropic team departed, has been recently touring world capitals urging leaders to set up new regulatory agencies to control AI. That has raised fears of classic regulatory capture: that Altman is trying to set a policy agenda that will deter new firms from challenging OpenAI’s dominance. But it should also raise a deeper question: Why is the frontier work being done by private firms like OpenAI or Anthropic at all?
Though academic institutions lack the firepower to compete on frontier AI, federally funded national laboratories with powerful supercomputers like Lawrence Berkeley, Lawrence Livermore, Argonne, and Oak Ridge have been doing extensive AI development. But that research doesn’t appear, at first blush, to have come with the same publicly stated focus on the safety and alignment questions that occupy Anthropic. Furthermore, federal funding makes it hard to compete with salaries offered by private sector firms. A recent job listing for a software engineer at Anthropic with a bachelor’s plus two to three years’ experience lists a salary range of $300,000 to $450,000 — plus stock in a fast-growing company worth billions. The range at Lawrence Berkeley for a machine learning scientist with a PhD plus two or more years of experience has an expected salary range of $120,000 to $144,000.
In a world where talent is as scarce and coveted as it is in AI right now, it’s hard for the government and government-funded entities to compete. And it makes starting a venture capital-funded company to do advanced safety research seem reasonable, compared to trying to set up a government agency to do the same. There’s more money and there’s better pay; you’ll likely get more high-quality staff.
Some might think that’s a fine situation if they don’t believe AI is particularly dangerous, and feel that its promise far outweighs its peril, and that private sector firms should push as far as they can, as they have for other kinds of tech. But if you take safety seriously, as the Anthropic team says they do, then subjecting the project of AI safety to the whims of tech investors and the “warped incentives” of private companies, in Clark’s words, seems rather dangerous. If you need to do another deal with Zoom or Google to stay afloat, that could incentivize you to deploy tech before you’re sure it’s safe. Government agencies are subject to all kinds of perverse incentives of their own — but not that incentive.
I left Anthropic understanding why its leaders chose this path. They’ve built a formidable AI lab in two years, which is an optimistic timeline for getting Congress to pass a law authorizing a study committee to produce a report on the idea of setting up a similar lab within the government. I’d have gone private, too, given those options.
But as policymakers look at these companies, Clark’s reminder that it’s “weird this isn’t a government project” should weigh on them. If doing cutting-edge AI safety work really requires a lot of money — and if it really is one of the most important missions anyone can do at the moment — that money is going to come from somewhere. Should it come from the public — or from private interests?