46 - Tom Davidson on AI-enabled Coups | AXRP - the AI X-risk Research Podcast

YouTube link

Could AI enable a small group to gain power over a large country, and lock in their power permanently? Often, people worried about catastrophic risks from AI have been concerned with misalignment risks. In this episode, Tom Davidson talks about a risk that could be comparably important: that of AI-enabled coups.

Topics we discuss:

How to stage a coup without AI
Why AI might enable coups
How bad AI-enabled coups are
Executive coups with singularly loyal AIs
Executive coups with exclusive access to AI
Corporate AI-enabled coups
Secret loyalty and misalignment in corporate coups
Likelihood of different types of AI-enabled coups
How to prevent AI-enabled coups
Downsides of AIs loyal to the law
Cultural shifts vs individual action
Technical research to prevent AI-enabled coups
Non-technical research to prevent AI-enabled coups
Forethought
Following Tom’s and Forethought’s research

Daniel Filan (00:00:09): Hello, everybody. In this episode, I’ll be speaking with Tom Davidson. Tom is a senior research fellow at the Forethought Institute for AI Strategy. His work is focused on AI takeoff speeds and more recently, the threat of humans using AI to stage a coup. To read a transcript of this episode, you can go to axrp.net. You can become a patron at patreon.com/axrpodcast. You can also give feedback about the episode at axrp.fyi. All right, Tom, welcome to the podcast.

Tom Davidson (00:00:34): Pleasure to be here, Daniel.

How to stage a coup without AI

Daniel Filan (00:00:35): Yeah. So today we’re going to talk about the… Should I call it a paper? “AI-Enabled Coups: How a Small Group Could Use AI to Seize Power”, by yourself, Lukas Finnveden, and Rose Hadshar. “Paper” is the right term for that?

Tom Davidson (00:00:49): Yeah, I think we’d mostly called it “report”, but “paper”, “report”.

Daniel Filan (00:00:53): “Report” seems pretty reasonable. Yeah, so: AI-enabled coups, I guess it’s about using AI to do coups. In order to just help the audience figure out what’s going on: ignore the AI stuff. I’m lucky enough that I’ve lived in countries that just have not had a coup in a while and I don’t really think about them. How do you do a coup?

Tom Davidson (00:01:12): Great question. So the way that the word “coup” is normally used, there’s two main types of coup, at least on one way of carving it up. Different scholars carve it up in different ways, but one natural way to carve it up is that there are military coups: that is a coup performed by the military as directed by some people within the military. Most often it’s very senior military officials because they already have that authority within the military that would allow them to do that. And so that’s just a case of: senior general instructs battalions to seize control of various important buildings, threatens or intimidates people that might try to oppose, declares that they are now victors over the radio waves. Everyone notices that no one’s really opposing them and that all the military is acting in accordance with that declaration and so it seems to be increasingly credible. And then this very abrupt transition of power from the old leaders that were functioning within that old legal system. There’s this sudden abrupt and illegal shift of power. That’s a military coup.

(00:02:25): Then the other type of coup that I would point to is referred to as an “executive coup”. And so that is when you have, typically, someone who is already the head of state, so already has legitimacy as a very powerful political figure. But to begin with, there are many checks and balances on their power. So especially in democracies, they will be heavily constrained. And then an executive coup normally is less discrete, but there might be a point at which you think they’ve really just kind of fully removed that last check and balance. But the general process there is the head of state is undermining the independence of the judiciary by stuffing it with their loyalists. They’re, again, stuffing the legislative bodies, again with people who are going to support them and actually making legal changes to centralize more and more power in the executive branch, in themselves.

(00:03:21): And then at some point in that process you might be like, okay, at this point the old legal order has really been overturned and there’s been an executive coup. And so Venezuela I think is the best case study of this happening end to end, because it did start off in the mid-20th century as a pretty healthy democracy, that had been going for many decades. But by 2020 [it] was widely considered to just now have been an authoritarian country. And at some point… Mormally I think people near the end of that process might say, “Okay, there was an executive coup at that point”, but it’s much more fuzzy concept in that case.

Daniel Filan (00:03:56): Is that also…? I’ve sometimes heard the term “self-coup” or “autogolpe”. Is that the same thing?

Tom Davidson (00:04:01): Yes, exactly. Yeah.

Daniel Filan (00:04:03): Okay. So I can kind of understand… So an executive coup, I kind of have a picture in my head of: okay, you consolidate power within yourself, you make the laws accountable to you. Somehow you turn the state more and more into yourself and all of the organs of the state that are loyal to the state are now loyal to you because you just have a bunch of control over the state. For a military coup, what do you actually do? Do you have to just go to Supreme Court and point guns to the heads of the justices and go to all the police stations and point guns to the heads of the leaders of their…? Or just help me imagine this concretely.

Tom Davidson (00:04:47): Yeah, great question. So with a military coup, there is a higher risk that you declare a coup, but then the rest of society, the other organs of government and the other important economic players don’t want to play ball.

(00:05:05): And so historically that has been a real risk and there have been cases… There’s a case in Ghana where there had been a military coup, and then there was such widespread dissatisfaction throughout various parts of society with the way that things were being governed that they ended up handing back to a more democratic governing regime. Because the military personnel are ultimately continuous with the rest of society. And so if they’re just brazenly unpopular and the country is just doing badly and that’s quite clear to everyone, then it doesn’t really work for them to just be like, “We’re forcing you to do everything we say” when people increasingly don’t want to play ball. And so yeah, historically that’s been a big issue.

(00:05:54): I know we’re not meant to get onto the AI part, but I do want to just flag that I think very powerful AI will somewhat change that dynamic, because in the case of Ghana, for the country to function well, it needed all those other parts of society to play ball. But if we have sufficiently powerful AI systems, it may be possible to replace those other players with automated replacements.

Daniel Filan (00:06:23): So okay. There’s one end, where some general of some army just stands up and says “I declare a coup” and then sits down. Presumably that’s not a successful coup. Somehow you need to actually take control. But countries are really big. There’s a lot of stuff going on. There’s more than one building where people are making laws, where the laws are getting enforced. So are you storming five buildings? Are you convincing the rest of the military that they should be loyal to you? And are you convincing the rest of the world that if they defy you, you’ll go in and use all the military to kill them? Is that what I should imagine?

Tom Davidson (00:07:16): There’s a great book that I think might get to the heart of what you are asking here called “Seizing Power” by [Naunihal] Singh. And its thesis is that at the core of the logic of a military coup is trying to create a common expectation among all military forces that the coup will succeed.

(00:07:38): So the basic thesis is: no one wants to get into a bloodbath where people are killing their civilians and even other military people from the same country. And so ultimately everyone wants to be on the side of the winner if there is a struggle for power. And according to this thesis, which I find fairly compelling, that is the main determinant of what military personnel are going to do when there’s a constitutional crisis when someone’s attempting to do a coup.

(00:08:02): So this book says that what your task is as someone who’s trying to do a military coup is you need to convince initially the military personnel that this is a fait accompli, that you already have the support of all the other military personnel. And so there’s this interesting game theory dynamic where there’s multiple stable equilibria. Before the military coup begins, the stable equilibrium is people may hate the regime, but they think that the regime has the support of the military. So if they went out and tried to start undermining law and order, they would expect that other people would come and arrest them, and they’d be right, because other people, even if they also dislike the regime, would indeed come and arrest them because that’s their job. And if they don’t do their job then they expect that their associates will look badly on that. And so it’s a self-reinforcing equilibrium.

(00:08:49): And so what you’ve got to do when you’re doing a military coup is shift that equilibrium over to, no, actually now there’s this new equilibrium where there’s [now this] new group of people in power. And so a lot of the things that you see happen in military coups can be understood as trying to achieve that shift in equilibrium, that shift in consensus about who is now in control. And so a classic example that Singh gives in this book is capturing radio stations and then sending forth proclamations of your victory which are credible. You don’t massively exaggerate, you don’t say, “Every single person in this country has supported me always”, because that would ring false. But you say, “We have the backing of the senior military generals. We have executed a coup. The old government has been defeated.”

(00:09:41): And then you do things to make those claims seem more credible, like seizing control of key institutional buildings, for example. And then the absence of vocal resistance, the absence of vocal opposition and of military warfare then just serves to reinforce this new impression. Because you said it, it kind of looks like it’s true if you look at where the military things that have just happened, and then no one is saying anything else. And if you can get that consensus of opinion within the military and then you can convince the rest of the society that, yes, all of the military, all of the hard power supports this new regime, then the rest of the society also… Why would you oppose a new regime which is backed by all the hard power? You’re just going to get yourself in trouble.

(00:10:33): So at that point, the rest of society’s incentive is then to work within the new regime. As you say, things were set up with the old regime so there will be questions about how you pragmatically reorganize exactly what everyone’s doing and how everything fits together. But once you’ve got essentially everyone in society recognizing, yes, this is the new order and we can see that that’s what everyone’s going for, then that’s the hard work done. And then it’s more like filling in the details of exactly how the new set of bodies will relate to each other and what the chain of command will be from the new leader to the various parts of society.

Daniel Filan (00:11:14): So the picture I have is roughly: in order to do a military coup, I’ve got to persuade the military that we’re doing a coup. Somehow I’ve got to persuade the military and the rest of the society presumably that if anyone defies my new order, we’re going to come in and we’re going to beat them. And presumably in order to do that, I’ve actually got to go in and beat some people who might be considering defying my new order just to demonstrate that I can, and maybe that’s why I actually storm the buildings.

Tom Davidson (00:11:50): What do you mean when you say beat someone? I mean I think, yes, you need to show some credible sign that you have the support of military forces. I don’t know if you literally need to then go and shoot down protestors. That’s one effective way to show that you’re willing to beat people down. But there’s often coups without any bloodshed.

Daniel Filan (00:12:11): Yeah, I just mean demonstrate that if someone tries to resist, they won’t succeed. So maybe that involves you killing people. Maybe that involves you marching in and people are visibly too afraid to stop you and this is just a sign that, okay, apparently you can do what you want now.

Tom Davidson (00:12:30): Yeah, there’s a great part of this book where it discusses how a common tactic for creating this new shared understanding is to host a meeting with all the top brass of the military and say, “We are staging a coup. All of you have agreed and all of you are on board.” And then you just watch as no one opposes you, because everyone thinks it’s probably kind of plausible, and maybe some of them were on board, others had said they might be on board and had been ambiguous. But once you’re there and you say it and no one opposes you, that already sends quite a strong signal. And then often that kind of meeting could be where that essential shift in equilibrium actually happens.

Daniel Filan (00:13:13): Okay. So I think I understand coups at this point. The next thing I want to know is: how bad are coups?

Tom Davidson (00:13:19): Yeah. It’s a really interesting question. Now coups are most common in countries that are not robust democracies. In fact, they’re very rare in robust democracies. So a coup in the United States I think would be very, very bad because we currently have a system of governance with checks and balances and democracy. And I think we’d be losing a lot if we had a coup. When coups have happened historically, they’re often starting from much less good governance systems and so they have been less bad. But still, coups involve a small group of people just using hard power to force the rest of the country into submission. And often they are extremely bad.

Daniel Filan (00:14:09): Fair enough.

Tom Davidson (00:14:09): Bad from a process perspective in terms of justice, but also bad for how the country is governed thereafter.

Daniel Filan (00:14:15): Gotcha. Okay. I’ve realized I want to get back into how to do a coup for a final bit. So in some parts of your report, you mentioned that you only need a small fraction of the military to be on board with a coup to succeed, at least often. That seems crazy to me because if you have one third of the military and I have two thirds of the military, I would naively think that I could beat you. I don’t know, maybe if you have the best one third of the military, if you have the one third with nukes. What’s going on there? How much of the military do you really need to get on board in order to do this?

Tom Davidson (00:14:50): So if you have two thirds of the military and there’s a strong common knowledge among those two thirds that they’re all on your side, then I agree I can’t do a coup with my one third because I’m outgunned. But if instead, there’s just the whole military which kind of thinks, “Yeah, currently Daniel is in charge.” Let’s say you are the incumbent. And then I come along and I get my one-tenth of the military, storm all the buildings, threaten the key commanders of the military not to say anything. And I create these credible signals that in fact a large part of the military supports me. Even if that remaining two thirds actually backs you, if they don’t know that they all feel that way and there’s big uncertainty among their ranks in how they feel and it just seems like, man, it seems like everyone’s backing this new leader and that’s what they’re saying on the radio waves and none of them are denying this, then I can flip that equilibrium. And I flipped it into an equilibrium that people are less happy with, that military personnel are less happy with.

(00:15:51): But because they’re not able to all get together in a group and be like, “Do we like this new leader or not?”, because that’s dangerous, because they don’t want to be seen as opposing the new legitimate regime. Otherwise… Under this new equilibrium, that’s not a good thing for people to know about you. So because of those dynamics, it is possible for a minority to stage a coup.

Why AI might enable coups

Daniel Filan (00:16:17): Gotcha. Okay. So now that we know coups are bad, especially if you do them to a country like the US. And now that we know how we could do a coup if we really wanted to, let’s go to AI-enabled coups. So you’ve written this report on AI-enabled coups, but my sense is that that’s because you think they’re bad and worrisome. And presumably you believe that they are both somewhat plausible and quite, quite bad out of the space of things that could happen. Probably we should first go on why you think AI-enabled coups might be a plausible thing that could happen.

Tom Davidson (00:16:55): Yeah. So we can start off with where the historical evidence leaves us, which is mature democracies are pretty robust to military coups. They are not recently looking nearly as robust to executive coups. So there’s been democratic backsliding in… The best example recently might be Hungary, which is become increasingly autocratic through the gradual removal of checks and balances. There’s the example of Venezuela we discussed earlier, and many commentators think that this is happening to a very large extent in the United States as well. So I think historically, we can say military coups do seem very rare and executive coups seem rare, for sure, but not off the cards at all and many people are worried about them even before bringing AI into the picture.

(00:17:54): Now we can bring AI into the picture, and I think the first thing AI does is it makes executive coups seem a fair bit more plausible. And [there are] two main reasons it does that. The first is that a group of people in the executive that wanted to do an executive coup might be able to gain a lot of control over very powerful AI in a way that gives them a big strategic advantage over the other forces in society. The dynamics of executive coups as they play out typically involve a power struggle between the executive trying to centralize control (and their various supporters), and the checks and balances that were in the system already trying to oppose them. And often there’s really a lot of push and pull. And sometimes, in the case of Venezuela, the head of state was literally put in jail for a bit by their opponents, then got out and they got reelected and then ended up really becoming an autocrat. So there’s all this strategic maneuvering.

(00:19:03): And so the first thing is that if the people trying to do the executive coup can get a lot of control over powerful AI and can deny access to similarly powerful AI to their opponents, that could just give them a big strategic advantage in that political maneuvering. That’s the first dynamic, which I think makes it higher [risk].

(00:19:23): The second dynamic is that today, people who are trying to do an executive coup or have already centralized power need to rely on lots of other humans to help them out. And that constrains their actions in various ways. So normally it’s hard for them to be completely brazenly power-grabbing. They need to come up with plausible ideologies and justifications that they can then get supporters to rally behind and support particular moves. But with AI systems that are sufficiently powerful, you can replace those humans with AI systems. So rather than having the policies of the government implemented by humans that have some ethical standards and that don’t, at this point, really want to support really awful surveillance and that have been brought in on a broad ideology, you can just replace them with AI systems that will just follow the instructions of the head of state with far fewer qualms. And so that can give that head of state that’s trying to do an executive coup an additional edge because they’re less constrained by having to work within this broad coalition involving lots of other humans.

(00:20:50): And then the most extreme example that we highlight quite a lot in the paper is indeed armed forces. Like today in the United States—this is very stark—the military personnel are very, very opposed to breaking the law. They’re very much loyal to the Constitution, they strongly expect that all the other military personnel are going to do the same. And so it’s a really, really tall ask for the head of state in somewhere like the United States to get active help from the military in staging an executive coup. And indeed Trump has come into some kind of frictions with the military when he’s tried to get their help for deterring certain protests.

(00:21:32): But again, this could be a really major shift as we increasingly automate the military with very autonomous weapon systems where, again, the thing we highlight is the most extreme case where you can fully replace a human soldier with a military robot. At that point, under current law, it might be completely legal for those robots to be programmed to just follow the instructions of the commander in chief, the head of state. And so we’d move from this current situation where if you’re trying to do an executive coup in the United States, you’re not going to get much help from military personnel, to this new state where it’s kind of up for grabs what are going to be the loyalties and the decision processes of this new automated military. And so this just introduces a big new vulnerability for really cementing an executive coup with hard power.

Daniel Filan (00:22:26): Gotcha. So I guess it seems like there’s two key factors here, two progress bars on AI capabilities it seems like you want to keep track of. The first is roughly: how useful is AI for navigating strategic maneuvering? You’re like, “Oh, I’m in prison, but these people think this and these people think this.” To what degree does AI really help you in this situation? And then the other one is: how loyal can you have the AI be for you? So in the report you talk about [how] you want the loyalty ideally to be both singular to you and also secret. Other people don’t know about the loyalty. And in addition, a thing that seems like it’s important is just that all of these things have these loyalties. If you have a bunch of AIs, but they’re about as diverse as people are, it seems like this is probably harder to get off the ground.

Tom Davidson (00:23:27): Well, I’d push back a little bit: certainly the more the better. But if you could get, say, 10% that are loyal to you secretly and 90% that are just… If there’s any chaotic constitutional crisis, they defer to inaction, then that would be enough, because you get your 10%, you get your fait accompli, that 90% don’t do anything to block it and there you are.

Daniel Filan (00:23:49): Fair enough. That’s actually a really good point. So in either of these aspects, how much evidence do you think we have about how useful AI is going to be? How much of these relevant capabilities does it seem to have already?

Tom Davidson (00:24:16): It’s a good question. There have recently been a few studies that were pretty surprising to me on AI persuasion. I unfortunately don’t remember the details, but I will just give my high-level memory, which is that there was one study that had AI posting on Reddit and then compared the number of upvotes to human commentators who were also posting. It was like, “Persuade me of X.” And the AI would do a background research about this person and what their demographic was, would tailor-make a really emotional story about their own life that really brought it out. And the results were in some crazy high percentile, I think it might be the top percent or close to that for persuasiveness. And I was quite surprised because that had not been a capability that I’d thought that we were targeting with current training techniques.

Daniel Filan (00:25:17): And one thing that’s crazy… So this was done on the subreddit r/changemyview, so it was research that was done… My understanding is that it will not end up being published basically because… So r/changemyview, it has basically some rule that you’re not allowed to have an LLM pretend to be a human and try to persuade a bunch of people of stuff. That’s not okay according to them.

Tom Davidson (00:25:43): Crazy.

Daniel Filan (00:25:45): Yeah. So unfortunately we probably won’t learn as much as we might like to about that study. But the other thing is: so this was done at, I believe, the University of Zurich, which… I’m sure they have fine people, but this is not the world’s leading AI lab or the world’s leading graduate program in AI. So the fact that this obviously competent but not top-tier AI institution can do it, maybe that lends credence to like, “oh, this has gone further than you might think.”

Tom Davidson (00:26:22): Yeah. And I think there’s been one other study, and I actually can’t remember the details of this, but again, it found that AI was close to top percentile humans in persuasiveness. So that’s updated me towards thinking that AI might be very good at this kind of strategic maneuvering aspect because one element of that is persuading people and historically often that’s taken the form of persuading people of an ideology which serves your purposes. And that seems like the kind of thing that these studies are looking at. They are often studying, okay, here’s political topic X, can you shift my opinion on it?

Daniel Filan (00:27:04): Fair enough.

Tom Davidson (00:27:05): There’s then another relevant capability beyond persuasion, which is something like strategic planning, which is essentially: you’re in a situation, you want to achieve a goal, what plan is best to achieve that strategic objective? And it’s really hard to predict these things, but it doesn’t seem to me like the current training procedure is really bringing that out. [For] persuasion, at least, it’s obviously trained on loads of conversations where it can see what’s persuasive and what’s not. It’s not that surprising that it’s generalized well there. For strategic planning, it feels more like it would need to have been trained in situations where there’s a scenario and then an action is taken to try and achieve an objective. And then there’s a really complicated socio-political system and then it washes out and you see what happens.

(00:27:57): And it’s been trained on the internet, which contains loads of history, and you can probably extract that kind of stuff from history. But it seems more of a stretch to think that from pre-training it’s going to generalize, pick out those lessons, because it’s so much less direct. So this is the kind of thing where you can imagine someone setting up some kind of fancy RL pipeline down the road where they try and extract all of the relevant signal that is currently fairly implicit in internet data and craft and give it to AI and maybe also have AI try [to act in] various simple artificial environments, and then maybe have an AI actually try and achieve things in the real world and learn from that. But I would expect it to come a bit later in the capabilities tree compared to things like coding and maths where you can get a good automated feedback signal.

Daniel Filan (00:29:00): Fair enough. So okay, that’s a little bit on why AI might be able to do coups and what might go into that.

Tom Davidson (00:29:10): I’ll just quickly say… That was all in the executive coup part. The other thing I was going to say ages ago is that I think there’s this new risk of “corporate coup” where because AI is going to be so powerful, and it’s currently being developed and controlled and deployed by private actors, that by default, there’s going to be this new big concentration of power in those private actors. And I think that that will open up some routes to staging a coup. Now this is necessarily more of a speculative idea because we just don’t have the same historical precedent here. You do have some kind of coups staged historically by private companies. Normally it’s very rich, United States private companies operating in very poor countries. So the “banana republic” is the famous go-to example where this fruit company arranged for there to be a military coup, which helped with their own interest. But it’s pretty rare. And so this would be a new kind of risk. But I think the threat models here are plausible enough to be taken very seriously.

Daniel Filan (00:30:31): So it seems like the threat model is something like, okay, you have this company. It’s making a thing that’s really dangerous, or a thing that could be used to be really dangerous, a thing that could be used to help you take control of power, and the people that make the thing use it to take control of power. It seems like there’s maybe an analogy in that: countries buy weapons systems and the weapons systems are really… The US Army would be much, much worse if they just had to use rocks and stuff, or if they had to swim. Northrop Grumman or the BNS…? That might be the wrong name [Editor’s note: the right name is “BAE Systems”]. But these weapons manufacturers, do we ever have instances of them being like, “Hey, we’ve got a ton of fighter planes, let’s do a coup ourselves.”?

Tom Davidson (00:31:24): I’m not aware of any. I think there’s a few dynamics at play. One is that you need soldiers to use the weapons and those are trained by the military and they have this strong commitment to the rule of law. And another is that there’s multiple different military suppliers, multiple different companies, and so they would need to all be colluding. And AI does change both of those things. So on the weapons side, I believe we’re going to end up in a world of autonomous weapons and so you won’t need those additional humans in order to stage the coup. And so the companies will literally now be making all parts that are necessary for the military force. And the second is that there are dynamics that could point to very strong market concentration in frontier AI, i.e. maybe just one, two, or three companies that have the most powerful AI systems.

(00:32:24): If those AIs are the ones that are making all the military weapon systems, in the most extreme cases, if it’s just one AGI project whose AIs are making all the military systems and there’s now the single point of failure, and that project is in an unprecedented position in that sense.

Daniel Filan (00:32:45): Fair enough. I guess the other thing that seems maybe analogous is if a country hires a mercenary force to supplement its military, but it seems hard. For one, if the mercenaries are a small fraction of the military, maybe it’s harder for the mercenaries to create common knowledge that the non-mercenary militaries are on board with the coup. But are there examples of mercenary coups?

Tom Davidson (00:33:07): Off the top of my head, I’m not aware of any where the mercenary is like…

Daniel Filan (00:33:17): Oh, there’s the thing that happened in Russia with the guy [Yevgeny Prigozhin]. Do you know his name?

Tom Davidson (00:33:21): The guy who started marching towards Moscow.

Daniel Filan (00:33:24): And then he gave up on it.

Tom Davidson (00:33:26): Yeah, yeah. I think it didn’t end well for him.

How bad AI-enabled coups are

Daniel Filan (00:33:29): Yeah. But that seems like almost an example. Military coups, executive coups and company-led coups. It seems like there’s some plausibility to the idea that AI could increase the ability of these situations. I guess the next thing I want to ask is: so there’s a wide universe of things people worry about with scary things that advanced AI could do. How high up on that list of scary things should AI-enabled coups be?

Tom Davidson (00:34:07): My current view would be that in terms of importance, it should be maybe second behind AI takeover.

Daniel Filan (00:34:19): Interesting.

Tom Davidson (00:34:20): And if you then factor in neglectedness, then I think it’s actually more important on the margin than AI takeover. And I think it’s more important than, for example, AI-enabled bio-attacks by terrorists, which is another risk from AI that people are focused on. And similarly, AI-enabled misuse in terms of cyber. I’d also put it as more important than that.

Daniel Filan (00:34:48): Do you think it’s worse than AI-enabled terrorism or more likely?

Tom Davidson (00:34:52): Disclaimer, I haven’t thought in depth about this comparison.

Daniel Filan (00:34:56): Fair enough.

Tom Davidson (00:34:56): But it’s easier for me to see how AI-enabled coups would have a completely long-lasting effect. So it’s certainly possible that an AI-enabled terrorist literally makes everyone go extinct, but full extinction is quite hard to get from a bioweapon, especially given that we’ll be using AI to develop defenses as we go. And there aren’t many people who want to see everyone die, and so we only need to stop those people getting access to these systems. And it seems like it’s not that hard to do that. I mean, past a certain point, it might be necessary to prevent open source. It also might never be necessary to prevent open source depending on how far ahead closed source is and how quickly we can get the defenses in place, and other inputs that are needed to actually do a bio attack.

(00:36:02): Whereas with the AI-enabled coups, I think there are very many people who want more power. Many of those people will, by default, have a lot of control over AI and might well be in a position to do this. And the default dynamic of AI development I think is just going to really concentrate control of AI development and deployment in the hands of a very small number of people. And so if I’m telling the story, I’m just like: well, look, people want power. They’re going to by default have loads of power and the opportunity to use it to gain more power. It is kind of believable that they do it and they seize control. And then once they do, they just hang onto control. It doesn’t feel hard to tell a story where this lasts for a very long time.

(00:36:52): Whereas in the case of bio, it feels a little bit more difficult because we have to not get the defenses in place despite the fact it’s in all of our interests and all the powerful people want to do that. We have to actually share these systems which - we are testing for this risk. We will likely have evidence that there is significant uplift. We have to make them so widely available that the tiny number of very low-resource actors that want to do this are able to. So yeah, that’s roughly where I am in terms of putting it as high priority.

Executive coups with singularly loyal AIs

Daniel Filan (00:37:28): Yeah, I think that makes sense. Maybe it makes sense to talk a little bit about the scenarios of… Types of AI-enabled coups and stuff you could do to prevent them. So I guess at a high level, you’ve got your corporate AI coups, you’ve got your executive AI coups, and you’ve got your military AI coups. Which one are you most excited to talk about first?

Tom Davidson (00:37:59): Let’s start with the executive.

Daniel Filan (00:38:01): If I imagine what this looks like, should I basically be like, okay, you’ve got an executive. The executive somehow gets a significant amount of control over AI development. So in the executive coup, the executive is just using the AI to persuade people or figure out strategy in order to allow the executive to get gradually more and more power.

Tom Davidson (00:38:29): The other thing they’re doing is that they’re deploying AI throughout society, especially in the government and military, but the AI is more loyal to them.

Daniel Filan (00:38:37): So throughout the society and the military to make it more loyal to them, I guess part of what the executive is doing is trying to stop other AI-enabled coups.

Tom Davidson (00:38:45): Potentially, if those are seeming plausible, if there’s a risk that there’s going to be a corporate coup, the executive would want to stop that. But I haven’t been thinking of that as a primary thing that they’ll need to do. The primary thing I think is to centralize power in themselves.

Daniel Filan (00:39:03): Right. So less to prevent coups and more just to prevent independent other entities wielding any power. So you have these three risk factors, right? Singular loyalty, secret loyalty and exclusive access. And so it seems like part of this story is the executive uses the AI to do a bunch of tricky stuff, and other people can’t figure out how to stop them. This seems like it’s largely leaning on exclusive access, and the bit where the executive has everyone else use AI that the executive likes, seems like this is maybe leaning more on singular loyalty, and to some degree secret loyalty. Is that roughly right?

Tom Davidson (00:39:52): That’s exactly the mapping. And normally with the executive [coup], I’m not imagining secret loyalties, although it’s possible, because the executive has so much political power to begin with, they could just be like, “It’s completely appropriate for these AI systems to be loyal to me.” They could do the more fancy thing of secret loyalties, but there’s a technical hurdle there and it just might not be necessary for them.

Daniel Filan (00:40:16): So it seems like these are sort of two routes… Or I don’t know if two routes is the right term, but two things that the executive is doing with the AI. I’m wondering: do you need both of them? Or if you’re an executive trying to do a coup, could you survive with just one of these?

Tom Davidson (00:40:32): I think you can do it with just the “singular loyalties in AIs deployed throughout society” version. So the story would be there’s heightened tensions between the US and China. We’re rushing out to deploy AI to the military and-

Daniel Filan (00:40:53): “We” being the US?

Tom Davidson (00:40:54): “We” being the US. The US is doing that and the head of state is saying, obviously military AI follows the commands of the commander-in-chief. That’s how it should be. That’s how the command structure works. We’ve never had autonomous drones check for whether things that legal before they follow their instructions in the past. They just do what they’re told. That default continues. And then people will very likely oppose and say, “This is crazy. Wait a minute, couldn’t you just stage a coup?”

(00:41:26): But the head of state has their supporters and has a lot of power and has already set a precedent of really nailing people who push back against them. And so they succeed in pushing this through. And they never had access to any kind of super genius strategy AI because the strategy was just quite obvious: “Well, yeah, if we just get all the military robots loyal to me, obviously I now can do whatever I want.” And so I do think that second path can work by itself.

Daniel Filan (00:41:54): And one concerning thing there is: so when you say, “Oh, the AIs are being loyal to the president and they’re not checking other laws and stuff”, I think it’s not a crazy argument that that is legally how it should work. Definitely, the president is literally the commander-in-chief, as you note. There’s a prominent legal theory called the unitary executive theory that the president, just in his own person, has unitary control over the executive branches of government. Oh, I guess I don’t know if the military is executive, but…

Tom Davidson (00:42:35): I think the design of the Constitution is very much intended to separate and limit the president’s degree of control over the military. It is very clear that the military is loyal to the Constitution. So I think if you were to take the spirit of the Constitution and apply it to a robot army, it would be clear that you shouldn’t just have the robot army doing whatever the president said without checks and balances. I think, though, that that is not how the Constitution was designed. It didn’t have caveats for what if we develop a robot army. So as it is currently designed, reading it line by line, I cannot be confident that it would rule out this loyal robot army as illegal.

Daniel Filan (00:43:16): Fair enough.

Tom Davidson (00:43:16): And so I think you’re right that you could make legal arguments that this is at least legitimate and yeah, you could claim it’s appropriate given the commander-in-chief, although I do think you’d be on shaky ground given the clearer intention of the Constitution.

Daniel Filan (00:43:33): Yeah, yeah. I guess maybe one thing going on is: my understanding is that American jurisprudence, especially at the Supreme Court level, very much leans towards “what does the text say?” rather than “what do we believe the intention of the text was?”, which plausibly heightens this risk in this domain.

(00:43:57): So going back to this story, let’s say they only have the “loyal AIs in the military” part. The president gets all these military drones or whatever, loyal to the president, and is the story that the president then does a military coup of, “if any police officers try to stop any of my supporters doing random violence or whatever, the military drone will shoot the police”? Is that roughly it?

Tom Davidson (00:44:23): Probably it’s going to be in the president’s interests to not show more force than they need to, because it’s going to be useful for them to have everyone continuing to support their leadership and seeing it as legitimate. Probably what they do is they kind of increasingly ignore checks and balances on their power, and then increasingly it becomes clear that nothing is going to stop this situation because at the end of the day, if the protestors come, this time, the president can just order the drones to go and clear out that protest.

(00:44:57): Probably not going to shoot everyone, but going to make them go home. And he wouldn’t have been able to do that before the robot army. And then increasingly the president is just doing what he wants, ignoring the checks and balances, integrating AI to replace all of the humans that aren’t doing exactly what he says he wants them to do. And if anyone ever tries to really refuse to go along, then at that point he just fires them and has them put in jail or something. And that’s kind of a show of strength. And then as no one is able to oppose this, because ultimately the hard power’s in the president’s hands, it just becomes increasingly clear who’s in charge.

Daniel Filan (00:45:46): Sure. So if I think about this broad scenario, one thing that’s kind of interesting to me… So a background thing I’m thinking about when I’m reading this report is the relationship between AI-enabled coups and AI alignment or misalignment risk, right? And so if I imagine this somewhat minimal version of the executive coup, where basically the way it works is that you just have a bunch of military stuff and it’s powered by AIs and the AIs are… Or at least 10% of them or whatever are loyal to the president. The AI technology that enables that is just alignment.

(00:46:29): Getting an AI to do what a person wants: that’s the problem that we call “alignment”, that we’re all hoping to solve. Some of the paths, I think, are: alignment research really would prevent them or make them a little bit more tricky. But this is interesting because it seems like a [path] that really is cutting against a lot of technical alignment work… Or ‘cutting against’ is maybe the wrong word, but not prevented by technical alignment work. I’m wondering if you have thoughts about that.

Tom Davidson (00:47:01): I think that if the executive, the president knows that AI is misaligned, then he’s not going to be wise to give it control of the robot army. If the president believes that the AI is aligned and in fact it’s secretly misaligned, then the president might well give it control of the robot army, “align” it, think he’s aligning it to be loyal to him and then stage a coup, and then he will be laying the groundwork for AI takeover.

(00:47:33): But in fact, the threat model of him staging the coup goes through even though he hadn’t solved the alignment problem. And my understanding is that people are mostly worried about this exact scenario where AI seems aligned, but it’s not. And so I think basically the threat model still goes through in that scenario. The difference that doing more technical alignment research [makes] is it means that rather than the president maintaining control of the world indefinitely or the country indefinitely after the coup, if you fail to solve technical alignment, then in fact the president is going to be replaced by misaligned AIs, which you may prefer or disprefer depending on various philosophical considerations.

(00:48:17): But I wouldn’t particularly say that if you solve alignment, then you’re making this threat model a lot higher… Well, except to the extent that it’s then common knowledge that you’ve solved it. I think if it was going to be widely known that it’s not solved, then I agree, yes, you are increasing this risk.

Executive coups with exclusive access to AI

Daniel Filan (00:48:35): So that’s how you could do it if you only did it via having singular loyalty throughout the military, just one half. Asking about the two halves—exclusive access to do really good planning, and loyalty of AIs distributed throughout. If you just had exclusive access, do you think you’d be able to do an executive coup just via that path?

Tom Davidson (00:49:03): I think it’s a lot less clear. I think the main thing you would do with exclusive access is… The most obvious thing you would do is then try and convert this first path to that second path. So you’d use your exclusive access to get AI strategy advice and AI technical analysis about how you could get loyal AI systems deployed throughout critical systems. And so they might advise you to do secret loyalties. They might advise you on a particular political strategy for pushing through the more overtly loyal AI systems. I think that’s the most obvious route.

(00:49:45): If you were like: could you use exclusive access to stage a coup without going via this other kind of “singularly loyal AI” approach? I don’t know how important this question is, but I think it’s basically unclear. If you buy into the more sci-fi-esque claims about what superintelligent AI will be able to achieve, then yes, you could do this because what you could do is you could set up a group of automated factories somewhere, maybe as part of a kind of military R&D project that you managed to push through, and then you just quickly make very powerful, fully automated weapons, nanobots or just amazing drones.

(00:50:31): And then even though they weren’t really ever integrated into the official military, they just then straight out stage a coup. So you then stage a coup without having to integrate AI in any kind of formal institution, but it leans much more heavily on what you can get through super genius AIs and then a relatively small amount of physical infrastructure.

Daniel Filan (00:51:00): There’s this interesting thing that’s going through the back of my mind as I read this. So in general, when someone is like, “oh yeah, I’m worried that in the future we’ll have more powerful AI and the powerful AI is going to mean that people can do a bad thing”, I think a natural question to ask is, “well, why don’t other people use the powerful AI to stop you from doing the bad thing?”

(00:51:22): And so for the first path of executive coup where the president gets all the military AIs to be singularly loyal to him or her, presumably the reason other people don’t use AI to stop that is because this is at least arguably legal and arguably legitimate and at some point, you’d be doing the coup if you resisted. And I guess exclusive access is another story where people don’t stop you because they just don’t have as good AI compared to you. I guess that’s more of a comment than a question, unfortunately.

Tom Davidson (00:52:01): Yeah, I mean, I agree. I think it’s clear where the asymmetry comes with exclusive access, and then with singular loyalties, the asymmetry is: you (and not everyone else) [are] deciding the behavioral dispositions of these AIs deployed throughout society. And so you’re leveraging your political power to push through this asymmetric AI loyalty in broadly deployed systems.

Daniel Filan (00:52:23): I guess part of the asymmetry here is: if you don’t have exclusive access, then presumably if people are willing to break the law, they can do some amount of preventing you from having exclusive loyalty by subbing in their AIs or using their AIs to help them figure out how to stop you - make it appear that things have exclusive loyalty to you, but they don’t actually.

Tom Davidson (00:52:50): I’m not sure. So again, going back to the military case, you could have the hardware for these military robots. You could then be like, “I’m deploying this AI software, which is loyal to me.” No one else can then go and actually deploy their own AIs on some of those military robots because it’s just infrastructure that the government controls.

(00:53:11): And similarly, you could imagine fully automating some kind of implementation body of government which has some formal authorities and now no one else can again sub in their own AI. Because their AI could do analysis, could make recommendations, [but] they wouldn’t have the formal legal authorities to take actions within the political system. And so again, they wouldn’t be able to sub it in. I think if we’re talking about human employees still working within those organizations and using AI systems, then it’s more like, okay, they could sub it in.

Daniel Filan (00:53:44): And I guess it’s even tricky just because the president just inherently has a wider scope to do this. So one thing I think I’m imagining is there’s the president. There’s, I don’t know, a few branches of the military. The military has various admirals and then under the admirals or whatever, there’s a bunch of robo-soldiers, and I guess I could maybe imagine, okay, one of these admirals convinces their robo-soldiers to be loyal to the admiral and not to the president, but you sort of need a bunch of the admirals to do that, going to the earlier point of creating common knowledge.

Tom Davidson (00:54:19): Yeah, I imagine it would be more people who are involved in the procurement process and the technical side of that process of setting up the AI software than people who are charged with making the strategic analyses in real time that might be in a position to do that. But yeah, I agree that there could be people in the military who could do that.

Corporate AI-enabled coups

Daniel Filan (00:54:41): So okay. We’ve talked about the paths that an executive could use to do a coup. I think I want to pivot to the paths that an AI company could use to do a coup, because at first blush, it seems like both of the things you said could kind of work for the AI company. If the AI company can get all the robo-soldiers to be loyal to the AI company, and if the AI company can have exclusive access to its own AIs, that seems like maybe it’s putting the AI company in a pretty good place, right?

Tom Davidson (00:55:14): Yes. I think for exclusive access, it’s easier for the AI company. They are just going to have it by default. For the head of state or the executive branch to get exclusive access, they would have to intervene in quite a substantial way with the development process. I think the thing that’s harder for the AI companies is that in terms of deploying these loyal AI systems throughout society, there’s a much higher technical hurdle where they have to make them secret, really hard to detect.

(00:55:49): Let’s say they train GPT-7 to be secretly loyal. It could be five years later that someone discovers a new testing procedure, and if those weights have been stored somewhere, then someone realizes, “whoa, OpenAI trained a literal secret loyalty” and then the game’s up. So you probably want to really cover your tracks and lock down those weights, make sure no one can ever run any tests on them that you don’t want to run. Today, AIs are not that reliable, their personalities are somewhat haphazard, it’s all a bit of a mess. Today, I think it would be very hard to get away with a very hard-to-detect secret loyalty. And it might just be that it’s hard to predict how hard it will be to detect these secret loyalties.

(00:56:41): So it might be that the company is considering this: we could put in this really subtle back door, but we don’t know what people are going to do on this API testing. We don’t know if some of our employees might test for this in certain ways, that some of our employees will have access to deeper analysis tools. There’s going to be records of the training data. Are we going to be able to delete all of those? And if not, what if someone goes looking? It could be, there might just be genuine uncertainty about whether this is possible until quite late in the day. I imagine that at some point, once you’ve got really superintelligent AI systems, they would be able to come up with a plan here, but it might be that for most of the intelligence explosion, the analysis is just like, “ah, seems like you might get caught”.

(00:57:25): Whereas for the executive coup, the president is doing everything in plain sight. There’s no risk of getting caught because they are claiming that they have legitimate reason to do all these things as they go and they’re defeating their opponents as they go. And so there’s less downside risk. So that would be the counterargument. I think there are things on both sides of the ledger in terms of which is more likely, but the counterargument is just that the technical hurdle is much higher.

Secret loyalty and misalignment in corporate coups

Daniel Filan (00:57:56): So maybe this is actually… So I’m trying to give myself some hopium to stop myself from worrying about this, and maybe one path I have is: okay, AI-enabled coups, it seems like it’s much easier to do it if you’re the AI company because you have all the AI. But you have to have this hurdle where the loyalty of the AI kind of has to be secret. It seems to me that the worry about that is: suppose you’re an AI company and you succeed in instilling the secret loyalty to yourself.

(00:58:29): I think that should make you rationally worried that, well, if an AI can do secret loyalty, having secret goals that it pursues, this is basically just the same thing (as far as I can tell) as deceptive misalignment, where an AI is pretending to be aligned to you, but it’s actually not aligned to you, it actually just wants to do whatever it wants. And so I would think that if an AI company succeeds at getting a secretly loyal AI, they would rationally be concerned that the AI that they think is secretly loyal to it is actually deceptively misaligned, that if they use this AI to get a bunch of power, the AI is actually just going to take over. And it’s not going to be the AI corp that rules everything. It’s going to be some random AI desires. What do you think of my hopium?

Tom Davidson (00:59:14): Yeah, I’m not convinced. I think that you are right that the AI company would know that the AI has the capability to do deceptive misalignment, but the AI company might still have strong reason to think that it will not have the propensity. It might have done careful interpretability throughout training or whatever alignment techniques you believe might work. It might have various empirical evidence to suggest that it is in fact not misaligned, even if it would be capable of doing that.

(00:59:49): And in fact, it might not even have evidence of the capability. If the company is really just going for this, then it could have very sophisticated internal tests for scheming, which in fact the AI cannot game. And then it could train an AI to be secretly loyal. But then that secretly loyal AI is only able essentially to game tests that it’s been trained on how to game. So you can game all the company’s tests. It’s been trained on that. That’s not scary. Obviously you can game a test if you’ve been told how to pass it. It can game the unseen tests of auditors outside of the company because they’re much less sophisticated. But the company could still believe that the AI is actually not capable enough to game a highly sophisticated test that it has not been taught how to pass. And so actually even on the capability side, it needn’t have evidence that the scheming capability is there.

Daniel Filan (01:00:53): Yeah, so one thing is it could be secret to people other than the AI company but not the AI company. You could also have… You might think that if the first AI that you can succeed at making secretly loyal via doing a bunch of specially-built training… It seems somewhat reasonable to think that the first thing you could make secretly loyal by doing a bunch of specially-built training maybe couldn’t be secretly loyal by just the normal course of pre-training.

Tom Davidson (01:01:33): Yeah, I think that’s very likely. I think that the risk of secret loyalties is very likely to emerge before the risk of scheming because it just seems way harder to zero-shot how to pass all these tests and know what your strategic situation is and exactly when to act out versus if you just have a million super smart AIs figuring all of this out, training you how to do it, doing all this analysis. It’s just like the difference between what I can figure out in a few seconds on the fly and what a large civilization doing a concerted research effort can figure out, orders of magnitude difference in how much you can achieve. And so I do think this secret loyalties thing, I strongly expect to become technologically feasible at an earlier stage.

Daniel Filan (01:02:24): Okay, here’s why I’m holding onto my hopium. I think there’s this risk aversion. I think it depends a little bit how risk averse you are. So suppose you’re like: okay, I managed to instill a secret loyalty. Let’s say I think I’m 95% confident that the AI doesn’t have its own secret loyalty. One in 20. If you’re the head of an AI lab, I think they have pretty decent lives. I don’t know, I’ve never chatted with Sam Altman or Demis Hassabis. From the outside, it seems like they have relatively cushy lives, right? One in 20 chance of “you hasten AI doom by starting a coup”, that’s pretty bad, right? So it seems like it has to not only be true that the AI doesn’t have a secret loyalty, you have to be pretty confident in it.

Tom Davidson (01:03:14): Well, let’s say that OpenAI trained GPT-7, it did the capabilities tests, it did the alignment tests that it has, and it was like, “We’re going to deploy it. We’re happy with this system.” They’ve got a certain level of evidence. And yeah, let’s say it’s really capable, it’s really good at strategic deception, but indeed, people in this community worry that they would decide to deploy nonetheless. Maybe the risk is 5%, maybe they think it’s 0.5%. The question is, if they’re now considering instilling a secret loyalty, is that going to significantly materially increase that risk? And it’s not actually something I’ve thought about. You could argue: well, look, you are going to be actively teaching it all these different types of strategic deception. That seems like maybe it’s increasing this risk. But the reason why I’m not sold is that I don’t see why you’d be actively teaching it to in fact be misaligned. You’re obviously giving it capabilities which are scary. But if you’ve already decided how likely you think it is to be misaligned, you’ve already decided you’re happy to be deploying it. Are you going to now be more worried about it suddenly becoming misaligned as you trained it to be loyal to you? That doesn’t seem like it would be the case.

Daniel Filan (01:04:27): Yeah. I think what I’m imagining, which maybe doesn’t actually make sense, is [that] you have an overall plan and your overall plan has two parts. Part one is instill these secret loyalties, and part two is have the AI be more widespread and have more ability to gain power than you by default were planning, right?

(01:04:45): And the combination of that is pretty bad. Now, if you were holding fixed how far you would spread the power of the AI or whatever, then I agree instilling your own secret loyalties… I think it provides some Bayesian evidence. It seems plausible to me that being able to do it is some evidence that it might’ve already had the secret loyalty, but I think it’s less bad than the two-part plan.

Tom Davidson (01:05:14): Yeah, I think it’s a good point. I hadn’t thought of this, that if someone did want to stage a coup using their AI system, they probably would want to push it out into the world faster than other AI companies. Because if half the military is now controlled by this other AI company, then it becomes much less clear you can pull off the coup. So I think it’s a good point, that to the extent that you are already worried about misalignment and therefore you wouldn’t have wanted to push out your AI to the military, that would then continue to stop you from pushing out your AI to the military to stage the coup. That provides some hope.

(01:05:54): But my baseline has actually just been, we’re already worried that the AI companies will just be pushing out their AIs all over the place because they won’t be that worried about misalignment. And then once we take that for granted, then the risk of them staging a coup is very real. I guess where I’d agree with what you’re saying is that to the extent that these companies are worried about misalignment, I do think that that helps because it will give them pause.

Daniel Filan (01:06:22): Yeah. I think it’s a good point that if you’re already deploying and maybe you’re like, “If I’m planning to do a coup, I’m going to deploy 20% more.” That’s probably not going to be the difference between… I think that’s a pretty decent point. I think in the report, the illustration and the way we’ve talked so far is, we’re sort of imagining it’s the head of the AI lab that’s doing this. One thing I kind of wonder about… So a friend of the podcast, Jan Leike, is Head of Alignment at Anthropic [Editor’s note: he actually co-leads the Alignment Science team]. Seems like a chill guy. I don’t want to cast any aspersions on any of the Heads of Alignment of any of these labs. But wouldn’t they be in a really, really great position to do this?

Tom Davidson (01:07:04): Now I’d normally thought that it was the people who were involved in post-training and were involved in, let’s say, the fine-tuning runs that give the AI its particular personality and goals, that would be best placed. The team at OpenAI that’s ensuring the AI matches the model spec, because they’re the ones that are kind of influencing the particular values, and they could maybe try and insert an extra stage at the end that isn’t reported, that inserts the secret loyalties. And then I was imagining the alignment people would be the ones that then do the tests for the accidental misalignment.

(01:07:43): But I think really it’s both that you’d want. The ideal coalition would include both, because you have the team instilling the secret loyalties colluding with the team that’s meant to be detecting them. But you don’t necessarily need the detection team, because if the team that’s instilling them can access the code of the alignment team, then they can teach the AI to pass the test. And so this is another reason to have siloing within AI organizations, so that you can’t teach it to pass the test in that way.

Daniel Filan (01:08:12): Yeah. I’m now realizing that I know less about the exact internal structure of these labs.

Tom Davidson (01:08:19): Yeah, neither do I.

Daniel Filan (01:08:21): Good thing I’m not trying to make a coup happen. But yeah, that’s a good point. And again, no individual aspersions to anyone I might know who works in these teams. I feel decently happy with how we’ve covered corporate coups. Is there more you want to say there or shall we move on?

Tom Davidson (01:08:41): Another technical hurdle I’ll just flag is that I don’t expect GPT-7 to be given control directly of all the robots and all the drones. I expect there to be specialized AI systems which are developed that control those particular military weapons. And so there’s this other technical question over: even if you had all GPT-7 secretly loyal to some group of people, there’s then this big question of, they have to now do all this military R&D, design all these new weapons and drones and build the specialized AI systems that will control those drones and weapons. And they need to do so in a way that ensures that this small group of people ultimately has control over what this whole very complicated interrelated military infrastructure is going to be. It’s not just going to be carbon copies of the same robot. It’s not going to be one AI system. It’s going to be very complicated. There are hundreds of different AI systems controlling hundreds of different bits of equipment interacting in complicated ways. We’re not literally going to have robot soldiers that replace one-for-one human soldiers. That’s just a simplification to explain how intense we’re imagining this dynamic being.

(01:09:56): And there’s an open question of: as you get these sprawling military systems, interacting in complicated ways, controlled by various AIs and other things, how easy is it for GPT-7 or GPT-8 to ensure as they’re building all of this stuff that it’s all going to be ultimately controllable and loyal-able or hackable by the AI company or by copies of themselves, where… I just don’t know, but it seems like it might be really hard. The military contractors might have humans or less powerful AIs reviewing these designs and noticing obvious flaws.

(01:10:39): The military isn’t stupid, they have serious security practices. They’re worried about Chinese attempts to seize control. They might be worried that they are indeed Chinese spies in the labs that might be having influence on these AI companies. So you might well expect there to be significant defense efforts from the military, fingers crossed, looking for this kind of thing. And maybe [it’s] just a really tough technical task to design a very complicated physical infrastructure that is ultimately all controllable by a particular AI system or particular copies of an AI system. And so I think it might be very hard to predict in advance whether that works, and that’s another significant technical hurdle that might just turn out not to be doable, which I think should give us pause for hope in terms of whether the company coup is doable.

Likelihood of different types of AI-enabled coups

Daniel Filan (01:11:39): So there’s a few paths towards an AI-enabled coup that we’ve talked about. There’s basically the head of the executive doing it, there’s the lab company doing it, and also there’s this free variable about there are a variety of countries that could be couped. I’m wondering if you have a sense of the relative likelihoods of these things happening?

Tom Davidson (01:12:01): It’s a great question. In terms of countries, I think that in the fullness of time, current countries that are already fairly autocratic, like China and Russia, I think are at very large risk of an executive coup because the executive is just starting in such a strong position to begin with. So all of those steps, they’ve basically accomplished the first half or more and then [it’s] just quite plausible [that] they could use their existing power to push through the deployment of loyal systems throughout society. So I think that is worryingly likely. Honestly, it sometimes feels a bit hopeless to me in terms of how we avoid that. You can imagine one country really intervening in another country’s affairs. That’s not something I really feel excited about pushing towards. The other thing is just really encouraging the other actors that still have some power in those societies to really be live to these issues and get ahead of the game and maybe they can outmaneuver the head of state, even though the head of state is in a very strong position.

Daniel Filan (01:13:15): So to the degree that part of the reason you’re worried about AI-enabled coups is that you think that there’s some concentration of AI labs, or a small number of labs that are powerful: I mean presumably one way of preventing this is like: so suppose you and the AI lab are simpatico. Suppose you have a list of “here are the countries that I’m most worried about having a coup”. You could say, “Hey lab, we’re just not selling to those countries,” which is obviously… It’s a somewhat geopolitically aggressive move, I guess.

Tom Davidson (01:13:51): You might also be able to sell AIs that have guardrails that prevent their use to enable an executive coup. It would be very complicated because if you’re just setting up a surveillance state, there’s just lots of somewhat narrowly defined tasks that you want your AIs to do, but you could try and differentially allow them to deploy AI systems that won’t centralize power as an intermediate.

Daniel Filan (01:14:17): Yeah, I guess the tricky thing about that being, it’s just very… If you have some countries who do get to use AI in their militaries and some countries where either they don’t or the AI they get to use is filtered for not doing a coup, and maybe other countries don’t trust that that’s the only thing they’ve monkeyed with, it seems like it might be a pretty aggressive move, which…

Tom Davidson (01:14:47): I don’t know how aggressive it’s going to be to just not sell a powerful technology. I think that might be the default situation with a really powerful AI, that just for national security reasons, you wouldn’t want countries that you’re adversarial towards to have access to those most powerful systems.

Daniel Filan (01:15:07): Fair enough.

Tom Davidson (01:15:08): But to me, the worry is it’s just a delaying tactic and that in the fullness of time, China will develop its own powerful AI and sell access to autocracies that want it.

Daniel Filan (01:15:22): So maybe another question is… So I’m not from China, I don’t live there. I wish the best for the Chinese people. But if there’s a coup in China, an AI-enabled coup in China, to what degree is the concern, like, China is autocratic forever?

Tom Davidson (01:15:39): And just to be clear, probably in China it would be less called a coup and more… Well, it would be an executive coup, but it might just be cementing the system that already exists if you already consider it to be autocratic.

Daniel Filan (01:15:49): Also, by the way, I’m asking about China, but I’m not really just specific to China. I’m mostly just thinking [about] a bunch of countries that I don’t live in. If there’s a relatively autocratic country, it has an AI-enabled coup/cementation of power, to what degree is that concerning because that country is autocratic forever versus to what degree is that concerning because maybe that country becomes more bellicose and starts trying to take over the world, or it’s a promoter of conflict?

Tom Davidson (01:16:21): Yeah, I think it depends on exactly what you care about. One lens you can take is the kind of hard-nosed, longtermist lens where you say, “Okay, what we care about is control of the stars over the long term.” And so then you’ll be thinking, “Okay, would this perhaps less powerful country, would the new dictator hang on to power for long enough for it to be indefinite? And would they be able to get a sizable fraction of the stars such that there’s been a significant loss of value?” And if it’s a not very powerful country, you might, from that really hard-nosed, longtermist perspective, say, “Well, it’s not going to be powerful enough to actually gain any of the stars. Probably the United States is just going to basically be carving up the stars with China or just taking them all for themselves.”

(01:17:07): So though it’s a tragedy in terms of the people who live in those countries, from the kind of brutal, utilitarian calculus, it matters a lot less. I mean, that’s one lens. Then the other lens would just be the humanitarian lens that says, this is awful for the people in that country. And also if that country is able to strike a deal with countries like the United States, then they might be able to embed themselves permanently, even if ultimately the United States has much of the hard power.

Daniel Filan (01:17:42): I think there’s this uncertainty I still have about the domestic versus the international impact of doing a coup. So I could imagine one story where if you do especially an AI-enabled coup, you get all the military really unified behind you. Maybe that just makes your military more effective because they all have one purpose. You have access to this really good planning, and if you compare to militaries that basically haven’t been involved in a coup, that are different people with slightly different desires and they’re not as ruthless… There’s one story where that military is at a significant advantage. You can also have a story which is: well, democracy seems like it’s generally good. Somewhat dispersion of power seems like it generally makes things run better. So maybe this is not a concern. I’m wondering if you have thoughts there?

Tom Davidson (01:18:33): Yeah. One related thought I have is that: let’s say there’s not a coup in the United States. I then personally think it’s unlikely that the United States would end up completely dominating the rest of the world and seizing all power economically and all strategic control for its own citizens to the exclusion of all others. Because the United States…

(01:18:58): A few reasons. Firstly, the United States has many different coalitions with power, and many of those coalitions have ideologies that make them committed to things like democracy, things like trade, and have positive views of other countries, like, say, the United Kingdom where I live, and they just wouldn’t want the United States to dominate the United Kingdom as much as it possibly could. And so that balance of power in the United States would ensure that the United States uses its power in a way which does go somewhat beyond its borders. And the other thing is just that if the United States wanted to completely dominate the rest of the world, probably what it would want to do is to really restrict the AI systems that it sells to the rest of the world and really sell access to those systems at the highest price it could. Whereas under the default situation where power is distributed within the United States, different companies within the United States will compete to sell AI services to the rest of the world, driving down the cost that the rest of the world is paying.

(01:20:02): And because of competition within the United States, that means that actually the United States is going to give the rest of the world a bit of a better deal. And so under this default scenario where power is distributed, I think there’s less prospect for the United States to really just take power for itself, even if it’s leading on AI. Whereas if there is an AI-enabled coup and one person becomes dictator with total power, then they might be like, “I want to dominate the world. I want all control and I’m just going to force all these companies to only sell at this extortionate rate, and the rest of the world has no other source of powerful AI so they’ll pay it. And then I’m going to choose our foreign policy and economic policy to only take into account the welfare and power of the United States in particular.” And so I do think that if there’s an AI-enabled coup in a particular country, then as you indicated, that country might become more bellicose at pursuing its own particular interest and could actually do so more effectively.

Daniel Filan (01:21:08): And I guess there’s also just this factor of: if you’re doing a coup, you’re probably a bit of a bellicose person, you’re probably more inclined to that sort of thing than other people.

Tom Davidson (01:21:19): Exactly. I mean, you raised a good question about “are democracies just going to be more efficient?” Because the free market’s fairly efficient, you’re distributing the decision-making. I think a scary possibility is that you can still gain the benefits of the free market by distributing all the economic decision-making and having markets operating within the country, but you still have on all the important decision points, AIs that are loyal to one person. And so you can get all those economic benefits to democracy now without actually needing to have a real democracy. But I haven’t thought much about whether that would go through.

Daniel Filan (01:21:54): Something to think about. So speaking of democracy and speaking of the United States, initially you said, “Yeah, probably countries that already have a very strong executive, that already are less democratic probably are more at risk to having a stronger executive and being even less democratic.” I live in the United States, I’m a fan of it. How high do you think the risk is that the United States gets AI-enabled couped?

Tom Davidson (01:22:23): I mean, if I had to pluck a number, I’d say 10%, but it’s very made up. That’s my rough probability for AI takeover as well. I think it’s ballpark similar.

Daniel Filan (01:22:36): Okay. And can you talk me through why is it as high/as low as 10%?

Tom Davidson (01:22:43): By analogy with AI takeover or just in and of itself?

Daniel Filan (01:22:48): In and of itself.

Tom Davidson (01:22:49): Yeah, so I think some things are fairly likely to happen. We’re likely to see a very small number of companies developing superintelligent AI systems. We’re likely to have a government that if it tried to, could gain a lot of control over how those capabilities are used via its default monopoly on force, its natsec apparatus. If they don’t, then by default power is already and will continue to be very concentrated within the AI companies. There are not, in practice, many effective checks and balances on the CEOs in these companies. I also believe that it’s quite likely that CEOs will want on the margin to just increase their own power and use their influence over AI to increase their influence more generally.

(01:23:52): So you can already see with Grok, Elon [Musk] is doing this in a totally shameless way. He’s altering Grok’s prompts to make it promote political views that he likes. And I think it’s just a natural urge if you want stuff and you want a bit more power and you just have this way of getting it, which is that you’re controlling these hugely powerful influential AI systems. So I do think it’s quite likely that on the margin these company leaders will walk down that path of increasing their own power to some extent.

(01:24:29): But there are also some things which I think are not particularly likely. They may happen, but: will at any point a key company executive decide to do something which is really egregious? At some point they might need to decide to do a secret loyalty. I think there’s a chance that that’s just a step too far or there’s a chance that by the time that’s possible, the world has woken up and just put in some kind of checks and balances that would make that hard to do.

(01:24:56): And then there’s the further technical question on, okay, but would this actually work out? We were pointing to some of these difficulties of actually getting these secret loyalties propagated to the military infrastructure, being really confident the AI isn’t actually secretly misaligned. So really zooming out, maybe there’s a couple of steps which are… I wouldn’t say it’s more than 50%. And so that gets you down. Let’s say there’s two steps which are 40% each. Just in this rough range where it’s about 10%. As I’m thinking this through, I’m thinking maybe it should be higher because you’ve got either the lab route or you’ve got the executive route, and maybe you actually just want to add those up. Yeah, that’s just a brief indication.

How to prevent AI-enabled coups

Daniel Filan (01:25:52): Okay, I think at this point I’m interested in just talking about maybe what people should do about this. And probably I’m going to be most interested in thinking about this from a US perspective because that’s where I live and what I think the most about. Although I’m also interested in other places-

Tom Davidson (01:26:10): I do think it’s the most important case.

Daniel Filan (01:26:17): So a lot of these stories are about synthesis of AI power and military power. So it seems like one thing you could do for this proposal - AI power, military power, and executive power all coming together in a really concerning way. Sometimes people are like, “the US government should have this really big push to develop really powerful AI, that it does itself, with strong… pushing AI forward really hard, having exclusive access to the AI, and it should be really integrated within the government.” It seems like this is probably pretty bad from the coup perspective. I’m wondering if you have takes there?

Tom Davidson (01:27:05): So I think if you did this really well, it could be good from the coup perspective. If you’ve very carefully designed a project explicitly with reducing this risk in mind, I think you could probably actually reduce coup risk relative to the status quo, just because the status quo is so poor. Under the status quo, there’s very little constraining labs. So there’s very little guard against the company coup, but there’s also no explicit checks and balances that would constrain the ability of the executive to just demand that the companies sell them access to AIs without guardrails that they can deploy throughout the government and military. And the companies, if there’s a few of them, would be in potentially quite a weak negotiation position with the executive over that.

(01:27:52): So because the status quo is so bad, I think if you designed a good centralized project, you could reduce this risk. Now, I think probably the best way to minimize this risk would be to design a system of regulation where you continue to have multiple constrained regulated projects with various transparency and safety constraints in place, et cetera. That would probably bring the risk down lower still, and that would be better than a centralized project from this perspective.

Daniel Filan (01:28:28): One thing that occurs to me as well is… So again, I still have in the back of my mind, how do AI alignment concerns affect this? It seems like a lot of the things that people want out of AI alignment could potentially help with this. So transparency, causing companies to do evaluations of their models, having whistleblower protection schemes. It seems like a lot of these probably at least reduce the chance that AI labs do stuff in ways that the rest of the world doesn’t know about. Maybe it increases the risk that… If you’re worried about governments meddling too much with AI companies to do tricky things there, maybe that’s a concern. But I’m wondering, having strong AI Security Institutes or something: how much do you think that helps with coup risk?

Tom Davidson (01:29:29): I think all of the stuff you listed helps and in combination helps a fair bit. And yeah, I do think just a lot of the interventions here are pretty generally good across both coup risk and misalignment risk. The place where they really potentially bump heads is whether to centralize into just one project versus having careful regulation of multiple projects. But beyond that, I tend to think there’s this pretty strong alignment. There’s different areas you focus on. So you’re particularly concerned with, “oh, how is everyone actually using the compute within the AI companies and within the government?” And you’re relatively less concerned with looking for rogue deployments because it’s just the “legitimate” deployments that we might be more worried about now. But monitoring use of large compute, that’s the way of framing it where that’s both catching the misalignment risk and the risk of coups.

Daniel Filan (01:30:28): I think I want to talk about things that are maybe more unique to coups. So one thing that you don’t mention in the report as far as I could tell, but seems interesting to me: a lot of the coup risk seems to come from: you have one AI company, it’s relatively dominant, it’s relatively in the lead. It’s hard for other people to compete, and they’re doing sketchy stuff within that company, at least for company-enabled coups. Currently, it’s relatively common for employees… It’s not unheard of for employees to switch around between companies. There’s a decent amount of it happening, and it seems like this probably helps, partly for…

(01:31:14): It means that labs don’t have super, super intense secrets. They probably have some secrets of exactly how they do some things, but the general structure… I feel very confident that the latest Claude and OpenAI models are based on the transformer architecture. If they did something totally different, I just feel like we would’ve heard of that. So it feels like it both protects against (a) them doing something… If OpenAI were like, “Hey, all of our employees—we’re doing a bit of a coup, shh.” I feel like people would hear about that. And then (b), maybe it protects against one lab going super, super far ahead of the other labs. So some combination of (a) people moving between labs and (b) some good old corporate espionage. It seems like this maybe helps. I’m wondering if you have thoughts?

Tom Davidson (01:32:14): Some of the benefits you’re describing, I think just come from the fact that the labs are humans who are integrated throughout broader society and will talk about really notable things. If a lab tried to tell everyone, “look, don’t mention this, but we’re doing a coup”, it doesn’t matter whether employees are switching between labs, that’s going to backfire. So then what do we get from the employee switching? I do think you’re right. We’re getting some kind of evening of the playing field.

(01:32:42): My view is that you can get a lot of the benefits by having two separate labs in terms of coup risk reduction. Because if you have two separate labs, they can separately train AIs using different stacks, and then they can have the AI cross-audit each other for secret loyalties. You can deploy both systems in the military. So you have a significant additional check and balance there. You could have both AIs controlling each robot, so maybe the OpenAI system is controlling the robot, but the Anthropic system can shut it down. And so with a setup like that, you could significantly reduce the risk because even if one system is secretly loyal, that doesn’t give you strong reason to think that the other system is also secretly loyal and colluding. So I think if you had movement between companies which made it less likely that we just end up with one company [and] instead meant that we get two or three, then yeah, I think that would be good.

Downsides of AIs loyal to the law

Daniel Filan (01:33:43): So one thing you mentioned is you think that it’s important for AIs to be aligned to follow the law and to not be loyal to one individual and just [prioritize] the law in general. One concern I have about this, and in fact about AI-enabled coups in general, is it feels like it is possible for countries to be too stable. I think that it is possible for the law to be followed too much. Well, the law being followed too much, I think one version of that is it is sometimes very unclear what the law involves. A kind of silly version of this is I’m only 90% sure that the existence of the United States Air Force is constitutional because the Constitution doesn’t actually say that you can have an air force, because they didn’t think about it. It says you can have an army and a navy. Can you have an air force? I don’t know.

(01:34:41): I mean that’s a bit of a silly example, but the US Constitution, it is a little bit ambiguous in many places. But at a high level, if I imagine… So for example, the reason there’s a United States is that one part of the United Kingdom broke away from the rest of it, and that was, I assume, illegal. It was illegal. It was a portion of the United [Kingdom] breaking the law and being loyal to one entity within the United Kingdom versus other things. And in general, it seems like it’s probably good for it to be possible for sometimes bits of states [to] break away illegally and do their own thing. How much of preventing coup risk, especially via the means of making sure that things are aligned to the official law, will prevent bits of states breaking away in a way that seems healthy in the long run?

Tom Davidson (01:35:54): I think it’s a really interesting question. I think we want to get a balance between locking out the bad stuff, locking out the egregious coups, but as you’re saying, we don’t want to lock in too much. As an extreme case, we definitely don’t want to lock in “the rules of today can never be changed”, so we obviously want to have some process by which we collectively can decide to change the laws. And I think that that’s by default how it’ll happen. I had previously thought that, look, if we lock in the laws of today in a sensible, nuanced way, then we will leave enough flexibility to collectively decide to change things. And there could be some process by which it is legitimate for a state to break away.

(01:36:48): But I think you’re actually right in practice, it may be that the naive way of implementing even a nuanced version of the law… It’s possible that would actually lock in too much. I haven’t thought much about how much really positive stuff has happened historically via lawbreaking, and do we expect that to continue to be the case even in mature democracies like the United States? Do we want to allow California to just declare that it’s independent illegally, and do we want its AIs to go along with that?

(01:37:24): I think it’s a really good question, and it kind of highlights the way in which we may be going down significant path dependencies as we automate the government infrastructure and military, because once we’ve automated the whole government and the whole military, we will have implicitly baked in answers about whether AIs will support various different maneuvers. We’ll have implicitly baked in an answer about, if California tries to break away, and all of its systems support it, and most of the broader US supports it, but it’s actually technically illegal, and… There will be some decision that infrastructure of AIs will come to about whether it’s going to support… If push comes to shove and there’s going to be a military intervention, what will the AI military do? That’s a constitutional crisis, and we will be baking in some implicit answer to the question of what will happen there? Who will the AI military support?

(01:38:25): And I think it just highlights [that] we should think very carefully before we do this. And there’s kind of no way to not give an answer. There’s no default because the default in today’s world is just, I guess, there’s a kind of power struggle and random stuff happens, and I think it’s a fair point that maybe it’s actually good that you can sometimes do illegal stuff because it adds more variety. And so maybe in the ideal world, we’d say, look, in constitutional crises, be wise, consider what’s best for the broad future, and make the best decision that balances all these interests. And we hope that that would actually be an improvement on the status quo where it’s just kind of random and determined by power. Maybe we can get something that’s at least based on some kind of desirable principles when there are more edge case-y constitutional crises, and maybe we don’t always make it come down to the letter of the law.

Daniel Filan (01:39:23): So there’s one version of this which is being pro-pluralism. There’s another version of this, which is… Especially if instead of imagining the US, you’re imagining… I think there, at least conceivably, are authoritarian countries where you actually do want it to be possible for things to break away. And there is also this third thing, which is: the letter of the law really is not as clear as you might hope in many cases. I was thinking about this before we started. One thing you could imagine doing is being pro-pluralism instead of pro the letter of law. I don’t know, I didn’t spend 10 minutes thinking about ways in which that could be bad. So probably there are a bunch of ways that could be bad.

Tom Davidson (01:40:09): I mean, another possibility is you act in accordance with how you predict the Supreme Court judges will resolve this question assuming that they’re acting in good faith.

Daniel Filan (01:40:27): “In good faith” seems tricky and hard to define. I guess it depends-

Tom Davidson (01:40:36): Or assuming they’re trying to be reasonable. The law often has “reasonable judgment” and things like this, just because if you don’t say in good faith, then if all the Supreme Court now decide they want to do a coup, then the AI knows that, then the AI just does a coup. So you want to have something there to kind of idealize it.

Daniel Filan (01:40:51): Yeah, I think there’s probably some way you could do it.

Tom Davidson (01:40:56): I mean, the thing is about AI, you can give it these fuzzy things like “assume they’re trying to be in good faith”, “assume they’re being reasonable”, and just like humans do, it’s able to work with it, even though it’s not mathematically specified.

Cultural shifts vs individual action

Daniel Filan (01:41:06): Yeah, I think there’s something to that. So talking more about ways of stopping coups: one path is things you mentioned in the paper: try to align to things other than “it’s definitely just going to be what this one person wants”, try and prevent lab-led coups by making labs transparent, having some regulation of labs. I guess preventing executive-led coups… Presumably the thing to do there is just try and elect people who won’t do coups.

Tom Davidson (01:41:46): I think there’s building consensus among many different parts of society, especially the checks and balances parts, that we want AI to follow the law, to not be used to increase the partisan power of the current elected officials, build a consensus that military systems shouldn’t all report to one person, but should all report to many different humans. And if you can build consensus around that, then that can make it more of an uphill struggle for a head of state that wants to stage a coup.

Daniel Filan (01:42:22): So in the report, a lot of the proposals for how to prevent a coup are very “here’s things that we as a society could do”. One thing you could potentially do to prevent a coup is also sabotage-type things (or at least things that individuals could do, or things that are less [of a] global plan). I mean, one very minimal version of this is: if you imagine there’s some authoritarian country that you think is at high risk of an AI-enabled coup, you just not sell AI weapons to them. That’s a moderate version of this. You can also imagine, even if there’s not a policy for my AI lab to prevent coups, you can imagine individual workers in an AI lab saying, “Okay, I’m going to quit”, or “I’m going to insert my ‘don’t do coups’ bit into the code slightly surreptitiously.” I’m wondering what you think about these more individual-ish moves.

Tom Davidson (01:43:31): I definitely support whistleblowing and encourage employees of AI labs to be like, “Okay, what’s going to be my line?” If there is movement towards less transparency into what the AI is being aligned to or even just like it’s becoming clear that it’s being aligned to the company or to specific people, what is your line in which you’re going to whistleblow. I think one thing that employees can do is be like, “I am going to hold myself accountable to getting positive affirmation that this isn’t happening. I’m going to make sure that it’s not possible for the company to sneak in a secret loyalty given that I’m aware of what the company systems are like, and I’m going to ensure that the company isn’t training AI overtly to be loyal.”

(01:44:18): And so I think it would be great if there was a culture at companies where it’s just like, obviously we wouldn’t want this to happen, obviously we don’t think anyone here would try and do this, but we need to have an attitude of vigilance because that’s what makes it true that this would never happen. So I think that’s good.

(01:44:35): And one more positive framing for this is being like: one great thing to aim for as a company is to make a product which everyone absolutely knows they can trust, even people who don’t trust our staff and our processes and think we’re crooked and think we’re going to try and seize power, even they should just know they can trust our systems because that’s what a good product looks like. So you can frame this in terms of building amazing products that ultimately you want national security to use.

(01:45:06): If you anticipate that it’s going to be public knowledge that sleeper agents are possible, that secret loyalties are possible, then you might anticipate [that], for the government to proactively use our AIs in the military and in other critical systems, they’re going to want to really have strong assurances and not just trust that there’s nothing fishy going on. And that’s what we’re aiming for for purely product-based reasons.

(01:45:32): So I do think pushing towards that culture is one thing you can do as an individual. I feel more nervous about trying to sabotage the training run in a way that prevents secret loyalties but that no one else knows about, just because it’s such a similar act to introducing secret loyalties. And I just think that all of that stuff should be very taboo, and the processes should prevent that.

Technical research to prevent AI-enabled coups

Daniel Filan (01:45:53): If someone’s been listening to this… In particular, I think a lot of my audience are people who do technical alignment research. If someone’s listening to this, and they’re interested in, “okay, let’s prevent AI-enabled coups”, is there technical work in particular that you think is valuable for people to work on here?

Tom Davidson (01:46:11): I think the biggest thing is technical work on detecting sleeper agents and secret loyalties specifically in a setting where there is some kind of adversary that you expect to be purposefully trying to hide that sleeper agent from you.

(01:46:29): So historical sleeper agent work has often… At least the stuff from Anthropic, it’s been designed to illustrate the idea of misalignment. So AI kind of acts out when it sees it’s 2026 but plays nice before that time. But those model organisms (as they’re sometimes referred to), they’re not really trying to illustrate the risk that a human actor has created a secretly loyal sleeper agent that they specifically don’t want other humans to discover. And so I think there’s potential for a lot of technical work on this kind of offense-defense balance question of: if you have some attacker that has some kind of access to the training procedure and you have some kind of defender that has, again, some level of access, again maybe seeing how the training worked or doing tests afterwards, can the attacker create a sleeper agent that the defender doesn’t detect? And how sophisticated is that sleeper agent?

(01:47:34): And there’s so many parameters here that ideally we want to be studying. So the attacker could have just access to… The minimal access would just be they can put things on the internet that go into pre-training. A deeper level of access would be: they can add lines to the fine-tuning code, but then maybe the defender can see that fine-tuning code, or maybe the defender can’t, or maybe the attacker can add additional training examples, which maybe the [defender] can see or can’t see. And so there’s all these different kind of scenarios where the attacker and defender have various different affordances.

(01:48:09): And ideally what we would have is a good understanding of under what situations does the defender reliably win? Because if we have that understanding, we can then set up our training process so that it is secure enough that the defenders indeed reliably win given the actual kind of access that they have and given the likely access that we believe an attacker could gain.

Daniel Filan (01:48:31): How close do you think this is to just basically what AI control people are doing, versus do you think there are specific aspects of this that are not covered by typical control research?

Tom Davidson (01:48:41): Well, I think it goes beyond control. So interpretability is something you could look at here. Typically, we look at interpretability with a mind to accidental misalignment, but you could say, can I use interpretability to catch sleeper agents, even when the person is specifically trying to hide them? What if they can guess the kinds of interpretability techniques I might be using in advance? Can I still detect the sleeper agent then?

(01:49:07): So basically, take any method we have for testing alignment, black box testing, interpretability, looking through the training data: for many of those methods, you’ll be able to adjust it to consider the more adversarial setting where someone is specifically trying to create a sleeper agent that you cannot detect. And so I think it is not just control, it’s kind of alignment and control. And then I don’t think it’s exactly the same as standard control because it is a harder setting, because as we touched on earlier, the normal setting is where the AI kind of has to zero-shot figuring out how to dodge all these tests and how to sabotage. Whereas here the attacker would have more ability to help the AI do a good job using this-

Daniel Filan (01:50:01): Pre-train it how to…

Tom Davidson (01:50:03): Exactly. How much does it know about the company’s control processes? If you’re worried about an insider threat actor, then you might want to study, okay, how much of a better job can they do if they do versus do not know about the specific control setup? And I actually think the control research already looks at this variable.

Daniel Filan (01:50:23): Yeah. Well, I imagine it depends. I think it’s one of these ideas that has caught on and it’s maybe a bit hard for… I think it’s hard for the world to refer to exactly one thing exactly by the term “AI control”. So probably different people are thinking slightly different things, and it’s worth thinking about, “okay, I want this version”.

Tom Davidson (01:50:46): But I’ll just give it one last pitch I suppose, which is to say that there’s so many people that have worked on misalignment from the perspective of accidental misalignment. There are just very few people that have really comparably gone in on techniques for very powerful AI detecting purposeful sleeper agents. And I think it’s going to be a really important problem. It’s one day going to be commercially important, as we’ve discussed, because the government will want guarantees, but it’s very neglected. And so I would really encourage technical people to work on this.

(01:51:26): Probably the first stage is actually a research agenda that sets out the space and looks at the current alignment techniques and what kinds of best ways of adapting the research there would be to account for this new threat model.

Non-technical research to prevent AI-enabled coups

Daniel Filan (01:51:40): So okay, that’s technical research people can do if they’re interested in this threat model, but not all of my potential listeners to this episode are technical researchers. I’m wondering: stuff other than technical research, do you think there are further questions that need answering here?

Tom Davidson (01:51:55): One thing I’ll briefly mention is the infosecurity side. So that is technical. I don’t know how many of your [listeners] work on this, but a lot of infosecurity is focused on weight exfiltration, but there are infosecurity problems which arise from this threat model, which are new. So it’s much more about: how can you secure the training process against unauthorized edit access to the weights? So rather than read access to the weights, to exfiltrate, it’s edit access, and that has some distinctive features.

(01:52:33): And I think, again, a kind of road mapping exercise that goes through all the different infosecurity things that labs could do and thinks what are the best bet for making it hard to do significant unauthorized edits seems like low-hanging fruit. And then I think labs should be upweighting those relative to their previous prioritization. And so if you do infosecurity, I’d really encourage you to pitch that project to your team. But you asked about non-technical work.

Daniel Filan (01:53:06): Yeah.

Tom Davidson (01:53:07): I think one thing is just doing more detailed threat modeling. As we touched upon in the conversation, there are lots of steps of these threat models that you could dig into in more detail. How hard would it actually be to train secret loyalties without detection? How hard would it be for those AIs to pass the secret loyalties onto the next generation and then to military systems? You could look into: for the executive coup, what is the legal situation currently with respect to automated military systems? Who has legal authority over the procurement process, and if powerful AI military systems were being built, who would by default be looped into their loyalties and how they’re being designed?

(01:53:59): And again, then thinking about mitigations. So once you’ve got more a more detailed threat model, you can then think about… One thing that I thought of recently is, well, if all the labs send a copy of their weights to some safe enclave, then it makes secret loyalties a much more risky prospect because they can’t just delete those weights. And then whoever they send the weights to can, in a few years time, do tests with much more sophisticated techniques. So it’s possible that thinking through the threat models will bring up other ideas like that.

Daniel Filan (01:54:31): Well, yeah, in that case, you’ve got to be confident that the copy of the weights the lab sent you is the actual thing that they’re running. But there must be some way to…

Tom Davidson (01:54:39): Yeah, well, you can certainly do hashing the weights. So you can check that the copy you’ve received is the one they’re currently running, as long as you can get them to actually hash the weights that they’re really running and then compare them. But yeah, you’re right, there’s room for maneuvers there.

Daniel Filan (01:54:59): Yeah. I wonder if this is… So sometimes people talk about: we’re going to have computer chips, and they’re going to have a little thing on them that checks if you’re doing really crazy AI training and reports that, just so that governments can monitor how much AI training people are doing. It seems like a similar thing you might want to do with chips is “are people running the model weights that they say they’re running?” That seems like it’s potentially valuable for this threat model.

Tom Davidson (01:55:28): Yeah, that’s a great idea. I hadn’t thought of that. What you could do, you finish your training process, you hash the weights, then you do all these in-depth alignment tests, then you send the weights to the safe enclave so that then you can do even more tests later. And then you have the chips regularly check that the weights are the same as what you ended up with.

Daniel Filan (01:55:52): I guess also presumably there’s some amount of just thinking about structures that would be good. So I think you mentioned that a centralized AI project, if you structured it correctly, maybe it would be good at being AI-enabled coup-resistant. I imagine there’s probably more thinking someone could do about how you would actually set that up.

Tom Davidson (01:56:10): And for all the recommendations in the paper, there’s a lot more thinking about implementation. We’re giving recommendations on a very high level, transparency about various different things and sharing of capabilities with different parts of society to avoid exclusive access, and AI should follow rules that mean they can’t be used for coups, all of that’s [missing] “what rules exactly?” And exactly how we’re going to structure this transparency requirement, and which exact bodies should AI capabilities be shared with.

(01:56:46): So one type of work I’m excited about is working on drafting contracts between governments and labs that specify these requirements concretely. And similarly for setting up a centralized project, getting much more detail about how it would be structured, as you say.

Daniel Filan (01:57:05): I think I’d like to move a little bit onto Forethought, the organization that put out this, but before I do that, is there any last things you want to say about AI-enabled coups?

Tom Davidson (01:57:15): I’ll say one more thing, which is that I think it’s really helpful in many contexts to be very explicit about the threat model we’re concerned with. We’ve talked very explicitly about executive coups and lab leaders doing coups. That’s helpful for thinking clearly. I don’t think it’s the most helpful frame in many contexts: coups sound kind of extreme in many contexts, and it sounds like an adversarial framing, it sounds like you’re pointing fingers to individuals rather than just being like, well, obviously no one should be able to do this.

(01:57:52): And so I do think there are other more useful frames in many contexts. So rather than “let’s prevent secret loyalties”, I like the frame of “system integrity”, which just means that the system does what it says on the tin, hasn’t been tampered with, and rather than preventing an executive coup, you can talk about checks and balances, rule of law, democratic robustness.

Forethought

Daniel Filan (01:58:17): Yeah, that’s a good point. Okay, I next want to talk a little bit about Forethought. So Forethought is this new-ish organization. And in March or April, you guys put out a bunch of papers or a bunch of reports. What’s Forethought? What’s going on?

Tom Davidson (01:58:37): Yeah, it’s a research organization. We aspire to be considered a successor to FHI. So FHI was a macrostrategy research organization, so kind of thinking about strategy in the most zoomed-out terms possible. Often it was thinking about the very long-run future and the different outcomes that might occur, things like the vulnerable world hypothesis and astronomical waste, the kind of big, big picture questions, the big picture papers that came out of that institute, FHI.

(01:59:14): And so we’re aspiring to be the follow-on successor that is tackling the really big strategy questions. And the way we’re currently framing it is: we are going over the coming decades very plausibly to transition to a world with superintelligent AI systems. That is just going to bring a whole host of major, major changes. AI misalignment risk is one really important risk to be thinking about over that transition, but there’ll just be a whole host of other issues. AI-enabled coups is one example, and it’s the first one that we’ve really focused on, or at least that I’ve really focused on, but it’s not the only one.

(01:59:54): I mean, I really enjoyed your recent podcast on AI rights. I think that’s going to be another really big issue that is very much on our radar, and there’s going to be many other big issues as well. Another one that we’re excited about is just that at some point we’re going to start getting access and using resources in space, and how those resources are used is going to be a very, very important question. That is basically all the resources, and we have no idea how we’re going to use them, how we’re going to divvy them up, what the processes will be. In a sense, everything is up for grabs in that decision.

(02:00:34): So that’s another big example. And I expect there’ll be other things where just there’s going to be so much change as we’re going through this. There’s just going to be a lot of things which emerge, and so our aspiration is to be keeping our eye on the ball of these very high-level strategic questions and issues and trying to help us figure out what we should do about them.

Daniel Filan (02:00:58): Yeah. You mentioned that the first thing that you focused on is AI-enabled coups. The things you’ve mentioned: are those roughly the things that you expect the institute to prioritize, or what might I see out of Forethought in the next year or so?

Tom Davidson (02:01:14): I think those are our current best guesses, the things I mentioned. So I think space governance, you might well see stuff on that, you might well see stuff on AI rights: specific schemes to pay the AIs to work with us if they’re misaligned—something that we feel quite excited about and seems still underexplored, though it is getting more attention, which is great. I think positive uses of AI, for improving epistemics, for improving government decision-making, for ensuring that democracies don’t fall behind autocracies in an automated economy, those are some other issues that seem like we might well focus on. Another issue would be: if we’re choosing these AIs’ personality, exactly what should it be aligned to? Which is, again, a question which is getting more attention, but is going to be very, very consequential.

Daniel Filan (02:02:15): Another thing to ask: a bunch of my listeners, maybe they’re coming out of undergrad, maybe they’re in a space where they’re considering changing careers: is Forethought hiring?

Tom Davidson (02:02:28): Yeah, we’re planning to do an open hiring round soon. I’m not sure exactly when we’ll release it, but I would really encourage people to apply. I think there’s a lot of talent out there, and I expect there’s a lot of talent we’re completely unaware of. So even if you don’t think that you’ve got the skills or the knowledge, there’s no great on-ramp to doing this kind of work at the moment, and I think there’s a big danger of people just ruling themselves out prematurely. So when we do release the open hiring round, please throw in an application.

Following Tom’s and Forethought’s research

Daniel Filan (02:03:03): Final thing I want to ask: suppose someone listened to this episode, they found it interesting, and they want to hear more about the work you do, how should they go about doing that?

Tom Davidson (02:03:16): With me personally, you can follow me on Twitter. If you google “Tom Davidson AI X”, then you’ll see my Twitter pop up on Google. So you can follow me, subscribe. I post basically all of my research on LessWrong because that’s where the big community that cares about some of these issues is. So if you have a LessWrong account, you can subscribe there. We have a Forethought Substack, so if you, again, just google “Forethought Substack”, then the top link. Subscribe, that’d be great. And then you can also follow Will MacAskill, he’s the other senior researcher at Forethought. Follow him on Twitter and LessWrong as well.

Daniel Filan (02:04:07): Great. So yeah, links for all of that will be in the description of this episode. Tom, thanks very much for coming on. It was great chatting with you.

Tom Davidson (02:04:14): Yeah, real pleasure. Thanks so much, Daniel.

Daniel Filan (02:04:16): This episode is edited by Kate Brunotts and Amber Dawn Ace helped with transcription. The opening and closing themes are by Jack Garrett. This episode was recorded at FAR.Labs. Financial support for the episode was provided by the Long-Term Future Fund along with patrons such as Alexey Malafeev. To read a transcript, you can visit axrp.net. You can also become a patron at patreon.com/axrpodcast or give a one-off donation at ko-fi.com/axrpodcast. Finally, you can leave your thoughts on this episode at axrp.fyi.