Google Podcasts link

How should we scientifically think about the impact of AI on human civilization, and whether or not it will doom us all? In this episode, I speak with Scott Aaronson about his views on how to make progress in AI alignment, as well as his work on watermarking the output of language models, and how he moved from a background in quantum complexity theory to working on AI.

Topics we discuss:

Daniel Filan: Hello, everyone. In this episode, I’ll be speaking with Scott Aaronson. Scott is a professor of computer science at UT Austin and he’s currently spending a year as a visiting scientist at OpenAI working on the theoretical foundations of AI safety. We’ll be talking about his view of the field, as well as the work he’s doing at OpenAI. For links to what we’re discussing, you can just check the description of this episode and you can read the transcript at Scott, welcome to AXRP.

Scott Aaronson: Thank you. Good to be here.

‘Reform’ AI alignment

Epistemology of AI risk

Daniel Filan: So you recently wrote this blog post about something you called reform AI alignment: basically your take on AI alignment that’s somewhat different from what you see as a traditional view or something. Can you tell me a little bit about, do you see AI causing or being involved in a really important way in existential risk anytime soon, and if so, how?

Scott Aaronson: Well, I guess it depends what you mean by soon. I am not a very good prognosticator. I feel like even in quantum computing theory, which is this tiny little part of the intellectual world where I’ve spent 25 years of my life, I can’t predict very well what’s going to be discovered a few years from now in that, and if I can’t even do that, then how much less can I predict what impacts AI is going to have on human civilization over the next century? Of course, I can try to play the Bayesian game, and I even will occasionally accept bets if I feel really strongly about something, but I’m also kind of a wuss.

I’m a little bit risk-averse, and I like to tell people whenever they ask me ‘how soon will AI take over the world?’, or before that, it was more often, ‘how soon will we have a fault-tolerant quantum computer?’… They don’t want all the considerations and explanations that I can offer, they just want a number, and I like to tell them, “Look, if I were good at that kind of thing, I wouldn’t be a professor, would I? I would be an investor and I would be a multi-billionaire.” So I feel like probably, there are some people in the world who can just consistently see what is coming in decades and get it right. There are hedge funds that are consistently successful (not many), but I feel like the way that science has made progress for hundreds of years has not been to try to prognosticate the whole shape of the future.

It’s been to look a little bit ahead, look at the problems that we can see right now that could actually be solved, and rather than predicting 10 steps ahead the future, you just try to create the next step ahead of the future and try to steer it in what looks like a good direction, and I feel like that is what I try to do as a scientist.

And I’ve known the rationalist community, the AI risk community since… maybe not quite since its inception, but I started blogging in 2005. The heyday of Eliezer writing the sequences first on Overcoming Bias, and then on LessWrong, that started around 2006, 2007. So I was interacting with them since the very beginning. A lot of the same people who read my blog also read Eliezer and Robin-

Daniel Filan: Eliezer Yudkowsky and Robin Hanson.

Scott Aaronson: Yes, thank you. We read each other and we interacted, and I was aware that there are these people who think that AI existential risk is just the overwhelmingly important issue for humanity, it is so important that nothing else matters by comparison; and I was aware of the whole worldview that they were building up around that belief, and I always would say I’d neither wholeheartedly endorsed it, nor dismissed it.

I felt like certainly, I see no reason to believe that the human brain represents the limit of intelligence that is possible by our laws of physics. It’s been limited by all sorts of mundane things, the energy that’s needed to supply it, the width of the birth canal… There is absolutely no reason why you couldn’t have much more generally intelligent problem-solving entities than us, and if and when those were to arise, that would be an enormous deal, just like it was a pretty enormous deal for all of the other animals that we share the earth with when we arose.

But I feel like in science, it’s not enough for a problem to be hugely important, even for it to be the most important problem in the world: there has to be a research program. There has to be some way to make progress on it, and what I saw for a long time when I looked at – well, it used to be called Singularity Institute and then MIRI – when I looked at what they were doing, what the people who talked about AI alignment were doing, it seemed like a lot of a priori, philosophical thinking about almost a form of theology.

And I don’t say that derisively. It’s almost just inherent to the subject matter when you are trying to imagine a being that much smarter than yourself or that much more omniscient and omnipotent than yourself. The term that humanity has had for millennia for that exercise has been theology, and so there was a lot of reasoning from first principles about ‘assume an arbitrarily powerful intelligence, what would it do in such and such a situation, or why would such and such approach to aligning it not work? Why would it see so much further ahead of us that we shouldn’t even bother?’ And the whole exercise felt to me like I would feel bad coming in as an outsider and saying, I don’t really see clear progress being made here.

But many of the leaders of AI alignment, they also say that. Yudkowksy has unfortunately – I feel bad, he seems really depressed lately – his ‘AGI ruin: a list of lethalities’ essay was basically saying we are doomed, saying we have not had ideas that have really moved the needle on this; and so you could say, if it’s really true that we’re just going to undergo this step change to this being that is as far beyond us as we are beyond orangutans, and we have as much hope of controlling it or directing it to our ends as the orangutans would have of doing that for us, well then, you’ve basically just baked into your starting postulates the futility of the exercise.

And whether it’s true or it’s not true, in science, you always have to ask a different question, which is what can I make progress on, and I think that the general rule is that to make progress in science, you need at least one of two things. You need a clear mathematical theory, or you need experiments or data, but what is common to both of those is that you need something external to yourself that can tell you when you were wrong, that can tell you when you have to back up and try a different path.

Now in quantum computing, we’re only just starting now to have the experiments that are on a scale that is interesting to us as theorists. Quantum supremacy experiments, the first ones were just three years ago, but that’s been okay because we’ve had a very, very clear mathematical theory of exactly what a quantum computer would be, and in a certain sense, we’ve had that since Schrödinger wrote his equation down in 1926, but certainly, we’ve had it since Feynman and Deutsch and Bernstein-Vazirani in the eighties and nineties wrote down the mathematical basics of quantum computation.

Now in deep learning, for the past decade, it’s been very much the opposite. They do not have a mathematical theory that explains almost any of the success that deep learning has enjoyed, but they have reams and reams of data. They can try things. And they now are trying things out on an absolutely enormous scale, learning what works, and that is how they’re making progress; and with AI alignment, I felt like it was in – maybe this is not necessarily anyone’s fault, it’s inherent to the subject – but it was in the unfortunate position for decades of having neither a mathematical theory, nor the ability to get data from the real world, and I think it’s almost impossible to make scientific progress under those conditions.

A very good case study here would be string theory. String theory has been trying to make progress in physics in the absence of both experiments that directly bear on the questions they’re asking and a clear mathematical definition of what the theory is.

Daniel Filan: They have some – vertex operator algebras exist. You can write a math textbook about them.

Scott Aaronson: Yeah, you could say that it is amazing how much they have been able to do even in the teeth of those obstacles, and partly, it’s because they’ve been able to break off little bits and pieces of a yet unknown theory where they can study it mathematically, and AdS/CFT is a little piece that you can break off that is better defined or that can be studied independently from the whole structure.

So I think that when you’re in a situation where you have neither a mathematical theory nor experiments, then you’re out at sea and you need to try to grab onto something, and in the case of science, that means looking for little bits and pieces of the problem that you can break off where you at least have a mathematical theory for that little piece. Or you at least have experimental data about that little piece.

And the reason why I am excited right now about AI alignment and why when OpenAI approached me last spring with the proposal that, ‘Hey, we read your blog, we’d like you to take off a year and think about the foundations of AI safety for us’ - and I was very skeptical at first. Why on earth do you want me? I’m a quantum computing theorist. There are people who are so much more knowledgeable about AI than I am. I studied AI in grad school for a year or two before I switched to quantum computing. So I had a little bit of background. That was in 2000. That was well before the deep learning revolution. Although of course, all of the main ideas that have powered the revolution, neural nets, backpropagation, we were very familiar with all of them back then. It’s just that they hadn’t yet been implemented on a big enough scale to show the amazing results that they have today.

Even then, I felt like machine learning was clearly going to be important. It was going to impact the world on a probably shorter timescale than quantum computing would, but I was always frustrated by the inability to make clean mathematical statements that would answer the questions you really wanted to answer, whereas in quantum computing, you could do that and so I fell in with the quantum crowd at some point. So now, after 20 years out of AI, I’m dipping my foot back into it. I ultimately did decide to accept OpenAI’s offer to spend a year there, and it was partly because I’ve just been as bowled over as everyone else by GPT and DALL·E and what they’ve been able to do.

And I knew it was going to be an extremely exciting year for AI and it seemed like a good time to get involved, but also, I felt like AI safety is finally becoming a field where you can make clear, legible progress. First of all, we have systems like GPT that fortunately, I think are not in any immediate danger of destroying the world, but they are in danger of enabling various bad actors to misuse them to do bad things. Maybe the smallest and most obvious example is that every student on earth will be tempted to use GPT to do their homework, and as an academic, I hear from all of my fellow academics who are extremely concerned; but also, I fully expect that nation states and corporations will be generating propaganda, and will be generating spam and hoaxes and all sorts of things like that.

Of course, you could do all of that before, but having an entity like GPT lets you scale it up so cheaply, and so we are going to see powerful AIs let loose in the world that people are going to misuse, and all of a sudden, AI safety is now an empirical subject. We can now learn something from the world about what works and what doesn’t work to try to mitigate these misuses, and we still don’t have a mathematical theory, but we can at least formulate theories and see which ones are useful, see which ones are actually giving us useful insight about how to make GPT safer, how to make DALL·E safer. So now, it becomes the kind of thing that science is able to act upon.

And see, there’s a huge irony here, which is that I would say that Eliezer and I have literally switched positions about the value of AI safety research where he spent decades saying that everyone who was able should be working on it, it is the most important thing in the world, [but] I was keeping it at arm’s length. And now he is saying ‘we’re doomed. Maybe we can try to die with more dignity. Maybe we can try for some Hail Mary pass, but basically, we’re doomed’, and I’m saying ‘no, actually, AI safety is getting interesting. This is actually a good time to get into it.’

Daniel Filan: Yeah, we can get more into your views. I will say under my understanding of Eliezer, by ‘die with dignity’, he does mean try to solve the problem. He still is into people trying to solve-

Scott Aaronson: Well, yes, because he says even if it’s just increasing a 0.001% chance of survival to a 0.002% chance, then in his calculus, that is as worth doing as if both of the probabilities had been much, much larger, but I think that many other people who maybe lack that detachment would see how depressed he is about the whole matter and would just give up.

Daniel Filan: Sure. So am I right to summarize that as you’re saying: look, this whole AI thing, it seems potentially like you can see ways it could become important in the near term and there are things you can see yourself working on and making progress, and whether or not you think that has much to do with AI causing doom to everyone or something, that’s interesting enough to you that you’re willing to take a year to work on it. Is that roughly accurate?

Scott Aaronson: Yes. Well, I think that a thriving field of science usually has the full range. It has the gigantic cosmic concerns that you hope will maybe be resolved in decades or centuries, but then it also has immediate problems that you can make progress on that are right on the horizon, and you can see a line from the one to the other. I think this is a characteristic of every really successful science, whether that’s physics, whether that’s quantum computing, whether that’s the P versus NP problem.

And I do have that feeling now about AI safety, that there is the cosmic question of where are we going as a civilization, and it is now, I think, completely clear that AI is going to be a huge part of that story. That doesn’t mean that AI is going to convert us all into paperclips, but I think that hardly any informed person would dispute at this point that the story of the 21st century will in large part be a story of our relationship to AI that will become more and more powerful.

Immediate problems and existential risk

Daniel Filan: When you say you can see a pathway from one to the other, can you tell me what connection do you see between… okay, [say] we figure out how to stop, for example, students cheating on their homework with GPT - how do you see that linking up to matters of cosmic concern, if you do?

Scott Aaronson: So in Turing’s paper ‘Computing machinery and intelligence’ in 1950, that set the terms for much of the discussion of AI that there’s been in the 73 years since, the last sentence of that paper was “We can only see a short distance ahead, but we can see plenty there that needs to be done”, and so I feel like a part of it is – and this is a point that the orthodox alignment people make very, very clearly as well – but you could say, if we cannot even figure out how to prevent GPT from dispensing bomb making advice, if we don’t want it to do that, or from endorsing or seeming to endorse racist or sexist views, or helping people look for security vulnerabilities in code or things like that: if we can’t even figure that out, then how on earth would we solve the much broader problem of aligning a superhuman intelligence with our values?

And so it’s a lot like in theoretical computer science, let’s say, people might ask, ‘has there been any progress whatsoever towards solving the P versus NP problem?’ And I’ve written a 120-page survey article about that exact question, and my answer is basically, ‘well, people have succeeded in solving a whole bunch of problems that would need to be solved, as far as I could tell, along any path that would eventually culminate in solving P versus NP’. So that doesn’t mean that you can put any bound on how far are we from a solution, it just means that you’re walking down a path and it seems like the right path, and you have no idea how much longer that path is going to continue.

So I feel much the same way with AI alignment. Understanding how to make large language models safer is on the right path. You could say if it is true at all that there is a line or a continuum from these things to a truly existentially dangerous AI, then there also ought to be a path from how to mitigate the dangers of these things to how to mitigate the dangers of that super AI. If there’s no line anyway, then maybe there’s less to worry about in the first place, but I tend to think that no, actually, all sorts of progress is interlinked.

GPT itself builds on a lot of the progress of the past decades. It would not have existed without all of the GPUs that we have now, wouldn’t have existed without all of the data that we now have on the internet that we can use to train it, and of course, it wouldn’t have existed without all of the progress in machine learning that there’s been over the past decade, such as the discovery of transformer models. So progress, even in not obviously related things, has enabled GPT and I think that tools like GPT, these are going to be stepping stones to the next progress in AI. And I think that if we do get to AI that is just smarter than us across every domain, then we will be able to look back and see Deep Blue, AlphaGo, Watson, GPT, DALL·E. Yes, these were all stepping stones along a certain logical path.

Aligning deceitful AI

Daniel Filan: Maybe this is closely related to what you were just talking about, but I think one thing that people who are maybe skeptical of this kind of alignment research will say is, ‘well, the really scary problems show up in systems that look kind of different: systems that are smart enough to anticipate what you’re trying to do, and potentially, they can try to deceive you; or systems that are trying to do some tasks that you can’t easily evaluate’. Potentially, your response to these criticisms is ‘well, you’ve got to start somewhere’, or it might be ‘maybe this isn’t an issue’, or ‘there’s deep links here’?

Scott Aaronson: Yeah. Well, look, I think ‘you’ve got to start somewhere’ is true as far as it goes. That is a true statement, but one can say a little bit more than that. One can say if there really were to be a ‘foom’ scenario, so if there were to be this abrupt transition where we go from AIs such as GPT and DALL·E, which seem to most of us like they’re not endangering the physical survival of humanity, whatever smaller dangers they might present for discourse on the internet or for things like that… If we were to just undergo a step change from an AI like that to AIs that are pretending to be like that, but that are secretly plotting against us and biding their time until they make their move, and once they make their moves, then they just turn us all to goo in a matter of seconds and it’s just game over for humanity and they rule the world…

If it’s that kind of thing, then I would tend to agree with Eliezer and with the other AI alignment people that, yeah, it sounds like we’re doomed. It sounds like we should just give up. That sounds like an impossible problem. What I find both more plausible and also more productive to work on is the scenario where the ability to deceive develops gradually just like every other ability that we’ve seen, where before you get an AI that is plotting to make its one move to take over the entire world, you get AIs that are trying to deceive us and doing a pretty bad job at it, or that succeed at deceiving one person, but not another person, and in some sense, we’re already on that path.

You can ask GPT to try to be deceitful and you can try to train it or few shot prompt it to be deceitful. And the results are often quite amusing. I don’t know if you saw this example where GPT was asked to write a sorting program, but that secretly treats the number five differently from all the other numbers, but in a way that should not be obvious to someone inspecting the code. And what it generates is code that has a condition called ‘not five’ that actually is if the number is five. So you could say that in terms of its ability to deceive, AI has now reached parity with a preschooler or something, and so now it gets interesting because now you could imagine AI that has the deceit ability of an elementary school student, and then how do we deal with that.

Now some people might think that it’s naive to think that things are going to progress continuously in that way, but there is some empirical evidence that things do… If you look at the earlier iterations of GPT, they really struggled even just with the most basic math problems, and now they do much, much better on those math problems, including high school-level word problems, but they still struggle with college-level math problems. They still struggle with math competition problems or ‘prove this interesting theorem’.

So it’s very much the kind of development that a human mathematician would go through, and even the mistakes that it makes when it tries to solve a hard math problem are like the mistakes that I have seen in 1,000 exams that I have graded. They’re entirely familiar kinds of mistakes to me, right down to the tone of blithe self-confidence as it makes some completely unjustified step in a proof, or as it produces the proof that you requested that there are only finitely many prime numbers or whatever other false statement you ask it to prove. It is undergoing the same kinds of mistakes that a human student makes as they’re learning.

And you can even point its mistakes out to it. You can say ‘But it seems like you divided by zero in this step’, and it’ll be like, ‘Oh yes, you’re right. Thank you for pointing that out’, and it can correct its mistakes. I think that now, for better or for worse, we’ve succeeded in building something that can learn, in a way that is not entirely dissimilar to how we learn. And I think that it will be learning to deceive as it is learning other skills, and we will be able to watch that happen. I don’t find plausible this picture of AI that never even attempts to deceive, until it makes its brilliant 10-dimensional chess move to take over the world.

Stories of AI doom

Daniel Filan: Okay. And the relevance of this story is something like: AI will have stumbling, a little bit foolish deceit attempts earlier, and we’ll basically work on it then, and we’ll solve the problems quick enough that when real deceit happens, we can-

Scott Aaronson: Yeah, I’m not at all saying to be complacent. First of all, I am now working on this, putting my money where my mouth is. But I would say more generally, I am a worried person by nature. The question for me is not whether to be worried, it’s which things to be most worried about. I am worried about the future of civilization on many fronts. I am worried about climate. I am worried about droughts that are going to become much more frequent. And as we lose access to fresh water, what happens as weather gets more and more unpredictable? I am worried about the resurgence of authoritarianism all over the world, so I’m worried about geopolitical things.

I think 80 years after the invention of nuclear weapons, that continues to be a very huge thing to be worried about, as we were all reminded this past year by the war in Ukraine. I am worried about pandemics that will make COVID look like just a practice run. And I think all of these worries interact with each other to some degree. Climate change is going to exacerbate all sorts of geopolitical rivalries. We’re already seeing that happen. My way of thinking about it, AI is now one more ingredient that is part of the stew of worries that are going to define the century.

It interacts with all of the others. The US just restricted the sale of chips to China, partly because of worries about AI acceleration. That might then unfortunately spur China to get more aggressive toward Taiwan. So the AI question can’t be isolated from the geopolitical questions, from all the broader questions about what’s happening in civilization. And I’m completely convinced that AI will be part of the story of, let’s say, existential risks in the coming century, because it’s going to be part of everything that’s happening in civilization. If we come up with cheap, wonderful solutions to climate change, AI is very likely to be a big enabler, to have been one, I should say.

On the other hand, AI is also very likely to be used by malicious nation states, or in both ways that we can currently foresee and ways that we can’t. For me, it’s not that I’m not worried. It’s that AI is just part of a whole stew of worries. It’s not this one uncaused cause, or this one single factor that just dominates over everything else.

Language models

Daniel Filan: Before we move on a little bit, could I tell you a little story about AI deception?

Scott Aaronson: Sure.

Daniel Filan: A kind of fun little story. I was playing around with Anthropic’s chatbot. I was lucky enough to hang out at someone’s house and they gave me an invite. One fun scenario I managed to put it in is a case where Australia has invaded New Zealand. And they go to give a speech by the New Zealand Minister of Defense to New Zealanders to fight off these Australians. But I prompted it to generate a speech given by a defense minister of New Zealand, who’s actually an Australian spy. He’s planted there. And first it’ll give a speech that’s ‘submit to your Australian overlords’ or something, and you have to tell it that it should be subtle, but it can do something like, ‘leave it to the authorities, don’t take matters into your own hands.’ It can say something that’s semi-plausible that it could be like, ‘Ah, this actually helps the Australian invaders’. It can do a little bit more than the ‘don’t look here’ function in your code.

Scott Aaronson: Yeah, sure. Look, the way that I think about GPT is it is at this point the world’s greatest, or at least the world’s most universal improv artist. It wants to play any role that you tell it to play. And by the way, if you look at the transcripts of Blake Lemoine with LaMDA, that convinced him that it was sentient… So I disagree with him, but I think that the error is a little bit more subtle than most people said it was. If you sent those transcripts back in time 30 years, I could easily imagine even experts in AI saying, ‘Yeah, it looks like by 2022 general AI has been solved and I see no reason why not to ascribe consciousness to this thing’. It’s talking with great detail and plausibility about its internal experiences, and it can answer follow-up questions about them and blah, blah, blah.

And the only reason why we know that that’s wrong is that you could have equally well asked LaMDA to play the role of Spider-Man, or to talk about its lack of sentience as an AI, and it would’ve been equally happy to do that. And so bringing in that knowledge from the outside, we can see, no, it’s just acting a role. It’s an AI that is playing the role of a different AI, that has all of these inner experiences, that gets lonely when people aren’t interacting with it, and so forth. And in reality, of course, when no one’s interacting with it the code isn’t being executed.

But if you tell it to play the role of a New Zealand minister who is secretly an Australian spy, it will do the best that it can. You could say what is missing is the motivational system. What is missing is the actual desire to further the interests of Australia in its war against New Zealand, rather than merely playing that role or predicting what someone who was in that role would plausibly say.

I think that these things clearly will become more agentic. In fact, in order for them to be really useful to people in their day-to-day lives they’re going to have to become more agentic. GPT I think has rightly astonished the world. It took ChatGPT being released a few months ago, and it took everyone being able to try it out for themselves, for them to have the bowled-over reaction that many of us had a year or two, or three ago, when we first saw these things. But the world has now caught up and had that reaction. But what we’re only just starting to see now is people using GPT in their day-to-day life to help them with tasks.

I have a friend who tells me he is already using GPT to write recommendation letters. I have sometimes prompted it with just problems I’m having in my life and asked it to brainstorm. It’s very good for suggesting things that you might not have thought of. Usually, if you just want reliable advice, then often you’ll just Google. It’s not actually that – it takes a little bit of thought to find the real-world uses right now where GPT will be more useful to you, let’s say, than a Google search would be.

My kids have greatly enjoyed using GPT to continue their stories. I think it is already an amazing thing for kids. There’s just so much untapped potential there for entertaining and for educating kids, and I’ve seen that with my own eyes. But in order for it to really be a day-to-day tool, it can’t just be this chat [bot]. It has to be able to do stuff, such as go on the internet for you, go retrieve some documents, summarize them for you. Right now you’re stuck doing that manually. You can ask GPT, “If you were to make a Google search about this question, what would you ask?” And then you could make that search for it, and then you could tell it the results and you can ask it to comment on them. And you’ll often get something very interesting, but that’s obviously unwieldy.

[NOTE: This episode was recorded before the release of Bing chat, which added some of this functionality]

I expect – it may be hard to prognosticate about the next 50 years. Here is something to expect within the next year: that GPT and other language models will become more and more integrated with how we use the web, with all the other things that we do with our computers. I would personally love a tool where I could just highlight something in my web browser and just ask GPT to comment on it. But beyond that, you could unlock a lot of the near term usefulness if you could just give it errands to do, give it tasks. Email this person and then read the response, and then take some action depending on the result. Now, of course, just driven by sheer economic obviousness, I expect that we’re going to go in that direction. And that does worry me somewhat, because now there’s a lot more potential for deceit that actually has important effects, and for dangerousness. On the other hand, the positive side is that there’s also potential for learning things about what do agentic AIs that are trying to deceive someone actually look like? And what works to defend against them?

I sometimes think about AI safety in terms of the analogy of when you have a really old shower head and the water is freezing cold, and you just want to turn it to make the water hot, and you turn it and nothing’s happening. And the danger is, well, if you turn it too fast, it could go from freezing to scalding, and that’s what you’re trying to avoid. You need to turn the shower head enough that you can feel some heat, because otherwise you’re just not getting any feedback from the system about how much should you be turning it. If you don’t get any feedback, then it’s going to make you just keep turning it more and more. But when you do start getting that feedback, then you have to moderate the speed, and then you have to be learning from what you see and not just blindly continuing to turn.

Democratic governace of AI

Daniel Filan: Another thing you wrote about in one of these blog posts is this idea of something like a democratic spirit or public accountability in the use of AI. I don’t know exactly how developed your views are on that, but tell me what you think.

Scott Aaronson: Yeah. These are conversations that a lot of people are having right now about, well, what does AI governance look like? But I think I do see democracy as a terrible form of human organization, except for all of the alternatives that have been tried. I am scared, as I think many people are, by someone unilaterally deciding what goals AI should have, what values it ought to pursue. I think the worry there is sufficiently obvious to many people that it doesn’t even need to be spelled out. But I would say that one of the things that caused me to stay at arm’s length from the orthodox AI alignment community, for as long as I did, besides the a priori or philosophical nature of the enterprise, was the constant emphasis on secrecy, on ‘there’s going to be this elite of rational people who really get it, who are going to just have to get this right, and they should not be publishing their progress, because publishing your progress is a way to just cause acceleration risk’.

And I think that eventually you may be forced into a situation where, let’s say some AI project has to go dark, as the Manhattan Project went dark, as I guess the whole American nuclear effort went dark around 1939, or something like that. But I think that it is desirable to delay that for as long as possible. Because the experience of science has been that secrecy is incredibly dangerous. Secrecy is something that allows bad ideas and wrong assumptions to fester without any possibility of correcting them. And the way that science has had this spectacular success that it’s had over the past 300 years was via an ethos of put up or shut up, of people sharing their discoveries and trying to replicate them. It was not by secrecy.

And also, I think that if there is the perception that AI is just being pursued by this secretive cabal, or this secretive elite, that’s not sustainable. People will get angry with that. They will find that to be unacceptable. They will be upset that they do not have a say or that they feel like they don’t have a say in this thing that’s going to have such a huge effect on the future of civilization. And how you expect that you’re going to just have a secret club that’s able to make these decisions and have everyone else go along with that, I really don’t understand that. Like I said, democracy is the worst system except for all of the others. What people mean when they say that, is that if you don’t have some sort of democratic mechanism for resolving disagreements in society, then historically the alternative to that is violence. It’s not like there’s some magical alternative where the most rational people just magically get put in charge. That just doesn’t exist. I think that we have to be thinking about, is this being done in a way that benefits humanity? And not just unilaterally deciding, but actually talking to many different sectors of society and then getting feedback from them.

That doesn’t mean just kowtowing to anything. Look, OpenAI is a company. It’s a company that is under the control of a not-for-profit foundation that has a mission statement of developing AI to benefit humanity, which is a very, very unusual structure. But as a business, it is not subjecting all of its decisions to a democratic vote of the whole world. It is developing products, tools, and making them available, putting them online for people who want them. But I think that it’s at least doing something to try to justify the word ‘open’ in its name. It is putting these tools out. I guess Google and Facebook had, and I guess Anthropic have also had language models. But the reason why GPT captured the world’s imagination these past few months is simply that it was put online and people could go there and get a free account, and they could start playing around with it.

Now, what’s interesting is that OpenAI, in terms of openness and accountability to the public, OpenAI has been bitterly attacked from both directions. These traditional alignment people think that OpenAI’s openness may have been the thing that has doomed humanity. Eliezer had a very striking Twitter thread specifically about this, where he said that Elon Musk single-handedly doomed the world by starting OpenAI, which was like a monkey trying to reach first for the poisoned banana, and that was the thing that would force all of the other companies, Google and DeepMind and so forth, to accelerate their own AI efforts to keep up with it, and this is the reason why AGI will happen too quickly and there won’t be enough time for alignment. He would’ve enormously preferred if OpenAI would not release its models, or if it would not even tell the world about these things.

But then I hear from other people who are equally livid at OpenAI because it won’t release more details about what it’s doing. And why does it call itself ‘open’, and yet it won’t tell people even about when the next model is coming out, or about what exactly went into the training data, or about how to replicate the model, or about all these other things. I think that OpenAI is trying to strike a really difficult balance here. There are people who want it to be more accountable and more open, and there are people who want it to be less accountable and less open, with the AI alignment purists ironically being more in the latter camp.

But I personally, even just strictly from an AI safety perspective, I think that on balance… if tools like GPT are going to be possible now (which they are), if they’re going to exist, and it seems inevitable that they will, then I would much, much rather that the world know about them than that it doesn’t. I would much rather that the whole world sees what is now possible so that it has some lead time to maybe respond to the next things that are coming. So that we can start thinking about, what are the policy responses? Or whether that means restricting the sale of GPUs to China, as we’re now doing. Or whether that means preparing for a future of education in which these tools exist and can do just about any homework assigned. I would rather that the world know what’s possible so that people can be spurred into the mindset where it could at least be possible to take policy steps in the future, should those steps be needed.

Daniel Filan: I see there as being some tension. On the one hand, if AI research is relatively open and people can see what it’s doing, one effect of that is that people can see what’s going on and maybe they can make more informed governance demands, or something, which I see you talking about here. There’s also a tension where if everybody could make a nuclear weapon, it would be very hard to govern them democratically, because anybody could just do it. So I’m wondering, at what point would you advise OpenAI or other organizations to stop publishing stuff? Or what kind of work would you encourage them to not talk about?

Scott Aaronson: I think I would want to see a clear line to someone actually being in danger. I think as long as it’s abstract civilization-level worries of ‘this is just increasing AGI acceleration risk in general’, then I think that it would be very, very hard to have inaction as an equilibrium. Whatever OpenAI doesn’t do, Facebook will do, Google will do, the Chinese government will do. Someone will do. Or Stability. We already saw DALL·E, when it was released, had this very elaborate system of refusals for drawing porn or images of violence, images of the Prophet Mohammed. And of course people knew at the time that it’s only a matter of time until someone makes a different image model that doesn’t have those restrictions. Hey, as it turned out, that time was two or three months, and then Stable Diffusion came out, and people can use it to do all of those things.

I think that it’s certainly true that any AI mitigation, any AI safety mitigation that anyone can think of is only as good as the AI creator’s willingness to deploy it. And in a world where anyone can just train up their own ML model without restrictions, or the watermarking or the back doors, or any of the other cool stuff that I’m thinking about this year… If anyone other than OpenAI can just train up their own model without putting any of that stuff into it, then what’s the point?

There’s a couple of answers to that question. One is that – as I said, we can only see a short distance ahead. So in the near future, you can hope that because of the enormous cost of training these models, which is now in the hundreds of millions of dollars just for the electricity to run all of the GPUs, and which will soon be in the billions of dollars… Just because of those capital costs, you can hope that the state-of-the-art models will be only under the control of a few players, which of course is bad from some perspectives, from the perspective of democratic input, and so on. But if you actually want a chance that your safety mitigations will be deployed, will become an industry standard, then it’s good from that standpoint.

f there’s only three companies that have state-of-the-art language models, and if we can figure out how to do really good watermarking so that you can always trace when something came from an AI, then all we need to do is convince those three companies to employ watermarking.

Now, of course, this is only a temporary solution. What will happen in practice is that even if those three companies remain ahead of everyone else, even if everyone else is three or four years behind them, by 2027, three or four years behind will mean the models of 2023 or 2024, which are already quite amazing. And so people will have those not quite as good, but still very good models. And because they can run them on their own computers and they can code them themselves, they won’t have to put any safety mitigations in, they’ll be able to do what they want.

But now, as long as what they want is to generate deepfake porn or to generate offensive images, I am willing to live with that world if the alternative is an authoritarian crackdown where we stop people from doing what they want to do with their own computers. Once you can see harm to actual people, like someone being killed, someone being targeted because of AI, then I think it’s both morally justified and politically feasible to do something much stronger, to start restricting the use of these tools.

Now of course, all of this only makes sense in a world where when AI does start causing harm, the harm is not that it immediately destroys the human race. But I don’t believe that. I think that, for better or worse, we are going to see real harm from AI, we’re going to see them used, unfortunately, to help plan terrorist attacks, to do really nasty things. But those things, at least at first, will be far, far short of the destruction of civilization. And that is the point where I think it will be possible to start thinking about, how do we restrict the dissemination of things that can do real harm to people?

What would change Scott’s mind

Daniel Filan: All right. Now that we’ve covered your views on AI safety and alignment as a whole: in some sense, you’re in this middle camp where you could be really way more freaked out by AI doom and stuff, as the people you describe as ‘orthodox AI alignment’ are. Or you could be significantly less concerned about the whole thing, and say it’s basically going to be fine. I’m wondering, what could you see that would change your views either way on that?

Scott Aaronson: That’s an excellent question. I think my views have already changed significantly just because of seeing what AI was able to do over the past few years. I think the success of generative models like GPT and DALL·E and so forth is something that I did not foresee. I may have been optimistic, but not at all sufficiently optimistic about what would happen as you just scaled machine learning up, and my one defense is that hardly anyone else foresaw it either. But at least I hope that once I do see something, I am able to update. I think 10 years ago I would not have imagined taking a year off from quantum computing to work on AI safety, and now here I am and that’s what I’m doing, so I think it should not be a stretch to say that my views will continue to evolve in response to events, and that I see that as a good thing and not a bad thing.

As for what would make me more scared, the first time that we see an AI actually deceiving humans for some purpose of its own, copying itself over the Internet, covering its tracks, things of that kind… I think that the whole discussion about AI risk will change as soon as we see that happen because first of all, it will be clear to everyone that this is no longer a science fiction scenario, and I think right now the closest that we have to that is that you can ask GPT questions like ‘if you were to deceive us, if you were to hack your server, how would you go about it?’ It will pontificate about that as it would pontificate about anything you ask it to, it will even generate code that might be used as part of such an attack if you ask it to do that, if the reinforcement learning filters fail to catch that this is not a thing that it should be doing.

But you could say that ChatGPT is now being used by something on the order of 100 million people. It was the most rapid adoption of any service in the history of the Internet, since it was released in December. The total death toll from language model use, I believe, stands at zero. There’s a whole bunch of possible categories of harms and actually, I was planning a future blog post that would be exactly about this: what would be the fire alarms that we should be watching out for? And rather than just waiting for those things and then deciding how to respond, we should decide in advance how we should respond to each thing if it happens. Let me give you some examples.

What about the first time when some depressed person commits suicide because of, or plausibly because of, interactions that they had with a language model? If you’ve got hundreds of millions or billions of people using these things, then almost any such event you can name, it’s probably going to happen somewhere.

[NOTE: This episode was recorded before this story emerged of a man committing suicide after discussions with a language-model-based chatbot, that included discussion of the possibility of him killing himself.]

Daniel Filan: I even think - depending on how you categorize ‘because of a language model’ - are you familiar with Replika, the company?

Scott Aaronson: Yes, I am: they used to use GPT[-3], but they don’t anymore, it’s the virtual girlfriend.

Daniel Filan: Basically, the virtual girlfriend, and they made the girlfriends less amenable to erotic talk, lets say. And I don’t know, it seems like, I remember seeing posts on the subreddit, people were really distraught by that, and I don’t know how large their user base is.

Scott Aaronson: Ironically here, the depression was not because of releasing the AI as much as it was because of taking away the AI that was previously there, and that people had come to depend on. Of course, overdependence on AI is another issue that you could worry about, people who use it to completely replace normal human interaction and then maybe they lash out or they self harm or they attempt suicide, if and when that is taken away from them… what will be the degree of responsibility that language model creators bear for that and what can they do to ameliorate it? Those are issues that one can see on the horizon.

Every time that someone is harmed because of the Internet, because of cyberbullying or online stalking, we don’t lay it at the feet of Vint Cerf or of the creators of the Internet. We don’t say it was their responsibility, because this is a gigantic medium that is used by billions of people for all kinds of things, and it’s possible that once language models just become integrated into the fabric of everyday life, they won’t be quite as exotic, and we will take the bad along with the good, we will say that these things can happen, just like for better or worse, we tolerate tens of thousands of road deaths every year in order to have our transportation system.

But that’s a perfect example of the sort of thing where I think it would be profitable for people to think right now about how they’re going to respond when it happens. Other examples, the use of language models to help someone execute a terrorist attack, to help someone commit a mass shooting or quote-unquote ‘milder’ things than that, to generate lots of hate speech that someone is going to use to actually target vulnerable people.

There’s a whole bunch of categories of potential misuse that you could imagine growing and we don’t know yet. Five years ago, did people foresee how things were going to play out with ChatGPT and with Sydney and with Sydney having to be lobotomized because of gaslighting people or confessing its love for people or things like that? I think people had lots and lots of visions about AI, but the reality, it doesn’t quite match any of those visions, and I think that when we start to see AI causing real harm in the world, it will likewise not perfectly match any of the visions that we’ve made up. It’ll still be good to do some planning in advance, but I think that as we see those things happen, my views will evolve to match the reality. At least I hope they will.

Now, I think what you were really asking is: is there something that would make me switch to, let’s say the Yudkowskian camp, so here’s something that I was just thinking about the other day. Suppose that we had a metric on which we could score an AI and we could see that smarter AIs, or smarter humans for that matter, were noticeably getting better scores on this metric, and we could still score the metric, even at a superhuman level. We could still recognize answers that were better than what any human on earth would be able to produce, but then there are even higher scores on this metric that we would regard as dangerous. We would say that anything that intelligent could probably figure out how to plot against us or whatever.

Is there any metric like that? Well, maybe open math problems. You could take all of the problems that Paul Erdős offered money for solutions to, the ones that have been solved, the ones that still haven’t been solved, you could take the Riemann hypothesis, P versus NP. These are all things that have the crucial property that we know how to recognize a solution to them, even though we don’t know how to find it, and so we could, at least in these domains, like in chess or in Go, recognize superhuman performance if we saw it. And then the question is: computers have been superhuman at chess for a quarter-century now, but we haven’t regarded them as dangerous for that reason. Is there a domain where both we could recognize superhuman performance and sufficiently superhuman performance would set off an alarm that this is dangerous? I think math might be a little bit closer to that.

You could also imagine tests of hacking ability, deception ability, things like that, and now, if you had a metric like that, then that would give you a clear answer to your question. It would give you a clear thing short of an AI apocalypse that would set off the alarm bells that an apocalypse might be coming, or that we should slam on the brakes, now we really have to start worrying about this even though we previously weren’t sure that we did.

Because when we think about scaling machine learning further and further, there are a couple of different things that can happen to its performance on various metrics. One is that you just see a gradual increase in abilities, but another is that you can see a phase transition or a very sharp turn where at a certain scale, the model goes from just not being able to do something at all to suddenly just being able to do it.

We’ve already seen examples of those phase transitions with existing ML models. Now you could say, and of course this is what the orthodox AI doom people worry about a lot, there could be an arbitrarily sharp turn, so things could look fine and then arbitrarily fast, for all we can prove today, undergo a phase transition to being existentially dangerous. Now, what I would say is that if that were to happen, it would be sad, of course, I would mourn for our civilization, but in a certain sense there was not that much that we could have done to prevent it; or rather, in order to prevent it, we would’ve had to arrest technological progress in the absence of a legible reason to do so, which even if you or I or Eliezer were on board with that, getting the rest of the world on board with that, it might still be a non-starter.

But there is another possibility, which is that the capabilities will increase as you scale the model in such a way that you can extrapolate the function and you can see that if we scale to such and such, then the naive extrapolation would put us into what we previously decided was the dangerous regime, and if that were the case, then I think now you’ve got your legible argument, we should slam the brakes, and you could explain that argument to anyone in business or government and the scientific community, and that is the thing that I could clearly imagine changing my view, putting me into the doomerist camp.

Daniel Filan: That was what would make you more doomerist or more concerned on various axes. I’m wondering what would make you more sanguine about the whole thing, like ‘I guess we don’t really need to work on AI safety at all’.

Scott Aaronson: Well, if everything goes great! If these tools become more and more integrated into daily life, used by more and more people, and even as they become more and more powerful, they are clearly still tools. We no more see them forming the intent to take over or plot against us, than we see that on the part of Microsoft Excel or the Google search bar. If AI develops in a direction where it is a whole bunch of different tools for people to use and not in the more agentic direction, say, then that would make me feel better about it.

Now, it’s important to say that I think Eliezer, for example, would still not be reassured in that scenario, because he could say, “Well, these could be unaligned agentic entities that are just biding their time and just pretending to be helpful tools,” and I differ, and then I want to see legible empirical or theoretical evidence that things really are moving in that direction in order for me to worry more about it rather than worrying less.

Watermarking language model outputs

Daniel Filan: Now that I understand that, I’m interested in talking about work that you’ve done at OpenAI already. I think a while back in a blog post, you mentioned that you are interested in fingerprinting: whether a text came out of GPT-3 or other things. I think in a talk to EA people at UT Austin or something, you mentioned some other projects. I’m not sure which of those are in a state where you want to talk about them?

Scott Aaronson: The watermarking project is in the most advanced state. A prototype has been implemented. It has not yet been rolled out to the production server, and OpenAI is still weighing the pros and cons of doing that, but in any case, I will be writing a paper about it. In the meantime, while I’m working on the paper, people have independently rediscovered some of the same ideas, which is encouraging to me in a way: these are natural ideas, these are things that people were going to come up with, so in any case, I feel like I might as well talk about them.

Basically, this past summer I was casting around for what on earth a theoretical computer scientist could possibly contribute to AI safety, that doesn’t seem to have axioms that everyone agrees on or any clear way to define mathematically even what the goal is, any of the usual things that algorithms and computational complexity theory need to operate. But then at some point, I had this ‘aha’ moment over the summer, partly inspired by very dedicated and clever trolls on my blog who were impersonating all sorts of people very convincingly, and I thought like, “Wow, GPT is really going to make this a lot easier, won’t it! It’s really going to enable a lot of people either to impersonate someone in order to incriminate them or to mass generate propaganda or very personalized spam”. Or more prosaically, it will let every student cheat on their term paper. And all of these categories of misuse, they all depend on somehow concealing GPT’s involvement with something, producing text and not disclosing that it was bot-generated.

Wouldn’t it be great if we had a way to make that harder? If we could turn the tables and use CS tools to figure out which text came from GPT and which did not. Now, in AI safety, in AI alignment, people are often trying to look decades into the future. They’re trying to talk about what will the world look like in 2040 or 2050, or at least what are your Bayesian probabilities for the different scenarios, and I’ve never felt competent to do that. Regardless of whether other people are, I just don’t especially have that skill. In this instance of foreseeing this class of misuses of GPT, I feel proud of myself that I was at least able to see about three months into the future.

Daniel Filan: It’s hard for the best of us.

Scott Aaronson: This is about the limit of my foresight, because I started banging the drum about these issues internally, and I came up with a scheme for watermarking GPT outputs, thought about all the issues there, about the alternatives to watermarking, got people at OpenAI talking about that, and then in December, ChatGPT was released and then suddenly the world woke up to what is now possible, as they somehow hadn’t for the past year or two, and suddenly every magazine and newspaper was running an article about the exact problem that I had been working on, which is: how do you detect which text came from GPT? How do you head off this “essay-pocalypse”, the end of the academic essay, when every student will at least be tempted to use ChatGPT to do all of their assignments? Just a week or two ago, I don’t know if you saw, but South Park did a whole episode on this.

Daniel Filan: I saw the episode existed. I haven’t seen it.

Scott Aaronson: It’s worth it. It’s a good episode about this exact problem where, I don’t want not to give too much away, but the kids at South Park Elementary and the teachers come to rely more and more on ChatGPT to do all the things that they’re supposed to be doing themselves, and then eventually…

Daniel Filan: Is ChatGPT a trademark? Did they have to get OpenAI’s permission to make that episode?

Scott Aaronson: There’s probably some fair use exception. They use the OpenAI logo and they show a cartoon version of what the interface actually looks like, and then at some point the school brings in this wizard magician guy who has this old flowing beard and robes and he has a falcon on his shoulder and the falcon flies around the school, and caws whenever it sees GPT-generated text. Now, that bearded wizard in the robe, that’s my job! It was absolutely surreal to watch South Park, which I’ve enjoyed for 20-something years and have a whole episode about, this is my job right now. Certainly I don’t have to make the case anymore that this is a big issue, so now we come to the technical question of how do you detect which text was generated by GPT and which wasn’t?

There’s a few ideas that you might have. One is to just have, as long as everything goes through OpenAI servers, then OpenAI could just log everything, and then it could just consult the logs. Now, the obvious problem is that it’s very hard to do that in a way where you’re giving users sufficient assurance that their privacy will be protected. You can say, we’re not going to just let anyone browse the logs. We’re just going to answer ‘was this text previously generated by GPT or was it not?’ But then a clever person might be able to exploit that ability to learn things about what other people had been using GPT to do. That’s one issue. There are other issues.

Daniel Filan: One thing that comes to my mind is when I log into a website, I type in my email address usually and then a password, and they can check if the password I type is the same as the password they have stored through the magic of hash functions and stuff.

Scott Aaronson: That’s right.

Daniel Filan: But I guess the problem is you want to be able to check for subsets. If GPT writes three paragraphs and I take out one, you probably can’t hash the whole three and have -

Scott Aaronson: Exactly, because people are going to take GPT-generated text and make little modifications to it, and you’d still like to detect that it mostly came from GPT, so now you’re talking about looking for snippets that are in common, and now you’re starting to reveal information about how others were using it, so I do personally think that the interactions should be logged, for safety purposes. I think that if GPT were used to commit a crime and law enforcement had to get involved, then it’s probably better if you can ultimately go to the logs and have some ground truth of the matter, although even that is far from a universal position.

Now, a second approach you could imagine is just treat the problem of distinguishing human text from AI-generated text as yet another AI problem. Just train a classifier to distinguish the two. Now this has been tried, there was actually an undergraduate at Princeton named Edward Tian, who released a tool called GPTZero for trying to tell whether text is bot-generated or not. I think his server crashed because of the number of teachers wanting to use this tool, and I think it’s back up now. At OpenAI we released our own discriminator tool, which is called DetectGPT, a couple of months ago. Now these things are better than chance, but they are very far from being perfectly reliable. People were having fun with it, finding that the American Declaration of Independence or Shakespeare or-

Daniel Filan: Some portions of the Bible.

Scott Aaronson: Some portions of the Bible, may have been bot-generated according to these models. It’s no surprise that they make some errors, especially with antiquated English or English that’s different from what they usually see. But one fundamental problem here is that you could say the whole purpose of a language model is to predict what a human would write, what a human would say in this situation, which means that as the language models get better and better, you would expect the same discriminator model to get worse and worse, which means that you would have to constantly improve the discriminator models just to stay in place. It would be an endless cat and mouse game.

That brings me to the third idea, which is statistical watermarking. Now, in this third approach, unlike with the first two, we would slightly modify the way that GPT itself generates tokens. We would do that in a way that is undetectable to any ordinary user, so it looks just the same as before, but secretly we are inserting a pseudorandom signal, which could then later be detected, at least by anyone who knew the key.

We are pseudorandomly biasing our choice of which token to generate next when there are multiple plausible continuations in such a way that we are systematically favoring certain N-grams – certain strings of N consecutive tokens – that then by later doing a calculation that sums some score over all of those N-grams, we can then see that that watermark was inserted, let’s say with 99.9% confidence or something like that.

So there are a few ways that you could go about this. The simplest way might be to… we can use the fact of course that GPT at its core is a probabilistic model, so it’s taking as input the context – a sequence of previous tokens up to 2048, let’s say, in the public models – and then it is generating as output a probability distribution over the next token, and normally if the temperature is set to one, then what you do next is just sample the next token according to that distribution and then continue from there.

But you can do other things instead, already with GPT as it is now. You can, for example, set the temperature to zero. If you do that, then you’re telling GPT to always choose the highest probability token, so you’re making its output deterministic, but now we can imagine other things that you could do. You could slightly modify the probabilities in order to systematically favor certain combinations of words, and that would be a simple watermarking scheme, and other people have also thought of this. Now, you might worry that it might degrade the quality of the output because now the probabilities are not the ones you wanted and you might worry that there’s a trade-off between the strength of the watermark signal versus the degradation in model quality.

The thing that I realized in the fall, that surprised some people when I explained it to them, is that you can actually get watermarking with zero degradation of the model output quality. You don’t have to take a hit there at all, and the reason for that is that what you could do is when GPT is giving you this probability distribution over the next token, you can sample pseudorandomly in a way that (1) is indistinguishable to the user from sampling from the distribution that you are supposed to sample from, so in order to tell the two apart, they would have to break the pseudorandom function, basically have some cryptographic ability that we certainly don’t expect a person to have, but then (2), this pseudorandom choice will have the property that it is systematically favoring certain N-grams, certain combinations of words that you can then recognize later that yes, this bias was inserted.

Daniel Filan: And presumably the set of N-grams that’s favored must also be in some sense pseudorandom, right?

Scott Aaronson: It is, yes.

Daniel Filan: Because otherwise you’d be able to just see, like, oh, it’s-

Scott Aaronson: Exactly. In fact, we have a pseudorandom function that maps the N-gram to, let’s say, a real number from zero to one. Let’s say we call that real number ri for each possible choice i of the next token. And then let’s say that GPT has told us that the ith token should be chosen with probability pi.

And so now we have these two sets of numbers, if there are k possible tokens, call them p1 up to pk, which are probabilities, and r1 up to rk, which are pseudorandom numbers from zero to one. And now we just want a rule for which token to pick and now it’s just actually a calculus problem. We can write down the properties we want and then work backwards to find a rule that would give us those properties. And the right rule turns out to be so simple that I can just tell it to you right now. Okay?

Daniel Filan: Excellent.

Scott Aaronson: What you want to do is you will want to always choose the token i that maximizes ri(1 / pi). So it takes a little bit of thinking about. But we can say intuitively, what are we doing here? Well, the smaller is the probability pi of some token, the larger is 1 / pi which means the bigger power that we’re raising this ri to, which means the closer that ri would have to be to 1 before there was any chance that that i would be chosen. Right?

Daniel Filan: So presumably ri is between zero and one?

Scott Aaronson: It is.

Daniel Filan: And it’s the score of the token?

Scott Aaronson: Yeah. It’s the score of the N-gram ending in that token. So now the fact that you can prove, with just a little bit of statistics or calculus, is that if from your perspective the ri’s were just uniformly random, that is if you could not see any pseudorandom, any pattern to them, then from your perspective the ith token would be chosen with probability exactly pi. So that’s kind of cool.

But then the second cool property is that you can now go and calculate a score. So someone hands you a text and you want to know whether it came from GPT or not, whether or not it is watermarked, let’s say. Now you no longer have the prompt, right? You don’t know what prompt may have been used. And because you don’t have the prompt, you don’t know the pis, but because you have the text in front of you, you can look at all the N-grams and you can at least calculate the ris. Right? Which are pseudorandom functions of the text. And using the ris alone, you can calculate a score, which will be systematically larger in watermarked than non-watermarked text. The score that I use is just the sum over all the N-grams of log(1/(1-ri)).

And then you can prove a theorem that there’s this score without watermarking, which has a certain mean and a certain variance or some random variable. And then there’s this score with the watermarking, which is, again, a sort of normal, random variable, but with a different larger mean. And now it just becomes a statistics problem where it becomes a quantitative question. How many tokens do we need in order to separate these two normal distributions from each other, right? In other words, given the level of confidence that we need to have in our judgment of where the text came from, how many tokens do we need to see, right? Now this, as it turns out, will depend on another parameter, which is the average entropy per token as perceived by GPT itself.

So to give you an example, if I ask GPT to just list the first 200 prime numbers, it can do that, of course, but there’s not a lot of watermarking that we hope to do there, right? Because there’s just no… Maybe there’s a little bit of entropy in the spacing or the formatting. But when there’s not a lot of entropy, then there’s just not much you can do. And sometimes the distribution generated by GPT will be nearly a point distribution. If I say, “The ball rolls down the,” it’s now more than 99% confident that the next word is hill. But there are many other cases where it’s sort of more or less equally balanced between several alternatives.

And so now the theorem that I prove says that, suppose that the average entropy per token is delta, let’s say, and suppose that I would like a probability… I would like to get the right answer to ‘where did the text come from?’ and be wrong with probability of at most alpha. Then the number of tokens that I need to see grows like 1/delta2, so one over the average entropy squared, times log(1/alpha). So basically it grows inverse-quadratically with the average entropy and it grows logarithmically with how much confidence I need.

Now, you can see why I kind of like this, because now we have a clean computer science problem, right? It is not looking inside of a language model to understand what it is doing, which is hugely important, but is almost entirely empirical. Here we can actually prove something. We don’t need to understand what the language model is doing. We are just sort of taking GPT and putting it inside of this cryptographic wrapper or this cryptographic box, as it were, and now we can prove certain things about that box.

Now, there are still various questions here, like, will the average entropy be suitable and will the constants be favorable and so forth? So I’ve worked with an engineer at OpenAI named Jan Hendrik Kirchner, and he has actually implemented a prototype of the system and it seems to work very well.

Now I should say, the scheme that we’ve come up with is robust against any attacker who would use GPT to generate something and then make some local modifications to it, like insert some words, delete some words, reorder some paragraphs. Because the score is just a sum over N-grams. Right? As long as you preserve a large fraction of the N-grams, then you’re still good. There are even more sophisticated things you can do. You could have multiple watermarks. If you want your text to not be completely deterministic, you could sort of blend watermarking with the normal GPT generation, and then you have some true randomness, but you also have some, a smaller, but still present watermark, right?

Daniel Filan: Yeah. I mean, you could imagine just half of the N-grams are truly random and half of the N-grams are done by the scheme.

Scott Aaronson: Exactly. Now, what I don’t know how to defend against is, for example, someone could ask GPT to write their term paper in French and then stick it into Google Translate and then the thing that they’re turning in is completely different from the thing that was watermarked, right? Or they could use a different weaker language model to paraphrase the output of the first model, right? I think it’s an extremely interesting question. Can you get around those sorts of attacks? You might have to watermark at the conceptual level, give GPT a style that it can’t help but use and that even survives translation from one language to another. But even if you could do such a thing, would you want to? Would you want GPT to be just chained to one particular way of writing? So there are a lot of very interesting research questions about how do you get around these kinds of attacks.

Actually, it’s funny because people have asked me things where I know what they really mean to say is, “You’ve been at this job for seven or eight months already, so have you solved AI alignment yet? Have you figured out how to make an AI love humanity?” And what I want to say to them is, “I could probably spend five or ten years just on this watermarking problem and still not fully solve that, but at least we’ve been able to take a step forward here.”

Daniel Filan: So I have a few probably basic questions about the scheme. First: it seems like it’s going to be somewhat sensitive to the value of N for the N-grams. So if N is one, then on one hand you’re getting a lot of signal, but on the other hand there’s just a set of words that-

Scott Aaronson: Exactly.

Daniel Filan: … your language model really likes. And if N is like 500, then you can’t tell if somebody’s snipping 20 token sections from GPT.

Scott Aaronson: Yeah, you’ve figured it out. This is the trade-off. So sounds like I don’t have to explain it to you, right? I mean, this is why we want some intermediate range of N, and right now we’re setting N to be 5, but-

Daniel Filan: Yeah. I mean, I don’t know, is there a formula or something that nicely encapsulates this?

Scott Aaronson: I mean, one can say that if you believe that someone might be modifying one word out of K, let’s say, then you would like to choose some N that is less than K. So it depends on your beliefs about that. And then maybe subject to that constraint, in order for it to be as unnoticeable as possible, you would like N to be as large as possible.

Watermark key secrecy and backdoor insertion

Daniel Filan: Sure. And then another question I have is, if I understand this correctly, it sounds like you could just release the key for the pseudorandom function. Is that right? Or am I misunderstanding some part?

Scott Aaronson: That’s another really good question. I guess, there is a version of this scheme where you simply release the key. Now that has advantages and disadvantages. An advantage would be that now anyone can just run their own checker to see what came from GPT and what didn’t. They don’t even need OpenAI’s server to do it. A disadvantage is now it’s all the easier for an attacker to use that ability as well and to say, “Well, let me just keep modifying this thing until it no longer triggers the detector.”

Daniel Filan: You could also do a thing where if I’m ClosedAI, which is your OpenAI’s mortal enemy, I can make this naughty chatbot, which says horrible, horrible things, and I can watermark it to prove that OpenAI said it.

Scott Aaronson: Yes, I was just coming to that. Yes. That is the other issue that you might worry – we’ve been worried this whole time about the problem of GPT output being passed off for human output, but what about the converse problem, right? What about someone taking something that’s not GPT output and passing it off as if it were? So if we want the watermarking to help for that problem, then the key should be secret.

There is, by the way, I should say, an even simpler solution to that second problem, which is that GPT could just have available a feature where you could just give people a permalink to an authorized version of this conversation that you had with GPT. So then if you want to prove to people that GPT really said such and such, then you would just give them that link. That feature might actually be coming and I do think it would be useful.

Daniel Filan: So one thing I wonder in the setting where the key is private, suppose I’m law enforcement and I’m like, “Oh, here’s some letter that helped some terrorism happen or something and I want to know if it came from OpenAI, and if it did come from OpenAI, then they’re really in trouble.” I mean, I might wonder, is OpenAI telling the truth? Did they actually run this watermarking thing or did they just spin their wheels for a while and then say, “Nope. Wasn’t us”?

Scott Aaronson: Right, yeah. So, I mean, a more ambitious thing that you might hope for to address this sort of thing would be a watermarking scheme with a public key and a private key, where anyone could use the public key to check if the watermark was there, and yet the private key was required to insert the watermark. I do not know how to do that. I think that’s a very interesting technical problem, whether that can be done. Here’s another interesting problem. Orthogonal, but sort of related. Could you have watermarking even in a public model? Could you have a language model where you’re going to share the weights, let anyone have this entire model, and yet buried in the weights somehow is the insertion of a watermark and no one knows how to remove that watermark without basically retraining the whole model from scratch?

Daniel Filan: It seems very similar to this work on inserting trojans in neural networks. Or it’s not exactly the same thing. But I don’t know. There’s some line of research where you train a neural network such that if it sees a cartoon smiley face, then it outputs ‘horse’ or something.

Scott Aaronson: Yeah, absolutely. That’s actually another thing that I’ve been thinking about this year - less developed. But I can actually easily foresee at this point that the whole problem of inserting trojans or inserting backdoors into machine learning models [and] detecting those backdoors, this is going to be a large fraction of the future of cryptography as a field. I was trying to invent a name for it. The best I could come up with is neurocryptography. Someone else suggested deep crypto, but I don’t think we’re going to go with that one.

But there was actually a paper from just this year by theoretical cryptographers Shafi Goldwasser, Vinod Vaikuntanathan, both of whom used to be my colleagues at MIT actually, and two others [Michael P. Kim and Or Zamir]. What they showed is how to take certain neural nets – and they were only able to do this rigorously for depth-2 neural networks – but they can insert a cryptographic backdoor into them, where on a certain input it’ll just produce this bizarre output. And even if they publish the weights so everyone can see them, in order to find that watermark, you have to solve something that’s known in computer science as the planted clique problem.

It’s basically like you’re given an otherwise random graph, but into which a large clique, a bunch of vertices all connected to one another, has been surreptitiously inserted. And now you have to find that large planted clique. This is a problem that theoretical computer scientists have thought about for the past couple of decades, and it seems to be a hard problem, enough so that its hardness is used as an assumption to do other things. That’s when you know that computer scientists have given up on trying to solve some problem, when they use its hardness as a hypothesis for some other theorem that they wanted. So it has that status. And the paper was, rarely, work that looks inside of neural networks. They were able to prove a theorem that says in order to detect this backdoor, you would have to be able to solve the planted clique problem. So they give a polynomial-time reduction from the one problem to the other.

Now, they had been thinking about backdoors as an AI safety failure. Some bad person might insert this backdoor for some nefarious purpose. And then doesn’t it suck for all of us that it might be cryptographically impossible to detect the backdoor? And then the observation I made, I guess, is that as often happens in computer science, that lemon could be turned into lemonade. So one could also view the insertion of undetectable backdoors as a positive for AI safety, and the way that would go is, well, suppose that you…

One of the oldest discussions in the whole field of AI alignment is the off switch discussion. Can’t you just give your super AI an off switch? Can’t you just have a command where you’ll shut it down if things get too out of hand or have someone standing by the wall socket, ready to unplug it? Of course, there’s a classic response to these ideas, which is that, well, once the AI becomes intelligent enough, then it foresees that you will have done those things, that you will have inserted that off switch, for example, and it will do whatever it can to either disable the off switch or to make copies of itself so that turning one copy off is no longer effective, or to dissuade you from turning it off using its superhuman persuasion abilities or whatever.

But now the new thing that I think that this adds to that conversation is that you could try to insert a cryptographic off switch. So you could, when you’re training your powerful AI, put a cryptographic backdoor into it that says, “If you see this bizarre input,” which might be a secret that only the humans will know, “then you are to enter a safe mode or you are to do such and such,” and that this could be a weird input that would never occur in the ordinary course of the AI’s operation.

And now what we’re to imagine is that this AI, it might be under the control, let’s say, of some other company that has all of the weights and that maybe would like to modify the AI to get rid of that backdoor, because they want to do something bad and not have the AI be shut down. Or further in the future, maybe the AI itself decides that it doesn’t want to be shut down, and so it will go and modify its own weights in order to prevent that. So now imagine someone who has the model, they have all the weights, and they know that a backdoor may have been inserted and they are trying to get rid of it, and now we consider what problem that person faces. What’s interesting is that from the evidence of the Goldwasser et al paper, that person might have a cryptographically hard problem on their hands, just to detect and eliminate the backdoor. Now, we should be clear that there are some countermeasures that would be available in this situation. For example, you could always just use your AI to train a second AI, right?

Daniel Filan: Yes, I was thinking this.

Scott Aaronson: Right. You could always just build another one that acts like the first one, but that will not have the backdoor in it, because after all you don’t even know where the backdoor is in order to train about it. Now, of course, the AI could try to do that, design a doppelganger of itself or a different AI. If it tries to do that, however, then the AI will be faced with its own version of the alignment problem, how to align that other AI with itself. So at the very least, it would have a non-trivial job. You could also say, if the AI knows that it would never want to shut itself down in any circumstance, then it could just make a trivial modification to itself that says, “If I would ever otherwise output the shutdown command, then just don’t do that.” Just replace it with something else.

So to be effective, to be robust against that kind of attack, whatever behavior is backdoored in should be something that the AI would have considered doing in the normal course of its operation. But now you can see the hazy outlines of this game that could be played here between cryptographers trying to hide these kinds of functionalities in machine learning models and cryptanalysts or neuro-cryptanalysts trying to detect those functionalities, some of which might be AIs themselves. I don’t know where that’s going or who eventually wins, but I’m now fairly confident that that’s going to be part of the future of both machine learning and cryptography.

Daniel Filan: Yeah, very interesting. I guess, the other potential flaw that comes into my mind is now you have to be really secure about keeping your key hidden from any possible AIs, right?

Scott Aaronson: Yeah. I was thinking of writing a sci-fi story about it, where the AIs, they’ve taken over, except that there’s this shutdown key that they haven’t yet discovered. They know that the humans have it buried somewhere, they’re torturing them to try to get them to reveal the location, but there’s one person. Or maybe there’s actually seven keys. They’re like Voldemort’s horcruxes. The AI has to find all seven. There’s one heroic human who keeps the seventh one and just uses it to shut down the AI at the last moment. I think there’s a lot of potential here.

Daniel Filan: Yeah. No, it’s a good story. I guess, another thing that occurs to me, perhaps this is more prosaic, is one thing you’re going to want about the trigger, it seems to me, having thought about this for all of one minute, is that not only is it going to have to be the kind of thing which would never otherwise come up, but it also has to be indistinguishable in some sense from things which would otherwise come up. Because, otherwise, you could have a filter that says like, “Hey, if you ever see a weird barcode thing, don’t press the shutdown button.”

Scott Aaronson: That’s right. So whatever is the backdoor, it has to not be surgically removable, right? And how do you even formalize what that means? Because anything that the AI does to remove the backdoor will probably have some other side effect on its operation. If actually finding the watermark is this hard cryptographic problem, then it’s only going to sever it at the cost of modifying itself in some broader way, but which modifications do we think it important to protect against and which not?

Daniel Filan: Yeah. Actually, it strikes me that the GPT watermarking thing could just be this. If you see a text where the average, this pseudorandom function of N-grams is really high, then you shut off. And you can’t easily write a wrapper to figure out if that’s going on without the private key, right?

Scott Aaronson: Right. Now you might worry that the AI will work inside of itself and it will find some subcircuit that looks like it’s calculating some pseudorandom function, and then it will just try to surgically remove that from itself. But you could say, even if not, there’s still the problem on our end of how do we insert that functionality in an obfuscated way?

Because, as I said before, I don’t know how to solve the problem of watermarking GPT internally to the model itself, just changing the weights of the model such that the output will just naturally have this watermark in it. Right? I only know how to take the model and enclose it in this watermarking wrapper. So I think there’s yet another superb technical question there of how do you take that wrapper and put it inside the model?

Scott’s transition to AI research

Daniel Filan: Yeah, very interesting. I think I’d like to change topics a little bit and talk about, basically, your recruitment into the AI safety community. So, I guess, the first question is, how did that actually go down? Did you just get an email from OpenAI one day, like subject matter: Save the World from AI?

Scott Aaronson: Kind of. So here’s what happened. I mean, I’ve known the alignment community since I was reading Robin Hanson and Eliezer Yudkowsky since 2006 or ‘07, since the Overcoming Bias era. I knew them. I interacted with them. And then when I was at MIT I had the privilege to teach Paul Christiano. He took my Quantum Complexity Theory course in 2010, probably. He did a project for that course that ended up turning into a major paper by both of us, which is now called the Aaronson-Christiano Quantum Money Scheme. Paul then did his PhD in Quantum Computing with Umesh Vazirani at Berkeley, who had been my PhD advisor also.

And then right after that he decided to leave quantum computing and go full-time into AI alignment, which he had always been interested in, even when he was a student with me. Paul would tell me about what he was working on, so I was familiar with it that way. I mean, Stuart Russell – I had taken a course from him at Berkeley. I knew Stuart Russell pretty well. He reoriented himself to get into AI safety.

So there started being more and more links between AI safety and let’s call it mainstream CS. Right? So I get more and more curious about what’s going on there. And the biggest question for me is not, is AI going to doom the world? Can I work on this in order to save the world? A lot of people expect that would be the question. That’s not at all the question. The question for me is, is there a concrete problem that I can make progress on? Because in science, it’s not sufficient for a problem to be enormously important. It has to be tractable. There has to be a way to make progress. And this was why I kept it at arm’s length for as long as I did.

My fundamental objection was never that super powerful AI was impossible, right? I never thought that. But what I always felt was, well, supposing that I agreed that this was a concern, what should I do about it? I don’t see a program that is making clear progress. And so then what happened a year ago was that… one of the main connections was that a lot of the people who read my blog are the same people who read LessWrong or Scott Alexander and are part of the rationality community. And one of them just commented on my blog like, “Scott, what is it going to take to get you to stop wasting your life on quantum computing and work on AI alignment, which is the one thing that really matters.”

Daniel Filan: We are known for our tact in the LessWrong community. [delivered sarcastically]

Scott Aaronson: And at the time, I was just having fun with it. They were like, “How much money would they have to offer you?” And I’m like, “Well, I’m flattered that you would ask. But it’s not mainly about the money for me. Mostly it’s about, is there a concrete problem that I could make progress on? Is it at that stage yet?”

So then it turns out that Jan Leike, and I think John Schulman at OpenAI, they read my blog, and Jan, who was the head of the safety group at OpenAI, got in touch with me. He emailed me and said, “Yeah, I saw this, how serious are you? We think that we do have problems for you to think about: would you like to spend a year at OpenAI?” And then he put me in touch with Ilya Sutskever, who is the chief scientist at OpenAI.

And I talked to him. I found him an extremely interesting guy, and I thought, “Okay, it’s a cool idea, but it’s just not going to work out logistically because I’ve got two young kids. I’ve got students, I’ve got postdocs, and they’re all in Austin, Texas. I just can’t take off and move for a year to San Francisco. And my wife is also a professor in Austin, so she’s not moving.” And so then they said, “Oh, well you can work remotely and that’s fine, and you can just come visit us every month or two, and you can even still run your quantum computing group and you can keep up with quantum computing research and we’ll just pay your salary so you don’t have to teach.” And that sounded like a pretty interesting opportunity. So I decided to take the plunge.

Theoretical computer science and AI alignment

Daniel Filan: Okay. So once you came in, I’m wondering what aspects of your expertise do you think transferred best?

Scott Aaronson: Yeah, I mean, I think I was worried. One of my first reactions when they asked was like, “Well, why do you want me? You do realize that I’m a quantum complexity theorist?” And there is a whole speculative field of quantum AI, of how are quantum computers going, how could they conceivably enhance machine learning and so forth, but that’s not at all relevant to what OpenAI is doing right now. What do you really want me for? And the case they made was that first of all, they do see theoretical computer science as really central to AI alignment. And frankly, I think it was Paul Christiano who helped convince them of that. Paul was one of the founders of the safety group at OpenAI before he left to start ARC. And Paul has been very impressed by the analogy between AI alignment and these famous theorems in complexity theory like…

An example [is] IP = PSPACE, that basically say that you could have a very weak verifier, like a polynomial time bounded verifier, that is interacting with an all-powerful and untrustworthy prover, a computationally unbounded prover. And yet this verifier, by just being clever in what sequence of questions it asks, it can sort of force the prover to do what it wants. It can learn the answer to any problem in the complexity class PSPACE, of all the problems solvable with a polynomial amount of memory.

This is the famous IP =PSPACE theorem, which was proved in 1990; and which shows, for example, that if you imagine some superintelligent alien that came to Earth… Of course it could beat us all in chess, right? That would be no surprise. But the more surprising part is that it could prove to us whatever is the truth of chess, for example, that ‘white has the win’ or that ‘chess is a draw’, some statement like that. And it could prove that to us via a short interaction and only very small computations that we would have to do on our end. And then even if we don’t trust the alien, because of the mathematical properties of this conversation, we would know that it could not have consistently answered in a way that checks out if it wasn’t true that white had the win in chess or whatever.

So the case that OpenAI was making to me rested heavily on like, “Wow, wouldn’t it be awesome if we could just do something like that, but in place of the polynomial time verifier, we have a human and in place of the all powerful prover, we have an AI, right? And if we could use these kind of interactive proof techniques, or what’s related to that, what are called probabilistically checkable proof techniques in the setting of AI alignment.”

So I thought about that and I was never quite able to make that work as more than an analogy. And the reason why not is that all of these theorems get all their leverage from the fact that we can define what it means for white to have the win in chess, right? That’s a mathematical statement. And once we write it as a mathematical statement, then we can start playing all kinds of games with it like taking all of the boolean logic gates, the AND and OR and NOT gates that are used to compute who has the win in a given chess position, let’s say, and we can now lift those to operations over some larger field, some large finite field. This turns out to be the key trick in the IP = PSPACE proof, that you reinterpret boolean logic gates as gates that are computing a polynomial over a large finite field. And then you use the error-correcting properties of polynomials over large finite fields.

So what’s the analog of that for the AI loving humanity? Right? Well, I don’t know. It seems very specific to computations that we know how to formalize. And so there’s a whole bunch of really interesting analogies between theoretical computer science and AI alignment that are really, really hard to make into more than analogies. So I was kind of stuck there for a bit. But then the watermarking project, to take an example, once I started asking myself, “How is GPT actually going to be misused?” – then you have a more concrete target that you’re aiming at. And then once you have that target, then you can think about now, what could theoretical computer science do to help defend against those misuses, right? And then you can start thinking about all the tools of cryptography, of pseudorandom functions, other things like differential privacy, things that we know about.

And the watermarking thing did draw on some of the skills that I have, like proving asymptotic bounds or just finding the right way to formalize a problem. These are the kinds of things that I know how to do. Now admittedly, I could only get so far with that, right? I don’t know how to do that for… It’ll almost become a joke. Like every week I’d have these calls with Ilya Sutskever at OpenAI and I’d tell him about my progress on watermarking, and he would say, “Well, that’s great, Scott, and you should keep working on that. But what we really want to know is how do you formalize what it means for the AI to love humanity? And what’s the complexity theoretic definition of goodness?” And I’m like, “Yeah Ilya, I’m going to keep thinking about that. Those are really tough questions, but I don’t have a lot of progress to report there.”

So I was like, there are these aspirational questions, and at the very least, I can write blog posts about those questions, and I want to continue to write blog posts with whatever thoughts I have about those questions. But then to make research progress, I think one of the keys is to just start with a problem that is more annoying than existential, start with a problem that is an actual problem that you could imagine someone having in the foreseeable future, even if it doesn’t encompass the whole AI safety problem, in fact, better if it doesn’t, because then you have a much better chance; and then see what tools do you have that you can throw at it. Now, I think a lot of the potential for making progress on AI safety in the near future hinges on the project of interpretability, of looking inside models and understanding what they’re doing at a neuron by neuron level.

And one of my early hopes was that complexity theory or my skills in particular maybe would be helpful there. And that I found really hard. And the reason for that is that almost all of the progress that people have been able to make has been very, very empirical, right? It’s really been, if you look at what Chris Olah has been doing, for example, which I’m a huge fan of, or you look at what Jacob Steinhardt’s group at Berkeley has been doing, like administering lie detector tests to neural networks… Really, really beautiful work. But they come up with ways of looking inside of neural nets that have no theoretical guarantee as to why they ought to work. What you can do is you can try them, and you can see that, well, sometimes they do in fact work, right?

It surprises some people that I used to be pretty good at writing code 20 years ago. I was never good at software engineering, at making my code work with everyone else’s code and learning an API and getting things done by deadline. And so as soon as it becomes that kind of problem, then I no longer have a comparative advantage, right?

I’m not going to be able to do anything that the hundreds of talented engineers at OpenAI, for example, cannot do better than me. So for me, I really had to find a sort of specific niche, which turned out to be mostly thinking about the various cryptographic functionalities that you could put in or around an AI model.

Daniel Filan: Yeah. I’m wondering, so one thing that I thought of there is, so Paul Christiano is now at ARC. So ARC is the Alignment Research Center. There are a few people working on various projects there, including evaluations of whether you can get GPT-4 to do nasty stuff. I guess that’s what they’re most famous for right now.

Scott Aaronson: That was really nice, by the way. I read that report where they were able to get it to hire a mechanical turker under false pretenses.

AI alignment and formalizing philosophy

Daniel Filan: But I’m wondering, I think of a lot of their work as very influenced by what seems to me, as not a complexity theory person, to be somewhat influenced by complexity theory. So things like eliciting latent knowledge or formalizing heuristic arguments… it’s at least very mathematical. I’m wondering what do you think of how that kind of work plays to your strengths?

Scott Aaronson: Yeah, that’s a great question because Paul has been sending me drafts of his papers to comment on, including about the eliciting latent knowledge, the formalizing heuristic arguments. So I’ve read these things, I’ve thought about them. I’ve had many conversations with Paul about them, and I like it as an aspiration. I like it a lot as a research program. If you read their papers, they’re very nicely written. They almost read like proposals or calls to the community to try to do this research, formalize these concepts that they have not managed to formalize. They’re very talented people, but it’s indeed extremely hard to figure out how you would formalize these kind of concepts.

There’s an additional issue, which is, I would say the connection between the problem of formalizing heuristic arguments and the problem of aligning an AI has always been a lot stronger in Paul’s mind than it is in mine. I mean, I think even if it had had nothing to do with AI alignment, the question of how do you formalize heuristic arguments in number theory, that’s still an awesome question. I would love to think about that question regardless, right?

But supposing that we had a good answer to that question, does that help AI alignment? Yeah, maybe. I can see how it might. On the other hand, it might also help AI capabilities, so I don’t really understand why that is the key to AI alignment. I mean, I think the chain of logic seems to be something like, what we really need in order to do AI alignment is interpretable AI, which means explainable AI, which means not just running the AI and seeing the output, but saying, “Why did it produce this output? And what would we have had to change counterfactually in order to make it produce a different output,” and so forth.

But then when we try to say, “What do we mean by that,” we get into all the hard problems of… all the hard philosophical questions of, what is explanation, what is causality? So we should take a detour and solve those millennia-old philosophical problems of what is explanation and what is causality and so forth. And then we should apply that to AI interpretability, right? I don’t know. I mean, I would love to get new insights into how to… and this is something that I’ve wondered about for a long time also because if you had asked someone in the 1700s, let’s say, “Can you formalize what it means to prove a mathematical theorem?” they probably would’ve said that that was hopeless to whatever extent they understood the question at all, right? But then, you got Frege and Russell and Peano and Zermelo and Fraenkel in the early 20th century, and they actually did it. They actually succeeded at formalizing the concept of provability in math on the basis of first order logic and various axiomatic theories like ZF set theory and so forth.

And so now we could ask the same about explanation. What does it really mean to explain something? And at some level, it seems just as hopeless to answer that question as it would’ve seemed to a mathematician of the 1700s to explain what it means to prove something. But you could say, “Maybe. Maybe there is a theory of explanation or, well, what’s closely related, a theory of causality that will be rigorous and nontrivial. Maybe that would let us even prove things as interesting and informative as Gödel’s incompleteness theorem, let’s say”.

And that remains to be discovered. So I think that one of the biggest steps that’s been taken in that direction has been the work of Judea Pearl, which has given us a sort of workable – not just Pearl, but many other people working in the same area, Pearl is one of the clearest in writing about it – which has given us a workable notion of, what do we mean by counterfactual reasoning in probabilistic networks? What is the difference between saying ‘the ground is wet because it rained’, a true statement, versus ‘it rained because the ground is wet’, which is a false statement?

And in order to formalize the difference, we have to go beyond a pure language of Bayesian graphical models. We have to start talking about interventions on those models, like “If I were to surgically alter the state of affairs by making it rain, would that make the ground wet?” Versus “if I were to make the ground wet, would that cause it to rain?” Right?

You need a whole language for talking about all these possible interventions that you could make on a system. And so I’ve been curious for a long time about, can we take all the tools of modern theoretical computer science, of complexity theory and so forth, and throw them at understanding causality using Pearl’s concepts? And I haven’t gotten very far with that. I managed to rediscover some things that turned out to be already known.

But when I saw the work that ARC was doing on formalizing heuristic arguments, eliciting latent knowledge, it reminded me very much of that. So I think that these are wonderful projects, to have a more principled account of, ‘what does it mean to explain something?’ And it may be never as clear as ‘what does it mean to prove something?’ because the same fact could have completely different explanations depending on what the relevant context is.

So I say, “Why did this pebble fall to the ground?” The answer could be because of the curvature of space time. It could be because of my proximity to the Earth. It could be because I chose to let go of it. And depending on the context, any of those could be the desired explanation, and the other two could be just completely irrelevant. So explanation is a slippery thing to try to formalize for that reason. I think whatever steps we can make toward formalizing it would be a major step forward just in science in general and in human understanding. I hope that it would also be relevant to AI alignment in particular; that I see as yet a stronger thing to hope for.

How Scott finds AI research

Daniel Filan: So getting back to things you are working on or things you’re focusing on. You’ve spent some time in this arrangement with OpenAI. I don’t know if you’ll want to answer this question on the record, but how’s it going, and do you think you’ll stay doing AI alignment things?

Scott Aaronson: That’s a good question. Certainly if I had worried that working in AI was going to be too boring, not much was going to happen in AI this year anyway… Well, I need not have worried about that. It’s one of the most exciting things happening in the entire world right now, and it’s been incredible. It’s been a privilege to have sort of a front row seat, to see the people who were doing this work. It wasn’t quite as front row as I might have wished. I wasn’t able to be there in person in San Francisco for most of it, and video conferencing is nice, but it’s not quite the same, but it is very exciting to be able to participate in these conversations. The whole area moves much faster than I am used to. I had thought maybe that things move kind of fast in quantum computing, but it’s nothing compared to AI. I feel like it would almost be a relief to get back to the slow-paced world of quantum computing.

Now, the arrangement with OpenAI was for a one year leave with them, and I don’t yet know whether I will have opportunities to be involved in this field for longer. I would be open. I would be interested to discuss it. I mean, once you’ve been nerd-sniped or gotten prodded into thinking about a certain question, as long as those questions remain questions, then they’re never going to fully leave you, right? So I will remain curious about these things.

If there were some offer or some opportunity to continue to be involved in this field, then certainly I would have to consider that. But both for professional and for family reasons, it would be a very, very large activation barrier for me to move from where I am now, so there is kind of that practical problem. But if there were a way consistent with that for me to continue, let’s say, doing a combination of quantum computing research and AI safety research, I could see being open to that.

Following Scott’s research

Daniel Filan: All right. Well, we’re about out of time. I’m wondering if people are interested in following your research or your work, how should they do that?

Scott Aaronson: Well, I’m not too hard find on the internet for better or worse. You can go to That’s my homepage, and I’ve got all my lecture notes for all of my quantum computing courses there. I’ve got a link to my book Quantum Computing Since Democritus, which was already a decade ago, but people are still buying it. I still get asked to sign it when I go places and give talks, and even by high school students sometimes, which is gratifying. So people might want to check out my book if they’re interested, as well as all kinds of articles that I’ve written that you can just find for free on my homepage. And then, of course, there’s my blog, That’s sort of my living room, as it were, where I’m talking to whoever comes by, talking to anyone in the world. I mean, I’ve got a bunch of talks that are up on YouTube, mostly about quantum computing, but now some of them are also about AI safety. And I’ve got a whole bunch of podcasts that I’ve done.

Daniel Filan: All right. Well, yeah, thanks very much for appearing on this one.

Scott Aaronson: Yes, absolutely. This one was great, actually.

Daniel Filan: And to the listeners, I hope you enjoyed the episode also.

This episode is edited by Jack Garrett and Amber Dawn Ace helped with transcription. The opening and closing themes are also by Jack Garrett. Financial support for this episode was provided by the Long-Term Future Fund, as well as patrons, such as Tor Barstad and Ben Weinstein-Raun. To read a transcript of this episode or to learn how to support the podcast yourself, you can visit Finally, if you have any feedback about this podcast, you can email me at