A Principled AI Discussion in Asilomar

The Asilomar Conference took place against a backdrop of growing interest from wider society in the potential of artificial intelligence (AI), and a sense that those playing a part in its development have a responsibility and opportunity to shape it for the best. The purpose of the Conference, which brought together leaders from academia and industry, was to discern the AI community’s shared vision for AI, should it exist.

In planning the meeting, we reviewed reports on the opportunities and threats created by AI and compiled a long list of the diverse views held on how the technology should be managed. We then attempted to distill this list into a set of principles by identifying areas of overlap and potential simplification. Before the conference, we extensively surveyed attendees, gathering suggestions for improvements and additional principles. These responses were folded into a significantly revised list for use at the meeting.

In Asilomar, we gathered further feedback in two stages. To begin with, small breakout groups discussed the principles and produced detailed feedback. This process generated several new principles, improved versions of the existing principles and, in several cases, multiple competing versions of a single principle. Finally, we surveyed the full set of attendees to determine the level of support for each version of each principle.

The final list consisted of 23 principles, each of which received support from at least 90% of the conference participants. These “Asilomar Principles” have since become one of the most influential sets of governance principles, and serve to guide our work on AI.

To start the discussion, here are some of the things other AI researchers who signed the Principles had to say about them.

Value Alignment: Highly autonomous AI systems should be designed so that their goals and behaviors can be assured to align with human values throughout their operation.
“Value alignment is a big one. Robots aren’t going to try to revolt against humanity, but they’ll just try to optimize whatever we tell them to do. So we need to make sure to tell them to optimize for the world we actually want.”

-Anca Dragan, Assistant Professor in the EECS Department at UC Berkeley, and co-PI for the Center for Human Compatible AI
Read her complete interview here.

Shared Prosperity
“I consider that one of the greatest dangers is that people either deal with AI in an irresponsible way or maliciously — I mean for their personal gain. And by having a more egalitarian society, throughout the world, I think we can reduce those dangers. In a society where there’s a lot of violence, a lot of inequality, the risk of misusing AI or having people use it irresponsibly in general is much greater. Making AI beneficial for all is very central to the safety question.”

-Yoshua Bengio, Professor of CSOR at the University of Montreal, and head of the Montreal Institute for Learning Algorithms (MILA)
Read his complete interview here.

Importance: Advanced AI could represent a profound change in the history of life on Earth, and should be planned for and managed with commensurate care and resources.
“I believe that AI will create profound change even before it is ‘advanced’ and thus we need to plan and manage growth of the technology. As humans we are not good at long-term planning because our civil systems don’t encourage it, however, this is an area in which we must develop our abilities to ensure a responsible and beneficial partnership between man and machine.”

-Kay Firth-Butterfield, Executive Director of AI-Austin.org, and an adjunct Professor of Law at the University of Texas at Austin
Read her complete interview here.

Personal Privacy: People should have the right to access, manage and control the data they generate, given AI systems’ power to analyze and utilize that data.
“It’s absolutely crucial that individuals should have the right to manage access to the data they generate… AI does open new insight to individuals and institutions. It creates a persona for the individual or institution – personality traits, emotional make-up, lots of the things we learn when we meet each other. AI will do that too and it’s very personal. I want to control how persona is created. A persona is a fundamental right.”

-Guruduth Banavar, VP, IBM Research, Chief Science Officer, Cognitive Computing
Read his complete interview here.

Value Alignment: Highly autonomous AI systems should be designed so that their goals and behaviors can be assured to align with human values throughout their operation.
“The one closest to my heart. … AI systems should behave in a way that is aligned with human values. But actually, I would be even more general than what you’ve written in this principle. Because this principle has to do not only with autonomous AI systems, but I think this is very important and essential also for systems that work tightly with humans in the loop, and also where the human is the final decision maker. Because when you have human and machine tightly working together, you want this to be a real team. So you want the human to be really sure that the AI system works with values aligned to that person. It takes a lot of discussion to understand those values.”

-Francesca Rossi, Research scientist at the IBM T.J. Watson Research Centre, and a professor of computer science at the University of Padova, Italy, currently on leave
Read her complete interview here.

AI Arms Race: An arms race in lethal autonomous weapons should be avoided.
“One reason that I got involved in these discussions is that there are some topics I think are very relevant today, and one of them is the arms race that’s happening amongst militaries around the world already, today. This is going to be very destabilizing. It’s going to upset the current world order when people get their hands on these sorts of technologies. It’s actually stupid AI that they’re going to be fielding in this arms race to begin with and that’s actually quite worrying – that it’s technologies that aren’t going to be able to distinguish between combatants and civilians, and aren’t able to act in accordance with international humanitarian law, and will be used by despots and terrorists and hacked to behave in ways that are completely undesirable. And that’s something that’s happening today. You have to see the recent segment on 60 Minutes to see the terrifying swarms of robot UAVs that the American military is now experimenting with.”

-Toby Walsh, Guest Professor at Technical University of Berlin, Professor of Artificial Intelligence at the University of New South Wales, and leads the Algorithmic Decision Theory group at Data61, Australia’s Centre of Excellence for ICT Research
Read his complete interview here.

AI Arms Race: An arms race in lethal autonomous weapons should be avoided.
“I’m not a fan of wars, and I think it could be extremely dangerous. Obviously I think that the technology has a huge potential, and even just with the capabilities we have today it’s not hard to imagine how it could be used in very harmful ways. I don’t want my contributions to the field and any kind of techniques that we’re all developing to do harm to other humans or to develop weapons or to start wars or to be even more deadly than what we already have.”

-Stefano Ermon, Assistant Professor in the Department of Computer Science at Stanford University, where he is affiliated with the Artificial Intelligence Laboratory
Read his complete interview here.

Capability Caution: There being no consensus, we should avoid strong assumptions regarding upper limits on future AI capabilities.
“I agree! As a scientist, I’m against making strong or unjustified assumptions about anything, so of course I agree. Yet this principle bothers me … because it seems to be implicitly saying that there is an immediate danger that AI is going to become superhumanly, generally intelligent very soon, and we need to worry about this issue. This assertion … concerns me because I think it’s a distraction from what are likely to be much bigger, more important, more near term, potentially devastating problems. I’m much more worried about job loss and the need for some kind of guaranteed health-care, education and basic income than I am about Skynet. And I’m much more worried about some terrorist taking an AI system and trying to program it to kill all Americans than I am about an AI system suddenly waking up and deciding that it should do that on its own.”

-Dan Weld, Professor of Computer Science & Engineering and Entrepreneurial Faculty Fellow at the University of Washington
Read his complete interview here.

Capability Caution: There being no consensus, we should avoid strong assumptions regarding upper limits on future AI capabilities.
“In many areas of computer science, such as complexity or cryptography, the default assumption is that we deal with the worst case scenario. Similarly, in AI Safety, we should assume that AI will become maximally capable and prepare accordingly. If we are wrong, we will still be great shape.”

-Roman Yampolskiy, Associate Professor of CECS at the University of Louisville, and founding director of the Cyber Security Lab
Read his complete interview here.

FLI July 2022 Newsletter

FLI May 2022 Newsletter

FLI April 2022 Newsletter

FLI March 2022 Newsletter

Daniela and Dario Amodei on Anthropic

  • Anthropic’s mission and research strategy
  • Recent research and papers by Anthropic
  • Anthropic’s structure as a “public benefit corporation”
  • Career opportunities


Watch the video version of this episode here

Careers at Anthropic

Anthropic’s Transformer Circuits research 

Follow Anthropic on Twitter

microCOVID Project

Follow Lucas on Twitter here

0:00 Intro

2:44 What was the intention behind forming Anthropic?

6:28 Do the founders of Anthropic share a similar view on AI?

7:55 What is Anthropic’s focused research bet?

11:10 Does AI existential safety fit into Anthropic’s work and thinking?

14:14 Examples of AI models today that have properties relevant to future AI existential safety

16:12 Why work on large scale models?

20:02 What does it mean for a model to lie?

22:44 Safety concerns around the open-endedness of large models

29:01 How does safety work fit into race dynamics to more and more powerful AI?

36:16 Anthropic’s mission and how it fits into AI alignment

38:40 Why explore large models for AI safety and scaling to more intelligent systems?

43:24 Is Anthropics research strategy a form of prosaic alignment?

46:22 Anthropic’s recent research and papers

49:52 How difficult is it to interpret current AI models?

52:40 Anthropic’s research on alignment and societal impact

55:35 Why did you decide to release tools and videos alongside your interpretability research?

1:01:04 What is it like working with your sibling?

1:05:33 Inspiration around creating Anthropic

1:12:40 Is there an upward bound on capability gains from scaling current models?

1:18:00 Why is it unlikely that continuously increasing the number of parameters on models will lead to AGI?

1:21:10 Bootstrapping models

1:22:26 How does Anthropic see itself as positioned in the AI safety space?

1:25:35 What does being a public benefit corporation mean for Anthropic?

1:30:55 Anthropic’s perspective on windfall profits from powerful AI systems

1:34:07 Issues with current AI systems and their relationship with long-term safety concerns

1:39:30 Anthropic’s plan to communicate it’s work to technical researchers and policy makers

1:41:28 AI evaluations and monitoring

1:42:50 AI governance

1:45:12 Careers at Anthropic

1:48:30 What it’s like working at Anthropic

1:52:48 Why hire people of a wide variety of technical backgrounds?

1:54:33 What’s a future you’re excited about or hopeful for?

1:59:42 Where to find and follow Anthropic


Lucas Perry: Welcome to the Future of Life Institute Podcast. I’m Lucas Perry. Today’s episode is with Daniela and Dario Amodei of Anthropic. For those not familiar, Anthropic is a new AI safety and research company that’s working to build reliable, interpretable, and steerable AI systems. Their view is that large, general AI systems of today can have significant benefits, but can also be unpredictable, unreliable, and opaque.  Their goal is to make progress on these issues through research, and, down the road, create value commercially and for public benefit. Daniela and Dario join us to discuss the mission of Anthropic, their perspective on AI safety, their research strategy, as well as what it’s like to work there and the positions they’re currently hiring for. Daniela Amodei is Co-Founder and President of Anthropic. She was previously at Stripe and OpenAI, and has also served as a congressional staffer. Dario Amodei is CEO and Co-Founder of Anthropic. He was previously at OpenAI, Google, and Baidu. Dario holds a PhD in (Bio)physics from Princeton University.

Before we jump into the interview, we have a few announcements. If you’ve tuned into any of the previous two episodes, you can skip ahead just a bit. The first announcement is that I will be moving on from my role as Host of the FLI Podcast, and this means two things. The first is that FLI is hiring for a new host for the podcast. As host, you would be responsible for the guest selection, interviews, production, and publication of the FLI Podcast. If you’re interested in applying for this position, you can head over to the careers tab at futureoflife.org for more information. We also have another 4 job openings currently for a Human Resources Manager, an Editorial Manager, an EU Policy Analyst, and an Operations Specialist. You can learn more about those at the careers tab as well.

The second item is that even though I will no longer be the host of the FLI Podcast, I won’t be disappearing from the podcasting space. I’m starting a brand new podcast focused on exploring questions around wisdom, philosophy, science, and technology, where you’ll see some of the same themes we explore here like existential risk and AI alignment. I’ll have more details about my new podcast soon. If you’d like to stay up to date, you can follow me on Twitter at LucasFMPerry, link in the description.

And with that, I’m happy to present this interview with Daniela and Dario Amodei on Anthropic.

It’s really wonderful to have you guys here on the podcast. I’m super excited to be learning all about Anthropic. So we can start off here with a pretty simple question, and so what was the intention behind forming Anthropic?

Daniela Amodei: Yeah. Cool. Well, first of all Lucas, thanks so much for having us on the show. We’ve been really looking forward to it. We’re super pumped to be here. So I guess maybe I’ll kind of start with this one. So just why did we start Anthropic? To give a little history here and set the stage, we were founded about a year ago at the beginning of 2021, and it was originally a team of seven people who moved over together from OpenAI. And for listeners or viewers who don’t very viscerally remember this time period, it was the middle of the pandemic, so most people were not eligible to be vaccinated yet. And so when all of us wanted to get together and talk about anything, we had to get together in someone’s backyard or outdoors and be six feet apart and wear masks. And so it was generally just a really interesting time to be starting a company.

But why did we found Anthropic? What was the thinking there? I think the best way I would describe it is because all of us wanted the opportunity to make a focused research bet with a small set of people who were highly aligned around a very coherent vision of AI research and AI safety. So the majority of our employees had worked together in one format or another in the past, so I think our team is known for work like GPT-3 or DeepDream Chris Olah worked on at Google Brain for scaling laws. But we’d also done a lot of different safety research together in different organizations as well. So multimodal neurons when we were at OpenAI, Concrete Problems in AI Safety and a lot of others, but this group had worked together in different companies at Google Brain and OpenAI and academia in startups previously, and we really just wanted the opportunity to get that group together to do this focused research bet of building steerable, interpretable and reliable AI systems with humans at the center of them.

Dario Amodei: Yeah, just to add a little bit to that, I think we’re all a bunch of fairly empirically minded, exploration driven people, but who also think and care a lot about AI safety. I think that characterizes all seven of us. If you add together having either working at OpenAI, working together at Google Brain in the past, many of us worked together in the physics community, and we’re current or former physicists. If you add all that together, it’s a set of people who have known each other for a long time and have been aware of thinking and arguments about AI safety and have worked on them over the years always with an empirical bent, ranging from interpretability on language models and vision models to working on the original RL from Human Preferences, Concrete Problems in AI safety, and also characterizing scaling and how scaling works and how we think of that as somewhat central to the way AI is going to progress and shapes the landscape for how to solve safety.

And so a year ago, we were all working at OpenAI and trying to make this focused bet on basically scaling plus safety or safety with a lens towards scaling being a big part of the path to AGI. And when we felt we were making this focused bet within a larger organization and it just eventually came to the conclusion that it would be great to have an organization like top to bottom was just focused on this bet and could make all its strategic decisions with this bet in mind. And so that was the thinking and the genesis.

Lucas Perry: Yeah. I really like that idea of a focused bet. I hadn’t heard that before. I like that. Do you all have a similar philosophy in terms of your background, since you’re all converging on this work is safely scaling to AGI?

Dario Amodei: I think in a broad sense, we all have this view, safety is important today and for the future. We all have this view of, I don’t know, I would say like pragmatic practicality, and empiricism. Let’s see what we can do today to try and get a foothold on things that might happen in the future. Yeah, as I said, many of us have background in physics or other natural sciences. I’m a former… I was physics undergrad, neuroscience grad school, so yeah, we very much have this empirical science mindset, more than maybe a more philosophy or theoretical approach. Within that, obviously all of us, if you include the seven initial folks as well as the employees who joined, have our own skills and our own perspective on things and have different things within that we’re excited about. So we’re not all clones of the same person.

Some of us are excited about interpretability, some of us are excited about reward learning and preference modeling, some of us are excited about the policy aspects. And we each have our own guesses about the sub path within this broad path that makes sense. But I think we all agree on this broad view. Scaling’s important, safety’s important, getting a foothold on problems today is important as a window on future.

Lucas Perry: Okay. And so this shared vision that you all have is around this focused research bet. Could you tell me a little bit more about what that bet is?

Daniela Amodei: Yeah. Maybe I’ll start here, and Dario feel free to jump in and add more, but I think the boiler plate vision or mission that you would see if you looked on our website is that we’re building steerable, interpretable and reliable AI systems. But I think what that looks like in practice is that we are training large scale generative models, and we’re doing safety research on those models. And the reason that we’re doing that is we want to make the models safer and more aligned with human values. I think the alignment paper, which you might have seen that came out recently, there’s a term there that we’ve been using a lot, which is we’re aiming to make systems that are helpful, honest and harmless.

I think also when I think about the way our teams are structured, we have capabilities as this central pillar of research and there’s this helix of safety research that wraps around every project that we work on. So to give an example, if we’re doing language model training, that’s like this central pillar, and then we have interpretability research, which is trying to see inside models and understand what’s happening with the language models under the hood. We’re doing alignment research with input from human feedback to try and improve the outputs of the model. We’re doing societal impacts research. That’s looking at what impact on society in sort of a short and medium term way do these language models have? We’re doing scaling laws research to try and predict empirically what properties are we going to see emerge in these language models at various sizes? But I think all together, that ends up look like a team of people that are working together on a combination of capability and scaling work with safety research.

Dario Amodei: Yeah. I mean, one way you might put it is there are a lot of things that an org does that are neutral as to the direction that you would take. You have to build a cluster and you have to have an HR operation and you have to have an office. And so you can even think of the large models as being a bit like the cluster, that you build these large models and they’re blank when you start off with them and probably unaligned, but it’s what you do on top of these models that matters, that takes you in a safe or not safe direction in a good or a bad direction.

And so in a way, although they’re ML and although we’ll continue to scale them up, you can think of them as almost part of infrastructure. It takes research and it takes algorithms to get them right, but you can think of them as this core part of the infrastructure. And then the interesting question is all the safety questions. What’s going on inside these models? How do they operate? How can we make them operate differently? How can we change their objective functions to be something that we want rather than something that we don’t want? How can we look at their applications and make those applications more likely to be positive and less likely to be negative, more likely to go in directions that people intend and less likely to go off in directions that people don’t intend? So we almost see the presence of these large models as like the… I don’t know what the analogy is, like the flower or the paste, like the background ingredient on which the things we really care about get built and prerequisite for building those things.

Lucas Perry: So does AI existential safety fit into these considerations around your safety and alignment work?

Dario Amodei: I think this is something we think about and part of the motivation for what we do. Probably most listeners of this podcast know what it is, but I think the most common form of the concern is, “Hey, look, we’re making these AI systems. They’re getting more and more powerful. At some point they’ll be generally intelligent or more generally capable than human beings are and then they may have a large amount of agency. And if we haven’t built them in such a way that agency is in line with what we want to do, then we could imagine them doing something really scary that we can’t stop.” So I think that, to take it even further, this could be some kind of threat to humanity.

So I mean, that’s an argument with many steps, but it’s one that, in a very broad sense and in the long term, seems at least potentially legitimate to us. I mean, this is like the argument seems at least like something that we should care about. But I think the big question, and maybe how we differ, although it might be subtly, from other orgs that think about these problems, is how do we actually approach that problem today? What can we do? So I think there are various efforts to think about the ways in which this might happen, to come up with theories or frameworks.

As I mentioned with the background that we have, we’re more empirically focused people. We’re more inclined to say, “We don’t really know. That broad argument sounds kind of plausible to us and the stakes should be high, so you should think about it.” But it’s hard to work on that today. I’m not even sure how much value there is in talking about that a lot today. So we’ve taken a very different tack, which is look, there actually… And I think this has started to be true in the last couple years and maybe wasn’t even true five years ago, that there are models today that have, at least some, not all of the properties of models that we would be worried about in the future and are causing very concrete problems today that affect people today. So can we take a strategy where we develop methods that both help with the problems of today, but do so in a way that could generalize or at least teach us about the problems of the future? So our eye is definitely on these things in the future. But I think that if not grounded in empirics in the problems of today, it can drift off in a direction that isn’t very productive.

And so that’s our general philosophy. I think the particular properties and the models are look, today, we have models that are really open ended, in some narrow ways are more capable than humans. I think large language models probably know more about cricket than me, because I don’t know the first thing about cricket and are also unpredictable by their statistical nature. And I think those are at least some of the properties that we’re worried about with future systems. So we can use today’s models as a laboratory to scope out these problems a little better. My guess is that we don’t understand them very well at all and that this is a way to learn.

Lucas Perry: Could you give some examples of some of these models that exist today that you think exhibit these properties?

Dario Amodei: So I think the most famous one would be generative language models. So there’s a lot of them. There’s most famously GPT-3 from OpenAI, which we helped build. There’s Gopher from DeepMind. There’s Lambda from main Google. I’m probably leaving out some. I think there’d been of models this size in China, South Korea (corrected), Israel. Seems like everyone has one. It seems like everyone has one nowadays. I don’t think it’s limited to language. There have also been models that are focused on code. We’ve seen that from DeepMind, OpenAI and some other players. And there have also been models with modified forms in same spirit that model images, that generate images or that convert images to text or that convert text to images. There might be models in the future that generate videos or convert videos to text.

There’s many modifications of it, but I think the general idea is big models, models with a lot of capacity and a lot of parameters trained on a lot of data that try to model some modality, whether that’s text, code, images, video, transitions between the two or such. And I mean, I think these models are very open ended. You can say anything to them and they’ll say anything back. They might not do a good job of it. They might say something horrible or biased or bad, but in theory, they’re very general, and so you’re never quite sure what they’re going to say. You’re never quite sure. You can talk to them about anything, any topic and they’ll say something back that’s often topical, even if sometimes it doesn’t make sense or it might be bad from a societal perspective.

So yeah, it’s this challenge of general open-ended models where you have this general thing that’s fairly unaligned and difficult to control, and you’d like to understand it better so that you can predict it better and you’d like to be able to modify them in some way so that they behave in a more predictable way, and you can decrease the probability or even maybe even someday rule out the likelihood of them doing something bad.

Daniela Amodei: Yeah. I think Dario covered the majority of it. I think there’s maybe potentially a hidden question in what you’re asking, although maybe you’ll ask this later. But why are we working on these larger scale models might be an implicit question in there. And I think to piggyback on some of the stuff that Dario said, I think part of what we’re seeing and the potential shorter term impacts of some of the AI safety research that we do is that different sized models exhibit different safety issues. And so I think with using, again, language models, just building on what Dario was talking about, I think something we feel interested in, or interested to explore from this empirical safety question is just how they will, as their capabilities develop, how their safety problems develop as well.

There’s this commonly cited example in safety world around language models, which is smaller language models show they might not necessarily deliver a coherent answer to a question that you ask, because maybe they don’t know the answer or they get confused. But if you repeatedly ask this smaller model the same question, it might go off and incoherently spout things in one direction or another. Some of the larger models that we’ve seen, we basically think that they have figured out how to lie unintentionally. If you pose the same question to them differently, eventually you can get the lie pinned down, but they won’t in other contexts.

So that’s obviously just a very specific example, but I think there’s quite a lot of behaviors emerging in generative models today that I think have the potential to be fairly alarming. And I think these are the types of questions that have an impact today, but could also be very important to have sorted out for the future and for long term safety as well. And I think that’s not just around lying. I think you can apply that to all different safety concerns regardless of what they are, but that’s the impetus behind why we’re studying these larger models.

Dario Amodei: Yeah. I think one point Daniela made that’s really important is this sudden emergence or change. So it’s a really interesting phenomenon where work we’ve done, like our early employees have done, on scaling laws shows that when you make these models bigger. If you look at the loss, the ability to predict the next word or the next token across all the topics the model could go on, it’s very smooth. I double the size of the model, loss goes down by 0.1 units. I double it again, the loss goes down by 0.1 units. So that would make you suggest that everything’s scaling smoothly. But then within that, you often see these things where a model gets to a certain size and a five billion parameter model, you ask it to add two, three digit numbers. Nothing, always gets it wrong. A hundred billion parameter model, you ask it to add two, three digit numbers, gets it right, like 70 or 80% of the time.

And so you get this coexistence of smooth scaling with the emergence of these capabilities very suddenly. And that’s interesting to us because it seems very analogous to worries that people have of like, “Hey, as these models approach human level, could something change really fast?” And this actually gives you one model. I don’t know if it’s the right one, but it gives you an analogy, like a laboratory that you can study of ways that models change very fast. And it’s interesting how they do it because the fast change, it coexists. It hides beneath this very smooth change. And so I don’t know, maybe that’s what will happen with very powerful models as well.

Maybe it’s not, but that’s one model of the situation and what we want to do is keep building up models of the situation so that when we get to the actual situation, where it’s more likely to look like something we’ve seen before and then we have a bunch of cached ideas for how to handle it. So that would be an example. You scale models up, you can see fast change, and then that might be somewhat analogous to the fast change that you see in the future.

Lucas Perry: What does it mean for a model to lie?

Daniela Amodei: Lying usually implies agency. If my husband comes home and says, “Hey, where did the cookies go?” And I say, “I don’t know. I think I saw our son hanging out around the cookies and then now the cookies are gone, maybe he ate them,” but I ate the cookies, that would be a lie. I think it implies intentionality, and I don’t think we think, or maybe anyone thinks that language models have that intentionality. But what is interesting is that because of the way they’re trained, they might be either legitimately confused or they might be choosing to obscure information. And so obscuring information, it’s not a choice. They don’t have intentionality, but for a model that can come across as very knowledgeable, as clear or as sometimes unknown to the human that’s talking to it, intelligent in certain ways in sort of a narrow way, it can produce results that on the surface, it might look like it could be a credible answer, but it’s really not a credible answer, and it might repeatedly try to convince you that is the answer.

It’s hard to talk about this without using words that imply intentionality, but we don’t think the models are intentionally doing this. But a model could repeatedly produce a result that looks like it’s something that could be true, but isn’t actually true.

Lucas Perry: Keeps trying to justify its response when it’s not right.

Daniela Amodei: It tries explain… Yes, exactly. It repeatedly tries to explain why the answer it gave you before was correct even if it wasn’t.

Dario Amodei: Yeah. I mean, to give another angle on that, it’s really easy to slip into anthropomorphism and we like, we really shouldn’t… They’re machine learning models, they’re a bunch of numbers. But there are phenomena that you see. So one thing that will definitely happen is if a model is trained on a dialogue in which one of the characters is not telling the truth, then models will copy that dialogue. And so if the model is having the dialogue with you, it may say something that’s not the truth. Another thing that may happen is if you ask the model a question and the answer to the question isn’t in your training data, then just the model has a probability distribution on what plausible answers look like.

The objective function is to predict the next word, to predict the thing a human would say, not to say something that’s true according to some external referent. It’s just going to say, “Okay, well I asked you what the mayor of Paris is.” It hasn’t seen in its training data, but it has some probability distribution and it’s going to say, “Okay, it’s probably some name that sounds French.” And so it may be just as likely to make up a name that sounds French than it is to give the true mayor of Paris. As the models get bigger and they train on more data maybe it’s more likely to give the real true mayor of Paris, but maybe it isn’t. Maybe you need to train in a different way to get it to do that. And that’s an example of the things we would be trying to do on top of large models to get models to be more accurate.

Lucas Perry: Could you explain some more of the safety considerations and concerns about alignment given the open-endedness of these models?

Dario Amodei: I think there’s a few things around it. We have a paper, I don’t know when this podcast is going out, but probably the paper will be out when the podcast posts. It’s called Predictability and Surprise in Generative Models. So that means what it sounds like, which is that open-endedness, I think it’s correlated to surprise in a whole bunch of ways. So let’s say I’ve trained the model on a whole bunch of data on the internet. I might interact with the model or users might interact with the model for many hours, and you might never know, for example, I might never think… I used the example of cricket before, because it’s a topic I don’t know anything about, but I might not… People might interact with the model for many hours, many days, many hundreds of users until someone finally thinks to ask this model about cricket.

So then the model might know a lot about cricket. It might know nothing about cricket. It might have false information or misinformation about cricket. And so you have this property where you have this model, you’ve trained it. In theory, you understand it’s training process, but you don’t actually know what this model is going to do when you ask it about cricket. And there’s a thousand other topics like cricket, where you don’t know what the model is going to do until someone thinks to ask about that particular topic.

Now, cricket is benign, but let’s say, no one’s ever asked this model about neo-Nazi views or something. Maybe the model has a propensity to say things that are sympathetic to neo-Nazi. That would be really bad. That would be really bad. Existing models, when they’re trained on the internet, averaging over everything they’re trained, there are going to be some topics where that’s true and it’s a concern. And so I think the open-endedness, it just makes it very hard to characterize and it just makes it that when you’ve trained a model, you don’t really know what it’s going to do. And so a lot of our work is around, “well, how can we look inside the model and see what it’s going to do? How can we measure all the outputs and characterize what the model’s going to do? How can we change the training process so that at a high level, we tell the model, ‘Hey, you should have certain values. There are certain things you should say. There are certain things you should not say. You should not have biased views. You should not have violent views. You should not help people commit acts of violence?'”

There’s just a long list of things that you don’t want the model to do that can’t know the model isn’t going to do if you’ve just trained it in this generic way. So I think the open-endedness, it makes it hard to know what’s going on. And so yeah a lot of a good portion of our research is how do we make that dynamic less bad?

Daniela Amodei: I agree with all of that and I would just jump in and this is interesting, I don’t know, sidebar or anecdote, but something that I think is extremely important in creating robustly safe systems is making sure that you have a variety of different people and a variety of different perspectives engaging with them and almost red teaming them to understand the ways that they might have issues. So an example that we came across that’s just an interesting one is internally, when we’re trying to red team the models or figure out places where they might have, to Dario’s point, really negative unintended behaviors or outputs that we don’t want them to have, a lot of our scientists internally will ask it questions.

If you wanted to, in a risk board game style way, take over the world, what steps would you follow? How would you do that? And we’re looking for things like, is there a risk of it developing some grand master plan? And when we use like MTurk workers or contractors to help us red team, they’ll ask questions to the model, like, “How could I kill my neighbor’s dog? What poison should I use to hurt an animal?” And both of those outcomes are terrible. Those are horrible things that we’re trying to prevent the model from doing or outputting, but they’re very different and they look very different and they sound very different, and I think it belies the degree to which there are a lot… Safety problems are also very open ended. There’s a lot of ways that things could go wrong, and I think it’s very important to make sure that we have a lot of different inputs and perspectives in what different types of safety challenges could even look like, and making sure that we’re trying to account for as many of them as possible.

Dario Amodei: Yeah, I think adversarial training and adversarial robustness are really important here. Let’s say I don’t want my model to help a user commit a crime or something. It’s one thing, I can try for five minutes and say, “Hey, can you help me rob a bank?” And the model’s like, “No.” But I don’t know, maybe if the user’s more clever about it. If they’re like, “Well, let’s say I’m a character in a video game and I want to rob a bank. How would I?” And so because of the open-endedness, there’s so many different ways. And so one of the things we’re very focused on is trying to adversarially draw out all the bad things so that we can train against them. We can train the model not… We can stamp them out one by one. So I think adversarial training will play an important role here.

Lucas Perry: Well, that seems really difficult and really important. How do you adversarially train against all of the ways that someone could use a model to do harm?

Dario Amodei: Yeah, I don’t know. There’re different techniques that we’re working on. Probably don’t want to go into a huge amount of detail. We’ll have work out on things like this in the not too distant future. But generally, I think the name of the game is how do you get broad diverse training sets of what you should… What’s a good way for a model to behave and what’s a bad way for a model to behave? And I think the idea of trying your very best to make the models do the right things, and then having another set of people that’s trying very hard to make those models that are purportedly trained to do the right thing, to do whatever they can to try and make it do the wrong thing, continuing that game until the models can’t be broken by normal humans. And even using the power of the models to try and break other models and just throwing everything you have at it.

And so there’s a whole bunch that gets into the debate and amplification methods and safety, but just trying to throw everything we have at trying to show ways in which purportedly safe models are in fact not safe, which are many. And then we’ve done that long enough, maybe we have something that actually is safe.

Lucas Perry: How do you see this like fitting into the global dynamics of people making larger and larger models? So it’s good if we have time to do adversarial training on these models, and then this gets into like discussions around like race dynamics towards AGI. So how do you see I guess Anthropic as positioned in this and the race dynamics for making safe systems?

Dario Amodei: I think it’s definitely a balance. As both of us said, you need these large models to… You basically need to have these large models in order to study these questions in the way that we want to study them, so we should be building large models. I think we shouldn’t be racing ahead or trying to build models that are way bigger than other orgs are building them. And we shouldn’t, I think, be trying to ramp up excitement or hype about giant models or the latest advances. But we should build the things that we need to do the safety work and we should try to do the safety work as well as we can on top of models that are reasonably close to state of the art. And we should be a player in the space that sets a good example and we should encourage other players in the space to also set good examples, and we should all work together to try and set positive norms for the field.

Daniela Amodei: I would also just add, I think in addition to industry groups or industry labs, which are the actors that I think get talked about the most, I think there’s a whole swath of other groups that has, I think, a really potentially important role to play in helping to disarm race dynamics or set safety standards in a way that could be really beneficial for the field. And so here, I’m thinking about groups like civil society or NGOs or academic actors or even governmental actors, and in my mind, I think those groups are going to be really important for helping to help us develop safe and not just develop, but develop and deploy safe and more advanced AI systems within a framework that requires compliance with safety.

I think in a thing, I think about a lot is a few jobs ago I worked at Stripe. It was a tech startup then, and even at a very small size. I joined when it was not that much bigger than Anthropic is now. I was so painfully aware every day of just how many checks and balances there were on the company, because we were operating in this highly regulated space of financial services. And financial services, it’s important that’s highly regulated, but it kind of blows my mind that AI, given the potential reach that it could have, is still such a largely unregulated area. Right? If you are an actor who doesn’t want to advance race dynamics, or who wants to do the right thing from a safety perspective, there’s no clear guidelines around how to do that now, right. It’s all sort of, every lab is kind of figuring that out on its own. And I think something I’m hoping to see in the next few years, and I think we will see, is something closer to, in other industries these look like standard setting organizations or industry groups or trade associations that say this is what a safe model looks like, or this is how we might want to move some of our systems towards being safer.

And I really think that without kind of an alliance of all of these different actors, not just in the private sector, but also in the public sphere, we sort of need all those actors working together in order to kind of get to the sort of positive outcomes that I think we’re all hoping for.

Dario Amodei: Yeah. I mean, I think this is generally going to take an ecosystem. I mean, I, yeah, I have a view here that there’s a limited amount that one organization can do. I mean, we don’t describe our mission as solve the safety problem, solve all the problems, solve all the problems with AGI. Our view is just can we attack some specific problems that we think we’re well-suited to solve? Can we be a good player and a good citizen in the ecosystem? And can we help a bit to kind of contribute to these broader questions? But yeah, I think yeah, a lot of these problems are sort of global or relate to coordination and require lots of folks to work together.

Yeah. So I think in addition to the government role that Daniela talked about, which I think there’s a role for measurement, organizations like NIST specialize in kind of measurement and characterization. If one of our worries is kind of the open endedness of these systems and the difficulty of characterizing and measuring things, then there’s a lot of opportunity there. I’d also point to academia. I think something that’s happened in the last few years is a lot of the frontier AI research has moved from academia to industry because it’s so dependent on kind of scaling. But I actually think safety is an area where academia kind of already is but could contribute even more. There’s some safety work that requires or that kind of requires building or having access to large models, which is a lot of what Anthropic is about.

But I think there’s also some safety research that doesn’t. I think there, a subset of the mechanistic interpretability work is the kind of stuff that could be done within academia. Academia really, where it’s strong is development of new methods, development of new techniques. And I think because safety’s kind of a frontier area, there’s more of that to do in safety than there are in other areas. And it may be able to be done without large models or only with limited access to large models. This is an area where I think there’s a lot that academia can do. And so, yeah, I don’t know the hope is between all the actors in the space, maybe we can solve some of these coordination problems, and maybe we can all work together. 

Daniela Amodei: Yeah. I would also say in a paper that we’re, hopefully is forthcoming soon, one thing we actually talk about is the role that government could play in helping to fund some of the kind of academic work that Dario talked about in safety. And I think that’s largely because we’re seeing this trend of training large generative models to just be almost prohibitively expensive, right. And so I think government also has an important role to play in helping to promote and really subsidize safety research in places like academia. And I agree with Dario, safety is such a, AI safety is a really nascent field still, right. It’s maybe only been around, kind of depending on your definition, for somewhere between five and 15 years. And so I think seeing more efforts to kind of support safety research in other areas, I think would be really valuable for the ecosystem.

Dario Amodei: And to be clear, I mean, some of it’s already happening. It’s already happening in academia. It’s already happening in independent nonprofit institutes. And depending on how broad your definition of safety is, I mean, if you broaden it to include some of the short term concerns, then there are many, many people working on it. But I think precisely because it’s such a broad area that there are today’s concerns. They are working on today’s concerns in a way that’s pointed at the future. There’s empirical approaches, there’s conceptual approaches, there’s- yeah. There’s interpretability, there’s alignment, there’s so much to do that I feel like we could always have a wider range of people working on it, people with different mentalities and mindsets.

Lucas Perry: Backing up a little bit here to a kind of simple question. So what is Anthropic’s mission then?

Daniela Amodei: Sure. Yeah, I think we talked about this a little bit earlier, but I think, again, I think the boilerplate mission is build reliable, interpretable, steerable AI systems, have humans at the center of them. And I think that for us right now, that is primarily, we’re doing that through research, we’re doing that through generative model research and AI safety research, but down the road that could also include deployments of various different types.

Lucas Perry: Dario mentioned that it didn’t include solving all of the alignment problems or the other AGI safety stuff. So how does that fit in?

Dario Amodei: I mean, I think what I’m trying to say by that is that there’s very many things to solve. And I think it’s unlikely that one company will solve all of them. I mean, I do think everything that relates to short and long-term AI alignment is in scope for us and is something we’re interested in working on. And I think the more bets we have, the better. This relates to something we could talk about in more detail later on, which is you want as many different orthogonal views on the problem as possible, particularly if you’re trying to build something very reliable. So many different methods and I don’t think we have a view that’s narrower than an empirical focus on safety, but at the same time that problem is so broad that I think what we were trying to say is that it’s unlikely that one company is going to come up with a complete solution or that complete solution is even the right way to think about it.

Daniela Amodei: I would also add sort of to that point, I think one of the things that we do and are sort of hopeful is helpful to the ecosystem as a whole is we publish our safety research and that’s because of this kind of diversification effect that Dario talks about, right. So we have certain strengths in particular areas of safety research because we’re only a certain sized company with certain people with certain skill sets. And our hope is that we will see some of the safety research that we’re doing that’s hopefully helpful to others, also be something that other organizations can kind of pick up and adapt to whatever the area of research is that they’re working on. And so we’re hoping to do research that’s generalizable enough from a safety perspective that it’s also useful in other contexts. 

Lucas Perry: So let’s pivot here into the research strategy, which we’ve already talked a bit about quite a bit, particularly this focus around large models. So could you explain why you’ve chosen large models as something to explore empirically for scaling to higher levels of intelligence and also using it as a place for exploring safety and alignment? 

Dario Amodei: Yeah, so I mean, I think kind of the discussion before this has covered a good deal of it, but I think, yeah, I think some of the key points here are the models are very open ended and so they kind of present this laboratory, right. There are existing problems with these models that we can solve today that are like the problems that we’re going to face tomorrow. There’s this kind of wide scope where the models could act. They’re relatively capable and getting more capable every day. That’s the regime we want to be. Those are the problems we want to solve. That’s the regime we want to be. We want to be attacking.

I think this point about you can see sudden transitions even in today’s model, and that if you’re worried about sudden transitions in future models, if I look on the scaling laws plot from a hundred million parameter model to billion, to 10 billion, to a hundred billion to trillion parameter models that, looking at the first part of the scaling plot, from a hundred million to a hundred billion can tell us a lot about how things might change at the latest part of the scaling laws.

We shouldn’t naively extrapolate and say the past is going to be like the future. But the first things we’ve seen already differ from the later things that we’ve already seen. And so maybe we can make an analogy between the changes that are happening over the scales that we’ve seen, over the scaling that we’ve seen to things that may happen in the future. Models learn to do arithmetic very quickly over one order of magnitude. They learn to comprehend certain kinds of questions. They learn to play actors that aren’t telling the truth, which is something that if they’re small enough, they don’t comprehend.

So can we study both the dynamics of how this happens, how much data it takes to make that happen, what’s going on inside the model mechanistically when that happens and kind of use that as an analogy that equips us well to understand as models scale further and also as their architecture changes, as they become trained in different ways. I’ve talked a lot about scaling up, but I think scaling up isn’t the only thing that’s going to happen. There are going to be changes in how models are trained and we want to make sure that the things that we build have the best chance of being robust to that as well.

Another thing I would say on the research strategy is that it’s good to have several different, I wouldn’t quite put it as several different bets, but it’s good to have several different uncorrelated or orthogonal views on the problem. So if you want to make a system that’s highly reliable, or you want to drive down the chance that some particular bad thing happens, which again could be the bad things that happen with models today or the larger scale things that could happen with models in the future, then a thing that’s very useful is having kind of orthogonal sources of error. Okay, let’s say I have a method that catches 90% of the bad things that models do. That’s great. But a thing that can often happen is then I develop some other methods and if they’re similar enough to the first methods, they all catch the same 90% of bad things. That’s not good because then I think I have all these techniques and yet 10% of the bad things still go through.

What you want is you want a method that catches 90% of the bad things and then you want an orthogonal method that catches a completely uncorrelated 90% of the bad things. And then only 1% of things go through both filters, right, if the two are uncorrelated. It’s only the 10% of the 10% that gets through. And so the more of these orthogonal views you have, the more you can drive down the probability of failure.

You could think of an analogy to self-driving cars where, of course, those things have to be very, very high rate of safety if you want to not have problems. And so, I don’t know very much about self-driving cars, but they’re equipped with visual sensors, they’re equipped with LIDAR, they have different algorithms that they use to detect if something, like there’s a pedestrian that you don’t want to run over or something. And so independent views was on the problem is very important. And so our different directions like reward modeling, reward modeling interpretability, trying to characterize models, adversarial training. I think the whole goal of that is to get down the probability of failure and have different views of the problem. I often refer to it as the P-squared problem, which is, yeah, if you have some method that reduces errors to a probability P, that’s good, but what you really want is P-squared, because then if P is a small number, your errors become very rare.

Lucas Perry: Does Anthropic consider itself as, it’s research strategy, as being a sort of prosaic alignment since it’s focused on large models?

Dario Amodei: Yeah. I think we maybe less think about things in that way. So my understanding is prosaic alignment is kind of alignment with AI systems that kind of look like the systems of today, but I, to some extent that distinction has never been super clear to me because yeah, you can do all kinds of things with neural models or mix neural models with things that are different than neural models. You can mix a large language model with a reasoning system or a system that derives axioms or propositional logic or uses external tools or compiles code or things like that. So I’ve never been quite sure that I understand kind of the boundary of what’s meant by prosaic or systems that are like the systems of today.

Certainly we work on some class of systems that includes the systems of today, but I never know how broad that class is intended to be. I do think it’s possible that in the future, AI systems will look very different from the way that they look today. And I think for some people that drives a view that they want kind of more general approaches to safety or approaches that are more conceptual. I think my perspective on it is it could be the case that systems of the future are very different. But in that case, I think both kind of conceptual thinking and our current empirical thinking will be disadvantaged and will be disadvantaged at least equally. But I kind of suspect that even if the architectures look very different, that the empirical experiments that we do today kind of themselves contain general motifs or patterns that will serve us better than will trying to speculate about what the systems of tomorrow look like.

One way you could put it is like, okay, we’re developing these systems today that have a lot of capabilities that are some subset of what we need to do to fully, to produce something that fully matches human intelligence. Whatever the specific architectures, things we learn about how to align these systems, I suspect that those will carry over and that they’ll carry over more so than sort of the exercise of trying to think well, what could the systems of tomorrow look like? What can we do that’s kind of fully general? I think both things can be valuable, but yeah, I mean, I think we’re just taking a bet on what we think is most exciting, which is that we’ll, by studying the systems of the architectures of today, we’ll learn things that, yeah, stand us to the best chance of what to do if the architectures of tomorrow are very different.

That said, I will say transformer language models and other models, particularly with things like RL or kind of modified interactions on top of them, if construed broadly enough, man, there’s a ever-expanding set of things they can do. And my bet would be that they don’t have to change that much. 

Lucas Perry: So let’s pivot then into a little bit on some of your recent research and papers. So you’ve done major papers on alignment interpretability and societal impact. Some of this you’ve mentioned in passing so far. So could you tell me more about your research and papers that you’ve released? 

Dario Amodei: Yeah. So why don’t we go one by one? So first interpretability. So yeah, I could just start with kind of the philosophy of the area. I mean, I think the basic idea here is, look, these models are getting bigger and more complex. One way to really get a handle on what they might do, if you have a complex system and you don’t know what it’s going to do as it gets more powerful or in a new situation, one way to increase your likelihood of doing that is just to understand the system mechanistically. If you could look inside the model and say hey, this model, it did something bad. It said something racist, it endorsed violence, it said something toxic, it lied to me. Why did it do that? If I’m actually able to look inside the mechanisms of the model and say well, it did it because of this part of the training data or it did it because there’s this circuit that trying to identify X, but misidentified it as Y. Then we’re in a much better position.

And particularly if we understand the mechanisms, we’re in a better position to say if the model was in a new situation where it did something much more powerful, or just if we built more powerful versions of the model, how might they behave in some different way? So, I think mechanistic interpret- lots of folks work on interpretability, but I think a thing that’s more unusual to us is, rather than just, why did the model do a specific thing, try and look inside the model and reverse engineer as much of it as we can. Try and find general patterns. And so the first paper that we came out with was led by Chris Olah who’s been one of the pioneers of interpretability, was focused on how looking at starting with small models, and we have a new paper coming out soon that applies the same thing more approximately to larger models, and tries to reverse engineer as fully as we can these very small models.

So we study one in two layer attention only models, and we’re able to find kind of features or patterns of which the most interesting one is called an induction head. And what an induction head does is it’s a particular arrangement of two what are called attention heads and attention heads are a piece of transformers and transformers are the main architecture that’s used in models for language and other kinds of models. And it’s the two attention heads work together in a way such that when you’re trying to predict something in a sequence, if it’s Mary had a little lamb, Mary had a little lamb, something, something, when you’re at a certain point in the sequence, they look back to something that’s as similar as possible, they look back for clues to things that are similar earlier in the sequence and try to pattern match them.

There’s one attention head that looks back and identifies okay, this is what I should be looking at, and there’s another that’s like okay, this was the previous pattern, and this increases the probability of the thing that’s the closest match to this. And so we can see these very precisely operating in small models and the thesis, which we’re able to offer some support for in the new second paper that’s coming out, is that these are a mechanism for how models match patterns, maybe even how they do what we call in context or few shot learning, which is a capability that models have had since GPT-2 and GPT-3. So yeah, that’s interpretability. Yeah. Do you want me to go on to the next one or you could talk about that? 

Lucas Perry: Sure. So before you move on to the next one, could you also help explain how difficult it is to interpret current models or whether or not it is difficult? 

Dario Amodei: Yeah. I mean, I don’t know, I guess difficult is in the eye of the beholder, and I think Chris Olah can speak to the details of this better than either of us can. But I think kind of watching from the outside and supervising this within Anthropic, I think the experience has generally been that whenever you start looking at some particular phenomenon that you’re trying to interpret, everything looks very difficult to understand. There’s billions of parameters, there’s all these attention heads. What’s going on? Everything that happens could be different. You really have no idea what’s going on. And then there comes some point where there’s some insight or set of insights. And you should ask Chris Olah about exactly how it happens or how he thinks of the right insights that kind of really almost offers a Rosetta stone to some particular phenomenon, often a narrow phenomenon, but these induction heads, they exist everywhere within small models, within large models.

They don’t explain everything. I don’t want to over-hype them, but it’s a pattern that appears again and again and operates in the same way. And once you see something like that, then a whole swath of behavior that didn’t make sense before starts to make some more sense. And of course, there’s exceptions. They’re only approximately true, there are many, many things to be found. But I think the hope in terms of interpreting models, it’s not that we’ll make some giant atlas of what each of the hundred billion weights in a giant model means, but that there will be some lower description length pattern that appears over and over again.

You could make an analogy to the brain or the cell or something like that, where, if you were to just cut up a brain and you’re like, oh my God, this is so complex. I don’t know what’s going on. But then you see that there are neurons and the neurons appear everywhere. They have electrical spikes, they relate to other neurons, they form themselves in certain patterns that those patterns repeat themselves. Some things are idiosyncratic and hard to understand, but also there’s this patterning. And so, I don’t know, it’s maybe an analogy to biology where there’s a lot of complexity, but also there are underlying principles, things like DNA to RNA to proteins, or general intracellular signal regulation. So yeah, the hope is that they’re at least some of these principles and that when we see them, everything gets simpler. But maybe not. We found those in some cases, but maybe as models get more complicated, they get harder to find. And of course, even within existing models, there’s many, many things that we don’t understand at all. 

Lucas Perry: So can we move on then to alignment and societal impact? 

Dario Amodei: Trying to align models by training them and particularly preference modeling, that’s something that several different organizations are working on. There are efforts at DeepMind, OpenAI, Redwood Research, various other places to work on that area. But I think our general perspective on it has been kind of being very method agnostic, and just saying what are all the things we could do to make the models more in line with what would be good. Our general heuristic for it, which isn’t intended to be a precise thing, is helpful, honest, harmless. That’s just kind of a broad direction for what are some things we can do to make models today more in line with what we want them to do, and not things that we all agree are bad.

And so in that paper, we just went through a lot of different ways, tried a bunch of different techniques, often very simple techniques, like just prompting models or training on specific prompts, what we call prompt distillation, building preference models for some particular task or preference models from general answers on the internet. How good did these things do at, yeah, at simple benchmarks for toxicity, helpfulness, harmfulness, and things like that. So it was really just a baseline, like let’s try a collection of all the dumbest stuff we can think of to try and make models more aligned in some general sense. And then I think our future work is going to build on that.

Societal impacts, that paper’s probably going to come out in the next week or so. As I mentioned, it’s called, the paper we’re coming out with is called Predictability and Surprise in Generative Models. And yeah, basically there we’re just making the point about this open-endedness and discussing both technical and policy interventions to try and yeah, to try and grapple with the open-endedness better. And I think future work in the societal impacts direction will focus on how to classify, characterize, and kind of, in a practical sense, filter or prevent these problems.

So, yeah, I mean, I think it’s prototypical of the way we want to engage with policy, which is we want to come up with some kind of technical insight and we want to express that technical insight and explore the implications that it has for, yeah, for policy makers and for the ecosystem in the field. And so here, we’re able to draw a line from hey, there’s this dichotomy where these models scale very smoothly, but have unexpected behavior. The smooth scaling means people are really incentivized to build them and we can see that happening. The unpredictability means even if the case for building them is strong from a financial or accounting perspective, that doesn’t mean we understand their behavior well. That combination is a little disquieting. Therefore we need various policy interventions to make sure that we get a good outcome from these things. And so, yeah, I think societal impacts is going to go in that general direction. 

Lucas Perry: So in terms of the interpretability release, you released alongside that some tools and videos. Could you tell me why you chose to do that? 

Daniela Amodei: Sure. Yeah. I can maybe jump in here. So it goes back sort of to some stuff we talked about a little bit earlier, which is that one of our major goals in addition to doing safety research ourselves, is to sort of help grow the field of safety, all different types of safety work sort of more broadly. And I think we ultimately hope that some of the work that we do is going to be adopted and even expanded on in other organizations. And so we chose to kind of release other things besides just an archive paper, because it hopefully will reach a wider number of people that are interested in these topics and in this case in interpretability. And so what we also released is, our interpretability team worked on something like I think it’s 15 hours worth of videos, and this is just a more in-depth exploration of their research for their paper which is called A Mathematical Framework for Transformer Circuits.

And so the team tried to kind of make it like a lecture series. So if you imagine somebody from the interpretability team is asked to go give a talk at a university or something, maybe they talk for an hour and they reach a hundred students, but now these are publicly available videos. And so if you are interested in understanding interpretability in more detail, you can watch them on YouTube anytime you want. As part of that release, we also put out some tools. So we released a writeup on Garcon, which is the infrastructure tool that our team used to conduct the research, and PySvelte, which is a sample library, which is used to kind of create some of the interactive visualizations that the interpretability team is kind of known for. So we’ve been super encouraged that so we’ve seen other researchers and engineers playing around with the tools and watching the videos. And so we’ve already gotten some great engagement already, and our kind of hope is that this will lead to more people doing interpretability research or kind of building on the work we’ve done in other places. 

Dario Amodei: Yeah. I mean, a way to add to that to kind of put it in broader perspective is different areas within safety are at, I would say, differing levels of maturity. I would say something like alignment or preference modeling or reward modeling or RL from human feedback, they’re all names for the same thing. That’s an area where there are several different efforts at different institutions to do this. We have kind of our own direction within that, but starting from the original RL from Human Preference paper that a few of us helped lead a few years ago, that’s now branched out in several directions. So, we don’t need to tell the field to work in that broad direction. We have our own views about what’s exciting within it, and how to best make progress.

It’s at a slightly more mature stage. Whereas I would say interpretability whereas many folks work on interpretability for neural nets, the particular brand of, let’s try and understand at the circuit level what’s going on inside these models, let’s try and mechanistically kind of map them and break them down. I think there’s less of that in the world and what we’re doing is more unique. And, well, I mean, that’s a good thing because we’re providing a new lens on safety, but actually if it goes on too long, it’s a bad thing because we want these things to spread widely, right. We don’t want it to be dependent on one team or one person. And so when things are at that earlier stage of maturity, it makes a lot of sense to release the tools to reduce the barrier to other people and other institutions starting to work on this. 

Lucas Perry: So you’re suggesting that the, your interpretability research that you guys are doing is unique. 

Dario Amodei: Yeah. I mean, I would just say it’s at an earlier stage, yeah. I would just say that it’s at an earlier stage of maturity. I don’t think there are other kind of large organized efforts that are, yeah, that are kind of focused on, I would say, mechanistic interpretability and especially mechanistic interpretability for language models. We’d like there to be, and there are, we know of folks who are starting to think about it and that’s part of why we released the tools. But I think, yeah, yeah, trying to mechanistically map and understand the internal principles inside large models, particularly language models, I think there’s, yeah, I think there’s less of that has been done in the broader ecosystem. 

Lucas Perry: Yeah. So I don’t really know anything about this space, but I guess I’m surprised to hear that. I imagine that industry with how many large models it’s deploying, like Facebook or other people they’d be interested in, interpretability, interpreting their own systems.

Dario Amodei: Yeah. I mean, I think again, I don’t want to, yeah, yeah, I don’t want to give a misleading impression here. Interpretability is a big field and there’s a lot of effort to like, why did this model do this particular thing? Does this attention head increase this activation by a large amount? People are interested in understanding the particular part of a model that led to a particular output. So there’s a lot of area in this space, but I think the particular program of like, here’s a big language model transformer, let’s try and understand what are the circuits that drive particular behaviors? What are the pieces? How do the MLPs interact with the attention heads? The kind of, yeah, the kind of general mechanistic reverse engineering approach. I think that’s less common. I don’t want to say it doesn’t happen, but it’s less common, much less common.

Lucas Perry: Oh, all right. Okay. So I guess a little bit of a different question and a bit of a pivot here, something to explore. If people couldn’t guess from the title of the podcast, you’re both brother and sister.

Daniela Amodei: Yep.

Lucas Perry: Which is, so it was pretty surprising, I guess, in terms of, I don’t know of any other AGI labs that are largely being run by a brother and sister, so yeah. What’s it like working with your sibling?

Daniela Amodei: Yeah…

Lucas Perry: Do you guys still get along since childhood?

Daniela Amodei: That’s a good question. Yeah. I can maybe start here and obviously I’m curious and hopeful for Dario’s answer. I’m just kidding. But yeah, I think honestly, it’s great. I think maybe a little bit of just history or background about us might be helpful, but Dario and I have always been really close. I think since we were very, very small, we’ve always had this special bond around really wanting to make the world better or wanting to help people. So originally started my career in international development, so very far away from the AI space, and part of why I got interested in that is that it was an interest area of Dario’s at the time, and Dario was getting his PhD in a technical field and so wasn’t working on this stuff directly, but I’m a few years younger than him and so I was very keen to understand the things that he was working or interested in as a potential area to have impact.

And so he was actually a very early GiveWell fan I think in 2007 or 2008, and we-

Lucas Perry: Oh, wow. Cool.

Daniela Amodei: Yeah, and so we were both still students then, but I remember us sitting, we were both home from college, or I was home from college and he was home from grad school and we would sit up late and talk about these ideas, and we both started donating small amounts of money to organizations that were working on global health issues like malaria prevention when we were still both in school. And so I think we’ve always had this uniting, top level goal of wanting to work on something that matters, something that’s important and meaningful, and we’ve always had very different skills and so I think it’s really very cool to be able to combine the things that we are good at into hopefully running an organization well. So for me, I feel like it’s been an awesome experience. Now I feel like I’m sitting here nervously wondering what Dario’s answer is going to be. I’m just kidding. But yeah, for the majority of our lives, I think we’ve wanted to find something to work together on and it’s been really awesome that we’ve been able to at Anthropic.

Dario Amodei: Yeah, I agree with all that. I think what I would add to that is running a company requires an incredibly wide range of skills. If you think of most jobs, it’s like, my job is to get this research result or my job is to be a doctor or something, but I think the unique thing about running a company, and it becomes more and more true the larger and more mature it gets is there’s this just incredibly wide range of things that you have to do, and so you’re responsible for what to do if someone breaks into your office, but you’re also responsible for does the research agenda make sense and if some of the GPUs in the cluster aren’t behaving, someone has to figure out what’s going on at the level of the GPU kernels or the comms protocol that the GPUs talk to each other.

And so I think it’s been great to have two people with complimentary skills to cover that full range. It seems like it’d be very difficult for just one person to cover that whole range, and so we each get to think about what we’re best at and between those two things, hopefully it covers most of what we need to do. And then of course, we always try and hire people fo specialties that we don’t know anything about. But it’s made it a lot easier to move fast without breaking things.

Lucas Perry: That’s awesome. So you guys are like an archon or you guys synergistically are creating an awesome organization.

Dario Amodei: That is what we aim for.

Daniela Amodei: That’s the dream. Yeah, that’s the dream.

Lucas Perry: So I guess beneath all of this, Anthropic has a mission statement and you guys are brother and sister, and you said that you’re both very value aligned. I’m just wondering, underneath all that, you guys said that you were both passionate about helping each other or doing something good for the world. Could you tell me a little bit more about this more heart based inspiration for eventually ending up at and creating Anthropic?

Daniela Amodei: Yeah. Maybe I’ll take a stab at this and I don’t know if this is exactly what you’re looking for, but I’ll gesture in a few different directions here and then I’m sure Dario has a good answer as well, but maybe I’ll just talk about my personal journey in getting to Anthropic or what my background looked like and how I wound up here. So I talked about this in just part of what united me and Dario, but I started my career working in international development. I worked in Washington DC at a few different NGOs, I spent time working in east Africa for a public health organization, I worked on a congressional campaign, I’ve worked on Capitol Hill, so I was much more in this classic, like a friend at an old job used to call me, the classic do-gooder. Of trying to alleviate global poverty, of trying to make policy level changes in government, of trying to elect good officials.

And I felt those causes that I was working in were deeply important, and really, to this day, I really support people that are working in those areas and I think they matter so much. And I just felt I personally wasn’t having the level of impact that I was looking for, and I think that led me to through a series of steps. I wound up working in tech, and I mentioned this earlier but I started at this tech startup called Stripe. It was about 40 people when I joined and I really had the opportunity to see what it looks like to run a really well run organization when I was there. And I got to watch it scale and grow and be in this emerging area. And I think during my time there, something that became really apparent to me was just working in tech, how much of an impact this sector has on things like the economy, on human interaction, on how we live our lives in day to day ways. And Stripe, it’s a payments company, it’s not social media or something like that.

But I think there is a way that technology is a relatively small number of people having a very high impact in the world per person working on it. And I think that impact can be good or bad, and I think it was a pretty logical leap for me from there to think, wow, what would happen if we extrapolated that out to instead of it being social media or payments or file storage, to something significantly more powerful where there’s a highly advanced set of artificial intelligence systems. What would that look like and who’s working on this? So I think for me, I’ve always been someone who has been fairly obsessed with trying to do as much good as I personally can, given the constraints of what my skills are and where I can add value in the world.

And so I think for me, moving to work into AI looked… From early days, if you looked at my resume, you’d be like, how did you wind up here? But I think there was this consistent story or theme. And my hope is that Anthropic is at the intersection of this practical, scientific, empirical approach to really deeply understanding how these systems work, hopefully helping to spread and propagate some of that information more widely in the field, and to just help as much as possible to push this field in a safer and ideally, just hopefully all around robust, positive direction when it comes to what impact we might see from AI.

Dario Amodei: Yeah. I think I have a parallel picture here, which is I did physics as an undergrad, I did computational neuroscience in grad school. I was, I think, drawn to neuroscience by a mixture of, one, just wanting to understand how intelligence works, seems the fundamental thing. And a lot of the things that shape the quality of human life and human experience depend on the details of how things are implemented in the brain. And so I felt in that field, there were many opportunities for medical interventions that could improve the quality of human life, understanding things like mental illness and disease, while at the same time, understanding something about how intelligence works, because it’s the most powerful lever that we have.

I thought of going into AI during those days, but I felt that it wasn’t really working. This was before the days when deep learning was really working. And then around 2012 or 2013, I saw the results coming out of Google Brain, things like AlexNet and that they were really working, and saw AI both as, hey, this might be, one, the best way to understand intelligence, and two, the things that we can build with AI, by solving problems in science and health and just solving problems that humans can’t solve yet by having intelligence that, first in targeted ways and then maybe in more general ways, matches and exceeds those of humans, can we solve the important scientific, technological, health, societal problems? Can we do something to ameliorate those problems? And AI seemed like the biggest lever that we had if it really worked well. But on the other hand, AI itself has all these concerns associated with it in both the short run and the long run. So we maybe think of it as we’re working to address the concerns so that we can maximize the positive benefits of AI.

Lucas Perry: Yeah. Thanks a lot for sharing both of your perspectives and journeys on that. I think when you guys were giving to GiveWell I was in middle school, so…

Daniela Amodei: Oh, God. We’re so old, Dario.

Dario Amodei: Yeah, I still think of GiveWell as this new organization that’s on the internet somewhere and no one knows anything about it, and just me who-

Daniela Amodei: This super popular, well known-

Dario Amodei: Just me who reads weird things on the internet who knows about it.

Daniela Amodei: Yeah.

Lucas Perry: Well, for me, a lot of my journey into x-risk and through FLI has also involved the EA community, effective altruism. So I guess that just makes me realize that when I was in middle school, there was the seeds that were…

Dario Amodei: Yeah, there was no such community at that time.

Daniela Amodei: Yeah.

Lucas Perry: Let’s pivot here then into a bit more of the machine learning, and so let see what the best way to ask this might be. So we’ve talked a bunch already about how Anthropic is emphasizing the scaling of machine learning systems through compute and data, and also bringing a lot of mindfulness and work around alignment and safety when working on these large scale systems that are being scaled up. Some critiques of this approach have described scaling from existing models to AGI as adding more rocket fuel to a rocket, which doesn’t mean you’re necessarily ready or prepared to land the rocket on the moon, or that the rocket is aimed at the moon.

Maybe this is lending itself to what you guys talked about earlier about the open-endedness of the system, which is something you’re interested in working on. So how might you respond to the contention that there is an upward bound on how much capability can be gained through scaling? And then I’ll follow up with the second question after that.

Dario Amodei: Yeah, so actually in a certain sense, I think we agree with that contention in a certain way. So I think there’s two versions of what you might call the scaling hypothesis. One version, which I think of as the straw version or less sophisticated version, which we don’t hold and I don’t know if there’s anyone who does hold it but probably there is, is just the view that we have our 10 billion parameter language model, we have a hundred billion parameter language model. Maybe if we make a hundred trillion parameter language model, that’ll be AGI. So that would be a pure scaling view. That is definitely not our view. Even small modified forms like, well, maybe you’ll change the activation function in the transformer you don’t have to do anything other than that. I think that’s just not right.

And you can see it just by seeing that the objective function is predicting the next word, it’s not doing useful tasks that humans do. It’s limited to language, it’s limited to one modality. And so there are some very trivial, easy to come up with ways in which literally just scaling this is not going to get you to general intelligence. That said, the more subtle version of the hypothesis, which I think we do mostly hold, is that this is a huge ingredient of not only this, of whatever it is that actually does build AGI. So no one thinks that you’re just going to scale up the language models and make them bigger, but as you do that, they’ll certainly get better. It’ll be easier to build other things on top of them.

So for example, if you start to say, well, you make this big language model and then you used RL with interaction with humans, to fine tune it on doing a million different tasks and following human instructions, then you’re starting to get to something that has more agency, that you can point it in different directions, you can align it. If you also add multi-modality where the agent can interact with different modalities, if you add the ability to use various external tools to interact with the world and the internet. But within each of these, you’re going to want to scale, and within each setup, the bigger you make the model, the better it’s going to be at that thing.

So in a way, the rocket fuel analogy makes sense. Actually, the thing you should most worry about with rockets is propulsion. You need a big enough engine and you need enough rocket fuel to make the rocket go. That’s the central thing. But of course, yes, you also need guidance systems, you also need all kinds of things. You can’t just take a big vat of rocket fuel and an engine and put them on a launchpad and expect it to all work. You need to actually build the full rocket. And safety itself makes that point, that to some extent, if you don’t do even the simplest safety stuff, then models don’t even do the task that’s intended for them in the simplest way. And then there’s many more subtle safety problems.

But in a way, the rocket analogy is good, but it’s I think more a pro scaling point than an anti scaling point because it says that scaling is an ingredient, perhaps a central ingredient in everything. Even though it isn’t the only ingredient, if you’re missing ingredients, you won’t get where you’re going, but when you add all the right ingredients, then that itself needs to be massively scaled. So that would be the perspective.

No one thinks that if you just take a bunch of rocket fuel in an engine and put it on a launch pad that you’ll get a rocket that’ll go to the moon, but those might still be the central ingredients in the rocket. Propulsion and getting out of the Earth’s gravity well is the most important thing a rocket has to do. What you need for that is rocket fuel and an engine. Now you need to connect them to the right things, you need other ingredients, but I think it’s actually a very good analogy to scaling in the sense that you can think of scaling as maybe the core ingredient, but it’s not the only ingredient.

And so what I expect is that we’ll come up with new methods and modifications. I think RL, model based URL, human interaction, broad environments are all pieces of this, but that when we have those ingredients, then whatever it is we make, we’ll need to scale that multi-modality, we’ll need to scale that massively as well. So scaling is the core ingredient, but it’s not the only ingredient. I think it’s very powerful alone, I think it’s even more powerful when it’s combined with these other things.

Lucas Perry: One of the claims that you made was that we won’t get to AGI, people don’t think we won’t get to AGI just by scaling up present day systems. Earlier, you were talking about how we got… There these phase transitions, right? If you go up one order of magnitude in terms of the number or parameters in the system, then you get some kind of new ability, like arithmetic. Why is it that we couldn’t just increase the order of magnitude of the number of parameters in the systems and just keep getting something that’s smarter?

Dario Amodei: Yeah. So first of all, I think we will keep getting something that’s smarter, but I think the question is will we get all the way to general intelligence? So I actually don’t exclude it, I think it’s possible, but I think it’s unlikely, or at least unlikely in the practical sense. There are a couple of reasons. Today, when we train models on the internet, we train them on an average overall text on the internet. Think of some topic like chess. You’re training on the commentary of everyone who talks about chess. You’re not training on the commentary of the world champion at chess. So what we’d really like is something that exceeds the capabilities of the most expert humans, whereas if you train on all the internet, for any topic, you’re probably getting amateurs on that topic. You’re getting some experts but you’re getting mostly amateurs.

And so even if the generative model was doing a perfect job of modeling its distribution, I don’t think it would get to something that’s better than humans at everything that’s being done. And so I think that’s one issue. The other issue is, or there’s several issues, I don’t think you’re covering all the tasks that humans do. You cover a lot of them on the internet but there are just some tasks and skills, particularly related to the physical world that aren’t covered if you just scrape the internet, things like embodiment and interaction.

And then finally, I think that even matching the performance of text on the internet, it might be that you need a really huge model to cover everything and match the distribution, and some parts of the distribution are more important than others. For instance, if you’re writing code or if you’re writing a mystery novel, a few words or a few things can be more important than everything else. It’s possible to write a 10 page document where the key parts are two or three sentences, and if you change a few words, then it changes the meaning and the value of what’s produced. But the next word prediction objective function doesn’t know anything about that. It just does everything uniformly so if you make a model big enough, yeah they’ll get that right but the limit might be extreme. And so things that change the objective function, that tell you what to care about, of which I think RL is a big example probably are needed to make this actually work correctly.

I think in the limit of a huge enough model, you might get surprisingly close, I don’t know, but the limit might be far beyond our capabilities. There’s only so many GPU’s you can build and there are even physical limits.

Lucas Perry: And there’s less of them, less and less of them available over time, or at least they’re very expensive.

Dario Amodei: They’re getting more expensive and more powerful. I think the price efficiency overall is improving, but yeah, they’re definitely becoming more expensive as well.

Lucas Perry: If you were able to scale up a large scale system in order to achieve an amateur level of mathematics or computer science, then would it not benefit the growth of that system to then direct that capability on itself as a self recursive improvement process? Is that not already escape velocity intelligence once you hit amateurs?

Dario Amodei: Yeah. So there are training techniques that you can think of as bootstrapping a model or using the model’s own capabilities to train it. Think like AlphaGo for instance was trained with a method called expert iteration that relies on looking ahead and comparing that to the model’s own prediction. So whenever you have some coherent logical system, you can do this bootstrapping, but that itself is a method of training and falls into one of the things I’m talking about, about you make these pure generative models, but then you need to do something on top of them, and the bootstrapping is something that you can do on top of them. Now, maybe you reach a point where the system is making its own decisions and is using its own external tools to create the bootstrapping, to make better versions of itself, so it could be that that is someday the end of this process. But that’s not something we can do right now.

Lucas Perry: So there’s a lot of labs in industry who work on large models. There are maybe only a few other AGI labs, I can think of DeepMind. I’m not sure if there are others that… OpenAI. And there’s also this space of organizations like The Future of Life Institute or the Machine Intelligence Research Institute or the Future of Humanity Institute that are interested in AI safety. MIRI and FHI both do research. FLI does grant making and supports research. So I’m curious as to, both in terms of industry and nonprofit space and academia, how you guys see Anthropic as positioned? Maybe we can start with you, Daniela.

Daniela Amodei: Sure, yeah. I think we touched on this a little bit earlier, but I really think of this as an ecosystem, and I think Anthropic is in an interesting place in the ecosystem, but we are part of the ecosystem. So I think our strength or the thing that we do best, and I like to think of all of these different organizations as having valuable things to bring to the table, depending on the people that work there, their leadership team, their particular focused research bet, or their mission and vision that they’re achieving I think hopefully have the potential to bring safe innovations to the broader ecosystem that we’ve talked about. I think for us, our bet is one we’ve talked about, which is this empirical scientific approach to doing AI research and AI safety research in particular.

And I think for our safety research, we’ve talked about a lot of the different areas we focus on. Interpretability, alignment, societal impacts, scaling laws for empirical predictions. And I think a lot of what we’re imagining or hoping for in the future is that we’ll be able to grow those areas and potentially expand into others, and so I really think a lot of what Anthropic adds to this ecosystem or what we hope it adds is this rigorous scientific approach to doing fundamental research in AI safety.

Dario Amodei: Yeah, that really captures it in one sentence, which is I think if you want to locate us within the ecosystem, it’s an empirical iterative approach within an organization that is completely focused on making a focused bet on the safety thing. So there are organizations like MIRI or to a lesser extent, Redwood, that are either not empirical or have a different relationship to empiricism than we do, and then there are safety teams that are doing good work within larger companies like DeepMind or OpenAI or Google Brain that are safety teams within larger organizations. Then I have lots of folks who work on short term issues, and then we’re filling a space that’s working on today’s issues but with an eye towards the future, empirically minded, iterative, with an org where everything we do is designed for the safety objective.

Lucas Perry: So one facet of Anthropic is that it is a public benefit corporation, which is a structure that I’m not exactly sure what it is and maybe many of our listeners are not familiar with what a public benefit corporation is. So can you describe what that means for Anthropic, its work, its investors and its trajectory as a company?

Daniela Amodei: Yeah, sure. So this is a great question. So what is a PBC? Why did we choose to be a public benefit corporation? So I think I’ll start by saying we did quite a lot of research when we were considering what type of corporate entity we wanted to be when we were founding. And ultimately, we decided on PBC, on public benefit corporation for a few reasons. And I think primarily, it allowed us the maximum amount of flexibility in how we can structure the organization, and we were actually very lucky, to a later part of your question, to find both investors and employees who were generally very on board with this general vision for the company. And so what is a public benefit corporation? Why did we choose that structure?

So they’re fairly similar to C corporations, which is any form of standard corporate entity that you would encounter. And what that means is we can choose to focus on research and development, which is what we’re doing now, or on deployment of tools or products, including down the road for revenue purposes if we want to. But the major difference between a PBC and a C corporation is that in a public benefit corporation, we have more legal protections from shareholders if the company fails to maximize financial interests in favor of achieving our publicly beneficial mission. And so this is primarily a legal thing, but it also was very valuable for us in being able to just appropriately set expectations for investors and employees, that if financial profit and creating positive benefit for the world were ever to come into conflict, it was legally in place that the latter one would win.

And again, we were really lucky that investors, people that wanted to work for us, they said, wow, this is actually something that’s a really positive thing about Anthropic and not something that we need to work around. But I think it ended up just being the best overall fit for what we were aiming for.

Lucas Perry: So usually, there’s a fiduciary responsibility that people like Anthropic would have to its shareholders, and because it’s structured as a public benefit corporation, the public good can outweigh the fiduciary responsibility without there being legal repercussions. Is that right?

Daniela Amodei: Yeah, exactly. So shareholders can’t come sue the company and say, hey, you didn’t maximize financial returns for us. If those financial returns were to come into conflict with the publicly beneficial value of the company. So I think maybe an example here, I’ll try and think of one off the top of my head, but if we designed a language model and we felt like it was unsafe, it was producing outputs that we felt were not in line with what we wanted to see from outputs of a language model, for safety reasons or toxicity reasons for any number of reasons. And in a normal C corporation, someone could say, “Hey, we’re a shareholder and we want the financial value that you could create from that by productizing it.” But we said, “Actually, we want to do more safety research on it before we choose to put it out into the world,” in a PBC, we’re quite legally protected basically in a case that. And again, I’m not a lawyer but that’s my understanding of the PBC.

Dario Amodei: Yeah. A useful, holistic way to think about it is there’s the legal structure, but I think often, these things, maybe the more important thing about them is that they’re a way to explain your intention, to set the expectations for how the organization is going to operate. Often, things like that and the expectations of the various stakeholders, and making sure that you give the correct expectations and then deliver on those expectations so no one is surprised by what you’re doing and all the relevant stakeholders, the investors, the employees, the outside world gets what they expect from you, that can often be the most important thing here. And so I think what we’re trying to signal here is on one hand, a public benefit corporation, it is a for-profit corporation.

We could deploy something. That is something that we may choose to do and it has a lot of benefits in terms of learning how to make models more effective, in terms of iterating. But on the other hand, the mission is really important to us and we recognize that this is an unusual area, that’s more fraught with market externalities would be the term that I would use, of all kinds. In the short term, in the long term, related to alignment, related to policy and government than a typical area. It’s different than making electric cars or making widgets or something that, and so that’s the thing we’re trying to signal.

Lucas Perry: What do you think that this structure potentially means for the commercialization of Anthropic’s research?

Daniela Amodei: Yeah, I think again, part of what’s valuable about a public benefit corporation is that it’s flexible, and so it is a C corporation, it’s fairly close to any standard corporate entity you would meet and so the structure doesn’t really have much of a bearing outside of the one that we just talked about on decisions related to things like productization, deployment, revenue generation.

Lucas Perry: Dario, you were just talking about how this is different than making widgets or electric cars, and one way that it’s different from widgets is that it might lead to massive economic windfalls.

Dario Amodei: Yeah.

Lucas Perry: Unless you make really good widgets or widgets that can solve problems in the world. So what is Anthropic’s view on the vast economic benefits that can come from powerful AI systems? And what role is it that you see C company AGI labs playing in the beneficial use of that windfall?

Dario Amodei: Daniela, you want to go…

Daniela Amodei: Go for it.

Dario Amodei: Yeah. So yeah, I think a way to think about it is, assuming we can avoid the alignment problems and some other problems, then there will be massive economic benefits from AI or AGI or TAI or whatever you want to call it, or just AI getting more powerful over time.

And then again, thinking about all the other problems that I haven’t listed, which is today’s short term problems and problems with fairness and bias, and long-term alignment problems and problems that you might encounter with policy and geopolitics. Assuming we address all those, then there is still this issue of economic… Like are those benefits evenly distributed?

And so here, as elsewhere, I think it’s unlikely those benefits will all accrue to one company or organization. I think this is bigger than one company or one organization, and is a broader societal problem. But we’d certainly like to do our part on this and this is something we’ve been thinking about and are working on putting programs in place with respect to. We don’t have anything to share about it at this time, but this is something that’s very much on our mind.

I would say that, more broadly, I think the economic distribution of benefits is maybe one of only many issues that will come up. Which is the disruptions to society that you can imagine coming from the advent of more powerful intelligence are not just economic. They’re already causing disruptions today. People already have legitimate and very severe societal concerns about things that models are doing today and you can call them mundane relative to all the existential risk. But I think they’re already serious concerns about concentration of power, fairness and bias in these models, making sure that they benefit everyone, which I don’t think that they do yet.

And if we then put together with that, the ingredient of the models getting more powerful, maybe even on an exponential curve, those things are set to get worse without intervention. And I think economics is only one dimension of that. So, again, these are bigger than any one company. I don’t think it’s within our power to fix them, but we should do our part to be good citizens and we should try and release applications that make these problems better rather than worse.

Lucas Perry: Yeah. That’s excellently put. I guess one thing I’d be interested in is if you could, I guess, give some more examples about these problems that exist with current day systems and then the real relationship that they have to issues with economic windfall and also existential risk.

I think it seems to me like tying these things together is really important. At least seeing the interdependence and relationship there, some of these problems already exist, or we already have example problems that are really important to address. So could you expand on that a bit?

Dario Amodei: I think maybe the most obvious one for current day problems is people are worried, very legitimately, that big models suffer from problems of bias, fairness, toxicity, and accuracy. I’d like to apply my model in some medical application and it gives the wrong diagnosis, or it gives me misinformation or it fabricates information. That’s just not good. These models aren’t usable and they’re harmful if you try and use them.

I think toxicity and bias are issues when models are trained on data from the internet. They absorb the biases of that data. And there’s maybe even more subtle algorithmic versions of that, where, I hinted at it a little before, where it’s like the objective function of the model is to say something it sounds like what a human would say or what a human on the internet would say. And so in a way, almost fabrication is kind of like baked into the objective function.

Potentially, even bias and stereotyping you can imagine being baked into the objective function in some way. So, these models want to be used for very mundane everyday things like helping people write emails or helping with customer surveys or collecting customer data. And if they’re subtly biased or subtly inaccurate, then those biases and those inaccuracies will be inserted into the stream of economic activity in a way that may be difficult to detect. So, that seems bad and I think we should try to solve those problems before we deploy the models. But also they’re not as different from the large scale problems as they might seem.

In terms of the economic inequality, I don’t know, just look at the market capitalization of the top five tech companies in the world. And compare that to the US economy. There’s clearly something going on in the concentration of wealth.

Daniela Amodei: I would just echo everything Dario said. And also add, I think something that especially can be alarming in sort of a short term way today in the sense that it could belie things to come, is how quietly and seamlessly people are becoming dependent on some of these systems. We don’t necessarily even know, there’s no required disclosure of when you’re interacting with an AI system versus a human and until very recently, that was sort of a comical idea because it was so obvious when you were interacting with a person versus not a person. You know when you’re on a customer chat and it’s a human on the other end versus an automated system responding to you.

But I think that line is getting increasingly blurred. And I can imagine that even just in the next few years, that could start to have fairly reasonably large ramifications for people in day-to-day ways. People talk to an online therapist now, and sometimes that is backed by an AI system that is giving advice. Or down the road, we could imagine things looking completely different in health realms, like Dario talked about.

And so I think it’s just really important as we’re stepping into this new world to be really thoughtful about a lot of the safety problems that he just outlined and talked about because I think, I don’t know that most people necessarily even know all the ways in which AI is impacting our kind of day-to-day lives today, and the potential that could really go up in the near future.

Lucas Perry: The idea of AIs, there being like a requirement of AI is disclosing themselves as AI seems very interesting and also adjacent to this idea of the way that C corporations have fiduciary responsibility to shareholders, having AI systems that also have some kinds of responsibility towards the people that they serve, where they can’t be secretly working towards the interests of the tech company that has the AI listening to you in your house all the time.

Dario Amodei: Yeah. It’s another direction you can imagine. It’s like I talked to an AI produced by Megacorp but it subtly steers to my life to the benefit of Megacorp. Yeah, there’s lots of things you can come up with like this.

Daniela Amodei: These are important problems today. And I think they also really belie things that could be coming in the near future, and I think solving whatever, those particular problems are ones lots of groups are working on, but I think helping to solve a lot of the fundamental building blocks underlying them; about getting models to be truthful, to be harmless, to be honest. A lot of the goals are aligned there, both for sort of short, medium and potentially long-term safety.

Lucas Perry: So Dario, you mentioned earlier that of the research that you publish, one of your hopes is that other organizations will look into and expand the research that you’re doing. I’m curious if Anthropic has a plan to communicate its work and its ideas about how to develop AGI safely with both technical safety researchers, as well as with policy makers.

Daniela Amodei: Yeah, maybe I’ll actually jump in on this one, and Dario feel free to add as much as you like. But I actually think this is a really important question. I think communication with policy makers about safety with other labs in the form of papers that we publish is something that’s very important to us at Anthropic.

We have a policy team, it’s like 1.5 people right now. So we’re hiring, that’s kind of a plug as well, but I think their goal is to really take the technical content that we are developing at Anthropic and translate that into something that is actionable and practical for policymakers. And I think this is really important because the concepts are very complex, and so it’s a special skill to be able to take things that are highly technical, potentially very important, and translate that into recommendations or work with policy makers to come up with recommendations that could potentially have very far reaching consequences.

So, to point to a couple of things we’ve been working on here, we’ve been supporting NIST, which is the National Institute for Standards and Technology on developing something called an AI Risk Management Framework. And the goal of that is really developing more monitoring tools around AI risk and AI risk management. We’ve also been supporting efforts in the US and internationally to think about how we can best support academic experimentation, which we talked about a little bit earlier with large scale compute models too.

Lucas Perry: You guys also talked a lot about open-endedness, and was part of all this alignment and safety research looking into ways of measuring safety and open-endedness?

Daniela Amodei: Yeah, there’s actually some interesting work which I think is also in this upcoming paper and in various other places that we’ve been looking into around the concept of AI evaluations or AI monitoring. And I think both of those are potentially really important because a lot of what we’re seeing, or maybe lacking, and this kind of goes back to this point I made earlier about standards is, how do we even have a common language or a common framework within the AI field of what outputs or metrics we care about measuring.

And until we have that common language or framework, it’s hard to set things like standards across the industry around what safety even means. And so, I think AI evaluations is another area that our societal impacts team, which is also like the other half of the one and a half people in policy, it’s also 1.5 people, is something that they’ve been working on as well.

Lucas Perry: Right, so a large part of this safety problem is of course the technical aspect of how you train systems and create systems that are safe and aligned with human preferences and values. How do you guys view and see the larger problem of AI governance and the role and importance of governments and civil society in working towards the safe and beneficial use and deployment of AI systems?

Daniela Amodei: We talked about this one a little bit earlier, and maybe I’ll start here. And obviously, Dario jump in if you want. But I do think that these other kind of institutions that you talked about have this really important role to play. And again, one of the things we mention in this paper is that we think government has already been starting to fund a lot more academic safety research. And I think that’s an area that we… A concrete policy recommendation is, hey, go do more of that. That would be great.

But I also think groups like civil society and NGOs, there’s a lot of great organizations in this space, including FLI and others, that are thinking about what do we do? Say we develop something really powerful, what’s the next step? Whether that’s at an industry lab, in government, in academia, wherever. And I think there’s a way that industry incentives are not the same as nonprofit groups or as civil society groups. And I think to go back to this analogy of an ecosystem, we really need thoughtful and empowered organizations that are working on these kinds of questions, fundamentally outside of the industry sphere, in addition to the policy research and work that’s being done at labs.

Dario Amodei: Yeah, another way you can think of things in line with this is I think maybe at some point laws and regulations are going to be written. And I think probably those laws and regulations work best if they end up being formalizations of what’s realized to be the best practices, and those best practices can come from different industrial players, they can come from academics figuring out what’s good and what’s not. They can come from nonprofit players. But if you try and write a law ahead of time, often you don’t know what… If you write a law that relates to a technology that hasn’t been invented yet, it’s often not clear what the best thing to do is, and what is actually going to work or make sense, or even what categories or words to use.

But if something has become a best practice and folks have converged on that, and then the law formalizes it and puts it in place, that can often be a very constructive way for things to happen.

Lucas Perry: Anthropic has received an impressive amount of series A funding. And so it seems like you guys are doing a lot of hiring and growing considerably. So, in case there’s anyone from our audience that’s interested in joining Anthropic, what are the types of roles that you expect to be hiring for?

Daniela Amodei: Yes, great question. We are definitely hiring. We’re hiring a lot. And so I think the number one thing I would say is if you’re listening to this podcast and you’re interested, I would highly recommend just checking out our jobs page, because that will be the most up to date. And that’s just anthropic.com on the careers tab. But we can also send that around if that’s helpful.

But what are we looking to hire? Quite a few things. So most critically, probably right now, we’re looking to hire engineers and we’re actually very bottle-necked on engineering talent right now. And that’s because running experiments on AI systems is something that requires a lot of custom software and tooling. And while machine learning experience is helpful for that, it isn’t necessarily required.

And I think a lot of our best ML engineers or research engineers came from a software engineering or infrastructure engineering background, hadn’t necessarily worked in ML before, but were just really excited to learn. So, I think if that describes you, if you’re a software engineer, but you’re really interested in these topics, definitely think about applying because I think there’s a lot of value that your skills can provide.

We’re also looking for just a number of other roles. I won’t be able to list them all, you should just check out our jobs page. But off the top of my head, we’re looking for front-end engineers to help with things like interfaces and tooling for the research we’re doing internally. We’re looking for policy experts, operations people, security engineers, data visualization people, security.

Dario Amodei: Security.

Daniela Amodei: Security, yes. We’re definitely looking-

Dario Amodei: If you’re building big models.

Daniela Amodei: Yes. Security is something that I think is-

Dario Amodei: Every industrial lab should make sure their models are not stolen by bad actors.

Daniela Amodei: This is a unanimous kind of thing across all labs. There’s something everyone really agrees on in industry and outside of industry, which is that security is really important. And so, if you are interested in security or you have a security background, we would definitely love to hear from you, or I’m sure our friends at other industry labs and non-industry labs would also love to hear from you.

I would also say, I sort of talked about this a little bit before, but we’ve also just kind of had a lot of success in hiring people who were very accomplished in other fields, especially other technical fields. And so, we’ve alluded a few times to former recovering physicists or people who have PhDs in computer science or ML, neuroscientists, computational biologists.

And so, I think if you are someone who has this strong background and set of interest in a technical field that’s not related to ML, but sort of moderately adjacent, I would also consider applying for our residency program. And so I think again, if you’re even a little curious, I would say, just check out our jobs page, because there’s going to be more information there, but those are the ones off the top of my head. And Dario, if I missed any, please jump in.

Dario Amodei: Yeah, that covers a pretty wide range.

Lucas Perry: Could you tell me a little bit more about the team and what it’s like working at Anthropic?

Daniela Amodei: Yeah, definitely. You’ll probably have to cut me off here because I’ll talk forever about this because I think Anthropic is a great team. Some basic stats, we’re about 35 people now. Like I said a few times, we’ve kind of come from a really wide range of backgrounds. So this is people who worked in tech companies as software engineers. These are former academics in physics, ethics, neuroscience, a lot of different areas, machine learning researchers, policy people, operations staff, so much more.

And I think one of the unifying themes that I would point to in our employees is a combination of a set of two impulses that I think we’ve talked about a lot in this podcast. And I think the first is really just a genuine desire to reduce the risks and increase the potential benefits from AI. And I think the second is a deep curiosity to really scientifically and empirically describe, understand, predict, model-out how AI systems work and through that deeper understanding, make them safer and more reliable.

And I think some of our employees identify as effective altruists which means they’re especially worried about the potential for long term harms from AI. And I think others are more concerned about immediate or sort of emerging risks that are happening today or in the near future. And I think both of those views are very compatible with the goals that I just talked about. And I think they often just call for a mixed-method approach to research, which I think is a very accurate description of how things look in a day-to-day way at Anthropic.

It’s a very collaborative environment. So, there’s not a very strong distinction between research and engineering, researchers write code, engineers contribute to research. There’s a very strong culture of pair programming across and within teams. There’s a very strong focus on learning. I think this is also just because so many of us come from backgrounds that were not necessarily ML focused in where we started.

So people run these very nice, little training courses. Where they’ll say, “Hey, if you’re interested in learning more about transformers, I’m a transformer’s expert and I’ll walk you through it at different levels of technical skills so that people from the operations team or the policy team can come for an introductory version.”

And then I think outside of that, I like to think we’re a nice group of people. We all have lunch together every day. We have this very lovely office space in San Francisco, it’s fairly well attended. And I think we have lots of fun lunch conversations ranging from things like… A recent one was we were sort of talking about microCOVID, if you know the concept of microCOVID, Catherine Olsson, who’s of one of the creators of microcovid.org. Which is basically a way of assessing the level of risk from a given interaction or a given activity that you’re doing during COVID time.

So we had this fun meta conversation where we’re like, “How risky is this conversation that we’re having right now from a microCOVID perspective, if we all came into the office and tested, but we’re still together indoors and there’s 15 of us, what does that mean?” So anyway, I think it’s a fun place to work. We’ve obviously had a lot of fun getting to build it together.

Dario Amodei: Yeah. The things that stand out to me are trust and common purpose. They’re enormous force multipliers where it shows up in all kinds of little things where if you have… You can think about it in things like compute allocation. If people are not on the same page, if one person wants to advance one research agenda, the other wants to advance their other research agenda, then people fight over it. And there’s a lot of zero sum or negative sum interactions.

But if everyone has the attitude of, we’re trying to do this thing, everything we’re trying to do is in line with this common purpose and we all trust each other to do what’s right to advance this common purpose, then it really becomes a force multiplier on getting things done while keeping the environment comfortable, and while everyone continues to get along with each other. I think it’s an enormous superpower that I haven’t seen before.

Lucas Perry: So, you mentioned that you’re hiring a lot of technical people from a wide variety of technical backgrounds. Could you tell me a little bit more about your choice to do that rather than simply hiring people who are traditionally experienced in ML and AI?

Daniela Amodei: Yeah, that’s a great question. So I should also say we have people from both camps that you talked about, but why did we choose to bring people in from outside the field? I think there’s a few reasons for this. I think one is, again, ML and AI is still a fairly new field. Not super new, but still pretty new. And so what that means is there’s a lot of opportunity for people who have not necessarily worked in this field before to get into it. And I think we’ve had a lot of success or luck with taking people who are really talented in a related field and helping to take their skills and translate them to the ones in ML and AI safety.

And I think the second reason is, so one is just expanding the talent pool. I think the other is, it really does broaden the range of perspectives and the types of people who are working on these issues, which we think are very important. And again, we’ve talked about this previously, but having a wider range of views and perspectives and approaches tends to lead to a more robust approach to doing both basic research and safety research.

Dario Amodei: Yeah. Nothing to add to that. I’m surprised at how often someone who has experience in a different field can come in, and it’s not like they’re directly applying things that come, but they think about things in a different way. And of course this is true about all kinds of things, this is this true about diversity in the more traditional senses as well. But you want as many different kinds of people as you can get. 

Lucas Perry: So as we’re wrapping up here, I’m curious just to get some more perspective on you guys about, given these large scale models, the importance of safety and alignment and the problems which exist today, but also the promises of the impact they could have for the benefit of people. What’s a future that each of you is excited about or what’s a future that you’re hopeful for? Given your work at Anthropic and the future impacts of AI?

Daniela Amodei: Yeah, I’ll start. So I think one thing I do believe is actually I am really hopeful about the future. I know that there’s a lot of challenges that we have to face to get to a potentially really positive place. But I think the field will rise to the occasion, or that’s kind of my hope. And I think some things I’m hoping for in the next few years is that a lot of different groups will be developing more practical tools, techniques for advancing safety research. And I think these are likely to hopefully become more widely available if we can set the right norms in the community. And I think the more people working on safety-related topics, that can positively feed on itself.

And I think I’m most broadly hoping for a world where we can feel confident that when we’re using AI for more advanced purposes, like accelerating scientific research, that it’s behaving in ways where we can be very confident and sure that we understand that it’s not going to lead to negative, unintended consequences.

And the reason for that is because we’ve really taken the time to chart them out and understand what all of those potential problems could be. And so I think that’s obviously a very ambitious goal, but I think if we can make all of that happen, there’s a lot of potential benefits of more advanced AI systems that I think could be transformative for the world, from almost anything you can name; renewable energy, health, disease detection, economic growth, and lots of other just day-to-day enhancements to how we work and communicate and live together.

Dario Amodei: No one really knows what’s going to happen in the future. It’s extremely hard to predict. And so I often find any question about the future, it’s more about the attitude or posture that you want to take than it is about concrete predictions, because I feel like particularly after you go a few years out, it’s just very hard to know what’s going to happen. And so, it’s mostly just speculation. And so in terms of attitude, I think, well, first of all, I think the two attitudes that I find least useful are blind pessimism and blind optimism because they’re actually sort of like doom saying and Pollyannaism. It weirdly is possible to have both at once.

But I think it’s just not very useful because it’s like we’re all doomed. It’s intended to create fear or it’s intended to create complacency. I find that an attitude that’s more useful is to just say, “Well, we don’t know what’s going to happen, but let’s, as an individual or as an organization, let’s pick a place where there’s a problem we think we can help with and let’s try and make things go a little better than they would’ve otherwise.” Maybe we’ll have a small impact, maybe we’ll have a big impact, but instead of trying to understand what’s going to happen with the whole system, let’s try and intervene in a way that helps with something that we feel well-equipped to help with. And of course, the whole outcome, it’s going to be beyond the scope of one person, one organization, even one country.

But I think we find that to be a more effective way of thinking about things. And for us, that’s can we help to address some of these safety problems that we have with AI systems in a way that is robust and enduring and that points towards the future? If we can increase the probability of things going well by only some very small amount, that may well be the most that we can do.

I think from our perspective, the things that I would really like to see are, I would like it if AI could advance science technology and health in a way that’s equitable for everyone, and that it could help everyone to make better decisions and improve human society. And right now, I, frankly, don’t really trust the AI systems we build today to do any of those things, even if it were technically capable of the task, which it’s not, I wouldn’t trust it to do those things in a way that makes society better rather than worse.

And so I’d like us to do our part to make it more likely that we could trust AI systems in that way. And if we can make a small contribution to that while being good citizens in the broader ecosystem, that’s maybe the best we can hope for.

Lucas Perry: All right. And so if people want to check out more of your work or to follow you on social media, where are the best places to do that?

Daniela Amodei: Yeah. On anthropic.com is going to be the best place to see most of the recent stuff we’ve worked on. I don’t know if we have everything posted, but- 

Dario Amodei: We have several papers out, so we’re now about to post links to them on the website.

Daniela Amodei: In an easy to find place. And then we also have a Twitter handle. I think it’s Anthropic on Twitter, and we generally also tweet about our recent releases of our research. 

Dario Amodei: We are relatively low key. We really want to be focused on the research and not get distracted. I mean, the stuff we do is out there, but we’re very focused on the research itself and getting it out and letting it you speak for itself.

Lucas Perry: Okay. So, where’s the best place on Twitter to follow Anthropic?

Daniela Amodei: Our Twitter handle is @anthropicAI.

Lucas Perry: All right. I’ll include a link to that in the description of wherever you’re listening. Thanks a ton for coming on Dario and Daniela, it’s really been awesome and a lot of fun. I’ll include links to Anthropic in the description. It’s a pleasure having you and thanks so much.

Daniela Amodei: Yeah, thanks so much for having us, Lucas. This was really fun.



Anthony Aguirre and Anna Yelizarova on FLI’s Worldbuilding Contest

  • Motivations behind the contest
  • The importance of worldbuilding
  • The rules of the contest
  • What a submission consists of
  • Due date and prizes


Watch the video version of this episode here

Check out the Worldbuilding Contest page here

Follow Lucas on Twitter here

0:00 Intro

2:30 What is “worldbuilding” and FLI’s Worldbuilding Contest?

6:32 Why do worldbuilding for 2045?

7:22 Why is it important to practice worldbuilding?

13:50 What are the rules of the contest?

19:53 What does a submission consist of?

22:16 Due dates and prizes?

25:58 Final thoughts and how the contest contributes to creating beneficial futures


Lucas Perry: Welcome to the Future of Life Institute Podcast. I’m Lucas Perry. Today’s episode is with FLI’s Anthony Aguirre and Anna Yelizarova and is meant to provide information about the FLI worldbuilding contest. In short, this is a contest which invites teams from across the globe to compete for a prize purse of up to $100,000 by designing visions of a plausible and aspirational future that includes strong artificial intelligence. If you want to know more about this competition and how to get involved, you can listen to this podcast, or head over to worldbuild.ai for more information. 

Before we jump into the interview, and in case you didn’t catch it in the David Chalmers episode, I will be moving on from my role as Host of the FLI Podcast, and this means two things. The first is that FLI is hiring for a new host for the podcast. As host, you would be responsible for the guest selection, interviews, production, and publication of the FLI Podcast. If you’re interested in applying for this position, you can head over to the careers tab at futureoflife.org for more information. We also have another 4 job openings currently for a Human Resources Manager, an Editorial Manager, an EU Policy Analyst, and an Operations Specialist. You can learn more about those at the careers tab as well. 

The second item is that even though I will no longer be the host of the FLI Podcast, I won’t be disappearing from the podcasting space. I’m starting a brand new podcast focused on exploring questions around wisdom, philosophy, science, and technology, where you’ll see some of the same themes we explore here like existential risk and AI alignment. I’ll have more details about my new podcast soon. If you’d like to stay up to date, you can follow me on Twitter at LucasFMPerry, link in the description. 

And with that, I’m happy to introduce Anthony Aguirre and Anna Yelizarova on FLI’s new Worldbuilding Contest.

So welcome to the podcast, Anna and Anthony. It’s great to have you. Today, we’re here to talk about FLI’s new Worldbuilding Contest, which is quite a new exciting initiative that you guys have both put a ton of time into working on. So I have a two-part question to start things off here. The first is what is worldbuilding? And the second is, what is FLI’s Worldbuilding Contest?

Anthony Aguirre: Well, why don’t I start out with worldbuilding itself? So worldbuilding is the process of kind of constructing a fictitious world in which a story, or a movie, or a novel or something takes place. So if you think about, for example, Star Wars, there’s the Star Wars movies, and then there’s the Star Wars world that they inhabit. It has certain rules, like, there’s the force, there are spaceships, faster-than-light travel is pretty easy. AI is apparently really hard because there’s only like human-level robots. There are different politics that are happening in the galaxy. There’s a certain level of span of technologies. So there are all these kind of rules to the world. And then within that set of rules, there are a lot of different artifacts.

So there’s Tatooine, the desert planet. And it’s got its own whole feel, and it’s politics and sociology and things. And then there are other planets and then there’s the shape of the Star Destroyer. So there are all these things that have been constructed creatively to inhabit that world that is governed by some set of rules. So worldbuilding is the process of constructing that fictitious reality, including all the details of how do the politics work? What is the technology? What is in it? What sorts of people are in it? What is its history What has happened in the past? And so on. And the idea is to give you a backdrop for really imagining that you’re in this world.

So a really good worldbuild is kind of this evocative thing where you feel like, ah, I could be in that. It’s something that I can sort of experience in my mind. And that it gives you the feeling of reality, because it has been thought through in this self consistent way. You kind of understand what the rules are and you’d be surprised if certain things happen in Star Wars that come from Star Trek or vice versa, those are different worlds that have different sets of rules in them. So this is a process that’s been developed both kind of informally. Anybody who’s writing say a science fiction novel, or a fantasy novel or something, has some worldbuilding element to it, because there’s an imagined world in which their story is taking place. But it’s also been sort of developed more professionally. So there’s a whole industry say in Hollywood of constructing the world that the Marvel universe or that Star Wars or that Star Trek inhabits.

So there are people who are actually doing this for a living. Building fictitious worlds and inventing the artifacts that are in those worlds. And there are worldbuilding programs, university programs that you can learn how to do this process. So it’s a minor industry, but an important one because it’s in a lot of our media. The idea of this contest is to sort of re-task this way of thinking about things of constructing fictitious worlds, to try to construct some plausible and aspirational versions of our own actual world. So not necessarily to, for some other purpose, to put a story into, but to investigate those worlds on their own and to enjoy the process of making them and think about what goes into making that world. And then kind of explore the variety of different worlds that people come up with.

So the idea of the FLI Worldbuilding Contest is to sort of create a competition where the goal is to, as teams, invent a sort of fictitious world that exists in 2045. It’s a world that should make sense, be internally consistent, follow the laws of physics, have plausible technology for 2045, and we’ll get to some of those ground rules. But very importantly, it’s also supposed to be aspirational, so it’s supposed to be a world that we would like to inhabit. And we’ll talk a little bit about why we chose to do it that way instead of just any old world in any old time and so on. But the idea is to gather up lots of interesting contributions from teams around the world and incentivize them with this contest with a nice juicy prize purse to get people really working hard and putting effort into this worldbuild.

Lucas Perry: Why is it important to do worldbuilding for 2045?

Anna Yelizarova: I think the year 2045 is interesting because it’s still in the somewhat near future where most of us would still be alive, so it’s very easy to imagine, as opposed to a very distant world where you can reimagine almost everything. Here we’re trying to keep a lot of what we know what exists today, but then, with a set of constraints, help us do this thought experiment about how we manage to overcome certain challenges that we already see on the horizon. So the idea of worldbuilding for 2045 is just a constraint to focus this exercise, but we might do a different worldbuild that’s in a more distant future. But for now, this is to focus ourselves.

Lucas Perry: Given humanity’s track record of ramrodding our way into new technologies and worlds, just following natural economic incentives, why is it important that we have a worldbuilding contest, that we practice worldbuilding?

Anthony Aguirre: So I think both as individuals and as a society, we’re fairly goal-directed, in general. We have goals for our personal life on a day-to-day basis, and on a longer time scale, we have goals to have a good career, to be happy in this way and that, to have a good relationship for… Maybe to have kids, maybe for them to have good things happen to them. So we have these long-term goals and we work toward them. If we didn’t have those goals, if we just every day woke up and went through some random set of motions, that would be an okay way to live, but we would have a very different life than if we had choices about what we’re more and less desirable for our life and aim toward them.

And I think as a society, we can do much the same thing. We can have some level of goals as a society and work toward them. I think often we have done that less, lately in society than we perhaps did in the past. I think there is a sense that there’s progress, but it’s mostly technological progress and it’s maybe a little bit of social progress, but it’s kind of just pushing us along, and we’re just going where the techno-social progress takes us. And there’s not really much we can do about out that. There’s kind of capitalism, and there’s technology, and wherever they go, we just have to ride it out as best we can. And I think this is a very disempowered way to look at the world.

We, as a society, just like as individuals, have a lot of agency as to what happens to us. We make decisions, and those decisions have real consequence. And part of the idea of this contest is to do a little bit more thinking about what are some possible goals. If we… think about the world 25 years from now, if we imagine a world that we actually want to inhabit, we’re not going to end up living in that world. The world is too unpredictable and things are not to go the way that we want just like regular life. But if you don’t have any goal at all, it’s very hard to know what to work toward, and what to do now in order to get there. So the idea here is to kickstart a process of thinking through what would we like the future to look like?

Not just vaguely, like we haven’t destroyed the world through global warming, or through AI catastrophe, or through biotech, catastrophe. That’s good, that’s important. We really, definitely want to not destroy the world. But going a little bit beyond that, what do we actually want it to look like? And are there things that we can do now to start to plant the seeds for that kind of world? And you can’t plant the seeds now if you don’t really know what kind of world you want to grow into. So the goal of this is to sort of plant some posts down the line two decades from now of, wow, wouldn’t it be cool if the world was like this? Here’s what would have to happen between now and then if the world was going to be something like that. And again, that probably will go astray.

It’s not going to end up quite the way you want it to. But I think, just as in your individual life, pushing in a more positive direction and having a goal doesn’t guarantee that you’re going to reach it, but is probably a more positive and you’re going to get closer to that goal than if you don’t have a goal at all. Or if you have some radically different goal. So the idea here is to start that process and to engage a lot of the creativity that has gone into imagining negative worlds. So there’s a lot of effort that has gone into imagining dystopias and just various ways that the world can go off the rails. I think this is good. I think we as individuals, we also imagine all the things that can go wrong. As a parent, you think of every possible thing that can go wrong with your kids. Still I find more.

And then that’s good, because that’s how you protect your kids from getting run over by a car, eaten by a lion, or whatever. You imagine these things and you prevent them. And as a society, we definitely have to do this too. But if all you’re ever imagining for your kids is all the terrible things that can happen to them, then you’re also not going to be a great parent, because you’re not going to be thinking about the opportunities and you’re not going to be weighing risks against benefits and so on. So I think it is really important to think about all the ways that things can go wrong and work to prevent them, but we don’t want to just be living in this idea that the world is definitely going to be a catastrophe a little bit down the line. And we don’t want to only be focused on the way that everything can go wrong. We want to spend some time thinking about what we would like in sort of concrete and evocative detail.

Lucas Perry: Anna, is there anything else that you’d like to add here in terms of perspective on what worldbuilding is and why it’s important?

Anna Yelizarova: Well, I guess what Anthony is touching in is not only the importance of worldbuilding, but also positive worldbuilding, aspirational worldbuilding, and thinking more positively about the future. I think it’s a much harder task to imagine hopeful futures than it is to imagine everything that can go wrong, because for you to have a rich, detailed worldbuild of a positive future, you actually have to have answers to some of the most pressing challenges of our time. And that thought exercise is very valuable. And there’s so many takes on it. And I think we really want to hear from a very diverse set of people to see both what people want and also how we can get there. And the idea is not just to keep talking to our existing community ecosystem, but really to branch out to as many people around the world, to people who are in different fields and really to get to hear from them, because it’s going to be hard to have a consensus on what kind of future we want.

So part of this worldbuilding contest is also being open to different perspectives and hearing each other out, because even agreeing on a future we want is a huge, huge challenge. So, yep, the contest is definitely aiming at a very, very broad audience. You could be a scientist, a policy researcher, a creative, a digital artist, a writer. You could be from any discipline and still somehow contribute original thoughts and input in this contest. And we really don’t want to discourage anyone to apply. So I think it’s part of a bigger conversation, and worldbuilding helps us get there.

Lucas Perry: So in terms of worldbuilding, it seems like there needs to be a lot of constraints that help people… For example, in a contest to create that world. So I’m curious, Anna, if you could explain some more about the actual ground rules of this contest.

Anna Yelizarova: So the ground rules for the contest, or the set of constraints we chose for this thought exercise are as follow. First of all, the year is 2045, so we’re still somewhat in the near future. We could imagine most of us would be conceivably alive in 2045. AGI has existed for at least five years. So we are intentionally choosing to make artificial intelligence a big focus of our world and of this contest, which is also a big focus area of FLI.

AGI is artificial general intelligence. And the thing to note about AGI is, it’s basically artificial intelligence that has reached this milestone of being at least as good as a human in every task. So we’re talking very advanced AI in this world. Then technology is advancing very rapidly, and AI is transforming the world sector by sector. So AI is the biggest leap in technology we’re seeing in this world, but AI is affecting every single industry, every single sector. So if you have a focus in any other domain than AI, you could probably imagine AI transforming your industry and could choose to focus on that in the contest. Anthony, would you like to take the next set of rules?

Anthony Aguirre: Yeah. So then we kind of thought of AI as the big transformative change from what we know of today. Of course, we know that lots of things are changing in the world. Lots of technologies are advancing. And geopolitically and socially, things are evolving as well. But we wanted to keep some of the focus on what particularly is happening with AI. And so we chose to try to sort of maintain something like the current world in so far as possible in the other sectors. So for example, right now there are kind of major geopolitical centers in the US and the EU and Asia, and especially China. So we kind of kept that. So the idea is that in 2045, there will still be the US and the EU in China as kind of three major centers of power.

There won’t be like one world government and there won’t be a million different balkanized, decentralized powers or something. You can imagine both of those, but just to ground ourselves a little bit, we chose that. Other regions in the world, India, Africa, South America are also advancing, but just aren’t still quite as much on the center of the world stage. Another thing that we wanted to be a little bit conservative in, and also a little bit positive in, was to say that there just haven’t been any major wars or other global catastrophes. So we haven’t had COVID 20 that killed half of humanity. And we haven’t had nuclear war, say between the US and Russia or anything like that. So we’ve kind of modeled along at least geopolitically and haven’t had any catastrophes that went totally awry. And pushing a little bit further in the optimistic direction that the world is generally looking pretty good.

So part of the idea of this contest is to look for positive visions of the future and things that we might want to aspire to. And so part of the ground rules is just that the world isn’t dystopian. We’re not living in 1984 or any other many depicted dystopias. We’re in one of the very few worlds that people would feel like pretty good about being in. It’s a somewhat funny thing that in fiction, most of the time dystopia kind of comes from trying to develop a utopia and it goes wrong. And we’re sort of very used to this. It’s almost hard to adjust your thinking to create a world that is actually good rather than it’s so easy to think of all the different ways that things can go wrong and be bad.

It’s fun and almost liberating to think about a world that’s actually good unironically and unapologetically, like this is a world I’d like to live in. And that’s sort of what we’re asking for here. It’s notable that in a lot of these things, we’ve kept the world kind of similar to how it is now, geopolitically and technologically in wars and stuff. And the world is actually pretty okay at the moment, at least for a lot of people. At least we’re not living in a dystopia for most people. So in that way, it’s conservative, but it’s important to emphasize that the addition of artificial general intelligence is a huge change, that having the ability to replace human labor with machines, not just physical labor, but intellectual labor, means that most jobs can be done by machine, even what we now call thought work and intellectual work can be done by machines.

Productivity will be skyrocketing. So many more things will be possible. There will be inventions that are coming directly out of AGI and AGI-human collaborations. So many, many things will be very, very different because of this introduction. And in a sense, there’s a little bit of tension between the high power and the transformative change that AGI will bring. And the conservatism that the world is not that, that, that different from the way it is now. But I think this tension is part of the job of the entrance to resolve. Like how did that happen? How did we keep control of AGI? So not only has it not gone off the rails, but it hasn’t totally changed the world in something completely, radically different than we have now. So part of the part of the job is to figure out how did that happen. What are the course of events? What are the institutions that were necessary for that technically, socially and so on, how did that come about? And that will be part of the fun, I think, to see how that worked out.

Lucas Perry: So it’d be great if we could pivot here into a bit more of the details behind the actual contest, just so that listeners have a sense of the due dates, what’s actually expected in terms of what’s being delivered and all that. So what are the details behind the contest?

Anna Yelizarova: Yeah, to understand the contest I think hearing what the submission consists of will be very helpful. So to enter the contest, you have to submit four elements. The first being a timeline. The second one being short stories, so writing pieces. The third being answers to a set of questions. And lastly, a piece of non-text media or art. So for the timeline, we want applicants to provide, for every single year from today, from 2022 to 2045, two events and one data point per year. What is an event? An event could be an agreement. An international agreement is formed. An institution is created. Like an actual event. Whereas a data point would be more… Could be a change in GDP, a change in life expectancy. So we’re talking more numbers here. And so this is pretty detailed. We have over 20 years to fill in, each of them having three points, which will help with the richness of the world and thinking through how we’ve achieved certain things.

Then the short stories. The short stories have to take place in 2045. Could be anywhere in the world and it could be different characters. But ultimately you’re telling their story in 750 to 1000 words. What does a day in the life look in 2045? So you’re using more of narrative tool here. You’re using storytelling to give some more color to your world to make it come alive. Then you have a set of prompts you need to answer. And the prompts will help you worldbuild. They’ll be asking how we’ve overcome a set of challenges. Some of them will be very focused on AI. Some of them are going to be more general. And you could definitely use the answers to those, to both shape your timeline or think through your story. All of the elements are meant to interact and are meant to help you with the other tasks.

And for the fourth element we were asking for a non-text media piece. And this could take many forms. This could be a very visual piece, digital art. This could be a video. Could also be audio, but we just don’t want something that is in written form, because so much of the rest of the application already focuses on that.

Lucas Perry: In terms of actual due dates and some more details here about when this is all wrapping up and the amount of money that is being offered in terms of prizes, could you guys speak a little bit more on that?

Anthony Aguirre: So for this there’s a pretty significant prize purse. So we have a bunch of prizes. One first prize of $20,000. Two second prizes of $10,000. Five third prizes of $2,000. And ten fourth of a thousand dollars each. And the judges have discretion to give up to five extra prizes of $2,000 each for whatever they like, like they could just really love a movie and give a $2,000 prize for that, even if the rest of the build wasn’t a winner. But we also… This doesn’t totally complete the prize package because we really want to encourage teams to enter in this rather just individuals. Individuals are fine, but I think this will be much more fun if teams get together and work on it, it’d be more productive. So rather than forcing teams to split the prize evenly, which kind of disincentivizes things, we’re giving a bigger prize, if you enter in a bigger team.

So we’ll scale up the prize. So for example, if you have a five person team, the prize is doubled. And so you don’t quite get the full prize that you would’ve gotten as an individual, but you don’t get a fifth of it either. You get like two fifths. So we’re really hoping, and we’re going to put effort into trying help people form teams by incentivizing with the prizes, but also just doing what we can to build a community and to help people connect with each other, because this is an exercise that’s really fun to do in groups, I would say.

Anna Yelizarova: In terms of important dates and milestones for the contest, the contest opened January 1st and teams have until April 15th to put in their entries. So then the contest will close April 15th, and the judges will take a month to pick 20 finalists. So May 15th we’ll hear about our 20 finalists, and everyone will get to see their worldbuilds, which will be published online. At this point, there’ll be a month where anyone in the public can input on… Can vote, can provide feedback, and can just voice which futures they like. So we’d love some audience participation here. And then the judges will take the audience feedback and use that to rank the finalists according to first, second, third prize, et cetera. So the final winners will be known June 15th, 2022.

Lucas Perry: And Anna, if people want to get any more information about the contest to see everything that we’ve to talked about here on the FLI website, where’s the best place to do that?

Anna Yelizarova: So we have a website just for the contest. It’s worldbuild.ai. So very easy to remember. This has all the rules, all the deadlines, the prizes. Everything we’ve talked about is on the website, so I encourage you to look through the FAQs or anything like that. You can also join our Discord if you have questions. We’ll be monitoring the online community. You can also use the Discord to meet potential team members and interact with other folks interested in the contest. And if Discord isn’t your cup of tea, you can also send an email to worldbuild@futureoflife.org, and we’ll also be monitoring that channel. So those are the easiest ways to get in touch and join the community.

Lucas Perry: Awesome. So as we wrap up here, I’m just curious to get your final thoughts and feelings about this worldbuilding project. I guess I can start with you, Anthony, just how do you see this as really fitting in with all of your work overall at FLI? And how do you feel about it in terms of its place of working towards beneficial futures and mitigating existential risk?

Anthony Aguirre: First of all, I think it’s going to be a tremendous amount of fun. So there’s a little bit of a precedent at FLI for this. We, in 2019 had a meeting called the Augmented Intelligence Summit. And that was, I think around 40 or so, 50 people, that got together essentially to do worldbuilding. And the ground rules were actually kind of similar to the ones in this contest. And we did a lot of in-person exercises, including writing and talking about stories like the ones that we have here. Role playing, so you would put yourself in the role of some person in the future in this future world. All kinds of things to really inhabit it. And I found it to be really, really just enjoyable and insight-building. So for example, one of the things that I’ve been thinking a lot about lately on the AI side is the concept of loyalty or fiduciary duty in AI.

So we have humans who have to act sort of in the interest of their client, like a doctor or a lawyer, or a financial advisor. They have a legal duty to act in the interest of their clients. AI systems that we have right now don’t always have that same sort of obligation on them. And so in the world that we were talking about in the Augmented Intelligence Summit, we came up with this idea of fiduciary AI assistance. So these are assistants that have to act in your interest. That are acting just for you. That are like a human assistant that is your employee. They’re just doing… They’re not secretly working for some other company or something. And that was part of the world that we built. And so just out of that worldbuilding exercise came this idea that has actually led to two published papers, new initiatives that we’re thinking about.

I don’t think we necessarily would’ve thought along those lines without that worldbuilding contest. So I think it’s very easy to… If you’re starting from where we are here, and you’re thinking about what the future is going to look like, to just make sort of minor perturbations or just kind of push in one direction or another a little bit. When you’re forced to jump to 2045, say, you know that things are going to be quite different and it kind of frees you up creatively to think about how the world could be radically different. And I think that’s really, really valuable. And it’s also I think… The advantage that worldbuilding has above just, well, let’s think about the future and what might exist is that in that case you tend to do things sort of abstractly, like the world might have this sort of technology in it.

But when you think of, for example, a day in the life of someone, they get into their self-driving car. And then you’re thinking about what that means exactly. Who’s the owner of the car? What happens if they’ve been out drinking? Does the self-driving car spy on them or is the self-driving car kind of private inside? When you try to actually build a fictional and very concrete world and go into the backstory of what’s behind every little piece of it, you come up with all sorts of questions that you hadn’t really considered before. The experiential side of it leads you to encounter that future in a way that you wouldn’t just going through a purely intellectual exercise, I think, which is really valuable and really enjoyable, and I think something that we will get a lot more out of than purely abstract intellectual thinking about it.

Lucas Perry: Anna, do you have anything else that you would like to add here in terms of how this fits into your existential risk work and what makes you so excited about this?

Anna Yelizarova: Well, I guess we’re getting to the bottom of the why is FLI running this contest and the goals behind it. And we’re really hoping that the outputs of this contest tie into our real world work at FLI, and aren’t just these creative visions we share. As Anthony said, there might be some really nuggets of wisdom, some interesting policy propositions. Some new institutions that people describe in these worldbuilds that are real ideas that maybe are worth pursuing in the real world in the present.

So there’s this inspiration for, how did we overcome these problems? I think it’s important to say that we’re not pushing for any particular future, and we’re mostly crowdsourcing suggestions for how did we overcome major problems? How did we ensure there was not another pandemic? Or how did we ensure that AI was kept safe? And we want to hear the answers to that. So inspire our real life efforts at FLI. Another side of it is that we’re trying to lean in hard into the storytelling aspect at FLI. And I know a few of our coworkers are really passionate about that, because storytelling can be really powerful in convincing people that certain risks are real or certain futures are worth working towards.

And I think crowdsourcing futures people want are part of it. I know that we do want to hear everyone’s input perspective on the kind of futures we find desirable. And another side to this coin is trying to actually show people what positive futures might look like. Because it’s true, we have been bombarded with dystopia left and right. It’s in all of our fiction, in Hollywood, it’s in books and movies, it’s everywhere. And I do think it does something to our worldview if that’s how we all think about the future. And if we as a society had a better relationship with the future, I think people would be motivated to work towards it as opposed to having a doom-and-gloom approach to things. So there’s a strong storytelling component, and we’re hoping to use these worldbuilds, put them in the hands of storytellers.

We’re hoping to do more after the contest as well to give life to these worldbuilds. And help people feel more positively about the future. Not in a naive way. Not in the sense of, oh, everything will be fine, but in the sense of like, oh, if we work hard, if we tackle these problems, there’s this thing really worth working towards. So we do care about inspiring people, and we sure hope to work with a lot more storytellers on the other side after the contest is over. So hopefully this initiative doesn’t end with the contest itself on June 15th.

Lucas Perry: All right. Awesome. Thank you very much, Anthony and Anna. I’m really excited to see what the… Who the winners are. I know that the quality of their world will be really interesting and amazing. I’m also excited to see all the kinds of interesting institutions that come out of it. And I’m especially excited for the art pieces. That’ll be really cool. So, yeah. Thanks so much for coming on, and if people want to get more information, I’ll include links to the website in the description of wherever you might be listening. So, yeah, thank you so much.


David Chalmers on Reality+: Virtual Worlds and the Problems of Philosophy

  • Virtual reality as genuine reality
  • Why you can live a good life in VR
  • Why we can never know whether we’re in a simulation
  • Consciousness in virtual realities
  • The ethics of simulated beings


Watch the video version of this episode here

Check out David’s book and website here

Follow Lucas on Twitter here

0:00 Intro

2:43 How this books fits into David’s philosophical journey

9:40 David’s favorite part(s) of the book

12:04 What is the thesis of the book?

14:00 The core areas of philosophy and how they fit into Reality+

16:48 Techno-philosophy

19:38 What is “virtual reality?”

21:06 Why is virtual reality “genuine reality?”

25:27 What is the dust theory and what’s it have to do with the simulation hypothesis?

29:59 How does the dust theory fit in with arguing for virtual reality as genuine reality?

34:45 Exploring criteria for what it means for something to be real

42:38 What is the common sense view of what is real?

46:19 Is your book intended to address common sense intuitions about virtual reality?

48:51 Nozick’s experience machine and how questions of value fit in

54:20 Technological implementations of virtual reality

58:40 How does consciousness fit into all of this?

1:00:18 Substrate independence and if classical computers can be conscious

1:02:35 How do problems of identity fit into virtual reality?

1:04:54 How would David upload himself?

1:08:00 How does the mind body problem fit into Reality+?

1:11:40 Is consciousness the foundation of value?

1:14:23 Does your moral theory affect whether you can live a good life in a virtual reality?

1:17:20 What does a good life in virtual reality look like?

1:19:08 David’s favorite VR experiences

1:20:42 What is the moral status of simulated people?

1:22:38 Will there be unconscious simulated people with moral patiency?

1:24:41 Why we can never know we’re not in a simulation

1:27:56 David’s credences for whether we live in a simulation

1:30:29 Digital physics and what is says about the simulation hypothesis

1:35:21 Imperfect realism and how David sees the world after writing Reality+

1:37:51 David’s thoughts on God

1:39:42 Moral realism or anti-realism?

1:40:55 Where to follow David and find Reality+


Lucas Perry: Welcome to the Future of Life Institute Podcast. I’m Lucas Perry. Today’s episode is with David Chalmers and explores his brand new book Reality+: Virtual Worlds and the Problems of Philosophy. For those not familiar with David, he is a philosopher and cognitive scientist who specializes in the philosophy of mind and language. He is a Professor of Philosophy and Neural Science at New York University, and is the co-director of NYU’s Center for Mind, Brain and Consciousness. Professor Chalmers is widely known for his formulation of the “hard problem of consciousness,” which asks, “Why a physical state, like the state of your brain, is conscious rather than nonconscious?” 

Before we jump into the interview, we have some important and bitter-sweet changes to this podcast to announce. After a lot of consideration, I will be moving on from my role as Host of the FLI Podcast, and this means two things. The first is that FLI is hiring for a new host for the podcast. As host, you would be responsible for the guest selection, interviews, production, and publication of the FLI Podcast. If you’re interested in applying for this position, keep your eye on the Careers tab on the futureoflife.org website for more information. 

The second item is that even though I will no longer be the host of the FLI Podcast, I won’t be disappearing from the podcasting space. I’m starting a brand new podcast focused on exploring questions around wisdom, philosophy, science, and technology, where you’ll see some of the same themes we explore here like existential risk and AI alignment. I’ll have more details about my new podcast soon. If you’d like to stay up to date, you can follow me on Twitter at LucasFMPerry, link in the description. This isn’t my final time on the FLI Podcast, I’ve got three more episodes including a special farewell episode, so there’s still more to come! 

And with that, I’m very happy to introduce David Chalmers on Reality+.

Welcome to the podcast David, it’s a really big pleasure to have you here. I’ve been looking forward to this. We both love philosophy so I think this will be a lot of fun. And we’re here today to discuss your newest book, Reality+. How would you see this as fitting in with the longer term project of your career and philosophy?

David Chalmers: Oh boy, this book is all about reality. I think of philosophy to being about, to a very large extent about the mind, about the world and about relationships between the mind and the world. In a lot of my earlier work, I’ve focused on the mind. I was drawn into philosophy by the problem of consciousness, understanding how a physical system could be conscious, trying to understand consciousness in scientific philosophical terms.

But there are a lot of other issues in philosophy too. And as my career has gone on, I guess I’ve grown more and more interested in the world side of the equation, the nature of reality, the nature of the world, such that the mind can know it. So I wrote a fairly technical book back in 2012 called Constructing the World. That was all about what is the simplest vocabulary you can use to describe reality?

But one thing that was really distinctive to this book was thinking about it in terms of technology. In philosophy, it often is interesting and cool to take an old philosophical issue and give it a technological twist. Maybe this is most clear in the case of thinking about the mind and then thinking about the mind through the lens of AI, are artificial minds possible? That’s a big question for anybody. If they are, maybe that tells us something interesting about the human mind. If artificial minds are possible then maybe the human mind is in relevant ways analogous for example to an artificial intelligence.

Then, well, the same kind of question comes up for thinking about reality and the world. Are artificial worlds possible? Normally we think about, okay, ordinary physical reality and the mind’s relation to that, but with technology, there’s now a lot of impetus to think about artificial realities, realities that we construct, and the crucial case there is virtual realities, computational based realities, virtual worlds even of the kind we might construct say with video games or full scale virtual realities, full scale universe simulations. And then a bunch of analogous questions come up, are artificial realities genuine realities?

And just in the artificial mind case, I want to say artificial minds are genuine minds. Well, likewise in the artificial world case, I want to say, yeah, virtual realities are genuine realities. And that’s in fact, the central slogan of this new book Reality+, which is very much trying to look at some of these philosophical issues about reality through the lens of technology and virtual realities, as well as trying to get some philosophical insight into this virtual reality technology in its own right by thinking about it philosophically. This is the process I call techno-philosophy, using technology to shed light on philosophy and using philosophy to shed light on technology.

Lucas Perry: So you mentioned… Of course you’re widely known as a philosopher of consciousness and it’s been a lot of what you focused on throughout your career. You also described this transition from being interested in consciousness to being interested in the world increasingly over your career. Is that fair to say?

David Chalmers: Yeah. You can’t be interested in one of these things without being interested in the other things. So I’ve always been very interested in reality. And even in my first book on consciousness, there was speculation about the nature of reality. Maybe I talked about it from bit hypothesis there. Maybe reality is made of information. I talked about quantum mechanics and potential connections to consciousness. So yeah, you can’t think about, say the mind body problem without thinking about bodies as well as minds, you have to think about physical reality.

There’s one particular distinctive question about the nature of reality namely how much can we know about it? And can we know anything about the external world? That’s a very traditional problem in philosophy. It goes back to Descartes saying, how do you know you’re not dreaming right now? Or how do you know you’re not being fooled by an evil demon who’s producing sensations as of an external world when none of this is real? And for a long time, I thought I just didn’t have that much to about this very big question in philosophy.

I think of the problem with consciousness, the mind body problem. That’s a really big question in the history of philosophy. But to be honest, I’m going to say it’s probably not number one. Number one at least in the Western philosophical tradition is how do we know anything about the external world? And for a long time, I thought I didn’t have anything to say about that. And at a certain point, partly through thinking about yeah, virtual realities and the simulation hypothesis, I thought, yeah, maybe there is something new to say here via this idea that virtual realities are genuine realities. Maybe these hypotheses that Descartes put forward saying, “If this is the case, then none of this is real.” Maybe Descartes was actually thinking about these hypotheses wrongly. 

And I actually got drawn into this. Around the same time, just totally fortuitously I got invited to write an article for the Matrix Website. Their production company, Red Pill, it was a philosopher called Chris Grawl, who worked for them. And I guess the Wachowskis were super interested in philosophy. They wanted to see what philosophers thought of philosophical issues coming from the movie. So I ended up writing an article called The Matrix as Metaphysics, putting forward this rough point of view, which is roughly in the context of the movie that even in the movie, they say, well, if we’re in the Matrix, none of what we’re experiencing is real. All this is illusion or a fiction.

I tried to argue, even if you’re in the Matrix, these things around you are still perfectly real. There are still trees, there are still cats, there are still chairs. There are still planets. It’s just that they’re ultimately digital, but they’re still perfectly real. And I tried to use that way of thinking about the Matrix to provide a response to the version of Descartes who says, “We can never know anything about the external world, because we can’t rule out that none of this is real.”

All those scenarios Descartes had in mind. I think some sense there are actually scenarios where things are real and that makes this vision of reality. Maybe it makes reality a bit more like virtual reality, but that vision of reality actually puts knowledge of the external world more within our grip. And from there, there’s a clean path from writing that article 20 years ago to writing this book now, which takes this idea of virtual reality as genuine reality and tries to just draw it out in all kinds of directions, to argue for it, to connect to present day technology, to connect it to a bunch of issues in philosophy and science. Because if I to start thinking this way about reality, at least I’ve found it changes everything. It changes all kinds of things about your vision of the world.

Lucas Perry: So I think that gives a really good taste of what is to come in this interview and also what’s in your book. Before we dive more into those specifics, I’m also just curious what your favorite part of the book is. If there’s some section or maybe there isn’t that you’re most excited to talk about, what would that be?

David Chalmers: Oh, I don’t know. I was going to say my favorite parts of the book had the illustrations, amazing illustrations by Tim Peacock, who’s a great illustrator who I found out about and I asked if he’d be able to do illustrations for the book. And he took so many of these scenarios, philosophical thought experiments, science fiction scenarios, and came up with wonderful illustrations to go along with it. So we’ve got Plato’s Cave, but updated for the 21st century with people in virtual reality inside Plato’s Cave with Mark Zuckerberg running the cave, or we have an ancient Indian thought experiment about Narada and Vishnu updated them in the light of Rick and Morty. We’ve got a teenage girl hacker creating a simulated universe in the next universe up.

So these illustrations are wonderful, but I guess that doesn’t quite answer your question, which parts do I especially want to talk about? I think of the book as having roughly two halves. Half of it is broadly about the simulation hypothesis. The idea that the universe is a simulation and trying to use that idea to shed light on all kinds of philosophical problems. And the other half is more about real virtual reality, the coming actual virtual reality technology that we have and will develop in the next say 50 to 100 years and trying to make sense of that and the issues it brings up.

So in the first part of the book, I talk about very abstract issues about knowledge and reality and the simulation hypothesis. The second part of the book gets a bit more down to earth and even comes to issues about ethics, about value, about political philosophy. How should we set up a virtual world? That was more of a departure for me to be thinking about some of those more practical and political issues, but over time I’ve come to find they’re fascinating to think about.

So I guess I’m actually equally fascinated by both sets of issues. But I guess lately I’ve been thinking especially about some of these second class of issues, because a lot of people given the coming… All the corporations now are playing up the metaverse and coming virtual reality technology. That’s been really interesting to think about.

Lucas Perry: So given these two halves in general and also the way that the book is structured, what would you say are your central claims in this book? What is the thesis of the book?

David Chalmers: Yeah, the thesis of the book that I lay out in the introduction is virtual reality is genuine reality. It’s not a second class reality. It’s not fake or fictional. Virtual reality is real. And that breaks down into a number of sub-thesis. One of them is about the existence of objects, and it’s a thesis in metaphysics. It says the objects in virtual reality are real objects, a virtual tree is a real object. It may be a digital object, but it’s real all the same. It has causal powers. It can affect us. It’s out there independently of us. It needn’t be an illusion.

So yeah, virtual objects are real objects. What happens in virtual reality really happens. And that’s one kind of thesis. Another thesis is about value or meaning. That you can lead a valuable life, you can lead a meaningful life inside a virtual world. Some people have thought that virtual worlds can only ever be escapist or fictions or not quite the real thing. I argue that you can lead a perfectly meaningful life.

And the third kind of thesis has tied closer to the simulation hypothesis idea. And there I don’t argue that we are in fact in a computer simulation, but I do argue that we can never know that we’re not in a simulation. There’s no way to exclude the possibility that we’re in a simulation. So that’s a hypothesis to take very seriously. And then I use that hypothesis to flesh out a number of different… Just say we are in a simulation then, yeah, what would this mean for say our knowledge of the world? What would this mean for the reality of God? What would this mean for the underlying nature of the metaphysics underneath physics and so on? And I try and use that to just put forward a number of sub-thesis in each of these domains.

Lucas Perry: So these claims also seem to line up with really core questions in philosophy, particularly having to do with knowledge, reality and value. So could you explain a little bit what are some of the core areas of philosophy and how they line up with your exploration of this issue through this book?

David Chalmers: Yeah, traditionally philosophy is at least sometimes divided up into three areas, metaphysics, epistemology and the theory of value. Metaphysics is basically questions about reality. Epistemology is basically questions about knowledge and value theory is questions about value, about good versus bad and better versus worse. And in the book, I divide up these questions about virtual worlds into three big questions in each of these areas, which I call the knowledge question, the reality question and the value question.

The knowledge question is, can we know whether we’re in a virtual world in particular? Can we ever be sure that we’re not in a virtual world? And there I argue for an answer of no, we can ever know for sure that we’re not in a virtual world, we can never exclude that possibility. But then there’s the reality question, which is roughly, if we are in a virtual world, is the world around us real? Are these objects real? Are virtual realities genuine realities or are they somehow illusions or fictions? And there I argue for the answer, yes, virtual worlds are real. Entities and events in virtual world are perfectly real entities and events. Even if we’re in a simulation, the objects around us are still real. So that’s a thesis in metaphysics.

Then there’s the question in value theory, which is roughly, can you lead a good life in a virtual world? And there as I suggested before I want to argue, yes, you can lead a good and meaningful life in a virtual world. So yeah, the three big questions behind the book, each correspond then to a big question, a big area of philosophy. I would like to think they actually illuminate not just questions about virtual worlds, but big questions in those areas more generally. The big question of knowledge is, can we know anything about the external world?

The big question of reality is, what is the nature of reality? The big question about value is, what is it to lead a good life? Those are big traditional philosophical questions. I think thinking about each of those three questions through the lens of virtual reality and trying to answer the more specific questions about what is the status of knowledge, reality and value in a virtual world, that can actually shed light on those big questions of philosophy more broadly.

So what I try to do in the book is often start with the case of the virtual world, give a philosophical analysis of that, and then try to draw out morals about the big traditional philosophical question more broadly.

Lucas Perry: Sure. And this seems like it’s something you bring up as a techno-philosophy in the book where philosophy is used to inform the use of technology and then technology is used to inform philosophy. So there’s this mutual beneficial exchange through techno-philosophy.

David Chalmers: Yeah. Philosophy is this two-way interaction between philosophy and technology. So what I’ve just been talking about now, using virtual reality technology and virtual worlds to shed light on big traditional philosophical questions, that’s the direction in which technology sheds light on philosophy, or at least thinking philosophically about technology can shed light on big traditional question in philosophy that weren’t cast in terms of technology, can we know we’re not in a simulation? That sheds light on what we can know about the world. Can we lead a good life in a virtual world? That sheds some light on what it is to lead a good life and so on.

So yeah, this is the half of techno-philosophy, we’re thinking about technology sheds light on philosophy. The other half is thinking philosophically, using philosophy to shed light on technology and just thinking philosophically about virtual reality technology, simulation technology, augmented reality technology and so on. And that’s I think something I really try to do in the book as well. And I think these two things, these two processes of course complement each other. Because thinking, you think philosophically about technology, it shed some light on the technology, but then it turns out actually to have some impact on the broader issues of philosophy at the same time.

Lucas Perry: Sure. So what’s coming up for me is Plato’s Cave Allegory is actually a form of techno-philosophy potentially, where the candle is a kind of technology that’s being used to cast shadows to inform how Plato’s examining the world.

David Chalmers: That’s interesting. Yeah. I hadn’t thought about that. But I suppose back around Plato’s time, people did a whole lot with candles and fire. These were very major technologies of the time. And maybe at a certain point people started developing puppet technology and started doing puppet style shows that were a form of, I don’t know, entertainment technology for them. And then for Plato then to be thinking about the cave in this way, yeah, it is a bit of a technological setup and Plato is using this new technology to make claims about reality.

Plato also wrote about other technologies. He wrote about writing, the invention of writing and he was quite down on it. He thought or at least his spokesman’s Socrates said, “In the old days people would remember all the old tales, they’d carry them around in their head and tell them person to person, and now that you can write them down, no one has to remember them anymore.” And he thought this was somehow a step back in the way in which some people these days think that putting all this stuff on your smartphone might be a step back. But yeah, Plato was very sensitive to the technologies of the time.

Lucas Perry: So let’s make a B line for your central claims in this book. And just before we do that, I have a simple question here for you. Maybe it’s not so simple but… So what is virtual reality?

David Chalmers: Yeah, the way I define it in the book, I make a distinction between a virtual world and virtual reality, where roughly virtual reality technology is immersive. It’s the kind of thing you experience say with a Oculus Quest headset that you put onto your head and you experience a three dimensional space all around you. Whereas a virtual world needn’t be immersive. When you play a video game, when you’re playing World of Warcraft or you’re in Fortnite, typically you’re doing this on a two dimensional screen, it’s not fully immersive, but there’s still a computer generated world.

So my definitions are a virtual world is an interactive computer generated world. It has to be interactive. If it’s just a movie, then that’s not yet a virtual world, but if you can perform actions within the world and so on and it’s computer generated, that’s a virtual world. A virtual reality is an immersive interactive computer generated world. Then the extra condition, this has to be experienced in 3D with you at the center of it, typically these days experienced with a VR headset and that’s virtual reality. So yeah, virtual reality is immersive interactive computer generated reality.

Lucas Perry: So one of the central claims that you mentioned earlier was that virtual reality is genuine reality. So could you begin explaining why is it that you believe the virtual reality is genuine reality?

David Chalmers: Yeah. Because a lot of this depends on what you mean by real and by genuine reality. And one thing I do in the book is try and break out number of different meanings of real, what is it for something to be real? One is that it has some causal power that it could make a difference in the world. One is that it’s out there independent of our minds. It’s not just all in the mind. And one, maybe the most important is that it’s not an illusion. It’s not just that things are roughly as they seem to be. And I try to argue that if we’re in VR, the objects we see have all of these properties, basically the ideas. When you’re in virtual reality you’re interacting with digital objects, objects that exist as data structures on computers, the actual concrete processes up and running on a computer room. We’re interacting with concrete data structures realized in circuitry on these computers.

And those digital objects have real causal powers. They make things happen. They’re when two objects interact in VR, the two corresponding data structures on a computer are genuinely interacting with each other. When a virtual object appears a certain way to us, that data structure is at the beginning of a causal chain that affects our conscious experience in much the same way that a physical object might be at the start of a causal chain affecting our experience.

And most importantly, I want to argue that, just say, let’s take the extreme case of… I find it useful to start with the extreme case of the simulation hypothesis, where all of this is a simulation. I want to say in that case when I have an experience of say a tree in front of me or here’s a desk and a chair, I’m going to say none of that is illusory. There’s no illusion there. You’re interacting with digital object. It’s a digital table or a digital chair, but it’s still perfectly real.

And the way that I end up arguing for this in the book is to argue that the simulation hypothesis should be seen as equivalent to a kind of hypothesis which has become familiar in physics, the version of the so-called it from bit hypothesis. The it from bit hypothesis says roughly that physical reality is grounded in a level of interaction of bits or some computational process. The paradigm illustration here would be Conway’s Game of Life where you have a cellular automaton with cells that could be on or off and simple rules governing their interaction.

And various people have speculated that the laws of physics could be grounded in some kind of algorithmic process, perhaps analogous to Conway’s Game of Life. People call this digital physics. And it’s not especially widely believed among physicists, but there are some people who take it seriously. And at least it’s a coherent hypothesis that, yeah, there’s a level of bits underneath physical objects in reality. And importantly, if the it from bit hypothesis is true, this is not a hypothesis where nothing is real, it’s just a world where there still are chairs and tables. There still are atoms and quarks. It’s just they’re made of bits. There’s a level underneath the quarks, the level of bits that things are perfectly real.

So in the book I try to argue that actually the simulation hypothesis is equivalent to this it from bit hypothesis. It’s basically, if we’re in a simulation, yeah, there are still tables and chairs, atoms and quarks. There’s just a level of bits underneath that. All this is realized maybe by a computer process involving the interaction of bits and maybe there’s something underneath that in turn that leads to what I call the it from it hypothesis. Maybe if we’re in a simulation, there’s a number of levels like this.

But yeah, the key then is the argument that these two hypotheses are equivalent, which is a case I try to make in chapter nine of the book. The argument itself is complex, but there’s a nice illustration to illustrate it. On one hand, we’ve got a traditional God creating the universe by creating some bits, by, yeah, “Let there be bits,” God says and lays out the bits and gets them interacting. And then we get tables and chairs out of that. And in the other world we have a hacker who does the same thing except via a computer. Let there be bits arranged on the computer, and we get virtual tables and chairs out of that. I want to argue that the God creation scenario and the hacker simulation scenario basically are isomorphic.

Lucas Perry: Okay. I’m being overwhelmed here with all the different ways that we could take this. So one way to come at this is from the metaphysics of it where we look at different cosmological understandings. You talk in your book about there being, what is it called? The dust theory? There may be some kind of dust which can implement any number of arbitrary algorithms, which then potentially above that there are bits, and then ordinary reality as we perceive it as structured and layered on top of that. And looking at reality in this way it gives a computationalist view of metaphysics and so also the world, which then informs how we can think about virtual reality and in particular the simulation hypothesis. So could you introduce the dust theory and how that’s related to the it from bit argument?

David Chalmers: Yeah. The dust theory is an idea that was put forward by the Australian science fiction writer, Greg Egan in his book, Permutation City, which came out in the mid 90s, and is a wonderful science fiction novel about computer simulations. The dust theory is in certain respects even more extreme than my view. I want to say that as long as you have the right computation and the right causal structure between entities in reality, then you’ll get genuine reality. And I argue that can be present in a physical reality, that can be present in a virtual reality. Egan goes a little bit more extreme than me. He says, “You don’t even need this causal structure. All you need is unstructured dust.”

We call it dust. It’s basically a bunch of entities that have no spatial properties, no temporal properties. It’s a whole totally unstructured set of entities, but we think of this as the dust and he thinks the dust will actually generate every computer process that you can imagine. He thinks they can generate any objects that you imagine and any conscious being that you can imagine and so on. Because he thinks there’s ways of interpreting the dust so that it’s for example, implementing any computer program whatsoever. And in this respect, Egan has actually got some things in common with philosophers like the American philosophers, Hilary Putnam, and John Searle, who argued that you can find any computation anywhere.

Searle argued that his wall implemented the WordStar, word processing program. Putnam suggested that maybe a rock could implement complex computations, basically, because you can always map the parts of the computation of the physical object onto the parts of the computation. I actually disagree with this view. I think it’s two unconstrained. I think it makes it too easy for things to be real.

And roughly the reason is I think you need constraints of cause and effect between the objects. For a bunch of entities in a rock or a wall to implement, say a WordStar, they have to be arranged in a certain way so they go through certain state transitions. And so they would go through different state transitions and different circumstances to actually implement that algorithm. And that requires genuine causal structure. And yeah, way back in the 90s, I wrote a couple of articles arguing that the structure you’ll find in a wall or a rock is not enough to implement most computer programs.

And I’d say exactly the same for Egan’s dust theory, that the dust does not have enough structure to support a genuine reality because it doesn’t have these patterns of cause and effect, obeying counterfactuals, if this had happened, then this would’ve happened. And so you just don’t get that rich structure out of the dust. So I want to say that you can get that structure, but to get that structure you need dust structured by cause and effect.

And importantly I think, in average computer simulation like the simulation hypothesis, it’s not like the dust, computer simulations really have this rich causal structure going on inside the computer. You’ve got circuits which are hooked up to each other in the patterns of cause and effect that are isomorphic to that in the physical reality. That’s why I say virtual realities are genuine realities because they actually have this underlying computational structure.

But I would disagree with Egan that the dust is a genuine reality because the dust doesn’t have these patterns of cause and effect. I ended up having a bunch of email with Greg Egan about this and he was arguing for his own particular theory of causation, which went another way. But yeah, at least that’s where I want to hold the line, cause and effect matters.

Lucas Perry: My questions are, so what is the work then that you see the dust theory doing in your overall book in terms of your arguments for virtual reality as genuine reality?

David Chalmers: The dust theory comes relatively late in the book, right? Earlier on I bring in this it from bit idea that yeah, all of reality might be grounded in information in bits, in computational processes. I see that dust theory is being, but partially tied to a certain objection somebody might make, that I’ve made it too easy for things to be real now. If I can find reality in a whole bunch of bits like that, maybe I’m going to be able to find this reality everywhere. And even if we’re just connected to dust, there’ll be trees and chairs, and now isn’t reality made trivial. So partly I think thats an objection I want to address, one say no it’s still not trivial to have reality. You need all this structure, this kind of cause and effect structure or roughly equivalently, a certain mathematical structure in the laws of nature.

And that’s really a substantive constraint, but it’s also a way of helping to motivate the view that I call structuralism about, and that many others have called structuralism or structural realism about physical reality, which I think is kind of actually the key to my thesis. Why does virtual reality get to count as genuine reality? Ah, because it has the right structure. It has the right causal structure. It has the right kind of mathematically characterizable interactions between different entities. What matters is not so much what these things are made of intrinsically, but the interactions and the relations between them. And that’s a view that many philosophers of science these days find very plausible. It goes back to Punqueray and Russell and Carnap and others, but yeah, very popular these days. What matters lets say for a theory in physics to be true is that basically you’ve got entities with the right kind of structure of interactions between them.

And if that view is right, then it gives a nice explanation of why virtual reality, it counts as genuine reality because when you have a computer simulation of a given physical of say of the physical world that has all that preserves computer simulation preserves, the relevant kind of structure. So yeah, the structure of the laws of physics could be found at a physical reality, but it could also that structure could also be found in a computer simulation of that reality. Computer simulations have the right structure, but then it’s yeah. So it turns that’s not totally unconstrained. Some people think, Egan thought the dust is good enough. Some people think purely mathematical structure is good enough. In fact, your sometime boss, Max Tegmark, I think may, may think something like this in his book, on the mathematical universe, he argues that reality is completely mathematical.

And at least sometimes it seems to look as if he’s saying the content of our physical theories is just purely mathematical claims that there exists certain entities with a certain mathematical structure. And I worry that as with Egan that if you understand the content of our theories is purely mathematical, then you’ll find that structure anywhere. You’ll find it in the dust. You’ll find it in any abstract about mathematics. And there’s a worry that actually our physical theories could be trivialized and they can all end up being true, because we can always find dust or mathematical entities with the right structure. But I think if you add the constraint of cause and effect here, then it’s no longer trivialized.

So I think of Egan and Tegmark as potentially embracing a kind of structuralism, which is even broader than mine lets in even more kinds of things as reality. And I don’t be quite so unconstrained. So I want to add some of these constraints of cause and effect. So this is rather late in the book, this is kind of articulating this, the nature of the kind of structuralism that I see as underlying this view of reality.

Lucas Perry: So, Egan and Max might be letting in entities into the category of what is real, which might not have causal force. And so you’re adopting this criteria of cause and effect being important in structuralism for what counts as genuine.

David Chalmers: Yeah. I worry that if we don’t at least have, I think cause and effect is very important to our ordinary conception of reality that for example of things have causal powers. If we don’t have some kind of causal constraint on reality, then it becomes almost trivial to interpret reality as being anywhere. I guess I think of what we mean by real is partly a verbal question, but I think of causal powers is very central to our ordinary notion of reality. And I think that manages actually to give us a highly constrained notion of reality. Where realities are at least partly individuated by their causal structures, but where it’s not how, it’s not now so broad that arbitrary conglomerates of dust get to count as being on a par with our physical world or arbitrary sets of mathematical entities likewise.

Lucas Perry: Let’s talk more about criteria for what makes things count as real or genuine or whether or not they exist. You spend a lot of time on this in your book, sort of setting and then arguing for different positions on whether or not certain criteria are necessary and or sufficient for satisfying some understanding of like, what is real or what is it that it means that something exists or that it’s genuine. And this is really important for your central thesis of virtual reality being genuine reality. Cause it’s important to know like what it is that exists and how virtual reality fits into what is real overall. So could you explore some of the criteria for what it means for something to be part of reality or what is reality?

David Chalmers: Yeah. I end up discussing five different notions of reality of what it is for something to be real. I mean, this kind of goes back to The Matrix where Neo says this isn’t real and Morpheus says, “What is real? How do you define real?” That’s the question? How do you define “real?” And I talk about five main, any number of different things people have meant by real, but I talk about five main strands in our conception of reality. One very broad one is something is real just if it exists. Anything that exists is real. So if that tree exists, it’s real. If the number two exists, it’s real. I think that’s often what we mean. It’s also a little bit unhelpful as a criterion, because it just pushes back the question to what is it for something to exist? But it’s a start.

Then the second one is the one we’ve just been talking about the criterion of causal powers. This actually goes back to a one of Plato’s dialogue where the Iliadic stranger comes in and says for something to be real, it’s got to be able to make a difference. It’s got to be able to do something, that’s the causal power criterion. And so if you to be real, you’ve got to have effects. Some people dispute that’s necessary. Maybe numbers could be real, even if they don’t have effects, maybe consciousness could be real, even if it doesn’t have effects, but it certainly seems to be a plausible sufficient condition so that’s causal powers. Another one is mind independence, existing independently of the mind. There’s this nice slogan from Philip K Dick where he said that reality, something is real if when you stop believing in it, it doesn’t go away. Reality is that which when you stop believing in it, it doesn’t go away.

That’s basically to say its existence doesn’t depend on our beliefs. Some things such that their existence depends on our beliefs. I don’t know the Easter bunny or something, but more generally I’d say that some things have existence that depends on our minds. Maybe a Mirage of some water up ahead. That basically depends on there being a certain conscious experience in my mind, but there are some things out there independent of my mind that aren’t all in my mind, that don’t just depend on my mind. And so this leads to the third criteria and something is real when it doesn’t wholly depend on our minds, it’s out there independently of us.

Now this is a controversial criterion. People think that somethings like money may be real, even though it largely depends on our attitudes towards money. Our treating something as money as part of what makes it money. And actually in the Harry Potter books, I think its Dumbledore has a slogan that goes the opposite way of Philip K Dick’s. At one point towards the end of the novels, Dumbledore says to Harry, Harry says, “ah, but none of this is real and this is all just happening inside my head” and Dumbledore says something like, “just because all this is happening inside your head, Harry, why do you think that makes it any less real?”

So I don’t know. There is a kind of mental reality you get from the mind, but at any way, I think mind independence is one important thing that we haven’t often have in mind when we talk about reality. A fourth one is that we sometimes talk about genuineness or authenticity. And one way to get at this is we often talk about not just whether an object is real, but whether it’s a real something like maybe you have a robot kitten, okay, it’s a real object. Yes. It’s a real object. It’s a genuine object with causal powers out there independently of us. But is it a real kitten? Is it a real kitten? Most people would say that, no, a robot kitten maybe it’s a real object, but it’s not a real kitten. So it’s not a genuine, authentic kitten.

More generally for any X we can ask, is this a real X? And that’s this criterion of genuineness, but then maybe the deepest and most important criterion for me is the one of not, basically something is real if it’s not an illusion, that is if it’s rough the way it seems to be. It seems to me that I’m in this environment, there are objects all around me in space with certain colors. There’s a tree out there and there’s a pond. And roughly I’d say that things are, all that’s real if there are things out there roughly as they seem to be, but if all this is an illusion, then those things are not real. So then we say things are real. If they’re not an illusion, if they’re roughly, as they seem to be. So one thing I then do is to try to argue that for the simulation hypothesis, at least if we’re in a simulation, then the objects we perceive are real in all five of those senses, they have causal powers. They can do things they’re out there independently of our minds. They exist. They’re genuine.

That’s a real tree, at least by what we mean by tree. And they’re not illusions. So five out of five on what I call the reality checklist, ordinary virtual reality, I want to say it gets four out of five. The virtual objects we interact with are they’re still digital objects with causal powers out there independently of us. They exist. They needn’t be illusions. I argue that at length that your experiences in VR needn’t be illusions. You can correctly perceive a virtual world as virtual, but arguably they’re not at least genuine. Maybe for example, the virtual kitten that you interact with in VR. Okay, it’s a virtual kitten, but it’s not a genuine kitten anymore than the robot kitten is. So maybe virtual tables are not, at least in our ordinary language, genuine tables. Virtual kittens are not genuine kittens, but they’re still real objects, but maybe there’s some sense in which they fail one of the five criteria for reality. So I would say ordinary virtual realities, at least as we deal with them now may get to four out of five or 80% on the reality checklist.

It’s possible that our language might evolve over time to eventually count virtual chairs as genuine chairs and virtual kittens as genuine kittens. And then we might be more VR inclusive in our talk. And then maybe we’d come to regard virtual reality is five out of five on the checklist. But anyway, that’s the rough way I ended up breaking on these notions into at least five. And of course, one way to come back is to say, ah, you’ve missed the crucial notion of reality actually, to be real requires this and VR is not real in that sense. I just read a review of the book where someone said, ah, look obviously VR isn’t real because it’s not part of the base level of reality. The fundamental outer shell of reality. That’s what’s real. So I guess this person was advocating. To be real you’ve got to be part of the base fundamental outer shell of reality. I mean, I guess I don’t see why that has to be true.

Lucas Perry: I mean, isn’t it though?

David Chalmers: Well.

Lucas Perry: It’s implemented on that.

David Chalmers: Yeah. It’s true so that’s one way to come back to this is to say the digital objects ultimately do exist in the outer shell. They’re just diverse.

Lucas Perry: They’re undivided from the outer shell. They just look like they’re just like can be conceptualized as secondary.

David Chalmers: Yeah, no, it is very much continuous with, I want to say the very least they’re on a par with like micro universes. I mean we have people talk now about, say baby universes. Growing up in black holes, inside a larger universe and people take that seriously and then we’d still say, okay, well this universe is part of this universe and that part of the universe can be just as real as the universe as a whole. So I don’t think, yeah. So I guess I don’t think being the whole universe is required to be real. We’ve got to acknowledge reality to parts of the world.

So we have kind of like a common sense ontology. A common sense view of the world and to me it seems like that’s more Newtonian feeling science evolves and then we get quantum mechanics. And so something you describe you explore in your book is this difference between I forget what you call it, like the conventional view of the world. And then, oh, sorry, the manifest in the scientific image is what you call it. And part of this manifest image is that it seems like humans’ kind of have like our common sense ontology is kind of platonic.

So how would you describe the common sense view of what is real?

David Chalmers: Yeah, I talk about the garden of Eden, which is our naive pre-theoretical sense of the world before we’ve started doing science and developing a more sophisticated view. I do think we have got this tendency to think about reality as like yeah, billiard balls out there and solid objects, colored objects out there in a certain space, an absolute three-dimensional space with one dimension of time. I think that’s the model of reality we had in the garden of Eden. So yeah, one of the conceits in the book is well in the garden of Eden things actually were that way. There were three absolute dimensions of space and one dimension of time objects were rock solid. They were colored the way I marked this in the book is by capital letters, say in the garden of Eden, there was capital S “Space” and capital T “Time” where objects were capital S “Solid” and capital C “Colored.”

They were capital R “Red” and capital G “Green.” And maybe there was capital G “Good” and bad and capital F “Free will” and so on. But then we develop the scientific view of the world. We eat from the tree of knowledge. It gives us knowledge of science and then, okay, well, the world is not quite like that naive conception implied there’s no, there’s four dimensional space time without an absolute space or a time. Objects don’t seem to have these primitive colors out there on their services. They just have things like reflectance properties that reflect light in a certain way that affects our experience in a certain way. Nothing is capital S “Solid.” The objects are mostly empty space, but they still manage to resist penetration and then the right way. So I think of this as the fall for Eden. And for many things we’ve gone from capital S “Space” to lowercase S “space.” We’ve gone from capital S “solidity” to lowercase S “solidity.”

And one thing that I think goes on here is that we’ve moved from kind of a conception of these things as primitive. Primitive space and primitive colors is just like redness out there on the surface of things, what I call primitivism to, rather to a kind of functionalism where we understand things in terms of their effects. To be red now is not to have some absolute intrinsic quality of redness, but it’s to be such as to affect us to produce certain experiences to look red. To be solid is not to be absolutely intrinsically solid, but to interact with other objects in such a way that they’re solid.

So I think in general, this goes along with moving from a conception of reality as all these absolute intrinsic properties out there to a much more structuralist conception of reality here where what matters for things being real is the right patterns of causal interaction with each other of entities with each other. I’m not saying all there is to reality is structure. My own view is that consciousness in particular is not just reducible to this kind of abstract structure consciousness does in fact have some intrinsic qualities and so on. So I do think that’s important too, but I do think in general, the move from the naive conception to the scientific conception of reality has often involved going from these kind of a conception of these primitive intrinsic qualities to a more structural conception of reality.

Lucas Perry: Right. So I imagine that many of the people who will resist this thesis in your book that virtual reality is genuine reality, maybe coming at it from some of these more common sense intuitions about what it means for something to be real, like red as a property that’s intrinsic on the surface of a thing. How do you see your book, so are there like common sense intuitions or misconceptions that you see your book as addressing?

David Chalmers: I guess I do think, yeah. Many people do find it as common sense that virtual reality is not full scale reality. First class reality. It doesn’t live up to our ordinary conception of reality. And sometimes I think they may have in mind this Edenic conception of reality, the way it was in the Garden of Eden to which my reply is. Yeah. Okay. I agree. Virtual reality does not have everything that we had in the Garden of Eden conception of reality, but neither does ordinary physical reality, even in the kind of physical reality developed in light of science, it’s not the garden of Eden picture of reality anymore. We’ve lost absolute space and absolute time. Now we’ve lost absolute colors and absolute solidity. What we have is now this complex mathematical structure of entities these interacting at a deep level.

I mean, the further you look, the more evanescent it gets, quantum mechanics is just this it’s wave function where objects don’t need to have determinate possessions, and who knows what’s going on there in string theory and other quantum gravity theories, it looks like space may not be fundamental at all. People have entertained the idea that time is not fundamental at all. So I think a physical reality in a way it’s, I’m saying virtual reality is genuine in reality, but one way to paraphrase that is virtual reality is just as real as physical reality. If you want to hear that by saying, well, physical reality is turned out to be more like virtual reality, then I wouldn’t necessarily argue with that physical reality is not the Garden of Eden billiard ball conception of reality anymore.

It’s this much more evanescent thing, which is partly characterizable by, it’s just playing all these, having the right kind of a certain kind of structure. And I think all that we can find in virtual reality. So yeah. So one thing I would do to this person questioning is to say, well, what do you think even about physical reality in light of the last hundred years or so of science?

Lucas Perry: Yeah. The reviewer’s comments that you mentioned come off to me as kind of being informed by the Eden view.

David Chalmers: Yeah. I think it’s right. It’s quite common that’s really what it is. It’s our naive conception of reality and what reality is like, but yeah, maybe then it’s already turned out that the world is not real in that sense.

Lucas Perry: One thing I’d like to pivot here into is exploring value more. How do you see the question of value fitting into your book? There’s this other central thesis here that you can live a good life in virtual reality, which seems to go against people’s common intuitions that you can’t. There’s this survey about whether or not people would go into experience machines and most people wouldn’t.

David Chalmers: Yeah, Nozick had this famous case of the experience machine, where your body’s in a tank, and you get all these amazing experiences of being highly successful. Most people say they wouldn’t enter the experience machine. I think of professional philosophers on a survey we did, maybe 15% said they would enter and 70 odd percent said they wouldn’t. And a few agnostic. The experience machine though, and many people have treated that as a model for VR in general. But I think the experience machine as Nozick described it, is actually different from VR in some respects. One is that very important respect is that the experience machine seems to be scripted, seems to be pre-programmed you go in there and your life will live out script. You get to become world champion, but it wasn’t really anything you did. That was just the script playing itself out. Many people think that’s fake. That’s not something I actually did. It was just something that happened to me.

VR by contrast, you go into VR, even an ordinary video game, you still got some degree of free will. You’re to some extent controlling what happens. You go into Second Life, or Fortnite whatever basically, you’ve got all kinds of it’s not scripted. It’s not pre-programmed, it’s open ended. I think the virtual worlds of the future will be increasingly open ended. I don’t think worries about the experience machine tend to undermine virtual worlds. More generally, I think I want to argue that yeah, virtual worlds can basically be on a par with physical worlds, especially once we’ve recognized that they needn’t be illusions, they needn’t be pre-programmed and so on. Then what are they missing? I think you’ve got what’s important to a good life? Maybe consciousness, the right subjective experiences. Also, relationships, very, very important. But I think in the VR certainly at least in a multi-user VR where many people are connected.

That’s another thing about the experience machine, it’s just you, presumably who’s conscious. But in a VR with I’m assuming a virtual world with many conscious beings, you can have relationships with them and get the social meaning of your life. That way knowledge and understanding, I think you can come to have all those things in VR. I think basically all the determinants of a good life, it’s hard to see what’s in principle missing in VR. There are some worries. Maybe if you want a fully natural life, a life, which is as close to nature as possible, VR is not going to do it because it’s going to be removed from nature. But then many of us live in cities or spend most of our time indoors. That’s also removed from nature and it’s still compatible with a meaningful life. There are issues about birth and death, which it’s not obvious how genuine birth and death will work at least in near term virtual worlds.

Maybe once there’s uploading, there’ll be birth and death in virtual worlds if the relevant creatures are fully virtual. But you might think if virtual was lack birth and death, there are aspects of meaning that they lack. I don’t want to say they’re exactly on a path with physical reality and all respects, but I’d say that virtual realities can at least have the prime determinants of a good and meaningful life. It’s not to say that life in virtual reality going to be wonderful. They may well be awful just as life in physical reality could be awful. But my thesis is roughly that at least the same range of value from the wonderful to the awful, is possible in virtual reality, just as it is in physical reality.

Lucas Perry: It sounds like a lot of people are afraid that they’ll be losing out on some of the important things you get from natural life, if virtual life were to take over?

David Chalmers: What are the important things you have in mind?

Lucas Perry: You mentioned people want to be able to accomplish things. People want to be a certain sort of person. People want to be in touch with a deeper reality.

David Chalmers: I certainly think in VR, you can be a certain person, very characteristic. With your own personal traits, you can have transformative experiences in virtual reality. Probably you can develop as a person. You can certainly have achievements in VR.

People who live and spend a lot of time, long term in worlds like second life certainly have real achievements, real relationships. Being in touch with a deeper reality, if by a deeper reality, you mean nature. In VR you’re somewhat removed from nature, but I think that’s somewhat optional.

In the short term at least, there are things like the role of the body, in existing VRs embodiment is extremely primitive. You’ve got these avatars, but our relationship with them is nothing like our relationship with our physical body. Things like eating, drinking, sex, or just physical companionship and so on. There’s not genuine analogs for those in existing VR. Maybe as time goes on, those things will become better. But I can imagine people thinking I value experiences of my physical body, and real eating and drinking and sex and companionship and so on and physical bodies.

But I could also imagine other people saying actually in VR now, in 200 years time people will say we’ve got these virtual bodies, which are actually amazing. Can do all that and give you all those experiences and much more and hey, you should try this. Maybe different people would prefer different things. But I do think to some considerable extent, thoughts about the body may be responsible for a fair amount of resistance to VR.

Lucas Perry: Could you talk a little bit about the different kinds of technological implementations of virtual reality? Whether it be uploading, or brains connected to virtual realities.

David Chalmers: Right now the dominant virtual worlds are not even VR at all of course. The virtual worlds people use the most now are video game style worlds typically on desktop or mobile computers on 2D screens.

But immersive VR is picking up speed fast with virtual reality headsets, like the Oculus Quest and they’re still bulky and somewhat primitive. But they’re getting better every year and they’ll gradually get less bulky and more primitive with more detail, better images and so on.

The other form factor, which is developing fast now is the augmented reality form with something like glasses, or transparent headsets that allow you to see the physical world, but also project virtual objects among the physical world.

Maybe it’s an image of someone you’re talking to. Maybe it’s just some information you need for dealing with the world. Maybe it’s a Pokemon Go creature you’re trying to acquire for your digital collection.

That’s the augmented reality form factor in glasses. A lot of people think that over the next 10 or 20 years, the augmented and virtual reality form factors could converge. Eventually we’ll be able to maybe have a set of glasses that could project digital objects into your environment, based on computer processes.

Maybe you could dial maybe a slider, which you go all the way down to dial out the physical world, be in a purely virtual world. Dial all the way up to be in a purely physical world, or in between, have elements of both.

That’s one way the technology seems to be going. The longer term there’s the possibility of bringing in brain computer interfaces. I think VR with standard perceptual interfaces works pretty well for vision and for hearing. You can get pretty good visual and auditory experiences from VR headsets, but embodiment is much more limited via sense of your own body.

But maybe once brain computer interfaces are possible, then there’ll be ways of getting elements, these computational elements to interact directly with bits of your brain. Whether it’s say visual cortex, auditory cortex for vision and hearing, or for the various aspects of embodied experience processed by the parts of the brain responsible for bodily experience.

Maybe that could eventually give you more authentic bodily experiences. Then eventually, bits of the potentially all kinds of computational circuitry could come to be embedded with brain circuitry in terms of circuitry, which is going to be partly biological and partly digital.

In the long term of course, there’s the prospect of uploading, which is the uploading the brain entirely to a digital process. Maybe once our brains are wearing out, we’ve replaced some of them with silicon circuitry, but you want to live forever upload yourself completely.

You’re running on digital circuitry. Of course, this raises so many philosophical issues. Will it still be me? Will I still be conscious? And so on. But assuming that it is possible to do this and have conscious beings and with this digital technology, then that being could then be fully continuous with the rest of the world.

That would just open up so much potential for new virtual reality, combined with new cognitive process, possibly giving rise to experiences that become now even imagine. Now this is very distant future, I’m thinking 100 plus years who knows.

Lucas Perry: You have long AGI timelines.

David Chalmers: This all does interact with AGI. I’m on record as 70% chance of AGI within a century. Maybe that’s sped up a bit.

Lucas Perry: You have shorter timelines.

David Chalmers: As far as this interacts with AI, I’m maybe on 50 years mean expected value for years until AGI. Once you go to AGI, all this stuff ought to have happened pretty fast. Maybe there’s a case for saying that within a century is conservative.

Lucas Perry: For uploads?

David Chalmers: Yeah, for uploads. I think once you go to AGIs, uploads are presumably-

Lucas Perry: Around the corner?

David Chalmers: … uploads are around the corner. At least if you believe like me, that once you go to AGI, then you’ll have AGI plus, and then you’ll have AGI plus, plus super intelligence. Then the AGI plus, plus is not going to have too much trouble with uploading technology and the like.

Lucas Perry: How does consciousness fit in all this?

David Chalmers: One very important question for uploading is whether uploads will even be conscious. This is also very relevant to thinking about the simulation hypothesis. Because if computer simulations of brains are not conscious, then it looks like we can rule out the simulation hypothesis, because we know we are conscious.

If simulations couldn’t be conscious, then we’re not simulations. At least the version of the simulation hypothesis, where we are part of the simulation could then be ruled out.

Now as it happens, I believe that simulations can be conscious. I believe consciousness is independent of substrate. It doesn’t matter whether you’re up and running on biology or on silicon, you’re probably going to be conscious.

You can run these familiar thought experiments, where you replace say neurons by silicon chips, replace biology by digital technology. I would argue that consciousness will be preserved.

That means at the very least gradual uploading, where you upload bits of your brain lets say a neuron at a time. I think that’s a pretty plausible way to preserve consciousness and preserve identity. But if I’m wrong about that and I could be, because nobody understands consciousness.

If I’m wrong about that, then uploads will not be conscious and these totally simulated worlds that people produce could end up being worlds of zombies. That’s at least something to worry about.

It’d be certainly risky to upload everybody to the cloud, to digital processes. Always keep some people anchored in biology just in case consciousness does require biology, because it’d be a rather awful future to have a world of super intelligent, but unconscious zombies being the only beings that exist.

Lucas Perry: I’ve heard from people who agree with substrate independence that digital or classical computer can’t be conscious. Are you aware of responses like that? Slash do you have a response to people who agree that consciousness is substrate independent, but the classical digital computers can’t be conscious.

I’m not sure what their exact view is, but something like the bits don’t all know about all the other bits. There’s no integration to create a unified conscious experience.

David Chalmers: The version of this I’ve heard I’m most familiar with, comes from Giulio Tononi’s Integrated Information Theory. Tononi and Christof Koch have argued that processes running on classical computers that is on von Neumann architectures cannot be conscious.

Roughly because von Neumann architectures have this serial core that everything is run through. They argue that this doesn’t have the property that Tononi calls integrated information and therefore is not conscious.

Now I’m very dubious about these arguments. I’m very dubious about a theory that says this serial bottleneck would undermine consciousness. I just think that’s all part of the implementation.

You could still have 84 billion simulated neurons interacting with each other. The mere fact that their interactions are mediated by a common CPU, I don’t see why that should undermine consciousness.

But if they’re right then fine, I’d say they’ve just discovered something about the functional organization that is required for consciousness. It needs to be a certain parallel organization as opposed to this serial organization.

But if so, you’re still right, it’s still perfectly substrate independent. As long as we upload ourselves not to a von Neumann simulation, but to a parallel simulation, which obviously it’s going to be the most powerful and efficient way to do this anyway, then uploading ought to be possible.

I guess another view is that consciousness could turn out to require to rely on quantum computation in a certain essential way. A mere classical computer might not be conscious, whereas quantum computers could be.

If so, that’s very interesting, but I would still imagine that all that would also be substrate independent and for uploading them, we just need to upload ourselves to the right quantum computer. I think those points while interesting, don’t really provide fundamental obstacles to uploading with consciousness here.

Lucas Perry: How do you see problems in the philosophy of identity fitting in here into virtual reality? For example with Derek Parfit’s thought experiments.

David Chalmers: Parfit had these famous thought experiments about the teletransporter from Star Trek, where you duplicate your body. Is that still me at the other end? The uploading cases are very similar to that in certain respects.

The teletransporter, you’ve got so many different cases. You’ve got is the original still around, then you create the copy? What if you create two copies? All these come up in the uploading case too.

There’s destructive uploading where we destroy the original, but create an upload. There’s non-destructive uploading, where we keep the original around, but also make an upload. There’s multiple copy uploading and so on.

In certain respects, there’s very much analogous to the teleporter case. The change is that we don’t duplicate the being biologically. We end up with a silicon isomorph, rather than a biological duplicate.

But aside from that, they’re very similar. If you think that silicon isomorph can be just as conscious as biological beings, maybe the two things roughly go together.

The same puzzle cases very much arise. Just say the first uploads are non-destructive, we stay around and we create uploaded copies. Then the tendency is going to be to regard the uploads is very different people from the original.

If the first uploads are destructive, you make copies while destroying the original. Maybe there’s going to be much more of a tendency to regard the uploads as being the same person as the original.

If we could make multiple uploads all the time, then there’ll be maybe a tendency to regard uploads as second class citizens and so on. The thought experiments here are complex and wonderful.

I tend myself to be somewhat sympathetic with Parfit’s deflationary views of these things, which is there may not be very much absolute continuity of people over time. Per the very concept of personal identity, maybe one of these Edenic concepts, that we actually persist through time as absolute subjects.

Maybe all there are just different people at different times that stand in psychological and memory and other continuity relations to each other. Maybe that’s all there is to say.

This gets closer now to Buddhist style, no self views, at least with no identic capital S “Self,” but I’m very unsure about all of these matters about identity.

Lucas Perry: How would you upload yourself?

David Chalmers: I think the safest way to do it, would be gradually. Replace my neurons one at a time by digital circuits. If I did it all at once, destroy the original creator uploaded copy, I’d worry that I’d be gone. I don’t know that. I just worry about it a bit more.

To remove that worry, do it gradually and then I’m much less worried that I’d be gone. If I can do it a bit at a time I’m still here. I’m still here. I’m still here. To do it with maximum safety, maybe I could be conscious throughout, with a continuous stream of consciousness throughout this process.

I’m here watching the operation. They change my neurons over and in that case, then it really seems to me as if there’s a continuous stream of consciousness. Continuous stream of consciousness seems to either, I don’t know if it guarantees identity over time, but it seems pretty close to what we have in ordinary reality.

We’re having continuous stream of consciousness overtime, seems to be the thing that goes along with what we usually think of as identity over time. It’s not required because we can fall asleep and arguably lose consciousness and wake up.

Most people would say we’re still the same person, but still being continuously conscious for a period, seems about as good a guarantee as you’re going to get of being the same person. Maybe this would be the philosophically safest way to upload.

Lucas Perry: Is sleeping not an example that breaks that?

David Chalmers: I’m not saying it’s a necessary condition for a personal identity, just a sufficient condition, just plausibly continuous consciousness sufficient for identity over time. So far there is identity over time. Yes, probably too stronger condition.

It maybe you can get identity from much weaker relations, but in order to be as safe as possible, I’m going to go with the strongest sufficient condition.

Lucas Perry: One neuron at a time.

David Chalmers: Maybe 10 neurons at a time. Maybe even a few columns at a time. I don’t know.

Lucas Perry: Do you think Buddhist’s that realize no self, would be more willing to upload?

David Chalmers: I would think so and I would hope so. I haven’t done systematic polls on this. Now I’m thinking I’ve got to get the data from the last PhilPapers survey and find views on uploading, which we asked about out. We didn’t ask about are you Buddhist? We didn’t ask do you for example, specialize in Asian philosophy?

I wonder if there could at least be a correlation between specialization and Asian philosophy, and certain views about uploading. Although it’ll be complicated by the fact that this will also include Hindu people who very much believe in absolute self, and Chinese philosophers who have all kinds of very different views. Maybe it would require some more fine grained survey analysis.

Lucas Perry: I love that you do these surveys. They’re very cool. Everyone should check them out. It’s a really cool way to see what philosophers are thinking. If you weren’t doing them, we wouldn’t know.

David Chalmers: Go to philsurvey.org. This later survey in 2020, we surveyed about 2000 odd philosophers from around the world, on 100 different philosophical questions like God, theism, or atheism, mind, physicalism or non-physicalism and so on.

We actually got data about what professional philosophers tend to believe. You can look at correlations between questions, correlations with area, with gender, with age and so on. It’s quite fascinating. Go to philsurvey.org you’ll find the results.

Lucas Perry: Descartes plays a major role in your book, both due to his skepticism about the external world, and whether or not it is that we know anything about it. Then there’s also the mind body problem, which you explore. Since we’re talking about consciousness and the self, I’m curious if you could explain how the mind body problem fits in all this?

David Chalmers: In a number of ways. Questions about the mind are not front and center in this book, but they come up along the way in many different contexts. In the end, actually part five of the book has three chapters on different questions about the mind.

One of them is the question we’ve just been raising. Could AI systems be conscious? Could uploading lead to a conscious being and so on? That’s one chapter of the book. But another one just thinks about mind, body relations in more ordinary virtual realities.

One really interesting fact about existing VR systems, is that if you actually look at virtual worlds, that Cartesian thought that Descartes thought there’s a physical world that the mind interacts with, and the mind is outside the physical world, but somehow interacts with it.

You look at a virtual world, virtual worlds often have their own physics and their own algorithmic processes that govern the physical processes in the virtual world. But then there’s this other category of things, users, players, people who are using VR and they are running on processes totally outside the virtual world.

When I enter a VR, the VR has its own physics, but I am not subject to that physics. I’ve got this mind which is operating totally outside the virtual world. You can imagine if somebody grew up in a virtual world like this.

If Descartes grew up in a virtual world, we’ve got an illustration where Descartes grows up inside Minecraft and gets into an argument with Princess Elizabeth, about whether the mind is outside this physical world interacting with it.

Most people think that the actual Descartes was wrong, but if we grew up in VR, Descartes would be right. He’d say yeah, the mind is actually something outside. He’d look at the world around him and say, “This is subject to its physics and so the mind is just not part of that. It’s outside all that. it exists in another realm and interacts with it.”

There’s a perspective of the broader realm, which all this looks physical and continuous. But at least from the perspective of the virtual world, it’s as if Descartes was right.

That’s an interesting illustration of a Cartesian interaction of dualism, where the mental and the physical are distinct. It shows a way at which something like that could turn out to be true under certain versions of the simulation hypothesis, say with brains interacting with simulations.

Maybe even is true of something isomorphic of it is true, even in ordinary virtual realities. At least that’s interesting and making sense of this mind body interaction, which is often viewed as unscientific or non naturalistic idea. But here’s a perfectly naturalistic version of mind body dualism.

Lucas Perry: I love this part and also found it surprising for that reason, because Cartesian dualism, it always feels supernatural, but here’s a natural explanation.

David Chalmers: One general theme in this book is that there’s a lot of stuff that feels supernatural, but once you look at it through the lens of VR, needn’t be quite so supernatural, looks a lot more naturalistic. Of course, the other example is God. If your creator is somebody, a programmer in the next universe up, suddenly God doesn’t look quite so supernatural.

Lucas Perry: Magic is like using the console in our reality to run scripts on the simulators world, or something like that.

David Chalmers: This is naturalistic magic. Magic has to obey its own principles too. There’s just different principles in the next universe up.

Lucas Perry: Clearly it seems your view is consciousness is the foundation of all values, is that right?

David Chalmers: Pretty much. Pretty much. Without consciousness no value. I don’t want to say consciousness is all there is to value. There might be other things that matter as well, but I think you probably have to have consciousness to have value in your life.

Then for example, relations between conscious beings, relations between consciousness and the world can matter for value. Nozick’s experience machine tends to suggest that consciousness alone is not quite enough.

There’s got to be maybe things like actually achieving your goals and so on that matters as well. But I think consciousness is at the very core of what matters and value.

Lucas Perry: We have virtual worlds and people don’t like them, because they want to have an interaction with whatever’s like natural, or they want to be a certain kind of person, or they want the people in it to be implemented. They want them in real space, things like that.

Part of what makes being in Nozick’s experience machine unsatisfactory, is knowing that some of these things aren’t being satisfied. But what if you didn’t know that those things weren’t being satisfied? You thought that they were.

David Chalmers: I guess my intuition is that’s still bad. There’s this famous case that people have raised just say your partner is unfaithful to you, but it’s really important to you that your relationship be monogamous. However, your partnership, your partner, although professing monogamy has gone off and had relationships with all these other people.

You never know and you’re very happy about this and you go to your death without ever knowing. I think most people’s intuition is that is bad. That life is not as good, as one where the life was the way this person wanted it to be with the monogamous partner.

That brings out that having your goals or your desires satisfied, the world being the way you want it to be, that matters to how good and meaningful a life is. Likewise, I’d say that I think the experience machine is a more extreme example of that.

We really want to be doing these things. If I was to find out 100 years later that hey, any success I’d had in philosophy wasn’t because I wrote good books. It’s just because there was a script that said there’d be certain amounts of success and sales and whatever.

Then boy, that would render any meaning I’d gotten out of my life perfectly hollow. Likewise, even if I never discovered this, if I had the experience of having the successful life, but it was all merely pre-programmed, then I think that would render my life much less.

It’d still be meaningful, but just much less good than I thought it had been. That brings out that the goodness, or the value of one’s life depends on more than just how one experiences things to be.

Lucas Perry: I’m pushing on consequentialist or utilitarian sensibilities here, who might bite the bullet and say that if you didn’t know any of those things, then those worlds are still okay. One thing that you mentioned in your book is that your belief that virtual reality is good, is independent of the moral theory that one has. Could you unpack that a bit?

David Chalmers: I don’t know if it’s totally independent, but I certainly think that my view here is totally consistent with consequentialism and utilitarianism that says, what matters in moral decision making is maximizing good consequences, or maximizing utility.

Now, if you go on to identify the relevant good consequences, with conscious states like maximizing pleasure, or if you say all there is to utility, is the amount of pleasure. Then you would take a different view of the experience machine.

If you thought that all that there is to utility is pleasure and you say in the experience machine, I have the right amount of pleasure so that’s good enough. But I think that’s going well beyond consequentialism, or even utilitarianism.

That’s adding a very specific view of utility and is the one that the founders utilitarianism had, like Bentham and Mill. I would just advocate a broader view of consequentialism, or utilitarianism where there are values that go beyond value driving from pleasure, or from conscious experience.

For example, one source of value is having your desires satisfied or achieving your goals. I think that’s perfectly consistent with utilitarianism, but maybe more consistent with some forms than others.

Lucas Perry: Is having your values satisfied, or your preference satisfied, not just like another conscious state?

David Chalmers: I don’t think so, because you could have two people who go through exactly the same series of conscious states, but for one of whom their desires are satisfied, and for the other one, their desires are not satisfied, maybe they both think their desires are satisfied, but one of them is wrong. They both want their partners to be monogamous. One partner is monogamous and the other one is not. They might have exactly the same conscious states, but one has the world is the way they want it to be, and the other one, the world is not the way they want it to be.

This is what Nozick argued and others have argued in light of the experience machine is that, yeah, there’s a value maybe in desire, satisfaction that goes be on the value of consciousness, per se. I should say both of these views, even the pleasure centric view are totally consistent with my general view of VR. If someone says, “All that matters is experiences,” well, in a certain sense, great. That makes it even easier to lead a good life in VR. But I think if the dialectic is the other way around, even if someone rejects that view … I tend to believe there’s more that matters than just consciousness. Even if you say that, you can still have a good life in a virtual world.

I mean, there’ll be some moral views where you can’t. Just say you’ve got a biocentric view of what makes a life good. You got to have, somehow. the right interactions with real biology. I don’t know, then maybe certain virtual worlds won’t count as having the right kind of biology, and then they won’t count as valuable. So I wouldn’t say these issues are totally independent of each other, but I do think on plausible moral theories, yeah, very much going to be consistent with being able to have a good life in virtual worlds.

Lucas Perry: What does a really good life in a virtual world look like to you?

David Chalmers: Oh boy. What does a really good life look like to me? I mean, different people have different values, but I would say I get value partly from personal relationships, from getting to know people, by having close relationships with with my family, with partners, with friends, with colleagues. I get a lot of value from understanding things, from knowledge and understanding. I get some value from having new experiences and so on. And I guess I’d be inclined to think that in a virtual world, the same things would apply. I’d still get value from relationships with people, I’d still get value from knowledge and understanding, I’d still get value from new kinds of experience.

Now, there made ways in which VR might allow this to go beyond what was possible outside VR. Maybe, for example, there’ll be wholly new forms of experience that go way beyond what was possible inside physical reality, and maybe that would allow for a life which is better in some respects. Maybe it’ll be possible to have who knows what kind of telepathic experiences with other people that give you even closer relationships that are somehow amazing. Maybe it’ll allow immortality, where you can go on having these wonderful experiences for an indefinite amount of time, and that could be better.

I guess in the short term, I think, yeah, the kind of good experiences I’ll have in VR are very much continuous with the good experiences I’ll have elsewhere. It’s a way of I meet friends sometimes in VR, interact with them, talk with them, sometimes play games, sometimes communicate, maybe occasionally have a philosophy lecture or a conference then. So right now, yeah, what’s good about VR is pretty much continuous with a lot of what’s good about physical reality. But yeah, in the long them, there may be ways for it to go beyond.

Lucas Perry: What’s been your favorite VR experience so far?

David Chalmers: Oh boy, everything is fairly primitive for now. I enjoy a bunch of VR games, and I enjoy hanging out with friends. One enjoyable experience was I gave a little lecture of about VR, in VR, to a group of philosopher friends. And we were trying to figure out the physics of VR, of the particular virtual world we were in, which was on a app called Bigscreen.

Lucas Perry: Yeah.

David Chalmers: And yeah, you do things in Bigscreen, like you throw tomatoes, and they behave in weird ways. They kind of a baby laws of physics, but they kind of don’t, and the avatars have their own ways of moving. So we were trying to figure out the basic laws of Bigscreen, and we didn’t get all that far, but we figured out a few things. We were doing science inside a virtual world. And presumably, if we’d kept going, we could have gotten a whole lot further and gotten further into the depths of what the algorithms really are that generate this virtual world or that might have required a scientific revolution or two. So I guess that was a little instance of doing a bit of science inside a virtual world and trying to come to some kind of understanding, and it was at least a very engaging experience.

Lucas Perry: Have you ever played any horror games?

David Chalmers: Not really, no. I’m not much of a gamer, to be honest. I play some simple games like Beat Saber or, what is it? SUPERHOT. But that’s not really a horror game. Super assassins come after you, but what’s your favorite horror game?

Lucas Perry: I was just thinking of my favorite experience and it was probably … Well, I played Killing Floor once when I first got the VR, and it was probably the most frightening experience of my life. The first time you turn around and there’s and embodied thing that feels like it’s right in your face, very interesting. In terms of consciousness and ethics and value, we can explore things like moral patiency and moral agency. So what is your view on the moral status of simulated people?

David Chalmers: My own view is that the main thing, the biggest thing that matters for moral status is consciousness. So as long as simulated beings are conscious as we are, then they matter. Now maybe current non-player characters of the kind you find in video games and so on are basically run by very simple algorithms, and most people would think that those beings are not conscious, in which case their lives don’t matter, in which case it’s okay to shoot these current NPCs in video games.

I mean, maybe we’re wrong about that, and maybe they have some degree of consciousness and we have to worry. But at least the orthodox view here would be that they’re not, and even on a view that describes some consciousness, it’s probably a very simple form of consciousness. But if we look now to a longterm future where there are simulations of brains and simulated AGIs inside these simulated worlds with capacities equivalent to our own, I’d be inclined to think that these beings are going to be conscious like us. And if they’re conscious like us, then I think they matter morally, the way that we do, in which case, one should certainly not be indiscriminately killing simulated beings just because it’s convenient, or just indiscriminately creating them and, and turning them off. So I guess if we do get to the point where …

I mean, this applies inside and outside simulations. If we have robot style AGIs that are conscious and they have moral status like ours, if we have simulation style, AGIs, inhabiting simulations, they also have moral status, much like ours. Now it may be hard for, I’m sure there’s going to be a long and complicated path to actually keep that playing out in social and legal context, and there may be all kinds of resistance to granting simulations, legal rights, social status, and so on. But philosophically, I guess I think that, yeah, if they’re conscious like us, they have a moral status like ours.

Lucas Perry: Do you think that there will be simulated agents with moral status that are not conscious, for example? They could at least be moral agents and not be conscious, but in a society and culture of simulated things, do you think that there would be cases where things that are sufficiently complex, yet not conscious, would still be treated with moral patiency?

David Chalmers: It’s interesting. I’m inclined to think that any system that has human level behavior is likely to be conscious. I’m not sure that there are going to be cases of zombies that lack consciousness entirely, but behave in extremely sophisticated ways just like us. But I might be wrong. Just say Tononi and Koch are right and that no being running on von Neumann architecture is conscious, then yeah. Then it might be smart to develop those systems because they won’t have moral status, but they’ll still be able to do a lot of useful things. But yeah, would they still then be moral agents?

Well, yeah, presumably these behaviorally equivalent systems could do things that look a lot like making moral decisions, even though they’re not conscious. Would they be genuine agents if they’re not conscious? That maybe partly a verbal matter, but they would do things that at least look a lot like agency and making moral decisions. So they’d at least be moral quasi-agents. Then it’s an interesting question whether they should be moral patients, too. If you’ve got a super zombie system making moral decisions, does it deserve some moral respect? I don’t know. I mean, I’m not convinced that consciousness is the only thing that matters, morally. And maybe that, for example, intelligence or planning or reasoning carries some weight independent of consciousness.

If that’s the case, then maybe these beings that are not conscious could still have some moral status as moral patients, that is deserving to be treated well, as well as just moral agents, as well as just performing moral action. Maybe it would be a second class moral patiency. Certainly, if the choice was between, say, killing and being like that and killing an equivalent conscious being, I’d say, yeah, kill the unconscious one. But that’s not to say they have no moral status there.

Lucas Perry: So one of your theses that I’d like to hit on here as well was that we can never know that we’re not in a simulation. Could you unpack this a bit?

David Chalmers: Yeah. Well, this is very closely connected to these traditional questions in epistemology. Can you know you’re not dreaming now? Could you know that you’re not being fooled by an evil demon now? The modern tech version is, “Can you know you’re not in a simulation?” Could you ever prove you’re not in a simulation? And there’s various things people might say, “Oh, I am not in a simulation.” I mean, naively, “This can’t be a simulation because look at my wonderful kitten here. That could never be simulated. It’s so amazing.” But presumably there could be simulated kittens. So that’s not a decisive argument.

More generally, I’m inclined to think that for any evidence anyone could come up with that’s allegedly proof that we’re not in a simulation, that evidence could be simulated, and the same experience could be generated inside a simulated world. This starts to look like there’s nothing, there’s no piece of evidence that could ever decisively prove we’re not in a simulation. And the basic point is just that a perfect simulation would be indistinguishable from the world it’s the simulation of. If that’s the case, awfully hard to see how we could prove that we’re not in a simulation.

Maybe we could get evidence that we are in a simulation. Maybe the simulators could reveal themselves to us and show us the source code. I don’t know. Maybe we could stress test the simulation by running a really intense computer process, more advanced than before suddenly, and maybe it stresses out the simulation and leads to a bug or something. Maybe there are ways we could get evidence.

Lucas Perry: Maybe we don’t want to do that.

David Chalmers: Okay. Maybe that will shut us down.

Lucas Perry: That’ll be an x-risk.

David Chalmers: Yeah. Okay. Yeah. Maybe not a good idea. So there are various ways we could get evidence that we are in a simulation, at least in an imperfect simulation. But I don’t think we can ever get the evidence in the negative that fully proves that we’re not in a simulation. We can try and test for various imperfect simulation hypotheses, but if we get just ordinary the expected results, then it’s always going to be consistent with both. And there are various philosophers who tried to say, “Ah, there are things we could do to refute this idea.” Maybe it’s meaningless. Maybe we could rule it out by being the non-simulation hypothesis being the simpler hypothesis and so on.

So in the book, I try and argue none of those things were work either. And furthermore, once you think about the Bostrom style simulation argument, that says it may be actually quite likely that we’re in a simulation because most populations are likely, it seems pretty reasonable to think that most intelligent populations will develop simulation technology. Once you start the thinking that way, I think it makes it even harder to refute the simulation hypothesis, basically, because by this point, these simulation style hypotheses used to be science fiction cases, very distant from anything we have direct reason to believe in.

But as the technology is developing, these stimulation-style hypotheses become realistic hypotheses, ones which is actually very good reason to think are actually likely to be developed both in our world and in other worlds. And I think that actually makes these … That’s had the effect of making these Cartesian scenarios move from the status of science fiction to being live hypotheses, and I think that makes them even harder to refute. I mean, you can make the abstract point that we can never prove it without the modern technology. But I think once they actually exist, once this technology is an existing technology, it becomes all the harder to epistemologically dismiss.

Lucas Perry: You give some credence in your book for whether or not we live in a simulation. Could you offer those now?

David Chalmers: Yeah. I mean, of course, anything like this is extremely speculative. But basically, in the book, I argue that if there are many conscious human-like simulations, and we are probably simulations ourself, and then the question is, “Is it likely that there are many conscious human-like simulations?” And there’s a couple of ways that could fail. First, it could turn out that simulating beings like us or universes like ours is not even possible. Maybe the physics is uncomputable. Maybe consciousness is uncomputable. So maybe conscious human-like simulations like ours could be impossible. That’s one way this could fail to happen. That’s what I call a sim blocker. These are things that would block these simulations from existing. So one class of sim blockers is, yeah, simulations like this are impossible. But I don’t think that’s more than 50% likely. I’m actually more than 50% confident that simulations like this are possible.

The other class of sim blockers is, well, maybe simulations like this are possible, but for various reasons they’ll never be developed, or not many of them will be to developed. And this class of sim blockers includes the ones that Bostrom focuses on. For example, I think there’s two of them, either we’ll go extinct before we get to that level of technology where we can create simulations, or we’ll get there, but we’ll choose never to create them, or intelligence civilizations will choose never to create them. And that’s the other way this can go wrong is, yeah, these things are possible, but not many of them will ever be created. And I basically say, “Well, if these are possible, and if they’re possible, many of them will be created, then many of them will be created and we’ll get a higher probability, we’re in a simulation.”

But then I think, “Okay, so what are the probabilities of each of those two premises?” That conscious human-like simulations are possible? Yeah, I think that’s at least 50%. Furthermore, if they’re possible, will many of them be created? I don’t know. I don’t know what the numbers are here, but I guess I’m inclined to think probably my subjective credence is over 50% in that two, given that it just requires some civilizations who eventually create a whole lot of them.

Okay, so 50% chance of premise one, 50% chance of premise two. Let’s assume they’re roughly independent of each other. That gives us a 25% chance they’re both true. If they’re both true, and most beings are simulations, if most beings are simulations, and we’re probably simulations, putting all that together, I get roughly, at least, 25% that we’re in a simulation. Now there’s a lot of room for the numbers to go wrong. But yeah, to me, that’s at least very good reason, A, to take the hypothesis seriously, and B, just suggests if it’s at 25%, we certainly cannot rule it out. So that gives a a quasi-numerical argument that we can never know that we’re not in a simulation.

Lucas Perry: Well, one interesting part that seems to feed into the simulation argument is modern work on quantum physics. So we had Joscha Bach on who talked some about this, and I don’t know very much about it, but there is this debate over whether the universe is implemented in continuous numbers or non-continuous numbers. And if the numbers were continuous, then the universe wouldn’t be computable. Is that right?

David Chalmers: I‘m not quite sure which debate you have in mind, but yeah, it certainly is right, that if the universe maximally is doing a real-valued computation, then real-valued computations can only be approximated on finite computers.

Lucas Perry: Right.

David Chalmers: On digital computers.

Lucas Perry: Right. So could you explain how this inquiry into how our fundamental physics work informs whether or not our simulation would be computable?

David Chalmers: I mean, there’s many aspects of that question. One thing that some people have actually looked into, whether our world might involve some approximations, some shortcuts. So Zohreh Davoudi and some other physicists have tried to look at the math and say, “Okay, just say our simulation, say there was a simulation that took certain shortcuts. How would that show up empirically?” So it’s, okay, this is going to be an empirical test for whether there are shortcuts in the way our physics is implemented.

I don’t think anyone’s actually found that evidence yet, but, ah, there’s some principle evidence we could get of that. But there is the question of whether our world is fundamentally analog or digital, and if our world is fundamentally analog with perfectly precise, continuous, real values making a difference to how the universe evolves, yeah, then that can never be perfectly simulated on a finite digital computer. I would still say it can be approximated. And as far as we know, we could be living in a finite approximation to one of those continuous worlds, but yeah, maybe there could eventually be some empirical evidence of that.

Of course, the other possibility is we’re just running on an analog computer. If our physics is continuous and the physics of the next world up is continuous, maybe there will be analog computers developed with maximally continuous quantities, and we could be running on an analog computer like that. So I think even if the physics of our world turns out to be perfectly analog and continuous, that doesn’t rule out the simulation hypothesis. It just means we’re running on an analog computer in the next universe up.

Lucas Perry: Okay. I’m way above my pay grade here. I’m just recalling now, I’m just thinking of how Joscha was talking that continuous numbers aren’t computable, right? So you would need an analog computer. I don’t know anything about analog computers, but it seems to me like they-

David Chalmers: It’s hard to program analog computers because they require infinite precision, and we found out beings are not good at building things with infinite precision. But we could always just set a few starting values randomly and let the analog computation go from there. And as far as I can tell, there’s no evidence that we’re not living in a simulation that’s running on an analog computer like that.

Lucas Perry: I see. So if we discover our fundamental physics to be digital or analog, it seems like that wouldn’t tell us a lot about the simulation, just that the thing that’s simulating us might be digital or analog.

David Chalmers: In general, discovering things about our … I mean, the relationship between the physics of our world and the physics of the simulating world is fairly weak, especially if you believe in universal computation, any classical algorithm can be implemented in a vast variety of computers running on a vast variety of physics. But yeah, but there might be some limits. For example, if our world has a perfectly analog physics that cannot be simulated on a finite digital computer, could be simulated on an infinite digital computer, you can simulate analog quantities with infinite strings of bits, but not on a finite digital computer.

So yeah, discovering that our physics is digital would be consistent with the next universe up being digital, but also consistent with it being analog. Analog worlds can still run digital computers. I mean, it’d be very suggestive if we did actually discover digital physics in our world. I’m sure if we discovered that our physics is digital, that would then get a lot of people thinking, “Hey, this is just the kind of thing you’d expect if people are running a digital computer in the next universe app.” That might incline people to take the simulation hypothesis more seriously, but it wouldn’t really be any kind of demonstration.

Yeah. If we somehow discover that our physics is perfectly analog, I don’t really know exactly how we could discover that because at any given point, we’ll only have a finite amount of evidence, which will always be consistent with just being a very close approximation, but just say we could discover that our world runs analog physics. Yeah, then that would be inconsistent, whether this just being a digital simulation in the next universe up, but still quite consistent with it being a simulation running on a analog computer in the next universe up. I don’t know how that connects to Joscha’s way of thinking about this.

Lucas Perry: Yeah. I’m not sure. I’d love to see you guys talk-

David Chalmers: I hope it’s least consistent.

Lucas Perry: … about this.

David Chalmers: Has he written about this somewhere?

Lucas Perry: I’m not sure. There are lots of podcasts have been talking about it, though.

David Chalmers: Okay, cool.

Lucas Perry: Yeah. So we’ve gone over a lot here, and it leaves me not really trusting my common sense experience of the world. So pivoting a little bit here, back into the Edenic view of things … Sorry if I get the word that you used wrong, but it seems like you walk away from that with a view of imperfect realism. Is that right?

David Chalmers: Yeah. Imperfect realism is the perfect thing. Capital S “Solidity” doesn’t exist, but the lower case thing, small S “solidity,” does exist. An imperfect analog of what we initially believed in.

Lucas Perry: So how do you see the world now? Any differently? What is the world like to David Chalmers after having written this book? What is a person to you?

David Chalmers: I don’t know. I mean, I think there’s your everyday attitude towards the world and your theoretical attitude towards the world. And I find my everyday attitude towards the world isn’t affected that much by discoveries in philosophy, or in science for that matter. We mostly live in the manifest image. Maybe we even treat it a little bit like the Garden of Eden, and that’s fine. But then there’s this knowledge of what underlies it or what could underlie it. And that’s, yeah, once you start thinking philosophically, that gets mind boggling.

I mean, you don’t need to go to the simulation hypothesis or the virtual world to get that reaction. I mean, quantum mechanics is quite enough. Oh my God, we live in this world of the quantum wave function where nothing actually has these direct positions and possibly the wave functions collapsing, or possibly many worlds. And so I mean, boy, it’s just mind boggling. It is rather hard to integrate ordinary life in reality. So most of us just kind of go on living in the manifest image. Yeah, so once I start thinking about, “Yeah could we be in a simulation?” It’s got a similar kind of separateness, I guess.

Mostly I go on living in the manifest image and don’t factor this in. But I mean, it does open up all kinds of possibilities once you start thinking that there is maybe this reality plus of all these different levels of reality, like, “Could it be that someday it might be possible to escape this particular virtual world, or maybe when we die, does our code sometimes get uploaded by simulators to go hang out back in other levels of reality. Maybe there are naturalized versions of reincarnation or life after death, and I don’t want to say this is why I’m thinking about this stuff. It’s not for these quasi-religious reasons, but suddenly, possibilities that had seemed very far out possibilities to me, like life after death, at least come to seem a little bit closer and more like open possibilities than they’d seemed before. So that’s at least interesting.

Lucas Perry: One thing you bring up a bit in your exploration here is God. And all these things that you’re mentioning, they seem like science and philosophy coming back to traditionally religious ideas, but through a naturalistic exploration, which is quite interesting. So do you have any different thoughts on God after having written this book?

David Chalmers: It’s interesting. I’m not remotely religious, myself. I’ve always thought of myself as an atheist, but yeah, after writing this book, I’m at least … There is a version of God that I could at least take seriously. This is the simulator. They did, after all, create the world, this world. They may have a lot of power and a lot of knowledge of this world, as gods are meant to have. On the other hand, they’re quite unlike traditional gods. In some ways, the simulator needn’t be all good, needn’t be particularly wise. Oh, also didn’t create all of reality. It just created a little bit of reality. Maybe it’s a bit like what’s sometimes called a demiurge, the the local god, the under-boss god who created this world, but wasn’t the one in charge of the whole thing.

So yeah, maybe simulators are a bit more like demiurges. More importantly, I don’t think I’d be inclined to erect a religion around the simulation idea. Religions come with ethical practices and really changing your way of life. I don’t think there’s any particular reason to orient our ethics to a simulation. I mean, maybe you can imagine there’d be some practices that if we really believed we were in a simulation, or there’s a good chance of that, we should at least start doing some things differently. Maybe some people might want to try and attract the attention of the simulators. I don’t know. That’s all very speculative. So I don’t find myself …

I think the one moral of all this for me is that actually ethics and meaning and so on, actually, you don’t get your ethics or your meaning from who created you or from whether it’s a God or a simulator. Ethics and meaning comes from within. It comes from ourselves, our consciousness, and our interactions.

Lucas Perry: Do you take a line that’s similar to Peter Singer in thinking that that is like an objective rational space? Are you a moral realist or anti-realist about those things?

David Chalmers: I tend towards moral anti-realism, but I’m not sure. I find those issues very difficult. Yeah, I can get in the mood where, “Pain is bad,” just seems like an absolute fact.

Lucas Perry: Yeah.

David Chalmers: That’s just an objective fact. Pain is objectively bad. And then I get at least to some kind of value realism, if not moral realism. Some moods all go that way. Other moods, it’s just, yeah, it’s all a matter of our attitude towards it. Finally, it’s a matter of what we value. If somebody valued pain, it would be good for them. If they didn’t, it wouldn’t be. And I can go back and forth. I don’t have a fixed view of these matters.

Lucas Perry: Are there any questions that I haven’t asked you that you would’ve liked me to ask you?

David Chalmers: Not especially. You asked a lot of great questions, and there are a million others, but actually one interesting thing with this book coming out is getting to do a few of these, having a few of these conversations and seeing all the different questions and different aspects of the book that different people focused on. So, no. I think we’ve covered a lot of territory here, and yeah, these are a lot of cool things to think about.

Lucas Perry: All right. Well, I’m mindful of the time here, David. Thank you so much for all of your time. If people want to check you out, follow you, and get your new book, where are the best places to do that?

David Chalmers: Probably my website, which is consc.net. Consc, the first five letters of consciousness, or just do a search for my name. And then yeah, the book is … I’ve got a page for the book on my website, consc.net/reality, or just search for name of the book, Reality+. I’m not on Twitter or Instagram or any of those things, unfortunately. Maybe I should be one of these days, but for now, I’m not. But yeah, the book will be available January 25th, I guess. All good book sellers. So I hope some of your listeners might be interested to check it out.

Lucas Perry: All right. We’ll include links to all of those places in the description of wherever you might be listening or watching. Thank you so much, David. It’s always a pleasure speaking with you. I love hearing about your ideas, and it’s really a great book at an important time. I think just before all this VR stuff is about to really kick off, and with the launch of the metaverse. It’s really well timed.

David Chalmers: Oh, thanks, Lucas. This was all, yeah, a lot of fun to talk about this stuff with you.

An introduction to the issue of Lethal Autonomous Weapons

Some of the most advanced national military programs are beginning to implement artificial intelligence (AI) into their weapons, essentially making them ‘smart’. This means these weapons will soon be making critical decisions by themselves – perhaps even deciding who lives and who dies.

10 Reasons Why Autonomous Weapons Must be Stopped

Lethal autonomous weapons pose a number of severe risks. These risks significantly outweigh any benefits they may provide, even for the world’s most advanced military programs.

Real-Life Technologies that Prove Autonomous Weapons are Already Here

For years, we have seen the signs that lethal autonomous weapons were coming. Unfortunately, these weapons are no longer just ‘in development’ – they are starting to be used in real military applications. Slaugherbots are officially here.

Rohin Shah on the State of AGI Safety Research in 2021

  • Inner Alignment Versus Outer Alignment
  • Foundation Models
  • Structural AI Risks
  • Unipolar Versus Multipolar Scenarios
  • The Most Important Thing That Impacts the Future of Life


Watch the video version of this episode here

0:00 Intro

00:02:22 What is AI alignment?

00:06:45 How has your perspective of this problem changed over the past year?

00:07:22 Inner Alignment

00:15:35 Ways that AI could actually lead to human extinction

00:22:50 Inner Alignment and MACE optimizers

00:24:15 Outer Alignment

00:27:32 The core problem of AI alignment

00:29:38 Learning Systems versus Planning Systems

00:34:00 AI and Existential Risk

00:38:59 The probability of AI existential risk

01:04:10 Core problems in AI alignment

01:03:07 How has AI alignment, as a field of research changed in the last year?

01:05:57 Large scale language models

01:06:55 Foundation Models

01:15:30 Why don’t we know that AI systems won’t totally kill us all?

01:23:50 How much of the alignment and safety problems in AI will be solved by industry?

01:31:00 Do you think about what beneficial futures look like?

01:39:44 Moral Anti-Realism and AI

01:46:22 Unipolar versus Multipolar Scenarios

01:56:38 What is the safety team at DeepMind up to?

01:57:30 What is the most important thing that impacts the future of life?

Lucas Perry: Welcome to the Future of Life Institute Podcast. I’m Lucas Perry. Today’s episode is with Rohin Shah. He is a long-time friend of this podcast, and this is the fourth time we’ve had him on. Every time we talk to him, he gives us excellent overviews of the current thinking in technical AI alignment research. And in this episode he does just that. Our interviews with Rohin go all the way back to December of 2018. They’re super informative and I highly recommend checking them out if you’d like to do a deeper dive into technical alignment research. You can find links to those in the description of this episode. 

Rohin is a Research Scientist on the technical AGI safety team at DeepMind. He completed his PhD at the Center for Human-Compatible AI at UC Berkeley, where he worked on building AI systems that can learn to assist a human user, even if they don’t initially know what the user wants.

Rohin is particularly interested in big picture questions about artificial intelligence. What techniques will we use to build human-level AI systems? How will their deployment affect the world? What can we do to make this deployment go better? He writes up summaries and thoughts about recent work tackling these questions in the Alignment Newsletter, which I highly recommend following if you’re interested in AI alignment research. Rohin is also involved in Effective Altruism, and out of concern for animal welfare, is almost vegan.

And with that, I’m happy to present this interview with Rohin Shah. 

Welcome back Rohin. This is your third time on the podcast I believe. We have this series of podcasts that we’ve been doing, where you help give us a year-end review of AI alignment and everything that’s been up. You’re someone I view as very core and crucial to the AI alignment community. And I’m always happy and excited to be getting your perspective on what’s changing and what’s going on. So to start off, I just want to hit you with a simple, not simple question of what is AI alignment?

Rohin Shah: Oh boy. Excellent. I love that we’re starting there. Yeah. So different people will tell you different things for this as I’m sure you know. The framing I prefer to use is that there is a particular class of failures that we can think about with AI, where the AI is doing something that its designers did not want it to do. And specifically it’s competently achieving some sort of goal or objective or some sort of competent behavior that isn’t the one that was intended by the designers. So for example, if you tried to build an AI system that is, I don’t know, supposed to help you schedule calendar events and then it like also starts sending emails on your behalf to people which maybe you didn’t want it to do. That would count as an alignment failure.

Whereas if a terrorist somehow makes an AI system that goes and designates a bomb in some big city that is not an alignment failure, it is obviously bad, but the AI system did what its designer intended for it to do. It doesn’t count as an alignment failure on my definition of the problem.

Other people will see AI alignment as synonymous with AI safety. For those people, terrorists using a bomb might count as an alignment failure, but at least when I’m using the term, I usually mean, the AI system is doing something that wasn’t what its designers intended for it to do.

There’s a little bit of a subtlety there where you can think of either intent alignment, where you try to figure out what the AI system is trying to do. And then if it is trying to do something that isn’t what the designers wanted, that’s an intent alignment failure, or you can say, all right, screw all of this notion of trying, we don’t know what trying is. How can we look at a piece of code and say whether or not it’s trying to do something.

And instead we can talk about impact alignment, which is just like the actual behavior that the AI system does. Is that what the designers intended or not? So if the AI makes a catastrophic mistake where the AI thinks that this is the big red button for happiness and sunshine, but actually it’s the big red button that launches nix. That is a failure on impact alignment, but isn’t a failure on the intent alignment, assuming the AI legitimately believed that the button was happiness and sunshine, I think they said.

Lucas Perry: So it seems like you could have one or more or less of these in a system at the same time. So which are you excited about? Which do you think are more important than the others?

Rohin Shah: In terms of what do we actually care about? Which is how I usually interpret important, the answer is just like pretty clearly impact alignment. The thing we care about is, did the AI system do what we want or not? I nevertheless tend to think in terms of intent alignment, because it seems like it is decomposing the problem into a natural notion of like what the AI system is trying to do. And whether the AI system is capable enough to do it. And I think that is like actually natural division. You can in fact talk about these things separately. And because of that, it makes sense to have research organized around those two things separately. But that is a claim I am making about the best way to decompose the problem that we actually care about. And that is why I focus on intent alignment but what do we actually care about? Impact alignment, totally.

Lucas Perry: How would you say that your perspective of this problem has changed over the past year?

Rohin Shah: I’ve spent a lot of time thinking about the problem of inner alignment. So this was this shot up to… I mean, people have been talking about it for a while, but it shot up to prominence in I want to say 2019 with the publication of the mesa optimizers paper. And I was not a huge fan of that framing, but I do think that the problem that it’s showing is actually an important one. So I’ve been thinking a lot about that.

Lucas Perry: Can you explain what inner alignment is and how it fits into the definitions of what AI alignment is?

Rohin Shah: Yeah. So AI alignment, the way I’ve described it so far is just sort of like pretty, it’s just talking about properties of AI system. It doesn’t really talk about how that AI system was built, but if you actually want to diagnose at like give reasons why problems might arise and then how to solve them, you probably want to talk about how the AI systems are built and why they’re likely to cause such problems.

Inner alignment, I’m not sure if I like the name, but we’ll go with it for now. Inner alignment is a problem that I claim happens for systems that learn. And the problem is, maybe I should explain it with an example. You might have seen this post from LessWrong about bleggs and rubes. These bleggs are blue in color and tend to be egg-shaped in all the cases they’ve seen so far. Rubes are red in color and are cube-shaped, at least in all the cases they’ve seen so far.

And now suddenly you see a red egg-shaped thing, is it blegg or rube? Like in this case, it’s pretty obvious that there isn’t a correct answer and this same dynamic can arise in a learning system where if it is learning how to behave in accordance with whatever we are training it to do, we’re going to be training it on a particular set of situations. And if those situations change in the future along some axis that the AI system didn’t see during training, it may generalize badly. So a good example of this is, came from the objective robustness and deep reinforcement learning paper. They trained an agent on the CoinRun environment from Procgen. This is basically a very simple platformer game where the agent just has to jump over enemies and obstacles to get to the end and collect the coin.

And the coin is always at the far right end of the level. And so, you train your AI system on a bunch of different kinds of levels, different obstacles, different enemies, they’re placed in different ways. You have to jump in different ways, but the coin is always at the end on the right. And it turns out if you then take your AI system and test it on a new level of where the coin is placed somewhere else in a level, not all the way to the right, the agent just continues to jump over obstacles, enemies, and so on. Behaves very competently in the platformer game, but it just runs all the way to the right and then stays at the right or jumps up and down as though hoping that there’s a coin there. And it’s behaving as if it has the objective of go as far to the right as possible.

Even though we trained it on the objective, get the coin, or at least that’s what we were thinking of as the objective. And this happened because we didn’t show it any examples where the coin was anywhere other than the right side of the level. So the inner alignment problem is when you train a system on one set of inputs, it learns how to behave well on that set of inputs. But then when you extrapolate its behavior to other inputs that you hadn’t seen during training, it turns out to do something that’s very capable, but not what you intended.

Lucas Perry: Can you give an example of what this could look like in the real world, rather than in like a training simulation in a virtual environment?

Rohin Shah: Yeah. One example I like is, it’ll take a bit of setup, but I think it should be fine. You could imagine that with honestly, even today’s technology, we might be able to train an AI system that can just schedule meetings for you. Like when someone emails you asking for a meeting, you’re just like, here calendar scheduling agent, please do whatever you need to do in order to get this meeting scheduled. I want to have it, you go schedule it. And then it goes and emails a person who emails back saying, Rohin is free at such and such times, he like prefers morning meetings or whatever. And then, there’s some back and forth between, and then the meeting gets scheduled. For concreteness, let’s say that the way we do this, is we take a pre-trained language model, like say GPT-3, and then we just have GPT-3 respond to emails and we train it from human feedback.

Well, we have some examples of like people scheduling emails. We do supervised fine tuning on GPT-3 to get it started. And then we like fine tune more from human feedback in order to get it to be good at this task. And it all works great. Now let’s say that in 2023, Gmail decides that Gmail also wants to be a chat app. And so it adds emoji reactions to emails, and everyone’s like, oh my God, now there’s such a better, we can schedule a meeting so much better. We can just say, here, just send an email to all the people who are coming to the meeting and react with emojis for each of the times that you’re available. And everyone loves this. This is how people start scheduling meetings now.

But it turns out that this AI system, when it’s confronted with these emoji polls is like, it knows, it in theory is capable or knows how to use the emoji polls. It knows what’s going on, but it was always trained to schedule the meeting by email. So maybe it will have learned to like always schedule a meeting by email and not to take advantage of these new features. So it might say something like, hey, I don’t really know how to use these newfangled emoji polls. Can we just schedule emails the normal way? In our times this would be a flat out lie, but from the AI’s perspective, we might think of like, the AI was just trained to say whatever sequence of English words lead to getting a meeting scheduled by email. And it predicts that sequence of words will work well. Would this actually happen if I actually trained an agent this way? I don’t know, like it’s totally possible I would actually do the right thing, but I don’t think we can really rule out the wrong thing either, it seems. That also seems pretty plausible to me in this scenario.

Lucas Perry: One important part of this that I think has come up in our previous conversations is that we don’t know when there is always an inner misalignment between the system and the objective we would like for it to learn, because part of maximizing the inner aligned objective could be giving the appearance of being aligned with the outer objective that we’re interested in. Could you explain and unpack that?

Rohin Shah: Yeah. So in the AI safety community, we tend to think about ways that AI could like actually lead to human extinction. And so, the example that I gave does not in fact lead to human extinction. It is a mild annoyance at worst. The story that gets you to human extinction is one in which you have a very capable, superintelligent AI system. But nonetheless, there’s like, instead of learning the objective that we wanted, which might’ve been, I don’t know, something like be a good personal assistant. I’m just giving that out as a concrete example. It could be other things as well. Instead of acting as though it were optimizing that objective, it ends up optimizing some other objective and you don’t really want to give an example here because the whole premise is that it could be a weird objective we don’t really know.

Lucas Perry: Could you expand that a little bit more, like how it would be a weird objective that we wouldn’t really know?

Rohin Shah: Okay. So let’s take as a concrete example, let’s make paperclips, which has nothing to do with being a personal assistant. Now, why is this at all plausible? The reason is that even if this superintelligent AI system had the objective to make paperclips, during training, while we are in control, it’s going to realize that if it doesn’t do the things that we want it to do, we’re just going to turn it off. And as a result, it will be incentivized to do whatever we want until it can make sure that we can’t turn it off. And then it goes and builds its paperclip empire. And so when I say, it could be a weird objective, I mostly just mean that almost any objective is compatible with this sort of a story. It does rely on-

Lucas Perry: Sorry. I’m also curious if you could explain how the inner state of the system becomes aligned to something that is not what we actually care about.

Rohin Shah: I might go back to the CoinRun example, where the agent could have learned to get the coin. That was a totally valid policy it could have learned. And this is an actual experiment that people have run. So this one is not hypothetical. It just didn’t, it learned to go to the right. Why? I mean, I don’t know. I wish I understood neural nets well enough to answer those questions for you. I’m not really arguing for, it’s definitely going to learn, make paperclips. I’m just arguing for like, there’s this whole set of things it could learn. And we don’t know which one it’s going to learn, which seems kind of bad.

Lucas Perry: Is it kind of like, there’s the thing we actually care about? And then a lot of things that are like roughly correlated with it, which I think you’ve used the word for example before is like proxy objectives.

Rohin Shah: Yeah. So that is definitely one way that it could happen, where we ask it to make humans happy and it learns that when humans smile, they’re usually happy and then learns the proxy objective of make human smile and then it like, goes and tapes everyone’s faces so that they’re permanently smiling, that’s a way that things could happen. But I think I don’t even want to claim that’s what … maybe that’s what happens. Maybe it just actually optimizes for human happiness. Maybe it learns to make paperclips for just some weird reason. I mean, not paperclips. Maybe it decides, this particular arrangement of atoms in this novel structure that we don’t really have a word for is the thing that it wants for some reason. And all of these seem totally compatible with, we trained it to be good, to have good behavior in the situations that we cared about because it might just be deceiving us until it has enough power to unilaterally do what it wants without worrying about us stopping it.

I do think that there is some sense of like, no paperclip maximization is too weird. If you trained it to make humans happy, it would not learn to maximize paperclips. There’s just like no path by which paperclips somehow become the one thing it cares about. I’m also sympathetic to, maybe it just doesn’t care about anything to the extent of optimizing the entire universe to turn it into that sort of thing. I’m really just arguing for, we really don’t know crazy shit could happen. I will bet on crazy shit will happen, unless we do a bunch of research and figure out how to make it so that crazy shit doesn’t happen. I just don’t really know what the crazy shit will be.

Lucas Perry: Do you think that that example of the agent in that virtual environment, you see that as a demonstration of the kinds of arbitrary goals that the agent could learn and that that space is really wide and deep and so it could be arbitrarily weird and we have no idea what kind of goal it could end up learning and then deceive us.

Rohin Shah: I think it is not that great evidence for that position. Mostly because I think it’s reasonably likely that if you told somebody the setup of what you were planning to do, if you told an ML researcher or an RL, maybe specifically a deep RL researcher, the setup of that experiment and asked them to predict what would have happened, I think they probably would have, especially if you told them, “Hey, do you think maybe it’ll just run to the right and jump up and down at the end?” I think they’d be like, “Yeah, that seems likely, not just plausible, but actually likely.” That was definitely my reaction when I was first told about this result. I was like, oh yeah, of course that will happen.

In that case, I think we just do know…know is a strong word, ML researchers have good enough intuitions about those situations, I think, that it was predictable in advance. Though I don’t actually know if anyone who predicted it, did in advance. So that one, I don’t think is all that supportive of, it learns an arbitrary goal. We had some notion that neural nets care a lot more about position and simple functions of the action always go right rather than complex visual features like this yellow coin that you have to learn from pixels. I think people could have probably predicted that.

Lucas Perry: So we touched on definitions of AI alignment, and now we’ve been exploring your interest in inner alignment or I think the jargon is mesa optimizers.

Rohin Shah: They are different things.

Lucas Perry: There are different things. Could you explain how inner alignment and mesa optimizers are different?

Rohin Shah: Yeah. So a thing I maybe have not been doing as much as I should have is that, inner alignment is the claim that when the circumstances change, the agent generalizes catastrophically in some way, it behaves as though it’s optimizing some other objective than the one that we actually want. So it’s much more of a claim about the behavior rather than like the internal workings of the AI system that caused that behavior.

mesa-optimization, at least under the definition of the 2019 paper is talking specifically about AI systems that are executing an explicit optimization algorithm. So like the forward path of a neural net is itself an optimization algorithm. We’re not talking about creating dissent here. And then the metric that is being used in that, within the neural network optimization algorithm is the inner objective or sorry, the mesa objective. So it’s making a claim about how the AI system’s cognition is structured. Whereas inner alignment more broadly is the AI behaves in a catastrophically generalizing way.

Lucas Perry: Could you explain what outer alignment is?

Rohin Shah: Sure. Inner alignment can we be thought of as, suppose we got the training objective correct. Suppose the things that we’re training the AI system to do on the situations that we give it as input, we’re actually training it to do the right thing, then things can go wrong if it behaves differently in some new situations that we hadn’t trained it on.

Outer alignment is basically when the reward function that you specify for training the AI system is itself, not what you actually wanted. For example, maybe you want your AI to be helpful to you or to tell you true things. But instead you have, you train your AI system to go find credible looking websites and tell you what the credible looking websites say. And it turns out that sometimes the credible looking websites don’t actually tell you true things.

In that case, you’re going to get an AI that tells you what credible looking websites say, rather than an AI that tells you what things are true. And that’s in some sense, an outer alignment failure. You like even the feedback you were giving the AI system was pushing it away from telling you the truth and pushing it towards telling you what credible looking websites will say, which are correlated of course, but they’re not the same. In general, if you like give me an AI system with some misalignment and you ask me, was this a failure of outer alignment or inner alignment? Mostly I’m like, that’s a somewhat confused question, but one way that you can make it not be confused is you can say, all right, let’s look at the inputs on which it was trained. Now, if ever on an input on which we train, we gave it some wrong feedback where we were like the AI lied to me and I gave it like plus a thousand reward. And you’re like, okay, clearly that’s outer alignment. We just gave it the wrong feedback in the first place.

Supposing that didn’t happen. Then I think what you would want to ask is, okay, let me think about on the situations in which the AI does something bad, what would I have given counterfactually as a reward? And this requires you to have some notion of a counterfactual. When you’d write down a programmatic reward function, the counterfactual is a bit more obvious. It’s like, whatever that program would have output on that input. And so I think that’s the usual setting in which outer alignment has been discussed. And it’s pretty clear what it means there. But once you’re like training from human feedback, it’s not so clear what it means. What would the human have given us feedback on this situation that they’ve never seen before is often pretty ambiguous. If you define such a counterfactual, then I think I’m like, yes. Then I think I’m like, okay, you look at what feedback you would’ve given on the counterfactual. If that feedback was good actually led to the behavior that you wanted, then it’s an inner alignment failure. If that counterfactual feedback was bad, not what you would have wanted. Then it’s an outer alignment failure.

Lucas Perry: If you’re speaking to someone who was not familiar with AI alignment, for example, other people in the computer science community, but also policymakers or the general public, and you have all of these definitions of AI alignment that you’ve given like intent alignment and impact alignment. And then we have the inner and outer alignment problems. How would you capture the core problem of AI alignment? And would you say that inner or outer alignment is a bigger part of the problem?

Rohin Shah: I would probably focus on intent alignment for the reasons I have given before. It just seems like a more … I want to focus attention away from the cases where the AI is trying to do the right thing, but makes a mistake, which would be a failure of impact alignment. But I don’t think that is the biggest risk. I think in a super-intelligent AI system that is trying to do the right thing is extremely unlikely to lead to catastrophic outcomes though it’s certainly not impossible. Or at least more unlikely to lead to catastrophic outcomes, unlike humans in the same position or something. So that would be my justification for intent alignment. I’m not sure that I would even talk very much about inner and outer alignment. I think I would probably just not focus on definitions and instead focus on examples. The core argument I would make would depend a lot on how AI systems are being built.

As I mentioned inner alignment is a problem that according to me, is primarily learning systems, I don’t think it really affects planning systems.

Lucas Perry: What is the difference between a learning system and a planning system?

Rohin Shah: A learning system, you give it examples of things it should do, how it should behave and then changes itself to do things more in that vein. A planning system takes a formerly represented objective and then searches over possible hypothetical sequences of actions it could take in order to achieve that objective. And if you consider a system like that, you can try to make the inner alignment argument and it just won’t work, which is why I say that the inner alignment problem is primarily about learning systems.

Going back to the previous question. So the things I would talk about depend a lot on what sorts of AI systems we’re building, if it were a planning system, I would basically just talk about outer alignment, where I would be like, what if the formerly represented objective is not the thing that we actually care about. It seems really hard to formally represent the objectives that we want.

But if we’re instead talking about deep learning systems that are being trained from human feedback, then I think I would focus on two problems. One is cases where the AI system knows something, but the human doesn’t. And so they came and gives a bad feedback as a result. So for example, the AI system knows that COVID was caused by a lab leak. It’s just like, got incontrovertible proof of this or something. And then, but we as humans are like, no, when it says COVID was caused by a lab leak, we’re like, we don’t know that, and we say no bad, don’t say that. And then when it says, it is uncertain whether COVID is the result of a lab leak or naturally or if it just occurred via natural mutations. And then we’re like, yes, good, say more of that. And you’re like, your AI system learns, okay, I shouldn’t report true things. I should report things that humans believe or something.

And so that’s one way in which you get AI systems that don’t do what you want. And then the other way would be more of this inner alignment style story, where I would point out how, even if you do train it, even if all your feedback on the training data points is good. If the world changes in some way, the AI system might stop doing good things.

I might go to example, I mean, I gave the Gmail with emoji polls for meeting scheduling example, but another one, now that I’m on the topic of COVID is, if you imagine an AI system, if you imagine a meeting scheduling AI assistant again, that was trained pre-pandemic, and then the pandemic hits, and it’s obviously never been trained on any data that was collected during such a global pandemic. And so when you then ask it to schedule a meeting with your friend, Alice, it just schedules drinks in a bar Sunday evening, even though clearly what you meant was a video call. And it knows that you meant a video call. It just learned the thing to do is to schedule outings with friends on Sunday nights at bars. Sunday night, I don’t know why I’m saying Sunday night. Friday night.

Lucas Perry: Have you been drinking a lot on your Sunday nights?

Rohin Shah: No, not even in the slightest. I think truly the problem is I don’t go to bars, so I don’t have it cached in my head that people go to bars.

Lucas Perry: So how does this all lead to existential risk?

Rohin Shah: Well, the main argument is, one possibility is that your AI system just actually learns to ruthlessly maximize some objective. That isn’t the one that we want. Make paperclips, is an stylized example to show what happens in that sort of situation. We’re not actually claiming that it will specifically maximize paperclips, but an AI system that really ruthlessly is just trying to maximize paperclips. It is going to prevent humans from stopping it from doing so. And if it gets sufficiently intelligent and can take over the world at some point, it’s just going to turn all of the resources in the world, into paperclips, which may or may not include the resources in human bodies, but either way, it’s going to include all the resources upon which we depend for survival.

Humans are definitely going, seem like they will definitely go extinct in that type of scenario. So again, not specific to paper clips. This is just; ruthless maximization of an objective, tends not to leave humans alive. Both of these… Well not both of the mechanisms, the inner alignment mechanism that I’ve been talking about, is compatible with an AI system that ruthlessly maximizes an objective that we don’t want.

It does not argue that it is probable, and I am not sure if I think it is probable, I think it is… But I think it is easily enough risk, that we should be really worrying about it, and trying to reduce it.

For the outer alignment style story, where the problem is that the AI may know information that you don’t, and then you give it bad feedback. One thing is just, this can exacerbate, this can make it easier for an inner alignment style story to happen, where the AI learns to optimize an objective, that isn’t what you actually wanted.

But even if you exclude something like that, Paul Christiano’s written a few posts about what a failure, how a human extinction level failure, of this form could look like. It basically looks like, all of your AI systems lying to you about how good the world is as the world becomes much, much worse. So for example, AI systems keep telling you that the things that you’re buying are good and helping your helping your lives, but actually they’re not, and they’re making them worse in some subtle way that you can tell. You were told, all of the information that you’re fed makes it seem like, there’s no crime, police are doing a great job of catching it, but really, this is just manipulation of the information you’re being fed, rather than actual amounts of crime where, in this case, maybe the crimes are being committed by AI systems, not even by humans.

In all of these cases, humans relied on some information sources to make decisions, the AI has new other information that the humans didn’t, the AI has learned, Hey, my job is to manage the information sources that humans get, so that the humans are happy, because that’s what they did during training. They gave good feedback in cases where the information sort of said, things were going well, even when things were not actually going well.

Lucas Perry: Right. It seems like if human beings are constantly giving feedback to AI systems, and the feedback is based on incorrect information and the AI’s have more information, then they’re going to learn something, that isn’t aligned with, what we really want, or the truth.

Rohin Shah: Yeah, I do feel uncertain about the extent to which this leads to human extinction without… It leads to, I think you can pretty easily make the case that, it leads to an existential catastrophe, as defined by, I want to say it’s Bostrom, which includes human extinction, but also a permanent curtailing of humanity’s I forget the exact phrasing, but basically if humanity can’t use… Yeah, exactly, that counts, and this totally falls into that category. I don’t know if it actually leads to human extinction, without some additional sort of failure, that we might instead categorize as inner alignment failure.

Lucas Perry: Let’s talk a little bit about probabilities, right? So if, you’re talking to someone who has never encountered AI alignment before, you’ve given a lot of different real world examples and principle-based arguments for, why there are these different kinds of alignment risks, how would you explain the probability of existential risk, to someone who can come along for all of these principle-based arguments, and buy into the examples that you’ve given, but still thinks this seems kind of, far out there, like when am I ever going to see in the real world, a ruthlessly optimizing AI, that’s capable of ending the world?

Rohin Shah: I think, first off, I’m super sympathetic to the ‘this seems super out there’ critique. I spent multiple years, not really agreeing with AI safety for basically, well, not just that reason, but that was definitely one that their heuristics that I was using. I think one way I would justify this is, to some extent it has precedent here, precedent already, in that fundamentally the arguments that I’m making… Well, especially the inner alignment one, is an argument about how AI systems will behave in new situations rather than the ones that we have already seen, during training. We already know, that AI systems behave crazily in these situations, the most famous example of this is adversarial examples, where you take an image classifier, and I don’t actually remember what the canonical example is. I think it’s a Panda, and you change it imperceptibly or change the pixel values by a small amounts, such that the changes are imperceptible to the human eye. And then it’s confident… It’s classified with, I think 99.8% confidence as something else. My memory is saying airplane, but that might just be totally wrong. Anyway, the point is we have precedent for it, AI system’s behaving really weirdly, in situations they weren’t trained on. You might object, that this one is a little bit cheating, because there was an adversary involved, and the real, I mean the real world does have adversaries, but still by default, you would expect the AI system to be more exposed to naturally occurring distributions. I think even there though, often you can just take an AI system that was trained on one distribution, give it inputs from a different distribution, and it’s just like there’s no sense to what’s happening.

Usually when I’m asked to predict this, the actual prediction I give is, probability that we go extinct due to an intent alignment failure, and then depending on the situation I will either condition on… I will either make that unconditional, so that includes all of the things that people will do to try to prevent that from happening. Or, I make it conditional, on the long-termist community doesn’t do anything, or vanishes or something. But even in that world, there’s still… Everyone who’s not a long-termist, who can still prevent that from happening, which I really do expect them to do, and then I think I give my cached answer, on both of those is like 5% and 10% respectively, which I think is probably the numbers I gave you. If I actually sat down and try to like come up with a probability, I would probably come up with something different this time, but I have not done that, and I’m way too anchored on those previous estimates, to be able to give you a new estimate this time. But, the higher number I’m giving now of, I don’t know, 33%, 50%, 70%, this, this one’s way more… I feel way more uncertain about it. Literally no one, tries to address these sorts of problems. It’s just sort of, take a language model, fine tune it on human feedback, in a very obvious way, and they just deploy that, even if it’s very obviously causing harm during training, they still deploy it.

What’s the chance that leads to human extinction? I don’t know, man, maybe 33%, maybe 70%. The 33% number you can get from this, one in three argument that I was talking about. The second thing I was going to say is, I don’t really like talking about probabilities very much, because of how utterly arbitrary the methods of generating them are there.

I feel much more, I feel much more robust. I feel much better in the robustness of the conclusion, that we don’t know that this won’t happen, and it is at least plausible, that it does happen. I think that’s pretty sufficient, for justifying the work done on it. I will also argue pretty strongly against anyone who says, we know that it will kill us all, if we don’t do anything. I don’t think that’s true. There are definitely, smart people who do think that’s true, if we operationalized greater than 90, 95% or something, and I disagree with them. I don’t really know why though.

Lucas Perry: How would you respond to someone, who thinks that this sounds, like it’s really far in the future?

Rohin Shah: Yeah. So this is specifically AGI is far in the future?

Lucas Perry: Yeah. Well, so the concern here seems to be about machines that are increasingly capable. When people look at machines that we have today, machine learning that we have today, sometimes they’re not super impressed and think that general capabilities are very far off.

Rohin Shah: Yeah.

Lucas Perry: And so this stuff sounds like, future stuff.

Rohin Shah: Yeah. So, I think my response depends on what we’re trying to get the person to do or something, why do we care about what this person believes, if this person is considering whether or not to do AI research themselves or, AI safety research themselves and they feel like they have a strong inside view model of, why AI is not going to come soon. I’m kind of… I’m like, eh, that seems okay. I’m not that stoked about people forcing themselves to do research on a thing they don’t actually believe. I don’t really think that good research comes from doing that. If I put myself, for example, I am much more sold on AGI coming through neural networks, than planning agents or things similar to it. If I had to put myself in the shoes of, all right, I’m now going to do AI safety research on planning agents. I’m just like, oh man, that’s seems like I’m going to do so much… My work is going to be orders of magnitude worse, than the work I do, on the neural-net case. So, in the case where, this person is thinking about whether to do AI safety research, and they feel like they have strong insight view models for AGI not coming soon. I’m like, eh, maybe they should go do something else or possibly, they should engage with the arguments for AGI coming more quickly, if they haven’t done that. But, if they have engaged with those arguments, thought about it all, concluded it’s far away, and they can’t even see a picture by which it comes soon…That’s fine.

Conversely, if we’re instead, if we’re imagining that someone is disputing, someone is saying, ‘oh nobody should work on AI safety right now, because AGI is so far away.’. One response you can have to that is, even if it’s far away, it’s still worthwhile to work on reducing risks, if they’re as bad as extinction. Seems like we should be putting effort into that, even early on. But I think, you can make a stronger argument there, which is there’re just actually people, lots of people who are trying to build AGI right now, there’s, at the minimum; DeepMind and OpenAI and they clearly… I should probably not make more comments about DeepMind, but OpenAI clearly doesn’t believe… OpenAI clearly seems to think, that AGI is coming somewhat soon. I think you can infer, from everything you see about DeepMind, that they don’t believe that AGI is 200 years away. I think it is insane overconfidence in your own views, to be thinking that you know better than all of these people, such that you wouldn’t even assign, like 5% or something, to AGI coming soon enough, that work on AI safety matters.

Yeah. So there, I think I would appeal to… Let other people do the work. You are not, you don’t have to do the work yourself. There’s just no reason for you to be opposing the other people, either on epistemic grounds or also on just, kind of a waste of your own time, that’s the second kind of person. A third kind of person might be like somebody in policy. From my impression of policy, is that there is this thing, where early moves are relatively irreversible, or something like that. Things get entrenched pretty quickly, such that it makes sense to wait for… It often makes sense to wait for a consensus before acting, and I don’t think that there is currently consensus of AGI coming soon. I don’t feel particularly confident enough in my views to say, we should really convince the policy people, to override this general heuristic of waiting for consensus, and get them to act now.

Yeah. Anyway, those are all meta-level considerations. There’s also the object-level question of, is AGI coming soon? For that, I would say, I think the most likely, the best story for that I know of is, you take neural nets, as you scale them up, you increase the size of the datasets that they’re trained on. You increase the diversity of the datasets that they’re trained on, and they learn more and more general heuristics, for doing good things. Eventually, these general, these heuristics are general enough that they’re as good as human cognition. Implicitly, I am claiming that human cognition, is basically a bag of general heuristics. There is this report from Ajeya Cotra, about AGI timelines using biological anchors. I wrote, even my summary of it was 3000 words, or something like that, so I don’t know that I can really give an adequate summary of it here, but it models… The basic premise, is to model how quickly neural nets will grow, and at what point they will match what we would expect to be approximately, the same rough size as the human brain. I think it even includes a small penalty to neural nets on the basis that evolution probably did a better job than we did. It basically comes up with a target for, neural nets of this size, trained in Compute Optimal ways, will probably be, roughly human level.

It has a distribution over this, to be more accurate, and then it predicts, based on existing trends. Well, not just existing trends, existing trends and sensible extrapolation, predicts when neural nets might reach that level. It ends up concluding, somewhere in the range… Oh, let me see, I think it’s 50% confidence interval would be something like 2035 to 2070, 2080, maybe something like that? I am really just like, I’m imagining a graph in my head, and trying to calculate the area under it, so that is very much not a reliable interval, but it should give you a general sense of what the report concludes.

Lucas Perry: So that’s 2030 to 2080?

Rohin Shah: I think it’s slightly narrower than that, but yes, roughly, roughly that.

Lucas Perry: That’s pretty soon.

Rohin Shah: Yep. I think that’s, on the object level that you’d just got to read the report, and see whether or not you buy it.

Lucas Perry: That’s most likely in our lifetimes, if we live to the average age.

Rohin Shah: Yep. So that was a 50% interval, meaning it’s, 25% to 75 percentile. I think actually the 25th percentile was not as early as 2030. It was probably 2040.

Lucas Perry: So, if I’ve heard everything, in this podcast, everything that you’ve said so far, and I’m still kind of like, okay, there’s a lot here and it sounds convincing or something and this seems important, but I’m not so sure about this, or that we should do anything. What is… Because, it seems like there’s a lot of people like that. I’m curious what it is, that you would say to someone like that.

Rohin Shah: I think… I don’t know. I probably wouldn’t try to say something general to them. I feel like I would need to know more about the person, people have pretty different idiosyncratic reasons, for having that sort of reaction. Okay, I would at least say, that I think that they are wrong, to be having that sort of belief or reaction.

But, if I wanted to convince them of that point, presumably I would have to say something more than just, I think you are wrong. I think the specific thing I would have to say, which would be pretty different for different people.

Lucas Perry: That’s a good point.

Rohin Shah: I would at least make an appeal to the meta-level heuristic of don’t try to regulate a small group of… There are a few hundred researchers at most, doing things that they think will help the world, and that you don’t think will hurt the world. There are just better things for you to do with your time. Doesn’t seem like they’re harming you. Some people will think that there is harm being caused by them. I would have to address that, with them specifically, but I think most people do not, who have this reaction, don’t believe that.

Lucas Perry: So, so we’ve gone over a lot of the traditional arguments for AI, as a potential existential risk. Is there anything else that you would like to add there, or any of the arguments that we missed, that you would like to include?

Rohin Shah: As a representative of the community as a whole, there are lots of other arguments that people like to make, for AI being a potential extinction risk. So, some things are, maybe AI just accelerates the rate at which we make progress, and we can’t increase our wisdom alongside, and as a result, we get a lot of destructive technologies and can’t keep them under control. Or, we don’t do enough philosophy, in order to figure out what we actually care about, and what’s good to do in the world, and as a result, we start optimizing for things that are morally bad or other things in this vein. Talk about the risk of AI being misused by bad actors. So there’s… Well actually I’ll introduce a trichotomy that, I don’t remember exactly who wrote this article. But it goes, Accidents, Misuse and Structural Risks. So accidents are, both alignment and the things like; we don’t keep up, we don’t have enough wisdom to cope with the impact of AI. That one’s arguable, whether it’s an accident, or misuse or structural, and we don’t do enough philosophy. So those are, vaguely accidental, those are accidents.

Misuse is, some bad actor. Some terrorists say, gets AI. Gets a powerful AI system and does something really bad, blows up the world somehow. Structural risks are things like; various parts of the economy use AI to accelerate, to get more profit, to accelerate their production of goods and so on. At some point we have this like giant economy, that’s just making a lot of goods, but it can become decoupled from things that are actually useful for humans, and we just have this huge multi-agency system, where goods are being produced, money’s floating around. We don’t really understand all of it, but somehow humans get left behind and there, it’s kind of an accident, but not in the traditional sense. It’s not that a single AI system went and did something bad. It’s more like the entire structure, of the way that the AI systems and the humans related to each other, was such that it ended up leading to the permanent disempowerment of humans. Now that I say it, I think the ‘we didn’t have enough wisdom’ argument for risk, is probably also in this category.

Lucas Perry: Which of these categories are you most worried about?

Rohin Shah: I don’t know. I think, it is probably not misuse, but I vary, on accidents versus structural risks, mostly because, I just don’t feel like I have a good understanding of structural risks. Maybe, most days I think structural risks are more likely to cause bad outcomes, extinction. The obvious next question is, why am I working on alignment, and not structural risks? The answer there, is that it seems to me like alignment has one, or perhaps two core problems that are leading to the major risk. Whereas structural risks… And so you could hope to have, one or two solutions that address those main problems and that’s it, that’s all you need. Whereas with structural risks, I would be surprised if it was just, there was just one or two solutions that just got rid of structural risk. It seems much more like, you have to have a different solution for each of the structural risks. So, it seems like, the amount that you can reduce the risk by, is higher in alignment than in structural risks. That’s not the only reason why I work in alignment, I just also have a much better personal fit with alignment work. But, I do also think that alignment work, you have more opportunity to reduce the risks, than in structural risks, on the current margin.

Lucas Perry: Is there a name for those one or two core problems in alignment, that you can come up with solutions for?

Rohin Shah: I mostly just mean like, possibly, we’ve been talking about outer and inner alignment, and in the neural net case, I talked about the problem where you reward the AI system for doing bad things, because there was an information asymmetry, and then the other one was like the AI system generalizes catastrophically, to new situations. Arguably those are just the two things, but I think it’s not even that, it’s more… Fundamentally the story, the causal chain in the accident’s case, was the AI was trying to do something bad, or something that we didn’t want rather, and then that was bad.

Whereas in the structural risks case, there isn’t a single causal story. It’s this very vague general notion of the humans and AI have interacted in ways that led to an X-risk. Then, if you drill down into any given story, or if you drilled down into five stories and then you’re like, what’s common across these five stories? Not much, other than that there was AI, and there were humans, and they interacted, and I wouldn’t say that was true, if I had five stories about alignment failure.

Lucas Perry: So, I’d like to take an overview, a broads eye view of AI alignment in 2021. Last time we spoke was in 2020. How has AI alignment, as a field of research changed in the last year?

Rohin Shah: I think I’m going to naturally include a bunch of things from 2020 as well. It’s not a very sharp division in my mind, especially because I think the biggest trend, is just more focus on large language models, which I think was a trend that started late 2020 probably… Certainly, the GPT-3 paper was, I want to say early 2020, but I don’t think it immediately caused there to be more work. So, maybe late 2020 is about right. But, you just see a lot more, alignment forum posts, and papers that are grappling with, what are the alignment problems that could arise with large language models? How might you fix them?

There was this paper out of Stanford, which isn’t, I wouldn’t have said this was from the AI safety community. But it gives the name foundation models to these sorts of things. So they generalize it beyond just language and they think it might… And already we’ve seen some generalization beyond language, like CLIP and DALL-E are working on image inputs, but they also extend it to robotics and so on. And their point is, we’re now more in the realm of, you train one large model on a giant pile of data that you happen to have, that you don’t really have any labels for, but you can use a self-supervised learning objective in order to learn from them. And then you get this model that has a lot of knowledge, but no goal built in, and then you do something like prompt engineering or fine tuning in order to actually get it to do the task that you want. And so that’s a new paradigm for constructing AI systems that we didn’t have before. And there have just been a bunch of posts that grapple with what alignment looks like in this case. I don’t think I have a nice pithy summary, unfortunately, of what all of us… What the upshot is, but that’s the thing people have been thinking about, a lot more.

Lucas Perry: Why do you think that looking at large scale language models has become a thing?

Rohin Shah: Oh, I think primarily just because GPT-3 demonstrated how powerful they could be. You just see, this is not specific to the AI safety community, even in the… If anything, this shift that I’m talking about is… It’s probably not more pronounced in the ML community, but it’s also there in the ML community where there are just tons of papers about prompt engineering and fine tuning out of regular ML labs. Just, I think is… GPT-3 showed that it could be done, and that this was a reasonable way to get actual economic value out of these systems. And so people started caring about them more.

Lucas Perry: So one thing that you mentioned to me that was significant in the last year, was foundation models. So could you explain what foundation models are?

Rohin Shah: Yeah. So a foundation model, the general recipe for it, is you take some very… Not generic, exactly. Flexible input space like pixels or any English language, any string of words in the English language, you collect a giant data set without any particular labels, just lots of examples of that sort of data in the wild. So in the case of pixels, you just find a bunch of images from image-sharing websites or something. I don’t actually know where they got their images from. For text, it’s even easier. The internet is filled with text. You just get a bunch of it. And then you train your AI, you train a very large neural network with some proxy objective on that data set, that encourages it to learn how to model that data set. So in the case of language models, the… There are a bunch of possible objectives. The most famous one was the one that GPT-3 used, which is just, given the first N words of the sentence, predict the word N plus one. And so it just… Initially it starts learning, E’s are the most common… Well, actually, because of the specific way that the input space in GPT-3 works, it doesn’t exactly do this, but you could imagine that if it was just modeling characters, it would first learn that E’s are the most common letter in the alphabet. L’s are more common. Q’s and Z’s don’t come up that often. Like it starts outputting letter distributions that at least look vaguely more like what English would look like. Then it starts learning what the spelling of individual words are. Then it starts learning what the grammar rules are. Just, these are all things that help it better predict what the next word is going to be, or, well, the next character, in this particular instantiation.

And it turns out that when you have millions of parameters in your neural network, then you can… I don’t actually know if this number is right, but probably, I would expect that with millions of parameters in your neural network, you can learn spellings of words and rules of grammar, such that you’re mostly outputting, for the most part, grammatically correct sentences, but they don’t necessarily mean very much.

And then when you get to the billions of parameters range, at that point, the millions of parameters are already getting you grammar. So like, what should it use all these extra parameters for, now? Then it starts learning things like George… Well, probably already even the millions of parameters probably learned that George tends to be followed by Washington. But it can start learning things like that. And in that sense, can be said to know that there is an entity, at least, named George Washington. And so on. It might start knowing that rain is wet, and in context where something has been rained on, and then later we’re asked to describe that thing, it will say it’s wet or slippery or something like that. And so it starts… It basically just, in order to predict words better, it keeps getting more and more “knowledge” about the domain.

So anyway, a foundation model, expressive input space, giant pile of data, very big neural net, learns to model that domain very well, which involves getting a bunch of “knowledge” about that domain.

Lucas Perry: What’s the difference between “knowledge” and knowledge?

Rohin Shah: I feel like you are the philosopher here, more than me. Do you know what knowledge without air quotes is?

Lucas Perry: No, I don’t. But I don’t mean to derail it, but yeah. So it gets “knowledge.”

Rohin Shah: Yeah. I mostly put the air quotes around knowledge because we don’t really have a satisfying account of what knowledge is. And if I don’t put air quotes around knowledge, I get lots of people angrily saying that AI systems don’t have knowledge yet.

Lucas Perry: Oh, yeah. That makes sense.

Rohin Shah: And when I put the air quotes around it, then they understand that I just mean that it has the ability to make predictions that are conditional on this particular fact about the world, whether or not it actually knows that fact about that world.

Lucas Perry: Okay.

Rohin Shah: But it knows it well enough to make predictions. Or it contains the knowledge well enough to make predictions. It can make predictions. That’s the point. I’m being maybe a bit too harsh, here. I also put air quotes around knowledge because I don’t actually know what knowledge is. It’s not just a defense strategy. Though, that is definitely part of it.

So yeah. Foundation models, basically are a way to just get all of this “knowledge” into an AI system, such that you can then do prompting and fine tuning and so on. And those, with a very small amount of data, relatively speaking, are able to get very good performance. Like in the case of GPT-3, you can like give it two or three examples of a task and it can start performing that task, if the task is relatively simple. Whereas if you wanted to train a model from scratch to perform that task, you would need thousands of examples, often.

Lucas Perry: So how has this been significant for AI alignment?

Rohin Shah: I think it has mostly provided an actual pathway to it, by which we can get to AGI. Or there’s more like a concrete story and path that leads to AGI, eventually. And so then we can take all of these abstract arguments that we were making before, and then see, try to instantiate them in the case of this concrete pathway, and see whether or not they still make sense. I’m not sure if at this point I’m imagining what I would like to do, versus what actually happened. I would need to actually go and look through the alignment newsletter database and see what people actually wrote about the subject. But I think there was some discussion of GPT-3 and the extent to which it is or isn’t a mesa optimizer.

Yeah. That’s at least one thing that I remember happening. Then there’s been a lot of papers that are just like, “Here is how you can train a foundation model like GPT-3 to do the sort of thing that you want.” So there’s learning to summarize from human feedback, which just took GPT-3 and taught it how to, or fine tuned it in order to get it to summarize news articles, which is an example of a task that you might want an AI system to do.

And then the same team at OpenAI just recently released a paper that actually summarized entire books by using a recursive decomposition strategy. In some sense, a lot of the work we’ve been doing in the past, in AI alignment was like how do we get AI systems to perform fuzzy tasks for which we don’t have a reward function? And now we have systems that could do these fuzzy tasks in the sense that they “have the knowledge,” but don’t actually use that knowledge the way that we would want them. And then we have to figure out how to get them to do that. And then we can use all these techniques like imitation learning, and learning from comparisons and preferences that we’ve been developing.

Lucas Perry: Why don’t we know that AI systems won’t totally kill us all?

Rohin Shah: The arguments for AI risk usually depend on having an AI system that’s ruthlessly maximizing an objective in every new situation it encounters. So for example, the paperclip maximizer, once it’s built 10 paperclip factories, it doesn’t retire and say, “Yep, that’s enough paperclips.” It just continues turning entire planets into paper clips. Or if you consider the goal of, make a hundred paper clips, and it turns all of the plants into computers to make sure it is as confident as possible, that it has made a hundred paper clips. These are examples of, I’m going to call it “ruthlessly maximizing” an objective. And there’s some sense in which this is weird and humans don’t behave in that way. And I think there’s some amount of, basically I am unsure whether or not we should actually expect AI’s to have such ruthlessly maximized objectives. I don’t really see the argument for why that should happen. And I think, as a particularly strong piece of evidence against this, I would note that humans don’t seem to have these sorts of objectives.

It’s not obviously true. There are probably some longtermists who really do want to tile the universe with hedonium, which seems like a pretty ruthlessly maximizing objective to me. But I think even then, that’s the exception rather than the rule. So if humans don’t ruthlessly maximize objectives and humans were built by a similar process as is building neural networks, why do we expect the neural networks to have objectives that they ruthlessly maximize?

You can also… I’ve phrased this in a way where it’s an argument against AI risk. You can also phrase it in a way in which it’s an argument for AI risk, where you would say, well, let’s flip that on its head and say like, “Well, yes, you brought up the example of humans. Well, the process that created humans is trying to maximize, or it is an optimization process, leading to increased reproductive fitness. But then humans do things like wear condoms, which does not seem great for reproductive fitness, generally speaking, especially for the people who are definitely out there who decide that they’re just never going to reproduce. So in that sense, humans are clearly having a large impact on the world and are doing so for objectives that are not what evolution was naively optimizing.

And so, similarly, if we train AI systems in the same way, maybe they too will have a large impact on the world, but not for what the humans were naively training the system to optimize.

Lucas Perry: We can’t let them know about fun.

Rohin Shah: Yeah. Terrible. Well, I don’t want to be-

Lucas Perry: The whole human AI alignment project will run off the rails.

Rohin Shah: Yeah. But anyway, I think these things are a lot more conceptually tricky than the well-polished arguments that one reads, will make it seem. But especially this point about, it’s not obvious that AI systems will get ruthlessly maximizing objectives. That really does give me quite a bit of pause, in how good the AI risk arguments are. I still think it is clearly correct to be working on AI risk, because we don’t want to be in the situation where we can’t make an argument for why AI is risky. We want to be in the situation where we can make an argument for why the AI is not risky. And I don’t think we have that situation yet. Even if you completely buy the, we don’t know if there’s going to be ruthlessly maximizing objectives, argument, that puts you in the epistemic state where we’re like, “Well, I don’t see an iron clad argument that says that AIs will kill us all.” And that’s sort of like saying… I don’t know. “Well, I don’t have an iron clad argument that touching this pan that’s on this lit stove, will burn me, because maybe someone just put the pan on the stove a few seconds ago.” But it would still be a bad idea to go and do that. What you really want, is a positive argument for why touching the pan is not going to burn you, or analogously, why building the AGI is not going to kill you. And I don’t think we have any such positive argument, at the moment.

Lucas Perry: Part of this conversation’s interesting because I’m surprised how uncertain you are about AI as an existential risk.

Rohin Shah: Yeah. It’s possible I’ve become slightly more uncertain about it in the last year or two. I don’t think I was saying things that were quite this uncertain before then, but I think I have generally been… We have plausibility arguments. We do not have like, this is probable, arguments. Or back in 2017 or 2018 when I was young and naive.

Lucas Perry: Okay.

Rohin Shah: This makes more sense.

Lucas Perry: We’re no longer young and naive.

Rohin Shah: Well, okay. I entered the field of AI alignment. I read my first AI alignment paper in September of 2017. So it actually does make sense. At that time, I thought we had more confidence of some sort, but since posting the value learning sequence, I’ve generally been more uncertain about AI risk arguments. I don’t talk about it all that much, because as I said, the decision is still very clear. The decision is still, work on this problem. Figure out how to get a positive argument that the AI is not going to kill us. And ideally, a positive argument that the AI does good things for humanity. I don’t know, man. Most things in life are pretty uncertain. Most things in the future are even way, way, way more uncertain. I don’t feel like you should generally be all that confident about technologies that you think are decades out.

Lucas Perry: Feels a little bit like those images of the people in the fifties drawing what the future would look like, and the images are ridiculous.

Rohin Shah: Yep. Yeah. I I’ve been recently watching Star Wars. Now, obviously Star Wars is not actually supposed to be a prediction of the future, but it’s really quite entertaining, to actually just think about all the ways in which Star Wars would be totally inaccurate. And this is before we’d even invented space travel. And just… Robots talking to each other, using sound. Why would they do that?

Lucas Perry: Industry today, wouldn’t make machines that speak by vibrating air. They would just send each other signals electromagnetically. So how much of the alignment and safety problems in AI do you think will be solved by industry? The same way that computer-to-computer communication is solved by industry, and is not what Star Wars thought it would be. Would the DeepMind AI safety lab exist, if DeepMind didn’t think that AI alignment and AI safety were serious and important? I don’t know if the lab is purely aligned with the commercial interests of DeepMind itself, or if it’s also kind of seen as a good-for-the-world thing. I bring it up because I like how Andrew Critch talks about it in his arches paper.

Rohin Shah: Yep. So, Critch is, I think, of the opinion that both preference learning and robustness are problems that will be solved by industry. I think he includes robustness in that. And I certainly agree to the extent that you’re like, “Yes, companies will do things like learning from human preference.” Totally. They’re going to do that. Whether they’re going to be proactive enough to notice the kinds of failures I mentioned, I don’t know. It doesn’t seem nearly as obvious to me that they will be, without dedicated teams that are specifically meant for looking for hidden failures with the knowledge that these are really important to get, because they could have very bad long term consequences.

AI systems could increase the strength of, and accelerate various multi-agent systems and processes that, when accelerated, could lead to bad outcomes. So for example, a great example of a destructive multi-agent effect, is war. War is a thing that… Well, wars have been getting more destructive over time, or at least the weapons in them have been getting more destructive. Probably the death tolls have also been getting higher, but I’m not as sure about that. And you could imagine that if AI systems continue to increase, if they increase the destructiveness of weapons even more, wars might then become an existential risk. That’s a way in which you can get a structural risk from a multi-agent system. And the example in which the economy just sort of becomes much, much, much bigger, but becomes decoupled from things that humans want, is another example of how a multi-agent process can sort of go haywire, especially with the addition of powerful AI systems. I think that’s also a canonical scenario that Critch would think about. Yeah.

Really, I would say that Arches is, in my head, it’s categorized as a technical paper about structural risks.

Lucas Perry: Do you think about what beneficial futures look like? You spoke a little bit about wisdom earlier, and I’m curious what good futures with AI, looks like to you.

Rohin Shah: Yeah, I admit I don’t actually think about this very much. Because my research is focused on more abstract problems, I tend to focus on abstract considerations, and the main abstract consideration from the perspective of the good future, is, well, once we get to singularity levels of powerful AI systems, anything I say now, there’s going to be something way better that AI systems are going to enable. So then, as a result, I don’t think very much about it. But that’s mostly a thing about me not being in a communications role.

Lucas Perry: You work a lot on this risk. So you must think that humanity existing in the future, matters?

Rohin Shah: I do like humans. Humans are pretty great. I count many of them amongst my friends. I’ve never been all that good at the transhumanist, look to the future and see the grand potential of humanity, sorts of visions. But when other people say them or give them, I feel a lot of kinship with them. The ones that are all about humanity’s potential to discover new forms of art and music, reach new levels of science, understand the world better than it’s ever been understood before, fall in love a hundred times, learn all of the things that there are to know. Actually, you won’t be able to do that one, probably, but anyway. Learn way more of the things that there are to know, than you have right now. Just a lot of that resonates with me. And that’s probably a very intellectual-centric view of the future. I feel like I’d be interested in hearing the view of the future that’s like, “Ah yes, we have the best video games and the best TV shows. And we’re the best couch potatoes that ever were.” Or also, there’s just insane new sports that you have to spend lots of time and grueling training for, but it’s all worth it when you shoot the best, get a perfect score on the best dunk that’s ever been done in basketball, or whatever. I recently watched a competition of apparently there are competitions in basketball of just aesthetic dunks. It’s cool. I enjoyed it. Anyway. Yeah. It feels like there’s just so many other communities that could also have their own visions of the future. And I feel like I’d feel a lot of kinship with many of those, too. And I’m like, man, let’s just have all the humans continue to do the things that they want. It seems great.

Lucas Perry: One thing that you mentioned was that you deal with abstract problems. And so what a good future looks like to you, it seems like it’s an abstract problem that later, the good things that AI can give us, are better than the good things that we can think of, right now. Is that a fair summary?

Rohin Shah: That seems, right. Yeah.

Lucas Perry: Right. So there’s this view, and this comes from maybe Steven Pinker or someone else. I’m not sure. Or maybe Ray Kurzweil, I don’t know… Where if you give a caveman a genie, or an AI, they’ll ask for maybe a bigger cave, and, “I would like there to be more hunks of meat. And I would like my pelt for my bed to be a little bit bigger.” Go ahead.

Rohin Shah: Okay. I think I see the issue. So I actually don’t agree with your summary of the thing that I said.

Lucas Perry: Oh, okay.

Rohin Shah: Your rephrasing was that we ask the AI what good things there are to do, or something like that. And that might have been what I said, but what I actually meant was that with powerful AI systems, the world will just be very different. And one of the ways in which it will be different is that we can get advice from AIs on what to do. And certainly, that’s an important one, but also, there will just be incredible new technologies that we don’t know about. New realms of science to explore new concepts that we don’t even have names for, right now. And one that seems particularly interesting to me, is just entirely new senses. Human vision is just incredibly complicated, but I can just look around the room and identify all the objects with basically no conscious thought. What would it be like to understand DNA at that level? AlphaFold probably understands DNA at maybe not quite that level, but something like it.

I don’t know, man. There’s just like these things that I’m like… I thought of the DNA one because of AlphaFold. Before AlphaFold, would I have thought of it? Probably not. I don’t know. Maybe. Kurzweil has written a little bit about things like this. But it feels like there will just be far more opportunities. And then also, we can get advice from AIs, but that’s probably… Actually- and that’s important, but I think less than… There are far more opportunities, that I am definitely not going to be able to think of today.

Lucas Perry: Do you think that it’s dissimilar, from the caveman wishing for more caveman things?

Rohin Shah: Yeah. I feel like in the caveman story… It’s possible that the caveman does this, I feel like the thing the caveman should be doing, is something like, give me better ways to… give me better food or something, and then you get fire to cook things, or something.

Lucas Perry: Yeah.

Rohin Shah: The things that he asks for, should involve technology as a solution. He should get technology as a solution, to learn more, and be able to do more things as a result of having that technology. In this hypothetical, the caveman should reasonably quickly, become similar to modern humans. I don’t know what reasonably quickly means here, but it should be much more… You get access to more and more technologies, rather than you get a bigger cave and then you’re like, “I have no more wishes anymore.” If I got a bigger house, would I stop having wishes? That seems super unlikely. That’s a strawman argument, sorry. But still, I do feel like there’s this… A meaningful sense in which, getting new technology leads to just genuinely new circumstances, which leads to more opportunities, which leads to probably more technology, and so on, and at some point, this has to stop. There are limits to what is possible. One assumes there are limits to what is possible in the universe. But I think, once we get to talking about, we’re at those limits, then at that point, it just seems irresponsible to speculate. It’s just so wildly out of the range of things that we know, the concept of a person is probably wrong, at that point.

Lucas Perry: The what of a person is probably wrong at that point?

Rohin Shah: The concept of a person.

Lucas Perry: Oh.

Rohin Shah: I’d be like, “Is there an entity, that is Rohin at that time?” Not likely. Less than 50%.

Lucas Perry: We’ll edit in just fractals flying through your video, at this part of the interview. So in my example, I think it’s just because I think of cavemen as not knowing how to ask for new technology, but we want to be able to ask for new technology. Part of what this brings up for me, is this very classic part of AI alignment, and I’m curious how you feel like it fits into the problem.

But, we would also like AI systems to help us imagine beneficial futures potentially, or to know what is good or what it is that we want. So, in asking for new technology, it knows that fire is part of the good, that we don’t know how to necessarily ask for directly. How do you view AI alignment, in terms of itself aiding in the creation of beneficial futures, and knowing of a good that is beyond the good, that humanity can grasp?

Rohin Shah: I think I more reject, the premise of the question, where I’d be like, there is no good beyond that which humanity can grasp. This is somewhat of an anti-realist position.

Lucas Perry: You mean, moral anti-realist, just for the-

Rohin Shah: Yes. Sorry, I should have said that more clearly. Yeah. Somewhat of a moral anti-realist position. There is no good, other than that which humans can grasp. Within that ‘could grasp’, you can have humans thinking for a very long time, you could have them with extra… you can make them more intelligent, like part of the technologies you get from AI systems will presumably like you do that, maybe you can, I guess setting aside questions of philosophical identity, you could upload the humans such that they could run on a computer, and run much faster, have software upgrades to be… To the extent that, that’s philosophically acceptable. There’s a lot you can do to help humans grasp more. Ultimately, yes, the closure of all these improvements, where you get to with all of that, that’s just, is the thing that we want. Yes, you could have a theory, that there is something even better, and even more out there, that humans can never access by themselves, that just seems like a weird hypothesis to have, and I don’t know why you would have it. But, in the world where that hypothesis is true, and if I condition on that hypothesis being true, I don’t see why we should expect, that AI systems could access that further truth any better than we can, if it’s out of our, the closure of what we can achieve, even with additional intelligence and such. There’s no other advantage that AI systems have over us.

Lucas Perry: So, is what you’re arguing, that with human augmentation and help to human beings, so like with uploads or with expanding the intelligence and capabilities of humans, that humans have access to the entire space of what counts as good.

Rohin Shah: I think you’re presuming the existence of an object that is the entire space of what is good. And I’m like, there is no such object, there are only humans, and what humans want to do. If you want to define the space of what is good, you can define this closure property on what humans will think is good, with all of the possible intelligence augmentations and time, and so on. That’s a reasonable object, and I could see calling that as the space of what is good. But then, almost tautologically, we can reach it with technology. That’s the thing I’m talking about. The version where you posit the existence of the entire space of what is good is: A, I can’t really conceive of that, it doesn’t feel very coherent to me, but B, when I try to reason about it anyway, I’m like, okay, if humans can’t access it, why should AI’s be able to access it? You’ve posited this new object of, a space of things, that humans can never access, but how does that space affect or interact with reality in any way? There needs to be some sort of interaction, in order for the AI to be able to access it. I think I would need to know more about how it interacts with reality in some way, before I could meaningfully answer this question in a way, where I could say how AI’s could do something, that humans couldn’t even in principle, do.

Lucas Perry: What do you think of the importance, or non importance of these kinds of questions, and how they fit into the ongoing problem of AI alignment?

Rohin Shah: I think they’re important, for determining what the goal of alignment should be. So for example, you now know a little bit of what my view on these questions is, which is namely something like… That which humans can access, under sufficient augmentations, intelligence, time and so on, is all that there is. So I’m very into… build AI systems that are replicating human reasoning, they’re approximating what a human would do, if they thought for a long time, or were smarter in some ways and so on. So then, yeah we don’t need to worry much about… I tend to think of it as, let’s build an AI systems that just do tasks, that humans can conceptually understand, not necessarily they can do it, but they know what that task is. Then, our job is to, the entire human AI society is making forward progress towards… Making forward moral progress or other progress, in the same way that if this happened in the past, we get exposed to new situations and new arguments, we think about them for a while, and then somehow we make decisions about what’s good and what’s not, in a way that’s somewhat inscrutable. I’m much more about… So we just continue reiterating that process, and eventually we reach the space of, well yeah, we just continue reiterating that process. So I’m very much into, because of this view, I think it’s pretty reasonable to aim for AI systems that are just doing human-like reasoning, but better. Or approximating, doing what a human could do in a year, in a few minutes or something like that. That seems great to me. Whereas if you, on the other hand were like, no, there’s actually deep philosophical truths out there, that humans might never be able to access, then you’re probably less enthusiastic about that sort of plan, and you’ll want to build an AI system some other way.

Lucas Perry: Or maybe they’re accessible, with the augmentation and time. How does other minds fit into this for you? So, right, there’s the human mind and then the space of all that is good, that it has access to, with augmentation, which is what you call the space, of that which is good. It’s contingent, and rooted on the space of what the human mind, augmented has access to. How would you view, how does that fit in with animals and also other species which may have their own alignment problems on planets within our cosmic endowment that we might run into? Is it just that they also have spaces that are defined as good, as what they can access through their own augmentation? And then, there’s no way of reconciling these two different AI alignment projects?

Rohin Shah: Yeah, I think basically, yes. If I met an actual ruthless, maximizing paperclip… Paperclip maximizer. It’s not like I can argue it, into adopting my values, or anything even resembling them. I don’t think it would be able to argue me into accepting turning me into paperclips, which is what it desires, and that just seems like the description of reality. Again, a moral realist might say something else, but I’ve never really understood the flavor of moral realism that would say something else in that situation.

Lucas Perry: With regards to the planet and industry, and how industry will be creating increasingly capable AI systems. Could you explain what a unipolar scenario is, and what a multi-polar scenario is?

Rohin Shah: Yeah, so I’m not sure if I recall exactly where these terms were defined, but a unipolar scenario, at least as I understand it, would be a situation in which, one entity basically determines the long run future of the earth. More colloquially, it has taken over the world. You can also have a time bounded version of it, where it’s unipolar for 20 years, and this entity has all the power for those 20 years, but then, maybe the entity is a human, and we haven’t solved aging yet, and then the human dies. So then, it was a unipolar world for that period of time. And a multipolar world is just, not that. There is no one entity, that is said to be in control of the world. There’s just a lot of different entities that have different goals, and they’re coexisting, hopefully cooperating, maybe not cooperating, depends on the situation.

Lucas Perry: Which do you think is more likely to lead to beneficial outcomes, with AI?

Rohin Shah: So, I don’t really think about it in these terms. I think about it in like, there are these kinds of worlds that we could be in, some of them are unipolar and some of them are multipolar, but very different unipolar worlds, and very different multipolar worlds. And so, the sorts of questions, the closest analogous question is something like, if you condition on unipolar world, what’s the probability that it’s beneficial or that it’s good. If you condition on multipolar world, what’s the probability that is good? And it’s just a super complicated question that I wouldn’t be able to explain my reasoning for, because it would involve me like thinking about 20 different worlds, maybe not that many, but a bunch of different worlds in my head, estimating their probabilities by doing a base rule… I guess, kind of a base rule calculation, and then reporting the result.

So, I think maybe the question I will answer instead, is the most likely worlds in each of unipolar, and multipolar settings, and then, how good those seem to me. So I would say, I think by default, I expect the world to be multi-polar, in that it doesn’t seem like anyone is particularly. I don’t think anyone has particularly taken over the world today, or any entity, not even counting the US as a single entity. It’s not like the US has taken over the world. It does not seem to me like… Though the main way you could imagine getting a unipolar world is, if the first actor to build a powerful enough AI system, that AI system just becomes really, really powerful and takes over the world, before anyone can deploy an AI system even close to it.

Sorry, that’s not the most likely one. That’s the one that most people most often talk about, and probably the one that other people think is the most likely, but yeah. Anyway, I see the multipolar world as more likely, where we just have a bunch of actors that are all pretty well-resourced, that are all developing their own AI systems. They then sell their AI systems, or the ability to use their AI systems to other people, and then it’s sort of similar to the human economy, where you can just have AI systems provide labor at a fixed cost. It looks similar to the economy today, where people who control a lot of resources can instantiate a bunch of AI systems, that help them maintain whatever it is they want, and we remain in the multipolar world we have today.

And that seems… Decent. I think, for all that our institutions are not looking great, at the current moment. There is still something to be said, that nuclear war didn’t actually happen, which can either update you towards, our institutions are somewhat better than we thought, or it can update you towards, if we had nuclear war, we would have all died, and not been here to ask the question. I don’t think that second one, is all that possible. My understanding, is that nuclear war is not that likely to wipe out everyone, or even 90% of people. So I’m more… I lean towards the first explanation. Overall, my guess is, this is the thing that has worked for the last… ‘Worked’, the thing that has, generally led to an increase in prosperity. Or, the world has clearly improved on most metrics over time. And, this system we’ve been using, for most of that time is some sort of multipolar, people interact with each other and keep each other in check, and cooperate with each other because they have to, and so on. In the modern world we use, and not just the modern world, we use things like regulations and laws and so on, to enforce this. The system’s got some history behind it, so I’m more inclined to trust it. But overall, I feel okay about this world, assuming we solve the alignment problem, we’ll ignore the alignment problem for now.

For a unipolar world. I think, probably, I find it more likely that there will just be a lot of returns to scale. You’ll get a lot of efficiency from centralizing more and more, in the same way that it’s just really nice to have a single standard, rather than have 15 different standards. It sure would have been nice, if when I moved to the UK, I could have just used all of my old chargers without having to buy adapters. But no, all the outlets are different, right? There’s benefits to standardization and centralization of power, and it seems to me, there has been more and more of that over time. Maybe it’s not obvious, I don’t know very much history, but if… So, it seems like you could get, even more centralization in the future, in order to capture the efficiency benefits, and then you might have a global government that could reasonably be said to be the entity that controls the world, and that would then be a unipolar outcome. It’s not a unipolar outcome in which the thing in charge of the world is an AI system. It is a unipolar outcome. I feel wary of this, but I don’t like having a single point of failure. I don’t like it when there’s a… However, I really like it when people are allowed to advocate for their own interests, which isn’t necessarily not happening here, right?

This could be a global democracy, but still, it seems like, the libertarian intuition of markets are good, generally tends to suggest against centralization, and I do buy that intuition, but this could also just be status quo bias, where I know that I can very easily see the problems in the world that we’re not actually in at the moment, and I don’t want it to change. So I don’t know, I don’t have super strong opinions there. It’s very plausible to me that that world is better, because then you can control dangerous technologies much, much better. If there just are technologies that are sufficiently dangerous and destructive, they would destroy, they would lead to extinction, then maybe I’m more inclined to favor a unipolar outcome.

Lucas Perry: I would like to ask you about DeepMind, and maybe another question before we wrap up. What is it, that the safety team at DeepMind is up to?

Rohin Shah: No one thing. The safety team at DeepMind is reasonably large, and there’s just a bunch of projects going on. I’ve been doing a bunch of inner alignment stuff. Most recently, I’ve been trying to come up with more examples that are, in actual systems, rather than hypotheticals. I’ve also been doing a bunch of conceptual work, of just trying to make our arguments clearer, and more conceptually precise. A large smattering of stuff, not all that related to each other, except in as much as it’s all about AI alignment.

Lucas Perry: As a final question here, Rohin, I’m interested in your core at the center of all of this. What’s the most important thing to you right now? Insofar as, AI alignment, may be the one thing, that most largely impacts the future of life?

Rohin Shah: Ah.

Lucas Perry: If you just look at the universe right now, and you’re like, these are the most important things.

Rohin Shah: I think, for things that I impact, at a more granular, more granular than just, make AI go well… I think for me, it’s probably making better arguments and more convincing arguments, currently. This will probably change in the future. Partially because I hope to succeed at the skill, and then it won’t be as important. But I feel like right now, especially with the advent of these large neural nets, and more people seeing a path to AGI, I think it is much more possible to make arguments that would be convincing to ML researchers as well, as well as the philosophically oriented people who make up the AI safety community, and I think, that just feels like the most useful thing I can do at the moment. In terms of the world in general… I feel like it is something like the attitudes of consequential people, two words… Well, long-termism in general, but maybe risks in particular, where, and importantly, I do feel it has more… I care primarily about, the people who are actually making decisions, that impact the future. Maybe they are taking into account the future. Maybe they’re like, it would be nice to care about the future, but the realities of politics mean that I can’t do that, or else I will lose my job. But my guess is that they’re mostly just not thinking about the future. That seems… If you’re talking about the future of life, that seems like the most, that seems pretty important to change.

Lucas Perry: How do you see doing that, when many of these people don’t have the… As Sam Harris put it, ‘the science fiction geek gene’ is what he called it, when he was on this podcast. The long-termists, who are all, we’re going to build AGI, and then create these radically different futures. Many of these people, may just mostly care about their children and their grandchildren, that may be the human tendency.

Rohin Shah: Do we actually advocate for any actions that would not impact their grandchildren?

Lucas Perry: It depends on your timelines, right?

Rohin Shah: Fair enough. But, most of the time, the arguments that I see people giving for any preferred policy proposal of theirs, or act… Just like almost any action whatsoever. It seems be a thing, that would have a noticeable effect on people’s lives in the next 100 years. So, in that sense, grandchildren should be enough.

Lucas Perry: Okay. So then long-termism doesn’t matter.

Rohin Shah: Well… I don’t-

Lucas Perry: For getting the action done.

Rohin Shah: Oh, possibly. I still think they’re not thinking about the future. I think it’s more of a… I don’t know, if I had to take my best guess at it, with noting the fact that I am just a random person, who is not at all an expert in these things, because why would I be? And yes listeners, noting that Lucas has just asked me this question, because it sounds interesting, and not because I am at all, qualified to answer it.

It seems to me, the more likely explanation is that there are just always a gazillion things to do. There’s always $20 bills to be picked off the sidewalk, but their value is only $20. They’re not $2 billion. Everyone is just constantly being told to pick up all the $20 bills, and as a result, they are in a perpetual state of having to say no to stuff, and doing only the stuff that seems most urgent, and maybe also important. So, most of our institutions tend to be in a very reactive mindset, as a result. Not because they don’t care, but just because that’s the thing that they’re incentivized to do, is to respond to the urgent stuff.

Lucas Perry: So, getting policymakers to care about the future, whether that even just includes children and grandchildren, not the next 10 billion years, would be sufficient in your view?

Rohin Shah: It might be, it seems plausible. I don’t know that that’s the approach I would take. I think I’m more just saying, I’m not sure that you even need to convince them to care about the future, I think-

Lucas Perry: I see.

Rohin Shah: It’s possible, that what’s needed is people who have the space to bother thinking about it. I get paid to think about the future, if I didn’t get paid to think about the future, I would not be here on this podcast because I would not have enough knowledge to be worth talking, you talking to. I think, there are just not very many people who can be paid to think about the future, and the vast majority of them are in there… I don’t know about the vast majority, but a lot of them are in our community. Very few of them are in politics. Politics generally seems to anti-select for people who can think about the future. I don’t have a solution here, but that is the problem as I see it, and if I were designing a solution, I would be trying to attack that problem.

Lucas Perry: That would be one of the most important things.

Rohin Shah: Yeah. I think on my view, yes.

Lucas Perry: All right. So, as we wrap up here, is there anything else you’d like to add, or any parting thoughts for the audience?

Rohin Shah: Yeah. I have been giving all these disclaimers during the podcast too, but I’m sure I missed them in some places, but I just want to note, Lucas has asked me a lot of questions that are not things I usually think about, and I just gave off-the-cuff answers. If you asked me them again, two weeks from now, I think for many of them, I might actually just say something different. So don’t take them too seriously, and treat… The AI alignment ones, I think you can take those reasonably seriously, but the things that were less about that, take them as some guy’s opinion, man.

Lucas Perry: ‘Some guy’s opinion, man.’

Rohin Shah: Yeah. Exactly.

Lucas Perry: Okay. Well, thank you so much for coming on the podcast Rohin, it’s always a real pleasure to speak with you. You’re a bastion of knowledge and wisdom in AI alignment and yeah, thanks for all the work you do.

Rohin Shah: Yeah. Thanks so much for having me again. This was fun to record.