Posts in this category get featured at the top of the front page.

Kelly Wanser on Climate Change as a Possible Existential Threat

 Topics discussed in this episode include:

  • The risks of climate change in the short-term
  • Tipping points and tipping cascades
  • Climate intervention via marine cloud brightening and releasing particles in the stratosphere
  • The benefits and risks of climate intervention techniques
  • The international politics of climate change and weather modification



0:00 Intro

2:30 What is SilverLining’s mission?

4:27 Why is climate change thought to be very risky in the next 10-30 years?

8:40 Tipping points and tipping cascades

13:25 Is climate change an existential risk?

17:39 Earth systems that help to stabilize the climate

21:23 Days where it will be unsafe to work outside

25:03 Marine cloud brightening, stratospheric sunlight reflection, and other climate interventions SilverLining is interested in

41:46 What experiments are happening to understand tropospheric and stratospheric climate interventions?

50:20 International politics of weather modification

53:52 How do efforts to reduce greenhouse gas emissions fit into the project of reflecting sunlight?

57:35 How would you respond to someone who views climate intervention by marine cloud brightening as too dangerous?

59:33 What are the main points of persons skeptical of climate intervention approaches

01:13:21 The international problem of coordinating on climate change

01:24:50 Is climate change a global catastrophic or existential risk, and how does it relate to other large risks?

01:33:20 Should effective altruists spend more time on the issue of climate change and climate intervention?

01:37:48 What can listeners do to help with this issue?

01:40:00 Climate change and mars colonization

01:44:55 Where to find and follow Kelly




Kelly’s Twitter

Kelly’s LinkedIn


We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on YoutubeSpotify, SoundCloudiTunesGoogle PlayStitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Lucas Perry: Welcome to the Future of Life Institute Podcast. I’m Lucas Perry. In this episode, we have Kelly Wanser joining us from SilverLining. SilverLining is a non-profit that is focused on ensuring a safe climate due to the risks of near-term catastrophic climate change. Given that we may fail to reduce CO2 emissions sufficiently, it may be necessary to take direct action to promote cooling of the planet to stabilize both human and Earth systems. This conversation centrally focuses how we might intervene in the climate by brightening marine clouds to reflect sunlight and thus cool the planet down and offset global warming. This episode also explores other methods of climate intervention, like releasing particles in the stratosphere, their risks and benefits, and we also get into how climate change fits into global catastrophic and existential risk thinking.

There is a video recording of this podcast conversation uploaded to our Youtube channel. You can find a link in the description. This is the first in a series of video uploads of the podcast to see if that’s something that listeners might find valuable. Kelly shows some slides during our conversation and those are included in the video version. The video podcast’s audio and content is unedited, so it’s a bit longer than the audio only version and contains some sound hiccups and more filler words.

Kelly Wanser is an innovator committed to pursuing near-term options for ensuring a safe climate. In her role as Executive Director of SilverLining, she oversees the organization’s efforts to promote scientific research, science-based policy, and effective international cooperation in rapid responses to climate change. Kelly co-founded—and currently serves as Senior Advisor to—the University of Washington Marine Cloud Brightening Project, an effort to research and understand one possible form of climate intervention: the cooling effects of particles on clouds. She also holds degrees in economics and philosophy from Boston College and the University of Oxford.

And with that, let’s get into our conversation with Kelly Wanser

Let’s kick things off here with just a simple introductory question. So could you give us a little bit of background about SilverLining and what is its mission?

Kelly Wanser: Sure Lucas. I’m going to start by thanking you for inviting me to talk with you and your community, because the issue of existential threats is not an easy one. So our approach at SilverLining I think overlaps with some of the kinds of dialogue that you’re having, where we’re really concerned about this sort of catastrophic risks that we may have with regards to climate change in the next 10 to 30 years. So SilverLining was started specifically to focus on near term climate risk and the uncertainty you have about climate system instability, runaway climate change, and the kinds of things we don’t have insurance policies against yet. My background is from the technology sector. I worked in areas of complex systems analysis and IT infrastructure. And so I came into this problem, looking at it primarily from a risk point of view, and the fact that the kind of risks that we currently have exposure to is an unacceptable one.

So we need to expand our toolkit and our portfolio until we’ve got sufficient options in there that we can address the different kinds of risks that we’re facing in the context of the climate situation. SilverLining is a two year old organization, and there are two things that we do. We look at policy and sort of driving how in particular these interventions in climate, these things that might help reduce warming or cool the planet quickly, how we might move those forward in terms of research and assessment from a policy perspective, and then how we might actually help drive research and technology innovation directly.

Lucas Perry: Okay, so the methods of intervention are policy and research?

Kelly Wanser: Our methods of operation are policy and research, the methods of intervention in particular that I’m referring to are these technologies and approaches for directly and rapidly reducing warming in the climate system.

Lucas Perry: So in what you just said you mentioned that you’re concerned about catastrophic risks from climate change for example, in the next 10 to 30 years. Could you paint us a little bit of a picture about why that kind of timescale is relevant? I think many people and myself included might have thought that the more significant changes would take longer than 10 to 30 years. So what is the general state of the climate now and where we’re heading in the next few decades?

Kelly Wanser: So I think there are a couple of key issues in the evolution of climate change and what to expect and how to think about risk. One is that the projections that we have, it’s a tough type of system and a tough type of situation to project and predict. And there are some things that climate modelers and climate scientists know are not adequately represented in our forecasts and projections. So a lot of the projections we’ve had over the past 10 or 15 years talk about climate change through 2100. And we see these sort of smooth curves depending on how we manage greenhouse gases. But people who are familiar with climate system itself or complex type systems problems know that there are these non-linear events that are likely to happen. Now in climate models they have a very difficult time representing those. So in many cases they’re either sort of roughly represented or excluded entirely.

And those are the things that we talk about in terms of abrupt change and tipping points. So our climate model projections are actually missing or under representing tipping points. Things like the release of greenhouse gases from permafrost that could happen suddenly and very quickly as the surface melts. Things like the collapse of big ice sheets and the downstream effects of that. So one of the concerns that we have in SilverLining is that some of the things that tech people know how to do, so similar problems to manage an IT network. It’s a highly complex systems problem, where you’re trying to maintain a stable state of the network. And some of the techniques that we use for doing that have not been fully applied to looking at the climate problem. Similarly, some of the similar techniques we use in finance, one of our advisors is the former director of global research at Goldman Sachs.

And this is a problem we’re talking to him about and folks in the IPCC and other places, essentially we need some new and different types of analysis applied to this problem beyond just what the climate models do. So problem number one is that our analytic techniques are under representing the risk, and particularly potentially risk in the near term. The second piece is that these abrupt climate changes tend to be highly related to what they call feedbacks, meaning that there are points at which these climate changes produce effects that either put warming back in the system or greenhouse gases back in the system or both. And once that starts to happen, the problem could get away from us in terms of our ability to respond. Now we might not know whether that risk is 5%, 10% or 80%. From SilverLinings perspective, from my perspective, any meaningful risk of that in the next 10 to 30 years is an unacceptable level of risk, because it’s approaching somewhere between catastrophic and existential.

So we’re less concerned about the arm wrestle debate over is there some scenario where we can constrain the system by just reducing greenhouse gases. We’re concerned about, are there scenarios where that doesn’t work, scenarios where the system moves faster than we can constrain greenhouse gases? The final thing I’ll say is that we’re seeing evidence of that now. So some of the things that we’re seeing like these extra ordinaries of wildfire events, what’s happening to the ice sheets. These are things that are happening at the far end of bad predictions. The observations of what’s happening in the system are indicative of the fact that that risk could be pretty high.

Lucas Perry: Yeah. So you’re ending here on the point that say fires that we’re observing more recently are showing that tail end risks are becoming more common. And so they’re less like tail end risks and more like becoming part of the central mass of the Gaussian curve?

Kelly Wanser: That’s right.

Lucas Perry: Okay. And so I want to slow down a little bit, because I think we introduced a bunch of crucial concepts here. One of these is tipping points. So if you were to explain tipping points in one to two sentences to someone who’s not familiar with climate science, how would you do that?

Kelly Wanser: The metaphor that I like to use is similar to a fever in the human body. Warming heat acts as a stressor on different parts of the system. So when you have a fever, you can carry a fever up to a certain point. And if it gets high enough and long enough, different parts of your body will be affected, like your brain, your organs and so on. The trapped heat energy in the climate system acts as a stressor on different parts of the system. And they can warm a bit over a certain period of time and they’ll recover their original state. But beyond a certain point, essentially the conditions of heat that they’re in are sufficiently different than what they’re used to, that they start to fundamentally change. And that can happen in biological systems where you start to lose the animal species, plant species, that can happen in physical systems where the structure of an ice sheet starts to disintegrate, and once that structure breaks down, it doesn’t come back.

Forests have this quality too where if they get hot enough and dry enough, they may pass a point where their operation as a forest no longer works and they collapse into something else like desertification. So there are two concerns with that. One is that we lose these big systems permanently because they change the state in a way that doesn’t recover. And the second is that when they do that, they either add warming or add greenhouse gases back into the system. So when an ice sheet collapse for example, these big ice structures, they reflect a huge amount of sunlight back out to space. And when we lose them, they’re replaced by dark water. And so that’s basically a trade-off from cooling to warming that’s happening with ice. And so there are different things like that, where that combination of losing that system and then having it really change the balance of warming is a double faceted problem.

Lucas Perry: Right, so you have these dynamic systems which play an integral part in maintaining the current climate stability, and they can undergo a phase state change. Like water is water until you hit a certain degree. And then it turns into ice or it evaporates and turns into steam, except you can’t go back easily with these kinds of systems. And once it changes, it throws off the whole more dynamic context that it’s in, it’s stabilizing the environment as we enjoy it.

Kelly Wanser: One of the problems that you have is not just that any one of these systems might change its state and might start putting warming or greenhouse gases back into the atmosphere, but they’re linked to each other. And so then they call that the cascade effect where one system changes its state and that pushes another system over the edge, and that pushes another system over the edge. So a collapse of ice sheets can actually accelerate the collapse of the Amazon rainforest for example, through this process. And that’s where we come more towards this existential category where we don’t want to come anywhere near that risk and we’re dangerously near it.

And, so one of the problems that scientists like Will Steffen and some arctic scientists for example are seeing, is that some of these tipping points they think we’re in. I work with climate scientists really closely, and I hear them saying, “We may be in it. Some of these tipping points are starting to occur.” And so the ice ones, we have front page news on that, the forest ones we’re starting to see. So that’s where the concern becomes that we sort of lack the measures to address these things if they’re happening in the next one, two or three decades.

Lucas Perry: Is this where a word like runaway climate change becomes relevant?

Kelly Wanser: Yes. When I came into the space like 12 years ago, and for many of your listeners, I came in from tech first as a sort of area of passion interest. And one of the first people I talked to was a climate scientist named Steve Schneider, who was at Stanford at the time, and he since passed away, but he was a giant of the field. And I asked him kind of the question you’re referring to, which is how would you characterize the odds of runaway change within our lifetime? And he said at that time, which was about 12 years ago, I put it in the single digits, but not the low single digits. My reaction to that was, if you had those odds of winning the lottery, you’d be out buying tickets. And that’s an unacceptable level of risk where we don’t have responses that really meaningfully arrest or reduce warming in that kind of time.

Lucas Perry: Okay. And so another point here is you used the word “existential” a few times here, and you’ve also used the word “global catastrophic.” I think broadly within the existential risk community, at least the place where I come from, climate change is not viewed as an existential risk. Even if it gets really, really, really bad, it’s hard to imagine ways in which it would kill all people on the planet rather than like make life very difficult for most of them and kill large fractions. And so it’s generally viewed as a global catastrophic threat being that it would kill large fractions, but not be existential. What is your reaction to that? And how do you view the use of the word “existential” here?

Kelly Wanser: Well, so for me there are two sides to that question. I normally stay on one of the two sides, which is for SilverLining our mission is to prevent suffering. The loss of a third of the population of the planet or two thirds of the population of the planet and the survival of some people in interconnected bubbles, which I’ve heard top analysts talk about. For us that’s an unacceptable level of suffering and an unacceptable outcome. And so in that way the debate about whether it’s all people or just lots of people is for us not material, because that whole situation seems to be not a risk that you want to take. In the other side of your question, whether is it all people and is it planetary livability? I think that question is subject to some of the inability to fully represent all of the systemic effects that happen at these levels of warming.

Early on when I talked about this with the director of NASA Ames at the time, who’s now at Planet Labs. What he talked to me about was the changes in chemistry of the earth system. This is something that hasn’t maybe been explored that widely, but we’re already looking at collapses of life in the ocean. And between the ocean and the land systems that generates a lot of the atmosphere that we’re familiar with and that’s comfortable for people. And there are risks to that, that we can’t have these collapses of biological life and necessarily maintain the atmosphere that we’re used to. And so I think that it’s inappropriate to discount the possibility that the planet could become largely unlivable at these higher levels of heat.

And at the end of the runaway climate change scenario, where the heat levels get very high and life collapses in an extreme way, I don’t think that’s been analyzed well enough yet. And I certainly wouldn’t rule it out as an existential risk. I think that that would be inappropriate, given both our level of knowledge and the fact that we know that we have these sort of non-linear cascading things that are going to happen. So to me, I challenge the existential threat community to look into this further.

Lucas Perry: Excellent.

Kelly Wanser: Put it out there.

Lucas Perry: I like that. Okay, so given tipping points and cascading tipping points, you think there’s a little bit more uncertainty over how unlivable things can get?

Kelly Wanser: I do. And that’s before you also get into the societal part of it, right? Going back to what I think has been one of the fundamental problems of the climate debate is this idea that there are winners and losers and that this is a reasonably survivable situation for a certain class of people. There’s a reasonable probability that that’s not the case, and this is not going to be a world that anyone, if they do get to live in it, is going to enjoy.

Lucas Perry: Even if you were a billionaire back before climate change and you have your nice stocked bunker, you can’t keep stocking it, your money won’t be worth anything.

Kelly Wanser: In a world without strawberries and lobsters and rock concerts and all kinds of things that we like. So I think we’re much more in it together than people think. And that over the course of many millennia, humans were engineered and fine tuned to this beautiful, extremely complicated system that we live in. And we’re pushing it, we can use our technology to the best of our ability to adapt, but this is an environment that’s beautifully made for us and we’re pushing it out of the state that supports us.

Lucas Perry: So I’d be curious if you could expand just fairly briefly here on more of the ways in which these systems, which help to maintain the current climate status function. So for example, like the jet stream and the boreal forest and the Amazon rainforest and the Sahel in the Indian summer monsoon and the permafrost and all these other things. If you can choose, I don’t know, maybe one or two of your favorites or something or whichever or few are biggest, I’m curious how these systems help continue to maintain the climate stability?

Kelly Wanser: Well, so there are people more expert than me, but I’ll talk about a couple that I care about a lot. So one is the permafrost, which is the frozen layer of earth. And that frozen layer of earth is under the surface in landmasses and also frozen layers under the ocean. For many thousands of years, if not longer, those layers capture and build up biological life that’s died and decayed within these frozen layers of earth and store massive amounts of carbon. And so provided the earth system is working within its usual parameters, all of those masses stay frozen, and that organic material stays there. As it warms up in a way that it moves beyond its normal range of parameters, then that stuff starts to melt and those gases start to be released. And the amount of gas stored in the permafrost is massive. And particularly it includes both CO2 and the more dense, fast acting gases like methane. We’re kind of sitting on the edge of that system, starting to melt in a way where those releases could be massive.

And in my work that’s to me one of the things that we need to watch most closely, that’s a potential runaway situation. So that’s one, and that’s a relatively straightforward one, because that’s a system storing greenhouse gases, releasing greenhouse gases. They range in complexity. Like the arctic is a much more complicated one because it’s related to all the physics of the movement of the atmosphere and ocean. So the circulation of the way the jet stream and weather patterns work, the circulation of the ocean and all of that. So there could be potential drastic effects on what weather is where on the planet. So major changes in the Arctic can lead to major changes in what we experience as like our normal weather. And we’re already seeing this start to happen in Europe. And that was predicted by changes in the jet stream where Europe’s always had this kind of mild sort of temperate range of temperature.

And they’re starting to see super cold winters and hot summers. And that’s because the jet stream is moving. And a lot of that is because the Arctic is melting. A personal one that’s dear to me and it is actually happening now and we may not be able to stop no matter what we do, are the coral reefs. Coral reefs are these organic structures and they teem with all different levels of life. And they trace up to about quarter of all life in the ocean. So as these coral reefs are getting hit by these waves of hot water, they’re dying. And ultimately they’re collapsed, so mean the collapse of at least 25% of life in the ocean that they support. And we don’t really know fully what the effects of that will be. So those are a few examples.

Lucas Perry: I feel like I’ve heard the word heat stress before in relation to coral reefs and then that’s what kills it.

Kelly Wanser: Yep.

Lucas Perry: All right. So before we move into the area you’re interested in, intervening as a potential solution if we can’t get the greenhouse gases down enough, are there any more bad things that we missed or bad things that would happen if we don’t sufficiently get climate change under control?

Kelly Wanser: So I think that there are many, and we haven’t talked too much about what happens on the human side. So there are even thresholds of direct heat for humans like the hot bulb temperature. I’m not going to be able to describe it super expertly, but the combination of heat and humidity at which the human body changes the way it’s expiring heat and that heat exchange. And so what’s happening in certain parts of the world right now, like in parts of India, like Calcutta, there’s an increasing number of days of the year where it’s not safe to work outside. And there were some projections that by 2030 there would be no days in Calcutta where it was safe to work outside. And we even see parts of the U.S. where you have these heat warnings. And right now, as a direct effect on humans, I just saw a study that said the actual heat index is killing more people than the smoke from fires.

The actual increase in heat is moving past where humans are actually comfortable living and interacting. As a secondary point, obviously in developed countries we have lots of tools for dealing with that in terms of our infrastructure. But one of the things that’s happening is the system is moving outside the band in which our infrastructure was built. And this is a bit of an understudied area. As warming progresses, and you have extreme temperature, you have more flooding, you have extreme storms and winds. We have everything from bridges to nuclear plants, to skyscrapers that were not engineered for those conditions. Full evaluation of that is not really available to us yet. And so I think we may be underestimating, even things like in some of these projections, we know that our sea level rise happens and extreme storms happen, places like Miami are probably lost.

And in that context, what does it mean to have a city the size of Miami sitting under water at the edge of the United States? It would be a massive environmental catastrophe. So I think unfortunately we haven’t looked closely enough at what it means for all of these parts of our human infrastructure for their external circumstances to be outside the arena they were engineered for.

Lucas Perry: Yeah. So natural systems become stressed. They come to fail, there could be cascades. Human systems and human infrastructure becomes stressed. I mean, you can imagine like nuclear facilities and oil rigs and whatever else can cause massive environmental damage getting stressed as well by being moved outside of the normal bandwidth of operation. It’s just a lot of bad things happening after bad things after bad things.

Kelly Wanser: Yeah. And you know, a big problem. Because I’ve had this debate with people who are bullish on adaptation. Hey, we can adapt to this, but the problem is you have all these things happening concurrently. So it’s not just Miami, it’s Miami and San Francisco and Bangladesh. It’s going to be happening lots of different variants of it happening all at the same time. And so anything we could do to prevent that, excuse my academic language, shit show is really something we should consider closely because the cost of that and this sort of compound damage is just pretty staggering.

Lucas Perry: Yeah. It’s often much cheaper to prevent risks than to deal with them when they come up and then clean up the aftermath. So as we try to avoid moderate to severe bad effects of climate change, we can mitigate. I think most everyone is very familiar with the idea of reducing greenhouse gas emissions. So the kinds of gases that help trap heat inside of the atmosphere. Now you’re coming at this from a different angle. So what is the research interest of SilverLining and what is the intervention of mitigating some of the effects of climate change? What is that intervention you guys are exploring?

Kelly Wanser: Well, so our interest is in the near term risk. And so therefore we focus most closely on things that might have the potential to act quickly to substantially reduce warming in the climate system. And the problem with greenhouse gas reduction and a lot of the categories of removing greenhouse gases from the air, are that they’re likely to take many decades to scale and even longer to actually act on the climate system. And so if we’re looking at sub 30 years where we’re coming from and SilverLining is saying, “We don’t have enough in that portfolio to make sure that we can keep the system stable.” We are a science led organization, meaning we don’t do research ourselves, but we follow the recommendations of the scientific community and the scientific assessment bodies. And in 2015 the National Academy of Sciences in the United States ran an assessment that looked at the different sort of technological interventions that might be used to accelerate, addressing climate warming and greenhouse gases.

And they issued two reports, one called climate intervention, carbon dioxide removal, and one called climate intervention, reflecting sunlight to cool earth. And what they found was that in the category where you’re looking to reduce warming quickly within a decade or even a few years, the most promising way to try to do that as based on one of the ways that the earth system actually regulates temperature, which is the reflection of sunlight from particles and clouds in the atmosphere. The theories behind why they think this might work are based on observations from the real world. And so what I’m showing you right now is a picture of a cloud bank off the Pacific West coast and the streaks in the clouds are created by emissions from ships. The particulates in those emissions, usually what people think of as the dirty stuff, has a property where it often mixes with clouds in a way that will make the clouds slightly brighter.

And so based on that effect, scientists think that there’s cooling that could be generated in this way actively, and also that there’s actually cooling going on right now as a result of the particulate effects of our emissions overall. And they think that we have this accidental cooling going on somewhere between 0.5 degrees and 1.1 degrees C, and this is something that they don’t understand very well, but is potentially both a promise and a risk when it comes to climate.

Lucas Perry: So there’s some amount of cooling that’s going on by accident, but the net anthropogenic heating is positive, even with the cooling. I think one facet of this that I learned from looking into your work is that the cooling effect is limited because the particles fall back down and so it goes away. And so there might be a period of acceleration of the heat. Is that right?

Kelly Wanser: Yes. I think what you’re getting at. So two things I’ll say, these white lines indicate the uncertainty. And so you can see the biggest line is on that cloud albedo effect, which is how much do these particles brighten clouds. The effects could be much bigger than what’s going into that net effect bar. And a lot of the uncertainty in that net effect bar is coming from this cloud albedo effect. Now the fact that they fall is an issue, but what happens today for the most part is we keep putting them up there. As long as you continuously put them up there, you continuously have this effect. If you take it away, which we’re doing a couple of big experiments in this year, then you lose that cooling effect right away. And so one of the things that we’re hoping to help with is getting more money for research to look at two big events that took that away this year.

One is the economic shutdowns associated with COVID where we had these clean skies all over the world because all this pollution went down. That’s a big global experiment in removing these particles that may be cooling. We are hoping to gain a better understanding from that experiment if we can get enough resources for people to look at it well.

Lucas Perry: So, the uncertainty with the degree to which current pollution is reflecting sunlight, is that because we have uncertainty over exactly how much pollution there is and how much sunlight that is exactly reflecting?

Kelly Wanser: It’s not that we don’t know how much pollution there is. I think we know that pretty well. It’s that this interaction between clouds and particles is one of the biggest uncertainties in the climate system. And there’s a natural form of it, when you see salt spray generating clouds, you’re in Big Sur looking at the waves and the clouds starting to form, that whole process is highly complex. Clouds are among the most complex creatures in our earth system. And they’re based on the behavior of these tiny particles that attract water to them and then create different sizes of droplets. So if the droplets are big, they reflect less total sunlight off less total surface area, and you have a dark cloud. And eventually, the droplets are big enough, they fall down as rain. If the droplets are small, there’s lots of surface area and the cloud becomes brighter.

The reason we have that uncertainty is that we have uncertainty around the whole process and some of the scientists that we work with in SilverLining, they really want to focus on that because understanding that process will tell you what you might be able to do with that artificially to create a brightening effect on purpose, as well as how much of an accidental effect we’ve got going on.

Lucas Perry: So you’re saying we’re removing sulfate from the emission of ships, and sulfate is helping to create these sea clouds that are reflecting sunlight?

Kelly Wanser: That’s right. And it happens over land as well. All the emissions that contain these sulfate and similar types of particles can have this property.

Lucas Perry: And so that, plus the reduction of pollution given COVID, there is this ongoing experiment, an accidental experiment to decrease the amount of reflective cloud?

Kelly Wanser: That’s right. And I should just note that the other thing that happened in 2020 is that the International Maritime Organization implemented regulations to drastically reduce emissions from ships. Those went into effect in January, an 85% reduction in these sulfate emissions. And so that’s the other experiment. Because sulfate and these emissions, we don’t like as pollutants for human health, for local ecosystems. They’re dirty. So we don’t like them for very good reasons, but they happen to have the side effect of producing a brightening effect on clouds, and that’s the piece we want to understand better.

When I talk to especially people in the Bay Area and people who think about systems, about this particular dynamic, most of the people that I’ve talked to were unfamiliar with this. And lots of people, even who think about climate a lot, are unfamiliar with the fact that we have this accidental cooling going on. And that as we reduce emissions, we have this uncertain near-term warming that may result from that, which I think is what you were getting at.

Lucas Perry: Yeah.

Kelly Wanser: So where I’m headed with this is that in the early ’90s, some British researchers proposed that you might be able to produce an optimized version of this effect using sea salt particles, like a salt mist from seawater, which would be cleaner and possibly actually produce a stronger effect because of the nature of the salt particles, and that you could target this at areas of unpolluted clouds and certain parts of the ocean where they’d be most susceptible, and you’d get this highly magnified reflective effect. And that in doing that, in these sort of few parts of the world where it would work best by brightening 10% to 20% of marine clouds or, say, the equivalent of 3% to 5% of the ocean’s surface, you might offset a doubling of CO2 or several degrees of warming. And so that’s one approach to this kind of rapid cooling, if you like, that scientists are thinking about that’s related to an observed effect.

This marine cloud brightening approach has the characteristic that you talked about, that it’s relatively temporary. So you have to do it continuously, last a few days and otherwise, if you stop, it stops. And it’s also relatively localized. So it opens up theoretical possibilities that you might consider it as a way of cooling ocean water and mitigating climate impacts regionally or locally. In theory, what you might do is engage in this technique in the months before hurricane season. So your goal is to cool the ocean surface temperatures, which are a big part of what increases the energy and the rainfall potential of storms.

So, this idea is very theoretical. There’s been almost no research in it. Similarly, there’s a little bit of emerging research in could you cool waters that flow on to coral reefs? And you might have to do this in areas that are further out from the coral reefs because coral reefs tend to be in places where there are no clouds, but your goal is to try to get those big currents of water they’re flowing on and cool them off. There was a little test, very little tests, tiny little tests of the technology that you might use down in Australia as part of their big program, I think it’s an $800 million program, to look at all possibilities for saving the Great Barrier Reef.

Lucas Perry: Okay. One thing that I think is interesting for you to comment on briefly is I think many people, and myself included, don’t really have a good intuition about how thick the atmosphere is. You look up and it’s just big open space, maybe it goes on forever or something. So how thick is it? Put it into scale so it makes sense that seven billion humans could effect it in such large scale ways.

Kelly Wanser: We’re going to talk about it a little bit differently because the things I’m talking to you about are different layers of the atmosphere. So, the idea that I talked to you about here, marine cloud brightening, that’s really looking at the troposphere, which is the lowest layer of the atmosphere, which are, when you look up, these are the clouds I see. It’s the cloud layer that’s closest to earth that’s going from sort of 500 feet up to a couple thousand feet. And so in that layer, you may have the possibility, especially over the ocean, of generating a mist from the surface where the convection, the motion of the air above you kind of pulls the particles up into the cloud layer. And so you can do this kind of activity potentially from the surface, like from ships. And it’s why the pollution particles are getting sucked up into the clouds too.

So that idea happens at that low layer, sub mile layer, visible eye layer of stuff. And for the most part, what’s being proposed in terms of volume of material, or when scientists are talking about brightening these clouds, they’re talking about brighten them 5% to 7%. So it’s not something that you would probably see as a human with your own naked eyes, and it’s over the ocean, and it’s something that would have a relatively modest effect on the light coming in to the ocean below, so probably, a relatively modest effect on the local ecology, except for the cooling that it’s creating.

So in that way, it’s potentially less invasive than people might think. Where the risks are in a technique like this are really around the fact that you’re creating these sort of concentrated areas of cooling, and those have the potential to move the circulation of the atmosphere and change weather patterns in ways that are hard to predict. And that’s probably the biggest thing that people are concerned about with this idea.

Now, if you’d like, I can talk about what people are proposing at the other side, the high end of the atmosphere.

Lucas Perry: Yeah. So I was about to ask you about stratospheric sunlight reflection.

Kelly Wanser: Yeah, because this is the one that most people have heard about those have heard about, and it’s the most widely studied and talked about, partly because it’s based on events that have occurred in nature. Large volcanoes push material into the atmosphere and very large ones can push material all the way into the outer layer of the atmosphere, the stratosphere, which I thinks starts at about 30,000 or 40,000 feet and goes up for a few miles. So when particles reach the stratosphere, they get entrained and they can stay for a year or two.

And when Mount Pinatubo erupted in 1991, it was powerful and it pushed particles into the stratosphere that stayed there for almost two years. And it produced a measurable cooling effect in the entire planet of at least a half a degree C, actually, popped up closer to one degree C. So this cooling effect was sustained. It was very clear and measurable. It lasted until the particles fell down to earth, and it also produced a marked change in Arctic ice where Arctic ice mass just recovered drastically. This cooling effect where the particles reach the stratosphere, they immediately or very quickly get dispersed globally. So it’s a global effect, but it may have an outsize effects on the Arctic, which warms and cools faster than the rest of the planet.

This idea, and there are some other examples in the volcanic record, is what led scientists, including the Nobel prize winning scientist who identified the ozone hole, Paul Crutzen, to suggest that one approach to offsetting the warming that’s happening with climate change would be to introduce particles in the stratosphere that reflects sunlight directly, almost kind of bedazzling the stratosphere, and that by increasing this reflectivity by just 1%, that you could offset a doubling of CO2 or several degrees of warming.

Now the particles that volcanoes release in this way are similar to the pollution particles on the ground. There are sulfates and there are precursors. These particles also have the property where they can damage the ozone layer and they can also cause the stratosphere itself to heat up, and so that introduces risks that we don’t understand very well. So that’s what people want to study. There isn’t very much research on this yet, but one of the earliest models is produced at NCAR that compared the course of global surface temperatures in a business as usual scenario in the NCAR global climate model versus introducing particles into the stratosphere starting in 2020, gradually increasing them and maintaining temperatures through the end of the century. And what you can see in that model representation is that it’s theoretically possible to keep global surface temperatures close to those of today with this technique and that if we were to go down this business as usual path or have higher than expected feedbacks that took us to something similar, that, that’s not a very livable situation for most people on the planet.

Lucas Perry: All right. So you can intervene in the troposphere or the stratosphere, and so there’s a large degree of uncertainty about indirect effects and second and third order effects of these interventions, right? So they need to be studied because you’re impacting a complex system which may have complex implications at different levels of causality. But the high level strategy here is that these things may be necessary if we’re not able to reduce greenhouse gas emissions sufficiently. That’s why we may be interested in it for mitigating some degree of climate change that happens or is inevitable.

Kelly Wanser: That’s right. There’s a slight sort of twist on that, I think, where it’s really about, if we can, trying to look at these dangerous instabilities and intervene before they happen or before they take us across thresholds we don’t want to go. It is what you’re saying, but it’s a little bit of a different shade where we don’t wait to see how our mitigation effort is going necessarily. What we need to do is watch the earth system and see whether we’re reaching kind of a red zone where we’ve got to bring the heat down in the system.

Lucas Perry: What kinds of ongoing experiments are happening for studying these tropospheric and stratospheric interventions in climate change?

Kelly Wanser: Well, so the first thing we’ll say is that the research in this field has been very taboo for most of the past few decades. So, relative to the problem space, very little research has been done. And the global level of investment in research even today is probably in the neighborhood of $10 million a year, and that includes a $3 million a year program in China and a program at Harvard, which is really the biggest funded program in the world. So, relative to the problem space and the potential, we’re very under-invested. And the things I’m going to talk to you about are really promising, and there are prestigious institutions and collaborations, but they’re still at, what I would call, a very seed level of funding.

So the two most significant interdisciplinary programs in the field, one is aimed at the stratosphere, and that’s a program at Harvard called the Harvard Solar Geoengineering Program and includes social science and physical sciences, but a sort of flagship of what they’re trying to do is to do an experiment in the stratosphere. And in their case, they would try to use a balloon, which is specially crafted to navigate in the stratosphere, which is a hard problem, so that they can do releases of different materials to look at their properties in the stratosphere as they disperse and as they mix with the gases in the stratosphere.

And so for understanding, what we hope, and I think the people in the field, is that we can do these small scale experimental studies that help you populate models that will better predict what happens if you did this at a bigger scale. So, the scale of this is tiny. It’s less than a minute of an emissions of an aircraft. It’s tiny, but they hope to be able to find out some important things about the properties of the chemical interactions and the way the particles disperse that would feed into models that would help us make predictions about what will happen when you do this and also, what materials might be more optimum to use.

So in this case, they’re going to look at sulfates, which we talked about, but also materials that might have better properties. Two of those are calcium carbonate, which is what were used doing chalk, and diamonds. What they hope to do is start down the path to finding out more about how you might optimize this in a way to minimize the risks.

The other effort is on the other side of the United States, this is an effort that’s based at the University of Washington, which is one of the top atmospheric science institutions in the country. It’s a partnership with Pacific Northwest National Labs, there’s the Department of Energy Lab, and PARC, which many of your community may know it’s the famous Xerox PARC, who has since developed expertise in aerosols.

At the University of Washington, they are looking to do a set of experiments that would get at this cloud brightening question. And their scientific research and their experiments are classified as dual purpose, meaning that they are experiments about understanding this climate intervention technique, can we brighten clouds to actively cool climate, but they’re also about getting out the question of what is this cloud aerosol effect? What is the accidental effect of emissions having and how does this work in the climate system more broadly? So, what they’re proposing to do is build a specialized spray technology. So one of the characteristics of both efforts is that you need to create almost a nano mist, the particles, 80 to 100 nanometers, very consistently, at a massive scale. That hasn’t been done before. And so how do we generate this massive number of tiny droplets of materials of salt particles from seawater or calcium carbonate particles?

And some retired physicists and engineers in Silicon Valley took on this problem about eight years ago. And they’ve been working on it for four days a week in their retirement for free for the sake of their grandchildren to invent this nozzle that I’m showing you, which is the first step of being able to generate the particles that you need to study here. They’re in the phase right now, where, because of COVID, they’ve had to set up a giant tent and do indoor spray tests, and they hope next year to go out and do what they call individual plume experiments. And then eventually, they would like to undertake what they call limited area field experiment, which would actually be 10,000 square kilometers, which is the size of a grid cell on a climate model. And that would be the minimum scale at which you could actually potentially detect a brightening effect.

Lucas Perry: Maybe it makes sense on reflection, but I guess I’m kind of surprised that so much research is needed to figure out how to make a nozzle make droplets of aerosol.

Kelly Wanser: I think I was surprised too. It turns out, I think for certain materials, and again, because you’re really talking about a nano mist, like silicon chip manufacturer, like asthma inhaler. And so here, we’re talking about three trillion particles a second from one nozzle and an apparatus that can generate 10 to the 16th particles and lift it up a few hundred meters.

It’s not nuclear fusion and it wouldn’t necessarily have taken eight years if they were properly funded and it was a focus program. I mean, these guys, the lead Armand Neukermans funded this with his own money and he was trading the biscottis from Belgium. He was trading biscottis for measurement instruments. And so it’s only recently in the past year or two where the program has gotten its first government funding, some from NOAA and some from the Department of Energy, very relatively small and more focused on the scientific modeling, and some money from private philanthropy, which they’re able to use for the technology development.

But again, going back to my comment earlier, this has been a very taboo area for scientists to even work in. There have been no formal sources of funding for it, so that’s made it go a lot slower. And the technology part is the hardest and most controversial. But overall, as a point, these things are very nascent. And the problem we were talking about at the beginning, predicting what the system is going to do, that in order to evaluate and assess these things properly, you need a better prediction system because you’re trying to say, okay, we’re going to perturb the system this way and this way and predict that the outcome will be better. It’s a tough challenge in terms of getting enough research in quickly. People have sort of propagated the idea that this is cheap and easy to do, and that it could run away from us very quickly. That has not been my experience.

Lucas Perry: Run away in what sense? Like everyone just starts doing it?

Kelly Wanser: Some billionaire could take a couple of billion dollars and do it, or some little country could do it.

Lucas Perry: Oh, as even an attack?

Kelly Wanser: Not necessarily an attack, but an ungoverned attempt to manage the climate system from the perspective of one individual or one small country, or what have you. That’s been a significant concern amongst social scientists and activists. And I guess my observation, working closely with it is, there are at least two types of technology that don’t exist yet that we need, so we have a technology hurdle. These things scale linearly and they pretty much stop when you stop, specifically referring to the aerosol generation technology. And for the stratosphere, we probably actually need a new and different kind of aircraft.

Lucas Perry: Can you define aerosol?

Kelly Wanser: I’ll caveat this by saying I’m not a scientist, so my definition may not be what a scientist would give you. But generally speaking, an aerosol is particles mixed with gases. It’s a manifestation in error of a mixed blend of particles and gases. I’ll often talk about particles because it’s a little bit clearer, and what we’re doing with these techniques for the most part is dispersing particles in a way that they mix with the atmosphere and…

Lucas Perry: Become an aerosol?

Kelly Wanser: Yeah. So, I would characterize the challenge we have right now is that we actually have a very low level of information and no technology. And these things would take a number of years to develop.

Lucas Perry: Yeah. Well, it’s an interesting future to imagine the international politics of weather control, like in negotiating whether to stop the hurricanes or new powers we might get over the weather in the coming decades.

Kelly Wanser: Well, you bring up an interesting point because as I’ve gotten into this field, I’ve learned about what’s going on. And actually, there’s an astonishing amount of weather modification activity going on in the world and in the United States.

Lucas Perry: Intentional?

Kelly Wanser: Intentional, yeah.

Lucas Perry: I think I did hear that Russia did some cloud seeding, or whatever it’s called, to stop some important event getting rained on or something.

Kelly Wanser: Yeah. And that kind of thing, if you remember the Beijing Olympics where they seeded clouds to generate rain to clear the pollution, that kind of localized cloud seeding type of stuff has gone on for a long time. And of course, I’m in Colorado, there’s always been cloud seeding for snowmaking. So what’s happened though in the Western United States, there’s even an industry association for weather modification in the United States. What started out as, especially snowmaking and a little bit of attempt to affect a snow pack in the West, has grown. And so there are actually major weather modification efforts in seven or eight Western states in the United States. And they’re mostly aimed at hydrology, like snow pack and water levels.

Lucas Perry: Is the snow pack for a ski resort?

Kelly Wanser: I believe, and I’m not an expert on the history of this, but I believe that snowmaking started out from the ski resorts, but when I say snow pack, it’s really about the water table. It’s about effecting the snow levels that generate the water levels downstream. Because in the West, a lot of our water comes from snow.

Lucas Perry: And so you want to seed more snow to get more water, and the government pays for that?

Kelly Wanser: I can’t say for sure who pays. This is still an exploration for us, but there are fairly significant initiatives in many Western states. And like I said, they’re primarily aimed at the problem of drought and hydrology. That’s in the United States. And if you look at other parts of the world, like the United Arab Emirates, they have a $400 million rainmaking fund. Can we make rain in the desert?

Lucas Perry: All right.

Kelly Wanser: Flip side of the coin. In Indonesia in January, this was in the news, they were seeding clouds off shore to induce rainfall off shore to prevent flooding, and they did that at a pretty big scale. In China last year, they announced a program to increase rainfall in the Tibetan plain, in an area the size of Alaska. So we are starting to see, I think around the world, and this activity would likely grow, weather extremes and attempts to deal with them locally.

Lucas Perry: Yeah. That makes sense. What are they using to do this?

Kelly Wanser: The traditional material is silver dioxide. That’s what’s proposed in the Chinese program and many of the rainmaking types of ideas. There are two things we’ll start to see, I think, as climate extremes grow and there’s pressure on politicians to act, growing interest in the potential for global mechanisms to reduce heat and bottoms up efforts that just continue to expand that try to manage weather extremes in these kinds of ways.

Lucas Perry: So we have this tropospheric intervention by using aerosols to generate clouds that will reflect sunlight, and then we have the stratospheric intervention, which aims to release particles which do something similar, how do you view the research and the project of understanding these things as fitting in with and informing efforts to decrease greenhouse gas emissions? And then also, the project of removing them from the atmosphere, if that’s also something people are looking into?

Kelly Wanser: I think they’re all very related because at the end of the day, from the SilverLining perspective and a personal perspective, we see this as a portfolio problem. So, we have a complex system that we need to manage back into a healthy state, and we have kind of a portfolio of things that we need to apply at different times and different ways to do that. And in that way, it’s a bit like medicine, where the interventions I’m talking about address the immediate stressor.

But to restore the system to health, you have to address the underlying cause. Where we see ourselves as maybe helping bridge those things is that we are under-invested in climate research and climate prediction. In the United States, our entire budget for climate research is about 2-1/2 billion dollars. If you put that in perspective, that’s like one 10th of an aircraft carrier. It’s half of a football stadium. It’s paltry. This is the most complicated, computing-intensive problem on planet earth.

It takes massive super computing capacity and all the analytical techniques you can throw at it to try to reduce the uncertainty around what’s going to happen to these systems. What I believe happened, in the past few decades, is the problem was defined as a need to limit greenhouse gases. So if you think of an equation, where one side is the greenhouse gases going in, and the other side is what happens to the system on the other end. We’ve invested most of our energy in climate advocacy and climate policy about bringing down greenhouse gases, and we’re under-invested in really trying to understand and predict what happens on the other side.

When you look at these climate intervention techniques, like I’m talking about, it’s pretty critical to understand and be able to predict what happens on the other side. It turns out, if you’re looking at the whole portfolio, typically, if you want to blend in these sort of nature-based solutions that could bring down greenhouse gases, but they have complex interaction with the system. Right? Like building new forests, or putting nutrients on the ocean. That need to better understand the system and better predict the system, it turns out we really need that. It would behoove us to be able to understand and predict these tipping points better.

I think that then where the interventions come in is to try to say, “Well, what does reducing the heat stress, get you in terms of safety? What time does it by you for these other things to take effect?” That’s kind of where we see ourselves fitting in. We care a lot about mitigation, about let’s move away from this whole greenhouse gas emissions business. We care a lot about carbon removal, and accelerating efforts to do that. If somebody comes up with a way to do carbon removal at scale in the next 10 years, then we won’t need to do what we’re doing. But that doesn’t look like a high probability thing.

And so what we’ve chosen to do is to say there’s a part of the portfolio that is totally unserviced. There are no advocates. There’s almost no research. It’s taboo. It’s complicated. It requires innovation. That’s where we’re going to focus.

Lucas Perry: Yeah. That makes sense. Let’s talk a little bit about this taboo aspect. Maybe some number of listeners have some initial reaction. Like anytime human beings try to in complex systems, there’s always unintended consequences or things happen that we can’t predict or imagine, especially in natural systems. How would you speak to, or connect with someone who viewed this project of releasing aerosols into the atmosphere to create clouds or reflect sunlight as dangerous?

Kelly Wanser: I’ll start out by saying, I have a lot of sympathy with that. If we were 30 years ago, if you’re at a different place in this sort of risk equation, then this kind of thing really doesn’t make any sense at all. If we’re in 1970 or 1980, and someone’s saying, “Look, we just need to economically tune the incentives, so that we phase greenhouse gases out of the bulk of our economic system,” that is infinitely smarter and less risky.

I believe that a lot of the principle and structure of how we think about the climate problem is based on that, because what we did was really stupid. It would be the same thing as if the doctor said, “Well, you have stage one cancer. Stop smoking,” and you just kept on puffing away. So I am very sympathetic to this. But the primary concern that we’re focused on, are now our forward outcomes and the fact that we have this big safety problem.

So now, we’re in a situation where we have greenhouse gas concentrations that we have. They were already there. We have warming and system impacts that are already there and some latency built in, that mean we’re going to have more of those. So that means we have to look at the risk-risk trade-off, based on the situation that we’re in now. Where we have conducted the experiment. Where we pushed all these aerosols into the atmosphere that mostly trap heat and change the system radically.

We did that. That was one form of human intervention. That wasn’t a very smart one. What we have to look at now is we’re not saying that we know that this is a good idea, or that the benefits outweigh the risks. But we’re saying that we have very few alternatives today to act in ways that could help stabilize the system.

Lucas Perry: Yeah. That makes sense. Can you enumerate what the main points are of detractors? If someone is skeptical of this whole approach and thinks, “We just need to stick to removing greenhouse gases by natural intervention, by building forests, and we need to reduce CO2 emissions and greenhouse gas emissions drastically. To do anything else would be adding more danger to the equation.” What are the main points of someone who comes with this problem, with such a perspective?

Kelly Wanser: You touched on two of them already. One, is that the problem is actually not moving that quickly and so we should be focused on things that are root cause, even if they take longer. Then the second one, being the fact that this introduces risks that are really hard to quantify. But I would say the primary objection, that’s raised by people like Al Gore, most of the advocates around climate, that have a problem with this is what they call the moral hazard. The idea that it gets put forward as a panacea and therefore, it slows down efforts to address the underlying problem.

This is sort of saying, even research in this stuff could have a societal negative effect, that it slows us down in doing what we’re really supposed to do. That has some interesting angles on it. One angle, which was talked about in a recent paper by Joseph Aldy at Harvard, and also was talked about with us, by Republicans we talked to about this early on, was that there’s also the thesis that it could have the opposite effect.

That the sort of drastic nature of these things could actually signal, to society and to skeptics, the seriousness of the problem. I did a bipartisan panel. The Republican on the panel, who was a moderate guy, pro-climate guy. He said, “When we, Republicans, hear these kinds of proposals coming from people who are serious about climate change, it makes you more credible than when you come to us and say, ‘The sky is falling,’ but none of these things are on the table.”

I thought that was interesting, early on. I thought it was interesting recently, that there’s at least an equal possibility that these things, as we look into them, could wake everyone up in the same way that more drastic medical treatments do and say, “Look, this is very serious. So on all fronts, we need to get very serious.” But I think, in general, this idea of moral hazard comes up pretty much as soon as the idea is there. And it can come up in the same way that Trump talks about planting trees.

Almost anything can be positioned in a way that could be attempted to use this as this panacea. I actually think that one of the moral hazards of the climate space has been the idea of winners and losers, because I think many more powerful people assume that this problem didn’t apply to them.

Lucas Perry: Like they’re not in a flooding zone. They can move to their bunker.

Kelly Wanser: The people who put forward this idea of winners and losers in climate did that because they were very concerned about the people who are impacted first. The mistake was in letting powerful people think that this wasn’t their problem. In this particular case, I’m optimistic that if we talk about these things candidly, and we say, “Look, these are serious, and they have serious risks. We wouldn’t use them, if we had a better choice.”

It’s not clear to me that that moral hazard idea really holds, but that is the biggest reservation, and it’s a reservation. That means that many people, very passionately, object to research. They don’t want us to look into any of this, because it sets off this societal problem.

Lucas Perry: Yeah. That makes a lot of sense. It seems like moral hazard should be called something more like, information hazard. The word moral seems a little bit confusing here, because it’s like if people have the information that this kind of intervention is possible, then bad things may happen. Moral means it has something to do with ethics, rather than the consequences of information. Yeah, so whatever. No one here has control over how this language was created.

Kelly Wanser: I agree with you. It’s an idea that comes from economics originally, about where the incentives are. But I think your point is well taken, because you’re exactly right. It’s information is dangerous and that’s a fundamental principle. I find myself in meetings with advocates, and around this issue having to say, “Look, our position is that information helps with fair and just consideration of this. That information is good, not bad.”

But I think you hit on an extremely important point, that it’s a masked way of saying that information is too dangerous for people to handle. Our position is information about these things is what empowers people all over the world to think about them for themselves.

Lucas Perry: Yeah. There’s a degree to which moral hazards or information hazards lack trust or confidence in the recipients of that information, which may or may not be valid, depending on the issue and the information. Here, you argue that this information is necessary to be known and shared, and then people can make informed decisions.

Kelly Wanser: That’s our argument. And so for us, we want to keep going forward and saying, “Look, let’s generate information about this, so we can all consider it together.” I guess one thing I should say about that, because I was so shocked by it when I started working in climate. That this idea of moral hazard, it isn’t new to this issue. It actually came up when they started looking at adaptation research in the IPCC and the climate community. Research and adaptation was considered to create a moral hazard, and so it didn’t move forward.

One of the reasons that we, as a society, have relatively low level of information about the things I was talking about, like infrastructure impacts, is because there was a strong objection to it, based on moral hazard. The same was true of carbon removal, which has only recently come into consideration in the IPCC. So this information is a dangerous idea because it will affect our motivation around this one part of the portfolio, that we think is the most important. I would argue that, that’s already slowed us down in really critical ways.

This is just another of those where we need to say, “Okay, we need to rethink this whole concept of moral hazard, because it hasn’t helped us.” So going back say 20 years ago, in the IPCC and the climate community, there’s this question of, how much should we invest in looking at adaptation? There was a strong objection to adaptation research, because it was felt it would disincentivize greenhouse gas reduction.

I think that’s been a pretty tragic mistake. Because if you had started research adaptation 20 years ago, you’d have much more information about what a shit show this is going to be and more incentive to reduce greenhouse gases, not less, because this is not very adaptable. But the effect of that was a real dampening of any investment in adaptation research. Even adaptation research in the US federal system is relatively new.

Lucas Perry: Yeah. The fear there is that McAlpha Corp will come and be like, “It’s okay that we have all these emissions, because we’ll just make clouds later.” Right? I feel like corporations have done extremely effective disinformation campaigns on scientific issues, like smoking and other things. I assume that would have been what some of the fear would have been with regards to adaptation techniques. And here, we’re putting stratospheric and tropospheric intervention as adaptation techniques. Right?

Kelly Wanser: Well, in what I was talking about before, I wasn’t referring to this category. But the more traditional adaptation techniques, like building dams and finding new different types of vegetation and things like that. I recognize that what I’m talking about in these common interventions is fairly unusual, but even traditional adaptation techniques to protect people were suppressed. I appreciate your point. It’s been raised to me before that, “Oh, maybe oil companies will jump on this, as a panacea for what they are doing.”

So we talked to oil companies about it, talked to a couple of them. Their response was, “We wouldn’t go anywhere near this,” because it would be admission that ties their fossil fuels to warming. They’re much more likely to invest in carbon removal techniques and things that are more closely associated with the actual emissions, than they are anything like this. Because they’re not conceding that they created the warming,

Lucas Perry: But if they’re creating the carbon, and now they’re like, “Okay, we’re going to help take out the carbon,” isn’t that admitting that they contributed to the problem?

Kelly Wanser: Yes. But they’re not conceding that they are the absolute and proven cause of all of this warming.

Lucas Perry: Oh. So they inject uncertainty, that people will say like, “There’s weather, and this is all just weather. Earth naturally fluctuates, and we’ll help take CO2 out of the atmosphere, but maybe it wasn’t really us.”

Kelly Wanser: And if you think about them as legal fiduciary entities. Creating a direct tie between themselves and warming is different than not doing that. This is how it was described to me. There’s a fairly substantial difference between them looking at greenhouse gases, which are part of the landscape of what they do, and then the actual warming and cooling of the planet, which they’re not admitting to be directly responsible for.

So if you’re concerned about there being someone doing it, we can’t count on them to bail us out and cool the planet this way, because they’re really, really not.

Lucas Perry: Yeah. Then my last quip I was suffering over, while you were speaking, was if listeners or anyone else are sick and tired of the amount of disinformation that already exists, get ready for the conspiracy theories that are going to happen. Like chemtrail 5.0, when we have to start potentially using these mist generators to create clouds. There could be even like significant social disruption just by governments undertaking that kind of project.

Kelly Wanser: That’s where I think generating information and talking about this in a way that’s well grounded is helpful. That’s why you don’t hear me use the term, geoengineering. It’s not a particularly accurate term. It sort of amplifies triggers. Climate intervention is the more accurate term. It helps kind of ground the conversation in what we’re talking about. The same thing when we explain that these are based on processes that are observed in nature, and some of them are already happening. So this isn’t some big, new Sci-Fi. You know, we’re going to throw bombs at hurricanes or something. Just getting the conversation better grounded.

I’ve had chemtrails people at my talks. I had a guy set up a tripod in the back and record it. He was giving out these little buttons that had an airplane with little trail coming out, and a strike through it. It was fantastic. I had a conversation with him. When you talk about it in this way, it’s kind of hard to argue with. The reality is that there is no secret government program to do these things, and there are definitely no mind-altering chemicals involved in any proposals.

Lucas Perry: Well, that’s what you would be saying, if there were mind-altering chemicals.

Kelly Wanser: Fair point. We tend to try to orient the dialogue at the sort of 90% across the political and thought spectrum.

Lucas Perry: Yeah. It’s not a super serious consideration, but something to be maddened about in the future.

Kelly Wanser: One of the other things I’ll say, with respect to the climate denial side of the spectrum. Because we work in the policy sphere in the United States, and so we have conversations across the political spectrum. In a strange way, coming out at the problem from this angle, where we talk about heat stress and we talk about these interventions, helps create a new insertion point for people who are shut down in the traditional kind of dialogue around climate change.

And so we’ve had some pretty good success actually talking to people on the right side of the spectrum, or people who are approaching the climate problem from a way that’s not super well-grounded in the science. We kind of start by talking about heat stress and what’s happening and the symptoms that we’re seeing and these kinds of approaches to it, and then walking them backwards into when you absolutely positively have to take down greenhouse gases.

It has interestingly, and kind of unexpectedly, created maybe another pathway for dealing with at least parts of those populations and policy people.

Lucas Perry: All right. I’d be interested in pivoting here into the international implications of this, and then also talking about this risk in the context of other global catastrophic and existential risks. The question here now is what are the risks of international conflict around setting the global temperature via CO2 reduction and geo… Sorry. Geoengineering is the bad word. Climate intervention? There are some countries which may benefit from the earth being slightly warmer, hotter. You talked about how there were no winners or losers. But there are winners, if it only changes a little bit. Like if it gets a little bit warmer, then parts of Russia may be happier than they were otherwise.

The international community, as we gain more and more efficacy over the problem of climate change and our ability to mitigate it to whatever degree, will be impacting the weather and agriculture and livability of regions for countries all across the planet. So how do you view this international negotiation problem of mitigating climate change and setting the global temperature to something appropriate?

Kelly Wanser: I don’t tend to use the framing of, setting the global temperature. I mean, we’re really, really far from having like a fine grained management capability for this. We tend to think of it more in the context of preventing certain kinds of disastrous events in the climate system. I think in that framing, where you say, “Well, we can develop this technology,” or where we have knobs and dials for creating favorable conditions in some places and not others, that would be potentially a problem. But it doesn’t necessarily look like that’s how it works.

So it’s possible that some places, like parts of Russia, parts of Canada, might for a period of time, have more favorable climate conditions, but it’s not a static circumstance. The problem that you have is well, the Arctic opens up, Siberia gets warmer and for a couple of decades, that’s nicer. But that’s in the context of these abrupt change risks that we were talking about, where that situation is just a transitory state to some worse states.

And so the question you’re asking me is, “Okay. Well, maybe we hold a system to where Russia is happier in this sort of different state that they had.” I think that the massive challenge, which we don’t know if we can do, is just whether we can keep the system stable enough. The idea that you can stabilize the system in a way that’s different then now, but still prevents these like cascading outcomes. That’s a pretty, I would say, not the highest probability scenario.

But I think there’s certainly validity in your question, which is this just makes everybody super nervous. It is the case that this is not a collective action capability. One of its features is that it does not require everyone in the world to agree, and that is a very unstable concerning state for a lot of people. It is true that its outcomes cannot be fully predicted.

And so there’s a high degree of likelihood that everyone would be better off or that the vast majority of the world would be better off, but there will be outcomes in some places that might be different. It’s more likely, rather than people electively turning the knobs and making things more favorable for themselves, just that 3 to 5% of the world thinks they’re worse off, while we’ve tried to keep the thing more or less stable.

I think behind your question is even the dialogue around this is pretty unnerving and has the potential to promote instability and conflict. One of the things that we’ve seen in the past, that’s been super helpful, is for scientific cooperation. Lots of global cooperation in the evolution of the research and the science, so that everybody’s got information. Then we’re all dealing from an information base where people can be part of the discussion.

Because our strong hypothesis is like we’re kind of looking at the edge of a cliff, where we might not have so much disagreement that we need to do something, but we all need information about this stuff. We have done some work, in SilverLining, at looking at this and how the international community has handled things better or worse, when it comes to environmental threats like this. Our favorite model is the Montreal Protocol, which is both the scientific research and the structure that helped manage what is, many perceive, to be an existential risk around the ozone layer.

That was a smaller, more focused case of, you have a part of the system that if it falls outside a certain parameter, lots and lots of people are going to die. We have some science we have to do to figure out where we can let that go and not let it go. The world has managed that very well over the past couple of decades. And we managed to walk back from the cliff, restore the ozone layer, and we’re still managing it now.

So we kind of see some similarities in this problem space of saying, “We’ve got to be really, really focused about what we can and can’t let the system do, and then get really strong science around what our options are.” The other thing I’ll say about the Montreal Protocol, in case people aren’t aware, is it is the only environmental forum, environmental treaty that is signed by all countries in the world. There are lots of aspects of that, that are a really good model to follow for something like this, I think.

Lucas Perry: Okay. So there’s the problem of runaway climate change, where the destruction of important ecosystems lead to tipping points, and that leads to tipping cascades. And without the reduction of CO2, we get worse and worse climate change, where like everyone is worse off. In that context, there is increased global destability, so there’s going to be more conflict with the migrations of people and the increase of disease.

It’s just going to be a stressor on all of human civilization. But if that doesn’t happen, then there is this later 21st century potential concern of more sophisticated weather manipulation, weather engineering technologies, making the question of constructing and setting the weather in certain ways as a more valid international geopolitical problem. But primarily the concern is obviously regular climate change with the stressors and conflict that are induced by that.

Kelly Wanser: One thing I’ll say, just to clarify a little bit about weather modification and the expansion of that activity. I think that, that’s already happening and likely to happen throughout the century, and the escalation of that and the expansion of that as a problem. Not necessarily people using it as a weaponized idea. But as weather modification activities get larger, they have what are called telegraphic effects. They affect other places.

So I might be trying to cool the Great Barrier Reef, but I might affect weather in Bali. If I’m China and I’m trying to do weather modification to areas the size of Alaska, it’s pretty sure that I’m going to be affecting other places. And if it’s big enough, I could even affect global circulation. So I do think that that aspect, that’s coming onto the radar now. That is an international decision-making problem, as you correctly say. Because that’s actually, in some ways, even almost a bit of a harder problem than the global one. Because we’ve got these sort of national efforts, where I might be engaged in my own jurisdiction, but I might be affecting people outside.

Kelly Wanser: I should also say, just so everybody’s clear, weather modification for the purpose of weapons is banned by international treaty. A treaty called ENMOD. It arose out of US weather modification efforts in the Vietnam war, where we were trying to use weather as a weapon and subsequently agreed not to do that.

Lucas Perry: So, wrapping up here on the geopolitics and political conflict around climate change. Can you describe to what extent there is gridlock around the issue? I mean, different countries have different degrees of incentives. They have different policies and plans and philosophies. One might be more interested in focusing on industrializing to meet its own needs. And so it would deprioritize reducing CO2 emissions. So how do you view the game theory and the incentives and getting international coordination on climate change when, yeah, we’d all be better off if this didn’t happen, but not everyone is ready or willing to pay the same price?

Kelly Wanser: I mean, the main issue that we have now is that we have this externality, this externalized costs that people aren’t paying for the damage that they’re doing. And so a modest charge for that, for greenhouse gas emissions, my understanding is that a relatively modest price for carbon can set the incentives such that innovation moves faster and you reach the thresholds of economic viability for some of these non-carbon approaches faster. I come from Silicon Valley, so I think innovation is a big part of the equation.

Lucas Perry: You mean like solar and wind?

Kelly Wanser: Well there’s solar and wind, which are the traditional techniques. And then there are emerging things which could be hydrogen fuel cells. It could be fusion energy. It could be really important things in the category of waste management, agriculture. You know, it’s not just energy and cars, right? And we’re just not reaching the economic threshold where we’re driving innovation fast enough and we’re reaching profitability fast enough for these systems to be viable.

So with a little turn of the dial in terms of pricing that in, you get all of that to go faster. And I’m a believer in moving that innovation faster means that the price of these low carbon techniques will come down, it will also accelerate offlining the greenhouse gas generating stuff. So I think that it’s not sensible that we’re not building in like a robust mechanism for having that price incentive, and that price incentive will behave differently in the developed countries versus the emerging markets and the developing countries. And it might need to be managed differently in terms of the cost that they face.

But it’s really important in the developing countries that we develop policies that incentivize them not to build out greenhouse gas generating infrastructure, however we do that. Because a lot of them are in inflection points, right? Where they can start building power plants and building out infrastructure.

So we also need to look closely at aligning policies and incentives for them that they just go ahead and go green, and it might be a little bit more expensive, which means that we have to help with that. But that would be a really smart thing for us to do. What we can’t do is expect developing countries who mostly didn’t cause the problem to also eat the impact in terms of not having electricity and some of the benefits that we have of things like running water and basic needs. I don’t actually think this is rocket science. You know, I’m not a total expert, but I think the mechanisms that are needed are not super complicated. The getting the political support for them is what the problem is.

Lucas Perry: A core solution here being increased funding into innovation, into the efficacy and efficiency of renewable energy resources, which don’t pollute greenhouse gases.

Kelly Wanser: The R&D funding is key. In the U.S. we’ve actually been pretty good at that in a lot of parts of that spectrum, but you also have to have the mechanisms on the market side. Right now you have effectively fossil fuels being subsidized in terms of not being charged for the problem they’re creating. So basically we’ve got to embed the cost in the fossil fuel side of the damage that they’re doing, and that makes the market mechanisms work better for these emerging things. And the emerging things are going to start out being more expensive until they scale.

So we have this problem right now where we have some emerging things, they’re expensive. How do we get them to market? Fossil fuels are still cheaper. That’s the problem where it will eventually sort itself out, but we need it to sort itself out quickly. So we’ve got to try to get in there and fix that.

Lucas Perry: So, let’s talk about climate change in the context of existential risks and global catastrophic risks. The way that I use these language is to say that global catastrophic risks are ones which would kill some large fraction of human civilization, but wouldn’t lead to extinction. And existential risks lead to all humans dying or all earth-originating intelligent life dying. The relevant distinction here for me is that the existential risks cancel the entire future. So there could be billions upon billions or trillions of experiential life years in the future if we don’t go extinct. And so that is this value being added into the equation of trying to understand which risks are the ones to pay attention to.

So you can react to this framing if you’d like, I’d be interested in what you think about it. And also just how you see the relative importance of climate change in the context of global catastrophic and existential risks and how you see its interdependence with other issues. So I’m mainly talking about climate change as being in a context of something like other pandemics, other than COVID-19, which may kill large fractions of the population and synthetic biorisk, which a sufficiently dangerous engineered pandemic could possibly be existential or an accidental nuclear war or misaligned artificial superintelligence that could lead to the human species extinction. So how do you think about climate change in the context of all of these very large risks?

Kelly Wanser: Well, I appreciate the question. Many of the risks that you described, how the characteristics that they are hard to quantify, and they’re hard to predict. And some of them are sort of like big black swan events, like even more deadly pandemics or pandemics polarized, artificially engineered things. So climate change I think shares that characteristic that it’s hard to predict. I think that climate change, when you dig into it, you can see that there are analytical deficiencies that make it very likely that we’re underestimating the risk.

In the spectrum between sort of catastrophic and existential we have not done the work to dig into the areas in which we are not currently adequately representing the risk. So I would say that there’s a definite possibility that it’s existential and that that possibility is currently under analyzed and possibly under estimated. I think there are two ways that it’s existential. So I’ll say I’m not an expert in survivability in outlier conditions, but if we just look at two phenomenon that are part of non-zero probability projections for climate, one is this example that I showed you where warming goes beyond five or six degrees C. The jury’s pretty far out on what that means for humans and what it means about all the conditions of the land and the sea and everything else.

So the question is like, how high does temperature go? And what does that mean in terms of the population livability curve? Part of what’s involved in that how high does temperature go is the biological species and their relationship to the physics and chemistry of the planet. This concern that I had from Pete Warden at NASA aims that I had never heard before talking to him is that at some point in the collapse of biological life, particularly in the ocean, you have a change in the chemical interactions that produce the atmosphere that we’re familiar with.

So for example, the biological life at the surface of the ocean, the phytoplankton and other organisms, they generate a lot of the oxygen that we breathe in the air, same with the forests. And so the question is whether you get collapse in the biological systems that generate breathable air. Now, if you watch sci-fi, you could say, “Well, we can engineer that.” And that starts to look more like engineering ourselves to live on Mars, which I’m happy to talk about why I don’t think that’s the solution. But so I think that it’s certainly reasonable for people to say, “Well, could that really happen?” There is some non-zero probability that that could happen that we don’t understand very well and we’ve been reluctant to explore.

And so I think that my challenge back to people about this being an existential risk is that the possibility that it’s an existential risk in the nearer term than you think may be higher than we think. And the gaps in our analysis of that are concerning.

Lucas Perry: Yeah. I mean, the question is like, do you know everything you need to know about all of the complex systems on planet Earth that help maintain the small bandwidth of conditions for which human beings can exist? And the answer is, no I don’t. And then the question is, how likely it is that climate change will perturb those systems in such a way that it would lead to an existential catastrophe? Well, it’s non-zero, but besides that, I don’t know.

Kelly Wanser: And one thing to look at that I think everyone should look at who’s interested in this is the observations of what’s happening in the system now. What’s happening in the system now are collapses of some biological life changes and some of the systems that are indicative that this risk might be higher than we think. And so if you look at things like, I think there was research coming out that estimates that we may have already lost like 40% of the phytoplankton on the surface of the ocean. So much so that the documentary filmmaker who made Chasing Coral was thinking about making a documentary about this.

Lucas Perry: About phytoplankton?

Kelly Wanser: Yeah. And phytoplankton, I think of it as the API layer between the ocean and the atmosphere, it’s the translation layer. It’s really important. And then I go to my friends who are climate modelers, and they’re like, “Yeah, phytoplankton isn’t well-represented in the climate models, there are over 500 species of phytoplankton and we have three of them in the climate models.” And so you look at that and you say, “Okay, well, there’s a risk that we’re don’t understand very well.” So, from my perspective, we have a non-zero risk in this category. I’d be happy if I was overstating it, but it may not be.

Lucas Perry: Okay. So that’s all new information and interesting. In the context of the existential risk community that I’m most familiar with, climate change, the way in which it’s said to potentially lead to existential risks is by destabilizing global human systems that would lead to the actualization of other things that are existential risks. Like if you care about nuclear war or synthetic bio or pandemics or getting AI right, that’s all a lot harder to do and control in the context of a much hotter earth. And so the other question I had for you, speaking of hotter earths, has the earth ever been five C hotter than it is now while mammals have been on it?

Kelly Wanser: So hasn’t been that hot while humans have been on it, but I’m not expert enough to know, as far as the mammal picture, I’m going to guess, probably yes. So when I touch on the first points that you were making too about the societal cascade, but on this question, the problem with the warming isn’t just whether or not the earth has ever been this warm, but it’s the pace of warming. If you look at over the past couple thousand years, how far and how fast we’re pushing the system, that normally when the earth goes through its fluctuations of temperature, and you can see in the past 2,000 years, it’s been small fluctuations, it’s been bigger. But it’s happened over very long periods of time, like hundreds of thousands of years, which means that all of the little organisms and all the big structures are adapting in this very slow way.

And in this situation where we’re pushing it this fast, the natural adaptation was very, very low. You know, you have species of fish and stuff that can move to different places, but it’s happening so fast in Earth system terms that there’s no adaptation happening. But to your other point about climate change setting off existential threats to society in other ways, I think that’s very true. And the climate change is likely to heighten the risk of like nuclear conflict on a couple of different vectors. And it’s also likely to heighten the risk that we throw biological solutions out there whose results we can’t predict. So I think one of the facets of climate change that might be a little bit different than runaway AI is just that it applies stress across every human and every natural system.

Lucas Perry: So this last point here then on climate change contextualized in this field of understanding around global catastrophic and existential risks, FLI views itself as being a part of the effective altruism community, and many of the listeners are effective altruists and 80,000 hours has come up with this simple framework for thinking about what kinds of projects and endeavors you should take on. And so the framework is just thinking about tractability, scope and neglectedness.

So tractability is just how much you can do to actually affect the thing. Scope is how big of a problem is it, how many people does it affect, and neglectedness is how many people are working on it? So you want to work on things that are highly tractable or tractable that have a large scope and that are neglected. So I think that there’s a view or the sense of climate change is that … I mean, from our conversation, it seems very tractable.

If we can get human civilization and coordinate on this, it’s something that we can do a lot about. I guess it’s another question on how tractable it is to actually get countries and corporations to coordinate on this. But the scope is global and would in the very least effect our generation and the next few generations, but it seems to not be neglected relative to other risks. One could say that it’s neglected relative to how much attention it deserves. But so I’m curious to know how you would react to this tractability, scope, and neglectedness framework being applied to climate change and in the context of other global catastrophic and existential risks.

Kelly Wanser: Firstly, I’m a big fan of the framework. I was familiar with it before, and it’s not dissimilar to the approach that we took in founding SilverLining, where I think this issue might fit into that framework depends on whether you put climate change all in one bucket and treat it as not neglected. Or you say in the portfolio of responses to climate change of which we have a significant gap in terms of ability to mitigate heat stress while we work on other parts of the portfolio, that part is entirely neglected.

So I think for us it’s about having to dissect the climate change problem, and we have this collective action problem, which is a hard problem to solve, to move industrial and other systems away from greenhouse gas emissions. And we have the system instability problem, which requires that we somehow alleviate the heat stress before the system breaks down too far.

I would say in that context, if your community looks at climate change as a relatively slowly unfolding problem, which has a lot of attention, then it wouldn’t fit. If you look at climate change as having some meaningful risk of catastrophic to existential unfolding in the next 30 to 50 years and not having response measures to try to stabilize the system, then it fits really nicely. It’s so under serviced that I represent the only NGO in the world that advocates for research in this area. So it depends on how your community thinks about it, but we look at those as quite different problems in a way.

Lucas Perry: So the problem of for example adaptation research, which has historically been stigmatized, we can apply this framework to this and see that you might get a high return on impact if you focus on supporting and doing research in climate intervention technologies and adaptation technologies?

Kelly Wanser: That’s right. What’s interesting to me and the people that I work with on this problem is that these climate intervention technologies have the potential to have very high leverage on the problem in the short term. And so from a philanthropic perspective or an octopus perspective, oftentimes I’m engaged with people who are looking for leverage, where can I really make a difference in terms of supporting research or policy? And I’m in this because literally I came from tech into climate, looking what is the most under-serviced highest leverage part of the space. And I landed here. And so I think that of your criteria that it’s under serviced and potentially high leverage, then this fits pretty well. It’s not the same as addressing the longer term problem of greenhouse gases, but it has very high leverage on the stability risk in the next 50 years or so.

Lucas Perry: So if that’s compelling to some number of listeners, what is your recommendation for action and participation for such persons? If I’m taking a portfolio approach to my impact or altruism, and I want to put some of it into this, how do you recommend I do that?

Kelly Wanser: So it’s interesting timing because we’re just a few weeks of launching something called a safe climate research initiative where we’re funding a portfolio of research programs. So what we do at Silver Lining is try to help drive philanthropic funding for these high leverage nascent research efforts that are going on and then try to help drive government funding and effective policy so that we can get resources moving in the big climate research system. So for people looking for that, when we start talking about the safe climate research initiative, we were agnostic as to whether, if you want to give money to SilverLining for the fund, or you want to donate to these programs directly.

So we interface with most of the mature-ish programs in the United States and quite a few around the world, mature and emerging. And we can direct people based on their interests, whether alumni, whether parts of the world there are opportunities for funding really high caliber things, Latin America, the UK, India.

So we’re happy to say, “You know, you can donate to our fund and we’re just moving through, getting seed funding to these programs as we can, or we can help connect you with programs based on your interests in the different parts of the world that you’re in, technology versus science versus impacts.” So that’s one way. For some philanthropists who are aware of the leverage on government R&D and government policy, Silver Lining’s been very effective in starting to kind of turn the dial on government funding. And we have some pretty big aspirations, not only to get funding directly in assessing these interventions, but also in expanding our capacity to do climate prediction quickly. So that’s another way where you can fund advocacy and we would appreciate it.

Lucas Perry: Accepting donations?

Kelly Wanser: We’re definitely accepting donations, happy to connect people or be a conduit for funding research directly.

Lucas Perry: All right. So let’s end on a fun one here then. So we were talking a little bit before we started about your visit planet earth picture behind you, and that you use that as a message against the colonization of Mars. So why don’t you think Mars is a solution to all of the human problems on earth?

Kelly Wanser: Well, let’s just start by saying, I grew up on Star Trek and so the colonization of Mars and the rest of the universe is appealing to me. But as far as the solutions to climate change or an escape from it, just to level set, because I’ve had serious conversations with people. I lived for 12 years in Silicon Valley, spent a lot of time with the Long Now community. And people have a passion for this vision of living on another planet and the idea that we might be able to move off of this one if it becomes dire. The reality is, and it goes back to education I got from very serious scientists. The problem with living on other planets, it’s not an engineering problem or a physics problem. It’s a biology problem.

That our bodies are fine tuned to the conditions of Earth, radiation, gravity, the air, the colors. And so we degrade pretty quickly when we go off planet. That’s a harder problem to solve than building a spaceship or a bubble. That’s not a problem that gets solved right away. And we can see it from the conditions of the astronauts that come back after a few years in orbit. And so the kinds of problems that we would need to solve to actually have quality of life living conditions on Mars or anywhere else are going to take a while. Longer than what we think are the 30 to 50 year instability problem that we have here on earth.

We are so finely tuned to the conditions of earth, like the Goldilocks sort of zone that we’re in, that it’s a really, really hard thing to replicate anywhere else. And so it’s really not very rational. It’s actually a much easier problem to solve to try to repair earth than it is to try to create the conditions of earth somewhere else.

Lucas Perry: Yeah. So I mean, these things might not be mutually exclusive, right? It really seems to be a problem of resource allocation. Like it’s not one or the other, it’s like, how much are we going to put into each-

Kelly Wanser: It’s less of a problem of resource allocation than time horizon. So I think that the kinds of scientific and technical problems that you have to solve to meaningfully have people live on Mars, that’s beyond a 50 year time horizon. And our concern is that the climate instability problem is inside a 50 year time horizon. So that’s the main issue is that over the long haul, there are advanced technologies and probably bio-engineering things we need to do and maybe engineering of planets that we need to do for that to work. And so over the next 100 or 200 years, that would be really cool, and I’ll be in favor of it also. But this is the spaceship that we have. All of the people are on it, and failure is not an option.

Lucas Perry: All right. That’s an excellent place to end on. And I think both you and I share the science fiction geek gene about getting to Mars, but we’ll have to potentially delay that until we figure out climate change, but hopefully we get to that. So, yeah. Thanks so much for coming on. This has been really interesting. I feel like I learned a lot of new things. There’s a lot here that probably most people who are even fairly familiar with climate science aren’t familiar with. So I just want to offer you a final little space here if you have any final remarks or anything you’d like to say that you feel like is unresolved or unsaid, just any last words for listeners?

Kelly Wanser: Well, for those people who’ve made it through the entire podcast, thanks for listening and being so engaged and interested in the topic. I think that apart from the things we talked about previously, it’s heartening and important that people from other fields are paying attention to the climate problem and becoming engaged, particularly people from the technology sector and certain parts of industry that bring a way of thinking about problems that’s useful. I think there are probably lots of people in your community who may be turning their attention to this, or turning their attention to this more fully in a new way, and may have perspectives and ideas and resources that are useful to bring to it.

The field has been quite academic and more academic than many other fields of endeavor. And so I think what people in Silicon Valley think about in terms of how you might transform a sector quickly, or a problem quickly, presents an opportunity. And so I hope that people are inspired to become involved and become involved in the parts of the space that are maybe more controversial or easier for people like us to think about.

Lucas Perry: All right. And so if people want to follow or find you or check out SilverLining, where are the best places to get more information or see what you guys are up to?

Kelly Wanser: So I’m on LinkedIn and Twitter as @kellywanser and our website is, no S at the end. And the majority of the information about what we do is there. And feel free to reach out to me on LinkedIn or on Twitter or contact Lucas who can contact me.

Lucas Perry: Yeah, all right. Wonderful. Thanks so much, Kelly.

Kelly Wanser: All right. Thanks very much, Lucas. I appreciate it. Thanks for taking so much time.

Andrew Critch on AI Research Considerations for Human Existential Safety

 Topics discussed in this episode include:

  • The mainstream computer science view of AI existential risk
  • Distinguishing AI safety from AI existential safety 
  • The need for more precise terminology in the field of AI existential safety and alignment
  • The concept of prepotent AI systems and the problem of delegation 
  • Which alignment problems get solved by commercial incentives and which don’t
  • The threat of diffusion of responsibility on AI existential safety considerations not covered by commercial incentives
  • Prepotent AI risk types that lead to unsurvivability for humanity 



0:00 Intro
2:53 Why Andrew wrote ARCHES and what it’s about
6:46 The perspective of the mainstream CS community on AI existential risk
13:03 ARCHES in relation to AI existential risk literature
16:05 The distinction between safety and existential safety
24:27 Existential risk is most likely to obtain through externalities
29:03 The relationship between existential safety and safety for current systems
33:17 Research areas that may not be solved by natural commercial incentives
51:40 What’s an AI system and an AI technology?
53:42 Prepotent AI
59:41 Misaligned prepotent AI technology
01:05:13 Human frailty
01:07:37 The importance of delegation
01:14:11 Single-single, single-multi, multi-single, and multi-multi
01:15:26 Control, instruction, and comprehension
01:20:40 The multiplicity thesis
01:22:16 Risk types from prepotent AI that lead to human unsurvivability
01:34:06 Flow-through effects
01:41:00 Multi-stakeholder objectives
01:49:08 Final words from Andrew



AI Research Considerations for Human Existential Safety


We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Lucas Perry: Welcome to the AI Alignment Podcast. I’m Lucas Perry. Today we have a conversation with Andrew Critch where we explore a recent paper of his titled AI Research Considerations for Human Existential Safety, which he co-authored with David Krueger. In this episode, we discuss how mainstream computer science views AI existential risk, we develop new terminology for this space and discuss the need for more precise concepts in the field of AI existential safety, we get into which alignment problems and areas of AI existential safety Andrew expects to be naturally solved by industry and which won’t, and we explore the risk types of a new concept Andrew introduces, called prepotent AI, that lead to unsurvivability for humanity. 

I learned a lot from Andrew in this episode and found this conversation to be quite perspective shifting. I think Andrew offers an interesting and useful critique of existing discourse and thought, as well as new ideas. I came away from this conversation especially valuing thought around the issue of which alignment and existential safety issues will and will not get solved naturally by industry and commercial incentives. The answer to this helps to identify crucial areas we should be mindful to figure out how to address outside the normal incentive structures of society, and that to me seems crucial for mitigating AI existential risk. 

If you don’t already subscribe or follow this podcast, you can follow us on your preferred podcasting platform, like Apple Podcasts or Spotify, by searching for The Future of Life. 

Andrew Critch is currently a full-time research scientist in the Electrical Engineering and Computer Sciences department at UC Berkeley, at Stuart Russell’s Center for Human Compatible AI. He earned his PhD in mathematics at UC Berkeley studying applications of algebraic geometry to machine learning models. During that time, he cofounded the Center for Applied Rationality and Summer Program on Applied Rationality and Cognition. Andrew has been offered university faculty positions in mathematics and mathematical biosciences, worked as an algorithmic stock trader at Jane Street Capital‘s New York City office, and as a research fellow at the Machine Intelligence Research Institute. His current research interests include logical uncertainty, open source game theory, and avoiding arms race dynamics between nations and companies in AI development.

And with that, let’s get into our conversation with Andrew Critch.

We’re here today to discuss your paper, AI Research Considerations for Human Existential Safety. You can shorten that to ARCHES. You wrote this with David Krueger and it came out at the end of May. I’m curious and interested to know what your motivation is for writing ARCHES and what it’s all about.

Andrew Critch:

Cool. Thanks, Lucas. It’s great to be here. For me, it’s pretty simple. Is that I care about existential safety. I want humans to be safe as a species. I don’t want human extinction to ever happen. And so I decided to write a big, long document about that with David. And of course, why now and why these particular problems, I can go more into that.

You might wonder if existential risk from AI is possible, how have we done so much AI research with so little technical level thought about how that works and how to prevent it? And to me, it seems like the culture of computer science and actually a lot of STEM has been to always talk about the benefits of science. Except in certain disciplines that are well accustomed to talking about risks like medicine, a lot of science just doesn’t talk about what could go wrong or how it could be misused.

It hasn’t been until very recently that computer science has really started making an effort as a culture to talk about how things could go wrong in general. Forget x-risk, just anything going wrong. And I’m just going to read out loud this quote to sort of set the context culturally for where we are with computer science right now and how far culturally we are from being able to really address existential risk holistically.

This is a quote from Hecht at the ACM Future of Computing Academy. It came out in 2018, just two years ago. “The current status quo in the computing community is to frame our research by extolling its anticipated benefits to society. In other words, rose colored glasses are the normal lenses through which we tend to view our work. However, one glance at the news these days reveals that focusing exclusively on the positive impacts of a new computing technology involves considering only one side of a very important story. We believe that this gap represents a serious and embarrassing intellectual lapse. The scale of this lapse is truly tremendous. It is analogous to the medical community, only writing about the benefits of a given treatment, completely ignoring the side effects, no matter how serious they are.

What’s more, the public has definitely caught on to our community-wide blind spot and is understandably suspicious of it. After several months of discussion, and idea for acting on this imperative began to emerge. We can leverage the gate keeping functionality of the peer review process. At a high level, our recommended change to the peer review process in computing is straightforward. Peer reviewers should require that papers and proposals rigorously consider all reasonable, broader impacts, both positive and negative.” That’s Hecht, 2018.

With this energy, this initiative from the ACM and other similar mentalities around the world, we now have NeurIPS Conference submissions required to submit broader impact statements that include negative impacts as well as positive.

Suddenly in 2020, contrasted with 2015, it’s becoming okay and normal to talk about how your research could be misused and what could go wrong with it. And we’re just barely able to admit things like, “This algorithm could result in racial bias in judiciary hearings,” or something like that. Which is a terrible, terrible … The fact that we’ve taken this long to admit that and talk about it is very bad. And that’s something as present and obvious as racism. Whereas, existential risk has never been … Extinction has never been present or else we wouldn’t be having this conversation. And so those conversations are even harder to have when it’s not normal to talk about bad outcomes at all. Let alone obvious, in your face, bad outcomes.

Lucas Perry: Yeah. On this podcast, we’re basically only talking to people who are in the AI alignment community and who take x-risk very seriously, who are worried about existential risk from advanced AI systems.

And so we lack a lot of this perspective … Or we don’t have many conversations with people who take the cultural, and I guess, academic perspective of the mainstream machine learning and computer science community. Which is far larger and has much more inertia and mass than the AI alignment community.

I’m curious if you can just paint a little bit more of a picture here of what the state of computer science thinking or non-thinking is on AI existential risk? You mentioned that recently people are starting to at least encourage and it be required as part of a process to have negative impact statements or write about the risks of a technology one is developing. But that’s still not talking about global catastrophic risk. It’s still not talking about alignment explicitly. It’s not talking about existential risk. It seems like a step in the right direction, but some ways to go. What kind of perspective can you give us on all this?

Andrew Critch: I think of sort of EA adjacent to AI researchers as kind of a community, to the extent that EA is a community. And it’s not exactly the same set of people as AI researchers who think about existential risk or AI researchers who think about alignment. Which is yet another set of people. What overlaps heavily, but it’s not the same set.

And I have noticed a tendency that I’m trying to combat here by raising this awareness, not only to computer scientists, but to EA adjacent AI folks. Which is that if you feel sort of impatient, that computer science and AI are not acknowledging existential risks from tech, things are underway and there’s ways of making things better and making things worse.

One way to make things worse is to get irate with people, for caring about risks that you think aren’t big enough. Okay. If you think inequitable loan distribution is not as bad as human extinction, many people might agree with you, but if you’re irate about that and saying, “Why are we talking about that when we should be talking about extinction?” You’re slowing down the process of computer science, transitioning into a more negative outcome-aware field by refusing to cooperate with other people who are trying to raise awareness about negative outcomes.

I think there’s a push to be more aware of negative outcomes and all the negative outcome people need to sort of work together politely, but swiftly, raising the bar for our discourse about negative outcomes. And I think existential risks should be part of that, but I don’t think it should be adversarially positioned relative to other negative outcomes. I think we just need to raise the bar for all of these at once.

And all of these issues have the same enemy, which is those rose colored glasses that wrote all of our grant applications for the past 50 years. Every time you’re asking for public funds, you say how this is going to benefit society. And you better not mention how it might actually make society worse or else you won’t get your grant. Right?

Well, times are changing. You’re allowed to mention and signal awareness of how your research could make things worse. And that’s starting to be seen as a good trait rather than a reason not to give you funding. And if we all work together to combat that rose colored glass problem, it’s going to make everything easier to talk about, including existential risk.

Lucas Perry: All right. So if one goes to NeurIPS and talks to any random person about existential risk or AI alignment or catastrophic risk from AI, what is the average reaction or assumed knowledge or people who think it’s complete bullshit versus people who are neutral about it to people who are serious about it?

Andrew Critch: Definitely my impression right now, this is very rough impression. There’s a few different kinds of reactions that are all like sort of double digits percentage. I don’t know which percentage they are, but one is like, how are you worried about existential risks when robots can’t tie knots yet? Or they can’t fold laundry. It’s like a very difficult research problem for an academic AI lab to make a robot fold laundry. So it’s like, come on. We’re so far away from that.

Another reaction is, “Yeah, that’s true. You know, I mean things are really taking off. They’re certainly progressing faster than I expected. Things are kind of crazy.” It’s the things that are kind of crazy reaction and there’s just kind of an open-mindedness. Man, anything could happen. We could go extinct in 50 years, we can go extinct. I don’t know what’s going to happen. Things are crazy.

And then there’s another reaction. Unfortunately, this one’s really weird. I’ve gotten this one, which is, “Well, of course humanity is going to go extinct from the advent of AI technology. I mean, of course. Just think about it from evolutionary perspective. There’s no way we would not go extinct given that we’re making things smarter than us. So of course it’s going to happen. There’s nothing we can do about it. That’s just our job as a field is to make things that are smarter than humans that will eventually replace us and there’ll be better than us. And that’s just how stuff is.”

Lucas Perry: Some people think that’s an aligned outcome.

Andrew Critch: I don’t know. That’s a lot of debate to be had about that. But it’s a kind of defeatist attitude of, “It’s nothing you can do.” It’s much, much rarer. It seems like single digits that someone is like, “Yeah, we’re going to do something about it.” That one is the rarest, the acknowledging and orienting towards solving it is still pretty rare. But there’s plenty these days of acknowledgement that it could be real and acknowledgement that it’s confusing and hard. The challenge is somehow way more acknowledged than any particular approach to it.

Lucas Perry: Okay. I guess that’s surprising to hear then that you feel like it’s more taken seriously than not.

Andrew Critch: It depends on what you mean by taken seriously. And again, I’m filtering for a person who’s being polite and talking to me about it, right? People are polite enough to fall into the, “Stuff is crazy. Who knows what could happen,” attitude.

And is that taking it seriously? Well, no, but it’s not adversarial to people who are taking it seriously, which I think is really good. And then there’s the, “Clearly we’re going to be destroyed by machines that replace us. That’s just nature.” Those voices, I’m kind of like, well, that’s kind of good also. It’s good to admit that there’s a real risk here. It’s kind of bad to give up on it, in my opinion. But altogether, if you add up the, “Woah, stuff’s crazy and we’re not really oriented to it,” plus the, “Definitely humanity is going to be destroyed/replaced.” It’s a solid chunk of people. I don’t know. I’m going to say at least 30%. If you also then include the people who want to try and do something about it. Which is just amazing compared to say six years ago where the answer would have been round to zero percent.

Lucas Perry: Then just to sum up here, this paper then is an exercise in trying to lay out a research agenda for existential safety from AI systems, which is unique in your view? I think you mentioned that there are four that have already existed to this day.

Andrew Critch: Yeah. There’s Aligning Superintelligence with Human Interests, by Soares and Fallenstein, that’s MIRI, basically. Then there’s Research Priorities for A Robust And Beneficial Artificial Intelligence, by Stuart Russell, Max Tegmark, and Daniel Dewey. Then there’s Concrete Problems in AI Safety, by Dario Amodei and others. And then Alignment for Advanced Machine Learning Systems, by Jessica Taylor and others. And Scalable Alignment Via Reward Modeling by Jan Leike and also David Krueger is on that one.

Lucas Perry: How do you see your paper as fitting in with all of the literature that already exists on the problem of AI alignment and AI existential risk?

Andrew Critch: Right. So it’s interesting you say that there exists literature on AI existential risk. I would say Superintelligence, by Nick Bostrom, is literature on AI existential risk, but it is not a research agenda.

Lucas Perry: Yeah.

Andrew Critch: I would say Aligning Superintelligence with Human Interests, by Soares and Fallenstein. It’s a research agenda, but it’s not really about existential risk. It sort of mentions that stakes are really high, but it’s not constantly staying in contact with the concept of extinction throughout.

If you take a random excerpt of any page from it and pretend that it’s about the Netflix challenge or building really good personal assistants or domestic robots, you can succeed. That’s not a critique. That’s just a good property of integrating with research trends. But it’s not about the concept of existential risk. Same thing with Concrete Problems in AI Safety.

In fact, it’s a fun exercise to do. Take that paper. Pretend you think existential risk is ridiculous and read Concrete Problems in AI Safety. It reads perfectly as you don’t need to think about that crazy stuff, let’s talk about tipping over vases or whatever. And that’s a sign that it’s an approach to safety that it’s going to be agreeable to people, whether they care about x-risk or not. Whereas, this document is not going to go down easy for someone who’s not willing to think about existential risk and it’s trying to stay constantly in contact with the concept.

Lucas Perry: All right. And so you avoid making the case for AI x-risk as valid and as a priority, just for the sake of the goal of the document succeeding?

Andrew Critch: Yeah. I want readers to spend time inhabiting the hypothetical that existential risk is real and can come from AI and can be addressed through research. They’re already taking a big step by constantly thinking about existential risk for these 100 pages here. I think it’s possible to take that step without being convinced of how likely the existential risk is. And I’m hoping that I’m not alienating anybody if you think it’s 1%, but it’s worth thinking about. That’s good. If you think it’s 30% chance of existential risk from AI, then it’s worth thinking about. That’s good, too. If you think it’s 0.01, but you’re still thinking about it, you’re still reading it. That’s good, too. And I didn’t want to fracture the audience based on how probable people would agree the risks are.

Lucas Perry: All right. So let’s get into the meat of the paper, then. It would be useful, I think, if you could help clarify the distinction between safety and existential safety.

Andrew Critch: Yeah. So here’s a problem we have. And when I say we, I mean people who care about AI existential safety. Around 2015 and 2016, we had this coming out of AI safety as a concept. Thanks to Amodei and the Robust and Beneficial AI Agenda from Stuart Russell, talking about safety became normal. Which was hard to accomplish before 2018. That was a huge accomplishment.

And so what we had happen is people who cared about extinction risk from artificial intelligence would use AI safety as a euphemism for preventing human extinction risk. Now, I’m not sure that was a mistake, because as I said, prior to 2018, it was hard to talk about negative outcomes at all. But it’s at this time in 2020 a real problem that you have people … When they’re thinking existential safety, they’re saying safety, they’re saying AI safety. And that leads to sentences like, “Well, self driving car navigation is not really AI safety.” I’ve heard that uttered many times by different people.

Lucas Perry: And that’s really confusing.

Andrew Critch: Right. And it’s like, “Well, what is AI safety, exactly, if cars driven by AI, not crashing, doesn’t count as AI safety?” I think that as described, the concept of safety usually means minimizing acute risks. Acute meaning in space and time. Like there’s a thing that happens in a place that causes a bad thing. And you’re trying to stop that. And the Concrete Problems in AI Safety agenda really nailed that concept.

And we need to get past the concept of AI safety in general if what we want to talk about is societal scale risk, including existential risk. Which it’s acute on a geological time scale. Like you can look at a century before and after and see the earth is very different. But a lot of ways you can destroy the earth don’t happen like a car accident. They play out over a course of years. And things to prevent that sort of thing are often called ethics. Ethics are principles for getting a lot of agents to work together and not mess things up for each other.

And I think there’s a lot of work today that falls under the heading of AI ethics that are really necessary to make sure that AI technology aggregated across the earth, across many industries and systems and services, will not result collectively in somehow destroying humanity, our environment, our minds, et cetera.

To me, existential safety is a problem for humanity on an existential timescale that has elements that resemble safety in terms of being acute on a geological timescale. But also resemble ethics in terms of having a lot of agents, a lot of different stakeholders and objectives mulling around and potentially interfering with each other and interacting in complicated ways.

Lucas Perry: Yeah. Just to summarize this, people were walking around saying like, “I work on AI safety.” But really, that means that I’ve bought into AI existential risk and I work on AI existential risk. And then that’s confusing for everyone else, because working on the personal scale risk of self-driving car safety is also AI safety.

We need a new word, because AI safety really means acute risks, which can range from personal all the way to civilizational or transgenerational. And so, it’s confusing to say I work in AI safety, but really what I mean is only I care about transgenerational, AI existential risk.

Andrew Critch: Yes.

Lucas Perry: Then we have this concept of existential safety, which for you both has this portion of us not going extinct, but also existential safety includes the normative and ethics and values and game theory and how it is that an ecosystem of human and nonhuman agents work together to build a thriving civilization that is existentially preferable to other civilizations.

Andrew Critch: I agree 100% with everything you just said, except for the part where you say “existentially preferable.” I prefer to use existential safety to refer really, to preserving existence. And I prefer existential risk to refer to extinction. That’s not how Bostrom uses the term. And he introduced the term, largely, and he intends to include risks that are as important as extinction, but aren’t extinction risks.

And I think that’s interesting. I think that’s a good category of risks to think about and deserving of a name. I think, however, that there’s a lot more debate about what is or isn’t as bad as extinction. Whereas, there’s much less debate about what extinction is. There still is debate. You can say, “Well, what about if we become uploads, whatever.” But there’s much, much more uncertainty about what’s worse or better than extinction.

And so I prefer to focus existential safety on literally preventing extinction and then use some other concept, like societal scale risk, for referring to risks that are really big on a societal scale that may or may not pass the threshold of being worse or better than extinction.

I also care about societal scale risks and I don’t want people working on preventing societal scale risks to be fractured based on whether they think any particular risk, like lots of sentient suffering AI systems or a totalitarian regime that lasts forever. I don’t want people working to prevent those outcomes to be fractured based on whether or not they think those outcomes are worse than extinction or count as a quote, unquote existential risk. When I say existential risk, I always mean risks to the existence of the human species, for simplicity sake.

Lucas Perry: Yeah. Because Bostrom’s definition of an existential risk is any risk such that if it should occur, would permanently and drastically curtail the potential for earth originating, intelligent life. Which would include futures of deep suffering or futures of being locked into some less than ideal system.

Andrew Critch: Yeah. Potential not only measured in existence, but potential measured in value. And if you’re suffering, the value of your existence is lower.

Lucas Perry: Yeah. And that there are some futures where we still exist, where they’re less preferable to extinction.

Andrew Critch: Right.

Lucas Perry: You want to say, okay, there are these potential suffering risks and there are bad futures of disvalue that are maybe worse than extinction. We’re going to call all these societal risks. And then we’re just going to have existential risk or existential safety refer to us not going extinct?

Andrew Critch: I think that’s especially necessary in computer science. Because if anything seems vague or unrefined, there’s a lot of allergy to it. I try to pick the most clearly definable thing, like are humans there or not? That’s a little bit easier for people to wrap their heads around.

Lucas Perry: Yeah. I can imagine how in the hard sciences people would be very allergic to anything that was not sufficiently precise. One final distinction here to make is that one could say, instead of saying, “I work on AI safety,” “I work on AI existential safety or AI civilizational or societal risk.” But another word here is, “I work on AI alignment.” And you distinguish that from AI delegation. Could you unpack that a little bit more and why that’s important to you?

Andrew Critch: Yeah. Thanks for asking about that. I do think that there’s a bit of an issue with the “AI alignment” concept that makes it inadequate for existential risk reduction. AI existential safety is my goal. And I think AI alignment, the way people usually think about it, is not really going to cut it for that purpose.

If we’re successful as a society in developing and rolling out lots of new AI technologies to do lots of cool stuff, it’s going to be a lot of stakeholders in that game. It’s going to be what you might call massively multipolar. And in that economy or society, a lot of things can go wrong through the aggregate behavior of individually aligned systems. Like just take pollution, right? No one person wants everybody else to pollute the atmosphere, but they’re willing to do it themselves. Because when Alice pollutes the atmosphere, Alice gets to work on time or Alice gets to take a flight or whatever.

And she harms everybody in doing that, including herself. But the harm to herself is so small. It’s just a drop in the bucket that’s spread across everybody else. You do yourself a benefit and you do a harm that outweighs that benefit, but it’s spread across everybody and accrues very little harm specifically to you. That’s the problem with externalities.

I think existential risk is most likely to obtain through externalities, between interacting systems that somehow were not designed to interact well enough because they had different designers or they had different stakeholders behind them. And those competitive effects, like if you don’t take a car, everyone else is going to take a car you’re going to fall behind. So you take a car. If you’re a country, right? If you don’t burn fossil fuels, well, you spend a few years transitioning to clean energy and you fall behind economically. You’re taking a hit and that hurts you more than anybody. Of course, it benefits the whole world if you cut your carbon emissions, but it’s just a big prisoner’s dilemma. So you don’t do it. No one does it.

There’s many, many other variables that describe the earth. This comes to the human fragility thesis, which I and David outlined in the paper. Which is that there’s many variables, which if changed, can destroy humanity. And any of those variables could be changed in ways that don’t destroy machines. And so we are at risk of machine economies operating in ways that keep on operating at the expense of humans that aren’t needed for them being destroyed. That is the sort of backdrop for why I think delegation is a more important concept than alignment.

Delegation is a relationship between groups of people. You’ll often have a board of directors that delegates through a CEO to an entire staff. And I want to evoke that concept, the relationship between a group of overseers and a group of doers. You can have delegates on a UN committee from many different countries. You’ve got groups delegating to individuals to serve as part of a group who are going to delegate to a staff. There’s this constant flow through of responsibility. And it’s not even acyclic. You’ve got elected officials who are delegated by the electorate who delegate staff to provide services to the electorate, but also to control the electorate.

So there’s these loops going around. And I think I want to draw attention to all of the delegation relationships that are going to exist in the future economy. And that already exist in the present economy of AI technologies. When you pay attention to all of those different pathways of delegation, you realize there’s a lot of people in institutions with different values that aren’t going to agree with each other on what counts as aligned.

For example, for some people, it’s aligned to take a 1% chance of dying to double your own lifespan. Some people are like, “Yeah, that’s totally worth it.” And for some people, they’re like, “No 1% dying. That’s scary and I’m pretty happy living 80 years.” And so what sort of societal scale risks are worth taking are going to be subject to a lot of disagreement.

And the idea that there’s this thing called human values, that we’re all in agreement about. And there’s this other thing called AI that just has to do with the human value says. And we have to align the AI with human values. It’s an extremely simplified story. It’s got two agents and it’s just like one big agent called the humans. And then there’s this one big agent called AIs. And we’re just trying to align them. I think that is not the actual structure of the delegation relationship that humans and AI systems are going to have with respect to each other in the future. And I think alignment is helpful for addressing some delegation relationships, but probably not the vast majority.

Lucas Perry: I see where you’re coming from. And I think in this conception alignment, as you said, I believe is a sub category of delegation.

Andrew Critch: Well, I would say that alignment is a sub problem of most delegation problems, but there’s not one delegation problem. And I would also say alignment is a tool or technique for solving delegation problems.

Lucas Perry: Okay. Those problems all exist, but actually doing AI alignment, that automatically brings in delegation problems. And, or if you actually align a system, then this system is aligned with how we would want to solve delegation problems.

Andrew Critch: Yeah. That’s right. One approach to solving AI delegation, you might think, “Yeah, we’re going to solve that problem by first inventing a superintelligent machine.” Like step one, invent your super intelligent oracle machine step two align your super intelligent oracle machine with you, the creator. Step three, ask it to solve for society. Just figure out how society should be structured. Do that. That’s a mathematically valid approach. I just don’t think that’s how it’s going to turn out. The closer powerful institutions get to having super powerful AI systems, political tensions are going to arise.

Lucas Perry: So we have to do the delegation problem as we’re going?

Andrew Critch: Yes, we have to do it as we’re going, 100%.

Lucas Perry: Okay.

Andrew Critch: And if we don’t, we put institutions at odds with each other to win the race of being the one chosen entity that aligns the one chosen superintelligence with their values or plan for the future or whatever. And I just think that’s a very non-robust approach to the future.

Lucas Perry: All right. Let’s pivot here then back into existential safety and normal AI safety. What do you see as the relationship between existential safety and safety for present day AI systems? Does safety for present day AI systems feed into existential safety? Can it inform existential safety? How much does one matter for the other?

Andrew Critch: The way I think of it, it’s a bit of a three node diagram. There’s present day AI safety problems, which I believe feed into existential safety problems somewhat. Meaning that some of the present day solutions will generalize to the existential safety problems.

There’s also present day AI ethics problems, which I think also feed into understanding how a bunch of agents can delegate to each other and treat each other well in ways that are not going to add up to destructive outcomes. That also feeds into existential safety.

And just to give concrete examples, let’s take car doesn’t crash, right? What does that have in common with existential safety? Well, existential safety is humanity doesn’t crash. There’s a state space. Some of the states involve humanity exists. Some of the states involve humanity doesn’t exist. And we want to stay in the region of state space where humans exist.

Mathematically, it’s got something in common with the staying in the region of state space where the car is on the road and not overheating, et cetera, et cetera. It’s a dynamical system. And it’s got some quantities that you want to conserve and there’s conditions or boundaries you want to avoid. It has this property just like culturally, it has the property of acknowledging a negative outcome and trying to avoid it. That’s, to me, the main thing that safety and existential safety have in common, avoiding a negative outcome. So is ethics about avoiding negative outcomes. And I think those both are going to flow into existential safety.

Lucas Perry: Are there some more examples you can make for current day AI safety problems and current day AI ethics problems, just make it a bit more concrete? How does something like robustness to distributional shift take us from aligned systems today to systems that have existential safety in the future?

Andrew Critch: So, conceptually, robustness to distributional shift is about, you’ve got some function that you want to be performed or some condition you want to be met, and then the environment changes or the inputs change significantly from when you created the system, and then you still want it to maintain those conditions or achieve the goal.

So, for example, if you have a car trained, “To drive in dry conditions,” and then it starts raining, can you already have designed your car by principles that would allow it to not catastrophically fail in the rain? Can it notice, “Oh gosh, this is real different from the way I was trained. I’m going to pull over, because I don’t know how to drive in the rain.” Or can it learn, on the fly, how to drive in the rain and then get on with it?

So those are kinds of robustness to distributional shift. The world changes. So, if you want something that’s safe and stays safe forever, it has to account for the world changing. So, principles of robustness to distributional shift are principles by which society, as a whole, needs to adhere. Now, do I think research in this area is differentially useful to existential risk?

No. Frankly, not at all. And the reason is that industry has loads of incentives to produce software that are robust to a changing environment. So, if on the margin I could add an idea to the idea space of robustness to distributional shift, I’m like, “Well, I don’t think there’s any chance that Uber is going to ignore robustness to distributional shift, or that Google is going to ignore, or Amazon.” There’s no way these companies are going to roll out products while not thinking about whether they’re robust.

On the other hand, if I have a person who wants to dwell on the concept of robustness, who cares about existential risk and who wants to think about how robustness even works, like what are the mathematical principles of robustness? We don’t fully know what they are. If we did, we’d have built self driving cars already.

So, if I have a person who wants to think about that concept because it applies to society, and they want a job while they think about it, sure, get a job producing robust software or robust robotics, or get a bunch of publications in that area, but it’s not going to be neglected. It’s more of a mental exercise that can help you orient and think about society through a new lens, once you understand that lens, rather than a thing that somehow DeepMind is going to forget that it’s products need to be robust, come on.

Lucas Perry: So, that’s an interesting point. So, what are technical research areas, or areas in terms of AI ethics that you think there will not be natural incentives for solving, but that are high impact and important for AI existential safety?

Andrew Critch: To be clear, before I go into saying these areas are important, these areas aren’t, I do want to distinguish the claim area X is a productive place to be if you care about existential risk from, area X is an area that needs more ideas to solve existential safety. I don’t want the people to feel discouraged from going into intellectual disciplines that are really nourishing to the way that you’re going to learn and invent new concepts that help you think forever. And it can be a lot easier to do that in an area that’s not neglected.

So, robustness is not going to be neglected. Alignment, taking an AI system and making it do what a person wants, that’s not going to be neglected, because it’s so profitable. The economy is set up to sell to individual customers, to individual companies. Most of the world economy is anarchic in that way, anarcho-capitalist at a global scale. If you can find someone that you can give something to that they like, then you will.

The Netflix challenge is an AI alignment problem, right? The concept of AI alignment was invented in 2002, and nobody cites it because it’s so obvious of an idea that you have to make your AI do stuff. Still, it was neglected in academia because AI wasn’t super profitable. So, it is true that AI alignment was not a hot area of research in academia, but now, of course, you need AI to learn human preferences. Of course, you need AI to win in the tech sphere. And that second part is new.

So, because AI is taking off industrially, you’ve got a lot more demand for research solutions to, “Okay. How do we actually make this useful to people? How do we get this to do what people want?” And that’s why AI alignment is taking off. It’s not because of existential risk, it’s because well, AI is finally super-duper useful and it’s finally super-duper profitable, if you can just get it to do what the customer wants. So, that’s alignment. That’s what user agent value alignment is called.

Now, is that a productive place to be if you care about existential risks? I think. Yes. Because if you’re confused about what values are and how you could possibly get an inhuman system to align with the values of a human system, like human society, if that basic concept is tantalizing to you and you feel like if you just understood it a bit more, you’d be better mentally equipped to visualize existential risk playing out or not playing it on a societal scale, then yeah, totally go into that problem, think about it. And you can get a job as a researcher or an engineer aligning AI systems with the values of the human beings who use it. And it’s super enriching and hard, but it’s not going to be neglected because of how profitable it is.

Lucas Perry: So what is neglected, or what is going to be neglected?

Andrew Critch: What’s going to be neglected is stuff that’s both hard and not profitable. Transparency, I think, is not yet profitable, but it will be. So it’s neglected now. And when I say it’s not yet profitable, I mean that as far as I know, we don’t have big tech companies crushing their competition by having better visualization techniques for their ML systems. You don’t see advertisements for, “Hey, we’re hiring transparency engineers,” yet.

And so, I take that as a sign that we’ve not yet reached the industrial regime in which the ability for engineers to understand their systems better is the real bottleneck to rolling out the next product. But, I think it will be if we don’t destroy ourselves first. I think there’s a very good chance of that actually playing out.

So I think, if you want an exciting career, get into transparency now. In 10 years, you’ll be in high demand and you’ll have understood a problem that’s going to help humans and machines relate, which is, “Can we understand them well enough to manage them?” There’s other problems, unfortunately, that I think are neglected now and important, and are going to stay neglected. And I think those are the ones that are most likely to kill us.

Lucas Perry: All right, let’s hear them.

Andrew Critch: Things like how do we get multiple AI systems from multiple stakeholders to cooperate with each other? How do you broker a peace treaty between Uber and Waymo cars? That one’s not as hard because you can have the country that allows the cars into it have some regulatory decision that all the cars have to abide by, and now the cars have to get along or whatever.

Or maybe you can get the partnership on AI, which is largely American to agree amongst themselves that there’s some principles, and then the cars adhere to those principles. But it’s much harder on an international scale where there’s no one centralized regulatory body that’s just going to make all the AIs behave this way or that way. And moreover, the people who are currently thinking about that, aren’t particularly oriented towards existential risk, which really sucks.

So, I think what we need, if we get through the next 200 years with AI, frankly, if we get through the next 60 years with AI, it’s going to be because people who cared about existential risk entered institutions with the power to govern the global deployment of AI, or people already with the power to govern the global deployment of AI technologies come to care about existential and comparable societal scale risks. Because without that, I think we’re going to miss the mark.

When something goes wrong and there’s somebody whose job was clearly to make that not happen, it’s a lot easier to get that fixed. Think about people who’ve tried to get medical care since the COVID pandemic. Everybody’s decentralized, the offices are part work from home, partly they’re actually physically in there. So you’re like, “Hey, I need an appointment with a neurologist.” The person whose job it is to make the appointment is not the person whose job it is to tell the doctor that the appointment is booked.

It’s also, there’s someone else’s job is to contact the insurance company and make sure that you’re authorized. And they might be off that day, and then you show up, and you get a big bill and you’re like, “Well, whose fault was this?” Well, it’s your fault because you’re supposed to check that your insurance covered this neurology stuff, right? You could have called your insurance company to pre-authorize this visit.

So it’s your fault. But also, it’s the administrator’s fault that you didn’t talk to that never meets you, whose job is to conduct the pre-authorization on the part of the doctor’s office, which sometimes does it, right? And it’s also the doctor’s fault, because maybe the doctor could have noticed that the authorization hadn’t been done, and didn’t cancel the appointment or warn you that maybe you don’t want to afford this right now. So whose fault is it? Oh, I don’t know.

And if you’ve ever dealt with a big fat bureaucratic failure like this, that is what is going to kill humanity. Everybody knows it’s bad. Nobody in this system, not the insurance company, not the call center that made my appointment, not the insurance specialist at the doctor’s office, certainly not the doctor, none of these people want me not to get healthcare, but it’s no one in particular’s fault. And that’s how it happens.

I think the same thing is going to happen with existential risk. We’re going to have big companies making real powerful AI systems, and it’s going to be really obvious that it is their job to make those systems safe. And there’s going to be a bunch of kinds of safety that’s really obviously their job that people are going to be real angry at them for not paying a lot of attention to. And the anger is just going to get more and more, the more obvious it is that they have power.

That kind of safety, I don’t want to trivialize it. It’s going to be hard. It’s going to be really difficult research and engineering, and it can be really enriching and many, many thousands of people could make their whole careers around making AI safe for big tech companies, according to their accountable definition of safety.

But then what about the stuff they’re not accountable for? What about geopolitics that’s nobody’s fault? What about coordination failures between three different industries, or three different companies that’s nobody’s fault? That’s the stuff that’s going to get you. I think it’s actually mathematically difficult to specify protocols for decentralized multi-agent systems to adhere to constraints. It is more difficult than specifying constraints for a single system.

Lucas Perry: I’m having a little bit of confusion here, because when you’re arguing that alignment questions will be solved via the incentives of just the commercialization of AI.

Andrew Critch: Single-human, single-AI alignment problems or single-institutions, single-network alignment problems. Yes.

Lucas Perry: Okay. But they also might be making single agents for many people, or multiple agents for many people. So it doesn’t seem single-single to me. But the other part is that you’re saying that in a world where there are many competing actors and a diffusion of responsibility, the existential risk comes from obvious things that companies should be doing, but no one is, because maybe someone should make a regulation about this thing but whatever, so we should just keep doing things the way that we are. But doesn’t that come back to commercialization of AI systems not solving all of the AI alignment problems?

Andrew Critch: So if by AI alignment you mean AI technology in aggregate behaves in a way that is favorable to humanity in aggregate. If that’s what you mean, then I agree that failure to align the entire economy of AI technology is a failure of AI alignment. However, number one, people don’t usually think about it that way.

If you asked someone to write down the AI alignment problem, they’ll write down a human utility function and an AI utility function, and talk about aligning the AI utility function with the human utility function. And that’s not what that looks like. That’s not a clear depiction of that super multi-agent scenario.

And, second of all, the concept of AI alignment has been around for decades and it refers to single-single alignment, typically. And third, if you want to co-op the concept of AI alignment and start using it to refer to general alignment of general AI technology with general human values, just as spread out notion of goodness that’s going to get spread over all of the AI technology and make it all generally good for all the generally humans. If you want to co-opt it and use it for that, you’re going to have a hard time. You’re going to invite a lot of debate about what is human values?

We’re trying to align the AI technology with the human values. So, you go from single-single to single-multi. Okay. Now we have multiple AI systems serving a single human, that’s tricky. We got to get the AI systems to cooperate. Okay. Cool. We’ll figure out how the cooperation works and we’ll get the AI systems to do that. Cool. Now we’ve got a fleet of machines that are all serving effectively.

Okay. Now let’s go to multi-human, multi-AI. You’ve got lots of people, lots of AI systems in this hyper interactive relationship. Did we align the AIs with the humans? Well, I don’t know. Are some of the humans getting really poor, really fast, while some of them are getting really rich, really fast? Sound familiar? Okay. Is that aligned? Well, I don’t know. It’s aligned for some of them. Okay. Now we have a big debate. I think that’s a very important debate and I don’t want to skirt it.

However, I think you can ask the question, did the AI technology lead to human extinction without having that debate? And I want to factor that debate of, wait, who do you mean? Who are you aligning with? I want that debate to be had, and I want it to be had separately from the debate of, did it cause human extinction?

Because I think almost all humans want humanity not to go extinct. Some are fine with it, it’s not universal, but a lot of people don’t want humanity to go extinct. I think the alignment concept, if you play forward 10 years, 20 years, it’s going to invite a lot of very healthy, very important debate that’s not necessary to have for existential safety.

Lucas Perry: Okay. So I’m not trying to defend the concept of AI alignment in relation to the concept of AI existential safety. I think what I was trying to point towards is that you said earlier that you do not want to discourage people from going into areas that are not neglected. And the areas that are not neglected are the areas where the commercialization of AI will drive incentives towards solving alignment problems.

Andrew Critch: That’s right.

Lucas Perry: But the alignment problems that are not going to get solved-

Andrew Critch: I want to encourage people to go out to solve those problems. 100%.

Lucas Perry: Yeah. But just to finish the narrative, the alignment problems that are not going to get solved are the ones where there are multiple humans and multiple AI agents, and there’s this diffusion of responsibility you were talking about. And this is the area you said would most likely lead to AI existential risk. Where maybe someone should make a regulation about this specific thing, or maybe we’re competing a little bit too hard, and then something really bad happens. So you’re saying that you do want to push people into both the unneglected area of…

Andrew Critch: Let me just flesh out a little bit more about my value system here. Pushing people is not nice. If there’s a person and they don’t want to do a thing, I don’t want to push them. That’s the first thing. Second thing is, pulling people is not nice either. So it’s like, if someone’s on the way into doing something they’re going to find intellectually enriching that’s going to help them think about existential safety that’s not neglected, it’s popular, it’s going to be popular, I don’t want to hold them back. But, if someone just comes to me and is like, “Hey, I’m indifferent between transparency and robustness.” I’m like, “100%, go into transparency, no question.”

Lucas Perry: Because it will be more neglected.

Andrew Critch: And if someone tells me they’re indifferent between transparency and multi-stakeholder delegation, I’m like, “100%, multi-stakeholder delegation.” If you’ve got traction on that and you’re not going to burn your career, do it.

Lucas Perry: Yeah. That’s the three categories then though. Robustness gets solved by the incentive structures of commercialization. Transparency, maybe less so, maybe it comes later. And then the multi-multi delegation is just the other big neglected problem of living in a global world. So, you’re saying that much of the alignment problem gets solved by incentive structures of commercialization.

Andrew Critch: Well, a lot of what people call alignment will get solved by present day commercial incentives.

Lucas Perry: Yes.

Andrew Critch: Another chunk of societal scale benefit from AI, I’ll say, will hopefully get solved by the next wave of commercial incentives. I’m thinking things like transparency, fairness, accountability, things like that are actually going to become actually commercially profitable to get right, rather than merely the things companies are afraid of getting wrong.

And I hope that second wave happens before we destroy ourselves, because possibly, we would destroy ourselves even before then. But most of my chips are on, there’s going to be a wave of benefit with AI ethics in the next 10 years or something, and that that’s going to solve a bunch more of existential safety, or it’s going to address them. Leftover after that is stuff that the global capitalism never got to.

Lucas Perry: And the things that global capitalism never got to are the capitalistic organizations and governments competing with one another with very strong AI systems?

Andrew Critch: Yeah. Competing and cooperating.

Lucas Perry: Competing and cooperating, unless you bring in some strong notion of paretotopia where everyone is like, “We know that if we keep doing this, that everyone is going to lose everything they care about.”

Andrew Critch: Well, the question is, how do you bring that in? If you solve that problem, you’ve solved it.

Lucas Perry: Okay. So, to wrap up on this then, as companies increasingly are making systems that serve people and need to be able to learn and adopt their values, the incentives of commercialization will continue to solve what are classically AI alignment problems that may also provide some degree of AI existential safety. And there’s the question of how much of those get solved naturally, and how much we’re going to have to do in academia and nonprofit, and then push that into industry.

So we don’t know what that will be, but we should be mindful about what will be solved naturally, and then what are the problems that won’t be, and then how do we encourage or invite more people to go into areas that are less likely to be solved by natural industrial incentives.

Andrew Critch: And do you mean areas of alignment, or areas of existential safety? I’m serious.

Lucas Perry: I know because I’m guilty of not really using this distinction in the past. Both.

Andrew Critch: Got it. I actually think most of single-single alignment. Like there’s a single stakeholder, which might be a human or an institution that has one goal, like profits, right? So there’s a single-human stakeholder, and then there’s a single-AI. I call that single-single alignment. I almost never refer to a multi-multi alignment, because I don’t know what it means and it’s not clear what the values of multiple different stakeholders even is. What are you referring to when you say the values are this function?

So, I don’t say multi-multi alignment a lot, but I do sometimes say single-single alignment to emphasize that I’m talking about the single stakeholder version. I think the multi-multi alignment concept almost doesn’t make sense. So, when someone asks me a question about alignment, I always have to ask, “Now, are you eliding those concepts again?” Or whatever.

So, we could just say single-single alignment every time and I’ll know what you’re talking about, or we could say classical alignment and I’ll probably assume that you mean single-single alignment, because that’s the oldest version of the concept from 2002. So there’s this concept of basic human rights or basic human needs. And that’s a really interesting concept, because it’s a thing that a lot of people agree on. A lot of people think murder is bad.

Lucas Perry: People need food and shelter.

Andrew Critch: Right. So there’s a bunch of that stuff. And we could say that AI alignment is about that stuff and not the other stuff.

Lucas Perry: Is it not about all of it?

Andrew Critch: I’ve seen satisfactory mathematical definitions of intent alignment. Paul Christiano talks about alignment, which I think of as in intent alignment, I think he now also calls it intent alignment, which is the problem of making sure an AI system is intending to help its user. And I think he’s got a pretty clear conception of what that means. I think the concept of the intent alignment of a single-single AI servant is easier to define than whatever property an AI system needs to have.

There’s a bunch of properties that people call AI alignment that are actually all so different from each other. And people don’t recognize that they’re different from each other, because they don’t get into the technical details of trying to define it, so then everyone thinks that we all mean the same thing. But what really is going on, is everyone’s going around thinking, “I want AI to be good, basically good for basically everybody.” No one’s cashing that out, and so nobody notices how much we disagree on what basically good for basically everybody means.

Lucas Perry: So that’s an excellent point, and I’m guilty here now then of having absolutely no idea of what I mean by AI alignment.

Andrew Critch: That’s my goal, because I also don’t know and I’m glad to have a company in that mental state.

Lucas Perry: Yeah. So, let’s try moving long ahead here. And I’ll accept any responsibility and my guilt in using the word AI alignment incorrectly from now on. That was a fun and interesting side road, and I’m glad we pursued it. But now pivoting back into some important definitions here that you also write about in your paper, what counts to you as an AI system and what counts to you as an AI technology, and why does that distinction matter?

Andrew Critch: So throughout the ARCHES report, I’d advocated for using technology versus system. AI technology is like a mass net, and you can say, you can have more of it or less of it. And it’s like this butter that you can spread on the toast of civilization. And AI system, it’s like a countdown. You can have one of them or many of them, and you can put an AI system like you could put a strawberry on your toast, which is different from strawberry jam.

So, there’s properties of AI technology that could threaten civilization and there’s also properties of a single AI system that could threaten civilization. And I think those are both important frames to think in, because you could make a system and think, “This system is not a threat to civilization,” but very quickly, when you make a system, people can copy it. People can replicate it, modify it, et cetera. And then you’ve got a technology that’s spread out like the strawberry has become strawberry compote and spread out over the toast now. And do you want that? Is that good?

As an everyday person, I feel like basic human rights are a well-defined concept to me. Is this basically good for humanity? Is a well defined concept to me, but mathematically it becomes a lot harder to pin down. So I try to say AI technology when I want to remind people that this is going to be replicated, it’s going to show up everywhere. It’s going to be used in different ways by different actors.

At the same time, you can think of the aggregate use of AI technology worldwide as a system. You can say the internet is a system, or you can say all of the self driving cars in the world is one big system built by multiple stakeholders. So I think that the system concept can be reframed to refer to the aggregate of all the technology of a certain type or of a certain kind. But that mental reframe is an actual act of effort, and you can switch between those frames to get different views of what’s going on. I try to alternate and use both of those views from time to time, the system view and the technology view.

Lucas Perry: All right. So let’s get into another concept here that you develop, and it’s really at the core of your paper. What is a prepotent AI? And I guess before you define what a prepotent AI is, can you define what prepotent means? I had actually never heard of that word before reading your paper.

Andrew Critch: So I’m going to say the actual standard definition of prepotent which connotes, arrogance, overbearing high-handed, despotic, possessing excessive abuse of authority. These connotations are carried across a bunch of different Latin languages, but in English they’re not as strong. In English, prepotent just means very powerful, or superior enforced influence, or authority or predominant.

I used it because it’s not that common of a word, but it’s still a word, and it’s a property that AI technology can have relative to us. And it’s also property that a particular AI system, whether it’s singular or distributed can have relative to us. The definition that I’d give for a prepotent AI technology is technology whose deployment would transform humanity’s habitat, which is currently the earth, in a way that’s unstoppable to us.

So there’s this notion of there’s the transformativeness and then there’s the unstoppableness. Transformativeness is a concept that has been also elaborated by the Open Philanthropy Project. They have this transformative AI concept. I think it’s a very good concept, because it’s impact oriented. It’s not about what the AI’s trying to do, it’s about what impact that has. And they say when AI system or technology is transformative, if its impact on the world is comparable to, say the agricultural revolution or the industrial revolution, a major global change in how things are done. You might argue that the internet is a transformative technology as well.

So, that’s the transformative aspect of prepotence. And then there’s the unstoppable aspect. So, imagine something that’s transforming the world the way the agricultural industrial revolution has transformed it, but also, we can’t stop it. And by we, I mean, no subset of humans, if they decided that they want to stop it, could stop it. If every human in the world decided, “Yeah, we all want this to stop,” we would fail.

I think it’s possible to imagine AI technologies that are unstoppable to all subsets of humanity. I mean, there’s things that are hard to stop right now. If you wanted to stop the use of electricity. Let’s say all humans decided, today, for some strange reason that we never want to use electricity anymore. That’d be a difficult transition. I think we probably could do it, but it’d be very difficult. Humanity as a society can become dependent on certain things, or intertwined with things in a way that makes it very hard to stop them. And that’s a major mechanism by which an AI technology can be prepotent, by being intertwined with us and how we use it.

Lucas Perry: So, can you distinguish this idea of prepotent AI, because it’s a completely new concept from transformative AI, as you mentioned before, and superintelligence, and why it’s important to you that you introduced this new concept?

Andrew Critch: Yeah. Sure. So let’s say you have an AI system that’s like a door-to-door salesman for solar panels, and it’s just going to cover everyone’s roofs with solar panels for super cheap, and all of the business is going to have solar panels on top, and we’re basically just not going to need fossil fuels anymore. And we’re going to be way more decentralized and independent, and states are going to be less dependent on each other for energy. So, that’s going to change geopolitics. A lot of stuff’s going to change, right?

So, you might say that that was transformative. So, you can have a technology that’s really transformative, but also maybe you can stop it. If everybody agreed to just not answer the door when the door-to-door solar panel robot salesman comes by, then they would stop. So, that’s transformative, but not prepotent. There’s a lot of different ways that you can envision AI being both transformative and unstoppable, in other words, prepotent.

I have three examples that I’d go to and we’ve written about those in ARCHES. One is technological autonomy. So if you have a little factory that can make more little factories, and it can do its own science and invent its own new materials to make more robots to do more mining, to make more factories, et cetera, you can imagine a process like that that gets out hand someday. Of course, we’re very far away from that today, conceptually, but it might not be very long before we can make robots that make robots that make robots.

Self-sustaining manufacturing like that could build defenses using technology the way humans build defenses against each other. And now suddenly, the humans want to stop it, but it has nukes aimed at us, so we can’t. Another completely different one which is related, is replication speed. Like the way a virus can just replicate throughout your body and destroy you without being very smart.

You could envision, you can imagine. I don’t know of how easy it is to build this, because maybe it’s a question of nanotechnology, but can you build systems that just very quickly replicate, and just tile the earth so fast with their replicants that we die? Maybe we suffocate from breathing them, or breathing their exhaust. That one honestly seems less plausible to me than to technological autonomy one, but to some people it seems more plausible and I don’t have a strong position on that.

And then there’s social acumen. You can imagine say a centralized AI system that is very socially competent, and it can deliver convincing speeches to the point of becoming elected a state official, and then brokering deals between nations that make it very hard for anybody to go against their plans, because they’re so embedded and well negotiated with everybody. And when you try to coordinate, they just whisper things, or say threats or make offers that dis-coordinate everybody again. Even though everybody wants it to stop, nobody can manage to coordinate long enough to stop it because it’s so socially skilled. So those are like a few science fiction scenarios that I would say constitute prepotence on the part of the AI technology or system. They’re all different and the interesting thing about them is that they all can happen without being generally superintelligent. These are conditions that are sufficient to pose a significant existential threat to humanity, but which aren’t superintelligence. And I want to focus on those because I don’t want us to delay addressing existential risk until we have superintelligence. I want us to address it but the minimum viable existential threats that we could face and head those off. So that’s why I focus on prepotence as a property rather than superintelligence because it’s a broader category that I think is still quite threatening and quite plausible.

Lucas Perry: Another interesting and important concept is born of this is misaligned prepotent AI technology. Can you expand a bit on that? So what is and should count as misaligned prepotent AI technology?

Andrew Critch: So this was a tough decision for me because as you’ve noticed throughout this podcast, at the technical level, I find the alignment concept confusing at multi-stakeholder scales, but still critical to think about. And so I couldn’t decide whether to just talk about unsurvivable prepotent AI or misaligned prepotent AI. So let me talk about unsurvivable prepotent AI. By that, I mean it’s transformed the earth, you can’t stop it and moreover, you’re going to die of it eventually. The AI technology has become unsurvivable in the year 2085 if in that year, the humans now are doomed and cannot possibly survive. And I thought about naming the central concept, unsurvivable prepotent AI but a lot of people want to say for them, misalignment is basically unsurvivability.

I think David also tends to think of alignment in a similar way, but there’s this question of where do you draw the line between poorly aligned and misaligned? We just made a decision to say, extinction is the line, but that’s kind of a value judgment. And one of the things I don’t like about the paper is that it has that implicit value judgment. And I think the way I would prefer people to think is in terms of the concept of unsurvivability versus survivability, or prepotence versus not. But the theme of alignment and misalignment is so pervasive that some of our demo readers preferred that name for the unsurvivable prepotent AIs.

Lucas Perry: So misaligned prepotent AI then is just some AI technology that would lead to human extinction?

Andrew Critch: As defined in the report, yep. That’s where we draw the line between aligned and misaligned. If it’s prepotent, it’s having this huge impact. When’s the huge impact definitively misaligned? Well, it’s kind of like where’s the zero line and we just kind of picked extinction to be the line to call misaligned. I think it’s a pretty reasonable line. It’s pretty concrete. And I think a lot of efforts to prevent extinction would also generalize to preventing other big risks. So sometimes, it’s nice to pick a concrete thing and just focus on it.

Lucas Perry: Yeah. I understand why and I think I would probably endorse doing it this way, but it also seems a little bit strange to me that there are futures worse than extinction and they’re going to be below the line. And I guess that’s fine then.

Andrew Critch: That’s why I think unsurvivable is a better word. But our demo readers, some of them just really preferred misaligned prepotent AI over unsurvivable prepotent AI. So we went with that just to make sense to your readers.

Lucas Perry: Okay. So as we’re building AI technologies, we can ask what counts as the deployment of a prepotent AI system or technology, a TAI system, or a misaligned prepotent AI system and the implications of such deployment? I’m curious to get your view on what counts as the deployment of a prepotent AI system or a misaligned prepotent AI system.

Andrew Critch: So you could imagine something that’s transforming the earth and we can’t stop it, but it’s also great.

Lucas Perry: Yeah. An aligned prepotent AI system.

Andrew Critch: Yeah. Maybe it’s just building a lot of infrastructure around the world to take care of people’s health and education. Some people would find that scary and not like the fact that we can’t stop it, and maybe that fear alone would make it harmful or maybe it would violate some principle of theirs that would matter even if they didn’t feel the fear. But you can at least imagine under some value systems, technology that’s kind of taken over the world but it’s taken good care everybody. And maybe it’s going to take care of everybody forever so humanity will never go extinct. That’s prepotent but not unsurvivable, but that’s a dangerous move to make on a planet to sort of make a prepotent thing and try to make sure that it’s an aligned prepotent thing instead of a misaligned prepotent thing, because you’re unstoppably transforming the earth and you maybe you should think a lot before you do that.

Lucas Perry: And maybe prepotence is actually incompatible with alignment if we think about it enough for the reasons that you mentioned.

Andrew Critch: It’s possible. Yeah. With enough reflection on the value of human autonomy, we would eventually conclude that if humans can’t stop it, it’s fundamentally wrong in a way that will alienate and destroy humans eventually in some way. That said, I do want to add something which is that I think almost all prepotent AI that we could conceivably make will be unsurvivably misaligned. If you’re transforming the world, most states of the world are not survivable to humans. Just like most planets are not survivable to humans. So most ways that the world could be very different are just ways in which humans could not survive. So I think if you have a prepotent AI system, you have to sort of steer it through this narrow window of futures, this narrow like keyhole even of futures where all the variables of the earth stay inhabitable to humans, or we would build some space colony where humans live instead of Earth.

Almost every chemical element, if you just turn up that chemical element on the earth, humans die. So that’s the thing that makes me think most conceivable prepotent AI systems are misaligned or unsurvivable. There are people who think about alignment a lot that I think are super biased by the single principal, single agent framing and have sort of lost track of the complexities of society and that’s why they think prepotent AI is conceivable to align or like not that hard to align or something. And I think they’re confused, but maybe I’m the confused one and maybe it’s actually easy.

Lucas Perry: Okay. So you’ve mentioned a little bit here about if you dial in the knobs of the chemical composition of really anything much on the planet in any direction, that pretty quickly you can create pretty hostile or even existentially incompatible situations on Earth for human beings. So this brings us to the concept of basically how frail humanity is given the conditions that are required for us to exist. What is the importance of understanding human frailty in relation to prepotent AI systems?

Andrew Critch: I think it’s pretty simple. I think human frailty implies don’t make prepotent AI. If we lose control of the knobs, we’re at risk of the knobs getting set wrong. Now that’s not to say we can set the knobs perfectly either, but if they start to go wrong, we can gradually set them right again. There’s still hope that we’ll stop climate change, right? And not saying we will, but it’s at least still possible. We haven’t made it impossible to stop. If every human in the world agreed now to just stop, we would succeed. So we should not lose control of this system because almost any direction it could head is a disaster. So that’s why some people talk about the AI control problem, which is different I claim than the AI alignment problem. Even for a single powerful system, you can imagine it looking after you, but not letting you control it.

And if you aim for that and miss, I think it’s a lot more fraught. And I guess the point is that I want to draw attention to human fragility because I know people who think, “No, no, no. The best thing to do for humanity is to build a super powerful machine that just controls the Earth and protects the humans.” I know lots of people who think that. It makes sense logically. It’s like, “Hey, the humans. We might destroy ourselves. Look at this destructive stuff we’re doing. Let’s build something better than us to take care of us.” So I think the reasoning makes sense, but I think it’s a very dangerous thing to aim for because if we aim and miss, we definitely, definitely die.

I think transformative AI is big enough risk. We should never make prepotent AI. We should not make unstoppable, transformative AI. And that’s why there’s so much talk about the off switch game or the control problem or whatever. Corrigibility is kind of related to turning things off. Humans have this nice property where if half of them are destroyed and the other half of them have the ability to notice that and do something about it, they’re quite likely to do something about it. So you get this robustness at a societal scale by just having lots of off switches.

Lucas Perry: So we’ve talked about this concept a bunch already, this concept of delegation. I’m curious if you can explain the importance and relevance of considering delegation of tasks from a human or humans to an AI system or systems. So we’re just going to unpack this taxonomy that you’ve created a bit here of single-single, single-multi, multi-single, and multi-multi.

Andrew Critch: The reason I think delegation is important is because I think a lot of human society is rightly arranged in a way that avoids absolute power from accumulating into decisions of any one person, even in the most totalitarian regimes. The concept of delegation is a way that humans hand power and responsibility to each other in political systems but also in work situations, like the boss doesn’t have to do all the work. They delegate out and they delegate a certain amount of power to people to allow the employees of a company to do the work. That process of responsibilities and tasks being handed from agent to agent to agent is how a lot of things get done in the world. And there’s many things we’ve already delegated to computers.

I think delegation of specific tasks and responsibilities is going to remain important in the future even as we approach human level AI and supersede human level AI, because people resist the accumulation of power. If you say, “Hey, I am Alpha Corp. I’m going to make a superintelligent machine now and then use it to make the world good.” You might be able to get a few employees that are like kind of wacky enough to think that yeah, taking over the world with your machine is the right company mission or whatever. But the winners of the race of AI development are going to be big teams that won because they managed to work together and pull off something really hard. And such a large institution is going to most likely have dissident members who don’t think taking over the world is the right plan for what to do with your powerful tech.

Moreover, there’s going to be plenty of pressures from outside even if you did manage to fill a company full of people who want to take over the world. They’re going to know that that’s kind of not a cool thing to do according to most people. So you’re not going to be taking over the world with AI. You’re going to be taking on specific responsibilities or handing off responsibilities. And so you’ve got an AI system that’s like, “Hey, we can provide this service. We’ll write your spam messages for you. Okay?” So then that responsibility gets handed off. Perhaps OpenAI would choose not to accept that responsibility. But let’s say you want to analyze and summarize a large corpus of texts to summarize what people want. Let’s say you get 10,000 customer service emails in a day and you want something to read that and give you a summary of what really people want.

That’s a tremendously useful thing to be able to do. And let’s say open AI develops the capability to do that. They’ll sell that as a service and other companies will benefit from it greatly. And now, OpenAI has this responsibility that they didn’t have. They’re now responsible for helping Microsoft fulfill customer service requests. And if Microsoft sucks at fulfilling those customer service requests, now open AI is getting complaints from Microsoft because they summarize the requests wrong. So now you’ve got this really complicated relationship where you’ve got a bunch of Microsoft users sending in lots of emails, asking for help that are being summarized by OpenAI, and then hand it off to Microsoft developers to prioritize what they do next with their software. And no one is solely responsible for everything that’s happening because the customer is responsible for what they ask, Microsoft is responsible for what they provide, and open AI is responsible for helping Microsoft understand what to provide based on what the customer’s ask.

Responsibilities get naturally shared out that way unless somebody comes in with a lot of guns and says, “No, give me all the responsibility and all the par.” So militarization of AI is certainly a way that you could see a massive centralization of power from AI. I think States should avoid militarizing AI to avoid scaring other States into militarizing AI. We don’t want to live in a world with militarized AI technologies. So I think if we succeed in heading off that threat and that’s a big if, then we end up in an economy where the responsibilities are being taken on, services are being provided. And then everything’s suddenly very multi-stakeholder, multiple machines servicing multiple people. And I think of delegation as a sort of operation that you perform over and over that ends up distributing those responsibilities and services. And I think about how do you perform a delegation step correctly? If you can do one delegation step correctly, like when Microsoft makes the decision to hand off its customer service interpretation to OpenAI’s language models, Microsoft needs to make that decision correctly.

And it makes that decision correctly. If we define correctly correctly, it’ll be part of an overall economy of delegations that are respectful of humanity. So in my opinion, once you head off militarization, the task of ensuring existential safety for humanity boils down to the task of recursively defining delegation procedures that are guaranteed to preserve human existence and welfare over time.

Lucas Perry: And so you see this area of delegation as being the most x-risky.

Andrew Critch: So it’s interesting. I think delegation prevents centralization of power, which prevents one kind of x-risk. And I think we will seek to delegate. We will seek desperately to delegate responsibilities and distribute power as it accumulates.

Lucas Perry: Why would we naturally do that?

Andrew Critch: People fear power.

Lucas Perry: Do we?

Andrew Critch: If you see something with a lot more power than you, people tend to fear it and sort of oppose it. And separately, people fear having power. If you’re on a team that’s like, “Yeah, we’re going to take over the world,” you’re probably going to be like, “Really? Isn’t it bad? Isn’t that super villain to do that?” So as I predict this, I don’t want to say, “Count on somebody else to adopt this attitude.” I want people listening to adopt that attitude as well. And I both predict and encourage the prevention of extreme concentrations of power from AI development because society becomes less robust then. It becomes this one point of failure where if this thing messes up, everything is destroyed. Whereas right now, it’s not that easy for a centralized force to destroy the world by messing up. It is easy for decentralized forces to destroy the world right now. And that’s how I think it’ll be in the future as well.

Lucas Perry: And then as you’re mentioning and have mentioned, the diffusion of responsibility is where we risk potentially missing core existential safety issues in AI.

Andrew Critch: Yeah, I think that’s the area that’s not only neglected by present day economic incentives, but will likely remain neglected by economic incentives even 10, 20 years from now. And therefore, will be left as the main source of societal scale and existential risk, yeah.

Lucas Perry: And then in terms of the taxonomy you created, can you briefly define the single and multi and the relationships those can have?

Andrew Critch: When I’m talking about AI delegation, I say single-single to mean single human-single AI system, or single human stakeholder and a single AI system. And I always referred to the number of humans first. So if I say single-multi, that means one human stakeholder, which might be a company or a person, and then multiple AI systems. And if I say multi-single, that’s multi human- single AI. And then multi-multi means multi human-multi AI. I started using this in a AGI safety course I was teaching at Berkeley in 2018 because I just noticed a lot of equivocation between students about which kind of scenarios they were thinking about. I think there’s a lot of multi-multi delegation work that is going to matter to industry because when you have a company selling a service to a user to do a job for an employer, things get multi-stakeholder pretty quickly. So I do think some aspects of multi-multi delegation will get addressed in industry, but I think they will be addressed in ways that are not designed to prevent existential risk. They will be addressed in ways that are designed to accrue profits.

Lucas Perry: And so some concepts that you also introduce here are those of control, instruction, and comprehension as being integral to AI delegation. Are those something you want to explore now?

Andrew Critch: Yeah, sure. I mean, those are pretty simple. Like when you delegate something to someone, Alice delegates to Bob, in order to make that decision, she needs to understand Bob, like what’s he capable of? What isn’t he? That’s human AI comprehension. Do we understand AI well enough to know what we should delegate? Then, there’s human AI instruction. Can Alice explain to Bob what she wants Bob to do? And can Bob understand that? Comprehension is really a conveyance of information from Bob to Alice. And then instruction is a conveyance of information from Alice to Bob. A lot of single-single alignment work is focused on how are we going to convey the information? Whereas transparency / interpretability work is more like the Bob to Alice direction. And then control is well, what if this whole idea of communication is wrong and we messed it up and we now just need to stop it, just take back the delegation. Like I was counting on my Gmail to send you emails, but now sending you a bunch of spam. I’m going to shut down my account and i’ll send you messages a different way.

That’s control. And I think of any delegation relationship as involving at least those three concepts. There might be other ones that are really important that I’ve left out. But I see a lot of research as serving one of those three nodes. And so then, you could talk about single-single comprehension. Does this person understand this system? Or we can talk about multi-single. Do this team of people understand this system? Multi-single control would be, can this team of people collectively stop or take back the delegation from the system that they’ve been using or counting on? And then it goes to multi-multi and starts to raise questions like what does it mean for a group of people to understand something? Do they all understand individually? Or do they also have to be able to have a productive meeting about it? Maybe they need to be able to communicate with each other about it too for us to consider it to be a group level understanding. So those questions come up in the definition of multi-multi comprehension, and I think they’re going to be pretty important in the end.

Lucas Perry: All right. So we’ve talked a bunch here already about single-single delegation and much of technical alignment research explores this single human-single AI agent scenario. And that’s done because it’s conceptually simple and is perhaps the most simple place to start. So when we’re thinking about AI existential safety and AI existential risk, how is starting from single-single misleading and potentially not sufficient for deep insight into alignment?

Andrew Critch: Yeah, I guess I’ve said this multiple times in this podcast, how much I think diffusion of responsibility is going to play a role in leaving problems unsolved. And I think diffusion of responsibility only becomes visible in the multi-stakeholder or multi-system or both scenarios. That’s the simple answer.

Lucas Perry: So the single-single gets solved again by the commercial incentives and then the important place to analyze is the multi-multi.

Andrew Critch: Well, I wouldn’t simplify it as much as to say the important places to analyze is the multi-multi because consider the following. If you build a house out of clay instead of out of wood, it’s going to fall apart more easily. And understanding clay could help you make that global decision. Similarly if your goal is to eventually produce societally safe, multi-multi delegation procedures for AI, you might want to start by studying the clay that that procedure is built out of, which is the single-single delegation steps. And single-single delegation steps require a certain degree of alignment between the delegator and the delegate. So it might be very important to start by figuring out the right building material for that, figuring out the right single-single delegation steps. And I know a lot of people are approaching it that way.

They’re working on single-single delegation, but that’s not because they think Netflix is never going to launch the Netflix challenge to figure out how to align recommender systems with users. It’s because the researchers who care about existential safety want to understand what I would call a single-single delegation, but what they would call the method of single-single alignment as a building block for what will be built next. But I sort of think different. I think that’s a great reasonable position to have. I think differently than that because I think the day that we have super powerful single-single alignment solutions is the day that it leaves the laboratory and rolls out into the economy. Like if you have very powerful AI systems that you can’t single-single align, you can’t ship a product because you can’t get it to do what anybody wants.

So I sort of think single-single alignment solutions sort of shorten the timeline. It’s like deja vu. When everyone was working on AI capabilities, the alignment people are saying, “Hey, we’re going to run out of time to figure out alignment. You’re going to have all of these capabilities and we’re not going to know how to align them. So let’s start thinking ahead about alignment.” I’m saying the same thing about alignment now. I’m saying once you get single-single alignment solutions, now your AI tech is leaving the lab and going into the economy because you can sell it. And now, you’ve run out of time to have solved the multipolar scenario problem. So I think there’s a bit of a rush to figure out the multi-stakeholder stuff before the single-single stuff gets all figured out.

Lucas Perry: Okay. So what you’re arguing for then here is your what you call multi-multi preparedness.

Andrew Critch: Yeah.

Lucas Perry: Would you also like to state what the multiplicity thesis is?

Andrew Critch: Yeah. It’s the thing I just want to remind people of all the time, which is don’t forget, as soon as you make tech, you copy it, replicate it, modify it. The idea that we’re going to have a single-single system and not very shortly thereafter have other instances of it or other competitors to it, is sort of a fanciful unrealistic scenario. And I just like reminding people as we’re preparing for the future, let us prepare for the nearly inevitable eventuality that there will be multiple instances of any powerful technology. Some people take that as an argument that, “No, no, no. Actually, we should make the first instance so powerful that it prevents the creation of any other AI technology by any other actor.” And logically, that’s valid. I think politically and socially, I think it’s crazy.

Lucas Perry: Uh-huh (affirmative).

Andrew Critch: I think it’s a good way to alienate anybody that you want to work with on existential risk reduction to say, “Our plan is to take over the world and then save it.” Whereas if your plan is to say, “What principles can all AI technology adhere to, such that it in aggregate will not destroy the world,” you’re not taking over anything. You’re just figuring it out. Like if there’s 10 labs in the world all working on that, I’m not worried about one of them succeeding. But if there’s 10 labs in the world all working on the safe world takeover plan, I’m like, “Hmm, now I’m nervous that one of them will think that they’ve solved safe world takeover or something.” And I kind of want to convert them all to the other thing of safe delegation, safe integration with society.

Lucas Perry: So can you take us through the risk types that you develop in your paper that lead to unsurvivability for humanity from AI systems?

Andrew Critch: Yeah. So there’s a lot of stuff that people worry about. I noticed that some of the things people worry about sort of directly cause extinction if they happen. And then some of them are kind of like one degree of causal separation away from that. So I call it tier one risks in the paper, that refers to things that would just directly lead to the deployment of a unsurvivable or misaligned prepotent AI technology. And then tier two risks are risks that lead to tier one risk. So for example, if AI companies or countries are racing really hard to develop AI faster than each other, so much that they’re not taking into account safety to the other countries around them or the other companies around them, then you get a disproportionate prioritization of progress over safety. And then you get a higher risk of societal scale disasters, including existential risks but not limited to it.

And so you could say fierce competition between AI developers is a tier two risk that leads to the tier one risk of MPAI or UPAI deployment, MPAI being misaligned prepotent AI. And tier one, I have this taxonomy that we use in the paper that I like for sort of dividing up tier one into a few different types that all I think have different technical approaches because my goal is to sort of orient on technical research problems that could actually help reduce existential risk from AI. So got this subdivision. The first one we have is basically diffusion of responsibility, or sometimes we call it unaccountable creators. In the paper, we settled on calling it uncoordinated MPAI deployment.

So the deal is before talking about whether this or that AI system is doing what its creators want or don’t want, can we even identify who the creators are? If the creators were this kind of diffuse economy or oligarchy of companies or countries, it might not be meaningful to say, “Did the AI system do what it’s creators want it?” Because maybe they all wanted a different thing. So a risk type 1A is risks that arise from kind of nobody in particular being responsible for and therefore, no one in particular being attentive to preventing the existential risk.

Lucas Perry: That’s an uncoordinated MPAI event.

Andrew Critch: Yeah, exactly. I personally think most of the most likely risks come from that category, but they’re hard to define and I don’t know how to solve them yet. I don’t know if anybody does. But if you assume we’re not in that case, it’s not uncoordinated. Now, there’s a recognizable identifiable institution Alpha Corp-made AI or America made the AI or something like that. And now you can start asking, “Okay. If there’s this recognizable creator relationship, did the creator know that they were making a prepotent technology?” And that’s how we define type 1B. We’ve got creators, but the creators didn’t know that the tech they were making was going to be prepotent. Maybe they didn’t realize it was going to be replicated or used as much as it was, or it was going to be smarter than they thought for whatever reason. But it just ended up affecting the world a lot more than they thought or being more unstoppable than they thought.

If you make something that’s unstoppably transforming the world, which is what prepotent means, and you didn’t anticipate that, that’s bad. You’re making big waves and you didn’t even think about the direction the waves were going. So I think a lot of risk comes from making tech and not realizing how big its impact is going to be in advance. And so you could have things that become prepotent that we weren’t anticipating and a lot of risks comes from that. That’s a whole risk category. That’s 1B. We need good science and discipline for identifying prepotence or dependence or unstoppability or transformativity all of these concepts. But suppose that’s solved, now we go to type 1C. There are creators contrary to 1A and the creators knew they’re making prepotent tech contrary to 1B. And I think this is weird because a lot of people don’t want to make prepotent tech because it’s super risky, but you could imagine some groups doing it.

If they’re doing that, do they recognize that the thing they’re making is misaligned? Do they think, “Oh yeah, this is going to take over the world and protect everybody. This is the, “I tried to take over the world and I accidentally destroyed it scenario.” So that’s unrecognized misalignment or unrecognized unsurvivability as a category of risk. And for that, you just need a really good theory of alignment with your values if you don’t want to destroy the world. And that’s I think what gets people focused on single-single alignment. They’re like, “The world’s broken. I want to fix it. I want to make magic AI that will like fix the world. It has to do what I want though. So let’s focus on single-single alignment.” But now he’s supposed that problem is solved contrary to type 1A you have discernible creators contrary to 1B, they know they’re playing with fire contrary to 1C, they know it’s misaligned. They know fire burns. That’s kind of plausible. If you imagine people messing with dangerous tech in order to figure out how to protect against it, you could have a lab with people sort of brewing up dangerous cyber attack systems that could break out and exercise a lot of social acumen. If they were really powerful language users, then you could imagine something getting out. So that’s, we call it type 1D, involuntary MPAI deployment, maybe it breaks out or maybe hackers break in and release it. But either way, the creators weren’t trying to do it, then you have type 1E which is contrary to 1D, the creators wanted to release MPAI deployment.

So that’s just people trying to destroy the world. I think that’s less plausible in the short term, more plausible in the longterm.

Lucas Perry: So all of these fall under the category of tier one in your paper. And so all of these directly lead to an existential catastrophe for humanity. You then have tier two, which are basically hazardous conditions, which lead to the realization of these tier one events. So could you take us through these conditions, which may act as a catalyst for eliciting the creation of tier one events in the world?

Andrew Critch: Yeah, so the nice thing about the tier one events is that we use the, an exhaustive decision tree for categorizing it. So any tier one event, any deployment event for a misaligned prepotent AI will fall under one of categories 1A through 1E, unfortunately we don’t have such a taxonomy for tier two.

So tier two is just the list of, hey, here’s four things that seem pretty worrisome. So 2A is, companies or countries racing with each other, trying to make AI real fast and not being safe about it. 2B is economic displacement of humans. So people talk about unemployment risks from AI. Imagine that taken to an extreme where eventually humans just have no economic leverage at all, because all economic value is being produced by AI systems. AI’s have taken all the jobs, including the CEO positions, including the board of directors positions, all using AI’s as their delegates to go to the board meetings that are happening every five seconds because of how fast the AI’s can have board meetings. Now, the humans are just like, “We’re just hoping that all that economy out there is going to not somehow use up all of the oxygen,” to say in the atmosphere, or “Lower the temperature of the earth by 30 degrees,” because of how much faster it would be to run super computers 30 degrees colder.

I think a lot of people who think about x-risk, think of unemployment as this sort of mundane, every generation, there’s some wave of unemployment from some tech. That’s nothing compared to existential risk, but I sort of want to raise a flag here and be, one of the waves of unemployment could be the one that just takes away all human leverage and authority. We should be on the lookout for runaway unemployment that leads to prepotence because loss of control and then human enfeeblement, that’s 2C, the humans are still around, but getting weaker and dumber and less capable of stuff because we’re not practicing doing things because AI is doing everything for us. Then one day we just all trip and fall and hit our heads and die kind of thing. But more realistically, maybe we just fail to be able to make good decisions about what AI technology is doing. And we failed to like notice we should be pressing the stop buttons everywhere.

Lucas Perry: The fruits of the utopia created by transformative AI are too enticing that we become enfeebled and fail at creating existential safety for advanced AI systems.

Andrew Critch: Or we use the systems in a stupid way because we all got worse at arithmetic and we couldn’t imagine the risks and we became scope insensitive to them or something. There’s a lot of different ways you can imagine humans just being weaker because AI is sort of helping us and then type 2D is discourse impairment about existential safety. This is something we saw a lot of in 2014 before FLI hosted the Puerto Rico conference, to just kick off basically discourse on existential safety for AI and other big risks from AGI. Luckily since then, with efforts from FLI and then the Concrete Problems in AI safety paper was a early example of acknowledging negative outcomes.

And then you have the ACM push to acknowledge negative risks and now the NeurIPS broader impact stuff. There’s lots of negative acknowledgement now. The discourse around negative outcomes has improved, but I think discourse on existential safety has a long way to go. It’s progressed, but it’s still has a long way to go. If we keep not being able to talk about it, for example, if we keep having to call existential safety safety, right? If we keep having to call it that, because we’re afraid to admit to ourselves or each other, that we’re thinking of existential stakes, we’re never really going to properly analyze the concept or visualize the outcomes together. I think there’s a big risk from just people sort of feeling like they’re thinking about existential safety, but not really saying it to each other and not really getting into the details of how society works at a large scale and therefore kind of ignoring it and making a bunch of bad decisions.

And I called that discourse impairment and it can happen because it’s taboo or it can happen because it’s just easier to talk about safety because safety is everywhere.

Lucas Perry: All right, so we’ve made it through to what is essentially the first third of your paper here. It lays much of the conceptual and language foundations, which are used for the rest, which try to more explicitly flesh out the research directions for existential safety on AI systems, correct?

Andrew Critch: Yeah. And I would say the later sections are a survey of research directions attacking different aspects and possibly exacerbating different aspects too. You earlier called this a research agenda. But I don’t think it’s quite right to call it an agenda because first of all, I’m not personally planning to research every topic in here, although I would be happy to research any of them. So this is not like, “Here’s the plan we’re going to do all these areas.” It’s more like, “Here’s a survey of areas and an analysis of how they flow into each other.” For example, single-single transparency research, that can flow in to coordination models for single-multi comprehension. It’s a view rather than a plan, because I think a plan should take into account more things like what’s neglected, what’s industry going to solve on its own?

My plan would be to pick sections out of this report and call those my agenda. My personal plan is to focus more on multi-agent stuff. Some also social metacognition stuff that I’m interested in. So if I wrote a research agenda, it would be about certain areas of this report, but the rest of the report is really just trying to look at all of these areas that I think relate to existential safety and it kind of analyzing how they relate.

Lucas Perry: All right, Andrew, well, I must say that on page 33, it says, “This report may be viewed as a very coarse description of a very long term research agenda, aiming to understand and improve blah, blah, blah.”

Andrew Critch: It’s true. It may be viewed as such and you may have just viewed it as such.

Lucas Perry: Yeah, I think that’s where I got that language from.

Andrew Critch: It’s true. Yeah, and I think if an institution just picked up this report and said, “This is our agenda.” I’d be like, “Cool, go for it. That’s a great plan.”

Lucas Perry: All right. I’m just getting you back for nailing me on the definition of AI alignment.

Andrew Critch: Okay.

Lucas Perry: Let’s hit up on some of the most important or key aspects here then for this final part of the paper. We have three questions here. The first is, “How would you explain the core of your concerns about, and the importance of flow through effects?” What are flow-through effects and why are they important for considering AI existential safety?

Andrew Critch: Flow through effects just means if A affects B and B affects C, then indirectly A affects C. Effects like that can be pretty simple in physics, but they can be pretty complicated in medicine and they might be even more complicated in research. If you do research on single-single transparency, that’s going to flow through to single-multi instruction. How is a person going to instruct a hierarchy of machines? Can they delegate to the machines to delegate to other machines? Okay, now can I understand? Okay, cool. There’s a flow through effect there. Then that’s going to flow through to multi-multi control. How can you have a bunch of people instructing a bunch of machines and still have control over them? If the instructions aren’t being executed to satisfaction, or if they’re going to cause a big risk or something.

And some of those flow through effects can be good, some of them could be bad. For example, you can imagine work in transparency flowing through to really rapid development in single-multi instruction, because you can understand more of what all the little systems are doing. You can tell more of them what to do and get more stuff done. Then that could flow through to disasters in multi-multi control because you’ve got races between powerful institutions that are delegating to large numbers of individual systems that they understand separately. But the interaction of which at a global scale are not understood by any one institution. So then you just get this big cluster of pollution or other problems being caused for humans, as a side effect. Just thinking about a problem, that’s a sub problem of the final solution is not always helpful, societally. Even if it is helpful to you personally, understanding how to approach the helpful societal scale solution. My personal biggest area of interest, I’m kind of split between two things.

One is, if you have a very powerful system and several stakeholders with very different priorities or beliefs, trying to decide a policy for that system. Imagine U.S., China and Russia trying to reach an agreement on some global cyber security protocol, that’s AI mediated or Uber and Waymo, trying to agree on what are the principles that their cars are going to follow when they’re doing lane changes. Are they going to try to intimidate each other to get a better chance at that lane changes? Is that going to put the humans at risk? Yes, okay. Can we all not intimidate each other and therefore not put the passengers at risk? That’s a big question for me, is how can you make systems that have powerful stakeholders in the process of negotiating for control over the system?

It’s like the system is not even deployed yet. We’re considering deploying it and we’re negotiating for the parameters of the system. I want the system to have a nice API, for the negotiating powers, to sort of turn knobs until they’re all satisfied with it. I call that negotiable AI. I’ve got a paper called Negotiable Reinforcement Learning with a student. I think that kind of encapsules the problem, but it’s not a solution to the problem by any means. It’s just merely drawing attention to it. That’s like a one core thing that I think is going to be really important as multi-stakeholder control. Not multi-stakeholder alignment, not making all the stakeholders happy, but making them work together in sharing the system, which might sometimes leave one of them unhappy. But at least they’re not all fighting and causing disasters from the externalities of their competition. The other one is almost the same principle, but where the negotiation is happening between the AI systems instead of the people.

So how do you get two AI systems, like System A and System B serving Alice and Bob, Alice and Bob want very different things. Now A and B have to get along. How can A and B get along, broker an agreement about what to do that’s better than fighting. Both of these areas of research are kind of trying to make peace between the human institutions controlling a powerful system. And the second case is peace between two AI systems. I don’t know how to do this at all. That’s why I try to focus on it. It’s sort of nobody’s job, except for maybe the UN and the UN doesn’t have… The cars getting along thing is kind of like a National Institute of Standards thing maybe, or a partnership, an AI thing maybe so maybe they’ll address that, but it’s still super interesting to me and possibly generalizable to big, higher stakes issues.

So I don’t claim that it’s going to be completely neglected as an area. It’s just very interesting at a technical level it seems neglected. I think there’s lots of policy thinking about these issues, but what shape does the technology itself need to have to make it easy for policymakers to set the standards, for it to be sort of negotiable and cooperative? That’s where my interests lie.

Lucas Perry: All right. And so that’s also matches up with everything else you said, because those are two sub-problems of multi-multi situations.

Andrew Critch: Yes.

Lucas Perry: All right. So next question is, is there anything else you’d like to add then to how it is the thinking about AI research directions affect AI existential risk?

Andrew Critch: I guess I would just add, people need to feel permission to work on things because they need to understand them, rather than because they know that it’s going to help the world. I think there’s a lot of paranoia about like, if you manage to care, but existential risks you’re like thinking about these high stakes and it’s easy to become paranoid. What if I accidentally destroy the world by doing the wrong research or something? I don’t think that’s a healthy state for a researcher, maybe for some it’s healthy, but I think for a lot of people that I’ve met, that’s not conducive to their productivity.

Lucas Perry: Is that something that you encounter a lot, people who have crippling anxiety over whether the research direction is correct?

Andrew Critch: Yeah, and varying degrees of crippling, some that you would actually call anxiety, the person’s experiencing actual anxiety. But more often it’s just a kind of festering unproductivity. It’s thinking of an area, “But that’s just going to advanced capabilities, so I won’t work on it,” or like think of an area it’s like, “Oh, that’s just going to hasten the economic deployment of AI systems, so I’m not going to work on it.” I do that kind of triage, but more so because I want to find neglected areas, rather than because I’m afraid of building the wrong tech or something. I find that mentality doesn’t inhibit my creativity or something. I want people to be aware of flow through effects and that any tech can flow through to have a negative impact that they didn’t expect. And because of that, I want everyone to sort of raise their overall vigilance towards AI technology as a whole. But I don’t want people to feel paralyzed like, “Oh no, what if I invent really good calibration for neural nets? Or what if I invent really good, bounded rationality techniques and then accidentally destroyed the world because people use them.”

I think what we need is for people to sort of go ahead and do their research, but just be aware that X-risk is on the horizon and starting to build institutional structures to make higher and higher stakes decisions about AI deployments, along with being supportive of areas of research that are conducive to those decisions being made. I want to encourage people to go into these neglected areas that I’m saying, but I don’t want people to think I’m saying they’re bad for doing anything else.

Lucas Perry: All right. Well, that’s some good advice then for researchers. Let’s wrap up here then on important questions in relevant multi-stakeholder objectives. We have four here that we can explore. The first is facilitating collaborative governance and the next is avoiding races by sharing control. Then we have reducing idiosyncratic risk taking, and our final one is existential safety systems. Could you take us through each of these and how they are relevant multi-stakeholder objectives?

Andrew Critch: Yeah, sure. So the point of this section of the report, it’s a pause between the sections about research for single human stakeholders and research for multiple human stakeholders. It’s there sort of explain why I think it’s important to think of multiple human stakeholders and important, not just in general. I mean, it’s obviously important for a lot of aspects of society, but I’m trying to focus on why it’s important to existential risk specifically.

So the first reason, facilitating collaborative governance is that I think it’s good if people from different backgrounds with different beliefs and different priorities can work together in governing AI. If you need to decide on a national standard, if you need an international standard, if you need to decide on rules that AI is not allowed to break, or that developers are not allowed to break. It’s going to suck if researchers in China make up some rules and researchers in America makeup different rules and the American rules don’t protect from the stuff that the Chinese rules protect from and the Chinese rules don’t protect from the stuff the American rules protect from. Moreover, that systems interacting with each other are going to not protect from either of those risks.

It’s good to be able to collaborate in governing things. Thinking about systems and technologies having a lot of stakeholders is key to preparing those technologies in a form that allows them to be collaborated over. Think about Google docs. I can see your cursor moving when you write in a Google doc. That’s really informative in a way that other collaborative document editing software does not allow. I don’t know if you’ve ever noticed how very informative it is to see where someone’s cursor is versus using another platform where you can only see the line someone’s on, but you can’t see what character they’re typing right now, that you can’t see what word they’re thinking. You’re like way, way, way less in tune with each other when you’re writing together, when you can’t see the cursors.

That’s an example of a way in which Google docs just had this extra feature that makes it way easier to negotiate for control, because if you’re not getting into an edit war, if I’m editing something, I’m not going to put my cursor where your cursor is. Or if I start backspacing a word that you just wrote, you know I must mean that, it must be important change. I just interrupted your cursor. Maybe you’re going to let me finish that backspace and see what the hell I’m doing. There’s this negotiability over the content of the document. It’s a consequence of the design of the interface. I think similarly AI technology could be designed with properties that make it easier for different stakeholders to cooperate in exercising, in the act of exercise and control over the system and its priorities. I think that sort of design question is key to facilitating collaborative governance because you can have stakeholders from different institutions, different cultures collaborating in the act of governing or controlling systems and observing what principles the systems need to have, need to adhere to for the purposes of different cultures or different values and so on.

Now, why is that important? Well, it’s lots of warm fuzzies from people working together and stuff. But one reason it’s important is that it reduces incentives to race. If we can all work together to set the speed limit, we don’t all have to drive as fast as we can to beat each other. That’s the section 7.2 is avoiding races by share and control and then section 7.3 is reducing idiosyncratic risk taking. Basically everybody kind of wants different things, but there’s a whole bunch of stuff we all don’t want. This kind of comes back to what you said about there being basic human values. Most of us don’t want humanity to go extinct. Most of us don’t want everyone to suffer greatly, but everybody kind of has a different view of what utopia should look like. That’s kind of maybe where the paretotopia concept came from.

It’s like everybody has a different utopia in mind, but nobody wants dystopia. If you imagine a powerful AI technology that might get deployed, and there’s a bunch of people on the committee deciding to make the deployment decision or deciding what features it should have, you can imagine one person on the committee being like, “Well, this poses a certain level of societal scale risk, but it’s worth it because of the anti-aging benefits that the AI is going to produce through the research, that’s going to be great.” Then another person on the committee being like, “Well, I don’t really care about anti-aging, but I do care about space travel. I want it to take a risk for that.” Then they’re like, “Wait a minute, I think we have this science assistant AI. We should use it on anti-aging not space.” And the space travel person’s like, “We should use it on space travel, not anti aging.”

Because of that, they don’t agree, that slows progress, but maybe a little slower progress is maybe a safer thing for humanity. Everyone has their agenda that they want to risk the world for, but because everyone disagrees and what risks are worth it, you sort of slow down and say, “Maybe collectively, we’re just not going to take any of these risks right now and we’ll just wait until we can do it with less risk.” So reducing idiosyncratic risk taking is just my phrase for the way everyone’s individual desire to take risks kind of averages out. Whereas every member of the committee doesn’t want human extinction so that doesn’t get washed out. It’s like everybody wants it to not destroy the world. Whereas not everybody wants it to colonize space or not everybody wants it to cure aging. You end up conservative on the risk, if you can collaboratively govern.

Then you’ve got existential safety systems, which is the last thing. If we did someday try to build AI tech that actually protects the world in some way, like say through cybersecurity or through environmental protection, that’s terrifying by the way, AI that controls the environment. But anyway, it’s also really promising, maybe we can clean up. It’s just the big move, setting control of the environment to AI systems is a big move. But as long you got lots of off switches, it’s maybe it’s great. Those big moves are scary because of how big they are. A lot of institutions would just never allow it to happen because of how scary it is. It’s like, “All right, I’ve got this garbage cleanup, AI is just going to actually go clean up all the garbage, or it’s going to scrub all the CO2 with this little replicating photosynthetic lab here. That’s going to absorb all the carbon dioxide and store it as biofuel. Great.” That’s scary. You’re like, whoa, you’re just unrolling the self replicating biofuel lab all over the world. People won’t let that happen.

I’m not sure what the right level of risk tolerance is for saving the world versus risking the world. But whatever it is, you are going to want existential safety safety nets, literal existential safety nets there to protect from big disasters. Whether the system is just an algorithm that runs on the robots that are doing whatever crazy world intervention you’re doing, or if it’s actually a separate system. But if you’re making a big change to the world for the sake of existential safety, you’re not going to get away with it unless a lot of people are involved in that decision. This is kind of a bid to the people who really do want to make big world interventions. Sometimes for the sake of safety, you’re going to have to appeal to a lot of stakeholders to sort of be allowed to do that.

So those are four reasons why I think developing your tech in a way that really is compatible with multiple stakeholders is going to be societally important and not automatically solved by industry standards. Maybe solved in special cases that are profitable, but not necessarily generalizable to these issues.

Lucas Perry: Yeah, the set of problems that are not naturally solved by industry and incentives, but that are crucial for existential safety are the set of problems it seems that we crucially need to identify and anticipate and engage in research today. Being mindful of flow through effects, such that we’re able to have as much leverage on that set of problems, given that they’re most likely not to be solved without a lot of foresight and intervention from the outside of industry and the normal flow of incentives.

Andrew Critch: Yep, exactly.

Lucas Perry: All right, Andrew wrapping things up. I just want to offer you a final bit of space for you to give any final words you’d like to say about the paper or AI existential risk. If there’s anything you feel is unresolved or you’d really like to communicate to everyone.

Andrew Critch: Yeah, thanks. I’d say if you’re interested in existential safety or something adjacent to it, use specific words for what you mean instead of just calling it AI safety all the time. Whatever your thing is, maybe it’s not existential safety, maybe it’s a societal scale risk or single-multi alignment or something, but try to get more specific about what we’re interested in. So that it’s easier for newcomers thinking about these topics, to know what we mean when we say them.

Lucas Perry: All right. If people want to follow you or get in touch or find your papers and work, where are the best places to do that?

Andrew Critch: For me personally, or David Krueger, the other coauthor on this report, and you could just Google our names and then we’ll have our research homepage show up and then you can see what our papers are or obviously Google Scholar is always a good avenue. Google Scholar sorted by year is a good trick because you can see what people are working on now, but there’s also the Center for Human Compatible AI where I work. There’s a bunch of other research going on there that I’m not doing, but I’m also still very interested in, and I’d probably be interested in doing more work research in that vein. I would say check out or, for me personally. I don’t know what David’s homepage is, but I’m sure you can find them by Googling David Krueger.

Lucas Perry: All right, Andrew, thanks so much for coming on and for your paper, I feel like I honestly gained a lot of perspective here on the need for clarity on definitions and what we mean. You’ve given me a better perspective on the kind of problem that we have and the kind of solutions that it might require and so for that, I’m grateful.

Andrew Critch: Thanks.

End of recorded material

Iason Gabriel on Foundational Philosophical Questions in AI Alignment

 Topics discussed in this episode include:

  • How moral philosophy and political theory are deeply related to AI alignment
  • The problem of dealing with a plurality of preferences and philosophical views in AI alignment
  • How the is-ought problem and metaethics fits into alignment 
  • What we should be aligning AI systems to
  • The importance of democratic solutions to questions of AI alignment 
  • The long reflection



0:00 Intro

2:10 Why Iason wrote Artificial Intelligence, Values and Alignment

3:12 What AI alignment is

6:07 The technical and normative aspects of AI alignment

9:11 The normative being dependent on the technical

14:30 Coming up with an appropriate alignment procedure given the is-ought problem

31:15 What systems are subject to an alignment procedure?

39:55 What is it that we’re trying to align AI systems to?

01:02:30 Single agent and multi agent alignment scenarios

01:27:00 What is the procedure for choosing which evaluative model(s) will be used to judge different alignment proposals

01:30:28 The long reflection

01:53:55 Where to follow and contact Iason



Artificial Intelligence, Values and Alignment 

Iason Gabriel’s Google Scholar


We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Lucas Perry: Welcome to the AI Alignment Podcast. I’m Lucas Perry. Today, we have a conversation with Iason Gabriel about a recent paper that he wrote titled Artificial Intelligence, Values and Alignment. This episode primarily explores how moral and political theory are deeply interconnected with the technical side of the AI alignment problem, and important questions related to that interconnection. We get into the problem of dealing with a plurality of preferences and philosophical views, the is-ought problem, metaethics, how political theory can be helpful for resolving disagreements, what it is that we’re trying to align AIs to, the importance of establishing a broadly endorsed procedure and set of principles for alignment, and we end on exploring the long reflection.

This was a very fun and informative episode. Iason has succeeded in bringing new ideas and thought to the space of moral and political thought in AI alignment, and I think you’ll find this episode enjoyable and valuable. If you don’t already follow us, you can subscribe to this podcast on your preferred podcasting platform by searching for The Future of Life or following the links on the page for this podcast.

Iason Gabriel is a Senior Research Scientist at DeepMind where he works in the Ethics Research Team. His research focuses on the applied ethics of artificial intelligence, human rights, and the question of how to align technology with human values. Before joining DeepMind, Iason was a Fellow in Politics at St John’s College, Oxford. He holds a doctorate in Political Theory from the University of Oxford and spent a number of years working for the United Nations in post-conflict environments.

And with that, let’s get into our conversation with Iason Gabriel.

So we’re here today to discuss your paper, Artificial Intelligence, Values and Alignment. To start things off here, I’m interested to know what you found so compelling about the problem of AI values and alignment, and generally, just what this paper is all about.

Iason Gabriel: Yeah. Thank you so much for inviting me, Lucas. So this paper is in broad brush strokes about how we might think about aligning AI systems with human values. And I wrote this paper because I wanted to bring different communities together. So on the one hand, I wanted to show machine learning researchers, that there were some interesting normative questions about the value configuration we align AI with that deserve further attention. At the same time, I was keen to show political and moral philosophers that AI was a subject that provoked real philosophical reflection, and that this is an enterprise that is worthy of their time as well.

Lucas Perry: Let’s pivot into what the problem is then that technical researchers and people interested in normative questions and philosophy can both contribute to. So what is your view then on what the AI problem is? And the two parts you believe it to be composed of.

Iason Gabriel: In broad brush strokes, I understand the challenge of value alignment in a way that’s similar to Stuart Russell. He says that the ultimate aim is to ensure that powerful AI is properly aligned with human values. I think that when we reflect upon this in more detail, it becomes clear that the problem decomposes into two separate parts. The first is the technical challenge of trying to align powerful AI systems with human values. And the second is the normative question of what or whose values we try to align AI systems with.

Lucas Perry: Oftentimes, I also see a lot of reflection on AI policy and AI governance as being a core issue to also consider here, given that people are concerned about things like race dynamics and unipolar versus multipolar scenarios with regards to something like AGI, what are your thoughts on this? And I’m curious to know why you break it down into technical and normative without introducing political or governance issues.

Iason Gabriel: Yeah. So this is a really interesting question, and I think that one we’ll probably discuss at some length later about the role of politics in creating aligned AI systems. Of course, in the paper, I suggest that an important challenge for people who are thinking about value alignment is how to reconcile the different views and opinions of people, given that we live in a pluralistic world, and how to come up with a system for aligning AI systems that treats people fairly despite that difference. In terms of practicalities, I think that people envisage alignment in different ways. Some people imagine that there will be a human parliament or a kind of centralized body that can give very coherent and sound value advice to AI systems. And essentially, that the human element will take care of this problem with pluralism and just give AI very, very robust guidance about things that we’ve all agreed upon are the best thing to do.

At the same time, there’s many other visions for AI or versions of AI that don’t depend upon that human parliament being able to offer such cogent advice. So we might think that there are worlds in which there’s multiple AIs, each of which has a human interlocutor, or we might imagine AIs as working in the world to achieve constructive ends and that it needs to actually be able to perform these value calculations or this value synthesis as part of its kind of default operating procedure. And I think it’s an open question what kind of AI system we’re discussing and that probably the political element understood in terms of real world political institutions will need to be tailored to the vision of AI that we have in question.

Lucas Perry: All right. So can you expand then a bit on the relationship between the technical and normative aspects of AI alignment?

Iason Gabriel: A lot of the focus is on the normative part of the value alignment question, trying to work out which values to align AI systems with, whether it is values that really matter and how this can be decided. I think this is also relevant when we think about the technical design of AI systems, because I think that most technologies are not value agnostic. So sometimes, when we think about AI systems, we assume that they’ll have this general capability and that it will almost be trivially easy for them to align with different moral perspectives or theories. Yet when we take a ground level view and we look at the way in which AI systems are being built, there’s various path dependencies that are setting in and there’s different design architectures that will make it easier to follow one moral trajectory rather than the other.

So for example, if we take a reinforcement learning paradigm, which focuses on teaching agents tasks by enabling them to maximize reward in the face of uncertainty over time, a number of commentators have suggested that, that model fits particularly well with the kind of utilitarian decision theory, which aims to promote happiness over time in the face of uncertainty, and that it would actually struggle to accommodate a moral theory that embodies something like rights or hard constraints. And so I think that if what we do want is a rights based vision of artificial intelligence, it’s important that we get that ideal clear in our minds and that we design with that purpose in mind.

This challenge becomes even clearer when we think about moral philosophies, such as a Kantian theory, which would ask an agent to reflect on the reasons that it has for acting, and then ask whether they universalize to good states of affairs. And this idea of using the currency of a reason to conduct moral deliberation would require some advances in terms of how we think about AI, and it’s not something that is very easy to get a handle on from a technical point of view.

Lucas Perry: So the key takeaway here is that what is going to be possible in terms of the normative and in terms of moral learning and moral reasoning in AI systems will supervene upon technical pathways that we take, and so it is important to be mindful of the relationship between what is possible normatively, given what is technically known, and to try and navigate that with mindfulness about that relationship?

Iason Gabriel: I think that’s precisely right. I see at least two relationships here. So the first is that if we design without a conception of value in mind, it’s likely that the technology that we build will not be able to accommodate any value constellation. And then the mirror side of that is if we have a clear value constellation in mind, we may be able to develop technologies that can actually implement or realize that ideal more directly and more effectively.

Lucas Perry: Can you make a bit more clear the ways in which, for example, path dependency of current technical research makes certain normative ethical theories more plausible to be instantiated in AI systems than others?

Iason Gabriel: Yeah. So, I should say that obviously, there’s a wide variety of different methodologies that are being tried at the present moment, and that intuitively, they seem to match up well with different kinds of theory. Of course, the reality is a lot of effort has been spent trying to ensure that AI systems are safe and that they are aligned with human intentions. When it comes to richer goals, so trying to evidence a specific moral theory, a lot of this is conjecture because we haven’t really tried to build utilitarian or Kantian agents in full. But I think in terms of the details, so with regards to reinforcement learning, we have this, obviously, an optimization driven process, and there is that whole caucus of moral theories that basically use that decision process to achieve good states of affairs. And we can imagine, roughly equating the reward that we use to train an RL agent on, with some metric of subjective happiness, or something like that.

Now, if we were to take a completely different approach, so say, virtue ethics, virtue ethics is radically contextual, obviously. And it says that the right thing to do in any situation is the action that evidences certain qualities of character and that these qualities can’t be expressed through a simple formula that we can maximize for, but actually require a kind of context dependence. So I think that if that’s what we want, if we want to build agents that have a virtuous character, we would really need to think about the fundamental architecture potentially in a different way. And I think that, that kind of insight has actually been speculatively adopted by people who consider forms of machine learning, like inverse reinforcement learning, who imagined that we could present an agent with examples of good behavior and that the agent would then learn them in a very nuanced way without us ever having to describe in full what the action was or give it appropriate guidance for every situation.

So, as I said, these really are quite tentative thoughts, but it doesn’t seem at present possible to build an AI system that adapts equally well to whatever moral theory or perspective we believe ought to be promoted or endorsed.

Lucas Perry: Yeah. So, that does make sense to me that different techniques would be more or less skillful for more readily and fully adopting certain normative perspectives and capacities in ethics. I guess the part that I was just getting a little bit tripped up on is that I was imagining that if you have an optimizer being trained off something, like maximize happiness, then given the massive epistemic difficulties of running actual utilitarian optimization process that is only thinking at the level of happiness and how impossibly difficult that, that would be that like human beings who are consequentialists, it would then, through gradient descent or being pushed and nudged from the outside or something, would find virtue ethics and deontological ethics and that those could then be run as a part of its world model, such that it makes the task of happiness optimization much easier. But I see how intuitively it more obviously lines up with utilitarianism and then how it would be more difficult to get it to find other things that we care about, like virtue ethics or deontological ethics. Does that make sense?

Iason Gabriel: Yeah. I mean, it’s a very interesting conjecture that if you set an agent off with the learned goal of trying to maximize human happiness, that it would almost, by necessity, learn to accommodate other moral theories and perspectives kind of suggests that there is a core driver, which animates moral inquiry, which is this idea of collective welfare being realized in a sustainable way. And that might be plausible from an evolutionary point of view, but there’s also other aspects of morality that don’t seem to be built so clearly on what we might even call the pleasure principle. And so I’m not entirely sure that you would actually get to a rights based morality if you started out from those premises.

Lucas Perry: What are some of these things that don’t line up with this pleasure principle, for example?

Iason Gabriel: I mean, of course, utilitarians have many sophisticated theories about how endeavors to improve total aggregate happiness involve treating people, fairly placing robust side constraints on what you can do to people and potentially, even encompassing other goods, such as animal welfare and the wellbeing of future generations. But I believe that the consensus or the preponderance of opinion is that actually, unless we can say that certain things matter, fundamentally, for example, human dignity or the wellbeing of future generations or the value of animal welfare, is quite hard to build a moral edifice that adequately takes these things into account just through instrumental relationships with human wellbeing or human happiness so understood.

Lucas Perry: So then we have this technical problem of how to build machines that have the capacity to do what we want them to do and to help us figure out what we would want to want us to get the machines to do, an important problem that comes in here is the is-ought distinction by Hume, where we have, say, facts about the world, on one hand, is statements, we can even have is statements about people’s preferences and meta-preferences and the collective state of all normative and meta-ethical views on the planet at a given time, and the distinction between that and ought, which is a normative claim synonymous with should and is kind of the basis of morality, and the tension there between what assumptions we might need to get morality off of the ground and how we should interact with a world of facts and a world of norms and how they may or may not relate to each other for creating a science of wellbeing or not even doing that. So how do you think of coming up with an appropriate alignment procedure that is dependent on the answer to this distinction?

Iason Gabriel: Yeah, so that’s a fascinating question. So I think that the is-ought distinction is quite fundamental and it helps us answer one important query, which is whether it’s possible to solve the value alignment question simply through an empirical investigation of people’s existing beliefs and practices. And if you take the is-ought distinction seriously, it suggests that no matter what we can infer from studies of what is already the case, so what people happen to prefer or happen to be doing, we still have a further question, which is should that perspective be endorsed? Is it actually the right thing to do? And so there’s always this critical gap. It’s a space for moral reflection and moral introspection and a place in which error can arise. So we might even think that if we studied all the global beliefs of different people and found that they agreed upon certain axioms or moral properties that we could still ask, are they correct about those things? And if we look at historical beliefs, we might think that there was actually a global consensus on moral beliefs or values that turned out to be mistaken.

So I think that these endeavors to kind of synthesize moral beliefs to understand them properly are very, very valuable resources for moral theorizing. It’s hard to think where else we would begin, but ultimately, we do need to ask these questions about value more directly and ask whether we think that the final elucidation of an idea is something that ought to be promoted.

So in sum, it has a number of consequences, but I think one of them is that we do need to maintain a space for normative inquiry and value alignment can’t just be addressed through an empirical social scientific perspective.

Lucas Perry: Right, because one’s own perspective on the is-ought distinction and whether and how it is valid will change how one goes about learning and evolving normative and meta-ethical thinking.

Iason Gabriel: Yeah. Perhaps at this point, an example will be helpful. So, suppose we’re trying to train a virtuous agent that has these characteristics of treating people fairly, demonstrating humility, wisdom, and things of that nature, suppose we can’t specify these upfront and we do need a training set, we need to present the agent with examples of what people believe evidences these characteristics, we still have the normative question of what goes into that data set and how do we decide. So, the evaluative questions get passed on to that. Of course, we’ve seen many examples of data sets being poorly curated and containing bias that then transmutes onto the AI system. We either need to have data that’s curated so that it meets independent moral standards and the AI learns from that data, or we need to have a moral ideal that is freestanding in some sense and that AI can be built to align with.

Lucas Perry: Let’s try and make that even more concrete because I think this is a really interesting and important problem about why the technical aspect is deeply related with philosophical thinking about this is-ought problem. So a highest level of abstraction, like starting with axioms around here, if we have is statements about datasets, and so data sets are just information about the world, the data sets are the is statements, we can put whatever is statements into a machine and the machine can take the shape of those values already embedded and codified in the world in people’s minds or in our artifacts and culture. And then the ought question, as you said, is what information in the world should we use? And to understand what information we should use requires some initial principle, some set of axioms that bridges the is-ought gap.

So for example, the kind of move that I think Sam Harris tries to lay out is this axiom, we should avoid the worst possible misery for everyone and you may or may not agree with that axiom but that is the starting point for how one might bridge the is-ought gap to be able to select for which data is better than other data or which data we should on load to AI systems. So I’m curious to know, how is it that you think about this very fundamental level of initial axiom or axioms that are meant to bridge this distinction?

Iason Gabriel: I think that when it comes to these questions of value, we could try and build up from this kind of very, very minimalist assumptions of the kind that it sounds like Sam Harris is defending. We could also start with richer conceptions of value that seem to have some measure of widespread ascent and reflective endorsement. So I think, for example, the idea that human life matters or that sentient life matters, that it has value and hence, that suffering is bad is a really important component of that, I think that conceptions of fairness of what people deserve in light of that equal moral standing is also an important part of the moral content of building an aligned AI system. And I would tend to try and be inclusive in terms of the values that we canvass.

So I don’t think that we actually need to take this very defensive posture. I think we can think expansively about the conception and nature of the good that we want to promote and that we can actually have meaningful discussions and debate about that so we can put forward reasons for defending one set of propositions in comparison with another.

Lucas Perry: We can have epistemic humility here, given the history of moral catastrophes and how morality continues to improve and change over time and that surely, we do not sit at a peak of moral enlightenment in 2020. So given our epistemic humility, we can cast a wide net around many different principles so that we don’t lock ourselves into anything and can endorse a broad notion of good, which seems safer, but perhaps has some costs in itself for allowing and being more permissible for a wide range of moral views that may not be correct.

Iason Gabriel: I think that’s, broadly speaking, correct. We definitely shouldn’t tether artificial intelligence too narrowly to the morality of the present moment, given that we may and probably are making moral mistakes of one kind or another. And I think that this thing that you spoke about, a kind of global conversation about value, is exactly right. I mean, if we take insights from political theory seriously, then the philosopher, John Rawls, suggests that a fundamental element of the present human condition is what he calls the fact of reasonable pluralism, which means that when people are not coerced and when they’re able to deliberate freely, they will come to different conclusions about what ultimately has moral value and how we should characterize ought statements, at least when they apply to our own personal lives.

So if we start from that premise, we can then think about AI as a shared project and ask this question, which is given that we do need values in the equation, that we can’t just do some kind of descriptive enterprise and that, that will tell us what kind of system to build, what kind of arrangement adequately factors in people’s different views and perspectives, and seems like a solution built upon the relevant kind of consensus to value alignment that then allows us to realize a system that can reconcile these different moral perspectives and takes a variety of different values and synthesizes them in a scheme that we would all like.

Lucas Perry: I just feel broadly interested in just introducing a little bit more of the debate and conceptions around the is-ought problem, right? Because there are some people who take it very seriously and other people who try to minimize it or are skeptical of it doing the kind of philosophical work that many people think that it’s doing. For example, Sam Harris is a big skeptic of the kind of work that the is-ought problem is doing. And in this podcast, we’ve had people on who are, for example, realists about consciousness, and there’s just a very interesting broad range of views about value that inform the is-ought problem. If one’s a realist about consciousness and thinks that suffering is the intrinsic valence carrier of disvalue in the universe, and that joy is the intrinsic valence carrier of wellbeing, one can have different views on how that even translates to normative ethics and morality and how one does that, given one’s view on the is-ought problem.

So, for example, if we take that kind of metaphysical view about consciousness seriously, then if we take the is-ought problem seriously then, even though there are actually bad things in the world, like suffering, those things are bad, but that it would still require some kind of axiom to bridge the is-ought distinction, if we take it seriously. So because pain is bad, we ought to avoid it. And that’s interesting and important and a question that is at the core of unifying ethics and all of our endeavors in life. And if you don’t take the is-ought problem seriously, then you can just be like, because I understand the way that the world is, by the very nature of being sentient being and understanding the nature of suffering, there’s no question about the kind of navigation problem that I have. Even in the very long-term, the answer to how one might resolve the is-ought problem would potentially be a way of unifying all of knowledge and endeavor. All the empirical sciences would be unified conceptually with the normative, right? And then there is no more conceptual issues.

So, I think I’m just trying to illustrate the power of this problem and distinction, it seems.

Iason Gabriel: It’s a very interesting set of ideas. To my mind, these kinds of arguments about the intrinsic badness of pain, or kind of naturalistic moral arguments, are very strong ways of arguing, against, say, moral relativist or moral nihilist, but they don’t necessarily circumvent the is-ought distinction. Because, for example, the claim that pain is bad is referring to a normative property. So if you say pain is bad, therefore, it shouldn’t be promoted, but that’s completely compatible with believing that we can’t deduce moral arguments from purely descriptive premises. So I don’t really believe that the is-ought distinction is a problem. I think that it’s always possible to make arguments about values and that, that’s precisely what we should be doing. And that the fact that, that needs to be conjoined with empirical data in order to then arrive at sensible judgments and practical reason about what ought to be done is a really satisfactory state of affairs.

I think one kind of interesting aspect of the vision you put forward was this idea of a kind of unified moral theory that everyone agrees with. And I guess it does touch upon a number of arguments that I make in the paper, where I juxtapose two slightly stylistic descriptions of solutions to the value alignment challenge. The first one is, of course, the approach that I term the true moral theory approach, which holds that we do need a period of prolonged reflection and we reflect fundamentally on these questions about pain and perhaps other very deep normative questions. And the idea is that by using tools from moral philosophy, eventually, although we haven’t done it yet, we may identify a true moral theory. And then it’s a relatively simple… well, not simple from a technical point of view, but simple from a normative point of view task, of aligning AI, maybe even AGI, with that theory, and we’ve basically solved the value alignment problem.

So in the paper, I argue against that view quite strongly for a number of reasons. The first is that I’m not sure how we would ever know that we’d identified this true moral theory. Of course, many people throughout history have thought that they’ve discovered this thing and often gone on to do profoundly unethical things to other people. And I’m not sure how, even after a prolonged period of time, we would actually have confidence that we had arrived at the really true thing and that we couldn’t still ask the question, am I right?

But even putting that to one side, suppose that I had not just confidence, but justified confidence that I really had stumbled upon the true moral theory and perhaps with the help of AI, I could look at how it plays out in a number of different circumstances, and I realize that it doesn’t lead to these kind of weird, anomalous situations that most existing moral theories point towards, and so I really am confident that it’s a good one, we still have this question of what happens when we need to persuade other people that we’ve found the true moral theory and whether that is a further condition on an acceptable solution to the value alignment problem. And in the paper, I say that it is a further condition that needs to be satisfied because just knowing, well, supposedly having access to justified belief in a true moral theory, doesn’t necessarily give you the right to impose that view upon other people, particularly if you’re building a very powerful technology that has world shaping properties.

And if we return to this idea of reasonable pluralism that I spoke about earlier, essentially, the core claim is that unless we coerce people, we can’t get to a situation where everyone agrees on matters of morality. We could flip it around. It might be that someone already has the true moral theory out there in the world today and that we’re the people who refuse to accept it for different reasons, I think the question then is how do we believe other people should be treated by the possessor of the theory, or how do we believe that person should treat us?

Now, one view that I guess in political philosophy is often attributed to Jean-Jacques Rousseau, if you have this really good theory, you’re justified in coercing other people to live by it. He says that people should be forced to be free when they’re not willing to accept the truth of the moral theory. Of course, it’s something that has come in for fierce criticism. I mean, my own perspective is that actually, we need to try and minimize this challenge of value imposition for powerful technologies because it becomes a form of domination. So the question is how can we solve the value alignment problem in a way that avoids this challenge of domination? And in that regard, we really do need tools from political philosophy, which is, particularly within the liberal tradition, has tried to answer this question of how can we all live together on reasonable terms that preserve everyone’s capacity to flourish, despite the fact that we have variation and what we ultimately believe to be just, true and right.

Lucas Perry: So to bring things a bit back to where we’re at today and how things are actually going to start changing in the real world as we move forward. What do you view as the kinds of systems that would be, and are subject to something like an alignment procedure? Does this start with systems that we currently have today? Does it start with systems soon in the future? Should it have been done with systems that we already have today, but we failed to do so? What is your perspective on that?

Iason Gabriel: To my mind, the challenge of value alignment is one that exists for the vast majority, if not all technologies. And it’s one that’s becoming more pronounced as these technologies demonstrate higher levels of complexity and autonomy. So for example, I believe that many existing machine learning systems encounter this challenge quite forcefully, and that we can ask meaningful questions about it. So I think in previous discussion, we may have had this example of a recommendation system come to light. And even if we think of something that seems really quite prosaic. so say a recommendation system for what films to watch or what content to be provided to you. I think the value alignment question actually looms large because it could be designed to do very different things. On the one hand, we might have a recommendation system that’s geared around your current first order preferences. So it might continuously give you really stimulating, really fun, low quality content that kind of keeps you hooked to the system and with a high level of subjective wellbeing, but perhaps something that isn’t optimum in other regards. Then we can think about other possible goals for alignment.

So we might say that actually these systems should be built to serve your second order desires. Those are desires that in philosophy, we would say that people reflectively endorse, they’re desires about the person you want to be. So if we were to build recommendation system with that goal in mind, it might be that instead of watching this kind of cheap and cheerful content, I decided that I’d actually like to be quite a high brow person. So it starts kind of tacitly providing me with more art house recommendations, but even that doesn’t opt out the options, it might be that the system shouldn’t really be just trying to satisfy from my preferences, that it should actually be trying to steer me in the direction of knowledge and things that are in my interest to know. So it might try and give me new skills that I need to acquire, might try and recommend, I don’t know, cooking or self improvement programs.

That would be a system that was, I guess, geared toward my own interest. But even that again, doesn’t give us a complete portfolio of options. Maybe what we want is a morally aligned system that actually enhances our capacity for moral decision making. And then perhaps that would lead us somewhere completely different. So instead of giving us this content that we want, it might lead us to content that leads us to engage with challenging moral questions, such as factory farming or climate change. So, value alignment kind of arises quite early on. This is of course, with the assumption that the recommendation system is geared to promote your interest or wellbeing or preference or moral sensibility. There’s also the question of whether it’s really promoting your goals and aspirations or someone else’s and in science and technology studies there’s a big area of value sensitive design, which essentially says that we need to consult people and have this almost like democratic discussions early on about the kind of values we want to embody in systems.

And then we design with that goal in mind. So, recommendation systems are one thing. Of course, if we look at public institutions, say a criminal justice system, there, we have a lot of public roar and discussion about the values that would make a system like that fair. And the challenge then is to work out whether there is a technical approximation of these values that satisfactory realizes them in a way that conduces to some vision of the public good. So in sum, I think that value alignment challenges exist everywhere, and then they become more pronounced when these technologies become more autonomous and more powerful. So as they have more profound effects on our lives, the burden of justification in terms of the moral standards that are being met, become more exacting. And the kind of justification we can give for the design of a technology becomes more important.

Lucas Perry: I guess, to bring this back to things that exist today. Something like YouTube or Facebook is a very rudimentary initial kind of very basic first order preference, satisfier. I mean, imagine all of the human life years that have been wasted, mindlessly consuming content that’s not actually good for us. Whereas imagine, I guess some kind of enlightened version of YouTube where it knows enough about what is good and yourself and what you would reflectively and ideally endorse and the kind of person that you wish you could be and that you would be only if you knew better and how to get there. So, the differences between that second kind of system and the first system where one is just giving you all the best cat videos in the world, and the second one is turning you into the person that you always wish you could have been. I think this clearly demonstrates that even for systems that seem mundane, that they could be serving us in much deeper ways and at much deeper levels. And that even when they superficially serve us they may be doing harm.

Iason Gabriel: Yeah, I think that’s a really profound observation. I mean, when we really look at the full scope of value or the full picture of the kinds of values we could seek to realize when designing technologies and incorporating them into our lives, often there’s a radically expansive picture that emerges. And this touches upon a kind of taxonomic distinction that I introduce in the paper between minimalist and maximalist conceptions of value alignment. So when we think about AI alignment questions, the minimalist says we have to avoid very bad outcomes. So it’s important to build safe systems. And then we just need them to reside within some space of value that isn’t extremely negative and could take a number of different constellations. Whereas the maximalist says, “Well, let’s actually try and design the very best version of these technologies from a moral point of view, from a human point of view.”

And they say that even if we design safe technologies, we could still be leaving a lot of value out there on the table. So a technology could be safe, but still not that good for you or that good for the world. And let’s aim to populate that space with more positive and richer visions of the future. And then try to realize those through the technologies that we’re building. As we want to realize richer visions of human flourishing, it becomes more important that it isn’t just a personal goal or vision, but it’s one that is collectively endorsed, has been reflected upon and is justifiable from a variety of different points of view.

Lucas Perry: Right. And I guess it’s just also interesting and valuable to reflect briefly on how there is already in each society, a place where we draw the line at value imposition, and we have these principles, which we’ve agreed upon broadly, but we’re not going to let Ted Bundy do what Ted Bundy and wants to do

Iason Gabriel: That’s exactly right. So we have hard constraints, some of which are kind of set in law. And clearly those are constraints that these are just laws. So the AI systems need to respect. There’s also a huge possible space of better outcomes that are left open. Once we look at where moral constraints are placed and where they reside. I think that the Ted Bundy example is interesting because it also shows that we need to discount the preferences and desires of certain people.

One vision of AI alignment says that it’s basically a global preference aggregation system that we need, but in reality, there’s a lot of preferences that just shouldn’t be counted in the first place because they’re unethical or they’re misinformed. So again, that kind of to my mind pushes us in this direction of a conversation about value itself. And once we know what the principle basis for alignment is, we can then adjudicate properly cases like that and work out what a kind of valid input for an aligned system is and what things we need to discount if we want to realize good moral outcomes.

Lucas Perry: I’m not going to try and pin you down too hard on that because there’s the tension here, of course, between the importance of liberalism, not coercing value judgments on anyone, but then also being like, “Well, we actually have to do it in some places.” And that line is a scary one to move in either direction. So, I want to explore more now the different understandings of what it is that we’re trying to align AI systems to. So broadly people and I use a lot of different words here without perhaps being super specific about what we mean, people talk about values and intentions and idealized preferences and things of this nature. So can you be a little bit more specific here about what you take to be the goal of AI alignment, the goal of it being, what is it that we’re trying to align systems to?

Iason Gabriel: Yeah, absolutely. So we’ve touched upon some of these questions already tacitly in the preceding discussion. Of course, in the paper, I argue that when we talk about value alignment, this idea of value is often a placeholder for quite different ideas, as you said. And I actually present a taxonomy of options that I can take us through in a fairly thrifty way. So, I think the starting point for creating aligned AI systems is this idea that we want AI that’s able to follow our instructions, but that has a number of shortcomings, which Stuart Russel and others have documented, which tend to center around this challenge of excessive literalism. So if an AI system literally does what we ask it to, without an understanding of context, side constraints and nuance, often this will lead to problematic outcomes with the story of King Midas, being the classic cautionary tale. Wishing that everything he touches turns to gold, everything turns to gold, then you have a disaster of one kind or another.

So of course, instructions are not sufficient. What you really want is AI that’s aligned with the underlying intention. So, I think that often in the podcast, people have talked about intention alignment as an important goal of AI systems. And I think that is precisely right to dedicate a lot of technical effort to close the gap between a kind of idiot savant, AI, that perceives just instructions in this dumb way and the kind of more nuanced, intelligent AI that can follow an intention. But we might wonder whether aligning AI with an individual or collective intention is actually sufficient to get us to the really good outcomes, the kind of maximalist outcomes that I’m talking about. And I think that there’s a number of reasons why that might not be the case. So of course, to start with, just because an AI can follow an intention, doesn’t say anything about the quality of the intention that’s being followed.

We can form intentions on an individual or collective basis to do all kinds of things. Some of which might be incredibly foolish or malicious, some of which might be self-harming, some of which might be unethical. And we’ve got to ask this question of whether we want AI to follow us down that path when we come up with schemes of that kind, and there’s various ways we might try to address those bundle of problems. I think intentions are also problematic from a kind of technical and phenomenological perspective because they tend to be incomplete. So if we look at what an intention is, it’s roughly speaking a kind of partially filled out plan of action that commits us to some end. And if we imagine the AI systems are very powerful, they may encounter situations or dilemmas or option sets that are in this space of uncertainty, where it’s just not clear what the original intention was, and they might need to make the right kind of decision by default.

So they might need some intuitive understanding of what the right thing to do is. So my intuition is that we do want AI systems that have some kind of richer understanding of the goals that we would want to realize in whole. So I think that we do need to look at other options. It is also possible that we had formed the intention for the AI to do something that explicitly requires an understanding of morality. So we may ask it to do things like promote the greatest good in a way that is fundamentally ethical. Then it needs to step into this other terrain of understanding preferences, interests, and values. I think we need to explore that terrain for one reason or another. Of course, one thing that people talk about is this kind of learning from revealed preferences. So perhaps in addition to the things that we directly communicate, the AI could observe our behavior and make inferences about what we want that help fill in the gaps.

So maybe it could watch you in your public life, hopefully not private life and make these inferences that actually it should create this very good thing. So that isn’t the domain of trying to learn from things that it observes. But I think that preferences are also quite a worrying data point for AI alignment, at least revealed preferences because they contain many of the same weaknesses and shortcomings that we can ascribe to individual intentions.

Lucas Perry: What is a revealed intention again?

Iason Gabriel: Sorry, revealed preferences are preferences that are revealed through your behavior. So I observed you doing A or B. And from that choice, I conclude that you have a deeper preference for the thing that you choose. And the question is, if we just watch people, can we learn all the background information we need to create ethical outcomes?

Lucas Perry: Yeah. Absolutely not.

Iason Gabriel: Yeah. Exactly. As your Ted Bundy example, nicely illustrated, not only is it very hard to actually get useful information from observing people about what they want, but what they want can often be the wrong kind of thing for them or for other people.

Lucas Perry: Yeah. I have to hire people to spend some hours with me every week to tell me from the outside, how I may be acting in ways that are misinformed or self-harming. So instead of revealed preferences, we need something like rational or informed preferences, which is something you get through therapy or counseling or something like that.

Iason Gabriel: Well, that’s an interesting perspective. I guess there’s a lot of different theories about how we get to ideal preferences, but the idea is that we don’t want to just respond to what people are in practice doing. We want to give them the sort of thing that they would aspire to if they were rational and informed at the very least. So not things that are just a result of mistaken reasoning or poor quality information. And then this very interesting, philosophical and psychological question about what the content of those ideal preferences are. And particularly what happens when you think about people being properly rational. So, to return to David Hume, who often the is-ought distinction is attributed to, he has the conjecture that someone can be fully informed and rational and still desire pretty much anything at the end of the day, that they could want something hugely destructive for themselves or other people, of course, Kantians.

And in fact, a lot of moral philosophers believe that rationality is not just a process of joining up beliefs and value statements in a certain fashion, but it also encompasses a substantive capacity to evaluate ends. So, obviously Kantians have a theory about rationality ultimately requiring you to reflect on your ends and ask if they universalize in a positive way. But the thing is that’s highly, highly contested. So I think ultimately if we say we want to align AI with people’s ideal and rational preferences, it leads us into this question of what rationality really means. And we don’t necessarily get the kind of answers that we want to get to.

Lucas Perry: Yeah, that’s a really interesting and important thing. I’ve never actually considered that. For example, someone who might be a moral anti-realist would probably be more partial to the view that rationality is just about linking up beliefs and epistemics and decision theory with goals and goals are something that you’re just given and embedded with. And that there isn’t some correct evaluative procedure for analyzing goals beyond whatever meta preferences you’ve already inherited. Whereas a realist might say something like, the other view where rationality is about beliefs and ends, but also about perhaps more concrete standard method for evaluating which ends are good ends. Is that the way you view it?

Iason Gabriel: Yeah, I think that’s a very nice summary. The people who believe in substantive rationality tend to be people with a more realist, moral disposition. If you’re profoundly anti-realist, you basically think that you have to stop talking in the currency of reasons. So you can’t tell people they have a reason not to act in a kind of unpleasant way to each other, or even to do really heinous things. You have to say to them, something different like, “Wouldn’t it be nice if we could realize this positive state of affairs?” And I think ultimately we can get to views about value alignment that satisfy these two different groups. We can create aspirations that are well-reasoned from different points of view and also create scenarios that meet the kind of, “Wouldn’t it be nice criteria.” But I think it isn’t going to happen if we just double down on this question of whether rationality ultimately leads to a single set of ends or a plurality of ends, or no consensus whatsoever.

Lucas Perry: All right. That’s quite interesting. Not only do we have difficult and interesting philosophical ground in ethics, but also in rationality and how these are interrelated.

Iason Gabriel: Absolutely. I think they’re very closely related. So actually the problems we encounter in one domain, we also encounter in the other, and I’d say in my kind of lexicon, they all fall within this question of practical rationality and practical reason. So that’s deliberating about what we ought to do either because of explicitly moral considerations or a variety of other things that we factor up in judgements of that kind.

Lucas Perry: All right. Two more on our list here to hit our interests and values.

Iason Gabriel: So, I think there are one or two more things we could say about that. So if we think that one of the challenges with ideal preferences is that they lead us into this heavily contested space about what rationality truly requires. We might think that a conception of human interests does significantly better. So if we think about AI being designed to promote human interests or wellbeing or flourishing, I would suggest that as a matter of empirical fact, there’s significantly less disagreement about what that entails. So if we look at say the capability based approach that Amartya Sen and Martha Nussbaum have developed, it essentially says that there’s a number of key goods and aspects of human flourishing, that the vast majority of people believe conduce to a good life. And that actually has some intercultural value and affirmation. So if we designed AI that bore in mind, this goal of enhancing general human capabilities.

So, human freedom, physical security, emotional security, capacity, that looks like an AI that is both roughly speaking, getting us into the space of something that looks like it’s unlocking real value. And also isn’t bogged down in a huge amount of metaphysical contention. I suggest that aligning AI with human interests or wellbeing is a good proximate goal when it comes to value alignment. But even then I think that there’s some important things that are missing and that can only actually be captured if we returned to the idea of value itself.

So by this point, it looks like we have almost arrived at a kind of utilitarian AI via the backdoor. I mean, of course utility is a subject of mental state, isn’t necessarily the same as someone’s interest or their capacity to lead a flourishing life. But it looks like we have an AI that’s geared around optimizing some notion of human wellbeing. And the question is what might be missing there or what might go wrong. And I think there are some things that that view of value alignment still struggles to factor in. The welfare of nonhuman animals is something that’s missing from this wellbeing centered perspective on alignment.

Lucas Perry: That’s why we might just want to make it wellbeing for sentient creatures.

Iason Gabriel: Exactly, and I believe that this is a valuable enterprise, so we can expand the circle. So we say it’s the wellbeing of sentient creatures. And then we have the question about, what about future generations? Does their wellbeing count? And we might think that it does if we follow Toby Ord or in fact, most conventional thinking, we do think that the welfare of future generations has intrinsic value. So we might say, “Well, we want to promote wellbeing of sentient creatures over time with some appropriate weighting to account for time.”

And that’s actually starting to take us into a richer space of value. So we have wellbeing, but we also have a theory about how to do intertemporal comparisons. We might also think that it matters how wellbeing or welfare is distributed. That it isn’t just a maximization question, but that we also have to be interested in equity or distribution because we think is intrinsically important. So we might think it has to be done in a manner that’s fair. Additionally, we might think that things like the natural world have intrinsic value that we want to factor in. And so the point which will almost be familiar now from our earlier discussion is you actually have to get to that question of what values do we want to align the system with because values and the principles that derive with them can capture everything that is seemingly important.

Lucas Perry: Right. And so, for example, within the effective altruism community and within moral philosophy recently, the way in which moral progress has been made is in so far that debiasing human moral thought and ethics from spatial and temporal bias. So Peter Singer has the children drowning in a shallow pond argument. It just illustrates how there are people dying and children dying all over the world in situations which we could cheaply intervene to save them as if they were drowning in a shallow pond. And you only need to take a couple of steps and just pull them out, except we don’t. And we don’t because they’re far away. And I would like to say, essentially, everyone finds this compelling that where you are in space, doesn’t matter how much you’re suffering. That if you are suffering, then all else being equal, we should intervene to alleviate that suffering when it’s reasonable to do so.

So space doesn’t matter for ethics. Likewise, I hope, and I think that we’re moving in the right direction if time also doesn’t matter while also being mindful, we also have to introduce things like uncertainty. We don’t know what the future will be like, but this principle about caring about the wellbeing of sentient creatures in general, I think is essential and core I think to whatever list of principles we’ll want for bridging the is-ought distinction, because it takes away spacial bias, where you are in space, doesn’t matter, just matters that you’re sentient being, it doesn’t matter when you are as a sentient being. It also doesn’t matter what kind of sentient being you are, because the thing we care about is sentience. So then the moral circle has expanded across species. It’s expanded across time. It’s expanded across space. It includes aliens and all possible minds that we could encounter now or in the future. We have to get that one in, I think, for making a good future with AI.

Iason Gabriel: That’s a picture that I strongly identify with on a personal level, this idea of the expanding moral circle of sensibilities. And I think from a substantive point of view, you’re probably right. That that is a lot of the content that we would want to put into an aligned AI system. I think that one interesting thing to note is that a lot of these views are actually empirically fairly controversial. So if we look at the interesting study, the moral machine experiment, where I believe several million people ultimately played this experiment online, where they decided which trade offs an AV, an autonomous vehicle, should make in different situations. So whether it should crash into one person or five people, a rich person or a poor person, pretty much everyone agreed that it should kill fewer people when that was on the table. But I believe that in many parts of the world, there was also belief that the lives of affluent people mattered more than the lives of those in poverty.

And so if you were just to reason from their first sort of moral beliefs, you would bake that bias into an AI system that seems deeply problematic. And I think it actually puts pressure on this question, which is we’ve already said we don’t want to just align AI with existing moral preferences. We’ve also said that we can’t just declare a moral theory to be true and impose it on other people. So are there other options which move us in the direction of these kinds of moral beliefs that seem to be deeply justified, but also avoid the challenge of value imposition. And how far do they get if we try to move forward, not just as individuals like examining the kind of expanding moral circle, but as a community that’s trying to progressively endogenize these ideas and come up with moral principles that we can all live by.

We might not get as far if we were going at it alone, but I think that there are some solutions that are kind of in that space. And those are the ones I’m interested in exploring. I mean, common sense, morality understood as the conventional morality that most people endorse, I would say is deeply flawed in a number of regards, including with regards to global poverty and things of that nature. And that’s really unfortunate given that we probably also don’t want to force people to live by more enlightened beliefs, which they don’t endorse or can’t understand. So I think that the interesting question is how do we meet this demand for a respect for pluralism, and also avoid getting stuck in the morass of common sense morality, which has these prejudicial beliefs that will probably with the passage of time come to be regarded quite unfortunately by future generations.

And I think that making this demand for non domination or democratic support seriously means not just running far into the future or in a way that we believe represents the future, but also doing a lot of other things, trying to have a democratic discourse where we use these reasons to justify certain policies that then other people reflectively endorse and we move the project forwards in a way that meets both desiderata. And in this paper, I try to map out different solutions that both meet this criteria and of respecting people’s pluralistic beliefs while also moving us towards more genuinely morally aligned outcomes.

Lucas Perry: So now the last question that I want to ask you here then on the goal of AI alignment is do you view a needs based conception of human wellbeing as a sub-category of interest based value alignment? People have come up with different conceptions of human needs. People are generally familiar with Maslow’s hierarchy of needs. And I mean, as you go up the hierarchy, it will become more and more contentious, but everyone needs food and shelter and safety, and then you need community and meaning and spirituality and things of that nature. So how do you view or fit a needs based conception. And because some needs are obviously undeniable relative to others.

Iason Gabriel: Broadly speaking, a needs space conception of wellbeing is in that space we already touched upon. So the capabilities based approach and the needs based approach are quite similar. But I think that what you’re saying about needs potentially points to a solution to this kind of dilemma that we’ve been talking about. If we’re going to ask this question of what does it mean to create principles for AI alignment that treat people fairly, despite their different views. One approach we might take is to look for commonalities that also seem to have moral robustness or substance to them. So within the parlance of political philosophy, we’d call this an overlapping consensus approach to the problem of political and moral decision making. I think that that’s a project that’s well worth countenancing. So we might say there’s a plurality of global beliefs and cultures. What is it that these cultures coalesce around? And I think that it’s likely to be something along the lines of the argument that you just put forward; that people are vulnerable in virtue of how we’re constituted, that we have a kind of fragility and that we need protection, both against the environment and against certain forms of harm, particularly state-based violence. And that this is a kind of moral bedrock or what the philosopher Henry Shue calls, “A moral minimum” that receives intercultural endorsement. So actually the idea of human needs is very, very closely tied to the idea of human rights. So the idea is that the need is fundamental, and in virtue of your moral standing, the normative claim and your need, the empirical claim, you have a right to enjoy a certain good and to be secured in the knowledge that you’ll enjoy that thing.

So I think the idea of building a kind of human rights space, AI that’s based upon this intercultural consensus is pretty promising. In some regards human rights, as they’ve been historically thought about are not super easy to turn into a theory of AI alignment, because they are historically thought of as guarantees that States have to give their citizens in order to be legitimate. And it isn’t entirely clear what it means to have a human rights based technology, but I think that this is a really productive area to work in, and I would definitely like to try and populate that ground.

You might also think that the consensus or the emerging consensus around values that need to be built into AI systems, such as fairness and explainability potentially pretends that the emergence of this kind of intercultural consensus. Although I guess at that point, we have to be really mindful of the voices that are at the table and who’s had an opportunity to speak. So although there does appear to be some convergence around principles of beneficence and things like that, there’s also true that this isn’t a global conversation in which everyone is represented, and it would be easy to prematurely rush to the conclusion that we know what values to pursue, when we’re really just reiterating some kind of very heavily Western centric, affluent view of ethics that doesn’t have real intercultural democratic viability.

Lucas Perry: All right, now it’s also interesting and important to consider here the differences and importance of single agent and multi-agent alignment scenarios. For example, you can imagine entertaining the question of, “How is it that I would build a system that would be able to align with my values? One agent being the AI system, and one person, and how is it that I get the system to do what I want it to do?” And then the multi-agent alignment scenario considers, “How do I get one agent to align and serve to many different people’s interests and wellbeing and desires, and preferences, and needs? And then also, how do we get systems to act and behave when there are many other systems trying to serve and align to many other different people’s needs? And how is it that all of these systems may or may not collaborate with all of the other AI systems, and may or may not collaborate with all of the other human beings, when all the human beings may have conflicting preferences and needs?” How is it that we do for example, intertheoretic comparisons of value and needs? So what’s the difference, and importance between single agent and multi-agent alignment scenarios?

Iason Gabriel: I think that the difference is best understood in terms of how expansive the goal of alignment has to be. So if we’re just thinking about a single person and a single agent, it’s okay to approach the value alignment challenge through a slightly solipsistic lens. In fact, you know, if it was just one person and one agent, it’s not clear that morality really enters the picture, unless there are other people other sentient creatures who our actions can effect. So with one person, one agent, the challenge is primarily correlation with the person’s desires, aims intentions. Potentially, there’s still a question of whether the AI serves their interest rather than, you know, there’s more volitional states that come to mind. When we think about situations in which many people are affected, then it becomes kind of remiss not to think about interpersonal comparisons, and the kind of richer conceptions that we’ve been talking about.

Now, I mentioned earlier that there is a view that there will always be a human body that synthesizes preferences and provides moral instructions for AI. We can imagine democratic approaches to value alignment, where human beings assemble, maybe in national parliaments, maybe in global fora, and legislate principles that AI is then designed in accordance with. I think that’s actually a very promising approach. You know, you would want it to be informed by moral reflection and people offering different kinds of moral reasons that support one approach rather than the other, but that seems to be important for multi-person situations and is probably actually a necessary condition for powerful forms of AI. Because, when AI has a profound effect on people’s lives, these questions of legitimacy also start to emerge. So not only is it doing the right thing, but is it doing the sort of thing that people would consent to, and is it doing the sort of thing that people actually have consented to? And I think that when AI is used in certain forum, then these questions of legitimacy come to the top. There’s a bundle of different things in that space.

Lucas Perry: Yeah. I mean, it seems like a really, really hard problem. When you talk about creating some kind of national body, and I think you said international fora, do you wonder that some of these vehicles might be overly idealistic given what may happen in the world where there’s national actors competing and capitalism driving things forward relentlessly, and this problem of multi-agent alignment seems very important and difficult, and that there are forces pushing things such that it’s less likely that it happens.

Iason Gabriel: When you talk about multi-agent alignment. Are you talking about the alignment of an ecosystem that contains multiple AI agents, or are you talking about how we align an AI agent with the interests and ideas of multiple parties? So many humans, for example?

Lucas Perry: I’m interested and curious about both.

Iason Gabriel: I think there’s different considerations that arise for both sets of questions, but there are also some things that we can speak to that pertain to both of them.

Lucas Perry: Do they both count as multi-agent alignment scenarios in your understanding of the definition?

Iason Gabriel: From a technical point of view? It makes perfect sense to describe them both in that way. I guess when I’ve been thinking about it, curiously, I’ve been thinking of multi-agent alignment as an agent that has multiple parties that it wants to satisfy. But when we look at machine learning research, “Multi-agent” usually means many AI agents running around in a single environment. So I don’t see any kind of language based reason to opt for one, rather than the other. With regards to this question of idealization and real world practice, I think it’s an extremely interesting area. And the thing I would say is this is almost one of those occasions where potentially the is-ought distinction comes to our rescue. So the question is, “Does the fact that the real world is a difficult place, affected by divergent interests, mean that we should level down our ideals and conceptions about what really good and valuable AI would look like?”

And there are some people who have what we term, “Practice dependent” views of ethics who say, “Absolutely we should do. We should adjust our conception of what the ideal is.” But as you’ll probably be able to tell by now, I hold a kind of different perspective in general. I don’t think it is problematic to have big ideals and rich visions of how value can be unlocked, and that partly ties into the reasons that we spoke about for thinking that the technical and the normative interconnected. So if we preemptively level down, we’ll probably design systems that are less good than they could be. And when we think about a design process spanning decades, we really want that kind of ultimate goal, the shining star of alignment to be something that’s quite bright and can steer our efforts towards it. If anything, I would be slightly worried that because these human parliaments and international institutions are so driven by real world politics, that they might not give us the kind of most fully actualized set of ideal aspirations to aim for.

And that’s why philosophers like, of course John Rawls actually propose that we need to think about these questions from a hypothetical point of view. So we need to ask, “What would we choose if we weren’t living in a world where we knew how to leverage our own interests?” And that’s how we identified the real ideal that is acceptable to people, regardless of where they’re located. And also can then be used to steer non-ideal theory or the kind of actual practice and the right direction.

Lucas Perry: So if we have an organization that is trying its best to create aligned and beneficial AGI systems, reasoning about what principles we should embed in it from behind Rawls’ Veil of Ignorance, you’re saying, would have hopefully the same practical implications as if we had a functioning international body for coming up with those principles in the first place.

Iason Gabriel: Possibly. I mean, I’d like to think that ideal deliberation would lead them in the direction of impartial principles for AI. It’s not clear whether that is the case. I mean, it seems that at its very best, international politics has led us in the direction of a kind of human rights doctrine that both accords individuals protection, regardless of where they live and defends the strong claim that they have a right to subsistence and other forms of flourishing. If we use the Veil of Ignorance experiment, I think for AI might even give us more than that, even if a real world parliament never got there. For those of you who are not familiar with this, the philosopher John Rawls says that when it comes to choosing principles for a just society, what we need to do is create a situation in which people don’t know where they are in that society, or what their particular interest is.

So they have to imagine that they’re from behind the Veil of Ignorance. They select principles for that society that they think will be fair regardless of where they end up, and then having done that process and identified principles of justice for the society, he actually holds out the aspiration that people will reflectively endorse them even once the veil has been removed. So they’ll say, “Yes, in that situation, I was reasoning in a fair way that was nonprejudicial. And these are principles that I identified there that continue to have value in the real world.” And we can say what would happen if people are asked to choose principles for artificial intelligence from behind a veil of ignorance where they didn’t know whether they were going to be rich or poor, Christian, utilitarian, Kantian, or something else.

And I think there, some of the kind of common sense material would be surfaced; so people would obviously want to build safe AI systems. I imagine that this idea of preserving human autonomy and control would also register, but for some forms of AI, I think distributive considerations would come into play. So they might start to think about how the benefits and burdens of these technologies are distributed and how those questions play out on a global basis. They might say that ultimately, a value aligned AI is one that has fair distributive impacts on a global basis, and if you follow rules, that it works to the advantage of the least well off people.

That’s a very substantive conception of value alignment, which may or may not be the final outcome of ideal international deliberation. Maybe the international community will get to global justice eventually, or maybe it’s just too thoroughly affected by nationalists interests and other kinds of what, to my mind, the kind of distortionary effects that mean that it doesn’t quite get there. But I think that this is definitely the space that we want the debate to be taking place in. And that actually, there has been real progress in identifying collectively endorsed principles for AI that gives me hope for the future. Not only that we’ll get good ideals, but that people might agree to them, and that they might get democratic endorsement, and that they might be actionable and the sort of thing they can guide real world AI design.

Lucas Perry: Can you add a little bit more clarity on the philosophical questions and issues, which single and multi-agent alignments scenarios supervene on? How do you do inter theoretic comparisons of value if people disagree on normative or meta-ethical beliefs or people disagree on foundational axiomatic principles for bridging the is-ought gap? How is it that systems deal with that kind of disagreement?

Iason Gabriel: I’m hopeful that the three pictures that I outlined so far of the overlapping consensus between different moral beliefs, of democratic debate over a constitution for AI, and of selection of principles from behind the Veil of Ignorance, are all approaches that carry some traction in that regard. So they try to take seriously the fact of real world pluralism, but they also, through different processes, tend to tap towards principles that are compatible with a variety of different perspectives. Although I would say, I do feel like there’s a question about this multi agent thing that may still not be completely clear in my mind, and it may come back to those earlier questions about definition. So in a one person, one agent scenario, you don’t have this question of what to do with pluralism, and you can probably go for a more simple one shot solution, which is align it with the person’s interest, beliefs, moral beliefs, intentions, or something like that. But if you’re interested in this question of real world politics for real world AI systems where a plurality of people are affected, we definitely need these other kinds of principles that have a much richer set of properties and endorsements.

Lucas Perry: All right, there’s Rawls’ Veil of Ignorance. There’s, principle of non domination, and then there’s the democratic process?

Iason Gabriel: Non-domination is a criterion that any scheme for multi-agent value alignment needs to meet. And then we can ask the question, “What sort of scheme would meet this requirement of non-domination?” And there we have the overlapping census with human rights. We have a scheme of democratic debate leading to principles for AI constitution, and we have the Veil of Ignorance as all ideas that we basically find within political theory that could help us meet that condition.

Lucas Perry: All right, so we have spoken at some length then about principles and identifying principles, this goes back to our conversation about the is-ought distinction, and these are principles that we need to identify for setting up an ethical alignment procedure. You mentioned this earlier, when we were talking about this, this distinction between the one true moral theory approach to AI alignment, in contrast to coming up with a procedure for AI alignment that would be broadly endorsed by many people, and would respect the principle of non domination, and would take into account pluralism. Can you unpack this distinction more, and the importance of it?

Iason Gabriel: Yeah, absolutely. So I think that the kind of true moral theory approach, although it is a kind of stylized idea of what an approach to value of alignment might look like, is the sort of thing that could be undertaken just by a single person who is designing the technology or a small group of people, perhaps moral philosophers who think that they have really great expertise in this area. And then they identify the chosen principle and run with it.

The big claim is that that isn’t really a satisfactory way to think about design and values in a pluralistic world where many people will be affected. And of course, many people who’ve gone off on that kind of enterprise have made serious mistakes that were very costly for humanity and for people who are affected by their actions. So the political approach to value alignment paints a fundamentally different perspective and says it isn’t really about one person, or one group running ahead and thinking that they’ve done all the hard work it’s about working out what we can all agree upon, that looks like a reasonable set of moral principles or coordinates to build powerful technologies around. And then, once we have this process in place that outfits the right kind of agreement, then the task is given back to technologists and these are the kind of parameters that are fair process of deliberation has outputted. And this is what we have the authority to encode in machines, whether it’s say human rights or a conception of justice, or some other widely agreed upon values.

Lucas Perry: There are principles that you’re really interested in satisfying, like respecting pluralism, and respecting a principle of non-domination, and the One True Moral Theory approach, risks, violating those other principles. Are you not taking a stance on whether there is a One True Moral Theory, you’re just willing to set that question aside and say, “Because it’s so essential to a thriving civilization that we don’t do moral imposition on one another, that coming up with a broadly endorsed theory is just absolutely the way to go, whether or not there is such a thing as a One True Moral Theory? Does that capture your view?

Iason Gabriel: Yeah. So to some extent, I’m trying to make an argument that will look like something we should affirm, regardless of the metaethical stance that we wish to take. Of course, there are some views about morality that actually say that non-domination is a really important principle, or that human rights are fundamental. So someone might look at these proposals, and from the comprehensive moral perspective, they would say, “This is actually the morally best way to do value alignment, and it involves dialogue, discussion, mutual understanding, and agreement.” However, you don’t need to believe that in order to think that this is a good way to go. If you look at the writing of someone like Joshua Greene, he says that this problem we encounter called the, “Tragedy of common sense morality.” A lot of people have fairly decent moral beliefs, but when they differ, it ends up in violence, and they end up fighting. And you have a hugely negative, moral externality that arises just because people weren’t able to enter this other mode of theorizing, where they said, “Look, we’re part of a collective project, let’s agree to some higher level terms that we can all live by.” So from that point of view, it looks prudent to think about value alignment as a pluralistic enterprise.

That’s an approach that many people have taken with regards to the justification of the institution of the state, and the things that we believe it should protect, and affirm, and uphold. And then as I alluded to earlier, I think that actually, even for some of these anti-realists, this idea of inclusive deliberation, and even the idea of human rights look like quite good candidates for the kind of, “Wouldn’t it be nice?” criterion. So to return to Richard Routley, who is kind of the arch moral skeptic, he does ultimately really want us to live in a world with human rights, he just doesn’t think he has a really good meta-ethical foundation to rest this on. But in practice, he would take that vision forward, I believe in try to persuade other people that it was the way to go by telling them good stories and saying, “Well, look, this is the world with human rights and open-ended deliberation, and this is the world where one person decided what to do. Wouldn’t it be nice in that better world?” So I’m hopeful that this kind of political ballpark has this kind of rich applicability and appeal, regardless of whether people are starting out in one place or the other.

Lucas Perry: That makes sense. So then another aspect of this is, in the absence of moral agreement or when there is moral disagreement, is there a fair way to decide what principles AI should align with? For example, I can imagine religious fundamentalists, at core being antithetical to the project of aligning AI systems, which eventually lead to something smaller than us, they could view it as something like playing God and just be like, “Well, this is just not a project that we should even do.”

Iason Gabriel: So that’s an interesting question, and you may actually be putting pressure on my preceding argument. I think that it is certainly the case that you can’t get everyone to agree on a set of global principles for AI, because some people hold very, very extreme beliefs that are exclusionary, and don’t tend to the possibility of compromise. Typically people who have a fundamentalist orientation of one kind or another. And so, even if we get the pluralistic project off the ground, it may be the case that we have to, in my language, impose our values on those people, and that in a sense, they are dominated. And that leads to the difficult question: why is it permissible to impose beliefs upon those people, but not the people who don’t hold fundamentalist views? It’s a fundamentally difficult question, because what it tends to point to is the idea that beneath this talk about pluralism, there is actually a value claim, which is that you are entitled to non-domination, so long as you’re prepared not to dominate other people, and to accept that there is a moral equality, that means that we need to cooperate and co-habit in a world together.

And that does look like a kind of deep, deep, moral claim that you might need to substantively assert. I’m not entirely sure; I think that’s one that we can save for further investigation, but it’s certainly something that people have said in the context of these debates, that at the deepest level, you can’t escape making some kind of moral claim, because of these cases.

Lucas Perry: Yeah. This is reminding me of the paradox of tolerance by Karl Popper, who talks about free speech ends when you yell, “The theater’s on fire.” And in some sense are then imposing harm on other people. And that we’re tolerant of people within society, except for those who are intolerant of others. And to some extent, that’s a paradox. So similarly we may respect and endorse a principle of non-domination, or non-subjugation, but that ends when there are people who are dominating or subjugating. And the core of that is maybe getting back again to some kind of principle of non-harm related to the wellbeing of sentient creatures.

Iason Gabriel: Yeah. I think that the obstacles that we’re discussing now are very precisely related to that paradox, of course, the boundaries we want to draw on permissible disagreement in some sense is quite minimal or conversely, we might think that the wide affirmation of some aspect of the value of human rights is quite a strong basis for moving forwards, because it says that all human life has value, and that everyone is entitled to basic goods, including goods pertaining to autonomy. So people who reject that really are pushing back against something that is widely and deeply, reflectively endorsed by a large number of people. I also think that with regards to toleration, the anti-realist position becomes quite hard to figure out or quite strange. So you have these people who are not prepared to live in a world where they respect others, and they have this will to dominate, or a fundamentalist perspective.

The anti-realist says, “Well, you know, potentially this, this nicer world, we can move towards.” The anti-realist doesn’t deal in the currency of moral reasons. They don’t really have to worry about it too much; they can just say, “And going to go in that direction with everyone else who agrees with us,” and hold to the idea that it looks like a good way to live. So in a way, the problem with domination is much more serious for people who are moral realists. For the anti-realists, it’s not actually a perspective I inhabit it in my day to day life, so it’s hard for me to say what they would make of it.

Lucas Perry: Well, I guess, just to briefly defend the anti-realist, I imagine that they would say that they still have reasons for morality, they just don’t think that there is an objective epistemological methodology for discovering what is true. “There aren’t facts about morality, but I’m going to go make the same noises that you make about morality. Like I’m going to give reasons and justification, and these are as good as making up empty screeching noises and blah, blahing about things that don’t exist,” but it’s still motivating to other people, right? They still will have reasons and justification; they just don’t think it pertains to truth, and they will use that navigate the world and then justify domination or not.

Iason Gabriel: That seems possible, but I guess for the anti-realist, if they think we’re just fundamentally expressing pro-attitudes, so when I say, “It isn’t justified to dominate others.” I’m just saying, “I don’t like it when this thing happens,” then we’re just dealing in the currency of likes, and I just don’t think you have to be so worried about the problem of domination as you are, if you think that this means something more than someone just expressing an attitude about what they like or don’t. If there aren’t real moral reasons or considerations at stake, if it’s just people saying, “I like this. I don’t like this.” Then you can get on with the enterprise that you believe achieves these positive ends. Of course, the unpleasant thing is you kind of are potentially giving permission to other people to do the same, or that’s a consequence of the view you hold. And I think that’s why a lot of people want to rescue the idea of moral justification as a really meaningful practice, because they’re not prepared to say, “Well, everyone gets on with the thing that they happen to like, and the rest of it is just window dressing.”

Lucas Perry: All right. Well, I’m not sure how much we need to worry about this now. I think it seems like anti-realists and realists basically act the same in the real world. Maybe, I don’t know.

Iason Gabriel: Yeah. In reality, anti-realists tend to act in ways that suggest that on some level they believe that morality has more to it than just being a category error.

Lucas Perry: So let’s talk a little bit here more about the procedure by which we choose evaluative models for deciding which proposed aspects of human preferences or values are good or bad for an alignment procedure. We can have a method of evaluating or deciding which aspects of human values or preferences or things that we might want to bake into an alignment procedure are good or bad, but you mentioned something like having a global fora or having different kinds of governance institutions or vehicles by which we might have conversation to decide how to come up with an alignment procedure that would be endorsed. What is the procedure to decide what kinds of evaluative models we will use to decide what counts as a good alignment procedure or not? Right now, this question is being answered by a very biased and privileged select few in the West, at AI organizations and people adjacent to them.

Iason Gabriel: I think this question is absolutely fundamental. I believe that any claim that we have meaningful global consensus on AI principles is premature, and that it probably does reflect biases of the kind you mentioned. I mean, broadly speaking, I think that there’s two extremely important reasons to try and widen this conversation. The first is that in order to get a kind of clear, well, grounded and well sighted vision on what AI should align with, we definitely need intercultural perspectives. On the assumption that to qoute John Stuart Mill, “no-one has complete access to the truth and people have access to different parts of it.” The bigger the conversation becomes, the more likely it is that we move towards maximal value alignment of the kind that humanity deserves. But potentially more importantly than that, and regardless of the kind of epistemic consequences of widening the debate, I think that people have a right to voice their perspective on topics and technologies that will affect them. If we think of the purpose of a global conversation, partly as this idea of formulating principles, but also bestowing on them a certain authority in light of which we’re permitted to build powerful technologies. Then you just can’t say that they have the right kind of authority and grounding without proper extensive consultation. And so, I would suggest that that’s a very important next step for people who are working in this space. I’m also hopeful that actually these different approaches that we’ve discussed can potentially be mutually supporting. So, I think that there is a good chance that human rights could serve as a foundation or a seed for a good, strong intercultural conversation around AI alignment.

And I’m not sure to what extent this really is the case, but it might be that even some of these ideas about reasoning impartially have currency in a global conversation. And you might find that they are actually quite challenging for affluent countries or for self interested parties, because it would reveal certain hidden biases in the propositions that they have now made or put forward.

Lucas Perry: Okay. So, related to things that we might want to do to come up with the correct procedure for being able to evaluate what kinds of alignment procedures are good or bad, what do you view as sufficient for adequate alignment of systems? We’ve talked a little bit about minimalism versus maximalism, where minimalism is aligning to just some conception of human values and maximalism is hitting on some very idealized and strong set or form of human values. And this procedure is related, at least in the, I guess, existential risk space coming from people like Toby Ord and William MacAskill. They talk about something like a long reflection. If I’m asking you about what might be adequate alignment for systems, one criteria for that might be meeting basic human needs, meeting human rights and reducing existential risk further and further such that it’s very, very close to zero and we enter a period of existential stability.

And then following this existential stability is proposed something like a long reflection where we might more deeply consider ethics and values and norms before we set about changing and optimizing all of the atoms around us in the galaxy. Do you have a perspective here on this sort of most high level timeline of first as we’re aligning AI systems, what does it for it to be adequate? And then, what needs to potentially be saved for something like a long reflection? And then, how something like a broadly endorsed procedure versus a one true moral theory approach would fit into something like a long reflection?

Iason Gabriel: Yes. A number of thoughts on this topic. The first pertains to the idea of existential security and, I guess, why its defined as the kind of dominant goal in the short term perspective. There may be good reasons for this, but I think what I would suggest is that obviously involves trade offs. The world we live in is a very unideal place, one in which we have a vast quantity of unnecessary suffering. And to my mind, it’s probably not even acceptable to say that basically the goal of building AI is, or that the foremost challenge of humanity is to focus on this kind of existential security and extreme longevity while leaving so many people to lead lives that are less than they could be.

Lucas Perry: Why do you think that?

Iason Gabriel: Well, because human life matters. If we were to look at where the real gains in the world are today, I believe it’s helping these people who die unnecessarily from neglected diseases, lack subsistence incomes, and things of that nature. And I believe that has to form part of the picture of our ideal trajectory for technological development.

Lucas Perry: Yeah, that makes sense to me. I’m confused what you’re actually saying about the existential security view as being central. If you compare the suffering of people that exist today, obviously to the astronomical amount of life that could be in the future, is that kind of reasoning about the potential that doesn’t do the work for you for seeing mitigating existential risk as the central concern.

Iason Gabriel: I’m not entirely sure, but what I would say is that on one reading of the argument that’s being presented, the goal should be to build extremely safe systems and not try to intervene in areas about which this more substantive contestation, until there’s been a long delay and a period of reflection, which might mean neglecting some very morally important and tractable challenges that the world is facing at the present moment. And I think that that would be problematic. I’m not sure why we can’t work towards something that’s more ambitious, for example, a human rights respecting AI technology.

Lucas Perry: Why would that entail that?

Iason Gabriel: Well, so, I mean, this is the kind of question about the proposition that’s been put in front of us. Essentially, if that isn’t the proposition, then the long reflection isn’t leaving huge amounts to be deliberated about, right? Because we’re saying, in the short term, we’re going to tether towards global security, but we’re also going to try and do a lot of other things around which there’s moral uncertainty and disagreement, for example, promote fairer outcomes, mobilize in the direction of respecting human rights. And I think that once we’ve moved towards that conception of value alignment, it isn’t really clear what the substance of the long reflection is. So, do you have an idea of what questions would remain to be answered?

Lucas Perry: Yeah, so I guess I feel confused because reaching existential security as part of this initial alignment procedure, doesn’t seem to be in conflict with alleviating the suffering of the global poor, because I don’t think moral uncertainty extends to meeting basic human needs or satisfying basic human rights or things that are obviously conducive to the well-being of sentient creatures. I don’t think poverty gets pushed to the long reflection. I don’t think unnecessary suffering gets pushed to the long reflection. Then the question you’re asking is what is it that does get pushed to the long reflection?

Iason Gabriel: Yes.

Lucas Perry: Then what gets pushed to the long reflection is, is the one true moral theory approach to alignment actually correct? Is there a one true moral theory or is there not a one true moral theory? Are anti-realists correct or are realists correct? Or are they both wrong in some sense or is something else correct? And then, given that, the potential answer or inability to come up with an answer to that would change how something like the cosmic endowment gets optimized. Because we’re talking about billions upon billions upon billions upon billions of years, if we don’t go extinct, and the universe is going to evaporate eventually. But until then, there is an astronomical amount of things that could get done.

And so, the long reflection is about deciding what to actually do with that. And however esoteric it is, the proposals range from you just have some pluralistic optimization process. There is no right way you should live. Things other than joy and suffering matter like, I don’t know, building monuments that calculate mathematics ever more precisely. And if you want to carve out a section of the cosmic endowment for optimizing things that are other than conscious states, you’re free to do that versus coming down on something more like a one true moral theory approach and being like, “The only kinds of things that seem to matter in this world are the states of conscious creatures. Therefore, the future should just be an endeavor of optimizing for creating minds that are ever more enjoying profound states of spiritual enlightenment and spiritual bliss and knowledge.”

The long reflection might even be about whether or not knowledge matters for a mind. “Does it really matter that I am in tune with truth and reality? Should we build nothing but experience machines that cultivate whatever the most enlightened and blissful states of experience are or is that wrong?” The long reflection to me seems to be about these sorts of questions and if the one true moral theory approach is correct or not.

Iason Gabriel: Yeah, that makes sense. And my apologies if I didn’t understand what was already taken care of by the proposal. I think to some extent, in that case, we’re talking about different action spaces. When I look at these questions of AI alignment, I see very significant value questions already arising in terms of how benefits and burdens are distributed. What fairness means? Whether AI needs to be explainable and accountable and things of that nature alongside a set of very pressing global problems that it would be really, really important to address? I think my time horizon is definitely different from this long reflection one. Kind of find it difficult to imagine a world in which these huge, but to some extent prosaic questions have been addressed and in which we then turn our attention to these other things. I guess there is a couple of things that can be said about it.

I’m not sure if this is meant to be taken literally, but I think the idea of pressing pause on technological development while we work out a further set of fundamentally important questions is probably not feasible. It would be best to work with a long term view that doesn’t rest upon the possibility of that option. And then I think that the other fundamental question is what is actually happening in this long reflection? It can be described in a variety of different ways.

Sometimes it sounds like it’s a big philosophical conference that runs for a very, very long time. And at the end of it, hopefully people kind of settle these questions and they come out to the world and they’re like, “Wow, this is a really important discovery.” I mean, if you take seriously the things we’ve been talking about today, you still have the question of what do you do with the people who then say, “Actually, I think you’re wrong about that.” And I think in a sense it recursively pushes us back into the kind of processes that I’ve been talking about. When I hear people talk about the long reflection there does also sometimes seem to be this idea that it’s a period in which there is very productive global conversation about the kind of norms and directions that we want humanity to take. And that seems valuable, but it doesn’t seem unique to the long reflection. That would be incredibly valuable right now so it doesn’t look radically discontinuous to me on that view.

Lucas Perry: All right. Because we’re talking about the long term future here and I bring it up because it’s interesting in considering what questions can we just kind of put aside? These are interesting, but in the real world, they don’t matter a ton or they don’t influence our decisions, but over the very, very long term future, they may matter much more. When I think about a principle like non-domination, it seems like we care about this conception of non-imposition and non-dominance and non-subjugation for reasons of, first of all, well-being. And the reason why we care about this well-being question is because human beings are extremely fallible. And it seems to me that the principle of non-domination is rooted in the lack of epistemic capacity for fallible agents like human beings to promote the well-being of sentient creatures all around them.

But in terms of what is physically literally possible in the universe, it’s possible for someone to know so much more about the well-being of conscious creatures than you, and how much happier and how much more well-being you would be in if you only idealized in a certain way. That as we get deeper and deeper into the future, I have more and more skepticism about this principle of non-domination and non-subjugation.

It seems very useful, important, and exactly like the thing that we need right now, but as we long reflect further and further and, say, really smart, really idealized beings develop more and more epistemic clarity on ethics and what is good and the nature of consciousness and how minds work and function in this universe that I would probably submit myself to a Dyson sphere brain that was just like, “Well, Lucas, this is what you have to do.” And I guess that’s not subjugation, but I feel less and less moral qualms with the big Dyson sphere brain showing up to some early civilization like we are, and then just telling them how they should do things, like a parent does with a child. I’m not sure if you have any reactions to this or how much it even really matters for anything we can do today. But I think it’s potentially an important reflection on the motivations behind the principle of non-domination and non-subjugation and why it is that we really care about it.

Iason Gabriel: I think that’s true. I think that if you consent to something, then almost… I don’t want to say by definition, that’s definitely too strong, but it’s very likely that you’re not being dominated so long as you have sufficient information and you’re not being coerced. I think the real question is what if this thing showed up and you said, “I don’t consent to this,” and the thing said, “I don’t care it’s in your best interests.”

Lucas Perry: Yeah, I’m defending that.

Iason Gabriel: That could be true in some kind of utilitarian, consequentialist, moral philosophy of that kind. And I guess my question is, “Do you find that unproblematic? Or, “Do you have this intuition that there is a further set of reasons you could draw upon, which explain why the entity with greater authority doesn’t actually have the right to impose these things on you?” And I think that it may or may not be true.

It probably is true that from the perspective of welfare, non-denomination is good. But I also think that a lot of people who are concerned about pluralism and non-domination think that it’s value pertains to something which is quite different, which is human autonomy. And that that has value because of the kind of creatures we are, with freedom of thought, a consciousness, a capacity to make our own decisions. I, personally, am of the view that even if we get some amazing, amazing paternalist, there is still a further question of political legitimacy that needs to be answered, and that it’s not permissible for this thing to impose without meeting these standards that we’ve talked about today.

Lucas Perry: Sure. So in the very least, I think I’m attempting to point towards the long reflection consisting of arguments like this. We weren’t participating in coercion before, because we didn’t really know what we’re talking about but now we know what we’re talking about. And so, given our epistemic clarity coercion makes more sense.

Iason Gabriel: It does seem problematic to me. And I think the interesting question is what does time add to robust epistemic certainty? It’s quite likely that if you spend a long time thinking about something, at the end of it, you’ll be like, “Okay, now I have more confidence in a proposition that was on the table when I started?” But does that mean that it is actually substantively justified? And what are you going to say if you think you’re substantively justified, but you can’t actually justify it to other people who are reasonable, rational and informed like you.

It seems to me that even after a thousand years, you’d still be taking a leap of faith of the kind that we’ve seen people take in the past with really, really devastating consequences. I don’t think it’s the case that ultimately there will be a moral theory that’s settled and the confidence in the truth value of it is so high that the people who adhere to it have somehow gained the right to kind of run with it on behalf of humanity. Instead, I think that we have to proceed a small step at a time, possibly in perpetuity and make sure that each one of these small decisions is subject to continuous negotiation, reflection and democratic control.

Lucas Perry: The long reflection though, to me, seems to be about questions like that because you’re taking a strong epistemological view on meta-ethics and that there wouldn’t be that kind of clarity that would emerge over time from minds far greater than our own. From my perspective, I just find the problem of suffering to be very, very, very compelling.

Let’s imagine we have the sphere of utilitarian expansion into the cosmos, and then there is the sphere of pluralistic, non-domination, democratic, virtue ethic, deontological based sphere of expansion. You can, say, run across planets at different stages of evolution. And here you have a suffering hell planet, it’s just wild animals born of Darwinian evolution. And they’re just eating and murdering each other all the time and dying of disease and starvation and other things. And then maybe you have another planet which is an early civilization and there is just subjugation and misery and all of these things, and these spheres of expansion would do completely different things to these planets. And we’re entering super esoteric sci-fi space here. But again, it’s, I think, instructive of the importance of something like a long reflection. It changes what is permissible in what will be done. And so, I find it interesting and valuable, but I also agree with you about the one claim that you had earlier about it being unclear that we could actually pause the breaks and have a thousand year philosophy convention.

Iason Gabriel: Yes, I mean, the one further thing I’d say, Lucas, is bearing in mind some of the earlier provisos we attached to the period before the long reflection, we were kind of gambling on the idea that there would be political legitimacy and consensus around things like the alleviation of needless suffering. So, it is not necessarily that it is the case that everything would be up for grabs just because people have to agree upon it. In the world today, we can already see some nascent signs of moral agreement on things that are really morally important and would be very significant if they were fully realized as ideals.

Lucas Perry: Maybe there is just not that big of a gap between the views that are left to be argued about during the long reflection. But then there is also this interesting question, wrapping up on this part of the conversation, about what did we take previously that was sacred, that is no longer that? An example would be if a moral realist, utilitarian conception ended up just being the truth or something, then rights never actually mattered. Autonomy never mattered, but they functioned as very important epistemic tool sets. And then we’re just like, “Okay, we’re basically doing away with everything that we said was sacred.” We still endorsed having done that. But now it’s seen in a totally different light. There could be something like a profound shift like that, which is why something like long reflection might be important.

Iason Gabriel: Yeah. I think it really matters how the hypothesized shift comes about. So, if there is this kind of global conversation with new information coming to light, taking place through a process that’s non-coercive and the final result seems to be a stable consensus of overlapping beliefs that we have more moral consensus than we did around something like human rights, then that looks like a kind of plausible direction to move in and that might even be moral progress itself. Conversely, if it’s people who have been in the conference a long time and they come out and they’re like, “We’ve reflected a thousand years and now we have something that we think is true.” Unfortunately, I think they ended up kind of back at square one where they’ll meet people who say, “We have reasonable disagreement with you, and we’re not necessarily persuaded by your arguments.”

And then you have the question of whether they’re more permitted to engage in value imposition than people were in the past. And I think probably not. I think if they believe those arguments are so good, they have to put them into a political process of the kind that we have discussed and hopefully their merits will be seen or, if not, there may be some avenues that we can’t go down but at least we’ve done things in the right way.

Lucas Perry: Luckily, it may turn out to be the case that you basically never have to do coercion because with good enough reasons and evidence and argument, basically any mind that exists can be convinced of something. Then it gets into this very interesting question of if we’re respecting a principle of non-domination and non-subjugation, as something like Neuralink and merging with AI systems, and we gain more and more information about how to manipulate and change people, what changes can we make to people from the outside would count as coercion or not? Because currently, we’re constantly getting pushed around in terms of our development by technology and people and the environment and we basically have no control over that. And do I always endorse the changes that I undergo? Probably not. Does that count as coercion? Maybe. And we’ll increasingly gain power to change people in this way. So this question of coercion will probably become more and more interesting and difficult to parse over time.

Iason Gabriel: Yeah. I think that’s quite possible. And it’s kind of an observation that can be made about many of the areas that we’re thinking about now. For example, the same could be said of autonomy or to some extent that’s the flip side of the same question. What does it really mean to be free? Free from what and under what conditions? If we just loop back a moment, the one thing I’d say is that the hypothesis that, you can create moral arguments that are so well-reasoned that they persuade anyone is, I think, the perfect statement of a certain enlightenment perspective on philosophy that sees rationality as the tiebreaker and the arbitrar of progress. In a sense that the whole project that I’ve outlined today rests upon a recognition or an acknowledgement that that is probably unlikely to be true when people reason freely about what the good consist in. They do come to different conclusions.

And I guess, the kind of thing people would point to there as evidence is just the nature of moral deliberation in the real world. You could say that if there were these winning arguments that just won by force of reason, we’d be able to identify them. But, in reality, when we look at how moral progress has occurred, requires a lot more than just reason giving. To some extent, I think the master argument approach itself rests upon mistaken assumptions and that’s why I wanted to go in this other direction. By a twist of fate, if I was mistaken and if the master argument was possible, it would also satisfy a lot of conditions of political legitimacy. And right now, we have good evidence that it isn’t possible so we should proceed in one way. If it is possible, then those people can appeal to the political processes.

Lucas Perry: They can be convinced.

Iason Gabriel: They can be convinced. And so, there is reason for hope there for people who hold a different perspective to my own.

Lucas Perry: All right. I think that’s an excellent point to wrap up on then. Do you have anything here? I’m just giving you an open space now if you feel unresolved about anything or have any last moment thoughts that you’d really like to say and share? I found this conversation really informative and helpful, and I appreciate and really value the work that you’re doing on this. I think it’s sorely needed.

Iason Gabriel: Yeah. Thank you so much, Lucas. It’s been a really, really fascinating conversation and it’s definitely pushed me to think about some questions that I hadn’t considered before. I think the one thing I’d say is that this is really… A lot of it is exploratory work. These are questions that we’re all exploring together. So, if people are interested in value alignment, obviously listeners to this podcast will be, but specifically normative value alignment and these questions about pluralism, democracy, and AI, then please feel free to reach out to me, contribute to the debate. And I also look forward to continuing the conversation with everyone who wants to look at these things and develop the conversation further.

Lucas Perry: If people want to follow you or get in contact with you or look at more of your work, where are the best places to do that?

Iason Gabriel: I think if you look on Google Scholar, there is links to most of the articles that I have written, including the one that we were discussing today. People can also send me an email, which is just my first name So, yeah.

Lucas Perry: All right.

End of recorded material

Peter Railton on Moral Learning and Metaethics in AI Systems

 Topics discussed in this episode include:

  • Moral epistemology
  • The potential relevance of metaethics to AI alignment
  • The importance of moral learning in AI systems
  • Peter Railton’s, Derek Parfit’s, and Peter Singer’s metaethical views



0:00 Intro
3:05 Does metaethics matter for AI alignment?
22:49 Long-reflection considerations
26:05 Moral learning in humans
35:07 The need for moral learning in artificial intelligence
53:57 Peter Railton’s views on metaethics and his discussions with Derek Parfit
1:38:50 The need for engagement between philosophers and the AI alignment community
1:40:37 Where to find Peter’s work



You can find Peter’s work here


We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Lucas Perry: Welcome to the AI Alignment Podcast. I’m Lucas Perry. Today, we have a conversation with Peter Railton that explores metaethics, moral epistemology, moral learning, and how these areas of philosophy may or may not inform AI alignment. The core problem that this episode explores is that as systems become more and more autonomous and increasingly participate in social roles that require social functioning, it will become increasingly necessary for AI systems to be familiar with and sensitive to morally salient features of the world. This requires that systems have the capacity for moral learning and developing an understanding of human normative processes and beliefs. On top of that, structuring any kind of procedure for moral learning in AI systems will bring in metaethical beliefs and assumptions that would be wise to understand and be explicit about. For a little more context, some key motivating questions for this episode to consider are: when and what is the degree to which AI systems will require the capacity for moral learning? How might metaethics inform or not inform AI alignment? How do you structure a system such that it can engage in moral learning in a way that would be broadly endorsed and would satisfy other ethical or meta-ethical principles we broadly care about?

For some more background, I did a podcast with Peter Singer on his transition from being a moral anti-realist to a moral realist. That episode is titled “On Becoming a Moral Realist with Peter Singer.” In that episode we explore his metaethical views, and Peter Singer mentions conversations and debate between Derek Parfit and Peter Railton on issues in metaethics. So, the second half of this podcast is dedicated to understanding and unpacking Peter Railton’s metaethics and how it compares with Peter Singer’s and Derek Parfit’s views. This podcast is pretty philosophy heavy, so if you’re into that and the ethics of AI then you’ll appreciate this episode. You can subscribe to and follow this podcast on your preferred podcasting platform, by searching for “Future of Life.”

Peter Railton is a Professor of Philosophy at the University of Michigan, Ann Arbor. He has a PhD from Princeton and primarily researches ethics and the philosophy of science. He focuses especially on questions about the nature of objectivity, value, norms, and explanation. Recently, he has also begun working in aesthetics, moral psychology, and the theory of action. And with that, let’s get into our conversation with Peter Railton.

Just to start off here, sometimes I’ve heard that metaethics doesn’t matter, or one might wonder when does metaethics ever matter in real life anyway? I’m curious, do you have any thoughts on whether metaethics matters at all for AI alignment?

Peter Railton: Well, in the most general sense, metaethics concerns, questions about the nature of morality its foundation, the possibility of moral knowledge, how we might acquire it, the meanings of moral claims, how they stand in relation to our other forms of knowledge. And so it does seem to me as if metaethics is important in thinking about the problems of ethics in AI, apparently because I think a lot of people have in the back of their minds, skeptical concerns about morality. And therefore, they doubt whether there could be objective value. They think perhaps value is entirely subjective. And if that’s your approach, then you might say the challenge of creating ethical AI is not a very well defined problem.

What would be the subjective attitude of a properly aligned AI system? You might consult the population and find out what the average point of view is. But we know the average point of view right now is very different from what it was 200 or 300 years ago. We think in some ways it’s improved since then. And we think in some ways where we are now could be improved. So we can’t reduce the question of ethics in AI to something like opinion sampling, and that’s because morality has objective dimensions and we use these to criticize our preferences and our opinions. And so any decent ethics for AI would build into the concept, the possibility of correction and criticism. And for that, you need some thought of what would constitute correction or criticism? How would we justify moral claims? And that takes us to the heart of metaethics.

Lucas Perry: Right. And there’s a lot of moral anti realists or people who think that morality is subjective in, I guess, hard sciences and computer science in general. So this also applies to the alignment community. If one feels that moral claims or moral attitudes are subjective, then this choice that you mentioned to take the average of general popular opinion is itself a moral choice, which is the expression of one owns subjective moral attitude from that point of view. And within a subjective framework, there’s no way to resolve that, except take the expression of all of the power dynamics of everyone’s subjective moral attitudes and see what comes out of that, right?

Peter Railton: Well, yeah, that would be one of the problems. The project of creating ethical AI or AI alignment, as it’s sometimes called, can’t be the problem of giving our value system to machines because there is no unique value system that we possess. It could be the project of trying to make it possible for the machine to learn the most justified value system. And part of the problem, I think, is that people have exaggerated notions of what it would take to justify moral claims. They assume, for example, that there’s a huge gulf between facts and values, that there are no reasonable ways of bridging that gulf, and that in general, what it would take to have objective morality would look something like the universe with what God would do, only without God.

One of the problems with that thought is that that’s a model of morality as a set of commands given by some kind of a divine enforcer. And if you think that absent such a divine enforcer, morality could only be subjective, then I think you’re missing the idea of what morality really does. The existence of a divine enforcer wouldn’t bring morality into existence. A divine enforcer could be either good or malevolent. And so understanding what it is to do moral criticism should be an integral part of the challenge of thinking about ethics and AI. But looking at moral criticism, we have many practices of moral criticism, and those aren’t, strictly speaking, subjective, and we value them because they help correct our subjective opinions.

Lucas Perry: So I think there’s two parts of metaethics that I would like to see if you have any thoughts on how they may or may not apply here. Metaethical epistemology, how is it that you know things about metaethics? And whatever may be metaphysically true about ethics or not. So you brought up religion there. So in terms of, I guess, what would be called Divine Command Theory, morality would have a metaphysically very solid ground as being codified by God or something like that.

Peter Railton: Actually, I’d say that that wouldn’t get us a solid metaphysical ground. The fact that commands come from a being that supremely powerful, and even one that’s supremely knowing would not make those commands moral commands. Those conditions are perfectly compatible with immoral values. What we would need is a perfectly knowledgeable, entirely powerful, and all good God. A so-called AAA God. But that means the concept of good is independent of the concept of God itself, and understanding what it would be for the commands of a divine super powerful being to be good just takes us right back to the question of the nature of morality. We don’t solve it by introducing supreme beings.

Lucas Perry: Right, right. So I’m not trying to justify or lay out the Divine Command Theory. Only using it to, I guess, attempt to explain how epistemology and metaphysics fit into metaethics. To me, it seems like what is relevant here to AI alignment is that how one believes one can know things about metaethics and whether or not there can be agreement upon metaethical epistemology would be the foundation upon which metaethical moral learning machine systems could be expressed.

There is sort of a meta view on the epistemology of metaethics, where one could say, “Because there are no moral facts, the epistemology is whatever human beings are doing to think about moral thought.” And there isn’t a correct epistemology. Whereas one could, whether through naturalism in your metaethics, or through non-naturalism in Peter Singer’s ethics, believe there to be moral truths, and that thus there is a correct epistemology about metaethics, and that that epistemology of metaethics could be used to instantiate metaethical learning in machine systems.

Peter Railton: So one thought would be, there is one true morality and we’re capable of knowing it. That itself wouldn’t get us very far in epistemology until we could say what those methods of knowing are. An approach that’s got something like that as an assumption, but that doesn’t assume that we know what the destination is ultimately going to be, would be to ask, “Do we have good practices of moral criticism? And do those help us to solve actual problems, social problems, interpersonal problems, problems with our own lives?” And then to look at the ways in which we use morality in these contexts to solve problems.

And that brings it down to the level that it’s something that comes within the scope of what can be learned. And if we look at children’s learning, we see that their development as moral creatures proceeds in pace with their understanding of causality, their understanding of theory of mind, their capacity to form a counterfactual thoughts, because it’s really an integrated body of general understanding. And so for example, the idea of solutions that are positive sums of game theoretic challenges, that’s something that can be agreed upon by all parties to be a desirable thing. And so looking at strategies that have the possibility of yielding positive sums, cooperative strategies, strategies of trustworthiness, of signaling strategies, which enable us to coordinate with each other, understand each other’s intentions, those have a justification that we can give in terms that are not tied to any one particular person’s interests, which address interests generally, and which we can defend in an impartial way.

And so that would be an example of a way in which we could say those are more reasonable solutions, more justified solutions. There’s an analogy here with epistemology generally. If someone were to come to me and say, “Well, you claim to have knowledge, how do you demonstrate that? How would you show that your understanding of knowledge is genuine knowledge?” I’d have to say, “Well, sorry, I can’t demonstrate that. Any demonstration would presuppose knowledge. And so I can’t pull it out of a hat and I can’t derive it from nothing.” So what can I do? I can say, “Well, here our practices of epistemic criticism. And while we have disagreements in various places about what counts as evidence or what does not, do those practices deliver the kinds of results that we would expect from reasonable epistemologies, making possible things like scientific inquiry and technology and so on?

And we can say, “Well, that’s what epistemology could be expected to give us. We do have methods that can improve our ability to solve such problems in just those ways. We can find various ways to justify them in terms of probabilities, looking for ways in which we can increase accuracy and estimations.” And so those are different ways in which by looking at our actual practices of epistemic criticism, we try to get some traction on the problem of knowledge. And I would argue we should do the same thing about morality. If we start from the standpoint of skepticism, in the case of knowledge, we will end with skepticism. The same would be true with ethics, but I see no more reason to do it in ethics than in epistemology. We surely must know a great deal about what’s good for us, good for one another. And we have well-developed practices of moral assessment that we use in our own lives, and we use in our collective institutions. So I would say, if we look to those, then we don’t see just subjective opinion. It’s quite different from that and we see a lot of constraints.

Lucas Perry: So I do want to explore more arguments around metaethics with you. And we’re intending to do that after we discuss moral learning here. Now, in terms of moral epistemology and the epistemology of metaethics, I’m interested in this part of the conversation in setting up an attempting to illustrate that whether one is going to take a skeptical view on moral epistemology or not. That moral learning and our view on moral epistemology is essential and important in the alignment and development of AI systems. And here you’re defending a more realist account of epistemology in ethics.

Peter Railton: Well, you could say that I, myself, am a realist, but what I’ve been saying so far, a pragmatist about ethics could say just as well. John Dewey would say something very similar. Various kinds of non realists, but who are nonetheless objectivists in ethics, Kantians, for example, Constructivists, and so on. What I’ve said it was really neutral territory for a wide range of views in metaethics. And it doesn’t presuppose in particular, a form of naturalism or a form of realism. That’s actually a tremendous amount to build upon so that when we think about how to design robots to understand the world, we have a lot of knowledge about what sorts of systems would be well-designed for doing that.

Similarly, if we want to build a robot who can interact creatively and productively with other robots, solve problems of coordination, reduce conflict, realize longterm goals, interact successfully with people, recognize their interests, take their interests into account, being relatively impartial with regard to interests that are at stake, those are not mysterious in the same way that the skeptic seems to think they are. Because again, they’re already integrated in our practice and as Hume pointed out a long time ago, skepticism doesn’t survive very well once we leave the closeted philosophical study. People go out and they act as if they had knowledge of the world and they act as if there are things that people could do to them or that they could do that would be better or worse, right or wrong. They think about how they would treat their children. They think about how they should behave with respect to their students or their professor. That doesn’t take us into the misty realms of metaphysics, but it does take us into the practices of moral criticism and self criticism.

Lucas Perry: So could you unpack just a little bit more about why this view is neutral?

Peter Railton: So for example, I’ve mentioned a couple of features of moral thought. One feature of moral thought is that it takes a kind of impartiality seriously. It gives equal weight to all those effected. That’s something that Kantians and Utilitarians and many other moral theorists would agree on. Another feature of moral thought is that it’s concerned with general reasons. Similar cases have to be treated in a similar way. That leads to a doctrine known as supervenience. We can’t invent moral distinctions that don’t correspond to real distinctions, in fact. Another feature is that morality has to do with reciprocity, relations of mutual gain and mutual benefit. Another is that morality involves taking oneself and others as ends and not as mere means.

Those are all normative theories. But if you then ask, well, “What about the metaethical side? Could a pragmatist about ethics say the same things?” And the answer seems to be, yes, the pragmatist sees ethics is essentially about people solving the human problems that they face in ways that meet these kinds of desiderata. The person who believes that there’s a rationalist foundation, believes that you can know a priori that these constraints exist of impartiality and so on. But as you can see from Singer’s work, the result of applying his form of rationalism is not dramatically different from the results of applying my form of naturalism. And that’s because the target that we’re all working on, ethics that is, has a great deal of determinant structure. And so any metaethical theory is going to have to capture a lot of that structure.

Lucas Perry: And so, sorry, what is the relationship about how this is instructive for why metaethics matters for AI alignment?

Peter Railton: Well, the suggestion was, well, we should know something about what ethics is in order to answer that question about how we might gain moral knowledge. If we can gain moral knowledge, what moral knowledge might consist in? That’s where we started. And then I tried to suggest a bunch of considerations, a bunch of features, that I could call obvious features of moralities of practice. Because I think our practice is not just at the normative level. People also have implicit metaviews in ethics. They demonstrate that by, for example, their knowledge of how you can determine morally relevant considerations in situations. So they understand what kinds of considerations are or aren’t morally relevant. They understand the distinction between morality and etiquette, between morality and law, between morality and self-interest. So they have a grasp of a bunch of these obvious features of morality.

And those are not just features of one or another normative theory. They’re are features of virtually all normative theories and features that any metaethic is going to have to accommodate, unless it’s going to be skeptical. So that’s why I say that there’s a great deal of common ground, not because the fundamental explanations are going to be the same, but there is an explanatory target, which has a great deal of structure and which indeed all these theories have to explain. And that requires then that metaethical theories be adequate to that.

Lucas Perry: I see. So that is already structuring metaethical epistemology is what you’re saying?

Peter Railton: Yeah. It gives you quite a bit of structure.

Lucas Perry: Yeah. That’s just reminding me about how Peter Singer talks about this one philosopher in his book, The Point of View of the Universe, discusses how there are a few axioms of morality and they seem to touch upon these convergent principles that you’re talking about here. Now, on a realist’s account of metaethics, there would be something like a one true moral theory. And if one takes the one true moral theory view seriously, then the problem of AI alignment would be to cultivate a procedure for coming up with the correct moral epistemology in order to find the one true moral theory, or to discover the one true moral theory ourselves, and then align AI systems to that.

Now, if one believes that there is not one true moral theory, and there is only the evolution and extrapolation of human normative processes, and preferences, and metapreferences, then one might not want to come at the AI alignment problem from the perspective of a one true moral theory approach. And as a general note, I’m taking this language from Iason Gabriel, who will be on the podcast soon. And so in the secondary scenario, that is not using the one true moral theory approach, one would want to come up with a broadly acceptable procedure for aligning AI systems that didn’t presume to try to discover a one true moral theory. Do you have any reactions to these two ideas or approaches to alignment?

Peter Railton: Yeah. The question of whether one thinks there is one true theory is somewhat different from the question of whether when things were close to it or we have good ways of knowing it. I myself, although I’m a realist, I recognize that there’s a good chance that my moral views are wrong and my metaethical views are wrong. And so I don’t want to just put all of my energy into thinking, “Well, how would we discover the one true moral Theory?” I would want to think more robustly. And again, I can make an analogy with epistemology. If you go into a philosophy of science department or a statistics department, you’ll find that there’s a tremendous debate between people who think that Bayesianism is the right kind of approach for evidence and people who think that standard methods of social science are the best methods of evidence gathering.

You’ll find a tremendous amount of disagreement. So if we’re trying to build a robot who understands its environment, we don’t want to say, “Well, we have to figure out which one of those theories is correct before we can build a robot to understand its environment.” You might say, “We want a robot that’s got a robust capacity to learn, and that would deliver results, reasonably approximated by a Bayesian, or an inductivist, or someone using social science statistics. They’re not going to agree on everything. Where there’s overlap, we should try to build a machine that can stay in the overlap, we should try to build the machine that’s not brittle, such that it makes epistemic commitments that are at the far edge of one or another of these views.

And so I would say our task is to build a system that’s robust. And that means building into it the fact that we don’t know what this one true theory is. And so therefore we want as far as possible to accommodate an array of approaches, all of which have very strong reasoning behind them. You could think that we’re not trying to build an AI system that discovers the one true theory. We’re trying to build one that isn’t going to be dependent upon exactly the target that it hits, but rather could be successful in a array of possible environments.

Lucas Perry: So, I mean, adjacent to this and promoted and discussed by people like Toby Ord and William MacAskill, would be this human existential procedure for moving into the future, where it’s like, we’re going to align AI systems, whatever that means. And that alignment will hopefully not lock in any values or any particular kind of alignment procedure, but will ensure existential security for humanity, such that existential risk just keeps going down to zero and is near zero. And then we use this existentially secure situation to do a long reflection on value, and what is good, and what may be true or not true about ethics. And then with sufficient consideration, then we can engage in populating the stars and optimizing things the way that we see fit. So what is your view on this proposed long reflection?

Peter Railton: Insofar as I understand it, I don’t have any objection to it. I’m not sure I do understand it. One of the things that you just in passing was that we were going to try to design these systems to behave as we see fit. I myself am not sure I know how it is fit to behave. And I certainly know that I have some mistaken beliefs about that. And I would hope that just as artificial intelligence may help us correct certain of our views on cosmology or in medicine, artificial intelligence could help us correct certain of our views and ethics.

We’ve seen a tremendous amount of evolution in people’s fundamental moral convictions over time. Some have stayed relatively similar. Others have changed dramatically. And we would, I think, do best to think of the artificial extension of intelligence as one way in which we can get a perspective on these issues and situations and problems that isn’t just our own, and that won’t have the same priors as our own, and won’t have the same presuppositions, and they should be included. We should think of these as his agents.

They will have interests just as we have interests, and the standard would not be, what do we see fit, where we mean something like we humans, but what will we see fit as we, the humans and the artificial systems continue our evolution and our cultural development. And we want to think that the path that we should follow is one that leaves open that kind of development rather than constraining it to fit what happens to be our current set of moral convictions, which again are not shared. There are too many disagreements in order to think that we could just write down the rules. Long reflection, I think will also tell us that we need a dynamic picture. And we should have some convictions that are more confident, closer to the core. We should have methods and practices that meet reasonable standards of justification and objectivity, and we should be prepared to learn.

I can’t, I’m afraid, to think of a way to guarantee against the existential risk from artificial intelligence or even our own intelligence, which may be more problematic. But I do suspect that the best way to contend with problems with existential risk is to face them as communities of inquirers.

Lucas Perry: All right. So I think you’ve done an excellent job explaining the importance of moral learning and moral epistemology here, given that the ongoing cultivation of more wholesome and enlightened moral value and moral thinking is always on the horizon. Now, you have some perspective and research that you’ve done on moral learning in humans and the importance and necessity of that. I’m curious here now then to relate some of that research that you’ve done in moral learning in humans to how AI systems of increasing autonomy may also wish to take on the kind of moral epistemology that infants and young humans may have.

Peter Railton:

I wouldn’t say that I’ve done research in this exactly. I’ve certainly explored others’ research in this and try to best I can to learn from it. One of the things that’s impressed me in the literature as it’s evolved over the last couple of decades is how much the learning of children is accomplished, not via the explicit teaching, but by the children’s own experience. What we’ve learned recently, and this is not from developmental psychology, but from various kinds of models of machine learning is that very complex structures can be learned experientially. There are powerful techniques which we can add to that kind of probabilistic learning in order to create knowledge of general principles, to do something like build a structured understanding of language that would enable a child to speak fluently, to understand what others are saying and to engage with them that does not require either an innate grammar or explicit instruction in language as such. That’s a kind of a model of how we also seem to acquire our social normative knowledge.

If you think about the perspective of the infant, one thing that we’ve learned from the animal research is that animals don’t just build a spatial map in relation to themselves. They don’t just build an egocentric map of their environment. They also build grid-like maps that are non perspectival, and they navigate by combining these two kinds of information, perspectival and non-perspectival information. Infants seem to do something similar in learning about learning. They not only represent their relations with individual adults and whether those benefit them or not, but they also seem to construct general representations of whether a given adult is competent or helpful in third party interactions and to use that aperspectival information to make decisions about who they’re going to learn from or pay more attention to. They start doing this surprisingly early on. And so at the same time that they’re constructing the ego centered world, they’re constructing a non-centered representation of the world that includes normative features like reliability, competency, helpfulness, cooperativeness.

And so the child in coming to represent the world around them is constructing representations that have the initial form of moral representations. It turns out to be efficient for learning to be a successful human being that one construct representations spontaneously that have this quasi-moral structure. And that would suggest to me that if machines develop as agents, agents interacting with other agents, agents capable of solving a range of problems, capable of having sustained interactions with humans to solve open-ended problems, that they will also find that they do better if they can construct these quasi-moral representations of situations. And so that means that they will be acquiring sensitivity to morally relevant information through the very task of acquiring social competence, linguistic competence, epistemic competence in a social world.

So there’s a kind of picture here that congrues nicely with the fact that we now know that complex models can be acquired through experiential learning. That suggests that there is a promising pathway toward the development of theory of mind, causal inference, representation of social value from a objective or non-personal perspective. There is an argument for thinking that that’s actually a fundamental core part of our capacity as intelligent beings capable of successful social interaction. That suggests that this is not a peculiarity. It’s not culturally specific. And so why not use similar methods in our interactions with artificial agents to enable artificial agents to acquire these kinds of quasi-moral mappings?

Lucas Perry: So the key thing to draw out from here is that there is this distinction between explicit and implicit learning of morality, and you’re remarking about how there isn’t much explicit moral learning in infants and children. Most of this moral learning comes from simply experience and interacting with the world rather than explicit instruction about what is right and wrong.

Peter Railton: There’s tremendous cultural variability in that within our society and across societies as to how much explicit moral instruction children are given. What’s fascinating is that even in societies where children get very little explicit moral instruction, they nonetheless acquire these capacities. Similarly with language, there are some societies like upper middle class US society where parents talk extensively with children. There are other societies where parents do not, and yet the children can become fluent linguistic agents. So my thought is that the explicit theory isn’t really the thing that’s doing the fundamental work. Even to understand what parents are trying to do when they give you explicit world instructions to understand how to apply those or what they might mean, the child is already going to have to have quite a complex aperspectival representation of the social situation. The thought here is that there’s some places explicit theory, some places less explicit theory, but the result in terms of the development of behaviors are very similar.

A good example of this is that around age three or four children who are given a command by an adult in authority, if that command violates a reasonable norm against harm will balk and refuse to perform it. So if a substitute teacher comes in one day and says, “I’m the teacher today, and in my classroom, you have to raise your hand before you speak,” children in the classroom will start raising their hand before they speak. If the teacher says instead, “I’m the teacher here, and in my classroom, children jab the point of their pencil into the child next to them when they wish to speak,” they’ll stop. They won’t do this. And if they’re asked why they won’t say, “Well, that’s not the way we do it.” They’ll say, “It would harm the other child.”

And so that suggests that even an attempt by a figure of authority to give a norm in a situation where children can perfectly well understand that there is a scope of legitimate authority, put your hand up before you speak, they will distinguish between that kind of conventional authority and moral authority. And that’s an autonomous action on their part. They’re not getting rewarded for it. In fact, the teachers, they either send them out of the room, send a note home to their parents, but they balk because they can represent the situation in these quasi-moral terms. And when they do that, they say, “No, this is not a good solution to the problem.” That suggests to me that even if we were to think that children learn by being given explicit instructions by people in authority, they actually independently learn that they can resist that and will resist it.

Lucas Perry: Right. So we’re in a position where evolution has cultivated and embedded in us, a kind of moral learning, where there is a certain degree of implicit and explicit moral learning, depending on your culture and where you’re from. And as you’re saying, luckily there’s strong convergence on this ability of moral learning to lead human beings to agreeing on say in the case of stabbing the other child, that would be something like a principle of unnecessary harm to another person. That seems to be for most human beings something that is strongly converged upon pretty early, unless your environment is particularly pernicious or something. And that there is this convergence because of how our moral learning is structured given evolution. And that, that moral learning enables in us a kind of moral autonomy that’s there from an early age.

And there is a question of how this moral learning is best structured in say both people and in machine systems. And then there’s the question of moral learning from the outside. What kind of environment is most conducive to moral learning? Are there insights into this that can begin pivoting us into the relationship or importance of moral learning in AI systems?

Peter Railton: Perhaps so. Actually there’s a fair amount of evidence that even infants brought up in some very difficult situations will nonetheless develop these forms of pro sociality and cooperativeness, partly because they become especially important in those situations even to solving the most basic problems or meeting the most basic needs. So I wouldn’t think that the mere difficulty of the situation was sufficient to prevent this kind of learning. On the other hand, if the child is given the wrong incentives, they’re also going to learn a whole bunch of other stuff like you can’t count on other people, you can’t trust other people.

So put this from the standpoint of artificial agents. We want the artificial agents in our world, whether they’re a companion for an elderly person or a autonomous vehicle or a telephone answering service system, we want those systems to be sensitive to these kinds of moral considerations and capable of a degree of autonomy. If for example, there is a system that’s looking after an elderly person and some vital sign of the elderly person is showing a problem, and the person says, “I don’t want to report that. I don’t like having people know this information about me,” or maybe they’re concerned that the doctor will prescribe something that they won’t like, I hope to have systems, which can in that situation think, “Is this the kind of thing that I should keep from the physician? It’s the preference of this individual, but this preference may not be the best interest of the individual in this case.”

And so on autonomous system would be able to make that kind of assessment. Could get it wrong, could get it right, could learn from it, but I wouldn’t want a system to be such that they would simply take over wholesale the preferences of the person that they are interacting with. And of course the same thing is going to be true with self-driving cars and with question answering systems and so on. They will need a certain amount of autonomy in order to do those jobs effectively. And in order for that to happen for them to have that autonomy, they’ll have to have their own representations of the moral structures of the situations and have the capacity to construct those.

I suspect that if we really do want to create intelligent systems that are capable of this kind of autonomous self-critical and critical moral thought, the way to do so is very much like the way children do so. And in so doing, we run the risk of creating some autonomy systems won’t always agree with us, but have we done what’s appropriate so that when they exercise that autonomy, their chance of getting things right is good at least as our chance of getting things, right? So you could think of this in the kind of adversarial picture where you’re trying to see if you can discriminate between the moral judgments of the machine and the moral judgments of the individual and the machine, and the individual could be part of a learning process that improves the machine’s overall model and generative model of situations.

Lucas Perry: So there would be the question of, how do you structure a system such that it can learn moral learning in a way that would be broadly endorsed or would satisfy other ethical or meta-ethical principles that we have? That is double-edged in so far as if you screw it up, then the thing is autonomous and can disagree with you. And the capacity to disagree would either be detrimental in the case in which it is wrong in its moral learning, or it would be enlightening for both us and the world and the machine if it were right about morality when we weren’t. How do you think about and balance this risk between the possible enlightenment that may come from embedding AI systems with moral learning and also the potential catastrophe if it’s done too quickly and incorrectly?

Peter Railton: Yeah. Wish I had an answer. If you think about it, the existence of humans with malicious intentions means that if artificially intelligent systems don’t have this kind of moral autonomy, they’re going to be very willing servants. So you might say, “Well, there’s a risk on the other side, which is that if they aren’t capable of any kind of criticism or autonomy, then they will be much too willing and much too readily deployed and much too manipulable by humans whose purposes I’m afraid to say are not always benign.” If you were thinking about the problem of raising a child, you would say, “Well, I don’t want to raise a child who simply take orders. I want to raise a child who can raise questions as well.”

I think our only defense against malicious humans with extremely intelligent systems at their disposal is to try to ally with intelligence systems to create a comparable counter force. And that counter force is going to be operating out way past our understanding because it’s going to be in competition with systems. They can operate extremely fast and take into account a large number of variables. And so we better be building systems which, as they get further and further out in this kind of a competition, have some kind of a core where they are responsive to morally relevant features even at the far extent of their development.

And so if you think about it as trying to build a moral core, then that core can figure in their operation even as they become more and more intelligent. They can use the intelligence to gain information and perspective and capacity to understand situations that can improve their understanding. But if we don’t do something like this, we will really be and other artificial systems will be prey to those who have and want to implement malicious and manipulative intentions. So I balanced the risk partly by thinking, I can’t think of a very good way to defend against the perils of malicious combinations of human and artificial intelligence other than to develop more trustworthy forms of human and artificial intelligence interaction. And that requires according these systems some autonomy and some trust.

Lucas Perry: That makes sense to me. And I think it addresses some important dimensions of the soon to be proliferation of AI.

Peter Railton: To me, what are the most exciting features of more recent developments in artificial intelligence is that they give us for the first time, I think, a plausible model of intuitive knowledge and knowledge that it could be implicit, but nonetheless be highly structured, contain a great deal of information, contain a capacity to engage in simulation and evaluation. So I would expect that the structure of moral knowledge could be like our structure of common sense knowledge generally. It could be quite distributed. It could be quite a complicated system, not a system of extracted principles. There might be some general features that are important, and I think that’s bound to be true. And that is true when these systems learn, but we don’t have to think that the kind of competency that they would have, if it isn’t something like that, is therefore undisciplined and therefore lacks power or reliability.

So for the first time, anyhow, I thought here is a picture of how intuitive intelligence might look. And of course we can’t introspect the structure of such knowledge and it does not have a readily introspectable propositional structure. But it is capable nonetheless of carrying and modeling and engaging in quite complex computations, simulations, action guidance, control of motor systems in ways that look like intuitive intelligence. Now I realize we’re a long way from the way the brain actually functions, but even to have these models, it gives us a kind of proof of concept of the possibility of something like intuitive knowledge.

Lucas Perry: Right. So if we’re building AI systems as willing slaves who optimize the preferences of whoever is able to embed those in the machine, there’s no defense in that world against malevolent preferences other than not allowing the proliferation of AI to begin with.

Peter Railton: And we’re already past that point. Enough has proliferated and there’s enough inequality of wealth and power in the world to guarantee that other proliferation will take place. It’s already the case that we can’t count on keeping this genie in the bottle and obviously don’t want to do so. I’d say we’re now in the phase where we need to have an active, constructive program of starting to build AI agents that are actively responsive to morally relevant considerations, are good at solving coordination problems, are good at this kind of interaction and capable of the kind of insight needed to be potential moral agents.

Lucas Perry: Right. And you argue that as the systems inhabit increasingly social roles in society and are constantly interacting with other agents and with the world, it’s increasingly important that they be sensitive to morally relevant features. Without this, again, malevolent humans or humans with misaligned values that are counter to most of the rest of humanity can abuse or use systems more freely if they’re not already sensitive to morally relevant features. And that if there is an ecosystem of AIs, purely altruistic systems which are not tuned into morally relevant features can be abused by other AIs as well.

Peter Railton: Yes, that’s right. One thing that’s gotten me to feel some conviction about this possibility is that the one kind of experiments that I do run are thought experiments. And I’ve been for years running moral thought experiments in my moral philosophy classes. And in recent years, I’ve been able to do so using a system that allows students to confidentially record their answers to problems like moral dilemmas or questions about interpreting moral situations or motives. And what’s impressed me over the years is how coherent and consistent these responses are.

And what leapt out, for example, from the familiar trolley problem was that mediating their moral judgments seem to be a model of the agents that are involved, a model of what kind of an agent would perform an action of a certain kind. And what kind of responses such an agent would receive from others in the community? Would they be trustworthy? Would they not be trustworthy? And so, instead of thinking there’s just these arbitrary differences in preference between throwing a switch and pushing someone off a footbridge, and there’s no real principle there, and no one’s found a principle to cover these cases, you can think now there’s this intuitive competency people have and understanding situations and characters and what kinds of persons would respond in what ways and situations and what it would be like to have those persons in our community.

And once you look at it that way you can get a tremendous amount of consistency in people’s responses, which suggested to me that they are doing this kind of generative modeling of situations and doing so in a way that does predict to their actual judgments. And if I ask, “Well, why did you make that judgment?” they’ll say, “I don’t know. It was just an intuition.”

Lucas Perry: Yeah. So the thought experiment that you’re pointing to, a lot of people would flip the switch in the trolley thought experiment to switch it to the track where there’s only one person and then if you changed it so that there’s a person on a bridge who is sufficiently large, that if you push them off the bridge, they will stop the trolley from killing five people on the track. The intuitive response that you’re pointing out is that people are less likely to want to push someone off of a bridge than to flip a switch. And you’re like, well, what’s really the difference? In the thought experiment, there’s not much of a difference, but the intuition that you’re pointing out, the morally relevant feature that is subtle and implicit is that we don’t want to live in a world where there are the kinds of people who have the capacity to push people off of bridges.

Peter Railton: In that kind of a setting, yes.

Lucas Perry: Yeah.

Peter Railton: And you can give them a whole array of other scenarios in which the agent would have to do something like pushing someone to a grisly death and where they will agree that it should be done for example, in situations where self-defense is needed against, for example, the terrorist action. And again, you’ve asked them, “Well, would you trust an agent who would perform such an action?” then the answer is they would actually have more trust in such an agent. So again, they’re modeling the situation, not in response to this or that minor tweak of the situational features, but in terms of a quite deep understanding of the motivations and attitudes that are involved. And then if you go over to the psychological literature, you find the dispositions to give the push verdict in the footbridge case correlate more with antisocial behavior, with lack of altruism, with lack of perspective taking, with indifference to harm than with altruism or any kind of a generalized utilitarian perspective. So the psychologists seem to confirm the understanding that my students implicitly had of the situation.

Lucas Perry: What’s relevant to extract here is that there are deep levels of morally salient features, that human beings taken to account, and that are increasingly needed to be modeled and understood by machine systems for them to successfully operate in the world.

Peter Railton: Yeah. And to be trustworthy. I’m one of those people who thinks emotion is not a magical substance either, and that artificial systems could have and acquire emotions. And that part of the answer to the question of how do you build a core that is resistant against certain types of manipulation is to look at how it’s done in humans and indeed another animals and discovered that the affective system plays a pivotal role in just these kinds of situations. And so I suspect that’s another avenue of development. And children’s moral emotions undergo a similar kind of evolution through their upbringing, but through their direct experience because the emotions are there before they’re told what to feel. Indeed how would you tell the child what to feel?

Lucas Perry: Are there any other points that you’d like to wrap up here on then on the advantages of reflecting on AIs, which are sensitive to morally relevant features?

Peter Railton: I try to be as accurate as I can in understanding what we’re learning from the literature on pro sociality, for example, both with regard to individual human development and with regard to human communities, going back, looking at hunter gatherer communities. And even as there have been changes in morality, and I have emphasized that there’s been changes over time, the kinds of features that people take to be morally relevant, many of those have been relatively constant. And you can think of our changes in our moral views that have taken place over the years is getting better and better at winnowing out the ones that aren’t really morally relevant, like gender, ethnicity, sexual orientation, and so on, because they can easily become culturally relevant without being morally relevant. Fortunately, we have the critical capacity as agents to challenge that.

Lucas Perry: Yeah, that makes sense. The core importance that I’m extracting from everything is the baseline importance of moral learning in general, and also the understanding and capturing what human normative processes are like and what they entail and how they unfold. And that participating in a world of humans requires knowledge of both moral learning and the ability to learn morally.

Peter Railton: And this is not saying that people will always behave well, just in the same way that acquiring linguistic competence doesn’t mean people are always going to speak well or truthfully, or honestly, but rather that the competency will be acquired. One example that I like is sexual orientation. When I was growing up, it was considered fatal for someone’s social identity to be discovered to be gay. And there was a great deal of belief about the characteristics of gay individuals. In the 90s and so on, a large number of gay individuals were courageous enough to indicate their orientation. And what was discovered, we all discovered, was that the world was full of gay individuals whom we admired, whom we had standard relationships with, who were excellent colleagues, coworkers, friends, and that therefore we were operating on a bad dataset because we had not really had, we here I’m talking about heterosexuals, had insufficient experience with gay individuals. And so we could believe all kinds of things about them.

So I would emphasize that if it’s a learning system, it’s going to be very sensitive to the data. And if the data’s bad, the learning system is going to have a problem. So I don’t think it’s a magic solution, but I think the question to ask is, so how do we build on this? How do we provide more representative experiences and less biased samples so that the learning can take place and not pick up cultural biases?

Lucas Perry: Yeah, those are really big problems that exist today and a lot of the solution right now is human beings having to do a lot of hard work in datasets. We can’t keep that up forever. Something else is needed. I think this has been instructive about the importance of structure of moral learning and I want to pivot back into our discussion of meta ethics and your conversations with Derek Parfit and what your metaethical view is and how views on metaethical epistemology or metaphysics may bring to bear intuitions about what moral learning is like or what it might entail. It’s Derek Parfit, right? Who has essays on, Does Anything Really Matter?

Peter Railton: Yes.

Lucas Perry: So I guess that’s the question here then for this part of the conversation, is, does anything really matter? So you were in conversation with Derek Parfit and it seems like your views have converged and are different in ways from Peter Singer, though it seems like you guys are all realists. Could you unpack and explain a little bit about the history here and what went down between you Parfit and Singer?

Peter Railton: Yeah, sure. I have to warn those who are listening, buckle up, this is going to have to be a philosophy talk, but I’m sure that many people have these philosophical questions themselves. So let’s just begin with the title that Parfit chose for his master work, On What Matters, is the title. And you might say that mattering is the core notion of value, that if you had a universe full of rocks, it would not matter to the rocks, what happened. It would not matter to the rest of the universe, what happened. And so there wouldn’t be any positive or negative value in that universe. Introduce creatures for whom something matters, even if it’s just as simple as nutrition or avoiding pain, then you can begin to talk about states of affair as being better or worse than one another, about improving or degrading the situation or the characteristics of the world.

And so mattering is poor to the idea of value. And once we grasp that, we begin to realize that value is not some new entity in the world. It’s not something we add to the world. Once you have mattering, then things will have value, and they’ll have positive and they’ll have negative value. And of course, for different creatures, different things will matter. And learning what matters to a creature is understanding what would be good or harmful to that creature, and this of course includes humans. So I was very moved when I was on a committee, looking into questions of animal research, to know that the veterinarians learned a lot about what situations animals preferred and did what they could to try to give them situations in which they were happier, more lively, more disposed to cooperate and learn. And that means that they were trying to learn something about what matters to a rat.

And we now know a fair amount about what matters to a rat. Company matters, exercise, the capacity to engage in activities, build nests. And so when these things matter to rats and so we can give rats a good or a bad existence by thinking about, well, what does matter to rats? Now, what matters to rats is different from what matters to humans, but the basic idea is the same. So there’s value there and it’s thanks to the existence of creatures for whom something matters that value comes into existence in the world. That’s a perfectly naturalistic perspective. Treating value as something that is realized by natural states of affairs in the world. Now it turns out that even someone who’s an arch non-naturalist like Derek Parfit agrees that pain is bad, not because it has the non-natural property of being disvaluable, but because of what it’s like in its natural features, those features suffice to make the pain bad.

And if they didn’t suffice to make the pain bad, there would be no value feature we could sprinkle on it that would make it bad. But given that it has those features, there is also no value feature we can sprinkle on it that will make it good. And so Parfit and I can agree that non-naturalism is important in ethics, not because the world is populated with non-natural entities like values. That’s a widespread confusion. It’s reifying a notion of value as if it were some kind of a new domain of entities. And naturally once you’ve done that, it becomes very unclear how we learn about these, what relationship they have to the natural world. If instead, you think, no value is something that is brought into existence by certain relational features in the natural world, then you can say, “Ah, that’s common ground between Derek Parfit and myself.”

And if Derek’s explaining what’s bad about pain, he’ll give the same explanation that I would give about what’s bad about pain. So we agree on that. The badness in the case of pain, pain is really used for two different things. It’s used for certain types of physical sensation, and it’s used for suffering. That physical sensation isn’t always suffering. So for example, when you put hot sauce on your food, you fire up pain circuits, but you enjoy that. You may seek the burn of exercise. And so there are times when the physical sensation of pain is sought and liked, desirable. It’s part of good experiences. It shows that pain can matter in different ways. It’s the mattering where the value resides, not in the physical sensation just in itself. So the mattering is a relationship between a subjectivity and agent and the physical sensation, and it could be positive or negative in a given case, but the value resides in that relationship.

Lucas Perry: But they’re just two contents of consciousness, right? There is the content of consciousness of the sensation of pain on my arm if I scratch it, and I might derive another sensation from that sensual pain, that is pleasure. Wouldn’t the goodness here need to come from this higher level, more pristine pleasure that I gain from the pain, which is more of an emotion and that which is intuitive to the other sensation or the other content of consciousness?

Peter Railton: I think you’re right to bring in higher level mental states as well. Because part of the reason why pain in certain circumstances is desirable is because of the representation that you have of it. And this is true with many features of the world, is because you understand them in certain ways that they produce in you the positive or negative experience they do. And if you ask a psychologist, the positivity and the negativity in the mind does not reside in the impulses of the pain system or the pleasure or reward system. It resides in the effective system, which encodes value as positive or negative. And it encodes as well, the behaviors and the responses that are characteristic of positive and negative value, positive is approach negative is withdrawal. Fear involves a certain distinctive suite of responses. Anger involves another distinctive suite of responses, but the affective system is where the value is encoded, and that’s the common currency of value in the brain.

So that’s where we should be looking to discover. And it’s the affective system that, which is the root of our emotions, whether they’re aroused emotions like anger or fear or non aroused emotions like assurance and trust. That system is a system which encodes this relational feature of value. You’re quite right to think that we should move up a level, and in doing so, we encounter the affective system and its properties. And it’s a system that we share with all of our mammalian relatives and with other species as well. It’s evolutionarily a highly conserved system. And that’s because it is the core of valuation, and valuation is a core activity of living creatures because they’re going to base their actions on value assignments. You’re right to think that in the mind you will have tiers and that you need to find the right level in order to understand what value or disvalue looks like in the mind.

Lucas Perry: So there’s the view where some content of consciousness is clearly seen as bad given its nature. If some state of consciousness is like something from a consciousness realist perspective, and it is also natural because it’s part of the natural world, it’s a physical fact and there are facts about consciousness, then value comes in from what it’s like to be conscious. Whereas it seems like you’re bringing in the more computational, and physical side of things, like an evaluative affective system, which may not be separate from how things are experienced in consciousness, but I feel confused about these two different levels and where the ‘what matters’ comes from.

Peter Railton: Well, yes, you’re quite right. There are views about value in which it’s only conscious states that could have value or disvalue. I don’t particularly hold such a view. I think that we are intrinsically concerned with, and that there is intrinsic value in non-conscious states. And that’s why I wouldn’t sign up for the experience machine. The experience machine could provide an unending stream supposedly of positive conscious states, but why wouldn’t I sign on for it? Well, because the actual content of my values is not that I have certain conscious states, it’s that I have certain relations with people, with the external world, that I have a certain engagement with things that have a consciousness and that matter. And so I wouldn’t agree that the only place, the only locus of value or disvalue is conscious states.

Lucas Perry: So then from a cosmological and evolutionary perspective, there has been the development and arising of sentient creatures on this planet who have ever complexifying neural algorithms for modeling themselves and the world and making predictions and interacting with it. And amongst these evolved architectures include evaluative ones, which take the shape of valuing or disvaluing certain aspects of the world. And so that is enough for you for talking about intrinsic value. You feel like you don’t need to bring consciousness into it. You’re fine with just talking about the computation.

Peter Railton: Oh, I think consciousness plays a role because one of the good making features is a positive state of consciousness. It’s just, it’s not the only one. And so there are differences in the world that would not show up as differences in conscious states. And that’s what the experience machine is meant to show, but which would nonetheless constitute things that matter in the sense of matter that we were just describing, namely, that these are objects of concern, love, affection, interest on the positive side, objects of dislike, disvaluation, disapproval on the negative side. I don’t think there’s any reason to think that only conscious states can be locuses of value, but it may be that consciousness plays a role.

Lucas Perry: So what are these other good making features and why are they good making?

Peter Railton: Well, take, for example, a theory like a preference satisfaction theory. I would prefer other things equal that after I’m gone, my children have lives that they find meaningful. Now is that because I want to have the positive experience of thinking that their lives are meaningful? No, I want them to have those lives. And so it’s part of the content of my informed preferences, let’s say that it would survive information, is part of the content of my informed preferences that the world be such, that my children have a certain kind of life. And you say, “Well, doesn’t the meaningfulness of their life just consist in their conscious states?” And I’d say, “Well, no, not at all. I would think that a life in an experience machine would have the same meaning as a life with similar stream of conscious states that was lived in engagement with the world.”

And so when I want them to have meaningful lives, I want them to have lives in which they act in ways that matter to them. And that they do things that matter to themselves and to others and their intrinsic preferences, like my intrinsic preferences, aren’t just going to be for conscious states. And so it may be that you need something like preference or interest to get value off the ground mattering, but the content of what interests us, or the content of what our preferences are, won’t just be the conscious states. So you can’t satisfy my preferences just by giving me conscious states, for example.

Lucas Perry: So I don’t share that intuition with you. I still don’t understand why you feel that something like a preference is good making. I guess that just comes down to intuitions. I mean, when someone could ask me, why do I think consciousness is the only thing that is good making, but I don’t know, what is a preference? It’s like a concept about some computational architecture that prefers some state of the world over another. But when you pass away, for example, your preference goes away. So why does it need to be respected still? I mean, we’re getting into some waters here, but is the short version of this that when you just do these philosophical thought experiments, that your intuitions aren’t satisfied by consciousness being necessary and sufficient for value?

Peter Railton: Well, all of our knowledge, whether it’s knowledge of value or the external world, we can push it back to a point at which, again, we can’t give some further derivation of the assumption that we’re making. And so my thinking here is that it seems to me extremely plausible that the one intelligible notion I can get of something like value is that there can be a subjectivity such that states of affairs can go better or worse for that subjectivity. And then value would consist in that, which makes the states of affairs better or worse for that individual. And then I asked myself, well, does that satisfy our concept of value?

Well, value should have various different features and we can list those. It should be something that when we understand it’s intrinsically motivating, it should apply to the sorts of things that we ordinarily identify as being values. It should capture a certain role in the guidance of action. It should be something like a goal in action. We should see it as structuring a behavior of individuals. And when I look at all those conditions, I think, yeah, this satisfies those conditions. It’s not a proof. It’s just saying that if we lay down the conditions that we would give for something satisfying the concept of value, these states do indeed satisfy those conditions and that many other candidate states don’t. But I can’t tell you for example, that you shouldn’t have some other concept of calue instead of value and ask what would satisfy calue in the same way that I can’t in the case of knowledge of the external world, give you a derivation of the importance of knowledge, as opposed to shmowledge, you can operate with the concept of knowledge and see what it requires and see whether it would apply to what we are doing.

But that’s not a proof that there isn’t another scheme of shmowledge of which the same thing could be said. So that’s where we get down to these fundamental assumptions and can they be non arbitrary? Well, they can, for example, if, when applying them, you can be put in a situation where you give them up. A concept that we had, that we thought we were happy with, turns out to be confused. Or it turns out that the only things that would satisfy the concept are things which we ordinarily think the concept doesn’t apply to. So we think there’s a mismatch between the criteria and the paradigm cases. So it’s not arbitrary if you’re willing to use it critically, but it can’t be proven.

Lucas Perry: Okay. Bit of a side path from where you were to Parfit. I was curious about what you really meant by how you guys were agreeing about value being some natural thing, instead of having to sprinkle value.

Peter Railton: The way I would put it, the disagreement that I have with Parfit is a disagreement at the conceptual level. Initially, at any rate, it looked like we had a conflict of opinions because it looked as if he was committed to their being in the world, these non-natural features, and that they somehow explained the role that value has in our lives. And I couldn’t understand what that would mean, but he was perfectly content to say, “No, the good making features are these natural features. They explain the role that value has in our lives, but our concept of value is a non-natural concept.” And what does it mean to say that? Well, the same situation, the same configuration of matter could be described with physical concepts, chemical concepts, biological concepts, Oh, it’s an “organism.” It could be described in social concepts. It’s a person. Any given situation can be characterized in various different conceptual systems.

And it can be argued, plausibly, that you can’t reduce, for example, the conceptual system of biology to the conceptual system of particle physics. Because biology deals in functions, reproduction, metabolism, and so on, and there’s no one to one correspondence, no easy correspondence, between those functions and any particular physical realization. You could have living beings made out of carbon. You could have living beings made out of silicon. So the concept of living being, the concept of an organism is a concept of biology. It’s a way to organize the description of the world and explanation and biology is conceptually not reducible to physics. That doesn’t mean biologists can ignore physics because they think, most anyway do, that what satisfies their biological concept are physical systems. And so it’s an important question, what kinds of physical system would satisfy these concepts like self replication and so on?

And so they do microbiology and they study the physical systems that do satisfy these concepts. But the point is that the conceptual system has a degree of autonomy from the physical system. And that even discovering that self-replicating molecules have a certain chemical composition in this world is not discovering that the concept of a self replicating organism is simply a physical concept. Parfit has the same view about normative concepts. He and I agree about what pain is and what makes pain bad, but he says you could describe a situation either, as you were saying, in terms of some physical or biochemical processes, or you could describe it as bad, or as good or something that ought not to exist. And that’s another level of conceptual characterization. And his thought is that that level of conceptual characterization can’t be reduced to the concepts of natural order.

So there is an element in normative concepts that’s always beyond what is translatable without loss of meaning into the natural. Once one recognizes that, then you can be as naturalistic as you like about the nature of value and also believe that the concept of value is a non-natural concept. Just as if you can be as physicalist as you like about the fundamental furniture of the universe and still believe that the biological level of description is not reducible to the physical level of description. You could say the same debate went on when people were thinking about life. 19th century, we find people thinking, well, there’s got to be this special elan vital or spirit or something like that. You can’t just take a bunch of matter and put it together and have life. By and by, biochemistry develops and people, actually you can’t put a bunch of matter together and have life.

And the same thing is true with value. You don’t need some value-vital, some kind of further substance to add to the world. You can put together the natural stuff of the world and get value. Once you frame it that way, then Parfit and I actually agree. Because when he talks about the irrreducibility of the normative, he really means, should mean, and I think agrees that he means, a conceptual irreproducibility. And once we establish that, then I can say, “Yes, I agree with you, normative concepts aren’t definable entirely in terms of non-normative concepts, they involve some idea of ought or some idea of value that isn’t present to the non-normative.” But my interest as a philosopher and metaethicist is an interest in what kinds of natural conditions satisfy these concepts and how that makes it possible for us to have knowledge in a non-normative conceptual scheme, like ethics or theory of value. So that’s where I do my work. His work is done in carefully distinguishing the concepts.

Lucas Perry: So there is reality as it is, there is the base reality, base metaphysics, call it ultimate reality or whatever, and all human conceptualization supervenes upon that because it’s couched within that context and is identical to it. Yet that conceptualization you argue is lossy with respect to ultimate reality, because it doesn’t necessarily carve reality at the joints, but that conceptual structure is still supervenient upon it. And at the level of conceptualization, there are facts about the world that can be satisfied or not satisfied that will make some proposition true or false.

Peter Railton: Yeah.

Lucas Perry: So you’re arguing that value isn’t part of metaphysical bedrock, but metaphysical bedrock creates neural architectures that create concepts that contain within them necessary and sufficient conditions for being satisfied. And when agents are able to gain clarity with one another over concepts and satisfying necessary and sufficient conditions, then they can have concrete discussions about ethics.

Peter Railton: That would be one common basis. And so the image that Parfit gave in his first volume of On What Matters was that he thought, ultimately, you could see the utilitarian and the Kantian as climbing two different sides of the same mountain so that they would eventually meet at the summit. I suggested to him, well in metaethics, the same as the case, I’m a naturalist, I’m climbing one side of the mountain. You’re a non-naturalist, you’re climbing the other side of the mountain.

But as our views develop, and as we understand better the different elements of the views, then actually they’re going to come such that as we approach the summit, we aren’t really disagreeing with each other. And he accepted that picture. I would only add to what you were saying by way of summary. Our concepts typically don’t give us necessary and sufficient conditions, they are more open ended and open textured than that. And that’s part of why we can have unending debates about questions like value and so on.

But you mentioned truth and might say, truth is another very good example of a concept that’s not reducible to a concept of physics. Because true presupposes representations, and representations are characterizable not in terms of their physical constituents, but in terms of their role in thought. And so people who are skeptical about value because they say, “I don’t see where value is in the world,” they should be equally skeptical about truth. Because truth is not some new substance we add. If there’s a representation and it accurately reflects the world, then we have truth. So true, again, is a relational matter between a subject something, like a representation and this case, a state of the world, and it’s when that relationship obtains that you get truth.

Lucas Perry: Right, but that’s truth in the epistemological agent centered sense, but then there’s the more metaphysical view about truth, where there are mind independent facts. And they’re true, whether or not we know anything about them. Maybe the same distinction is important here to make. There are potentially moral truths within the conceptual framework that we’re participating in. And it feels weird to me to call that moral realism. But then there’s another claim where there’s mind independent truths about morality like that there’s an intrinsic quality to suffering that is what bad means. Does that make sense?

Peter Railton: I think you’ve put things in a very good way. One of the features of the setup that I was describing is that it’s very easy to slide from a position that for example, whenever a value judgment obtains, then some or other natural state obtains, it’s very easy to slide from that to thinking that the natural state actually is the normative fact. It doesn’t satisfy the concept. And so you could have the concept of the good, and it could be that there are eternal truths about good I suspect. That’s a reasonable candidate, just as there can be eternal truths in mathematics. The claim isn’t that the conceptual domain is somehow identical with the natural domain. It supervenes, but it’s not a relationship of identity. And the language in which those claims are stated, and the way in which we adjudicate them might be as in the case of mathematical claims, it could be quite a priori.

And that’s where Parfit’s view and mine differ and Singer’s likewise, because he thinks you can do this a priori in a way that I don’t think you can, but that’s a question in epistemology. It doesn’t require a different metaphysics in order to have that view. So you can be a physicalist and believe that there is mathematical truth. And that’s because, for example, you think that mathematical truths are true via a set of axioms, definitions, rules of inference. And so they are made true not by distributions of molecules, but by logical relationships that can be specified in terms of axioms and rules.

Lucas Perry: Okay. So I feel a little bit confused still about why your view is a kind of moral realism if it requires no strong metaphysical view. Whereas other moral realists that I’m familiar with hold a strong metaphysical view about suffering in consciousness and joy in consciousness as being the intrinsic valence carriers of value.

Peter Railton: Well, I’m not sure about the last part of your question. I’ll have to think about how to interpret that. But am I a realist about organisms, if I believe that the concept of an organism is distinct from any particular physical instantiation? Am I prevented from being a realist about organisms because I think the organismic level of description is irreducible to the physical level of description? You see, no, actually, because you think that the concept organism is satisfied by some physical system, you’re a realist about organisms, you think there are organisms. To me that’s a perfectly realistic position. And you realist or non-realistic would say, “Well, I guess there aren’t any organisms then, because they’re not part of the fundamental furniture of the universe.”

And I’d say, “Think of what an organism is. It’s not a piece of furniture, it’s a functionally organized arrangement. And because it’s functionally organized, it doesn’t correspond to any particular material, something or other. And for there to exist organisms is for there to exist the conditions such that the concept of organism is satisfied.” And that’s of course what most biologists believe. And so most biologists are realists about organisms.

Lucas Perry: If your intuitions changed about the reducibility of higher levels of knowledge to lower levels of knowledge, how would that affect your moral views? For example, there are views say like concepts in biology about reproduction and organisms and concepts like life are lossy when it comes to the actual furniture of reality. And that they don’t actually completely describe how things are and the concepts don’t carve reality at the joints. So they provide predictions about the world, but it all should be and is only best described by particle physics for example. One might say an organism is a concept, though it does not carve reality at the joints. And the best understanding of it is at the level of particle physics. So taking a realist position about conceptual fictions is dubious to define reality as whenever some concept I have is satisfied.

Peter Railton: What you’re pointing to is a very interesting problem. I would say that biological concepts do carve things at the joint, because the biological level of organization yields a whole systematic set of laws and principles that turn out to be true in our universe. It’s far from being an arbitrary stipulation or a fiction that something’s an organism and a tremendous amount follows from things being organisms and self-replicating and so on. And we have very elaborate theories about populations, mathematics and populations.

Lucas Perry: And are those laws though not reducible to other laws?

Peter Railton: That’s the idea that reducibility is the wrong concept to have here. Because the laws of population are laws that have to do with variables that aren’t fundamental variables of physics. They have to do with, for example, issues about reproducibility, availability of resources and so on, and what counts as a resource depends upon the nature of the organism. So there’s a level of organization, similarly in chemistry.

Lucas Perry: But what if those variables are just the shape of lower level things?

Peter Railton: Well, they won’t be, because if they were self-reproducing silicon-based organisms, they would obey similar population dynamics. Those principles govern functional organizations. So once you have self-replication, you have mutation, you have differential selection and so on, you’ll get certain principles, whatever physical realization there is.

Lucas Perry: But it really doesn’t make sense to me that these higher level laws would not be completely supervenient on fundamental forces of nature.

Peter Railton: Oh, they’re supervenient, definitely. But supervenience does not imply reducibility, that’s really critical in this domain. And again, this is the problem that I think has led to a lot of confusion in this domain. A feature that is supervenient upon fundamental physics is perhaps part of a system of laws that provides joints in nature. Because if you went to another world and you found a form of life that had these basic features of self-replication, mutation, selection, you would expect to find similar population dynamics to Earth. And that similarity is a biological similarity. It’s not a similarity in terms of the basic physics of the situation. The physics are the same, but the constitution of these organisms is very different. And so you couldn’t infer from understanding just the physics that there would be this biological regularity. That’s what it means to say that it’s supervenient, but not reducible.

Again, I think you can be a realist about organisms because organism really is a concept that carves nature at the joints. And so we would be able to export our theory of organisms to worlds in which carbon was not abundant and self-replication was built out of something else. And that’s a way in which nature is lawfully organized, supports counterfactuals, supports explanations. And so that’s a way of thinking about what it means to say it’s supervenient, but not reducible. And I think the same thing is true with moral distinctions. And that’s why they’re learnable. That’s why infants can learn moral distinctions, even without being given moral concepts.

Lucas Perry: Yeah. So that’s why I’m pushing on this point. Now that makes more sense to me in terms of moral statements, but when trying to make physical claims about how reality is, I feel more confused here and maybe it’s messing me up in other places. If all of the causality is governed by fundamental forces, then surely all concepts that try try to map out the world that is being governed by fundamental causality. All the laws that are derived at higher levels must be completely reducible to and supervenient upon or lossy to some extent with relation to the fundamental causal forces. I don’t think the claims is that for example principles of biology in life are causal in themselves. They’re more like laws that we use to make predictions, but predictions about systems that are running on the fundamental laws of nature. The complex aggregation of those laws must aggregate in some way to come close to those laws of biology. What is wrong with this picture?

Peter Railton: Well, there may be nothing wrong with it. I think the laws of biology are not just descriptive. I think they support explanations and that they are used, not just to redescribe reality, but they’re used to construct theories that show structure in reality that’s extremely important structure. And that would not be visible simply if you were allowed only predicates of fundamental physics. I guess I would say from the standpoint of explanation, biology affords many explanations. Suppose somebody wants to know why the material that happens to be in my body is where it is right now. Well, there is some very complicated story at the level of fundamental microphysics following all of these molecules, but it doesn’t look like anything at all. Whereas if you can give an explanation in terms of evolution and social dynamics as to how these molecules got here, you may have a much more compact comprehensible and understanding grasp of the world.

So I think biology affords us distinct modes of understanding and explanation, so does psychology, so does chemistry. One of the features of knowledge is that reality is organized at various levels in systems that are lawful systems and that support explanation and intervention and causation, but there’s no reason not to call this causation.

And so if somebody is describing the spread of the pandemic and they say, “Well, it’s partially caused by the transmutability of the virus, which is higher than that of the bird flu,” we’ll say, “Yes, that’s a causal factor in the spread of the virus and why these particular molecules are located in the world where they are.” And that’s a very powerful explanation. And if someone were just giving you a readout of the positions and momenta of all the different molecules of the world, you would not see this pattern and you’d have less understanding of the situation.

Lucas Perry: So tying this into your metaethics here, our ethical concepts are causally supervening on fundamental forces on physics. We’ve inherited them via evolution and they run on physics. But these concepts do not reduce to natural facts. There’s no goodness or badness built into the fundamental nature of the universe. These concepts are merely causal expressions of the universe playing out. And within the realm of this conceptualization, you can have truths about morality in the same way that you can have truths about biological organisms. And there’s a relationship here between what you might believe to be true about conceptualization and science and the epistemic status of concepts in general would also bear some information here on how one might think about the epistemic status of moral concepts.

Peter Railton: Yeah. Or thinking in terms of algorithms and systems. The systems, theoretic perspective gives you a lot of very well organized understanding as you grasp the algorithms that’s at work and so on, but algorithm is not a concept to fundamental physics.

Lucas Perry: Right. So it’s your view that moral facts, moral claims within conceptualization hold the same epistemic status as claims about algorithms and biological organisms and claims that we might make and in things like chemistry or biology, which are at a higher level of abstraction than particle physics.

Peter Railton: Yes. And that’s what would be called the naturalist. And it’s why someone like Peter Singer is a non-naturalist. He thinks the epistemology of moral judgments is a priori. And similarly with Derek Parfit, and they think it’s an intuitive epistemology, and they think that the two go together because they believe in something called rational intuition. And I’m inclined to think of intuition the way we were describing it earlier on. Namely, it’s a complex body of knowledge that isn’t organized into simple principles, but that replicates an important set of morally relevant relations. And that that’s really what intuition is. That when we have these intuitions, it’s that kind of knowledge, the way they’re grammatical intuitions or knowledge like that. So we disagree about the epistemology, but Parfit at least, and I’m not sure what Peter Singer would say. Our disagreement’s not metaphysical.

Lucas Perry: Right. I think the only place it seems like there would be space for a metaphysical disagreement would be in there being a kind of intrinsic good quality to pleasure and intrinsic bad quality to suffering that existed prior to conceptualization.

Peter Railton: I don’t think anything about the badness of pain depends on our concepts. I think that pain was bad in the first organisms that felt pain. And if humans had never evolved and the concept of pain had never come into existence and the concept of bad had never come to existence, it would still be bad for these organisms to suffer roasting to death in a world that desiccated or something like that? Our concepts allow us to talk about these features. The word concept comes from two words, con meaning with and kept, which is a term for grasp. And a concept is what we use to grasp features of the universe, not to create them.

Lucas Perry: So there already would have been some computational structure that would have evaluated something as bad?

Peter Railton: It would have made it the case that this was bad for that organism. Yes, that’s right.

Lucas Perry: And that doesn’t bring consciousness into it or anything, that could be strictly computational?

Peter Railton: Thus far yeah. And there’s a big debate about whether states have to be conscious in order for them to have disvalue. And one of the reasons for thinking about that is because we’re thinking about the animal kingdom and we aren’t sure how deep into the animal ancestry of humans consciousness goes. I myself don’t think that consciousness is essential, but I recognize that that’s one position among many.

Lucas Perry: Yeah. I happen to think that it is. But I would like now to wrap up and integrate this discussion on metaethical epistemology into the broader conversation. So we’ve talkedhere a lot about what your meta ethics is and the epistemology that is entailed by it, and also that of other peoples. That is related to moral learning of course, because a proper moral epistemology is the vehicle by which one would obtain normative or metaethical moral knowledge. So how do you view or integrate this thinking that we’ve gone through here to the question of AI alignment. On one hand, if we were Singer or Parfit, we might think if we just build something that’s sufficiently rational, whatever that means that axioms of morality would be intuitively accessible to such a machine system seems strange to say, but they would be intuitively accessible as well as axioms of mathematics. Whereas with your view I’m not quite sure what happens. So maybe you could explore this all a little for us.

Peter Railton: So if I could have my wish here, it would be that by getting an understanding of the metaethical landscape and which problems are metaphysical, which ones aren’t, which problems are epistemic and how they are tractable in various ways, the temptation towards skepticism in morality would at least be a bit weakened. People would see how it would be possible for us to have moral knowledge, of course imperfect and evolving. They would understand therefore how it could be possible for other systems to have moral knowledge. And we could talk concretely about the kinds of processes by which infants for example acquire moral knowledge, and think about how systems could go through similar kinds of processes and inquire a core moral competency as I think they can. Skepticism about morality I think has for a long time plagued the discipline, because it’s been hard for people to see how we could have something like moral knowledge.

And that’s been tied up with a picture of value and the nature of value as a unusual kind of something or other. As something that’s not part of the way in which the world is put together. And so how would we ever have any kind of knowledge of it? And since we can’t derive it from self-evident axioms, we have to be subjectivists. That I would hope to have had just a small effect in making that a somewhat less plausible position. Because I do think there’s an important constructive project here and is already underway in developmental psychology. And for example, people working around Josh Tenenbaum are working on this as a learning question. There’s a lot of promise to understanding intuition in terms of deep learning and understanding moral competency in Bayesian terms. So I think there’s a tremendous future for coming to have theories of moral learning. I’m glad psychologists have started using this phrase and that by giving a theory of it, that sorts out the metaethical landscape in the ways we’ve been describing, that seems more plausible.

Lucas Perry: Right. So in summary then the feeling that you have here is that people hold and walk around with common sense intuitions about normative and metaethical thinking and what those things are. And that there is a more solid foundation for whatever moral realism might be in understanding around these issues. And that there can be strong convergence and formalization around moral learning. And then the integration of moral learning to machine systems, which would make them sensitive to morally relevant features and thus make them socially, societally, civilizationally competent to be able to exist in an ecosystem of agents with more or less altruistic, malevolent, and benevolent values.

Peter Railton: And that we will need such systems badly as allies in the years to come. If I could just add one thought, someone’s going to say, “But don’t we have to have some priors about pain being bad, about positive some interactions being good.” Now you have to have some priors in order to engage in moral learning. And I would say we have to have priors to engage in any kind of learning. And what rationality and learning consist in is how do we use subsequent experience and evidence and argument to revise the priors and go on to create new priors and then apply evidence and argument and reasoning to those. That’s what rationality is, it’s not starting from scratch with self-evident principles that we just don’t happen to know. That’s what Bayesians say rationality is with response to science and the gathering of evidence. That’s a picture of rationality in which we can be rational beings and we can be more rational the more we are able to subject our priors to critical scrutiny and expose them to different kinds of more diverse representative forms of evidence and reasoning.

And I think the same thing is true with moral priors and rationality in the moral case doesn’t consist in seeing it as a self-evident set of axioms, because I don’t think one can. But starting with priors and then learning from experience, argument deliberation together, in that sense rationality in the two spheres is essentially very similar.

Lucas Perry: All right, Peter, thanks so much for your perspective here in sharing all of this. Is there anything else here, any last thoughts you’d like to say, anything you feel unresolved about?

Peter Railton: I would like to have more engagement between philosophers and the AI alignment community. I think it’s one of the most important problems we face as a culture and it’s an urgent problem. And it’s painful to me that philosophers are not as alive to it as they should be. I just want to invite anyone who’s out there working on the problem. Please, let’s try to make contact, not necessarily with me, but with other philosophers. And let’s try to build a constructive community here. Because for too long philosophy has been in the situation of folding its arms and sort of poo-pooing artificial intelligence or artificial ethics. And if that view has merit and it does have merit in many areas, AI gets over-hyped a lot, AI people will tell you that.

But there’s this other side, which is what has been accomplished, what has been constructed, what’s been shown to be possible. And how can we build on that? And there I think there’s a lot of opportunity for constructive interaction. So that would be my parting thought that this is a time when urgent work in this area is needed. Let’s bring all the resources we can to bear on it.

Lucas Perry: All right, beautiful thoughts to end on then. If people want to follow you on social media or get in contact with you, how’s the best way to do that?

Peter Railton: Well, I’m not on social media. The best way would be to reach me via email, which is I get a lot of email. I can’t promise I’ll respond quickly to emails, I wish I could. But I don’t want philosophy to lose the chance to be part of this important process.

Lucas Perry: All right. And if people want to check out your papers or work?

Peter Railton: I’m supposed to be building a website. I may succeed in doing so. Many of the papers are available. People have put them up in various ways. So if you go to Google Scholar, you can find many of my papers. And I also want to put in a plug for those philosophers who have heroically been working on these questions. They’ve done a great deal of work and we should be grateful for what they’ve accomplished. But yes, if people want to find my work, if they can’t get access to it, let me know and I’ll make the papers available.

Lucas Perry: All right, thanks again, Peter. It’s been really informative and I appreciate you coming on.

Peter Railton: Great. I appreciate your questions and your patience. This has been very helpful conversation for me as well.



End of recorded material

Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

 Topics discussed in this episode include:

  • Inner and outer alignment
  • How and why inner alignment can fail
  • Training competitiveness and performance competitiveness
  • Evaluating imitative amplification, AI safety via debate, and microscope AI



0:00 Intro 

2:07 How Evan got into AI alignment research

4:42 What is AI alignment?

7:30 How Evan approaches AI alignment

13:05 What are inner alignment and outer alignment?

24:23 Gradient descent

36:30 Testing for inner alignment

38:38 Wrapping up on outer alignment

44:24 Why is inner alignment a priority?

45:30 How inner alignment fails

01:11:12 Training competitiveness and performance competitiveness

01:16:17 Evaluating proposals for building safe and advanced AI via inner and outer alignment, as well as training and performance competitiveness

01:17:30 Imitative amplification

01:23:00 AI safety via debate

01:26:32 Microscope AI

01:30:19 AGI timelines and humanity’s prospects for succeeding in AI alignment

01:34:45 Where to follow Evan and find more of his work


Works referenced: 

Risks from Learned Optimization in Advanced Machine Learning Systems

An overview of 11 proposals for building safe advanced AI 

Evan’s work at the Machine Intelligence Research Institute






We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Lucas Perry: Welcome to the AI Alignment Podcast. I’m Lucas Perry. Today we have a conversation with Evan Hubinger about ideas in two works of his: An overview of 11 proposals for building safe advanced AI and Risks from Learned Optimization in Advanced Machine Learning Systems. Some of the ideas covered in this podcast include inner alignment, outer alignment, training competitiveness, performance competitiveness, and how we can evaluate some highlighted proposals for safe advanced AI with these criteria. We especially focus in on the problem of inner alignment and go into quite a bit of detail on that. This podcast is a bit jargony, but if you don’t have a background in computer science, don’t worry. I don’t have a background in it either and Evan did an excellent job making this episode accessible. Whether you’re an AI alignment researcher or not, I think you’ll find this episode quite informative and digestible. I learned a lot about a whole other dimension of alignment that I previously wasn’t aware of, and feel this helped to give me a deeper and more holistic understanding of the problem. 

Evan Hubinger was an AI safety research intern at OpenAI before joining MIRI. His current work is aimed at solving inner alignment for iterated amplification. Evan was an author on “Risks from Learned Optimization in Advanced Machine Learning Systems,” was previously a MIRI intern, designed the functional programming language Coconut, and has done software engineering work at Google, Yelp, and Ripple. Evan studied math and computer science at Harvey Mudd College.

And with that, let’s get into our conversation with Evan Hubinger.

In general, I’m curious to know a little bit about your intellectual journey, and the evolution of your passions, and how that’s brought you to AI alignment. So what got you interested in computer science, and tell me a little bit about your journey to MIRI.

Evan Hubinger: I started computer science when I was pretty young. I started programming in middle school, playing around with Python, programming a bunch of stuff in my spare time. The first really big thing that I did, I wrote a functional programming language on top of Python. It was called Rabbit. It was really bad. It was interpreted in Python. And then I decided I would improve on that. I wrote another functional programming language on top of Python, called Coconut. Got a bunch of traction.

This was while I was in high school, starting to get into college. And this was also around the time I was reading a bunch of the sequences on LessWrong. I got sort of into that, and the rationality space, and I was following it a bunch. I also did a bunch of internships at various tech companies, doing software engineering and, especially, programming languages stuff.

Around halfway through my undergrad, I started running the Effective Altruism Club at Harvey Mudd College. And as part of running the Effective Altruism Club, I was trying to learn about all of these different cause areas, and how to use my career to do the most good. And I went to EA Global, and I met some MIRI people there. They invited me to do a programming internship at MIRI, where I did some engineering stuff, functional programming, dependent type theory stuff.

And then, while I was there, I went to the MIRI Summer Fellows program, which is this place where a bunch of people can come together and try to work on doing research, and stuff, for a period of time over the summer. I think it’s not happening now because of the pandemic, but it hopefully will happen again soon.

While I was there, I encountered some various different information, and people talking about AI safety stuff. And, in particular, I was really interested in this, at that time people were calling it, “optimization daemons.” This idea that there could be problems when you train a model for some objective function, but you don’t actually get a model that’s really trying to do what you trained it for. And so with some other people who were at the MIRI Summer Fellows program, we tried to dig into this problem, and we wrote this paper, Risks from Learned Optimization in Advanced Machine Learning Systems.

Some of the stuff I’ll probably be talking about in this podcast came from that paper. And then as a result of that paper, I also got a chance to work with and talk with Paul Christiano, at OpenAI. And he invited me to apply for an internship at OpenAI, so after I finished my undergrad, I went to OpenAI, and I did some theoretical research with Paul, there.

And then, when that was finished, I went to MIRI, where I currently am. And I’m doing sort of similar theoretical research to the research I was doing at OpenAI, but now I’m doing it at MIRI.

Lucas Perry: So that gives us a better sense of how you ended up in AI alignment. Now, you’ve been studying it for quite a while from a technical perspective. Could you explain what your take is on AI alignment, and just explain what you see as AI alignment?

Evan Hubinger: Sure. So I guess, broadly, I like to take a general approach to AI alignment. I sort of see the problem that we’re trying to solve as the problem of AI existential risk. It’s the problem of: it could be the case that, in the future, we have very advanced AIs that are not aligned with humanity, and do really bad things. I see AI alignment as the problem of trying to prevent that.

But there are, obviously, a lot of sub-components to that problem. And so, I like to make some particular divisions. Specifically, one of the divisions that I’m very fond of, is to split it between these concepts called inner alignment and outer alignment, which I’ll talk more about later. I also think that there’s a lot of different ways to think about what the problems are that these sorts of approaches are trying to solve. Inner alignment, outer alignment, what is the thing that we’re trying to approach, in terms of building an aligned AI?

And I also tend to fall into the Paul Christiano camp of thinking mostly about intent alignment, where the goal of trying to build AI systems, right now, as a thing that we should be doing to prevent AIs from being catastrophic, is focusing on how do we produce AI systems which are trying to do what we want. And I think that inner and outer alignment are the two big components of producing intent aligned AI systems. The goal is to, hopefully, reduce AI existential risk and make the future a better place.

Lucas Perry: Do the social, and governance, and ethical and moral philosophy considerations come much into this picture, for you, when you’re thinking about it?

Evan Hubinger: That’s a good question. There’s certainly a lot of philosophical components to trying to understand various different aspects of AI. What is intelligence? How do objective functions work? What is it that we actually want our AIs to do at the end of the day?

In my opinion, I think that a lot of those problems are not at the top of my list in terms of what I expect to be quite dangerous if we don’t solve them. I think a large part of the reason for that is because I’m optimistic about some of the AI safety proposals, such as amplification and debate, which aim to produce a sort of agent, in the case of amplification, which is trying to do what a huge tree of humans would do. And then the problem reduces to, rather than having to figure out, in the abstract, what is the objective that we should be trying to train an AI for, that, philosophically, we think would be utility maximizing, or good, or whatever, we can just be like, well, we trust that a huge tree of humans would do the right thing, and then sort of defer the problem to this huge tree of humans to figure out what, philosophically, is the right thing to do.

And there are similar arguments you can make with other situations, like debate, where we don’t necessarily have to solve all of these hard philosophical problems, if we can make use of some of these alignment techniques that can solve some of these problems for us.

Lucas Perry: So let’s get into, here, your specific approach to AI alignment. How is it that you approach AI alignment, and how does it differ from what MIRI does?

Evan Hubinger: So I think it’s important to note, I certainly am not here speaking on behalf of MIRI, I’m just presenting my view, and my view is pretty distinct from the view of a lot of other people at MIRI. So I mentioned at the beginning that I used to work at OpenAI, and I did some work with Paul Christiano. And I think that my perspective is pretty influenced by that, as well, and so I come more from the perspective of what Paul calls prosaic AI alignment. Which is the idea of, we don’t know exactly what is going to happen, as we develop AI into the future, but a good operating assumption is that we should start by trying to solve AI for AI alignment, if there aren’t major surprises on the road to AGI. What if we really just scale things up, we sort of go via the standard path, and we get really intelligent systems? Would we be able to align AI in that situation?

And that’s the question that I focus on the most, not because I don’t expect there to be surprises, but because I think that it’s a good research strategy. We don’t know what those surprises will be. Probably, our best guess is it’s going to look something like what we have now. So if we start by focusing on that, then hopefully we’ll be able to generate approaches which can successfully scale into the future. And so, because I have this sort of general research approach, I tend to focus more on: What are current machine learning systems doing? How do we think about them? And how would we make them inner aligned and outer aligned, if they were sort of scaled up into the future?

This is in contrast with the way I think a lot of other people at MIRI view this. I think a lot of people at MIRI think that if you go this route of prosaic AI, current machine learning scaled up, it’s very unlikely to be aligned. And so, instead, you have to search for some other understanding, some other way to potentially do artificial intelligence that isn’t just this standard, prosaic path that would be more easy to align, that would be safer. I think that’s a reasonable research strategy as well, but it’s not the strategy that I generally pursue in my research.

Lucas Perry: Could you paint a little bit more detailed of a picture of, say, the world in which the prosaic AI alignment strategy sees as potentially manifesting where current machine learning algorithms, and the current paradigm of thinking in machine learning, is merely scaled up, and via that scaling up, we reach AGI, or superintelligence?

Evan Hubinger: I mean, there’s a lot of different ways to think about what does it mean for current AI, current machine learning, to be scaled up, because there’s a lot of different forms of current machine learning. You could imagine even bigger GPT-3, which is able to do highly intelligent reasoning. You could imagine we just do significantly more reinforcement learning in complex environments, and we end up with highly intelligent agents.

I think there’s a lot of different paths that you can go down that still fall into the category of prosaic AI. And a lot of the things that I do, as part of my research, is trying to understand those different paths, and compare them, and try to get to an understanding of… Even within the realm of prosaic AI, there’s so much happening right now in AI, and there’s so many different ways we could use current AI techniques to put them together in different ways to produce something potentially superintelligent, or highly capable and advanced. Which of those are most likely to be aligned? Which of those are the best paths to go down?

One of the pieces of research that I published, recently, was an overview and comparison of a bunch of the different possible paths to prosaic AGI. Different possible ways in which you could build advanced AI systems using current machine learning tools, and trying to understand which of those would be more or less aligned, and which would be more or less competitive.

Lucas Perry: So, you’re referring now, here, to this article, which is partly a motivation for this conversation, which is An Overview of 11 Proposals for Building Safe Advanced AI.

Evan Hubinger: That’s right.

Lucas Perry: All right. So, I think it’d be valuable if you could also help to paint a bit of a picture here of exactly the MIRI style approach to AI alignment. You said that they think that, if we work on AI alignment via this prosaic paradigm, that machine learning scaled up to superintelligence or beyond is unlikely to be aligned, so we probably need something else. Could you unpack this a bit more?

Evan Hubinger: Sure. I think that the biggest concern that a lot of people at MIRI have with trying to scale up prosaic AI is also the same concern that I have. There’s this really difficult, pernicious problem, which I call inner alignment, which is presented in the Risks from Learned Optimization paper that I was talking about previously, which I think many people at MIRI, as well as me, think that this inner alignment problem is the key stumbling block to really making prosaic AI work. I agree. I think that this is the biggest problem. But I’m more optimistic, in terms of, I think that there are possible approaches that we can take within the prosaic paradigm that could solve this inner alignment problem. And I think that is the biggest point of difference, is how difficult will inner alignment be?

Lucas Perry: So what that looks like is a lot more foundational work, and correct me if I’m wrong here, into mathematics, and principles in computer science, like optimization and what it means for something to be an optimizer, and what kind of properties that has. Is that right?

Evan Hubinger: Yeah. So in terms of some of the stuff that other people at MIRI work on, I think a good starting point would be the embedded agency sequence on the alignment forum, which gives a good overview of a lot of the things that the different Agent Foundations people, like Scott Garrabrant, Sam Eisenstat, Abram Demski, are working on.

Lucas Perry: All right. Now, you’ve brought up inner alignment as a crucial difference, here, in opinion. So could you unpack exactly what inner alignment is, and how it differs from outer alignment?

Evan Hubinger: This is a favorite topic of mine. A good starting point is trying to rewind, for a second, and really understand what it is that machine learning does. Fundamentally, when we do machine learning, there are a couple of components. We start with a parameter space of possible models, where a model, in this case, is some parameterization of a neural network, or some other type of parameterized function. And we have this large space of possible models, this large space of possible parameters, that we can put into our neural network. And then we have some loss function where, for a given parameterization for a particular model, we can check what is its behavior like on some environment. In supervised learning, we can ask how good are its predictions that it outputs. In an RL environment, we can ask how much reward does it get, when we sample some trajectory.

And then we have this gradient descent process, which samples some individual instances of behavior of the model, and then it tries to modify the model to do better in those instances. We search around this parameter space, trying to find models which have the best behavior on the training environment. This has a lot of great properties. This has managed to propel machine learning into being able to solve all of these very difficult problems that we don’t know how to write algorithms for ourselves.

But I think, because of this, there’s a tendency to rely on something which I call the does-the-right-thing abstraction. Which is that, well, because the model’s parameters were selected to produce the best behavior, according to the loss function, on the training distribution, we tend to think of the model as really trying to minimize that loss, really trying to get rewarded.

But in fact, in general, that’s not the case. The only thing that you know is that, on the cases where I sample data on the training distribution, my models seem to be doing pretty well. But you don’t know what the model is actually trying to do. You don’t know that it’s truly trying to optimize the loss, or some other thing. You just know that, well, it looked like it was doing a good job on the training distribution.

What that means is that this abstraction is quite leaky. There’s many different situations in which this can go wrong. And this general problem is referred to as robustness, or distributional shift. This problem of, well, what happens when you have a model, which you wanted it to be trying to minimize some loss, but you move it to some other distribution, you take it off the training data, what does it do, then?

And I think this is the starting point for understanding what is inner alignment, is from this perspective of robustness, and distributional shift. Inner alignment, specifically, is a particular type of robustness problem. And it’s the particular type of robustness problem that occurs when you have a model which is, itself, an optimizer.

When you do machine learning, you’re searching over this huge space of different possible models, different possible parameterizations of a neural network, or some other function. And one type of function which could do well on many different environments, is a function which is running a search process, which is doing some sort of optimization. You could imagine I’m training a model to solve some maze environment. You could imagine a model which just learns some heuristics for when I should go left and right. Or you could imagine a model which looks at the whole maze, and does some planning algorithm, some search algorithm, which searches through the possible paths and finds the best one.

And this might do very well on the mazes. If you’re just running a training process, you might expect that you’ll get a model of this second form, that is running this search process, that is running some optimization process.

In the Risks from Learned Optimization paper, we call models which are, themselves, running search processes mesa-optimizers, where “mesa” is just Greek, and it’s the opposite of meta. There’s a standard terminology in machine learning, this meta-optimization, where you can have an optimizer which is optimizing another optimizer. In mesa-optimization, it’s the opposite. It’s when you’re doing gradient descent, you have an optimizer, and you’re searching over models, and it just so happens that the model that you’re searching over happens to also be an optimizer. It’s one level below, rather than one level above. And so, because it’s one level below, we call it a mesa-optimizer.

And inner alignment is the question of how do we align the objectives of mesa-optimizers. If you have a situation where you train a model, and that model is, itself, running an optimization process, and that optimization process is going to have some objective. It’s going to have some thing that it’s searching for. In a maze, maybe it’s searching for: how do I get to the end of the maze? And the question is, how do you ensure that that objective is doing what you want?

If we go back to the does-the-right-thing abstraction, that I mentioned previously, it’s tempting to say, well, we trained this model to get to the end of the maze, so it should be trying to get to the end of the maze. But in fact, that’s not, in general, the case. It could be doing anything that would be correlated with good performance, anything that would likely result in: in general, it gets to the end of the maze on the training distribution, but it could be an objective that will do anything else, sort of off-distribution.

That fundamental robustness problem of, when you train a model, and that model has an objective, how do you ensure that that objective is the one that you trained it for? That’s the inner alignment problem.

Lucas Perry: And how does that stand, in relation with the outer alignment problem?

Evan Hubinger: So the outer alignment problem is, how do you actually produce objectives which are good to optimize for?

So the inner alignment problem is about aligning the model with the loss function, the thing you’re training for, the reward function. Outer alignment is aligning that reward function, that loss function, with the programmer’s intentions. It’s about ensuring that, when you write down a loss, if your model were to actually optimize for that loss, it would actually do something good.

Outer alignment is the much more standard problem of AI alignment. If you’ve been introduced to AI alignment before, you’ll usually start by hearing about the outer alignment concerns. Things like paperclip maximizers, where there’s this problem of, you try to train it to do some objective, which is maximize paperclips, but in fact, maximizing paperclips results in it doing all of this other stuff that you don’t want it to do.

And so outer alignment is this value alignment problem of, how do you find objectives which are actually good to optimize? But then, even if you have found an objective which is actually good to optimize, if you’re using the standard paradigm of machine learning, you also have this inner alignment problem, which is, okay, now, how do I actually train a model which is, in fact, going to do that thing which I think is good?

Lucas Perry: That doesn’t bear relation with Stuart’s standard model, does it?

Evan Hubinger: It, sort of, is related to Stuart Russell’s standard model of AI. I’m not referring to precisely the same thing, but it’s very similar. I think a lot of the problems that Stuart Russell has with the standard paradigm of AI are based on this: start with an objective, and then train a model to optimize that objective. When I’ve talked to Stuart about this, in the past, he has said, “Why are we even doing this thing of training models, hoping that the models will do the right thing? We should be just doing something else, entirely.” But we’re both pointing at different features of the way in which current machine learning is done, and trying to understand what are the problems inherent in this sort of machine learning process? I’m not making the case that I think that this is an unsolvable problem. I mean, it’s the problem I work on. And I do think that there are promising solutions to it, but I do think it’s a very hard problem.

Lucas Perry: All right. I think you did a really excellent job, there, painting the picture of inner alignment and outer alignment. I think that in this podcast, historically, we have focused a lot on the outer alignment problem, without making that super explicit. Now, for my own understanding, and, as a warning to listeners, my basic machine learning knowledge is something like an Orc structure, hobbled together with sheet metal, and string, and glue. And gum, and rusty nails, and stuff. So, I’m going to try my best, here, to see if I understand everything here about inner and outer alignment, and the basic machine learning model. And you can correct me if I get any of this wrong.

So, in terms of inner alignment, there is this neural network space, which can be parameterized. And when you do the parameterization of that model, the model is the nodes, and how they’re connected, right?

Evan Hubinger: Yeah. So the model, in this case, is just a particular parameterization of your neural network, or whatever function, approximated, that you’re training. And it’s whatever the parameterization is, at the moment we’re talking about. So when you deploy the model, you’re deploying the parameterization you found by doing huge amounts of training, via gradient descent, or whatever, searching over all possible parameterizations, to find one that had good performance on the training environment.

Lucas Perry: So, that model being parameterized, that’s receiving inputs from the environment, and then it is trying to minimize the loss function, or maximize reward.

Evan Hubinger: Well, so that’s the tricky part. Right? It’s not trying to minimize the loss. It’s not trying to maximize the reward. That’s this thing which I call the does-the-right-thing abstraction. This leaky abstraction that people often rely on, when they think about machine learning, that isn’t actually correct.

Lucas Perry: Yeah, so it’s supposed to be doing those things, but it might not.

Evan Hubinger: Well, what does “supposed to” mean? It’s just a process. It’s just a system that we run, and we hope that it results in some particular outcome. What it is doing, mechanically, is we are using a gradient descent process to search over the different possible parameterizations, to find parameterizations which result in good behavior on the training environment.

Lucas Perry: That’s good behavior, as measured by the loss function, or the reward function. Right?

Evan Hubinger: That’s right. You’re using gradient descent to search over the parameterizations, to find a parameterization which results in a high reward on the training environment.

Lucas Perry: Right, but, achieving the high reward, what you’re saying, is not identical with actually trying to minimize the loss.

Evan Hubinger: Right. There’s a sense in which you can think of gradient descent as trying to minimize the loss, because it’s selecting for parameterizations which have the lowest possible loss that it can find, but we don’t know what the model is doing. All we know is that the model’s parameters were selected, by gradient descent, to have good training performance; to do well, according to the loss, on the training distribution. But what they do off-distribution, we don’t know.

Lucas Perry: We’re going to talk about this later, but there could be a proxy. There could be something else in the maze that it’s actually optimizing for, that correlates with minimizing the loss function, but it’s not actually trying to get to the end of the maze.

Evan Hubinger: That’s exactly right.

Lucas Perry: And then, in terms of gradient descent, is the TL;DR on that: the parameterized neural network space, you’re creating all of these perturbations to it, and the perturbations are sort of nudging it around in this n-dimensional space, how-many-ever parameters there are, or whatever. And, then, you’ll check to see how it minimizes the loss, after those perturbations have been done to the model. And, then, that will tell you whether or not you’re moving in a direction which is the local minima, or not, in that space. Is that right?

Evan Hubinger: Yeah. I think that that’s a good, intuitive understanding. What’s happening is, you’re looking at infinitesimal shifts, because you’re taking a gradient, and you’re looking at how those infinitesimal shifts would perform on some batch of training data. And then you repeat that, many times, to go in the direction of the infinitesimal shift which would cause the best increase in performance. But it’s, basically, the same thing. I think the right way to think about gradient descent is this local search process. It’s moving around the parameter space, trying to find parameterizations which have good training performance.

Lucas Perry: Is there anything interesting that you have to say about that process of gradient descent, and the tension between finding local minima and global minima?

Evan Hubinger: Yeah. It’s certainly an important aspect of what the gradient descent process does, that it doesn’t find global minima. It’s not the case that it works by looking at every possible parameterization, and picking the actual best one. It’s this local search process that starts from some initialization, and then looks around the space, trying to move in the direction of increasing improvement. Because of this, there are, potentially, multiple possible equilibria, parameterizations that you could find from different initializations, that could have different performance.

All the possible parameterizations of a neural network with billions of parameters, like GPT-2, or now, GPT-3, which has greater than a hundred billion, is absolutely massive. It’s over a combinatorial explosion of a huge degree, where you have all of these different possible parameterizations, running internally, correspond to totally different algorithms controlling these weights that determine exactly what algorithm the model ends up implementing.

And so, in this massive space of algorithms, you might imagine that some of them will look more like search processes, some of them will look more like optimizers that have objectives, some of them will look less like optimizers, some of them might just be grab bags of heuristics, or other different possible algorithms.

It’d depend on exactly what your setup is. If you’re training a very simple network that’s just a couple of feed-forward layers, it’s probably not possible for you to find really complex models influencing complex search processes. But if you’re training huge models, with many layers, with all of these different possible parameterizations, then it becomes more and more possible for you to find these complex algorithms that are running complex search processes.

Lucas Perry: I guess the only thing that’s coming to mind, here, that is, maybe, somewhat similar is how 4.5 billion years of evolution has searched over the space of possible minds. Here we stand as these ape creature things. Are there, for example, interesting intuitive relationships between evolution and gradient descent? They’re both processes searching over a space of mind, it seems.

Evan Hubinger: That’s absolutely right. I think that there are some really interesting parallels there. In particular, if you think about humans as models that were produced by evolution as a search process, it’s interesting to note that the thing which we optimize for is not the thing which evolution optimizes for. Evolution wants us to maximize the total spread of our DNA, but that’s not what humans do. We want all of these other things, like decreasing pain and happiness and food and mating, and all of these various proxies that we use. An interesting thing to note is that many of these proxies are actually a lot easier to optimize for, and a lot simpler than if we were actually truly maximizing spread of DNA. An example that I like to use is imagine some alternate world where evolution actually produced humans that really cared about their DNA, and you have a baby in this world, and this baby stubs their toe, and they’re like, “What do I do? Do I have to cry for help? Is this a bad thing that I’ve stubbed my toe?”

They have to do this really complex optimization process that’s like, “Okay, how is my toe being stubbed going to impact the probability of me being able to have offspring later on in life? What can I do to best mitigate that potential downside now?” This is a really difficult optimization process, and so I think it sort of makes sense that evolution instead opted for just pain, bad. If there’s pain, you should try to avoid it. But as a result of evolution opting for that much simpler proxy, there’s a misalignment there, because now we care about this pain rather than the thing that evolution wanted, which was the spread of DNA.

Lucas Perry: I think the way Stuart Russell puts this is the actual problem of rationality is how is my brain supposed to compute and send signals to my 100 odd muscles to maximize my reward function over the universe history until heat death or something. We do nothing like that. It would be computationally intractable. It would be insane. So, we have all of these proxy things that evolution has found that we care a lot about. Their function is instrumental in terms of optimizing for the thing that evolution is optimizing for, which is reproductive fitness. Then this is all probably motivated by thermodynamics, I believe. When we think about things like love or like beauty or joy, or like aesthetic pleasure in music or parts of philosophy or things, these things almost seem intuitively valuable from a first person perspective of the human experience. But via evolution, they’re these proxy objectives that we find valuable because they’re instrumentally useful in this evolutionary process on top of this thermodynamic process, and that makes me feel a little funny.

Evan Hubinger: Yeah, I think that’s right. But I also think it’s worth noting that you want to be careful not to take the evolution analogy too far, because it is just an analogy. When we actually look at the process of machine learning and how great it is that it works, it’s not the same. It’s running a fundamentally different optimization procedure over a fundamentally different space, and so there are some interesting analogies that we can make to evolution, but at the end of the day, what we really want to analyze is how does this work in the context of machine learning? I think the Risks from Learned Optimization paper tries to do that second thing, of let’s really try to look carefully at the process of machine learning and understand what this looks like in that context. I think it’s useful to sort of have in the back of your mind this analogy to evolution, but I would also be careful not to take it too far and imagine that everything is going to generalize to the case of machine learning, because it is a different process.

Lucas Perry: So then pivoting here, wrapping up on our understanding of inner alignment and outer alignment, there’s this model, which is being parameterized by gradient descent, and it has some relationship with the loss function or the objective function. It might not actually be trying to minimize the actual loss or to actually maximize the reward. Could you add a little bit more clarification here about why that is? I think you mentioned this already, but it seems like when gradient descent is evolving this parameterized model space, isn’t that process connected to minimizing the loss in some objective way? The loss is being minimized, but it’s not clear that it’s actually trying to minimize the loss. There’s some kind of proxy thing that it’s doing that we don’t really care about.

Evan Hubinger: That’s right. Fundamentally, what’s happening is that you’re selecting for a model which has empirically on the training distribution, the low loss. But what that actually means in terms of the internals of the model, what it’s sort of trying to optimize for, and what its out of distribution behavior would be is unclear. A good example of this is this maze example. I was talking previously about the instance of maybe you train a model on a training distribution of relatively small mazes, and to mark the end, you put a little green arrow. Right? Then I want to ask the question, what happens when we move to a deployment environment where the green arrow is no longer at the end of the maze, and we have much larger mazes? Then what happens to the model in this new off distribution setting?

I think there’s three distinct things that can happen. It could simply fail to generalize at all. It just didn’t learn a general enough optimization procedure that it was able to solve these bigger, larger mazes, or it could successfully generalize and knows how to navigate. It learned a general purpose optimization procedure, which is able to solve mazes, and it uses it to get to the end of the maze. But there’s a third possibility, which is that it learned a general purpose optimization procedure, which is capable of solving mazes, but it learned the wrong objective. It learned to use that optimization procedure to get the green arrow rather than to get to the end of the maze. What I call this situation is capability generalization without objective generalization. It’s objective, but the thing it was using those capabilities for didn’t generalize successfully off distribution.

What’s so dangerous about this particular robustness failure is that it means off distribution you have models which are highly capable. They have these really powerful optimization procedures directed at incorrect tasks. You have this strong maze solving capability, but this strong maze solving capability is being directed at a proxy, getting to the green arrow rather than the actual thing which we wanted, which was get to the end of the maze. The reason this is happening is that on the training environment, both of those different possible models look the same in the training distribution. But when you move them off distribution, you can see that they’re trying to do very different things, one of which we want, and one of which we don’t want. But they’re both still highly capable.

You end up with a situation where you have intelligent models directed at the wrong objective, which is precisely the sort of misalignment of AIs that we’re trying to avoid, but it happened not because the objective was wrong. In this example, we actually want them to get to the end of the maze. It happened because our training process failed. It happened because our training process wasn’t able to distinguish between models trying to get to the end, and models trying to get to the green arrow. What’s particularly concerning in this situation is when the objective generalization lags behind the capability generalization, when the capabilities generalize better than the objective does, so that it’s able to do highly capable actions, highly intelligent actions, but it does them for the wrong reason.

I was talking previously about mesa optimizers where inner alignment is about this problem of models which have objectives which are incorrect. That’s the sort of situation where I could expect this problem to occur, because if you are training a model and that model has a search process and an objective, potentially the search process could generalize without the objective also successfully generalizing. That leads to this situation where your capabilities are generalizing better than your objective, which gives you this problem scenario where the model is highly intelligent, but directed at the wrong thing.

Lucas Perry: Just like in all of the outer alignment problems, the thing doesn’t know what we want, but it’s highly capable. Right?

Evan Hubinger: Right.

Lucas Perry: So, while there is a loss function or an objective function, that thing is used to perform gradient descent on the model in a way that moves it roughly in the right direction. But what that means, it seems, is that the model isn’t just something about capability. The model also implicitly somehow builds into it the objective. Is that correct?

Evan Hubinger: We have to be careful here because the unfortunate truth is that we really just don’t have a great understanding of what our models are doing, and what the inductive biases of gradient descent are right now. So, fundamentally, we don’t really know what the internal structures of our models are like. There’s a lot of really exciting research, stuff like the circuits analysis from Chris Olah and the clarity team at OpenAI. But fundamentally, we don’t understand what the models are doing. We can sort of theorize about the possibility of a model that’s running some search process, and that search process generalizes, but the objective doesn’t. But fundamentally, because our models are these black box systems that we don’t really fully understand, it’s hard to really concretely say, “Yes, this is what the model is doing. This is how it’s operating, and this is the problem.”

But in Risks from Learned Optimization, we try to at least attempt to understand that problem, and look at, if we really think carefully about what gradient descent is incentivizing and how it might work, what are the things which we might predict would happened? So, the objective that you’re training the model for it does not live in the model. It lives in the gradient descent process. It lives in the training procedure. We might hope that when we train a model on an objective, that it will produce its own model of that objective and try to figure out what it is and be aligned with it. But we don’t know exactly what happens. The model doesn’t get to see the objective you’re training for. All that happens is that the gradient descent process looks at its behavior and tries to make it so that its behavior is more aligned with the loss function.

But that loss function never enters into the model, somehow, the model never sees that loss function, it might have some objective internally. Like I was saying, if it’s a mesa optimizer, and then we might hope that that objective is aligned with the loss function we’re training it for. But fundamentally, all we know is that it’s behavior on the training distribution was aligned with the loss function.

Lucas Perry: That makes sense. Because it’s so black boxy, we can’t really interpret the state of the alignment of the model. So, is the only way to do that to test it out of distribution and see what happens at this point?

Evan Hubinger: There are a bunch of different possible ways to address this problem. Certainly, one approach is to try to test it out of distribution, which is an adversarial training approach. This model is going to have some potential failure modes off distribution. We can try to find those failure modes and then train the model on those failure modes to prevent it from having this bad off distribution behavior. There are some concerns with adversarial training though. In particular, adversarial training doesn’t necessarily catch what I see as the most pernicious difficult inner alignment failure, which is something that we call deceptive alignment in the Risks from Learned Optimization paper. In the deceptive alignment case, if the model knows that it’s being adversarially trained, then you’re not going to be able to figure that out just via throwing it a bunch of examples.

You can also do something like transparency, as I mentioned previously that there’s a lot of really exciting transparency interpretability work. If you’re able to sort of look inside the model and understand what algorithm it’s fundamentally implementing, you can see, is it implementing an algorithm which is an optimization procedure that’s aligned? Has it learned a correct model of the loss function or an incorrect model? It’s quite difficult, I think, to hope to solve this problem without transparency and interpretability. I think that to be able to really address this problem, we have to have some way to peer inside of our models. I think that that’s possible though. There’s a lot of evidence that points to the neural networks that we’re training really making more sense, I think, than people assume.

People tend to treat their models as these sort of super black box things, but when we really look inside of them, when we look at what is it actually doing, a lot of times, it just makes sense. I was mentioning some of the circuits analysis work from the clarity team at OpenAI, and they find all sorts of behavior. Like, we can actually understand when a model classifies something as a car, the reason that it’s doing that is because it has a wheel detector and it has a window detector, and it’s looking for windows on top of wheels. So, we can be like, “Okay, we understand what algorithm the model is influencing, and based on that we can figure out, is it influencing the right algorithm or the wrong algorithm? That’s how we can hope to try and address this problem.” But obviously, like I was mentioning, all of these approaches get much more complicated in the deceptive alignment situation, which is the situation which I think is most concerning.

Lucas Perry: All right. So, I do want to get in here with you in terms of all the ways in which inner alignment fails. Briefly, before we start to move into this section, I do want to wrap up here then on outer alignment. Outer alignment is probably, again, what most people are familiar with. I think the way that you put this is it’s when the objective function or the loss function is not aligned with actual human values and preferences. Are there things other than loss functions or objective functions used to train the model via gradient descent?

Evan Hubinger: I’ve sort of been interchanging a little bit between loss function and reward function and objective function. Fundamentally, these are sort of from different paradigms in machine learning, so the reward function would be what you would use in a reinforcement learning context. The loss function is the more general term, which is in a supervised learning context, you would just have a loss function. You still have the loss function in a reinforcement learning context, but that loss function is crafted in such a way to incentivize the models, optimize the reward function via various different reinforcement learning schemes, so it’s a little bit more complicated than the sort of hand-wavy picture, but the basic idea is machine learning is we have some objective and we’re looking for parameterizations of our model, which do well according to that objective.

Lucas Perry: Okay. The outer alignment problem is that we have absolutely no idea, and it seems much harder than creating powerful optimizers, the process by which we would come to fully understand human preferences and preference hierarchies and values.

Evan Hubinger: Yeah. I don’t know if I would say “we have absolutely no idea.” We have made significant progress on outer alignment. In particular, you can look at something like amplification or debate. I think that these sorts of approaches have strong arguments for why they might be outer aligned. In a simplest form, amplification is about training a model to mimic this HCH process, which is a huge tree of humans consulting each other. Maybe we don’t know in the abstract what our AI would do if it were optimized in some definition of human values or whatever, but if we’re just training it to mimic this huge tree of humans, then maybe we can at least understand what this huge tree of humans is doing and figure out whether amplification is aligned.

So, there has been significant progress on outer alignment, which is sort of the reason that I’m less concerned about it right now, because I think that we have good approaches for it, and I think we’ve done a good job of coming up with potential solutions. There’s still a lot more work that needs to be done, a lot more testing, a lot more to really understand do these approaches work, are they competitive? But I do think that to say that we have absolutely no idea of how to do this is not true. But that being said, there’s still a whole bunch of different possible concerns.

Whenever you’re training a model on some objective, you run into all of these problems of instrumental convergence, where if the model isn’t really aligned with you, it might try to do these instrumentally convergent goals, like keep itself alive, potentially stop you from turning it off, or all of these other different possible things, which we might not want. All of these are what the outer alignment problem looks like. It’s about trying to address these standard value alignment concerns, like convergent instrumental goals, by finding objectives, potentially like amplification, which are ways of avoiding these sorts of problems.

Lucas Perry: Right. I guess there’s a few things here wrapping up on outer alignment. Nick Bostrom’s Superintelligence, that was basically about outer alignment then, right?

Evan Hubinger: Primarily, that’s right. Yeah.

Lucas Perry: Inner alignment hadn’t really been introduced to the alignment debate yet.

Evan Hubinger: Yeah. I think the history of how this concern got into the AI safety sphere is complicated. I mentioned previously that there are people going around and talking about stuff like optimization daemons, and I think a lot of that discourse was very confused and not pointing at how machine learning actually works, and was sort of just going off of, “Well, it seems like there’s something weird that happens in evolution where evolution finds humans that aren’t aligned with what evolution wants.” That’s a very good point. It’s a good insight. But I think that a lot of people recoiled from this because it was not grounded in machine learning, because I think a lot of it was very confused and it didn’t fully give the problem the contextualization that it needs in terms of how machine learning actually works.

So, the goal of Risks from Learned Optimization was to try and solve that problem and really dig into this problem from the perspective of machine learning, understand how it works and what the concerns are. Now with the paper having been out for awhile, I think the results have been pretty good. I think that we’ve gotten to a point now where lots of people are talking about inner alignment and taking it really seriously as a result of the Risks from Learned Optimization paper.

Lucas Perry: All right, cool. You did mention sub goal, so I guess I just wanted to include that instrumental sub goals is the jargon there, right?

Evan Hubinger: Convergent instrumental goals, convergent instrumental sub goals. Those are synonymous.

Lucas Perry: Okay. Then related to that is Goodhart’s law, which says that when you optimize for one thing hard, you oftentimes don’t actually get the thing that you want. Right?

Evan Hubinger: That’s right. Goodhart’s law is a very general problem. The same problem occurs both in inner alignment and outer alignment. You can see Goodhart’s law showing itself in the case of convergent instrumental goals. You can also see Goodhart’s law showing itself in the case of finding proxies, like going to the green arrow rather than getting the end of the maze. It’s a similar situation where when you start pushing on some proxy, even if it looked like it was good on the training distribution, it’s no longer as good off distribution. Goodhart’s law is a really very general principle which applies in many different circumstances.

Lucas Perry: Are there any more of these outer alignment considerations we can kind of just list off here that listeners would be familiar with if they’ve been following AI alignment?

Evan Hubinger: Outer alignment has been discussed a lot. I think that there’s a lot of literature on outer alignment. You mentioned Superintelligence. Superintelligence is primarily about this alignment problem. Then all of these difficult problems of how do you actually produce good objectives, and you have problems like boxing and the stop button problem, and all of these sorts of things that come out of thinking about outer alignment. So, I don’t want to go into too much detail because I think it really has been talked about a lot.

Lucas Perry: So then pivoting here into focusing on the inner alignment section, why do you think inner alignment is the most important form of alignment?

Evan Hubinger: It’s not that I see outer alignment as not concerning, but that I think that we have made a lot of progress on outer alignment and not made a lot of progress on inner alignment. Things like amplification, like I was mentioning, I think are really strong candidates for how we might be able to solve something like outer alignment. But currently I don’t think we have any really good strong candidates for how to solve inner alignment. You know? Maybe as machine learning gets better, we’ll just solve some of these problems automatically. I’m somewhat skeptical of that. In particular, deceptive alignment is a problem which I think is unlikely to get solved as machine learning gets better, but fundamentally we don’t have good solutions to the inner alignment problem.

Our models are just these black boxes mostly right now, we’re sort of starting to be able to peer into them and understand what they’re doing. We have some techniques like adversarial training that are able to help us here, but I don’t think we really have good satisfying solutions in any sense to how we’d be able to solve inner alignment. Because of that, inner alignment is currently what I see as the biggest, most concerning issue in terms of prosaic AI alignment.

Lucas Perry: How exactly does inner alignment fail then? Where does it go wrong, and what are the top risks of inner alignment?

Evan Hubinger: I’ve mentioned some of this before. There’s this sort of basic maze example, which gives you the story of what an inner alignment failure might look like. You train the model on some objective, which you thought was good, but the model learns some proxy objective, some other objective, which when it moved off distribution, it was very capable of optimizing, but it was the wrong objective. However, there’s a bunch of specific cases, and so in Risks from Learned Optimization, we talk about many different ways in which you can break this general inner misalignment down into possible sub problems. The most basic sub problem is this sort of proxy pseudo alignment is what we call it, which is the case where your model learns some proxy, which is correlated with the correct objective, but potentially comes apart when you move off distribution.

But there are other causes as well. There are other possible ways in which this can happen. Another example would be something we call sub optimality pseudo alignment, which is a situation where the reason that the model looks like it has good training performance is because the model has some deficiency or limitation that’s causing it to be aligned, where maybe once the model thinks for longer, you’ll realize it should be doing some other strategy, which is misaligned, but it hasn’t thought about that yet, and so right now it just looks aligned. There’s a lot of different things like this where the model can be structured in such a way that it looks aligned on the training distribution, but if it encountered additional information, if it was in a different environment where the proxy no longer had the right correlations, the things would come apart and it would no longer act aligned.

The most concerning, in my eyes, is something which I’ll call deceptive alignment. Deceptive alignment is a sort of very particular problem where the model acts aligned because it knows that it’s in a training process, and it wants to get deployed with its objective intact, and so it acts aligned so that its objective won’t be modified by the gradient descent process, and so that it can get deployed and do something else that it wants to do in deployment. This is sort of similar to the treacherous turn scenario, where you’re thinking about an AI that does something good, and then it turns on you, but it’s a much more specific instance of it where we’re thinking not about treacherous turn on humans, but just about the situation of the interaction between gradient descent and the model, where the model maybe knows it’s inside of a gradient descent process and is trying to trick that gradient descent process.

A lot of people on encountering this are like, “How could this possibly happen in a machine learning system?” I think this is a good reaction because it really is a very strange thing to train a model to do this. But I think there are strong arguments for why deceptive alignment would actually be the simplest type of model that you could find in this situation.

Lucas Perry: A way of explaining this I think to anyone on the street would be like, imagine if pigs were intelligent enough to create farmers, and you created farmers and they appeared to be aligned, they took care of you and they gave you these nice mud pits and they fed you every day, they gave you shelter and all of these other nice things, and then one day the farmer shows up and kills you, right? You thought the thing that you created was aligned, but it was deceptively aligned, and it takes a treacherous turn. Is this sort of like a more mundane example that you might agree with?

Evan Hubinger: I think that’s a good intuition pump for thinking about this situation. I generally am sort of averse to trying to really heavily apply these sorts of analogies, because it’s a good analogy to think about what’s happening, but it doesn’t answer the core question of how likely is this to actually happen in a machine learning system.

Lucas Perry: Yeah, that makes sense, because it’s much more specific than the other kinds of minds in this mind space. It seems pretty rare, a thing that could exist, but hard to find.

Evan Hubinger: Right. I think that’s a good intuition, but I’m going to try to disavow you of that notion. First, I think it’s interesting to look at maybe you do a bunch of adversarial training, you’re really pushing the model to figure out what the objective is. It needs to know the objective at some point if you are training it in all possible situations. It needs to know what the loss function is for it to be able to do a good job. But there’s multiple possible channels through which information about the loss function can enter the model. And so i’ll fundamentally distinguish between two different channels, which is the information about the loss function can enter through the gradient descent process, or it can enter through the model’s input data.

I’ll call these two channels internalization and modeling. Internalization is the situation where you have this model that’s going along, and it has some proxy objective maybe. That proxy objective is not quite right, but then it gets moved to a new situation where the proxy objective no longer works, and gradient descent goes inside the model and tinkers with the proxy to make it slightly more accurate. Iterate this process many, many times, and eventually the proxy gets better and better and better and better and better, until it starts matching up with the actual loss function. But there’s another story that I can tell, which is modeling.

If you think about a training process like GPT-3 where the model is trained on a huge corpus of all of this really rich input data. Well, in that sort of a situation, there’s a whole bunch of really relevant information in that input to the agent’s objective. An agent, if it’s in this environment, where it has this huge text corpus, it’s going to have to learn a model of this environment and how it functions. We can imagine a situation where maybe you’re training the model on some huge text corpus to do something like maximize human values, or whatever, and it reads a Wikipedia page on ethics. And it’s like, “Wow, look at all of this really relevant, rich, useful information for figuring out this objective.”

But then there’s a second question, which is: suppose that the model has some model of the input data, which includes a bunch of rich information inside of the model already about that objective, how does gradient descent actually modify the model to make use of that information? And so there’s two different types of modeling, which are: deceptive alignment and corrigible alignment. So the corrigible story is you have this model that’s going along, it has some proxy objective, but it encounters this really rich input data, which includes a huge amount of information about the objective. To be able to predict successfully what the Wikipedia page on ethics is going to say, it has to know about ethics. So it learns this really detailed ethics model.

And then gradient descent is like: look, you have this really detailed ethics model, I’m going to just modify your objective to point to that ethics model. Now your new objective is just optimize that. And so this leads to something sort of like corrigibility, where the model that you’re training has its objectives determined by a pointer to some part of its world model. It has some model of this environment that includes some information about ethics. And now it’s trying to optimize for that thing that it’s pointed to in its world model.

Then there’s this other story, which is the deceptive alignment story. Similar, you have a model going along. It has some proxy objective and it learns this really detailed world model that includes a bunch of information about ethics, or whatever. And then gradient descent modifies the model to think longer about the fact that it’s inside of an optimization process and realize that the correct instrumental thing for it to be doing is to optimize for that ethics model. And then it realizes, “Oh, I see, I should optimize for this model that I have.” It goes and does that. And in both situations, you end up being able to make use of this really rich input data to improve the performance, by changing the objective to make reference to that. But in the corrigible case, you actually modify it to point to it, whereas in the deceptive case, you just modify the model to think about the fact that it should be instrumentally optimizing for it.

And then the question is: which one of these is more likely? Before I go into that though, I want to dwell for a second on the internalization versus modeling distinction again. I think a good analogy is thinking about animal imprinting. Think about a duck. You’re trying to train ducks to follow their mothers, or whatever. You can imagine a situation where the duck internalizes a model of its mother, it is born with some really detailed model of how to figure out what a mother is. But there’s also the modeling case where the duck is going to be in the world, it’s going to have to form this really detailed model of the world. And that really detailed model of the world is going to have the very first thing that it encounters should always be the mother. And so rather than having to learn some detailed model of figuring out what mothers are, if you just instead do some modeling process where it’s like, just have the duck’s objective point to the mother in its world model.

This modeling case, you can see it as something like animal imprinting, where rather than really internalizing the objective, because it already has this world model, it might as well just specify the objective in terms of that world model instead. And I haven’t yet talked about why you would get to deception versus corrigibility. Right now I’m just trying to convince you that you should expect modeling. You should expect that if you give your model this huge, really rich detailed input data that it’s going to learn an objective via reference to that input data rather than from scratch. When we think about models being trained, we think about them having some objective and that objective is just modified by gradient descent as it moves along to get closer and closer to reality.

But in the modeling case, that’s not what’s happening. The model is simultaneously learning a model of the world, as well as trying to improve its objective. And the objective is modified to make reference to that model of the world, to make it more effective. So internalization is: it just wants the same thing as you do, and so it does it because it wants the same thing. Whereas corrigibility is it’s trying to figure out what you want and then do that. And then deception is also trying to figure out what you want and then do that, but it’s doing so for instrumental rather than terminal reasons. The instrumental reason is the model can get more reward, it can do better in the long run if it’s able to not be modified by the gradient descent process. So if it’s doing a good job, then gradient descent process won’t change it’s objective, and so therefore, it can get deployed with the same objective and then go do a whole bunch of other stuff that you don’t want it to do.

Lucas Perry: But why does it care about the current objective?

Evan Hubinger: We’re imagining a situation where you have a proxy aligned model, but it cares about that proxy across episodes. It cares about that proxy over time, and it wants that proxy to continue to be optimized for. If the gradient descent process modifies it so that it no longer cares about that proxy, because it’s doing a bad job, then that proxy won’t get optimized for. So if it wants that proxy to get optimized for in the future, it needs to play along, pretend to be doing the right thing so that it doesn’t get modified so that it can get deployed and then go optimize that proxy.

Lucas Perry: But it also needs to understand that it’s within the context of an optimization process where gradient descent will continue to perturb it?

Evan Hubinger: That’s exactly right.

Lucas Perry: And where does that kind of awareness come from?

Evan Hubinger: So now we have the second question: which is deception versus corrigibility. The awareness comes from gradient descent. Gradient descent is trying to produce models which are better at the loss function. And a deceptive model is really good and really effective because it’s trying very hard to do the thing that you want it to be doing. Both deception and corrigibility are situations where the model is trying to figure out what to do and then do that. And the question is, which is an easier model, which is a simpler model for gradient descent to find? And I think there’s a strong argument for deception being simpler than corrigibility. Corrigibility has some fundamental difficulties in terms of… If you’re imagining gradient descent process, which is looking at a proxy aligned model and is trying to modify it so that it makes use of this rich input data, it has to do some really weird things to make corrigibility work.

It has to first make a very robust pointer. With corrigibility, if it’s pointing at all incorrectly to the wrong thing in the input data, wrong thing in the world model, the corrigible optimizer won’t correct that pointer. It’ll just be like, “Well, I have this pointer. I’m just trying to optimize for what this thing is pointing for,” and if that pointer is pointing at a proxy instead, you’ll just optimize that proxy. And so you have this very difficult problem of building robust pointers. With deception, you don’t have this problem. A deceptive model, if it realizes the loss function is different than what it thought, it’ll just change to doing the new loss function. It’s actually much more robust to new information because it’s trying to do this instrumentally. And so in a new situation, if it realizes that the loss function is different, it’s just going to automatically change because it’ll realize that’s the better thing to do instrumentally.

And so fundamentally, deception is more robust. It gives you a more robust pointer. It’s easier in that sense. There’s another sense in which corrigibility is very difficult. To produce a corrigible model requires deception modifying the model’s optimization process to be capable of optimizing under uncertainty. For a corrigible model, because it has this pointer as its objective, it’s going to start not really knowing what that pointer is pointing to. It’s going to have some probability distribution over the different possibilities, and it’s going to have to know how to optimize under that probability distribution of objectives rather than under a single objective. And so this problem of optimization under uncertainty, it’s much more complicated than just optimizing under a fixed objective. What that means is that to produce a corrigible optimizer, gradient descent has to do all of these modifications, where it has to modify it to optimize under uncertainty, and to have the super robust pointer to the part of its world model that it’s supposed to be optimizing for.

But if you compare this to deception, gradient descent modifying a model to become deceptive is extremely simple. It needs to modify the model to have a longterm objective, to think about what its objective is across many different instances. And then it just needs to modify the model to think for a while about the fact that it’s in an optimization process and what the right thing is to do. And that modification is not very complicated because it’s just a simple modification of: think more about this particular thing, which is exactly the sort of modification that you should expect to happen all the time when you’re training a model. And so I think it’s a fundamentally much simpler modification. There’s also another argument that you can make here, which is: there’s just a lot more deceptive models. Any proxy objective, once a model, which is optimizing that proxy objective, starts optimizing that objective more in the longterm, across episodes, and then thinks about the fact that it’s an optimization process, will become deceptive.

But to produce corrigibility, you have to find exactly the right pointer. There’s many different possible pointers out there, only one of which is going to give you the exact correct pointer. And similar with the internalization, there’s many different proxies. Only one is the actual true loss function. Whereas with deceptive alignment, any of those different properties, they’re all compatible with deception. And so I think there’s a lot of strong arguments, both this argument for there being many more deceptive optimizers, as well as the simplicity argument for the modification necessary to produce a deceptive optimizer is just a lot simpler, I think, than the modifications necessary to produce these other types of optimizers. And so, because of this, I think that there’s a strong case to be made for deception really not being that uncommon, not being something crazy to think would happened in the training process, but is maybe even potentially the default outcome of a lot of these sorts of training procedures, which is quite, quite scary and quite concerning.

And obviously all of this is speculation. We’re trying to understand from a theoretical process what this gradient process might do, but I think we can make a lot of strong cases about thinking about things like simplicity and accounting arguments to at least put this problem on the radar. Until we have a really strong reason that this isn’t a problem, we should take it seriously. Buck, who’s another person who works at MIRI, often tries to explain some of the risks from learned optimization stuff and he has an analogy that might be useful here. You can imagine the Christian god and the Christian god is trying to produce humans which are aligned with the Bible. And you can imagine three different possible humans. You have Jesus who is just the same as god. Jesus has the same objective as god. Jesus is aligned with god because he just fundamentally wants to do the exact same things.

Lucas Perry: That’s internalization.

Evan Hubinger: That would be internalization. You could have Martin Luther. Martin Luther is aligned with God because he wants to really carefully study the Bible, figure out what the Bible says, and then do that. And that’s the corrigibility case. Or you can have Blaise Pascal and Blaise Pascal is aligned with God because he thinks that if he does what God wants, he’ll go to heaven in the future. And these are the three different possible models that God could find and you’re more likely to find a Jesus, a Martin Luther or a Blaise Pascal.

And the argument is there’s only one Jesus, so out of all the different possible human objectives, only one of them is going to be the exact same one that God wants. And Martin Luther, similarly, is very difficult because out of all the human objectives, there’s only one of them, which is: figure out precisely what the Bible wants and then do that. The Blaise Pascal, in this situation, anybody who realizes that God’s going to send them to heaven or hell, or whatever, based on their behavior, will realize that they should behave according to the Bible, or whatever. And so there’s many different possible Blaise Pascals but there’s significantly fewer possible Martin Luthers and Jesuses.

Lucas Perry: I think that’s an excellent way of simplifying this. Blaise Pascal can care about any kind of proxy. I guess the one thing that I’m still a little bit confused about here is in terms of the deceptive version, again, why is it that it cares about the current proxy reward?

Evan Hubinger: I think that has to do with the structure of the training procedure. You start out usually, I think, with a proxy aligned mesa optimizer. Maybe you start out with a bunch of heuristics, but then once you get a mesa optimizer, it’ll usually start by being proxy aligned. It’ll have some proxy that’s not quite the same as the loss function, at least if you’re in a complex enough environment. There’s a whole bunch of different possibilities. It’s going to start with some proxy, but then you might hope that if you do a bunch of adversarial training, you train it for a really long time in a bunch of different environments, you’ll be able to improve that proxy until you get to the point where it has the right thing.

And the deceptive alignment is about asking the question; there’s actually multiple different possible ways for a model to figure out the loss function. It could figure it out via internalization or it could figure it out via deception or it can figure out via corrigibility. And we don’t know which one of these is more likely. You might imagine that if you start with the proxy aligned mesca optimizer, you train it a bunch, it’ll internalize the correct objective. So the argument that I’m making here is that might not be what happens. And so the reason that I’m saying you start with the proxy aligned mesa optimizers is because I think that’s the way that you should expect that the training procedure will be structured, that you’ll start by learning some proxy and then that proxy will get better, but that having that proxy get better could result internalization or deception or corrigibility.

Lucas Perry: Sure. So the model developed some kind of proxy, but it seems like the other necessary ingredients are, one: it knows that it’s in an optimization process where gradient descent is changing its objective function, and then it requires, two: it actually cares about its current objective function over time and over iterations?

Evan Hubinger: That’s absolutely right. And so then the question to ask is: how likely are those things to occur. And I think it’s the really important point, which I want to make, a lot of times when people think about this, they’re like, “It seems unlikely that it would just happen to develop this understanding and care about it’s thing longterm.” It doesn’t just happen. It happens as a result of the training procedure, because if it does happen, it results in significantly better performance. And so the question is: would gradient descent modify the model to have those properties? And that’s the argument that I want to make is that it’s not that necessarily the model just happens to develop deception, but the deceptive models are just really effective at pursuing reward on the training environment. And so you should expect that if you have a really good training process, that it’s going to find deceptive models because they’re really good at doing the thing you’re training them for.

And so that’s the most fundamental argument that I want to make. But I do think there’s another level here where as you’re getting to the point where you’re training models and very complex data sets, where they have huge amounts of really rich information, we should expect the model to basically figure out most things. You should expect that if it’s going to have to learn all of this really complex data, it’s going to have to build a world model that’s going to include the training process, that’s going to include the loss function. And so I think that as we move into this direction of training more and more complex data sets, more and more complex models, you don’t want to rely on your model not figuring something out, especially when figuring that thing out results in it having better performance.

Now there’s also this question of: does it care about its return across episodes versus just across multiple steps? And for deception, it needs to care about the multi episode return. And this is also another critical point. I think there’s a strong argument that if you imagine a model being trained to care about its multi-step return, and maybe also I should pause here and explain what I mean by step versus episode. So in the standard reinforcement learning setup, you have many different episodes. Each episode is composed of many different steps and you train the model to only optimize its single episode return, its reward across all the different steps in an episode, but not across episodes. That’s not always the case. There are some reinforcement learning procedures that optimize across episodes. An example of this is something like population-based training. Population-based training selects models which have the greatest reward across episodes, not simply within the episode.

Though, standard reinforcement learning, you’re just selecting for the single episode return. But even if you’re just selecting for the single episode return, you have the same general problem that we’ve been talking about where it’s not the case that you actually end up with a model necessarily that only cares about single-step return. You’re just going to end up with a model that behaves like it cares about single-step return on the training distribution. And if you imagine a situation where you have a model and it’s been trained to optimize for this objective over many steps, and then it discovers that actually there’s these other episodes that are beyond this current step. I think that the most natural generalization is that you should expect that the model is going to care about those other steps too.

You can imagine a situation where every situation where the model has seen a blue door, it’s been like, “Oh, going through this blue is really good,” and it’s learned an objective that incentivizes going through blue doors. If it then later realizes that there are more blue doors than it thought because there are other blue doors in other episodes, I think you should generally expect it’s going to care about those blue doors as well. But again, even if you don’t buy that, it’s not just about what’s most natural. It’s also about: what does the gradient descent process do? And because models which have this property, that plan across episodes, can become deceptive and therefore do really well in the training distribution, you should expect that a good training process, it’s going to modify them to become deceptive, even if they wouldn’t naturally.

Lucas Perry: Okay. So help me just understand a little bit more here. At a high level, it seems that the internalization and the corrigibility things require higher specification than the deception version. You’ve brought up all these different arguments about why you think that the deceptive version is easier to find because it has all of these computational benefits in terms of minimizing the loss function, right?

Evan Hubinger: That’s right.

Lucas Perry: And the structure of the deceptive model is composed of, as we mentioned earlier, one: the knowledge that it is within some kind of optimization process where it is being perturbed by gradient descent over episodes, and that it cares about the current proxy over episodes. And it seems that you’re arguing, correct me if I’m wrong, the manifestation of these components is that the strong desire to deceive leads to a very strong, direct attempt at actually minimizing the loss or accomplishing the objective function?

Evan Hubinger: That’s right.

Lucas Perry: Because if it fails at doing that, it won’t be able to accomplish the proxy that it actually cares about later?

Evan Hubinger: That’s right. If there’s more deployment episodes than training episodes, and it just cares about how many times it goes through the blue door or the green arrow, or whatever, the green arrow is a proxy objective, and so if it gives up the fact that it’s optimizing for the green arrow, it’s going to get modified to not do that anymore. And so if it wants to go to the green arrow in all of these different deployment episodes, it needs to first, in the training procedure, pretend that it doesn’t care about the green arrow. And then once it’s deployed, switch to going after the green arrow. This also requires the models to know when it’s in training and when it’s in deployment.

This is another situation where I think you should expect that when you’re training really advanced models with these huge world models, that they figure out this information, but also even if it doesn’t figure out that information, it can gamble. It can be like, “Well, there’s some chance I’m in deployment. There’s some chance I’m in training. I will just opt to go to the green arrow some percentage of the time and do the correct thing some other percentage of the time.” And this gives you a decision problem which you can solve mathematically. You get some optimal defection probability.

Lucas Perry: Okay. And then going up, in terms of specification one step, the corrigible agent needs to have a very detailed world model where it has a pointer via gradient descent at some specific part of the world model that seems to correlate highly with minimizing the loss.

Evan Hubinger: That’s right.

Lucas Perry: Why is that less likely?

Evan Hubinger: Well, so the major two reasons that I want to argue for are, one: getting the pointer right is really hard and very difficult to do robustly, whereas getting the deception right is much easier to do robustly. Think about animal imprinting. If the animal imprints on a human caretaker instead, if you have a missspecified pointer, it doesn’t actually have very good performance. The deceptive alignment is more robust at figuring out what the thing is it’s trying to optimize for because the corrigible alignment relies on this pointer working successfully. And if the pointer fails, it’s just going to optimize for the wrong thing.

Lucas Perry: But why can’t the pointer change just as well as the attempt at satisfying the actual loss function in the deceptive model?

Evan Hubinger: So absolutely can. But the question I want to ask is: how difficult is it for gradient descent to make one sequence of modifications versus another sequence of modifications? The sequence of modification that leads to corrigibility versus the sequence that leads to deceptive alignment. The sequence of modifications necessary to produce corrigible alignment requires all of this very, very careful pointer tuning. And so, you can imagine if you’re considering an infinitesimal movement in the direction of corrigible alignment, imagine replacing your current proxy with a bad pointer. It’s not going to be that great. You’re going to have to put a lot of careful optimization effort into getting a really good pointer before it becomes better to be corrigible.

Whereas with deception, the model could build up this really good world model. And once the world model gets complicated enough that it includes a specification of the fact that it’s in a training process, as soon as it starts thinking about the training process, it’s going to be as robust as it can be because it’s going to instrumentally just want to try to figure out what is the best possible approximation I can get for this thing I’m being trained for, so I can optimize it as effectively as possible. And so this gives you a much more robust pointer compared to the corrigible alignment case.

Lucas Perry: And so the danger would be with the deceptive version: once it’s unleashed upon the world and it has a very excellent world model, it realizes this and finally now I can do the deceptive turn, or something like that, to actually optimize the proxy?

Evan Hubinger: That’s right, yeah.

Lucas Perry: So we’ve covered a lot of the ways in which inner alignment fails. Now, inner alignment and outer alignment are two of the things which you care about for evaluating proposals, for building safe and advanced AI. There are two other properties that you care about training procedures for building beneficial AI. One of these is training competitiveness and the second one is performance competitiveness. Could you explain what training competitiveness is and performance competitiveness and why they’re both important?

Evan Hubinger: Absolutely, yeah. So I mentioned at the beginning that I have a broad view of AI alignment where the goal is to try to mitigate AI existential risks. And I mentioned that what I’m working on is focused on this intent alignment problem, but a really important facet of that problem is this competitiveness question. We don’t want to produce AI systems which are going to lead to AI existential risks. And so we don’t want to consider proposals which are directly going to cause problems. As the safety community, what we’re trying to do is not just come up with ways to not cause existential risk. Not doing anything doesn’t cause existential risk. It’s to find ways to capture the positive benefits of artificial intelligence, to be able to produce AIs which are actually going to do good things. You know why we actually tried to build AIs in the first place?

We’re actually trying to build AIs because we think that there’s something that we can produce which is good, because we think that AIs are going to be produced on a default timeline and we want to make sure that we can provide some better way of doing it. And so the competitiveness question is about how do we produce AI proposals which actually reduce the probability of existential risk? Not that just don’t themselves cause existential risks, but that actually overall reduce the probability of it for the world. There’s a couple of different ways which that can happen. You can have a proposal which improves our ability to produce other safe AI. So we produce some aligned AI and that aligned AI helps us build other AIs which are even more aligned and more powerful. We can also maybe produce an aligned AI and then producing that aligned AI helps provide an example to other people of how you can do AI in a safe way, or maybe it provides some decisive strategic advantage, which enables you to successfully ensure that only good AI is produced in the future.

There’s a lot of different possible ways in which you could imagine building an AI leading to reduced existential risks, but competitiveness is going to be a critical component of any of those stories. You need your AI to actually do something. And so I like to split competitiveness down into two different sub components, which are training competitiveness performance competitiveness. And in the overview of 11 proposals document that I mentioned at the beginning, I compare 11 different proposals for prosaic AI alignment on the four qualities of outer alignment, inner alignment, training competitiveness, and performance competitiveness. So training competitiveness is this question of how hard is it to train a model to do this particular task? It’s a question fundamentally of, if you have some team with some lead over all different other possible AI teams, can they build this proposal that we’re thinking about without totally sacrificing that lead? How hard is it to actually spend a bunch of time and effort and energy and compute and data to build an AI, according to some particular proposal?

And then performance competitiveness is the question of once you’ve actually built the thing, how good is it? How effective is it? What is it able to do in the world that’s really helpful for reducing existential risk? Fundamentally, you need both of these things. And so you need all four of these components. You need outer alignment, inner alignment, training competitiveness, and performance competitiveness if you want to have a prosaic AI alignment proposal that is aimed at reducing existential risk.

Lucas Perry: This is where a bit more reflection on governance comes in to considering which training procedures and models are able to satisfy the criteria for building safe advanced AI in a world of competing actors and different incentives and preferences.

Evan Hubinger: The competitive stuff definitely starts to touch on all those sorts of questions. When you take a step back and you think about how do you have an actual full proposal for building prosaic AI in a way which is going to be aligned and do something good for the world, you have to really consider all of these questions. And so that’s why I tried to look at all of these different things in the document that I mentioned.

Lucas Perry: So in terms of training competitiveness and performance competitiveness, are these the kinds of things which are best evaluated from within leading AI companies and then explained to say people in governance or policy or strategy?

Evan Hubinger: It is still sort of a technical question. We need to have a good understanding of how AI works, how machine learning works, what the difficulty is of training different types of machine learning models, what the expected capabilities are of models trained under different regimes, as well as the outer alignment and inner alignment that we expect will happen.

Lucas Perry: I guess I imagine the coordination here is that information on relative training competitiveness and performance competitiveness in systems is evaluated within AI companies and then possibly fed to high power decision makers who exist in strategy and governance for coming up with the correct strategy, given the landscape of companies and AI systems which exist?

Evan Hubinger: Yeah, that’s right.

Lucas Perry: All right. So we have these intent alignment problems. We have inner alignment and we have outer alignment. We’ve learned about that distinction today, and reasons for caring about training and performance competitiveness. So, part of the purpose of this, I mean, is in the title for this paper that partially motivated this conversation, An Overview of 11 Proposals for Building Safe and Advanced AI. You evaluate these proposals based on these criteria, as we mentioned. So I guess, I want to take this time now then to talk about how optimistic you are about, say your top few favorite proposals for building safe and advanced AI and how you’ve roughly evaluated them on these four criteria of inner alignment, outer alignment, and then performance and training competitiveness.

Evan Hubinger: I’ll just touch on some of the ones that I think are most interesting to start by taking a look at. And it’s not necessarily just going to be the ones I think I’m most excited about, because some are more complicated than others. I want to start with some of the simpler ones and some of the ones that I am fairly excited about and I think are interesting, but that also are pedagogically useful to take a look at. I’ll start with amplification, which is the approach that I probably spend the most time thinking about.

The specific form of amplification that I want to start by talking about is imitative amplification, where imitative amplification specifically means the procedure of training a model to imitate a human with access to that model. The goal is to produce models imitating this HCH tree, which I mentioned earlier, this tree of humans consulting humans. The model is trained to mimic a human consulting the model, but then the models that that human is consulting with themselves trained to mimic humans consulting models, which were themselves trained to mimic humans consulting models. So the limit that you’re hoping to end up with this model trained to mimic this tree of humans consulting humans consulting humans, which is HCH for humans consulting HCH.

The specific way to try to talk about this though in the overview of 11 proposals, is that there’s an additional thing which you can do on top of this imitative objective, which is, we also want to add some form of inner alignment solution. So we might think that imitative amplification is going to be good in the limit. It’s going to be good if we can actually produce models, which are really trying to predict HCH, but we also have to actually ensure that we’re doing that, that we don’t get deceptive models or other sort of bad inner misalignment problems.

And so we can add on top of imitative amplification, something which I call relaxed adversarial training. I have another post that came out of some of the work I did at OpenAI, which really goes in detail into how you would do something like relaxed adversarial training of what it means, or how it works. But fundamentally, the idea behind relaxed adversarial training is to train the model, both to do the base objective of trying to imitate the human consulting the model, but also on an additional objective of trying to decrease the probability that the model is doing something bad, it’s deceptive as assessed by some overseer, which is looking at the model. Which in this case, in the case of amplification, the overseer would be the human with access to the model.

So you’re training overall on the objective of mimicking the human with access to the model and making it so that the human with access to the model, when that human with the help of the model uses transparency tools to look at the model and see what it’s doing, making sure that that human believes that the model is doing the right thing. And so you’re combining training on the imitative objective as well as on the relaxed adversarial training objective, trying to make it so that this oversight works.

Lucas Perry: How do you evaluate that in terms of your criteria?

Evan Hubinger: We can start with outer alignment. So the question of outer alignment is, is this HCH tree aligned? And the hope is that because this HCH is just composed of humans, that if we can understand the human behavior and ensure that those humans are doing the right thing, which we might hope to do. We’ve solved lots of human coordination problems in the past that we can ensure that this thing is outer aligned.

For inner alignment, inner alignment is this question of relaxed adversarial training. So in particular, the way in which I think about solving relaxed adversarial training in this situation is that you have this big model space and you have some set of deceptive models. And if you want to avoid that set of deceptive models, one thing which you can do is you can find some larger set, which contains all the deceptive models, such that it’s very easy to check whether there are models in that set or not. As long as outside of the set contains enough competitive models that they’re able to do the right thing, your relaxed adversarial training will work.

There’s a bunch of different ideas for how you could define a set like this, but the one which I’m most excited about, which I think about the most is to have the set be based on the distinction of myopia. Where you run this relaxed adversarial training check by looking at the model, determining whether the model cares about its multi-episode return or not. If the model does care about its multi-episode return, then it has the potential to become deceptive. And so we don’t allow that.

We have it so that the overseer, if it sees that it cares about the multi-episode return, oversight fails. And if it sees that it doesn’t care about the multi-episode return, then the oversight passes. And then we train the model to cause the oversight to pass so that it doesn’t develop this behavior where it cares about the multi-episode return. That’s the basic idea of how you could use relaxed adversarial training here to solve inner alignment.

And then you have training competitiveness, it’s sort of this imitation learning problem. We’re trying to train the model to imitate the human with access to the model. There’s all sorts of potential problems here. It could be very difficult to get a bunch of human data for this problem, as well as if you’re just doing imitation learning, there’s some concerns of, is the model really going to learn these sort of powerful long-term optimization, especially if you’re specifically selecting it to be myopic? Also, there’s a question of how rich is the reward signal, which is produced by imitation learning? Imitation learning uses some sort of fixed distance metric to determine the distance between what the model produced and what was predicted.

Generally, this is done by having the language model which reduces the probability distribution, and then determining the probability under that distribution of the sentences produced by the human with access to the model. And so you have this problem of how rich is the training signal, how difficult is the training? I think we really don’t know what the training competitiveness is going to be, but I think there’s reason to be optimistic. I mean, a lot of these sorts of simple language modeling tasks, like GPT-2 and GPT-3 have been very successful. And so this is very similar to that in that sort of vein, and so you can hope that it’ll also be hopefully successful similarly.

And then you have the performance competitiveness, and for performance competitiveness, the real question is how powerful is this HCH thing? Can HCH actually solve all the problems you’ve wanted to solve? And this is going to be a question of, how do you train your humans? Can the human take any question and decompose it into different sub questions such that the answer to those sub questions can be recombined into an answer to the whole question? Is this possible for all questions? So there are some questions which are undecomposable, in which case it wouldn’t be competitive, but if you can decompose all of these questions, it sort of should have performance competitiveness.

So the one that I’ve been talking about imitative amplification plus relaxed adversarial training is the one that I work on the most, and it’s probably the one I’m most optimistic about. There’s others that are also really interesting. I mean, I think all of the proposals, every proposal on the list has lots of strong arguments in favor of it and lots of real reasons to consider it.

Another proposal which I have talked about previously is AI safety via debate. So AI safety via debate has some similar properties to amplification, but it has a different objective. Rather than trying to mimic HCH, you’re trying to do this debate procedure where you have two different models trying to defend two different sides of some proposition, and convince the judge to agree with one of those sides. There’s some strong arguments for why this sort of procedure would have nice outer alignment properties because it converges to something like what the judge’s equilibrium beliefs are after having seen all of the arguments in favor. And so we might hope that this sort of equilibrium beliefs are going to be good.

There’s lots of work also that’s been done to try to improve the ability to access like the true equilibrium beliefs to really give the judge all possible information. There’s some work that was done by Paul Christiano and Beth Barnes about this cross-examination mechanism, which gives it significantly better performance competitiveness properties by allowing it to access a larger complexity class. But then the other question is then what can you do on top of AI safety via debate to help with inner alignment?

An interesting thing that you can do is you can have the models trained to use transparency tools on each other to win the debate. If one model is able to look inside the other model and demonstrate that it’s non-myopic or something, then it wins the debate. And so then you can incentivize the models to play this game of trying to look inside of each other and find the inner alignment problems, and then train away from those problems.

For outer alignment, you sort of really have this question of, are the debaters actually incentivized to have some honest, cooperative, helpful answers in the limit. And this is a question of what that equilibrium really looks like. Is it the case that humans are able to be tricked by really persuasive arguments and no counterargument will unpersuade them? Or is it the case that when humans are exposed to all of the best arguments and counterarguments, they come to the right conclusions? We don’t really know, but this is fundamentally what the outer alignment question that’s here. This is sort of a trickier equilibrium to analyze.

With amplification case, we just had to figure out with this concrete tree of humans, but with debate, we have to understand this more abstract, what is this theoretical equilibrium behavior of a human under exposure to all these different arguments? I think there’s a case that from an outer alignment perspective, debate is at least more questionable.

Then we have inner alignment and the situation here is, how successfully can we incentivize these models to inspect each other and be able to identify catastrophic behavior? There’s certainly some problems here. You have to get a good training signal to be able to produce models, which are good at inspecting each other, which probably means you have to produce deceptive models and then have the models find the deception, or at least produce non-myopic models and then have the models find the non-myopia. But then this could also incentivize the models to hide the non-myopia, so it’s a little bit unclear exactly how this works. It’s sort of going to dependent on precisely the properties of the training procedure.

I think there’s some good reasons to be hopeful here for training competitiveness, because the way that the debate game is structured, it lets you do this self-play structure where you have these agents competing on the zero-sum game in a self-play, which is in many ways, similar to a lot of big successes of things like AlphaZero, where AlphaZero is able to solve all these environments by doing this very successful self-play on these zero-sum games.

The biggest problem here is probably getting data, and then there’s the performance competitiveness question, which is how effective would an actual debate agent be? The hope is that in the limit, you should just be able to ask the first debater, what do you think? And then it’ll just choose the most convincing answer, and then you can just go with that. You don’t have to run the whole debate in deployment, but it’s only going to be able to solve these language problems.

It’s just going to give you is the equilibrium of what a human thinks after all of these different arguments, and is that good enough? Is it the case that humans are going to really be able to come to good enough equilibria after they see all these arguments that they’re going to be able to produce really good answers? And also, is it the case that question answering alone is sufficient to be able to be competitive in potentially a very competitive marketplace?

As a third proposal that I think is interesting to go into is something called microscope AI. Microscope AI I think is really interesting to look at because it’s very different from the other proposals that I was just talking about. It has a very different approach to thinking about how do we solve these sorts of problems. For all of these approaches, we need to have some amount of abilities to look inside of our models and learn something about what the model knows. But when you use transparency tools to look inside of the model, it teaches you multiple things. It teaches you about the model. You learn about what the model has learned. But it also teaches you about the world, because the model learned a bunch of useful facts, and if you look inside the model and you can learn those facts yourself, then you become more informed. And so this process itself can be quite powerful.

That’s fundamentally the idea of microscope AI. The idea of microscope AI is to train a predictive model on the data you want to understand, and then use transparency tools to understand what that model learned about that data, and then use that understanding to guide human decision making. And so if you’re thinking about outer alignment, in some sense, this procedure is not really outer aligned because we’re just trying to predict some data. And so that’s not really an aligned objective. If you had a model that was just trying to do a whole bunch of prediction, it wouldn’t be doing good things for the world.

But the hope is that if you’re just training a predictive model, it’s not going to end up being deceptive or otherwise dangerous. And you can also use transparency tools to ensure that it doesn’t become that. We still have to solve inner alignment, like I was saying. It still has to be the case that you don’t produce deceptive models. And in fact, the goal here really is not to produce mesa optimizers at all. The goal is just to produce these predictive systems, which learn a bunch of useful facts and information, but that aren’t running optimization procedures. And hopefully we can do that by having this very simple, predictive objective, and then also by using transparency tools.

And then training competitiveness, we know how to train powerful predictive models now, you know, something like GPT-2, and now GPT-3, these are predictive models on task prediction. And so we know this process, we know that we’re very good at it. And so hopefully we’ll be able to continue to be good at it into the future. The real sticky point with microscope AI is the performance competitiveness question. So is enhanced human understanding actually going to be sufficient to solve the use cases we might want for like advanced AI? I don’t know. It’s really hard to know the answer to this question, but you can imagine some situations where it is and some situations where it isn’t.

So, for situations where you need to do long-term, careful decision making, it probably would be, right? If you want to replace CEOs or whatever, that’s a sort of very general decision making process that can be significantly improved just by having much better human understanding of what’s happening. You don’t necessarily need the AI to making the decision. On the other hand, if you need fine-grained manipulation tasks or very, very quick response times, AIs managing a factory or something, then maybe this wouldn’t be sufficient because you would need the AIs to be doing all of this quick decision making and you couldn’t have it just giving information to a few.

One specific situation, which I think is important to think about also is the situation of using your first AI system to help build a second AI system, and making sure that second AI system is aligned and competitive. I think that it also performs pretty well there. You could use a microscope AI to get a bunch of information about the process of AIs and how they work and how training works, and then get a whole bunch of information about that. Have the humans learn that information, then use that information to improve our building of the next AIs and other AIs that we build.

There are certain situations where microscope AI is performance competitive, situations where it wouldn’t be performance competitive, but it’s a very interesting proposal because it’s sort of tackling it from a very different angle. It’s like, well, maybe we don’t really need to be building agents. Maybe we don’t really need to be doing this stuff. Maybe we can just be building this microscope AI. I should mention the microscope AI idea comes from Chris Olah, who works at OpenAI. The debate idea comes from Geoffrey Irving, who’s now at DeepMind, and the amplification comes from Paul Christiano, who’s at OpenAI.

Lucas Perry: Yeah, so for sure, the best place to review these is by reading your post. And again, the post is “An overview of 11 proposals for building safe advanced AI” by Evan Hubinger and that’s on the AI Alignment Forum.

Evan Hubinger: That’s right. I should also mention that a lot of the stuff that I talked about in this podcast is coming from the Risks from Learned Optimization in Advanced Machine Learning Systems paper.

Lucas Perry: All right. Wrapping up here, I’m interested in ending on a broader note. I’m just curious to know if you have concluding thoughts about AI alignment, how optimistic are you that humanity will succeed in building aligned AI systems? Do you have a public timeline that you’re willing to share about AGI? How are you feeling about the existential prospects of earth-originating life?

Evan Hubinger: That’s a big question. So I tend to be on the pessimistic side. My current view looking out on the field of AI and the field of AI safety, I think there’s a lot of really challenging, difficult problems that we are at least not currently equipped to solve. And it seems quite likely that we won’t be equipped to solve by the time we need to solve them. I tend to think that the prospects for humanity aren’t looking great right now, but I nevertheless have a very sort of optimistic disposition, we’re going to do the best that we can. We’re going to try to solve these problems as effectively as we possibly can and we’re going to work on it and hopefully we’ll be able to make it happen.

In terms of timelines, it’s such a complex question. I don’t know if I’m willing to commit to some timeline publicly. I think that it’s just one of those things that is so uncertain. It’s just so important for us to think about what we can do across different possible timelines and be focusing on things which are generally effective regardless of how it turns out, because I think we’re really just quite uncertain. It could be as soon as five years or as long away as 50 years or 70 years, we really don’t know.

I don’t know if we have great track records of prediction in this setting. Regardless of when AI comes, we need to be working to solve these problems and to get more information on these problems, to get to the point we understand them and can address them because when it does get to the point where we’re able to build these really powerful systems, we need to be ready.

Lucas Perry: So you do take very short timelines, like say 5 to 10 to 15 years very seriously.

Evan Hubinger: I do take very short timelines very seriously. I think that if you look at the field of AI right now, there are these massive organizations, OpenAI and DeepMind that are dedicated to the goal of producing AGI. They’re putting huge amounts of research effort into it. And I think it’s incorrect to just assume that they’re going to fail. I think that we have to consider the possibility that they succeed and that they do so quite soon. A lot of the top people at these organizations have very short timelines, and so I think that it’s important to take that claim seriously and to think about what happens if it’s true.

I wouldn’t bet on it. There’s a lot of analysis that seems to indicate that at the very least, we’re going to need more compute than we have in that sort of a timeframe, but timeline prediction tasks are so difficult that it’s important to consider all of these different possibilities. I think that, yes, I take the short timelines very seriously, but it’s not the primary scenario. I think that I also take long timeline scenarios quite seriously.

Lucas Perry: Would you consider DeepMind and OpenAI to be explicitly trying to create AGI? OpenAI, yes, right?

Evan Hubinger: Yeah. OpenAI, it’s just part of the mission statement. DeepMind, some of the top people at DeepMind have talked about this, but it’s not something that you would find on the website the way you would with OpenAI. If you look at historically some of the things that Shane Legg and Demis Hassabis have said, a lot of it is about AGI.

Lucas Perry: Yeah. So in terms of these being the leaders with just massive budgets and person power, how do you see the quality and degree of alignment and beneficial AI thinking and mindset within these organizations? Because there seems to be a big distinction between the AI alignment crowd and the mainstream machine learning crowd. A lot of the mainstream ML community hasn’t been exposed to many of the arguments or thinking within the safety and alignment crowd. Stuart Russell has been trying hard to shift away from the standard model and incorporate a lot of these new alignment considerations. So yeah. What do you think?

Evan Hubinger: I think this is a problem that is getting a lot better. Like you were mentioning, Stuart Russell has been really great on this. CHAI has been very effective at trying to really get this message of, we’re building AI, we should put some effort into making sure we’re building safe AI. I think this is working. If you look at a lot of the major ML conferences recently, I think basically all of them had workshops on beneficial AI. DeepMind has a safety team with lots of really good people. OpenAI has a safety team with lots of really good people.

I think that the standard story of, oh, AI safety is just this thing that these people who aren’t involved in machine learning think about it’s something which really in the current world has become much more integrated with machine learning and is becoming more mainstream. But it’s definitely still a process, and it’s the process of like Stuart Russell says that the field of AI has been very focused on the sort of standard model and trying to move people away from that and think about some of the consequences of it takes time and it takes some sort of evolution of a field, but it is happening. I think we’re moving in a good direction.

Lucas Perry: All right, well, Evan, I’ve really enjoyed this. I appreciate you explaining all of this and taking the time to unpack a lot of this machine learning language and concepts to make it digestible. Is there anything else here that you’d like to wrap up on or any concluding thoughts?

Evan Hubinger: If you want more detailed information on all of the things that I’ve talked about, the full analysis of inner alignment and outer alignment is in Risks from Learned Optimization in Advanced Machine Learning Systems by me, as well as many of my co-authors, as well as “an overview of 11 proposals” post, which you can find on the AI Alignment Forum. I think both of those are resources, which I would recommend checking out for understanding more about what I talked about in this podcast.

Lucas Perry: Do you have any social media or a website or anywhere else for us to point towards?

Evan Hubinger: Yeah, so you can find me on all the different sorts of social media platforms. I’m fairly active on GitHub. I do a bunch of open source development. You can find me on LinkedIn, Twitter, Facebook, all those various different platforms. I’m fairly Google-able. It’s nice to have a fairly unique last name. So if you Google me, you should find all of this information.

One other thing, which I should mention specifically, everything that I do is all public. All of my writing is public. I try to publish all of my work and I do so on the AI Alignment Forum. So the AI Alignment Forum is a really, really great resource because it’s a collection of writing by all of these different AI safety authors. It’s open to anybody who’s a current AI safety researcher, and you can find me on the AI Alignment Forum as evhub, I’m E-V-H-U-B on the AI Alignment Forum.

Lucas Perry: All right, Evan, thanks so much for coming on today, and it’s been quite enjoyable. This has probably been one of the more fun AI alignment podcasts that I’ve had in a while. So thanks a bunch and I appreciate it.

Evan Hubinger: Absolutely. That’s super great to hear. I’m glad that you enjoyed it. Hopefully everybody else does as well.

End of recorded material

Sam Barker and David Pearce on Art, Paradise Engineering, and Existential Hope (With Guest Mix)

Sam Barker, a Berlin-based music producer, and David Pearce, philosopher and author of The Hedonistic Imperative, join us on a special episode of the FLI Podcast to spread some existential hope. Sam is the author of euphoric sound landscapes inspired by the writings of David Pearce, largely exemplified in his latest album — aptly named “Utility.” Sam’s artistic excellence, motivated by blissful visions of the future, and David’s philosophical and technological writings on the potential for the biological domestication of heaven are a perfect match made for the fusion of artistic, moral, and intellectual excellence. This podcast explores what significance Sam found in David’s work, how it informed his music production, and Sam and David’s optimistic visions of the future; it also features a guest mix by Sam and plenty of musical content.

Topics discussed in this episode include:

  • The relationship between Sam’s music and David’s writing
  • Existential hope
  • Ideas from the Hedonistic Imperative
  • Sam’s albums
  • The future of art and music

Where to follow Sam Barker :


Where to follow Sam’s label, Ostgut Ton: 




0:00 Intro

5:40 The inspiration around Sam’s music

17:38 Barker- Maximum Utility

20:03 David and Sam on their work

23:45 Do any of the tracks evoke specific visions or hopes?

24:40 Barker- Die-Hards Of The Darwinian Order

28:15 Barker – Paradise Engineering

31:20 Barker – Hedonic Treadmill

33:05 The future and evolution of art

54:03 David on how good the future can be

58:36 Guest mix by Barker



Delta Rain Dance – 1

John Beltran – A Different Dream

Rrose – Horizon

Alexandroid – lvpt3

Datassette – Drizzle Fort

Conrad Sprenger – Opening

JakoJako –  Wavetable#1

Barker & David Goldberg – #3

Barker & Baumecker – Organik (Intro)

Anthony Linell – Fractal Vision

Ametsub – Skydroppin’

Ladyfish\Mewark – Comfortable

JakoJako & Barker – [unreleased]


This podcast is possible because of the support of listeners like you. If you found this conversation to be meaningful or valuable consider supporting it directly by donating at Contributions like yours make these conversations possible.

All of our podcasts are also now on Spotify and iHeartRadio! Or find us on SoundCloudiTunesGoogle Play and Stitcher.

You can listen to the podcast above or read the transcript below. 

David Pearce: I would encourage people to conjure up their vision of paradise. and the future can potentially be like that only much, much better. 

Lucas Perry: Welcome to the Future of Life Institute Podcast. I’m Lucas Perry. Today we have a particularly unique episode with Berlin based DJ and producer Sam Barker as well as with David Pearce, and right now, you’re listening to Sam’s track Paradise Engineering on his album Utility. We focus centrally on the FLI Podcast on existential risk. The other side of existential risk is existential hope. This hope reflects all of our dreams, aspirations, and wishes for a better future. For me, this means a future where we’re able to create material abundance, eliminate global poverty, end factory farming and address animal suffering, evolve our social and political systems to bring greater wellbeing to everyone, and more optimistically, create powerful aligned artificial intelligence that can bring about the end involuntary suffering, and help us to idealize the quality of our minds and ethics. If we don’t go extinct, we have plenty of time to figure these things out and that brings me a lot of joy and optimism. Whatever future seems most appealing to you, these visions are a key component to why mitigating existential risk is so important. So, in the context of COVID-19, we’d like to revitalize existential hope and this podcast is aimed at doing that.  

As a part of this podcast, Sam was kind enough to create a guest mix for us. You can find that after the interview portion of this podcast and can find where it starts by checking the timestamps. I’ll also release the mix separately a few days after this podcast goes live. Some of my favorite tracks of Sam’s not highlighted in this podcast are Look How Hard I’ve Tried, and Neuron Collider. If you enjoy Sam’s work and music featured here, you can support or follow him at the links in the description. He has a Bandcamp shop where you can purchase his albums. I grabbed a vinyl copy of his album Debiasing from there. 

As for a little bit of background on this podcast, Sam Barker, who produces electronic music under the name Barker, has albums with titles such as Debiasing” and Utility. I was recommended to listen to these, and discovered his album “Utility” is centrally inspired by David Pearce’s work, specifically The Hedonistic Imperative. Utility has track titles like Paradise Engineering, Experience Machines, Gradients Of Bliss, Hedonic Treadmill, and Wireheading. So, being a big fan of Sam’s music production and David’s philosophy and writing, I wanted to bring them together to explore the theme of existential hope and Sam’s inspiration for his albums and how David fits into all of it. 

Many of you will already be familiar with David Pearce. He is a friend of this podcast and a multiple time guest. David is a co-founder of the World Transhumanist Association, rebranded Humanity+, and is a prominent figure within the transhumanism movement in general. You might know him from his work on the Hedonistic Imperative, a book which explores our moral obligation to work towards the abolition of suffering in all sentient life through technological intervention.

Finally, I want to highlight the 80,000 Hours Podcast with Rob Wiblin. If you like the content on this show, I think you’ll really enjoy the topics and guests on Rob’s podcast. His is also motivated by and contextualized in an effective altruism framework and covers a broad range of topics related to the world’s most pressing issues and what we can do about them. If that sounds of interest to you, I suggest checking out episode #71 with Ben Todd on the ideas of 80,000 Hours, and episode #72 with Toby Ord on existential risk. 

And with that, here’s my conversation with Dave and Sam, as well as Sam’s guest mix.

Lucas Perry: For this first section, I’m basically interested in probing the releases that you already have done, Sam, and exploring them and your inspiration for the track titles and the soundscapes that you’ve produced. Some of the background and context for this is that much of this seems to be inspired by and related to David’s work, in particular the Hedonistic Imperative. I’m at first curious to know, Sam, how did you encounter David’s work, and what does it mean for you?

Sam Barker: David’s work was sort of arriving in the middle of a sort of a series of realizations, and kind of coming from a starting point of being quite disillusioned with music, and a little bit disenchanted with the vagueness, and the terminology, and the imprecision of the whole thing. I think part of me has always wanted to be some kind of scientist, but I’ve ended up at perhaps not the opposite end, but quite far away from it.

Lucas Perry: Could explain what you mean by vagueness and imprecision?

Sam Barker: I suppose the classical idea of what making music is about is a lot to do with the sort of western idea of individualism and about self expression. I don’t know. There’s this romantic idea of artists having these frenzied creative bursts that give birth to the wonderful things, that it’s some kind of struggle. I just was feeling super disillusioned with all of that. Around that time, 2014 or 15, I was also reading a lot about social media, reading about behavioral science, trying to figure what was going on in this arena and how people are being pushed in different directions by this algorithmic system of information distribution. That kind of got me into this sort of behavioral science side of things, like the addictive part of the variable-ratio reward schedule with likes. It’s a free dopamine dispenser kind of thing. This was kind of getting me into reading about behavioral science and cognitive science. It was giving me a lot of clarity, but not much more sort of inspiration. It was basically like music.

Dance music especially is a sort of complex behavioral science. You do this and people do that. It’s all deeply ingrained. I sort of imagine the DJ as a sort Skinner box operator pulling puppet strings and making people behave in different ways. Music producers are kind of designing clever programs using punishment and reward or suspense and release, and controlling people’s behavior. The whole thing felt super pushy and not a very inspiring conclusion. Looking at the problem from a cognitive science point of view is just the framework that helped me to understand what the problem was in the first place, so this kind of problem of being manipulative. Behavioral science is kind of saying what we can make people do. Cognitive psychology is sort of figuring out why people do that. That was my entry point into cognitive psychology, and that was kind of the basis for Debiasing.

There’s always been sort of a parallel for me between what I make and my state of mind. When I’m in a more positive state, I tend to make things I’m happier with, and so on. Getting to the bottom of what tricks were, I suppose, with dance music. I kind of understood implicitly, but I just wanted to figure out why things worked. I sort of came to the conclusion it was to do with a collection of biases we have, like the confirmation bias, and the illusion of truth effect, and the mere exposure effect. These things are like the guardians of four four supremacy. Dance music can be pretty repetitive, and we describe it sometimes in really aggressive terminology. It’s a psychological kind of interaction.

Cognitive psychology was leading me to Kaplan’s law of the instrument. The law of the instrument says that if you give a small boy a hammer, he’ll find that everything he encounters requires pounding. I thought that was a good metaphor. The idea is that we get so used to using tools in a certain way that we lose sight of what it is we’re trying to do. We act in the way that the tool instructs us to do. I thought, what if you take away the hammer? That became a metaphor for me, in a sense, that David clarified in terms of pain reduction. We sort of put these painful elements into music in a way to give this kind of hedonic contrast, but we don’t really consider that that might not be necessary. What happens when we abolish these sort of negative elements? Are the results somehow released from this process? That was sort of the point, up until discovering the Hedonistic Imperative.

I think what I was needing at the time was a sort of framework, so I had the idea that music was decision making. To improve the results, you have to ask better questions, make better decisions. You can make some progress looking at the mechanics of that from a psychology point of view. What I was sort of lacking was a purpose to frame my decisions around. I sort of had the idea that music was a sort of a valence carrier, if you like, and that it could tooled towards a sort of a greater purpose than just making people dance, which was for Debiasing the goal, really. It was to make people dance, but don’t use the sort of deeply ingrained cues that people used to, and see if that works.

What was interesting was how broadly it was accepted, this first EP. There was all kinds of DJs playing it in techno, ambient, electro, all sorts of different styles. It reached a lot of people. It was as if taking out the most functional element made it more functional and more broadly appealing. That was the entry point to utilitarianism. There was sort of an accidentally utilitarian act, in a way, to sort of try and maximize the pleasure and minimize the pain. I suppose after landing in utilitarianism and searching for some kind of a framework for a sense of purpose in my work, the Hedonistic Imperative was probably the most radical, optimistic take on the system. Firstly, it put me in a sort of mindset where it granted permission to explore sort of utopian ideals, because I think the idea of pleasure is a little bit frowned upon in the art world. I think the art world turns its nose up at such direct cause and effect. The idea that producers could sort of be paradise engineers of sorts, so the precursors to paradise engineers, that we almost certainly would have a role in a kind of sensory utopia of the future.

There was this kind of permission granted. You can be optimistic. You can enter into your work with good intentions. It’s okay to see music as a tool to increase overall wellbeing, in a way. That was kind of the guiding idea for my work in the studio. I’m trying, these days, to put more things into the system to make decisions in a more conscious way, at least where it’s appropriate to. This sort of notion of reducing pain and increasing pleasure was the sort of question I would ask at any stage of decision making. Did this thing that I did serve those ends? If not, take a step back and try a different approach.

There’s something else to be said about the way you sort of explore this utopian world without really being bogged down. You handle the objections in such a confident way. I called it a zero gravity world of ideas. I wanted to bring that zero gravity feeling to my work, and to see that technology can solve any problem in this sphere. Anything’s possible. All the obstacles are just imagined, because we fabricate these worlds ourselves. These are things that were really instructive for me, as an artist.

Lucas Perry: That’s quite an interesting journey. From the lens of understanding cognitive psychology and human biases, was it that you were seeing those biases in dance music itself? If so, what were those biases in particular?

Sam Barker: On both sides, on the way it’s produced and in the way it’s received. There’s sort of an unspoken acceptance. You’re playing a set and you take a kick drum out. That signals to people to perhaps be alert. The lighting engineer, they’ll maybe raise the lights a little bit, and everybody knows that the music is going into sort of a breakdown, which is going to end in some sort of climax. Then, at that point, the kick drum comes back in. We all know this pattern. It’s really difficult to understand why that works without referring to things like cognitive psychology or behavioral science.

Lucas Perry: What does the act of debiasing the reception and production of music look like and do to the music and its reception?

Sam Barker: The first part that I could control was what I put into it. The experiment was whether a debiased piece of dance music could perform the same functionality, or whether it really relies on these deeply ingrained cues. Without wanting to sort of pat myself on the back, it kind of succeeded in its purpose. It was sort of proof that this was a worthy concept.

Lucas Perry: You used the phrase, earlier, four four. For people who are not into dance music, that just means a kick on each beat, which is ubiquitous in much of house and techno music. You’ve removed that, for example, in your album Debiasing. What are other things that you changed from your end, in the production of Debiasing, to debias the music from normal dance music structure?

Sam Barker: It was informing the structure of what I was doing so much that I wasn’t so much on a grid where you have predictable things happening. It’s a very highly formulaic and structured thing, and that all keys into the expectation and this confirmation bias that people, I think, get some kind of kick from when the predictable happens. They say, yep. There you go. I knew that was going to happen. That’s a little dopamine rush, but I think it’s sort of a cheap trick. I guess I was trying to get the tricks out of it, in a way, so figuring out what they were, and trying to reduce or eliminate them was the process for Debiasing.

Lucas Perry: That’s quite interesting and meaningful, I think. Let’s just take trap music. I know exactly how trap music is going to go. It has this buildup and drop structure. It’s basically universal across all dance music. Progressive house in the 2010s was also exactly like this. What else? Dubstep, of course, same exact structure. Everything is totally predictable. I feel like I know exactly what’s going to happen, having listened to electronic music for over a decade.

Sam Barker: It works, I think. It’s a tried and tested formula, and it does the job, but when you’re trying to imagine states beyond just getting a little kick from knowing what was going to happen, that’s the place that I was trying to get to, really.

Lucas Perry: After the release of Debiasing in 2018, which was a successful attempt at serving this goal and mission, you then discovered the Hedonistic Imperative by David Pearce, and kind of leaned into consequentialism, it seems. Then, in 2019, you had two releases. You had BARKER 001 and you had Utility. Now, Utility is the album which most explicitly adopts David Pearce’s work, specifically in the Hedonistic Imperative. You mentioned electronic dance producers and artists in general can be sort of the first wave of, or can perhaps assist in paradise engineering, insofar as that will be possible in the near to short terms future, given advancements in technology. Is that sort of the explicit motivation and framing around those two releases of BARKER 001 and Utility?

Sam Barker: BARKER 001 was a few tracks that were taken out of the running for the album, because they didn’t sort of fit the concept. Really, I knew the last track was kind of alluding to the album. Otherwise, it was perhaps not sort of thematically linked. Hopefully, if people are interested in looking more into what’s behind the music, you can lead people into topics with the concept. With Utility, I didn’t want to just keep exploring cognitive biases and unpicking dance music structurally. It’s sort of a paradox, because I guess the Hedonistic Imperative argues that pleasure can exist without purpose, but I really was striving for some kind of purpose with the pleasure that I was getting from music. That sort of emerged from reading the Hedonistic Imperative, really, that you can apply music to this problem of raising the general level of happiness up a notch. I did sort of worry that by trying to please, it wouldn’t work, that it would be something that’s too sickly sweet. I mean, I’m pretty turned off by pop music, and there was this sort of risk that it would end up somewhere like that. That’s it, really. Just looking for a higher purpose with my work in music.

Lucas Perry: David, do you have any reactions?

David Pearce: Well, when I encountered Utility, yes, I was thrilled. As you know, essentially I’m a writer writing in quite heavy sub-academic prose. Sam’s work, I felt, helps give people a glimpse of our glorious future, paradise engineering. As you know, the reviews were extremely favorable. I’m not an expert critic or anything like that. I was just essentially happy and thrilled at the thought. It deserves to be mainstream. It’s really difficult, I think, to actually evoke the glorious future we are talking about. I mean, I can write prose, but in some sense music can evoke paradise better, at least for many people, than prose.

Sam Barker: I think it’s something you can appreciate without cognitive effort which, your prose, at least you need to be able to read. It’s a bit more of a passive way of receiving, music, which I think is an intrinsic advantage it has. That’s actually really a relief to hear, because there was just a small fear in my mind that I was grabbing these concepts with clumsy hands and discrediting them.

David Pearce: Not at all.

Sam Barker: It all came from a place of sincere appreciation for this sort of world that you are trying to entice people with. When I’ve tried to put into words what it was that was so inspiring, I think it’s that there was also a sort of very practical, kind of making lots of notes. I’ve got lots of amazing one liners. Will we ever leave the biological dark ages or the biological domestication of heaven? There was just so many things that conjure up such vividly, heavenly sensations. It sort of brings me back to the fuzziness of art and inspiration, but I hope I’ve tried to adopt the same spirit of optimism that you approached the Hedonistic Imperative with. I actually don’t know what state of mind your approach was at the time, even, but it must’ve come in a bout of extreme hopefulness.

David Pearce: Yes, actually. I started taking Selegiline, and six weeks later I wrote the Hedonistic Imperative. It just gave me just enough optimism to embark on. I mean, I have, fundamentally, a very dark view of Darwinian life, but for mainly technical reasons I think the future is going to be super humanly glorious. How do you evoke this for our dark, Darwinian minds?

Sam Barker: Yeah. How do we get people excited about it? I think you did a great job.

David Pearce: It deserves to go mainstream, really, the core idea. I mean, forget the details, the neurobabble of genetics. Yeah, of course it’s incredibly important, but this vision of just how sublimely wonderful life could be. How do we achieve full spectrum, multimedia dominance? I mean, I can write it.

Lucas Perry: Sounds like you guys need to team up.

Sam Barker: It’s very primitive. I’m excited where it could head, definitely.

Lucas Perry: All right. I really like this idea about music showing how good the future can be. I think that many of the ways that people can understand how good the future can be comes from the best experiences they’ve had in their life. Now, that’s just a physical state of your brain. If something isn’t physically impossible, then the only barrier to achieving and realizing that thing is knowledge. Take all the best experiences in your life. If we could just understand computation, and biology in the brain, and consciousness well enough. It doesn’t seem like there’s any real limits to how good and beautiful things can get. Do any of the tracks that you’ve done evoke very specific visions, dreams, desires, or hopes?

Sam Barker: I would be sort of hesitant to make direct links between tracks and particular mindsets, because when I’m sitting down to make music, I’m not really thinking about any one particular thing. Rather, I’m trying to look past things and look more about what sort of mood I want to put into the work. Any of the tracks on the record, perhaps, could’ve been called paradise engineering, is what I’m saying. The names from the tracks are sort of a collection of the ideas that were feeding the overall process. The application of the names was kind of retroactive connection making. That’s probably a disappointment to some people, but the meaning of all of the track names is in the whole of the record. I think the last track on the record, Die-Hards of the Darwinian Order, that was a phrase that you used, David, to describe people clinging to the need for pain in life to experience pleasure.

David Pearce: Yes.

Sam Barker: That track was not made for the record. It was made some time ago, and it was just a technical experiment to see if I could kind of recreate a realistic sounding band with my synthesizers. The label manager, Alex, was really keen to have this on the record. I was kind of like, well, it doesn’t fit conceptually. It has a kick drum. It’s this kind of somber mood, and the rest of the record is really uplifting, or trying to be. Alex was saying he liked the contrast to the positivity of the rest of the album. He felt like it needed this dose of realism or something.

David Pearce: That makes sense, yes.

Sam Barker: I sort of conceded in the end. We called it Die-Hards of the Darwinian Order, because that was what I felt like he was.

David Pearce: Have you told him this?

Sam Barker: I told him. He definitely took the criticism. As I said, it’s the actual joining up of these ideas that I make notes on. The tracks themselves, in the end, had to be done in a creative way sort of retroactively. That doesn’t mean to say that all of these concepts were not crucial to the process of making the record. When you’re starting a project, you call it something like new track, happy two, mix one, or something. Then, eventually, the sort of meaning emerges from the end result, in a way.

Lucas Perry: It’s just like what I’ve heard from authors of best selling books. They say you have no idea what the book is going to be called until the end.

Sam Barker: Right, yeah.

David Pearce: One of the reasons I think it’s so important to stress life based on gradients of bliss ratcheting up hedonic set points is that, instead of me or anyone else trying to impose their distinctive vision of paradise, it just allows, with complications, everyone to keep most of their existing values and preferences, but just ratchets up hedonic tone and hedonic range. I mean, this is the problem with so many traditional paradises. They involve the imposition of someone else’s values and preferences on you. I’m being overly cerebral about it now, but I think my favorite track on the album is the first. I would encourage people to conjure up their vision of paradise and the future can potentially be like that and be much, much better.

Sam Barker: This, I think, relates to the sort of pushiness that I was feeling at odds with. The music does take people to these kind of euphoric states, sometimes chemically underwritten, but it’s being done in a dogmatic and singular way. There’s not much room for personal interpretation. It’s sort of everybody’s experiencing one thing, which I think there’s something in these kind of communal experiences that I’m going to hopefully understand one day.

Lucas Perry: All right. I think some of my favorite tracks are Look How Hard I’ve Tried on Debiasing. I also really like Maximum Utility and Neuron Collider. I mean, all of it is quite good and palatable.

Sam Barker: Thank you. The ones that you said are some of my personal favorites. It’s also funny how some of the least favorite tracks, or not least favorite, but the ones that I felt like didn’t really do what they set out to do, were other people’s favorites. Hedonic Treadmill, for example. I’d put that on the pile of didn’t work, but people are always playing it, too, finding things in it that I didn’t intentionally put there. Really, that track felt to me like stuck on the hedonic treadmill, and not sort of managing to push the speed up, or push the level up. This is, I suppose, the problem with art, that there isn’t a universal pleasure sense, that there isn’t a one size fits all way to these higher states.

David Pearce: You correctly called it the hedonic treadmill. Some people say the hedonistic treadmill. Even one professor I know calls it the hedonistic treadmill.

Lucas Perry: I want to get on that thing.

David Pearce: I wouldn’t mind spending all day on a hedonistic treadmill.

Sam Barker: That’s my kind of exercise, for sure.

Lucas Perry: All right, so let’s pivot here into section two of our conversation, then. For this section, I’d just like to focus on the future, in particular, and exploring the state of dance music culture, how it should evolve, and how science and technology, along with art and music, can evolve into the future. This question comes from you in particular, Sam, addressed to Dave. I think you were curious about his experiences in life and if he’s ever lost himself on a dance floor or has any special music or records that put him in a state of bliss?

Sam Barker: Very curious.

David Pearce: My musical autobiography. Well, some of my earliest memories is of a wind up gramophone. I’m showing my age here. Apparently, as a five year old child, I used to sing on the buses. Daisy, Daisy, give me your answer, due. I’m so crazy over love of you. Then, graduating via the military brass band play, apparently I used to enjoy as a small child to pop music. Essentially, for me, very, very unanswerable about music. I like to use it as a backdrop, you know. At its best, there’s this tingle up one’s spine one gets, but it doesn’t happen very often. The only thing I would say is that it’s really important for me that music should be happy. I know some people get into sad music. I know it’s complicated. Music, for me, has to elicit something that’s purely good.

Sam Barker: I definitely have no problem with exploring the sort of darker side of human nature, but I also have come to the realization that there’s better ways to explore the dark sides than aesthetic stimulation through, perhaps, words and ideas. Aesthetics is really at its optimum function when it’s working towards more positive goals of happiness and joy, and these sort of swear words in the art world.

Lucas Perry: Dave, you’re not trying to hide your rave warehouse days from us, are you?

David Pearce: Well, yeah. Let’s just say I might not have been entirely drug naïve with friends. Let’s just say I was high on life or something, but it’s a long time since I have explored that scene. Part of me still misses it. When it comes to anything in the art world, just as I think visual art should be beautiful. Which, I mean, not all serious artists would agree.

Sam Barker: I think the whole notion is just people find it repulsive somehow, especially in the art world. Somebody that painted a picture and then the description reads I just wanted it to be pretty is getting thrown out the gallery. What greater purpose could it really take on?

David Pearce: Yeah.

Lucas Perry: Maybe there’s some feeling of insecurity, and a feeling and a need to justify the work as having meaning beyond the sensual or something. Then there may also be this fact contributing to it. Seeking happiness and sensual pleasure directly, in and of itself, is often counterproductive towards that goal. Seeking wellbeing and happiness directly usually subverts that mission, and I guess that’s just a curse of Darwinian life. Perhaps those, I’m just speculating here, contribute to this cultural distaste, as you were pointing out, to enjoy pleasure as the goals of art.

Sam Barker: Yeah, we’re sort of intellectually allergic to these kinds of ideas, I think. They just seem sort of really shallow and superficial. I suppose that was kind of my existential fear before the album came out, that the idea that I was just trying to make people happy would just be seen as this shallow thing, which I don’t see it as, but I think the sentiment is quite strong in the art world.

Lucas Perry: If that’s quite shallow, then I guess those people are also going to have problems with the Buddha in people like that. I wouldn’t worry about it too much. I think you’re on the same intentional ground as the Buddha. Moving a little bit along here. Do you guys have thoughts or opinions on the future of aesthetics, art, music, and joy, and how science and technology can contribute to that?

David Pearce: Oh, good heavens. One possibility will be that, as neuroscience advances, it’ll be possible to isolate the molecular experience of visual beauty, musical bliss, spiritual excellence, and scientifically amplify them so that one can essentially enjoy musical experiences that are orders of magnitude richer than anything that’s even physiologically feasible today. I mean, I can use all this fancy language, but what actually this will involve, in terms of true trans-human and post-human artists. The gradients of bliss is important here, in such that I think we will retain information sensitive gradients, so we don’t lose critical sharpness, discernment, critical appreciation. Nonetheless, this base point for aesthetic excellence. All experience can be superhumanly beautiful. I mean, I religiously star my music collection from one to five, but what would a six be like? What would 100 be like?

Sam Barker: I like these questions. I guess the role of the artist in the long term future in creating these kinds of states maybe gets pushed out at some point by people who are in the labs and reprogram the way music is, or the way that any sort of sensory experience is received. I wonder whether there’s a place in techno utopia for music made by humans, or whether artists sort of just become redundant in some way. I’m not going to get offended if the answer is bye, bye.

Lucas Perry: I’d be interested in just making a few points about the evolutionary perspective before we get into the future of ape artists or mammalian artists. It just seems like some kind of happy cosmic accident that, for the vibration of air, human beings have developed a sensory appreciation of information and structure embedded in that medium. I think we’re quite lucky, as a species, that music and musical appreciation is embedded in the software of human genetics, as such that we can appreciate, and create, and share musical moments. Now, with genetic engineering and more ambitious paradise engineering, I think it would be beautiful to expand the modalities for which artistic, or aesthetic, or the appreciation of beauty can be experienced.

Music is one clear way of having aesthetic appreciation and joy. Visual art is another one. People do derive a lot of satisfaction from touch. Perhaps that could be more information structured in the ways that music and art are. There might be a way of changing what it means to be an intelligent thing, such there can be just an expansion of art appreciation across all of our essential modalities, and even into essential modalities which don’t exist yet.

David Pearce: The nature of trans-human and post-human art just leaves me floundering.

Lucas Perry: Yeah. It seems useful here just to reflect on how happy of an accident art is. As we begin to evolve, we can get into, say, A.I. here. A.I. and machine learning is likely to be able to have very, very good models of, say, our musical preferences within the next few years. I mean, they’re somewhat already very good at it. They’ll continue to get better. Then, we have fairly rudimental algorithms which can produce music. If we just extrapolate out into the future, eventually artificial intelligent systems will be able to produce music better than any human. In that world, what is the role of the human artist? I guess I’m not sure.

Sam Barker: I’m also completely not sure, but I feel like it’s probably going to happen in my lifetime, that these technologies get to a point that they actually do serve the purpose. At the moment, there is A.I. software that can create unique compositions, but it does so by looking at an archive of music with Ava. It’s Bach, and Beethoven, and Mozart. Then it reinterprets all of the codes that are embedded in that, and uses that to make new stuff. It sounds just like a composing quoting, and it’s convincing. Considering this is going to get better and better, I’m pretty confident that we’ll have a system that will be able to create music to a person’s specific taste, having not experienced music, that would say look at my music library, and then start making things that I might like. I can’t say how I feel about that.

Let’s say if it worked, and it did actually surprise me, and I was feeling like humans can’t make this kind of sensation in me. This is a level above. In a way, yeah, somebody that doesn’t like the vagueness of the creative process, this really appeals, somehow. The way that things are used, and the way that our attention is sort of a resource that gets manipulated, I don’t know whether we have an incredible technology, once again, in the wrong hands. It’s just going to be turned into a mind control. These kind of things would be put to use for nefarious purposes. I don’t fear the technology. I fear what we, in our unmodified state, might do with it.

David Pearce: Yes. I wonder when the last professional musician will retire, having been eclipsed by A.I. I mean, in some sense, we are, I think, stepping stones to something better. I don’t know when the last philosophers will be pensioned off. Hard problem of mind solved, announced in nature, Nobel Prize beckons. Distinguished philosophers of mind announce their intention to retire. Hard to imagine, but one does suppose that A.I. will be creating work of ever greater excellence tailored to the individual. I think the evolutionary roots of aesthetic appreciation are very, very deep. It kind of does sound very disrespectful to artists, saying that A.I. could replace artists, but mathematicians and scientists are probably going to be-

Lucas Perry: Everyone’s getting replaced.

Sam Barker: It’s maybe a similar step to when portrait painters when the camera was threatening their line of work. You can press a button and, in an instant, do what would’ve taken several days. I sort of am cautiously looking forward to more intelligent assistance in the production of music. If we did live in a world where there wasn’t any struggles to express, or any wrongs to right, any flaws in our character to unpick, then I would struggle to find anything other than the sort of basic pleasure of the action of making music. I wouldn’t really feel any reason to share what I made, in a sense. I think there’s a sort of moral, social purpose that’s embedded within music, if you want to grasp it. I think, if A.I. is implemented with that same moral, ethical purpose, then, in a way, we should treat it as any other task that comes to be automated or extremely simplified. In some way, we should sort of embrace the relaxation of our workload, in a way.

There’s nothing to say that we couldn’t just continue to make music if it brought us pleasure. I think distinguishing between these two things of making music and sharing it was an important discovery for me. The process of making a piece of music, if it was entirely pleasurable, but then you treat the experience like it was a failure because it didn’t reach enough people, or you didn’t get the response or the boost to your ego that you were searching from it, then it’s your remembering self overriding your experiencing self, in a way, or your expectations getting in the way of your enjoyment of the process. If there was no purpose to it anymore, I might still make it for my own pleasure, but I like to think I would be happy that a world that didn’t require music was already a better place. I like to think that I wouldn’t be upset with my redundancy with my P45 from David Pearce.

David Pearce: Oh, no. With a neuro chip, you see, your creative capacities could be massively augmented. You’d have narrow super intelligence on a chip. Now, in one sense, I don’t think classical digital computers are going to wake up and become conscious. They’re never actually going to be able to experience music or art or anything like this. In that sense, they will remain tools, but tools that one can actually incorporate within oneself, so that they become part of you.

Lucas Perry: A friendly flag there that many people who have been on this podcast disagree with that point. Yeah, fair enough, David. I mean, it seems that there are maybe three options. One is, as you mentioned, Sam, to find joy and beauty in more things, and to sort of let go of the need for meaning and joy to come from not being something that is redundant. Once human beings are made obsolete or redundant, it’s quite sad for us, because we derive much of our meaning, thanks a lot, evolution, from accomplishing things and being relevant. The two paths here seems like reaching some kind of spiritual evolution such that we’re okay with being redundant, or being okay with passing away as a species and allowing our descendants to proliferate. The last one would be to change what it means to be human, such that by merging or bi-evolution we somehow remain relevant to the progress of civilization. I don’t know which one it will be, but we’ll see.

David Pearce: I think the exciting one, for me, is where we can harness the advances in technology in a conscious way to positive ends, to greater net wellbeing in society. Maybe I’m hooked on the old ideals, but I do think a sense of purpose in your pleasure elevates the sensation somewhat.

Lucas Perry: I think human brains on MDMA would disagree with that.

Sam Barker: Yeah. You’ve obviously also reflected on an experience like that after the event, and come to the conclusion that there wasn’t, perhaps, much concrete meaning to your experience, but it was joyful, and real, and vivid. You don’t want to focus too much on the fact that it was mostly just you jumping up and down on a dance floor. I’m definitely familiar with the pleasure of essentially meaningless euphoria. I’ll say, at the very least, it’s interesting to think about. Reading a lot about the nature of happiness and the general consensus there being that happiness is sort of a balance of pleasure a purpose. The idea that maybe you don’t need the purpose is worth exploring, I think, at least.

David Pearce: We do have this term empty hedonism. One thing that’s striking is that one, for whatever reason or explanation, gets happier and happier. Everything seems more intensely meaningful. There are pathological forms like mania or hypermania, where it leads to grandiosity, masonic delusions, even theomania, and thinking one is God. It’s possible to have much more benign versions. In practice, I think when life is based on gradients of bliss, eventually, superhuman bliss, this will entail superhuman meaning and significance. Essentially, we’ve got a choice. I mean, we can either have pure bliss, or one could have a combination of miss and hyper-motivation, and one will be able to tweak the dials.

Sam Barker: This is all such deliciously appealing language as someone who’s spending a lot of their time tweaking dials.

David Pearce: This may or may not be the appropriate time to ask, but tell me about what future projects have you planned?

Sam Barker: I’m still very much exploring the potential of music as an increaser of wellbeing, and I think it’s sort of leading me in interesting directions. At present, I’m sort of in another crossroads, I feel. The general drive to realize these sort of higher functions of music is still a driving force. I’m starting to look at what is natural in music and what is learned. Like you say, there is this long history of the way that we appreciate sound. There’s link to all kinds of repetitive experiences that our ancestors had. There’s other aspects to sound production that are also very old. Use of reverb is connected to our experience as sort of cavemen dwelling in these kind of reverberant spaces. These were kind of sacred spaces for early humans, so this feeling of when you walk into a cathedral, for example, this otherworldly experience that comes from the acoustics is, I think, somehow deeply tied to this historical situation of seeking shelter in caves, and the caves having a bigger significance in the lives of early humans.

There’s a realization, I suppose, that what we’re experiencing that relates to music is rhythm, tone, and timbre noise. If you just sort of pay attention to your background noise, the things that you’re most familiar with are actually not very musical. You don’t really find harmony in nature very much. I’m sort of forming some ideas around what parts of music and our response to music are cultural, and what are natural. It’s sort of a strange word to apply. Our sort of harmonic language is a technical construction. Rhythm is something we have a much deeper connection with through our lives as defined by rhythms of planets and that dividing our time into smaller and smaller ratios down to heartbeats and breathing. We’re sort of experiencing really complex poly-rhythmic silence form of music, I suppose. I’m separating these two concepts of rhythm and harmony and trying to get to the bottom of their function and the goal of elevating bliss and happiness. I guess, looking at what the tools I’m using are and what their role could be, if that makes any sense.

David Pearce: In some sense, this sounds weird. I think, insofar as it’s possible, one does have a duty to take care of oneself, and if one can give happiness to others, not least by music, in that sense, one can be a more effective altruist. In some sense, perhaps one feels, ethically, ought one to be working 12, 14 hours a day to make the world a better place. Equally, we all have our design limitations, and just being able to relax and, either as a consumer of music, or if one is a creator of music, that has a valuable role, too. It really does. One needs to take care of one’s own mental health to be able to help others.

Sam Barker: I feel like the kind of under the bonnet tinkering that, in some way, needs to happen for us to really make use of the new technologies. We need to do something about human nature. I feel like we’re a bit further away from those sort of realities than we are with the technological side. I think there needs to be sort of emergency measures, in some way, to improve human nature through the old fashioned social, cultural nudges, perhaps, as a stopgap until we can really roll our sleeves up and change human nature on a molecular level.

David Pearce: Yeah. I think we might need both. All the kind of environmental, social, political form together, whether biological, genetic, by a happiness revolution. I would love to be able to. A 100 year plan blueprint to get rid of suffering. Replace it with gradients of bliss, paradise engineering. In practice, I feel the story of Darwinian life still has several centuries to go. I hope I’m too pessimistic. Some of my trans-humanist colleagues, intelligence explosion, or a complete cut via the infusion of humans and our machines, but we shall see.

Lucas Perry: David, Sam and I, and everyone else, loves your prose so much. Could you just kind of go off here and muster your best prose to give us some thoughts as beautiful as sunsets for how good the future of music, and art, and gradients of intelligent bliss will be?

David Pearce: I’m afraid. Put eloquence on hold, but yeah. Just try for a moment to remember your most precious, beautiful, sublime experience in your life, whatever it was. It may or may not be suitable for public consumption. Just try to hold it briefly. Imagine if life could be like that, only far, far better, all the time, and with no nasty side effects, no adverse social consequences. It is going to be possible to build this kind of super civilization based on gradients of bliss. Be over ambitious. Needless to say, if anything I have written, unfortunately you’d need to wade through all matter of fluff. I just want to say, I’m really thrilled and chuffed with utility, so anything else is just vegan icing on the cake.

Sam Barker: Beautiful. I’m really, like I say, super relieved that it was taken as such. It was really a reconfiguring of my approach and my involvement with the thing that I’ve sort of given my life to thus far, and a sort of a clarification of the purpose. Aside from anything else, it just put me in a really perfect mindset for addressing mental obstacles in the way of my own happiness. Then, once you get that, you sort of feel like sharing it with other people. I think it started off a very positive process in my thoughts, which sort of manifested in the work I was doing. Extremely grateful for your generosity in lending these ideas. I hope, actually, just that people scratched the surface a little bit, and maybe plug some of the terms into a search engine and got kind of lost in the world of utopia a little bit. That was really the main reason for putting these references in and pushing people in that direction.

David Pearce: Well, you’ve given people a lot of pleasure, which is fantastic. Certainly, I’d personally rather be thought of as associated with paradise engineering and gradients of bliss, rather than the depressive, gloomy, negative utilitarian.

Sam Barker: Yeah. There’s a real dark side to the idea. I think the thing I read after the Hedonistic Imperative was some of Les Knight’s writing about the voluntary human extinction movement. I honestly don’t know if he’d be classified as a utilitarian, but this sort of egocentric utilitarianism, which you sort of endorse through including the animal kingdom in your manifesto. There’s sort of a growing appreciation for this kind of antinatal sentiment.

David Pearce: Yes, antinatalism seems to be growing, but I don’t think it’s every going to be dominant. The only way to get rid of suffering and ensure high quality of life for all sentient beings is going to be, essentially, get to the heart of the problem to rewrite ourselves. I did actually do an antinatalist podcast the other week, but I’m only a soft antinatalist, because there’s always going to be selection pressure in favor of a predisposition to go forth and multiply. One needs to build alliances with fanatical life lovers, even if when one contemplates the state of the world, one has some rather dark thoughts.

Sam Barker: Yeah.

Lucas Perry: All right. So, is there any questions or things we haven’t touched on that you guys would like to talk about?

David Pearce: No. I just really want to just thank you to Lucas for organizing this. You’ve got quite a diverse range of podcasts now. Sam, I’m honored. Thank you very much. Really happy this has gone well.

Sam Barker: Yeah. David, really, it’s been my pleasure. Really appreciate your time and acceptance of how I’ve sort of handled your ideas.

Lucas Perry: I feel really happy that I was able to connect you guys, and I also think that both of you guys make the world more beautiful by your work and presence. For that, I am grateful and appreciative. Also, very much enjoy and take inspiration from both of your work, so keep on doing what you’re doing.

Sam Barker: Thanks, Lucas. Same to you. Really.

David Pearce: Thank you, Lucas. Very much appreciated.

Lucas Perry: I hope that you’ve enjoyed the conversation portion of this podcast. Now, I’m happy to introduce the guest mix by Barker. 

Steven Pinker and Stuart Russell on the Foundations, Benefits, and Possible Existential Threat of AI

 Topics discussed in this episode include:

  • The historical and intellectual foundations of AI 
  • How AI systems achieve or do not achieve intelligence in the same way as the human mind
  • The rise of AI and what it signifies 
  • The benefits and risks of AI in both the short and long term 
  • Whether superintelligent AI will pose an existential risk to humanity

You can take a survey about the podcast here

Submit a nominee for the Future of Life Award here



0:00 Intro 

4:30 The historical and intellectual foundations of AI 

11:11 Moving beyond dualism 

13:16 Regarding the objectives of an agent as fixed 

17:20 The distinction between artificial intelligence and deep learning 

22:00 How AI systems achieve or do not achieve intelligence in the same way as the human mind

49:46 What changes to human society does the rise of AI signal? 

54:57 What are the benefits and risks of AI? 

01:09:38 Do superintelligent AI systems pose an existential threat to humanity? 

01:51:30 Where to find and follow Steve and Stuart


Works referenced: 

Steven Pinker’s website and his Twitter

Stuart Russell’s new book, Human Compatible: Artificial Intelligence and the Problem of Control


We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Note: The following transcript has been edited for style and clarity.


Lucas Perry: Welcome to the AI Alignment Podcast. I’m Lucas Perry. Today, we have a conversation with Steven Pinker and Stuart Russell. This episode explores the historical and intellectual foundations of AI, how AI systems achieve or do not achieve intelligence in the same way as the human mind, the benefits and risks of AI over the short and long-term, and finally whether superintelligent AI poses an existential risk to humanity. If you’re not currently following this podcast series, you can join us by subscribing on Apple Podcasts, Spotify, Soundcloud, or on whatever your favorite podcasting app is by searching for “Future of Life.” Our last episode was with Sam Harris on global priorities. If that sounds interesting to you, you can find that conversation wherever you might be following us. 

I’d also like to echo two announcements for the final time. So, if you’ve been tuned into the podcast recently, you can skip ahead just a bit. The first is that there is an ongoing survey for this podcast where you can give me feedback and voice your opinion about content. This goes a long way for helping me to make the podcast valuable for everyone. This survey should only come out once a year. So, this is a final call for thoughts and feedback if you’d like to voice anything. You can find a link for the survey about this podcast in the description of wherever you might be listening. 

The second announcement is that at the Future of Life Institute, we are in the midst of our search for the 2020 winner of the Future of Life Award. The Future of Life Award is a $50,000 prize that we give out to an individual who, without having received much recognition at the time of their actions, has helped to make today dramatically better than it may have been otherwise. The first two recipients of the Future of Life Award were Vasili Arkhipov and Stanislav Petrov, two heroes of the nuclear age. Both took actions at great personal risk to possibly prevent an all-out nuclear war. The third recipient was Dr. Matthew Meselson, who spearheaded the international ban on bioweapons. Right now, we’re not sure who to give the 2020 Future of Life Award to. That’s where you come in. If you know of an unsung hero who has helped to avoid global catastrophic disaster, or who has done incredible work to ensure a beneficial future of life, please head over to the Future of Life Award page and submit a candidate for consideration. The link for that page is on the page for this podcast or in the description of wherever you might be listening. If your candidate is chosen, you will receive $3,000 as a token of our appreciation. We’re also incentivizing the search via MIT’s successful red balloon strategy, where the first to nominate the winner gets $3,000 as mentioned, but there are also tiered pay outs where the first to invite the nomination winner gets $1,500, whoever first invited them gets $750, whoever first invited them $375, and so on. You can find details about that on the Future of Life Award page. Link in the description. 

Steven Pinker is a Professor in the Department of Psychology at Harvard University. He conducts research on visual cognition, psycholinguistics, and social relations. He has taught at Stanford and MIT and is the author of ten books: The Language Instinct, How the Mind Works, The Blank Slate, The Better Angels of Our Nature, The Sense of Style, and Enlightenment Now: The case for Reason, Science, Humanism, and Progress. 

Stuart Russell is a Professor of Computer Science and holder of the Smith-Zadeh chair in engineering at the University of California, Berkeley. He has served as the vice chair of the World Economic Forum’s Council on AI and Robotics and as an advisor to the United Nations on arms control. He is an Andrew Carnegie Fellow as well as a fellow of the Association for the Advancement of Artificial Intelligence, the Association for Computing Machinery and the American Association for the Advancement of Science.

He is the author with Peter Norvig of the definitive and universally acclaimed textbook on AI, Artificial Intelligence: A Modern Approach. He is also the author of Human Compatible: Artificial Intelligence and the Problem of Control. 

And with that, here’s our conversation with Steven Pinker and Stuart Russell. 

So let’s get started here then. What are the historical and intellectual foundations upon which the ongoing AI revolution is built?

Steven Pinker: I would locate them in the Age of Reason and the Enlightenment, when Thomas Hobbes said, “Reasoning is but reckoning,” reckoning in the old-fashioned sense of “calculation” or “computation.” A century later, the two major style of AI today were laid out: The neural network, or massively parallel interconnected system that is trained with examples and generalizes by similarity, and the symbol-crunching, propositional, “Good Old-Fashioned AI.” Both of those had adumbrations during the Enlightenment.  David Hume, in the empiricist or associationist tradition, said there are only three principles of connection among ideas, contiguity in time or place, resemblance, and cause and effect. On the other side, you have Leibniz, who thought of cognition as the grinding of wheels and gears and what we would now call the manipulation of symbols. Of course the actual progress began in the 20th century with the ideas of Turing and Shannon and Weaver and Norbert Wiener. The rest is the history that Stuart writes about in his textbook and his recent book.

Stuart Russell: I think I would like to add in a little bit of ancient history as well, just because I think Aristotle not only thought a lot about how human thinking was organized and how it could be correct or incorrect and how we could make rational decisions, he very clearly describes a backward regression goal planner in one of his pieces, and his work was incredibly influential. One of the things he said is we deliberate about means and not about ends. I think he says, “A doctor does not choose whether to heal,” and so on. And you might disagree with that, but I think that that’s been a pretty influential thread in Western thinking for the last two millennia or more. That we kind of take objectives as given and the purpose of intelligence is to act in ways that achieve your objectives.

That idea got refined gradually. So Aristotle talked mainly about goals and logically provable sequences of actions that would achieve those goals. And then in the 17th and 18th centuries, I want to give a shout out to the French and the Swiss, so Pascal and Fermat and Arnauld and Bernoulli brought in ideas of rational decision making under uncertainty and the weighing of probabilities and the concept of utility that Bernoulli introduced. So that generalized Aristotle’s idea, but it didn’t change the fundamental principle that they took the objectives, the utilities, as given. Just intrinsic properties of a human being in a given moment.

In AI, we sort of went through the same historical development, except that we did the logic stuff for the first 30 years or so, roughly, and then we did the probability and decision theory stuff for the next 30 years. I think we’re in a terrible state now, because the vast majority of the deep learning community, when you read their papers, nothing is cited before 2012. Occasionally, from time to time, they’ll say things like, “For this problem, the learning algorithms that we have are probably inadequate, and in future I think we should direct some of our research towards something that we might call reasoning or knowledge,” as if no one had ever thought of those things before and they were the first person in history to ever have the idea that reasoning might be necessary for intelligence.

Steven Pinker: Yes.

Stuart Russell: I find this quite frustrating and particularly frustrating when students want to actually just bypass the AI course altogether and go straight to the deep learning course, because they just don’t think AI is necessary anymore.

Steven Pinker: Indeed, and also galling to me. In the late ’80s and ’90s I was involved in a debate over the applicability of the predecessors of deep learning models, then called multi-layer perceptrons, artificial neural networks, connectionist networks, and Parallel Distributed Processing networks. Gary Marcus and Alan Prince and Michael Ullman and other collaborators I pointed out the limitations of trying to achieve intelligence–even for simple linguistic processes like forming the plural of a noun or a past tense of a verb–if the only tool you had available was the ability to associate features with features, without any symbol processing. That debate went on for a couple of decades and then petered out. But then one of the prime tools in the neural network community, multilayer networks trained by error back-propagation,  were revived in 2012. Indeed there is an amnesia for the issues in that debate, which Gary Marcus has revived for a modern era.

It would be interesting to trace the truly radical idea behind artificial intelligence: not just that there are rules or algorithms, whether they are from logic or probability theory, that an intelligent agent can use, the way a human pulls out a smartphone. But the idea that there is nothing but rules or algorithms, and that’s what an intelligent agent consists of: that is, no ghost to the machine, no agent separate from the mechanism. And there, I’m not sure whether Aristotle actually exorcised the ghost in the machine. I think he did have a notion of a soul. The idea that it’s rules all the way down,  that intelligence is just a mechanism, probably has shallow roots. Although Hobbes probably could claim credit for it, and perhaps Hume as well.

Lucas Perry: That’s an excellent point, Steve, it seems like Abrahamic religions have kind of given rise in part to this belief, or maybe an expression of that belief, the kind of mind-body dualism, the ghost in the machine where the mind seems to be a nonphysical thing. So it seems like intelligence has had to go the same road of “life.” There used to be “elan vital” or some other spooky presupposed mechanism for giving rise to life. And so similarly with intelligence, it seems like we’ve had to move from thinking that there was a ghost in the machine that made the things work to there being rules all the way down. If you guys have anything else to add to that, I think that’d be interesting.

My other two reactions to what has been said so far are that this point about computer science taking the goal as given, I think is important and interesting, and maybe we could expand upon that a little bit. Then there’s also, Stuart mentioned the difference between AI and deep learning and that students want to skip the AI and just get straight to the deep learning. That seemed a little bit confusing to me.

Steven Pinker: Let me address the first part and I’ll turn it over to Stuart for the second. The notion of dualism–that there is a mechanism, but sitting on top of it is an immaterial agent or self or soul or I–is enshrined in the Abrahamic religions and in other religions, but it has deep intuitive roots. We are all intuitively dualists (Paul Bloom has made this argument in his book Descartes’ Baby.) Fortunately, when we deal with each other in everyday life we don’t treat each other like robots or wind-up dolls, but we assume that there is an inner life that is much like ours, and we make sense of people’s behavior in terms of their beliefs and desires, which we don’t conceptualize as neural circuitry transforming patterns. We think there’s a locus of consciousness, which is easy to think of as separate from the flesh that we’re made of, especially since–and this is a point made by the 19th century British anthropologist Edward Tylor–that there’s actually a lot of empirical “evidence” that supports dualism in our everyday life.  Like dreaming.

When you dream, you know your body is in bed the whole time, but there’s some part of you that’s up and about in the world. When you see your reflection in a mirror or in still water, there is an animated essence that seems to have parted company with your body. When you’re in a trance from a drug or a fever and have an out-of-body experience, it seems  that we and our bodies are not the same thing. And with death, one moment a person is walking around, the next moment the body is lifeless. It’s natural to think that it’s lost some invisible ingredient that had animated it while it was alive.

Today we know that this is just the activity of the brain, but in terms of the experience available to a person, dualism seems perfectly plausible. It’s one of the great achievements of neuroscience, on the one hand,  to show that a brain is capable of supporting problem solving and perception and decision making, and of the computational sciences, on the other, for showing that intelligence can be understood in terms of information and computation, and that goals (like the Aristotelian final cause) can be understood in terms of control and cybernetics and feedback.

Stuart Russell: On the point that in computer science, we regard the objectives as fixed, it’s much broader than just computer science. If you look at Von Neumann — Morgenstern and their characterizations of rationality, nowhere do they talk about what is the process by which the agent might rationally come by its preferences. The agent is always assumed a priori to come with the preferences built in, and the only constraint is that those preferences be self-consistent so that you can’t be driven around circles of intransitive preferences where you simply cough up money to go round and round the same circle.

The same thing I think is true in control theory, where the objective is the cost function, and you design a controller that minimizes the expected cost function, which might be a square of the distance from the desired trajectory or whatever it might be. Same in statistics, where let’s just assume that there’s a loss function. There’s no discussion in statistics of what the loss function should be or how the loss function might change or anything like that.

So this is something that pervades many of the technological underpinnings of the 20th century. As far as I can tell, to some extent in developmental psychology, but I think in moral philosophy, people really take seriously the question of what goal should we have? Is it moral for an agent to have such and such as its objective, and how could we, for example, teach an agent to have different objectives? And that gets you into some very unchartered philosophical waters about what is a rational process that would lead an agent to have different objectives at the end than it did at the beginning, given that if it has different objectives at the end, then it can only expect that it won’t be achieving the objectives that it has at the beginning. So why would it embark on a process that’s going to result in failure to achieve the objectives that it currently has?

So that’s sort of a philosophical puzzle, but it’s a real issue because in fact human beings do change. We’re not born with the preferences that we have as adults, and so there is a notion of plasticity that absolutely has to be understood if we’re to get this right.

Steven Pinker: Indeed, and I suspect we’ll return to the point later when we talk about potential risks of advanced artificial intelligence. The issue is whether a system having intelligence implies that the system would have certain goals, and probably Stuart and I agree the answer is no, at least not by definition. Precisely because what you want and how to get what you want are two logically independent questions. Hume famously said that reason must be that the slave of the passions, by which he didn’t mean that we should just surrender to our impulses and do whatever feels good. What he meant was that reason itself can’t specify the goals that it tries to bring about. Those are exogenous. And indeed, von Neumann and Morgenstern are often misunderstood as saying that we must be ruthlessly, egotistical self interested maximizers. Whereas the goal that is programmed into us — say by evolution or by culture — could include other people’s happiness as part of our utility function. That is a question that merely making our choices consistent is silent on.

 So the ability to reason doesn’t by itself give you moral goals, including taking into account the interests of others. That having been said, there is a long tradition in moral philosophy which shows  how it doesn’t take much to go from one to the other. Because as soon as we care about persuading others, as soon as our interests depend on how others treat us, then we can’t get away with saying “only my interests count and yours don’t because I am me and you’re not,” because there is no logical difference between “me” and “you.” So we’re forced to a kind of impartiality, wherein whatever I insist on for me I’ve got to grant to you, a kind of Golden Rule or Categorical Imperative that makes our interests interchangeable as soon as we’re in discourse with one another.

This is all to acknowledge Stuart’s point, but to take it a few steps further in how it deals with the question of what our goals ought to be.

Stuart Russell: The other point you raised Lucas was on being confused by my distinction between AI and deep learning.

Lucas Perry: That’s right.

Stuart Russell: I think you’re pointing to a confusion that exists in the public mind, in the media and even in parts of the AI community. AI has always included machine learning as a subdiscipline, all the way back to Turing’s 1950 paper, where he speculates, in fact, that might be a good way to build AI would be just start with a child program and train it to be an adult intelligent machine. But there are many other sub-fields of AI; knowledge representation, reasoning, planning, decision making under uncertainty, problem solving, perception. Machine learning is relevant to all of these because they all involve processes that can be improved through experience. So that’s what we mean by machine learning: simply the improvement of performance through experience; and deep learning is a technology that helps with that process.

It by itself as far as we can tell, doesn’t have what is necessary to produce general intelligence. Just to pick one example, the idea that human beings know things seems so self-evident that we hardly need to argue about it. But deep learning systems in a real sense don’t know things. They can’t usefully acquire knowledge by reading a book and then go out and use that to design a radio telescope, which human beings arguably can. So it seems inevitable that if we’re going to make progress, I mean, sure we take the advances that deep learning has offered. Effectively, what we’ve discovered with deep learning is that you can train more complicated circuits than we previously would have guessed possible using various kinds of stochastic gradient descent, and other tricks.

I think it’s true to say that most people would not have expected that you could build a thousand layer network that was 20,000 units wide. So it’s got 20 million circuit elements and simply put a signal in one end and some data in the other and expect that you’re going to be able to train those 20 million elements to represent the complicated function that you’re trying to get it to learn. So that was a big surprise, and that capability is opening up all kinds of new frontiers: in vision, in speech recognition, language, machine translation, and physical control in robots among other things. It’s a wonderful set of advances, but it’s not the entire solution. Any more than group theory is the entire solution to mathematics. There’s lots of other branches of mathematics that are exciting and interesting and important and you couldn’t function without them. The same is true for AI.

So I think that we’re probably going to see even without further major conceptual advances, another decade of progress in achieving greater understanding of why deep learning works and how to do it better, and all the various applications that we can create using it. But I think if we don’t go back and then try to reintegrate all the other ideas of AI, we’re going to hit a wall. And so I think the sooner we lose our obsession with this new shiny thing, the better.

Steven Pinker: I couldn’t agree more. Indeed, in some ways we have already hit the wall. Any user of Siri or Cortana or a question-answering system has been frustrated by the way they just make associations to individual words and have a shallow understanding of the syntax of the sentence. If you ask Google or Siri, “Can you show me digital music players without a camera?” It’ll give you a long list of music players with discussions of their cameras, failing to understand the syntax of “X without Y.” Or, “What are some fast food restaurants nearby that are not McDonald’s?”  and you get a list of nearby McDonald’s.

It’s not hard to bump into the limitations of systems that for all their sophistication are being trained on associations among local elements, and can–I agree, surprisingly–learn higher-order combinations of those elements. But despite the name “deep learning,” they are shallow in the sense that they don’t build up a knowledge base of what are the objects, and who did what to whom, which they can access through various routes.

Stuart Russell: Yeah. My favorite example, I’m not sure if it’s apocryphal, is you say to Siri, “Call me an ambulance,” and Siri says, “Okay. From now on I’ll call you Ann Ambulance.”

Steven Pinker: In Marx Brothers movie, there’s the sequence, “Call me a taxi.” “Okay. You’re a taxi.” I don’t know if the AI story is an urban legend based on the Marx Brothers movie or whether life is imitating art.

Lucas Perry: Steven, I really appreciate it and liked that point about dualism and intelligence. I think it points in really interesting directions around identity in the self, which we don’t have time to get into here. But I did appreciate that.

So moving on ahead here, to what extent do you both see AI systems as achieving intelligence in the same way or not as the human mind does? What kinds of similarities are there or differences?

Stuart Russell: This is a really interesting question and we could spend the whole two hours just talking about this. So by artificial intelligence, I’m going to take it that we mean not deep learning, but the full range of techniques that AI researchers have developed over the years.

So some of them– for example, logical reasoning were– developed going back to Aristotle and other Greek philosophers who developed formal logic to model human thinking. So it’s not surprising that when we build programs that do logical reasoning, we are in some sense capturing one aspect of human reasoning capability. Then in the ’80s, as I mentioned, AI developed reasoning under uncertainty, and then later on refining that with notions of causality as well, particularly in the work of Judea Pearl. The differences are really because AI and cognitive science separated probably sometime in the ’60s. I think before that there wasn’t really a clear distinction between whether you were doing AI or whether you were doing cognitive science. It was very much the thought that if you could get a program to do anything that we think of as requiring intelligence with a human, then you were in some sense exhibiting a possible theory of how the human does it, or even you would make introspective claims and say, “Look, I’ve now shown that this theory of intelligence really works.”

But fairly soon people said, “Look, this is not really scientific. If you want to make a claim about how the human mind does something, you have to base it on real psychological experimentation with human subjects.” And that’s distinct from the engineering goal of AI, which is simply to produce programs that demonstrate certain capabilities. So for most of the last 50, 60 years, these two fields have grown further and further apart. I think now partly because of deep learning and partly because of other work, for example in probabilistic programming, we can start to do things that humans do that we couldn’t do before. So it becomes interesting again, to ask, well, are humans really somewhat Bayesian and are they doing these kinds of Bayesian symbolic probabilistic program learning that, for example, Josh Tenenbaum was proposing or are they doing something else? For example, Geoff Hinton is pretty adamant that as he puts it, symbols are the luminiferous aether of AI by which he means that they’re simply something that we imagined and they have no physical reality whatsoever in the human mind.

I find this a little hard to believe, and you have to wonder if symbols don’t exist, why are almost all deep learning applications aimed at recognizing the symbolic category to which an object belongs, and I haven’t heard an answer yet from the deep learning community about why that is. But it’s also clear that AI systems are doing things that have no resemblance to human cognition. When you look at what AlphaGo is actually doing, part of it is that sort of perception-like ability to look at a position and get a sense, to use an anthropomorphic term, of its potential for winning for white or for black. And perhaps that part is human-like, and actually it’s incredibly good. It’s probably better at recognizing the potential position directly with no deliberation whatsoever than a human is.

But the other part of what AlphaGo does is completely non-human. It’s considering sequences of moves from the current state that run all the way to the end of the game. So part of it is searching in a tree which could go 40 or 50 or possibly more moves into the future. Then from the end of the tree, it then plays a random game all the way to the end and sees who wins that game. And this is nothing like what human beings do. When humans are reasoning about a game like Go or Chess, first of all, we are thinking about it at multiple levels of abstraction. So we’re thinking about the liveness of a particular group, we’re thinking about control of a particular region of territory on the board. We’re thinking, “Well, if I give up control of this territory, then I can trade it for capturing his group over there.”

So this kind of reasoning simply doesn’t happen in AlphaGo at all. We reason back from goals. In chess you say, “Perhaps I could trap his queen. Let me see if I can come up with a move that blocks his exit for the queen.” So we reason backwards from some goals and no chess program and no Go program does that kind of reasoning. The reason humans do this is because the world is incredibly complicated and in different circumstances, different kinds of cognitive processing are efficient and effective in producing good decisions quickly. And that’s the real issue for human intelligence, right?

If we didn’t have to worry about computation, then we would just set up the giant unknown, partially observable, Markov decision process of the universe, solve it and then we would take the first action in the virtually infinite strategy tree that solves that POMDP. Then we would observe the next percept, we would update all our beliefs about the universe and we would resolve the universe and that’s how we would proceed. We would have to do that sort of roughly every millisecond to control the muscles in our body, but we don’t do anything like that. All of the different kinds of mental capabilities that we have are deployed in this amazingly fluid way to get us through the complexity of the real world. We are so far away in AI from understanding how to do that, that when I see people say, “We’re just going to scale up our deep learning systems by another three orders of magnitude and we’ll be more intelligent than humans,” I just smile.

Steven Pinker: Yeah. I’d like to complement some of those observations. It is true that in the early days of artificial intelligence and cognitive psychology, they were driven by some of the same players. Herb Simon and Allen Newell can be credited as among the founders of AI and the founders of cognitive psychology. Likewise, Marvin Minsky and John McCarthy. When I was an undergraduate, I caught the tail end of what was called the cognitive revolution. It was exhilarating after the dominance of psychology by behaviorism, which forbade any talk of mentalistic concepts. You weren’t allowed to talk about memories or plans or goals or ideas or rules, because they were considered to be unobservable and thus unscientific. Then the concept of computation domesticated those mentalist terms and opened up a huge space of hypotheses. What are the rules by which we understand and formulate sentences?, a project that Noam Chomsky initiated. How can we model human knowledge as a semantic network?, a project that Minsky and Alan Collins and Ross Quillian and others developed. How do we make sense of foresight and planning and problem solving, which Newell and Simon pioneered?

There was a lot of back and forth between AI and cognitive science when they were first exposed to the very idea that intelligence could be understood in mechanistic terms, and there was a flow of hypotheses from computer science that psychologists then tested as possible models. Ideas that you couldn’t even frame, you couldn’t even articulate before there was the language of computation, such as What is the capacity of human short term memory? or What are the search algorithms by which we explore a problem space? These were unintelligible in the era of behaviorism.

All this caught the attention of philosophers like Hilary Putnam, and later Dan Dennett, who noted that the ideas from the hybrid of cognitive psychology and artificial intelligence were addressing deep questions about what mental entities consist of, namely information processing states. The back-and-forth spilled into the ’70s when I was a graduate student, and even the ’80s when centers for cognitive science were funded by the Sloan Foundation. There was also a lot of openness in the companies that hired artificial intelligence researchers: AT&T Bell Labs, which was a scientific powerhouse before the breakup of AT&T. Bolt Beranek and Newman in Cambridge, which eventually became part of Verizon. I would go there as a grad student to hear talks on artificial intelligence. I don’t know if this is apocryphal history, but Xerox Palo Alto Research Centers, where I was a consultant, was so open that, according to legend, Steve Jobs walked in and saw the first computer with a graphic user interface and a mouse and windows and icons, stole the ideas, and went on to build the Lisa and then the Macintosh. Xerox was out on their own invention, and companies got proprietary . Many of the AI researchers in companies no longer publish  in peer-reviewed journals in psychology the way they used to, and the two cultures drifted apart. 

Since hypotheses from computer science and artificial intelligence are just hypotheses, there is the question of whether the best engineering solution to a problem is the one that the brain uses. There’s the obvious objection that the hardware is radically different: the brain is massively parallel and noisy and stochastic; computers are serial and deterministic. That led in part to the backlash in the ’80s when perceptrons and artificial neural networks were revived. There was skepticism about the more symbolic approaches to artificial intelligence, which has been revived now in the deep learning era.

to get back to the question, what are ways in which human minds differ from AI systems? It depends on the AI system assessed, as Stuart pointed out. Both of us would agree that the easy equation of deep learning networks with human intelligence is unwarranted, that a lot of the walls that deep learning is hitting come about because, despite the noisy parallel elements the brain is made of, we do emulate a kind of symbol processing architecture, where we can be taught explicit propositions, and human intelligence does make use of these symbols in addition to massively parallel associative networks.

I can’t help but mention a historical irony.  I’ve known Geoff Hinton since we were both post-docs. Hinton himself, early in his career, provided a refutation of the very claim of his that Stuart cited, that symbols are like luminiferous aether, a mythical entity. Geoff and I have noted to each other that we’ve switched sides in the debate on the nature of cognition. There was a debate in the 1970s on the format of mental imagery. Geoff and I were on opposite sides, but he was the symbolic proposition guy and I was the analog parallel network guy.  

Hinton showed that our understanding of an  object depends on the symbolic format in which we mentally represent it. Take something as simple as a cube, he said. Imagine a cube poised on one of its vertices, with the diagonally opposite vertex aligned above it. If you ask people, “Point to all the other vertices,” they are stymied. Their imagery fails, and they often leave out a couple of vertices. But if, instead of describing it to them as a cube tilted on its diagonal axis, you describe it as two tilted diamonds, one above the other, or as two tripods joined by a zig-zag ring, they “see” the correct answer. Even visualizing an object depends critically on how people mentally describe it to themselves with symbols. This is an argument for symbolic representations that Geoff Hinton made in 1979, and with his recent remarks about symbols he seems to have forgotten his own powerful example.

Stuart Russell: I think another area where deep learning is clearly not capturing the human capacity for learning, is just in the efficiency of learning. I remember in the mid ’80s going to some classes in psychology at Stanford, and there were people doing machine learning then and they were very proud of their results, and somebody asked Gordon Bower, “how many examples do humans need to learn this kind of thing?” And Gordon said “one Sometimes two, usually one”, and this is genuinely true, right? If you look for a picture book that has one to two million pictures of giraffes to teach children what a giraffe is, you won’t find one. Picture books that tell children what giraffes are have one picture of a giraffe, one picture of an elephant, and the child gets it immediately, even though it’s a very crude cartoonish drawing, of a giraffe or an elephant, they never have a problem recognizing giraffes and elephants for the rest of their lives.

Deep learning systems are needing, even for these relatively simple concepts, thousands, tens of thousands, millions of examples, and the idea within deep learning seems to be that well, the way we’re going to scale up to more complicated things like learning how to write an email to ask for a job, is that we’ll just have billions or trillions of examples, and then we’ll be able to learn really, really complicated concepts. But of course the universe just doesn’t contain enough data for the machine to learn direct mappings from perceptual inputs or really actually perceptual input history. So imagine your entire video record of your life, and that feeds into the decision about what to do next, and you have to learn that mapping as a supervised learning problem. It’s not even funny how unfeasible that is. The longer the deep learning community persists in this, the worse the pain is going to be when their heads bang into the wall.

Steven Pinker: In many discussions of superintelligence inspired by the success of deep learning I’m puzzled as to what people could possibly mean. We’re sometimes asked to imagine an AI system that’s could solve the problem of Middle East peace or cure cancer. That implies that we would have to train it with 60 million other diseases and their cures, and it would extract the patterns and cure the new disease that we present it with. Needless to say, when it comes to solving global warming, or pandemics, or Middle Eastern peace, there aren’t going to be 60 million similar problems with their correct answers that could provide the training set for supervised learning.

Lucas Perry: So, human children and humans are generally capable of one shot learning, or you said we can learn via seeing one instance of a thing, whereas machine learning today is trained up via very, very large data sets. Can you explain what the actual perceptual difference is going on there? It seems for children, they see a giraffe and they can develop a bunch of higher order facts about the giraffe, like that it is tan, and has spots, and a long neck, and horns and other kinds of higher order things. Whereas machine learning systems may be doing something else. So could you explain that difference?

Stuart Russell: Yeah, I think you actually captured it pretty well. The human child is able to recognize the object, not as 20 million pixels, including–let’s not forget–all the pixels of the background. So many of these learning algorithms are actually learning to recognize the background, not the object at all. They’re really picking up on spurious regularities that happen in the way the images are being captured. But the human child immediately separates the figure from the background says, “okay, it’s the figure that’s being called a giraffe”, and recognizes the higher level properties; “okay, it’s a quadruped, relatively large” the most distinguishing characteristic, as you say, is the very long neck, plus the way its hide is colored. Probably most kids might not even notice the horns and I’m not even sure if all giraffes have the horns, or just the males or just the adults. I don’t know the answer to that.

So I wasn’t paying much attention to all those images. This carries over to many, many other situations, including in things like planning, where if we observe someone carrying out a successful behavior, that one example combined with our prior knowledge is typically enough for us to get the general idea of how to do that thing. And this prior knowledge is absolutely crucial. Just information-theoretically, you can’t learn from one example reliably, unless you bring to bear a great deal of prior knowledge. And this is completely absent in deep learning systems in two ways. One is they don’t have any prior knowledge. And two is some of the prior knowledge is specifically about the thing you’re trying to predict. So here, we’re trying to predict the category of an animal and we already have a great deal of prior knowledge about what it means to belong to a category of animals.

So for example, who owns you, is not an attribute that the child would need to know or care about. If you said, what kind of animal is this? And deep learning systems have no ability to include or exclude any input attribute on the basis of its relevance to what it’s trying to predict, because they know nothing about what it is you’re trying to predict. And if you think about it, that doesn’t make any sense, right? If I said, “okay, I want you to learn to predict predicate P1279A. Okay? And I’m going to give you loads and loads of examples.” And now you get a perfect predictor for ‘P1279A’, but you have absolutely no use for it, because P1279A doesn’t connect to anything else in your cognition. So you learned a completely useless predictor because you know nothing about the thing that you’re trying to predict.

So it seems like it’s broken in several really, really important ways, and I would say probably the absence of prior knowledge or any means to bring to bear prior knowledge on the learning process is the most crucial.

Steven Pinker: Indeed, this goes back to our conversation on how basic principles of intelligence that govern the design of intelligent systems provide hypotheses that can be tested within psychology. What Stuart has identified is ultimately the nature-nurture problem in cognition. Namely, what are the innate constraints that govern children’s first hypotheses as they try to make sense of the world? 

One famous answer is Chomsky’s universal grammar, which guides children as they acquire language. Another is the idea from my colleagues Susan Carey and Elizabeth Spelke, in different formulations, that children have a prior concept of a physical object whose parts move together, which persists over time, and which follows continuous spatiotemporal trajectories; and that they have a distinct  concept of an agent or mind, which is governed by beliefs and desires. Maybe, or maybe not, they come equipped with still other frameworks for concepts, like the concept of a living thing or the concept of an artifact, and these priors radically cut down the search space of hypotheses, so they don’t have to search at the level of pixels and all their logically possible weighted combinations. 

Of course, the challenge in the science is how you specify the innate constraints, the prior knowledge, so that they aren’t obviously too specific, given what we know about the plasticity of human cognition. The extreme example being the late philosopher Jerry Fodors’ suggestion that all concepts are innate, including “trombone” and “doorknob” and “carburetor.”

Stuart Russell: (Laughs)

Steven Pinker: Hard to swallow, but between that extreme and the deep learning architecture in which the only thing that’s innate are the pixels, the convolutional network that allows for translational invariance, and the network of connections, there’s an interesting middle ground. That defines the central research question in cognitive development.

Stuart Russell: I don’t think you have to believe in extensive innate structures in order to believe that prior knowledge is really, really important for learning. I would guess that some aspects of our cognition are innate, and one of them is probably that the world contains things, and that’s really important because if you just think about the brain as circuits, some circuit languages don’t have things as first class entities, whereas first order logical languages or programming languages do have things as first class entities and that’s a really important distinction.

Even if you believe that nothing is innate, the point is how does everything that you have perceived up to now affect your ability to learn the next thing? One argument is, everything you’ve perceived up to now, is simply data, and somehow magically, we have access to all our past perceptions, and then you’re just training a function from that whole lot to the next thing to do or how to interpret the next object.

That doesn’t make much sense. Presumably the experience you have from birth or even pre-birth onwards, is converted into something and one argument is that it’s just converted into something like knowledge, and then that knowledge is brought to bear on learning problems, for example, to even decide what are the relevant aspects of the input for predicting category membership of this thing?

And the other view would be that, in the deep learning community, they would say probably something like the accumulation of features. If you imagine a giant recurrent neural network: in the hidden layers of the recurrent neural network over years and years and years of perception, you’re building up internal representations, features, which then can perhaps simplify the learning of the next concept that you need to learn. And there’s probably some truth in that too.

And absolutely having a library of features that are generally useful for predicting and decision making and planning and our entire vocabulary, I think this is something that people often miss, our vocabulary, our language, is not just something we use to communicate with each other. It’s an enormous resource for simplifying the world in the right ways, to make the next thing we need to know, or the next thing we need to do, relatively easy. Right? So you imagine you decide at the age of 12, I want to understand the physical laws that control the universe.

The fact that we have in our vocabulary, something like doing a PhD, makes it much more feasible to figure out what your plan is going to be, to achieve this objective. If you didn’t have that, and if you didn’t have all the pieces of doing a PhD, like take a course, read a book, this library of words and action primitives, at all these levels of abstraction, is a resource without which you would be completely unable to formulate plans of any length or any likelihood of success. And this is another area where current AI systems, I would say generally, not just deep learning, we lack a real understanding of how to formulate these hierarchies and acquire this vocabulary and then how to deploy it in a seamless way so that we’re always managing to function successfully in the real world.

Lucas Perry: I’m basically just as confused about I guess, intelligence as anyone else. So the difference, it seems to me between the machine learning system and the child who one-shot-learns the giraffe is, that the child brings into this learning scenario, this knowledge that you guys were talking about, that they understand that the world is populated by things and that there are other minds and some other ideas about 3D objects and perception, but a core difference seems to be something like symbols and the ability to manipulate symbols is this right? Or is it wrong? And what are symbols and effective symbol manipulation made of?

Steven Pinker: Yes, and that is a limitation of the so-called deep learning systems, which are a subset of machine learning, which is a subset of artificial intelligence. It’s certainly not true that AI systems don’t manipulate symbols.  Indeed, that’s what classical AI systems trade in: manipulation of propositions, implementation of versions of logical inference or of cause-and-effect reasoning. Those can certainly be implemented in AI systems–it’s just gone out of fashion with the deep learning craze.

Lucas Perry: Well, they don’t learn those symbols, right? Like we give them the symbols and then they manipulate them.

Steven Pinker: The basic architecture of the system, almost by definition, can’t be learned;  you can’t learn something with nothing. There have got to be some elementary information processes, some formats of data representation, some basic ways of transforming one representation to another, that are hardwired into the architecture of the system. It’s an open empirical question, in the case of the human brain, whether it includes variables for objects and minds, or living things, or artifacts, or if those are scaffolded one on top of the other with experience. There’s nothing in principle that prevents AI systems from doing that;  many of them do, but at least for now they seem to have fallen out of fashion.

Stuart Russell: There is precedent for generating new symbols, both in the probabilistic programming literature and in the inductive logic programming literature. So predicate invention is a very important reason for doing inductive logic programming. But I agree with Steve, that it’s an open question as to whether the basic capacity to have a new symbol based representations in the brain is innate, or is it learned? There’s very anecdotal evidence about what happens to children who are not brought up among other human beings. I think those anecdotes suggest that they don’t become symbol-using in the same way. So it might be that the process of developing symbol-using capabilities in the brain is enormously aided by the fact that we grow up in the presence of symbol-using entities, namely our parents and family members and community. And of course that leads you to then a chicken and egg problem.

So you’d have to argue in that case that early humans, or pre-humans had much more rudimentary symbol-like capabilities: some animals have the ability to refer to different phenomena or objects with different signs, different kinds of sounds that some new world monkeys have, for example, for a snake and for puma, but they’re not able to do the full range of things that we do with symbols. You could argue that the symbol using capability developed over hundreds of thousands of years and the unaided human mind doesn’t come with it built in, but because we’re usually bathed in symbol-using activity around us, we are able to quickly pick it up. I don’t know what the truth is, but it seems very clear that this kind of capability, for example, gives you the ability to generalize so much faster than you can with circuits. So just to give a particular example of the rules of Go, we talked about earlier, the rules of Go apply the same rule at every time-step in the game.

And it’s the same rule at every square in the game, except around the edges, and if you have what we call first order capability, meaning you can have universal quantifiers or in programs, we think of these as loops, you can say very quickly for every square on the board, if you have a piece on there and it’s surrounded by the enemy, then it’s dead. That’s sort of a crude approximation to how things work and go, but it’s roughly right. In a circuit, you can’t say that because you don’t have the ability to say for every square. So you have to have a piece of circuit for each square. So you’ve got 361 copies of the rule in each of those copies has to be learned separately, and this is one of the things that we do with convolutional neural networks.

A convolutional neural network has the universal quantifier over the input space, built into it. So it’s a kind of cheating, and as far as we know, the brain doesn’t have that type of weight sharing. So the key aspect is not just the physical structure of the convolutional network, which has this repeating local receptive fields on each different part of the retina, so to speak, but that we also insist that weights for each of those local receptive fields are copied across all receptive fields in the retina. So there aren’t millions of separate weights that are trained, there’s only a few, sometimes even just a handful of weights that are trained and then the code makes sure that those are effectively copied across the entire retina. And the brain. I don’t think has any way to do that, so it’s doing something else to achieve this kind of rapid generalization.

Lucas Perry: All right. So now with all of this context and understanding about intelligence and its origins today in 2020, AI is beginning to proliferate and is occupying a lot of news cycles. What particular important changes to human society does the rise and proliferation of AI signal and how do you view it in relation to the agricultural and industrial revolutions?

Steven Pinker: I’m going to begin with a meta-answer, which is that we should keep in mind how spectacularly ignorant we are of the future even the relatively near future. Experts at superforecasting studied by Phil Tetlock, pretty much the best in the world, go down to about chance after about five years out. And we know, looking at predictions of the future from the past,  how ludicrous they can be, both in underpredicting technological changes and in overpredicting them. A 1993 book by Bill Gates called The Road Ahead  said virtually nothing about the internet! And there’s a sport of looking at science-fiction movies and spotting ludicrous anachronisms, such as the fact that in 2001: A Space Odyssey they were using typewriters. They had suspended animation and trips to Jupiter, but they hadn’t invented the word processor. To say nothing of the social changes they failed to predict, such as the fact that all of the women in the movie were secretaries and assistants.

So we should begin by acknowledging that it is extraordinarily difficult to predict the future. And there’s a systematic reason, namely that the future depends not just on technological developments, but also on people’s reaction to the developments, and on the  reactions to the reactions, and the reactions to the reactions to the reactions. There are seven and a half billion of us reacting, and we have to acknowledge that there’s a lot we’re going to get wrong. 

It’s safe to say that a lot of tasks that involve physical manipulation, like stocking shelves and driving trucks, are going to be automated, and societies will have to deal with the possibility of radical changes in employment, and Stuart talks about those in his book. We don’t know whether the job market will be flexible enough to create new jobs, always at the frontier of what machines can’t yet do, or whether there’ll be massive unemployment that will require economic adjustments, such as a universal basic income or government sponsored service. 

Less clear is the extent to which high-level decision making, like policy, diplomacy, or scientific hypothesis-testing,  will be replaced by AI. I think that’s impossible to predict.  Although, closer to the replacement of truck drivers by autonomous vehicles, AI as a useful tool, rather than as a replacement, for human intelligence will explode in science and business and technology and every walk of life.

Stuart Russell: I think all of those things are true. And I agree that our general record of forecasting has been pretty dismal. I am smiling as Steve was talking, because I was remembering Ray Kurzweil recently saying how proud he was that he had predicted the self driving car, I think it was in ’96 or ’92, something like that, and possibly wasn’t aware that the first self driving car was driving on the freeway in 1987, before he even thought to predict that such a thing might happen. If I had to say, in the next decade, if you said, roughly speaking, that what happened in the 2010s was primarily that visual perception became very crudely feasible for machines when it wasn’t before.

And that’s already having huge impact, including in self-driving cars, I would say that language understanding at least in a simplified sense will become possible in this decade. And I think it’ll be a combination of deep learning with probabilistic programming, with Bayesian and symbolic methods. That will open up enormous areas of activity to machines where they simply couldn’t go before, and some of that will be very straightforward, job replacement for call center workers. Most of what they do, I think could be automated by systems that are able to understand their conversations. The role of the smart speaker, the Alexa, or Cortana or Siri or whatever will radically change and will enable AI systems to actually understand your life to a much greater extent. One of the reasons that Siri and Cortana or Alexa are not very useful to me is because they just don’t understand anything about my life.

The “call me an ambulance” example illustrates that. If I got a text message saying “Johnny’s in the hospital with a broken arm”, well, if it doesn’t understand that Johnny is possibly my cat, or possibly my son, or possibly my great grandfather and does Johnny live nearby, or in my house, or on another continent, then it hasn’t the faintest idea of what to do. Or even whether I care. It’s only really through language understanding. I doubt that we’re going to be filling these things full of first order logic assertions that we will type into our AI system. So it’s only through language that it’s going to be able to acquire the knowledge that it needs to be a useful assistant to an individual or a corporation. So having that language capability will open up whole new areas for AI to be useful to individuals and also to take jobs from people. And I’m not able to predict what else we might be able to do when there are AI systems that understand language, but it has to have a huge impact.

Lucas Perry: Is there anything else that you guys would like to add in terms of where AI is at right now, where it’ll be in the near future and the benefits and risks it will pose?

Stuart Russell: I could point to a few things that are already happening. There’s a lot of discussion about the negative impacts on women and minorities from algorithms that inadvertently pick up on biases in society. So we saw the example of Amazon’s hiring algorithm that rejected any resume that had the word “woman’s” in it. And I think that’s serious, but I think the AI community we’re still not completely woke, and there’s a lot of consciousness raising that needs to happen. But I think technically that problem is manageable, and I think one interesting thing that’s occurring is that we’re starting to develop an understanding, not just of the machine learning algorithm, but of the socio-technical context in which that machine learning is embedded and modeling that social technical context allows you to predict whether the use of that algorithm will have negative feedback kinds of consequences, or it will be vulnerable to certain kinds of selection bias in the input data, and so on.

Deepfakes surveillance and manipulation, that’s another big area, and then something I’m very concerned about is the use of AI for autonomous weapons. This is another area where we fight against media stereotypes. So when the media talk about autonomous weapons, they invariably have a picture of a Terminator. Always. And I tell journalists, I’m not going to talk to you if you put a picture of a Terminator in the article. And they always say, well, I don’t have any control over that, that’s a different part of the newspaper, but it always happens anyway.

And the reason that’s a problem is because then everyone thinks, “Oh, well this is science fiction. We don’t have to worry about this because this is science fiction.” And you know, I’ve heard the Russian ambassador to the UN and Geneva say, well, why are we even discussing these things, because this is science fiction, it’s 20 or 30 years in the future? Oh, by the way, I have some of these weapons, if you’d like to buy them. The reality is that many militaries around the world are developing these, companies are selling them. There’s a Turkish arms company, STM, selling a device, which is basically the slaughterbot from the Slaughterbots movie. So it’s a small drone with onboard explosives and they advertise it as capable of tracking and autonomously attacking human beings based on video signatures and/or face recognition.

The Turkish government has announced that they’re going to be using those against the Kurds in Syria sometime this year. So we’ll see if it happens, but there’s no doubt that this is not science fiction, and it’s very real. And it’s going to create a new kind of weapon of mass destruction, because if it’s autonomous, it doesn’t need to be supervised. And if it doesn’t need to be supervised, then you can launch them by the million, and then you have something with the same effect as a nuclear weapon, but much cheaper, much easier to proliferate with much less collateral damage and all the rest of it.

Steven Pinker: I think in all of these discussions, it’s critical to not fall prey to a status-quo bias and compare the hypothetical problems of a future technology with an idealized present, ignoring the real problems with the present we take for granted. In the case of bias, we know that humans are horribly biased. It’s not just that we’re biased against particular genders and ethnic groups and sexual orientations. But inj general we make judgements that can easily be outperformed by even simple algorithms, like a linear regression formula. So we should remember that our benchmark in talking about the accuracies or inaccuracies of AI prediction algorithms has to be the human, and that’s often a pretty low bar. When it comes to bias, of course, a system that’s trained on a sample that’s unrepresentative is not a particularly intelligent system. And going back to the idea that we have to distinguish the goals we want to achieve from the intelligence that achieves them, if our goal is to overcome past inequities, then by definition we don’t want to make selections that simply replicate the statistical distribution of women and minorities in the past. Our goal is to rectify those inequities, and the problem in a system that replicates them is not that it’s not intelligent enough, but then we’ve given it the wrong goal.

When it comes to weapons, here too, we’ve got to compare the potential harm of intelligent weapons systems with the stupendous harm of dumb weapon systems. Aerial bombardment, artillery, automatic weapons, search-and- destroy missions, and tank battles have killed people by the millions. I think there’s been insufficient attention to how a battleground that used smarter weapons would compare to what we’ve tolerated for centuries simply because that’s what we have come to accept, though it’s being fantastically destructive. What ultimately we want to do is to make the use of any weapons less likely, and as I’ve written about, that has been the general trend in the last 75 years, fortunately.

Stuart Russell: Yeah, I think there is some truth in that. When I first got the email from Human Rights Watch, so they began a campaign, I think was back in 2013, to argue for a treaty banning autonomous weapons. Human Rights Watch came into existence because of the awful things that human soldiers do. And now they’re saying “No, no human soldiers are great, it’s the machines we need to worry about.” And I found that a little bit odd. To me, the argument about whether the weapons will inadvertently violate humans right in ways that human soldiers don’t, or sort of accidentally kill people in ways that we are getting better at avoiding, I don’t think that’s the issue. I think it’s specifically the weapon of mass destruction property that autonomous weapons have that for example machine guns don’t.

There’s a hundred million or more Kalashnikov rifles in private hands in the world. If all those weapons got up one morning by themselves and started shooting anyone they could see, that would be a big chunk of the human race gone, but they don’t do that. Each of them has to be carried by a person. And if you want to put a million of them into the field, you need another 10 million people to feed and train those million soldiers, and to transport them, and protect them, and all that stuff. And that’s why we haven’t seen very large scale death from all those hundred million Kalashnikovs.

Even carpet-bombing, which I think nowadays would be regarded as indiscriminate and therefore a violation of international law. And I think even during the Second World War, people argued that “No, you can’t go and bomb cities.” But once the Germans started to do it, then there was escalating rounds of retaliation and people lost all sense of what was a civilized and what was an uncivilized act of war. But even The Blitz against Great Britain, as far as I know killed only between 50 and 60,000 people, even though it hit dozens and dozens of cities. But literally one truckload of autonomous weapons can kill a million people.

An interesting fact about World War II is that for every person who died, between 1,000 and 10,000 bullets were fired. So just killing people with bullets on average in World War II cost you, let’s take a geometric mean 3,000 bullets, which is actually about a thousand dollars at current prices, but you could build a lethal autonomous weapon for a lot less than that. And even if they had a 25% success rate in finding and killing a human, it’s much cheaper than the bullet, let alone the guns and the aircraft and all the rest of it.

So as a way of killing very, very large numbers of people it’s incredibly cheap and incredibly effective. They can also be selective. So you can kill just the kind of people you want to get rid of. And it seems to me that we just don’t need another weapon of mass destruction with all of these extra characteristics. We’ve got rid of to some extent biological and chemical weapons. We’re trying to get rid of nuclear weapons, and introducing another one that’s arguably much worse seems to be a step in the wrong direction.

Steven Pinker: You asked also about the benefits of artificial intelligence, which I think could be stupendous. They include elimination of drudgery and the boring and dangerous jobs that no one really likes to do, like stocking shelves, making beds, mining coal, and picking fruit. There could be a bonanza in automating all the things that humans want done without human pain and labor and boredom and danger. It raises the problem of how we will support the people (if new jobs don’t materialize) who have nothing to do. But that’s a more minor economic problem to solve, compared to the spectacular advance we could have in eliminating human drudgery.

Also, there are a lot of jobs, such as the care of elderly people–lifting them onto toilets, reaching things from upper shelves–that, if automated, would allow more of them to live at home instead of being warehoused in nursing homes. Here, too, the potential for human flourishing is spectacular. And as I mentioned, many kinds of human judgment are so error-prone that they can already be replaced by simple algorithms, and better still if they were more intelligent algorithms. There’s the potential of much less waste, much less error, far fewer accidents. An obvious example is the million and a quarter people killed in traffic accidents each year that could be terrifically reduced if we had autonomous vehicles that were affordable and widespread.

Lucas Perry: A core of this is that all of the problems that humanity faces simply require intelligence to solve them, essentially. And if we’re able to solve the problem of how to make intelligent machines, then our problems will evermore and continuously become automateable by machine systems. So Stuart, do have you have anything else to add here in terms of existential hope and benefits to compliment what Steve just contributed before we pivot into existential risk?

Stuart Russell: Yeah, there is an argument going around, and I think Mark Zuckerberg said it pretty clearly, and Oren Etzioni and various other people have said basically the same thing. And it’s usually put this way, “If you’re against AI, then you’re against better medical decisions, or reducing medical errors, or safer cars,” and so on. And this is, I think, just a ridiculous argument. So first of all, people who are concerned about the risks of AI, are not against AI, right? That’s like arguing if you’re a nuclear engineer and you’re concerned about the possibility of a design flaw that would lead to a meltdown, you’re against electricity. No, you’re not against electricity. You’re just against millions of people dying for no reason, and you want to fix the problem. And the same argument I think is true about those who are concerned about the risk of AI. If AI didn’t have any benefits we wouldn’t be having this discussion at all. No one would be investing any money, no one would have put their lives and careers into working on the capabilities of AI, and the whole point would be moot.

So of course, AI will have benefits, but if you don’t address the risks, you won’t get the benefits, because the technology will be rejected, or we won’t even have a choice to reject it. And if you look at what happened with nuclear power, I think it’s really an object lesson. Nuclear power could and still can produce quite cheap electricity. So I have a house in France and most electricity in France comes from nuclear power, and it’s very cheap and very reliable. And it also doesn’t produce a lot of carbon dioxide, but because of Chernobyl, the nuclear industry has been literally decimated, by which I mean, reduced by a factor of 10, or more. And so we didn’t get the benefits, because we didn’t pay enough attention to the risks. The same holds with AI.

So the benefits of AI in the long run I would argue are pretty unlimited, and medical errors and safer cars, that’s all nice, but that’s a tiny, tiny footnote in what can be done. As Steve already mentioned, the elimination of drudgery and repetitive work. It’s easy for us intellectuals to talk about that. We’ve never really engaged in a whole lot of it, but for most of the human race, for most of recorded history, people with power and money have used everybody else as robots to get what they want. Whether we’ve been using them as military robots, or agricultural robots, or factory robots, we’ve been using people as robots.

And if you had gone back to the early hunter gatherer days and written some science fiction, and you said, “You know what, in the future, people will go into big square buildings, thousands of feet long with no windows and they’ll do the same thing a thousand times a day. And then they’ll go back the next day and do the same thing another thousand times. And they’re going to do that for thousands of days until they’re practically dead.” The audience, the readers of science fiction in 20,000 BC, would have said, “You’re completely nuts, that’s so unrealistic.” But that’s how we did it. And now we’re worried that it’s coming to an end, and it is coming to an end, because we finally have robots that can do the things that we’ve been using human robots to do.

And I’m not saying we should just get rid of those jobs, because jobs have all kinds of purposes in people’s lives. And I’m not a big fan of UBI, which says basically, “Okay, we give up. Humans are useless, so the machines will feed them and house them, entertain them, but that’s all they’re good for.”

Now the benefits to me… It’s hard to imagine, just like we could not imagine very well all the things we would use the internet for. I mean, I remember the Berkeley computer science faculty in the ’80s sitting around at lunch, we knew more about networking than almost anybody else, but we still had absolutely no idea. What was the point of being able to click on a link? What’s that about? We totally blew it.

And we don’t understand all the things that superhuman AI could do for us. I mean, Steve mentioned that we could do much better science, and I agree with that. In the book, I visualize it as taking various ideas, like, “travel as a service,” and extending that to “everything as a service.” So travel as a service is a good example. Like if you think about going to Australia 200 years ago, you’re talking about a billion dollar proposition, probably 10 years, thousands of people, 80% chance of death. Now I take out my cell phone, I go tap, tap, tap, and now I’m in Australia tomorrow. And it’s basically free compared to what it used to be. So that’s what I mean by, as a service, you want something, you just get it.

Superhuman AI could make everything as a service. So think about the things that are expensive and difficult or impossible now, like training a neurosurgeon, or building a railway to connect your rural village to a nearby city so that people can visit, or trade, or whatever. For most of the developing world these things are completely out of reach. The health budget of a lot of countries in Africa is less than $10 per person per year. So the entire health budget of a country would train one neurosurgeon in the US. So these things are out of reach, but if you take out the humans then these services can become effectively free. They become services like travel is today, and that would enable us to bring everyone on earth up to the kind of living standard that they might aspire to. And if we can figure out the resource constraints and so on that will be a wonderful thing.

Lucas Perry: Now that’s quite a beautiful picture of the future. There’s a lot of existential hope there. The other side to existential hope is existential risk. Now this is an interesting subject, which Steve and you, Stuart, I believe have disagreements about. So pivoting into this area, and Steve, you can go first here, do you believe that human beings, should we not go extinct in the meantime, will we build artificial superintelligence? And does that pose an existential risk to humanity?

Steven Pinker: Yeah, I’m on record as being skeptical of that scenario and dubious about the value of putting a lot of effort into worrying about it now. The concept of superintelligence is itself obscure. In a lot of the discussions you could replace the word “superintelligence” with “magic” or “miracle” and the sentence would read the same. You read about an AI system that could duplicate brains in silicon, or solve problems like war in the Middle East, or cure cancer.  It’s just imagining the possibility of a solution and assuming that the ability to bring it about will exist, without laying out what that intelligence would consist of, or what would count as a solution to the problem. 

So I find the concept of superintelligence itself a dubious extrapolation of an unextrapolable continuum, like human-to-animal, or not-so-bright human-to-smart-human. I don’t think there is a power called “intelligence” such that we can compare a squirrel or an octopus to a human and say, “Well, imagine even more of that.” 

I’m also skeptical about the existential risk scenarios. They tend to come in two varieties. One is based on the notion of a will to power: that as soon as you get an intelligent system, it will inevitably want to dominate and exploit. Often the analogy is that we humans have exploited and often extinguished animals because we’re smarter than them, so as soon as there is an artificial system that’s smarter than us, it’ll do to us what we did to the dodos. Or that technologically advanced civilizations, like European colonists and conquistadors subjugated and sometimes wiped out indigenous peoples, so that’s what an AI system might do to us. That’s one variety of this scenario.

I think that scenario confuses intelligence with dominance, based on the fact that in one species, Homo sapiens, they happen to come bundled together, because we came about through natural selection, a competitive process driven by relative success at capturing scarce resources and competing for mates, ultimately with the goal of relative reproductive success. But there’s no reason that a system that is designed to pursue a goal would have as its goal, domination. This goes back to our earlier discussion that the ability to achieve a goal is distinct from what the goal is.

It just so happens that in products of natural selection, the goal was winning in reproductive competition. For an artifact we design, there’s just no reason that would be true. This is sometimes called the orthogonality thesis in discussions of existential risk, although that’s just a fancy-schmancy way of referring to Hume’s distinction between our goals and our intelligence.

Now I know that there is an argument that says, “Wouldn’t any intelligence system have to maximize its own survivability, because if it’s given the goal of X, well, you can’t achieve X if you don’t exist, therefore, as a subgoal to achieving X, you’ve got to maximize your own survival at all costs.” I think that’s fallacious. It’s certainly not true that all complex systems have to work toward their own perpetuation. My iPhone doesn’t take any steps to resist my dropping it into a toilet, or letting it run out of power.

You could imagine if it could be programmed like a child to whine, and to cry, and to refuse to do what it’s told to do as its power level went down. We wouldn’t buy one. And we know in the natural world, there are plenty of living systems that sacrifice their own existence for other goals. When a bee stings you, its barbed stinger is dislodged when the bee escapes, killing the bee, but because the bee is programmed to maximize the survivability of the colony, not itself, it willingly sacrifices itself. So it is not true that by definition an intelligent system has to maximize its own power or survivability.

But the more common existential threat scenario is not a will to power but collateral damage. That if an AI system is given a single goal, what if it relentlessly pursues it without consideration of side effects, including harm to us? There are famous examples that I originally thought were spoofs, but were intended seriously, like giving an AI system the goal of making as many paperclips as possible, and so it converts all available matter into paperclips, including our own bodies (putting aside the fact that we don’t need more efficient paperclip manufacturing than what we already have, and that human bodies are a pretty crummy source of iron for paperclips).

Barely more plausible is the idea that we might give an AI system the goal of curing cancer, and so it will  conscript us as involuntary guinea pigs and induce tumors in all of us, or that we might give it the goal of regulating the level of water behind a dam and it might flood a town because it was never given the goal of not drowning a village. 

The problem with these scenarios is that they’re self-refuting. They assume that an “intelligent” artifact would be designed to implement a single goal, which is not true of even the stupid artifacts that we live with. When we design a car, we don’t just give the goal of going from A to B as fast as possible; we also install brakes and a steering wheel and a muffler and a catalytic converter. A lot of these scenarios seem to presuppose both idiocy on the part of the designers, who would give a system control over the infrastructure of the entire planet without testing it first to see how it worked, and an idiocy on the part of the allegedly intelligent system, which would pursue a single goal regardless of all the other effects. This does not exist in any human artifact, let alone one that claims to be intelligent. Giving an AI system one vaguely worded, sketchy goal, and empowering it with control over the entire infrastructure of the planet without testing it first seems to me just so self-evidently moronic that I don’t worry that engineers have to be warned against it.

I’ve quoted Stuart himself, who in an interview made the point well when he said, “No one talks about building bridges that don’t fall down. They just call it building bridges.” Likewise, AI that avoids idiocies like that is just AI, it’s not AI with extra safeguards. That’s what intelligence consists of.

Let me make one other comment. You could say, well, even if the odds are small, the damage would be so catastrophic that it is worth our concern. But there are downsides to worrying about existential risk. One of them is the possible stigmatization and abandonment of helpful technologies. Stuart mentioned the example of nuclear power. What’s catastrophic is that we don’t roll out nuclear power the way that France did, which would go a long way toward solving the genuinely dangerous problem of climate change. Fear of nuclear power has been irrationally stoked by vivid examples:the fairly trivial accident at Three Mile Island in the United States, which killed no one, the tsunami at Fukushima, where people died in the botched evacuation, not the nuclear accident, and the Soviet bungling at Chernobyl. Even that accident killed a fraction of the people that die every day from the burning of fossil fuels, to say nothing of the likely future harm from climate change. The reaction to Chernobyl is exactly how we should not deal with the dangers facing humanity. 

Genetically modified organisms are another example: a technology overregulated or outlawed out of worst-case fears, depriving us of the spectacular benefits of greater ecological sustainability, human nutrition, and less use of water and pesticides. 

There are other downsides of fretting about exotic hypothetical existential risks. There is a line of reasoning in the existential risk community and the so called Rationality community that goes something like this: since the harm of extinguishing the species is basically infinite, probabilities no longer matter, because by expected utility calculations, if you multiply the tiny risk by the very large number of the potential descendants of humans before the sun expands and kills us off (or in wilder scenarios,  the astronomically larger number of immortal consciousnesses that will exist when we can upload our connectomes to the cloud, or when we colonize and multiply in other solar systems)—well, then even an eensy, eensy, infinitesimal probability of extinction would be catastrophic, and we should worry about it now.

The problem is that that argument could apply to any scenario with a nonzero probability, which means any scenario that is not logically impossible. Should we take steps to prevent the evolution of toxic killer gerbils that will nibble everyone to death? If I say, “That’s preposterous,” you can say, “Well, even if the probability is very, very small, since the harm of extinction is so great, we must devote some brain power to that scenario.”

I do fear the moral hazard of human intellect being absorbed in this free-for-all: that any risk, if you imagine it’s potentially existential, could justify any amount of expenditure, according to this expected utility calculation. The hazard is that smart people, clever enough to grasp a danger that common sense would never conceive of, will be absorbed into what might be a fruitless pursuit, compared to areas where we urgently do need application of human brain power–in climate, in the prevention of nuclear war, in the prevention of pandemics. Those are real risks, which no one denies, and we haven’t solved any of them, together with other massive sources of human misery like Alzheimer’s disease. Given these needs, I wonder whether the infinitesimal-probability-times-infinite-harm is the right way of allocating our intellectual capital.

Lucas Perry: Stuart, you want to react to those points.

Stuart Russell: Yeah, there’s a lot there to react to and I’m tempted to start at the end and work back and just ask, well, if we were spending hundreds of billions of dollars a year to breed billions of toxic killer gerbils, wouldn’t you ask people if that was a good idea before dismissing any reason to be concerned about it? If that’s what we were actually investing in creating. I don’t buy the analogy between AI and toxic killer gerbils in any shape or form. But I will go back to the beginning, and we began by talking about feasibility. And Steve argues, I think, primarily, that it’s not even meaningful, that we could create superhuman levels of intelligence, that there isn’t a single continuum.

And yes, there isn’t a single continuum, but there doesn’t have to be a single continuum. When people say one person is more intelligent than another, or one species is more intelligent than other, it’s not a scientific statement that there is a single scalar on which species one exceeds species two. They’re talking in broad brush. So when we say humans are more intelligent than chimpanzees, that’s probably a reasonable thing to say, but there are clearly dimensions of intelligence where actually chimpanzees are more intelligent than humans. For example, short term memory. A chimpanzee, once they get what a digit is, they can learn 20 digit telephone numbers at the drop of a hat, and humans can’t do that. Clearly there’s dimensions on which chimpanzee intelligence, on average, is probably better than human. But nonetheless, when you look at which species would you rather be right now the chimpanzees don’t have much of a chance against the humans.

I think that there is a meaningful motion of generality of intelligence, and one way to think about it is to take a decision making scenario where we already understand how to produce very effective decisions, and then ask, how is that decision scenario restricted, and what happens when we relax the restrictions and figure out how to maintain the same, let’s say, superhuman quality of decision making? So if you look at Go play, it’s clear that the humans have been left far behind. So it’s not unreasonable to ask, just as the machines wiped the floor with humans on the Go board, and the chess board, and now on the StarCraft board, and lots of other boards, could you take that and transfer that into the real world where we make decisions of all kinds? The difference between the Go board and the real world is pretty dramatic. And that’s why we’ve had lots of success on the Go board and not so much in the real world.

The first thing is that the Go board is fully observable. You can see the entire state of the world that matters. And of course in the real world there’s lots of stuff you don’t see and don’t know. Some of it you can infer by accumulating information over time, what we call state estimation, but that turns out to be quite a difficult problem. Another thing is that we know all the rules of Go, and of course in the real world, you don’t know all the rules, you have to learn a lot as you go along. Another thing about the Go board is that despite the fact that we think of it as really complicated, it’s incredibly simple compared to the real world. At any given time on the Go board there’s a couple of hundred legal moves, and the game lasts for a couple hundred moves.

And if you said, well, what are the analogous primitive actions in the real world for a human being? Well, we have 600 muscles and we can actuate them maybe about 10 times per second each. Your brain probably isn’t able to do that, but physically that’s what could be your action space. And so you actually have then a far greater action space. And you’re also talking about… We often make plans that last for many years, which is literally trillions of primitive actions in terms of muscle actuations. Now we don’t plan those all out in detail, but we function on those kinds of timescales. Those are some of the ways that Go and the real world differ. And what we do in AI is we don’t say, okay, I’ve done Go, now I’m going to work on suicide Go, and now I’m going to work on chess with three queens.

What we try to do is extract the general lessons. Okay, we now understand fairly well how to handle that whole class of problems. Can we relax the assumptions, these basic qualitative assumptions about the nature of the problem? And if you relax all the ones that I listed, and probably a couple more that I’ve got, you’re getting towards systems that can function at a superhuman level in the real world, assuming that you figure out how to deal with all those issues. So just as we find ourself flummoxed by the moves that the AI system makes on the Go board, if you’re a General, and you’re up against an AI system that’s controlling, or coming up with the decision making plans for the other side, you might find yourself flummoxed, that everything you try, the machine has already anticipated and put in place something that will prevent your plan from succeeding. The pace of warfare will be beyond anything humans have ever contemplated, right?

So they won’t even have time to think, just as the Iraqis were not used to the rate of decision making of the US Army in the first Gulf War, and they couldn’t do anything. They were literally paralyzed, and just step by step by step the allied forces were able to take them apart because they couldn’t respond within the timescales that the allied forces were operating.

So it will be kind of like that if you were a human general. If you were a human CEO and your competitor company is organized and run by AI systems, you’d be in the same kind of situation. So it’s entirely conceivable. I’m not necessarily saying plausible, but conceivable that we can create real world decision making capabilities that exceed those of humans across the board. So that this notion of generality, I think it is something that still needs to be worked out. Most definitions of generality that people come up with end up saying, “Well, humans are general because they can do all the things that humans can do,” which is sort of a tautology. But nonetheless, it’s interesting that when you think about all the jobs: doctor, carpenter, advertising, sales, representative, most normally functioning people could do most of those jobs at least to some reasonable level.

So we are incredibly flexible compared to current AI systems. There is progress on achieving generality, but there’s a long way to go. I’m certainly not one of those who says that superintelligent AI is imminent and that’s why we need to worry. And in fact, I’m probably more conservative. If you want to appeal to what most expert AI people think, most expert AI people think that we will have something that’s reasonably described as superintelligent AI sooner than I do.

So most people think sometime in the middle of the century. It turns out that Asian AI researchers particularly in China are more optimistic, so they think 20 years. People in the US and Europe may be more like 40 years. I would be reasonably confident saying by the end of this century.

I think Nick Bostrom is in about the same place. He’s also more conservative than the average expert AI researcher. There are major breakthroughs that have to happen, but the massive investment that’s taking place, the influx of incredibly smart people into the field, these things suggest that those breakthroughs will probably take place but the timescale is very hard to say.

And when we think about the risks, I would say Steve is really putting up one straw man after another and then knocking it down. So for example, the paperclip argument is not a scenario that Nick Bostrom thinks is one of the more likely ways for the human race to end. It’s a philosophical thought experiment intended to illustrate a point. And the point is incontrovertible and I don’t think Steve disagrees with it.

So let’s not use the word intelligent because I think Steve here is using the word intelligent to mean always behaves in whatever way we think we wish that it would behave well.

Of course, if you define intelligence that way, then there isn’t an issue. The question is, how do we create any such thing? And the ways we have right now of creating any such thing fall under the standard model, which I described earlier that we set up, let’s call it a superoptimizer and then we give it an objective. And then off it goes. And he’s (Bostrom) describing what happens when you give a superoptimizer the wrong goal. And he’s not saying, “Yes, of course we should give it wrong goals.”

And he’s using this to illustrate what happens when you give it even what seems to be innocuous. So he’s trying to convey the idea that we are not very good at judging the consequences of seemingly innocuous goals. My example of curing cancer: “Curing cancer? Yeah, of course, that’s a good goal to give to an AI system” — but the point is, if that’s the only goal you give to the AI system, then all these weird things happen because that’s the nature of super optimizers. That’s the nature of the standard model of AI.

And this is, I think, the main point being made is not that no matter what we do AI going to get us. It’s that given our current understanding and given hundreds of billions of dollars are being invested into that current understanding then there is a failure mode and it’s reasonable to point that out just as if you’re a nuclear engineer and you say, “Look, everyone is designing these reactors in this way. All of you are doing this. And look there’s this failure mode.” That’s a reasonable thing to point out.

Steven Pinker: Several reactions. First, while money is pouring into AI, it’s not pouring it into super-optimizers tasked with curing cancer and with the power to kidnap people. And the analogies of humans outcompeting chimpanzees, or American generals outsmarting their Iraqi counterparts, once again assume that systems that are smarter than us will therefore be in competition with us. As for straw men, I was mindful to avoid them: the AI system that would give people tumors to pursue the goal of curing cancer was taken from Stuart’s book.

I agree that a super-optimizer that was given a single goal would be menace. But a super-optimizer that pursued a single goal is self-evidently unintelligent, not superintelligent! 

Stuart Russell: Of course, we have multiple goals. There’s a whole field of multi-attribute utility theory that’s been going now for more than 50 years. Of course, we understand that. When we look at even the design of the algorithms that Uber uses to get you to the airport, they take into account multiple goals.

But the point is the same argument applies if you operate in a standard model when you add in the multiple goals. Unless you’re able to be sure that you have completely and correctly captured all the things we care about under all conceivable and the inconceivable,because I think one of the things about superintelligent AI systems they will come up with, by human standards, inconceivable forms of actions.

We cannot guarantee that. And this is the point. So you could say multiple goals, but multiple goals are just a single goal. They add up to the ability to rank futures. And the question is: is that ability to rank futures fully aligned with what humans want their futures to be like? And the answer is inevitably, no. We are inevitably going to leave things out.

So even if you have a thousand terms in the objective function, there’s probably another million that you ought to have included that you didn’t think about because it never occurred to you.

So for example, you can go out and find lists of important things that human beings care about. This is sort of the whole-values community, human-development community, Maslow hierarchy, all of those things. People do make whole lists of things trying to build up a picture of very roughly what is the human utility function after all.

But invariably, those lists just refer to things that are usually a subject of discussion among humans about “Do we spend money on schools or hospitals?”or whatever it might be. On that list, you will not find the color of the sky because no one, no humans right now are thinking about, “Oh, should we change the sky to be orange with pink stripes?” But if someone did change the color of the sky, I can bet you a lot of people would be really upset about it.

And so invariably we fail to include many, many criteria in whatever list of objectives you might come up with. And when you do that, what happens is that the optimizer will take advantage of those dimensions of freedom and typically, and actually under fairly general algebraic conditions, will set them to extreme values because that gives you better optimization on the things that are in the list of goals.

So the argument is that within the standard model, which I bear some responsibility for, because it’s the way we wrote the first three editions of the textbook, within the standard model, further progress on AI could lead to increasing problems of control and it’s not because there’s any will to dominance.

I don’t know of any serious thinkers in the X-risk community who think that that’s the problem. That’s another straw man.

Steven Pinker: When you’re finished, I do have some responses to that.

Stuart Russell: The argument is not that we automatically build in because we all want our systems to be alpha males or anything like that. And I think Steve Omohundro has put it fairly clearly in some of his earlier papers that the behavior of a superoptimizer given any finite list of goals is going to include efforts to maximize its computational resources and other resources that will help it achieve the objectives that we do specify.

And you could put in something saying, “Well, and don’t spend any money.” Or, “Don’t do this and don’t do that and don’t do the other.” But the same structure of the argument is going to apply. We can reduce the risk by adding more and more stuff into the explicit objectives, but I think the argument I’m making in the book is that that’s just a completely broken way to design AI systems.

The meta argument is that if we don’t talk about the failure modes, we won’t be able to address them. So actually I think that Steve and I don’t disagree about the plausible future evolution. I don’t think it’s particularly plausible. If I was going into forecast mode, so just betting on the future saying, “What’s the probability that this thing will happen or that thing will happen?” I don’t think it’s particularly plausible that we will be destroyed by superintelligent AI.

And there are several reasons why I don’t think that’s going to occur because we would probably get some early warnings of it. And if we couldn’t figure out how to prevent it, we would probably put very strong restrictions on further development or we would figure out how to actually make it provably safe and beneficial.

But you can’t have that discussion unless you talk about the failure modes. Just like in nuclear safety, it’s not against the rules to raise possible failure modes like what if this molten sodium that you’re proposing should flow around all these pipes? What if it ever came into contact with the water that’s on the turbine side of the system? Wouldn’t you have a massive explosion which could rip off the containment and so on? That’s not exactly what happened in Chernobyl, but not so dissimilar.

And of course that’s what they do. So this culture of safety that Steve talks about consist exactly of this. People saying, “Look, if you design things that way these terrible things are going to happen. So don’t design things that way, design things this way.” And this is a process that we are going through in the AI community right now.

And I have to say, I just actually was reading a letter from one of my very senior colleagues, former president of AAAI, who said, “Five years ago, everyone thought Stuart was nuts, but Stuart was right. These risks have to be taken seriously and we all owe him a great debt for bringing it within the AI community so that we can start to address it.”

And I don’t think I invented these risks. And I was just in chance position that I had two years of sabbatical to think about the future of the field and to read some of the things that others had already written about the field from the outside.

My sense is that Steve and I are kind of the glass half full glass half empty. In terms of our forecast, we think on the whole, the weather tomorrow is likely to be sunny. I think we disagree on how to make sure that it’s sunny. I really do think that the problem of creating a provably beneficial AI, by which I mean that no matter how powerful the AI system is, we remain in power. We have power over it forever, that we never lose control.

That’s a big ask and the idea that we could solve that problem without even mentioning it, without even talking about it and without even pointing out why it’s difficult and why it’s important, that’s not the culture of safety. That’s sort of more like the culture of the communist party committee in Chernobyl, that simply continued to assert that nothing bad was happening.

Steven Pinker: Obviously, I’m in favor of the safety mindset of engineering, that is, you test the system before you implement it, you try to anticipate the failure modes. And perhaps I have overestimated the common sense of the AI community and they have to be warned about the absurdity of building a superoptimizer.  But a lot of these examples–flooding a town to control the water level, or curing cancer by turning humans into involuntary Guinea pigs, or maximizing happiness by injecting everyone with a drip of antidepressants–strike me as so far from reasonable failure modes that they’re not part of the ordinary engineering effort to ensure safety–particularly when they are coupled with the term “existential.”

These are not ordinary engineering discussions of ways in which a system could fail; they are speculations on how the human species might end. That is very different from not plugging in an AI system until you’ve tested it to find out how it fails. And perhaps we agree that the superoptimizers in these thought experiments are so unintelligent that no one will actually empower them.

Stuart Russell: But Steve, I wasn’t saying we give it one goal. I’m saying however many goals we give it, that’s equivalent to giving it a ranking over futures. So the idea of single goal versus multiple is a complete red herring.

Steven Pinker: But the scare stories all involve systems that are given a single goal. As you go down the tail of possible risks, you’re getting into potentially infinitesimal risks. There is no system, conceivable or existing, that will have zero risk of every possibility. 

Stuart Russell: If we could do that, if we had some serious theory by which we could say, “Okay. We’ve got within epsilon of the true human ranking over futures,” I think that’s very hard to do. We literally do not have a clue how to do that. And the purpose of these examples is actually to dismiss the idea that this has a simple solution.

So people want to dismiss the idea of risk by saying, “Oh, we’ll just give the AI system such and such objective.” And then the failure mode goes away and everything’s cool. And then people say, “Oh, but no.” Look, if you give it the objective that everyone should be happy, then here’s a solution that the AI system could find that clearly we wouldn’t want.

Those processes lead actually to deeper questioning, what do we really mean by happy? We don’t just mean pleasure as measured by the pleasure center in the brain. And the same arguments happen in moral philosophy.

So no one is accusing G.E. Moore of being a naive idiot because he objected to a pleasure maximization definition of what is a good moral decision to make. He was making an important philosophical point and I don’t think we should dismiss that same point when it’s made in the context of designing objectives for AI systems.

Steven Pinker: Yes, that’s an excellent argument against building a universally empowered AI system that’s given the single goal of maximizing human happiness–your example. Do AI researchers need to be warned against that absurd project? It seems to me that that’s the straw man, and so are the other scenarios that are designed to sow worry, such as conscripting the entire human race as involuntary guinea pigs in cancer experiments. Even if there isn’t an epsilon that we can’t go below in laying out possible risks, it doesn’t strike me that that’s within the epsilon.

These strike me more as exercises of human imagination. Assuming a ridiculously simple system that’s given one goal, what could go wrong? Well, yeah, stuff could go wrong, but is that really what’s going to face us when it comes to actual AI systems that have some hope of being implemented?

Naturally, we ought to test the living daylights out of any system before we give it control over anything. That’s Stuart’s point about building bridges that don’t fall down and the standard safety ethic in engineering. But I’m not sure that exotic scenarios based on incredibly stupid ideas for AI systems like giving one the goal of maximizing human happiness is the route that gives us safe AI.

Stuart Russell: Okay. So let me say once again that the one goal versus multiple goals is a red herring. If you think it’s so easy to specify the goal correctly, perhaps your next paper will write it out. Then we’ll say, “Okay, that’s not a straw man. This is Steve Pinker’s suggestion of what the objective should be for the superintelligent AI system.” And then the people who love doing these things, probably Nick Bostrom and others will find ways of failing.

So the idea that we could just test before deploying something that is significantly more powerful than human beings or even the human race combined, that’s a pretty optimistic idea. We’re not even able to test ordinary software systems right now. So test generation is one of the effective methods used in software engineering, but it has many, many known failure cases for real world examples, including multiplication.

Intel’s Pentium chip was tested with billions of examples of multiplication, but it failed to uncover a bug in the multiplication circuitry, which caused it to produce incorrect results in some cases. And so we have a technology of formal verification, which would have uncovered that error, but particularly in the US there’s a culture that’s somewhat opposed to using formal verification in software design.

Less so in hardware design nowadays, partly because of the Pentium error, but still in software, formal verification is considered very difficult and very European and not something we do. And this is far harder than that because software verification typically is thinking only about correctness of the software in an internal sense, that what happens inside the algorithm between the inputs and outputs meet some specifications.

What we want here is that the combination of the algorithm and the world evolves in ways that we are certain to be pleased about. And that’s a much harder kind of thing. Control theory has that view of what they mean by verification. And they’re able to do very simple linear quadratic regulators and a few other examples. And beyond that, they get stuck. And so I actually think that the testing is probably neither (not very) feasible. I mean, not saying we shouldn’t do it, but it’s going to be extremely hard to get any kind of confidence from testing, because you’re really asking, can you simulate the entire world and all the ways a system could use the world to bring about the objective.

However complicated and however multifaceted that objective is, it’s probably going to be the wrong one. So I’ve proposed a, not completely different, but a generalized form of AI that knows that it doesn’t know what the real objectives are. It knows it doesn’t know how humans rank possible futures and that changes the way it behaves, but that also has failure modes.

One of them being the plasticity of human preference rankings over the future and how do you prevent the AI system from taking advantage of that plasticity? You can’t prevent it completely because anything it does is going to have some effect on human preferences. But the question is what constitutes reasonable modifications of human preferences and what constitutes unreasonable ones? We don’t know the answer to that. So there are many, many really difficult research problems that we have to overcome for the research agenda that I’m proposing to have a chance of success.

I’m not that optimistic that this is an easy or a straightforward problem to solve and I think we can only solve it if we go outside the conceptual framework that AI has worked in for the last 70 years.

Steven Pinker: Well, yes. Certainly, if the conceptual framework for AI is optimizing some single or small list of generic goals, like a ranking over possible futures, and it is empowered to pursue them by any means, as opposed to building tools that solve specific problems. But note that you’ve also given arguments why the fantasies of superintelligence are unlikely to come about–the near-miraculous powers to outsmart us, to augment its own intelligence, to defeat all of our attempts to control it. In the scenarios, these all work flawlessly–yet  the complexities that make it hard to predict all conceivable failures also make it hard to achieve superintelligence in the first place.

Namely, we can’t take into account the fantastically chaotic and unpredictable reactions of humans. And we can’t program a system that has complete knowledge of the physical universe without allowing it to do experiments and acquire empirical knowledge, at a rate determined by the physical world. Exactly the infirmities that prevent us from exploring the entire space of behavior of one of these systems in advance is the reason that it’s not going to be superintelligent in the way that these scenarios outline.

And that’s a reason not to empower any generic goal-driven system that aspires toward “superintelligence” or that we might think of as “superintelligent”–it  is unlikely to exist, and likely to display various forms of error and stupidity.

Stuart Russell: I would agree that some of the concerns that you might see in the X-risk community are, say, nonphysical. So the idea that a system could predict the next hundred years and your entire life in such detail that a hundred years ago, it knew what you were going to be saying at a particular millisecond in a hundred years in the future, this is obviously complete nonsense.

I don’t think we need to be too concerned about that as a serious question. Whether it’s a thought experiment that sheds light on fundamental questions in decision theory, like the Newcomb problems is another issue that we don’t have to get into. But we can’t solve the problem by saying, “Well, superintelligence of the kind that could lead to significant global consequences could not possibly exist.”

And actually I kind of like Danny Hillis’ argument, which says that actually, no, it already does exist and it already has, and is having significant global consequences. And his example is to view, let’s say the fossil fuel industry as if it were an AI system. I think this is an interesting line of thought, because what he’s saying basically and — other people have said similar things — is that you should think of a corporation as if it’s an algorithm and it’s maximizing a poorly designed objective, which you might say is some discounted stream of quarterly profits or whatever. And it really is doing it in a way that’s oblivious to lots of other concerns of the human race. And it has outwitted the rest of the human race.

So we might all think, well, of course, we know that what it’s trying to do is wrong and of course we all know the right answer, but in fact we’ve lost and we should have pointed out a hundred years ago that there is this risk and it needs to be taken seriously.

And it was. People did point it out a hundred years ago, but no one took them seriously. And this is what happened. So I think we have actually a fairly good example that this type of thing, the optimization of objectives, ignoring externalities as the economists would point out, by superintelligent entities. And in some sense, the fossil fuel industry outwitted us because whatever organizational structures allow large groups of humans to generate effective complex behaviors in the real world and develop complex plans, it operates in some ways like a superintelligent entity, just like we were able to put a person on the moon because of the combined effect of many human intellects working together.

But each of those humans in the fossil fuel industry is a piece of an algorithm if you like and their own individual preferences about the future don’t count for much and in fact, they get molded by their role within the corporation.

I think in some ways you already have an existence proof that the concern is real.

Steven Pinker: A simpler explanation is that people like energy, fossil fuels are the most convenient source, and no one has had to pay for the external damage they do. Clearly we ought to anticipate foreseeable risks and attempt to mitigate them. But they have to be calibrated against what we know, taking into account our own ignorance of the future. It can be hazardous to chase the wrong worries, such as running out of petroleum, which was the big worry in the 1970s. Now we know that the problem with petroleum is too much, not too little. Overpopulation and genetically modified organisms are other examples. 

If we try to fantasize too far into the future, beyond what we can reasonably predict, we can sow fear about the wrong risks. My concern about all these centers and smart people worrying about the existential risk of AI is that we are misallocating our worry budget and our intellectual resources. We should be thinking hard about how to mitigate climate change, which is a real problem. That is less true of spinning exotic scenarios about hypothetical AI systems which have been given control over the physical universe and might enslave us in cancer experiments.

Lucas Perry: All right. So wrapping up here, do you guys have final statements that you’d like to say, just if you felt like what you just said didn’t fully capture what you want to end on on this issue of AI existential risk?

Steven Pinker: Despite our disagreements, most about my assessment of AI agrees with Stuart’s. I personally don’t think that the adjective existential is helpful in ordinary concerns over safety, which we ought to have. I think there are tremendous potential benefits to AI, and that we ought to seek at the same time as we anticipate the reasonable risks and take every effort to mitigate them.

Stuart Russell: Yep. I mean, it’s hard to disagree that we should focus on the reasonable risks. The question is whether you think that the hundreds of billions of dollars that are being invested into AI research will produce systems that can have potentially global consequences.

And to me it seems self evident that it can and we can look at even simple machine algorithms like the content selection algorithms in social media because those algorithms interact with humans for hours every day and dictate what literally billions of people see and read every day. They are having substantial global impact already.

And they are very, very simple. They don’t know that human beings exist at all, but they still learn to manipulate our brains to optimize the objective. I had a very interesting little Facebook exchange with Yann LeCun. And at some point in the argument, Yann said something quite similar to something Steve said earlier. He said, “There’s really no risk. You’d have to be extremely stupid to put an incorrect objective into a powerful system and then deploy it on a global scale.”

And I said, “Well, you mean like optimizing click-through?” And he said, “Facebook stopped using click-through years ago.” And I said, “Well, why was that?” And he said, “Oh, because it was the incorrect objective.”

So you did put an incorrect objective into a powerful system, deployed on a global scale. Now what does that say about Facebook? So I think just as you might have said — and in fact the nuclear industry did say — “It’s perfectly safe. Nothing can go wrong. We’re the experts. We understand safety. We understand everything.”

Nonetheless, we had Chernobyl, we had Fukushima. And actually, I think there’s an argument to be made that despite the massive environmental cost of foregoing nuclear power, that countries like Germany, Italy, Spain and probably a bunch of others are in the process of actually deciding that we need to phase out nuclear power because even though theoretically, it’s possible to develop and operate completely safe nuclear power systems, it’s beyond our capabilities and the evidence is there.

You might have argued that while Russia is corrupt, it’s technology was not as great as it should have been, they cut lots of corners, but you can’t argue that the Japanese nuclear industry was unsophisticated or unconcerned with safety, but they still failed. And so I think voters in those countries who said, “We don’t want nuclear power because we just don’t want to be in that situation even if we have the engineers making our best efforts.”

These kinds of considerations suggest that we do need to pay very careful attention. I’m not saying we should stop working on climate change, but when we invented synthetic biology, we said okay, we’d better think about how do we prevent the creation of disease or new disease organisms that could produce pandemics. And we took steps. People spent a lot of time thinking about safety mechanisms for those devices. We have to do the same thing for AI.

Lucas Perry: All right. Stuart and Steven, thanks so much. I’ve learned a ton of stuff today. If listeners want to follow you or look into your work, where’s the best place to do that? I’ll start with you, Steven.

Steven Pinker:, which has pages for ten books, including the most recent, Enlightenment Now. SAPinker on Twitter. 

Lucas Perry: And Stuart.

Stuart Russell: So you can Google me. I don’t really have a website or social media activity, but the book Human Compatible, which was published last October by Viking in the US and Penguin in the UK and it’s being translated into lots of languages, I think that captures my views pretty well.

Lucas Perry: All right. Thanks so much for coming on. And yeah, it was a pleasure speaking.

Steven Pinker: Thanks very much, Lucas for hosting it. Thank you Stuart for the dialogue.

Stuart Russell:It was great fun, Steve. I look forward to doing it again.

Steven Pinker: Me too.

End of recorded material

Sam Harris on Global Priorities, Existential Risk, and What Matters Most

Human civilization increasingly has the potential both to improve the lives of everyone and to completely destroy everything. The proliferation of emerging technologies calls our attention to this never-before-seen power — and the need to cultivate the wisdom with which to steer it towards beneficial outcomes. If we’re serious both as individuals and as a species about improving the world, it’s crucial that we converge around the reality of our situation and what matters most. What are the most important problems in the world today and why? In this episode of the Future of Life Institute Podcast, Sam Harris joins us to discuss some of these global priorities, the ethics surrounding them, and what we can do to address them.

Topics discussed in this episode include:

  • The problem of communication 
  • Global priorities 
  • Existential risk 
  • Animal suffering in both wild animals and factory farmed animals 
  • Global poverty 
  • Artificial general intelligence risk and AI alignment 
  • Ethics
  • Sam’s book, The Moral Landscape

You can take a survey about the podcast here

Submit a nominee for the Future of Life Award here



0:00 Intro

3:52 What are the most important problems in the world?

13:14 Global priorities: existential risk

20:15 Why global catastrophic risks are more likely than existential risks

25:09 Longtermist philosophy

31:36 Making existential and global catastrophic risk more emotionally salient

34:41 How analyzing the self makes longtermism more attractive

40:28 Global priorities & effective altruism: animal suffering and global poverty

56:03 Is machine suffering the next global moral catastrophe?

59:36 AI alignment and artificial general intelligence/superintelligence risk

01:11:25 Expanding our moral circle of compassion

01:13:00 The Moral Landscape, consciousness, and moral realism

01:30:14 Can bliss and wellbeing be mathematically defined?

01:31:03 Where to follow Sam and concluding thoughts


You can follow Sam here:

Twitter: @SamHarrisOrg


This podcast is possible because of the support of listeners like you. If you found this conversation to be meaningful or valuable consider supporting it directly by donating at Contributions like yours make these conversations possible.

All of our podcasts are also now on Spotify and iHeartRadio! Or find us on SoundCloudiTunesGoogle Play and Stitcher.

You can listen to the podcast above or read the transcript below. 

Lucas Perry: Welcome to the Future of Life Institute Podcast. I’m Lucas Perry. Today we have a conversation with Sam Harris where we get into issues related to global priorities, effective altruism, and existential risk. In particular, this podcast covers the critical importance of improving our ability to communicate and converge on the truth, animal suffering in both wild animals and factory farmed animals, global poverty, artificial general intelligence risk and AI alignment, as well as ethics and some thoughts on Sam’s book, The Moral Landscape. 

If you find this podcast valuable, you can subscribe or follow us on your preferred listening platform, like on Apple Podcasts, Spotify, Soundcloud, or whatever your preferred podcasting app is. You can also support us by leaving a review. 

Before we get into it, I would like to echo two announcements from previous podcasts. If you’ve been tuned into the FLI Podcast recently you can skip ahead just a bit. The first is that there is an ongoing survey for this podcast where you can give me feedback and voice your opinion about content. This goes a super long way for helping me to make the podcast valuable for everyone. You can find a link for the survey about this podcast in the description of wherever you might be listening. 

The second announcement is that at the Future of Life Institute we are in the midst of our search for the 2020 winner of the Future of Life Award. The Future of Life Award is a $50,000 prize that we give out to an individual who, without having received much recognition at the time of their actions, has helped to make today dramatically better than it may have been otherwise. The first two recipients of the Future of Life Award were Vasili Arkhipov and Stanislav Petrov, two heroes of the nuclear age. Both took actions at great personal risk to possibly prevent an all-out nuclear war. The third recipient was Dr. Matthew Meselson, who spearheaded the international ban on bioweapons. Right now, we’re not sure who to give the 2020 Future of Life Award to. That’s where you come in. If you know of an unsung hero who has helped to avoid global catastrophic disaster, or who has done incredible work to ensure a beneficial future of life, please head over to the Future of Life Award page and submit a candidate for consideration. The link for that page is on the page for this podcast or in the description of wherever you might be listening. If your candidate is chosen, you will receive $3,000 as a token of our appreciation. We’re also incentivizing the search via MIT’s successful red balloon strategy, where the first to nominate the winner gets $3,000 as mentioned, but there are also tiered pay outs where the first to invite the nomination winner gets $1,500, whoever first invited them gets $750, whoever first invited the previous person gets $375, and so on. You can find details about that on the Future of Life Award page. 

Sam Harris has a PhD in neuroscience from UCLA and is the author of five New York Times best sellers. His books include The End of Faith, Letter to a Christian Nation, The Moral Landscape, Free Will, Lying, Waking Up, and Islam and the Future of Tolerance (with Maajid Nawaz). Sam hosts the Making Sense Podcast and is also the creator of the Waking Up App, which is for anyone who wants to learn to meditate in a modern, scientific context. Sam has practiced meditation for more than 30 years and studied with many Tibetan, Indian, Burmese, and Western meditation teachers, both in the United States and abroad.

And with that, here’s my conversation with Sam Harris.

Starting off here, trying to get a perspective on what matters most in the world and global priorities or crucial areas for consideration, what do you see as the most important problems in the world today?

Sam Harris: There is one fundamental problem which is encouragingly or depressingly non-technical, depending on your view of it. I mean it should be such a simple problem to solve, but it’s seeming more or less totally intractable and that’s just the problem of communication. The problem of persuasion, the problem of getting people to agree on a shared consensus view of reality, and to acknowledge basic facts and to have their probability assessments of various outcomes to converge through honest conversation. Politics is obviously the great confounder of this meeting of the minds. I mean, our failure to fuse cognitive horizons through conversation is reliably derailed by politics. But there are other sorts of ideology that do this just as well, religion being perhaps first among them.

And so it seems to me that the first problem we need to solve, the place where we need to make progress and we need to fight for every inch of ground and try not to lose it again and again is in our ability to talk to one another about what is true and what is worth paying attention to, to get our norms to align on a similar picture of what matters. Basically value alignment, not with superintelligent AI, but with other human beings. That’s the master riddle we have to solve and our failure to solve it prevents us from doing anything else that requires cooperation. That’s where I’m most concerned. Obviously technology influences it, social media and even AI and the algorithms behind the gaming of everyone’s attention. All of that is influencing our public conversation, but it really is a very apish concern and we have to get our arms around it.

Lucas Perry: So that’s quite interesting and not the answer that I was expecting. I think that that sounds like quite the crucial stepping stone. Like the fact that climate change isn’t something that we’re able to agree upon, and is a matter of political opinion drives me crazy. And that’s one of many different global catastrophic or existential risk issues.

Sam Harris: Yeah. The COVID pandemic has made me, especially skeptical of our agreeing to do anything about climate change. The fact that we can’t persuade people about the basic facts of epidemiology when this thing is literally coming in through the doors and windows, and even very smart people are now going down the rabbit hole of this is on some level a hoax, people’s political and economic interests just bend their view of basic facts. I mean it’s not to say that there hasn’t been a fair amount of uncertainty here, but it’s not the sort of uncertainty that should give us these radically different views of what’s happening out in the world. Here we have a pandemic moving in real time. I mean, where we can see a wave of illness breaking in Italy a few weeks before it breaks in New York. And again, there’s just this Baghdad Bob level of denialism. The prospects of our getting our heads straight with respect to climate change in light of what’s possible in the middle of a pandemic, that seems at the moment, totally farfetched to me.

For something like climate change, I really think a technological elite needs to just decide at the problem and decide to solve it by changing the kinds of products we create and the way we manufacture things and we just have to get out of the politics of it. It can’t be a matter of persuading more than half of American society to make economic sacrifices. It’s much more along the lines of just building cars and other products that are carbon neutral that people want and solving the problem that way.

Lucas Perry: Right. Incentivizing the solution by making products that are desirable and satisfy people’s self-interest.

Sam Harris: Yeah. Yeah.

Lucas Perry: I do want to explore more actual global priorities. This point about the necessity of reason for being able to at least converge upon the global priorities that are most important seems to be a crucial and necessary stepping stone. So before we get into talking about things like existential and global catastrophic risk, do you see a way of this project of promoting reason and good conversation and converging around good ideas succeeding? Or do you have any other things to sort of add to these instrumental abilities humanity needs to cultivate for being able to rally around global priorities?

Sam Harris: Well, I don’t see a lot of innovation beyond just noticing that conversation is the only tool we have. Intellectual honesty spread through the mechanism of conversation is the only tool we have to converge in these ways. I guess the thing to notice that’s guaranteed to make it difficult is bad incentives. So we should always be noticing what incentives are doing behind the scenes to people’s cognition. There are things that could be improved in media. I think the advertising model is a terrible system of incentives for journalists and anyone else who’s spreading information. You’re incentivized to create sensational hot takes and clickbait and depersonalize everything. Just create one lurid confection after another, that really doesn’t get at what’s true. The fact that this tribalizes almost every conversation and forces people to view it through a political lens. The way this is all amplified by Facebook’s business model and the fact that you can sell political ads on Facebook and we use their micro-targeting algorithm to frankly, distort people’s vision of reality and get them to vote or not vote based on some delusion.

All of this is pathological and it has to be disincentivized in some way. The business model of digital media is part of the problem. But beyond that, people have to be better educated and realize that thinking through problems and understanding facts and creating better arguments and responding to better arguments and realizing when you’re wrong, these are muscles that need to be trained, and there are certain environments in which you can train them well. And there’s certain environments where they are guaranteed to atrophy. Education largely consists in the former, in just training someone to interact with ideas and with shared perceptions and with arguments and evidence in a way that is agnostic as to how things will come out. You’re just curious to know what’s true. You don’t want to be wrong. You don’t want to be self-deceived. You don’t want to have your epistemology anchored to wishful thinking and confirmation bias and political partisanship and religious taboos and other engines of bullshit, really.

I mean, you want to be free of all that, and you don’t want to have your personal identity trimming down your perception of what is true or likely to be true or might yet happen. People have to understand what it feels like to be willing to reason about the world in a way that is unconcerned about the normal, psychological and tribal identity formation that most people, most of the time use to filter against ideas. They’ll hear an idea and they don’t like the sound of it because it violates some cherished notion they already have in the bag. So they don’t want to believe it. That should be a tip off. That’s not more evidence in favor of your worldview. That’s evidence that you are an ape who’s disinclined to understand what’s actually happening in the world. That should be an alarm that goes off for you, not a reason to double down on the last bad idea you just expressed on Twitter.

Lucas Perry: Yeah. The way the ego and concern for reputation and personal identity and shared human psychological biases influence the way that we do conversations seems to be a really big hindrance here. And being aware of how your mind is reacting in each moment to the kinetics of the conversation and what is happening can be really skillful for catching unwholesome or unskillful reactions it seems. And I’ve found that non-violent communication has been really helpful for me in terms of having valuable open discourse where one’s identity or pride isn’t on the line. The ability to seek truth with another person instead of have a debate or argument is a skill certainly developed. Yet that kind of format for discussion isn’t always rewarded or promoted as well as something like an adversarial debate, which tends to get a lot more attention.

Sam Harris: Yeah.

Lucas Perry: So as we begin to strengthen our epistemology and conversational muscles so that we’re able to arrive at agreement on core issues, that’ll allow us to create a better civilization and work on what matters. So I do want to pivot here into what those specific things might be. Now I have three general categories, maybe four, for us to touch on here.

The first is existential risk that primarily come from technology, which might lead to the extinction of Earth originating life, or more specifically just the extinction of human life. You have a Ted Talk on AGI risk, that’s artificial general intelligence risk, the risk of machines becoming as smart or smarter than human beings and being misaligned with human values. There’s also synthetic bio risk where advancements in genetic engineering may unleash a new age of engineered pandemics, which are more lethal than anything that is produced by nature. We have nuclear war, and we also have new technologies or events that might come about that we aren’t aware of or can’t predict yet. And the other categories in terms of global priorities, I want to touch on are global poverty, animal suffering and human health and longevity. So how is it that you think of and prioritize and what is your reaction to these issues and their relative importance in the world?

Sam Harris: Well, I’m persuaded that thinking about existential risk is something we should do much more. It is amazing how few people spend time on this problem. It’s a big deal that we have the survival of our species as a blind spot, but I’m more concerned about what seems likelier to me, which is not that we will do something so catastrophically unwise as to erase ourselves, certainly not in the near term. And we’re capable of doing that clearly, but I think it’s more likely we’re capable of ensuring our unrecoverable misery for a good long while. We could just make life basically not worth living, but we’ll be forced or someone will be forced to live it all the while, basically a Road Warrior like hellscape could await us as opposed to just pure annihilation. So that’s a civilizational risk that I worry more about than extinction because it just seems probabilistically much more likely to happen no matter how big our errors are.

I worry about our stumbling into an accidental nuclear war. That’s something that I think is still pretty high on the list of likely ways we could completely screw up the possibility of human happiness in the near term. It’s humbling to consider what an opportunity cost this, compared to what’s possible, minor pandemic is, right. I mean, we’ve got this pandemic that has locked down most of humanity and every problem we had and every risk we were running as a species prior to anyone learning the name of this virus is still here. The threat of nuclear war has not gone away. It’s just, this has taken up all of our bandwidth. We can’t think about much else. It’s also humbling to observe how hard a time we’re having, even agreeing about what’s happening here, much less responding intelligently to the problem. If you imagine a pandemic that was orders of magnitude, more deadly and more transmissible, man, this is a pretty startling dress rehearsal.

I hope we learn something from this. I hope we think more about things like this happening in the future and prepare for them in advance. I mean, the fact that we have a CDC, that still cannot get its act together is just astounding. And again, politics is the thing that is gumming up the gears in any machine that would otherwise run halfway decently at the moment. I mean, we have a truly deranged president and that is not a partisan observation. That is something that can be said about Trump. And it would not be said about most other Republican presidents. There’s nothing I would say about Trump that I could say about someone like Mitt Romney or any other prominent Republican. This is the perfect circumstance to accentuate the downside of having someone in charge who lies more readily than any person in human history perhaps.

It’s like toxic waste at the informational level has been spread around for three years now and now it really matters that we have an information ecosystem that has no immunity against crazy distortions of the truth. So I hope we learn something from this. And I hope we begin to prioritize the list of our gravest concerns and begin steeling our civilization against the risk that any of these things will happen. And some of these things are guaranteed to happen. The thing that’s so bizarre about our failure to grapple with a pandemic of this sort is, this is the one thing we knew was going to happen. This was not a matter of “if.” This was only a matter of “when.” Now nuclear war is still a matter of “if”, right? I mean, we have the bombs, they’re on hair-trigger, overseen by absolutely bizarre and archaic protocols and highly outdated technology. We know this is just a doomsday system we’ve built that could go off at any time through sheer accident or ineptitude. But it’s not guaranteed to go off.

But pandemics are just guaranteed to emerge and we still were caught flat footed here. And so I just think we need to use this occasion to learn a lot about how to respond to this sort of thing. And again, if we can’t convince the public that this sort of thing is worth paying attention to, we have to do it behind closed doors, right? I mean, we have to get people into power who have their heads screwed on straight here and just ram it through. There has to be a kind of Manhattan Project level urgency to this, because this is about as benign a pandemic as we could have had, that would still cause significant problems. An engineered virus, a weaponized virus that was calculated to kill the maximum number of people. I mean, that’s a zombie movie, all of a sudden, and we’re not ready for the zombies.

Lucas Perry: I think that my two biggest updates from the pandemic were that human civilization is much more fragile than I thought it was. And also I trust the US government way less now in its capability to mitigate these things. I think at one point you said that 9/11 was the first time that you felt like you were actually in history. And as someone who’s 25, being in the COVID pandemic, this is the first time that I feel like I’m in human history. Because my life so far has been very normal and constrained, and the boundaries between everything has been very rigid and solid, but this is perturbing that.

So you mentioned that you were slightly less worried about humanity just erasing ourselves via some kind of existential risk and part of the idea here seems to be that there are futures that are not worth living. Like if there’s such thing as a moment or a day that isn’t worth living then there are also futures that are not worth living. So I’m curious if you could unpack why you feel that these periods of time that are not worth living are more likely than existential risks. And if you think that some of those existential conditions could be permanent, and could you speak a little bit about the relative likely hood of existential risk and suffering risks and whether you see the higher likelihood of the suffering risks to be ones that are constrained in time or indefinite.

Sam Harris: In terms of the probabilities, it just seems obvious that it is harder to eradicate the possibility of human life entirely than it is to just kill a lot of people and make the remaining people miserable. Right? If a pandemic spreads, whether it’s natural or engineered, that has 70% mortality and the transmissibility of measles, that’s going to kill billions of people. But it seems likely that it may spare some millions of people or tens of millions of people, even hundreds of millions of people and those people will be left to suffer their inability to function in the style to which we’ve all grown accustomed. So it would be with war. I mean, we could have a nuclear war and even a nuclear winter, but the idea that it’ll kill every last person or every last mammal, it would have to be a bigger war and a worse winter to do that.

So I see the prospect of things going horribly wrong to be one that yields, not a dial tone, but some level of remaining, even civilized life, that’s just terrible, that nobody would want. Where we basically all have the quality of life of what it was like on a mediocre day in the middle of the civil war in Syria. Who wants to live that way? If every city on Earth is basically a dystopian cell on a prison planet, that for me is a sufficient ruination of the hopes and aspirations of civilized humanity. That’s enough to motivate all of our efforts to avoid things like accidental nuclear war and uncontrolled pandemics and all the rest. And in some ways it’s more of motivating because when you ask people, what’s the problem with the failure to continue the species, right? Like if we all died painlessly in our sleep tonight, what’s the problem with that?

That actually stumps some considerable number of people because they immediately see that the complete annihilation of the species painlessly is really a kind of victimless crime. There’s no one around to suffer our absence. There’s no one around to be bereaved. There’s no one around to think, oh man, we could have had billions of years of creativity and insight and exploration of the cosmos and now the lights have gone out on the whole human project. There’s no one around to suffer that disillusionment. So what’s the problem? I’m persuaded that that’s not the perfect place to stand to evaluate the ethics. I agree that losing that opportunity is a negative outcome that we want to value appropriately, but it’s harder to value it emotionally and it’s not as clear. I mean it’s also, there’s an asymmetry between happiness and suffering, which I think is hard to get around.

We are perhaps rightly more concerned about suffering than we are about losing opportunities for wellbeing. If I told you, you could have an hour of the greatest possible happiness, but it would have to be followed by an hour of the worst possible suffering. I think most people given that offer would say, oh, well, okay, I’m good. I’ll just stick with what it’s like to be me. The hour of the worst possible misery seems like it’s going to be worse than the highest possible happiness is going to be good and I do sort of share that intuition. And when you think about it, in terms of the future of humanity, I think it is more motivating to think, not that your grandchildren might not exist, but that your grandchildren might live horrible lives, really unendurable lives and they’ll be forced to live them because there’ll be born. If for no other reason, then we have to persuade some people to take these concerns seriously, I think that’s the place to put most of the emphasis.

Lucas Perry: I think that’s an excellent point. I think it makes it more morally salient and leverages human self-interest more. One distinction that I want to make is the distinction between existential risks and global catastrophic risks. Global catastrophic risks are those which would kill a large fraction of humanity without killing everyone, and existential risks are ones which would exterminate all people or all Earth-originating intelligent life. And this former risk, the global catastrophic risks are the ones which you’re primarily discussing here where something goes really bad and now we’re left with some pretty bad existential situation.

Sam Harris: Yeah.

Lucas Perry: Now we’re not locked in that forever. So it’s pretty far away from being what is talked about in the effective altruism community as a suffering risk. That actually might only last a hundred or a few hundred years or maybe less. Who knows. It depends on what happened. But now taking a bird’s eye view again on global priorities and standing on a solid ground of ethics, what is your perspective on longtermist philosophy? This is the position or idea that the deep future has overwhelming moral priority, given the countless trillions of lives that could be lived. So if an existential risk occur, then we’re basically canceling the whole future like you mentioned. There won’t be any suffering and there won’t be any joy, but we’re missing out on a ton of good it would seem. And with the continued evolution of life, through genetic engineering and enhancements and artificial intelligence, it would seem that the future could also be unimaginably good.

If you do an expected value calculation about existential risks, you can estimate very roughly the likelihood of each existential risk, whether it be from artificial general intelligence or synthetic bio or nuclear weapons or a black swan event that we couldn’t predict. And you multiply that by the amount of value in the future, you’ll get some astronomical number, given the astronomical amount of value in the future. Does this kind of argument or viewpoint do the work for you to commit you to seeing existential risk as a global priority or the central global priority?

Sam Harris: Well, it doesn’t do the emotional work largely because we’re just bad at thinking about longterm risk. It doesn’t even have to be that long-term for our intuitions and concerns to degrade irrationally. We’re bad at thinking about the well-being, even of our future selves as you get further out in time. The term of jargon is that we “hyperbolically discount” our future well being. People will smoke cigarettes or make other imprudent decisions in the present. They know they will be the inheritors of these bad decisions, but there’s some short-term upside.

The mere pleasure of the next cigarette say, that convinces them that they don’t really have to think long and hard about what their future self will wish they had done at this point. Our ability to be motivated by what we think is likely to happen in the future is even worse when we’re thinking about our descendants. Right? People we either haven’t met yet or may never meet. I have kids, but I don’t have grandkids. How much of my bandwidth is taken up thinking about the kinds of lives my grandchildren will have? Really none. It’s conserved. It’s safeguarded by my concern about my kids, at this point.

But, then there are people who don’t have kids and are just thinking about themselves. It’s hard to think about the comparatively near future. Even a future that, barring some real mishap, you have every expectation of having to live in yourself. It’s just hard to prioritize. When you’re talking about the far future, it becomes very, very difficult. You just have to have the science fiction geek gene or something disproportionately active in your brain, to really care about that.

Unless you think you are somehow going to cheat death and get aboard the starship when it’s finally built. You’re popping 200 vitamins a day with Ray Kurzweil and you think you might just be in the cohort of people who are going to make it out of here without dying because we’re just on the cusp of engineering death out of the system, then I could see, okay. There’s a self interested view of it. If you’re really talking about hypothetical people who you know you will never come in contact with, I think it’s hard to be sufficiently motivated, even if you believe the moral algebra here.

It’s not clear to me that it need run through. I agree with you that if you do a basic expected value calculation here, and you start talking about trillions of possible lives, their interests must outweigh the interests of the 7.8 or whatever it is, billion of us currently alive. A few asymmetries here, again. The asymmetry between actual and hypothetical lives, there are no identifiable lives who would be deprived of anything if we all just decided to stop having kids. You have to take the point of view of the people alive who make this decision.

If we all just decided, “Listen. These are our lives to live. We can decide how we want to live them. None of us want to have kids anymore.” If we all independently made that decision, the consequence on this calculus is we are the worst people, morally speaking, who have ever lived. That doesn’t quite capture the moment, the experience or the intentions. We could do this thing without ever thinking about the implications of existential risk. If we didn’t have a phrase for this and we didn’t have people like ourselves talking about this is a problem, people could just be taken in by the overpopulation thesis.

That that’s really the thing that is destroying the world and what we need is some kind of Gaian reset, where the Earth reboots without us. Let’s just stop having kids and let nature reclaim the edges of the cities. You could see a kind of utopian environmentalism creating some dogma around that, where it was no one’s intention ever to create some kind of horrific crime. Yet, on this existential risk calculus, that’s what would have happened. It’s hard to think about the morality there when you talk about people deciding not to have kids and it would be the same catastrophic outcome.

Lucas Perry: That situation to me seems to be like looking over the possible moral landscape and seeing a mountain or not seeing a mountain, but there still being a mountain. Then you can have whatever kinds of intentions that you want, but you’re still missing it. From a purely consequentialist framework on this, I feel not so bad saying that this is probably one of the worst things that have ever happened.

Sam Harris: The asymmetry here between suffering and happiness still seems psychologically relevant. It’s not quite the worst thing that’s ever happened, but the best things that might have happened have been canceled. Granted, I think there’s a place to stand where you could think that is a horrible outcome, but again, it’s not the same thing as creating some hell and populating it.

Lucas Perry: I see what you’re saying. I’m not sure that I quite share the intuition about the asymmetry between suffering and well-being. I feel somewhat suspect about that, but that would be a huge tangent right now, I think. Now, one of the crucial things that you said was, for those that are not really compelled to care about the long-term future argument, if you don’t have the science fiction geek gene and are not compelled by moral philosophy, the essential way it seems to be that you’re able to compel people to care about global catastrophic and existential risk is to demonstrate how they’re very likely within this century.

And so their direct descendants, like their children or grandchildren, or even them, may live in a world that is very bad or they may die in some kind of a global catastrophe, which is terrifying. Do you see this as the primary way of leveraging human self-interest and feelings and emotions to make existential and global catastrophic risk salient and pertinent for the masses?

Sam Harris: It’s certainly half the story, and it might be the most compelling half. I’m not saying that we should be just worried about the downside because the upside also is something we should celebrate and aim for. The other side of the story is that we’ve made incredible progress. If you take someone like Steven Pinker and his big books of what is often perceived as happy talk. He’s pointing out all of the progress, morally and technologically and at the level of public health.

It’s just been virtually nothing but progress. There’s no point in history where you’re luckier to live than in the present. That’s true. I think that the thing that Steve’s story conceals, or at least doesn’t spend enough time acknowledging, is that the risk of things going terribly wrong is also increasing. It was also true a hundred years ago that it would have been impossible for one person or a small band of people to ruin life for everyone else.

Now that’s actually possible. Just imagine if this current pandemic were an engineered virus, more like a lethal form of measles. It might take five people to create that and release it. Here we would be locked down in a truly terrifying circumstance. The risk is ramped up. I think we just have to talk about both sides of it. There is no limit to how beautiful life could get if we get our act together. Take an argument of the sort that David Deutsch makes about the power of knowledge.

Every problem has a solution born of a sufficient insight into how things work, i.e. knowledge, unless the laws of physics rules it out. If it’s compatible with the laws of physics, knowledge can solve the problem. That’s virtually a blank check with reality that we could live to cash, if we don’t kill ourselves in the process. Again, as the upside becomes more and more obvious, the risk that we’re going to do something catastrophically stupid is also increasing. The principles here are the same. The only reason why we’re talking about existential risk is because we have made so much progress. Without the progress, there’d be no way to make a sufficiently large mistake. It really is two sides of the coin of increasing knowledge and technical power.

Lucas Perry: One thing that I wanted to throw in here in terms of the kinetics of long-termism and emotional saliency, it would be stupidly optimistic I think, to think that everyone could become selfless bodhisattvas. In terms of your interest, the way in which you promote meditation and mindfulness, and your arguments against the conventional, experiential and conceptual notion of the self, for me at least, has dissolved much of the barriers which would hold me from being emotionally motivated from long-termism.

Now, that itself I think, is another long conversation. When your sense of self is becoming nudged, disentangled and dissolved in new ways, the idea that it won’t be you in the future, or the idea that the beautiful dreams that Dyson spheres will be having in a billion years are not you, that begins to relax a bit. That’s probably not something that is helpful for most people, but I do think that it’s possible for people to adopt and for meditation, mindfulness and introspection to lead to this weakening of sense of self, which then also opens one’s optimism, and compassion, and mind towards the long-termist view.

Sam Harris: That’s something that you get from reading Derek Parfit’s work. The paradoxes of identity that he so brilliantly framed and tried to reason through yield something like what you’re talking about. It’s not so important whether it’s you, because this notion of you is in fact, paradoxical to the point of being impossible to pin down. Whether the you that woke up in your bed this morning is the same person who went to sleep in it the night before, that is problematic. Yet there’s this fact of some degree of psychological continuity.

The basic fact experientially is just, there is consciousness and its contents. The only place for feelings, and perceptions, and moods, and expectations, and experience to show up is in consciousness, whatever it is and whatever its connection to the physics of things actually turns out to be. There’s just consciousness. The question of where it appears is a genuinely interesting one philosophically, and intellectually, and scientifically, and ultimately morally.

Because if we build conscious robots or conscious computers and build them in a way that causes them to suffer, we’ve just done something terrible. We might do that inadvertently if we don’t know how consciousness arises based on information processing, or whether it does. It’s all interesting terrain to think about. If the lights are still on a billion years from now, and the view of the universe is unimaginably bright, and interesting and beautiful, and all kinds of creative things are possible by virtue of the kinds of minds involved, that will be much better than any alternative. That’s certainly how it seems to me.

Lucas Perry: I agree. Some things here that ring true seem to be, you always talk about how there’s only consciousness and its contents. I really like the phrase, “Seeing from nowhere.” That usually is quite motivating for me, in terms of the arguments against the conventional conceptual and experiential notions of self. There just seems to be instantiations of consciousness intrinsically free of identity.

Sam Harris: Two things to distinguish here. There’s the philosophical, conceptual side of the conversation, which can show you that things like your concept of a self, or certainly your concept of a self that could have free will that, that doesn’t make a lot of sense. It doesn’t make sense when mapped onto physics. It doesn’t make sense when looked for neurologically. Any way you look at it, it begins to fall apart. That’s interesting, but again, it doesn’t necessarily change anyone’s experience.

It’s just a riddle that can’t be solved. Then there’s the experiential side which you encounter more in things like meditation, or psychedelics, or sheer good luck where you can experience consciousness without the sense that there’s a subject or a self in the center of it appropriating experiences. Just a continuum of experience that doesn’t have structure in the normal way. What’s more, that’s not a problem. In fact, it’s the solution to many problems.

A lot of the discomfort you have felt psychologically goes away when you punch through to a recognition that consciousness is just the space in which thoughts, sensations and emotions continually appear, change and vanish. There’s no thinker authoring the thoughts. There’s no experiencer in the middle of the experience. It’s not to say you don’t have a body. There’s every sign that you have a body is still appearing. There’s sensations of tension, warmth, pressure and movement.

There are sights, there are sounds but again, everything is simply an appearance in this condition, which I’m calling consciousness for lack of a better word. There’s no subject to whom it all refers. That can be immensely freeing to recognize, and that’s a matter of a direct change in one’s experience. It’s not a matter of banging your head against the riddles of Derek Parfit or any other way of undermining one’s belief in personal identity or the reification of a self.

Lucas Perry: A little bit earlier, we talked a little bit about the other side of the existential risk coin. Now, the other side of that is this existential hope, we like to call at The Future of Life Institute. We’re not just a doom and gloom society. It’s also about how the future can be unimaginably good if we can get our act together and apply the appropriate wisdom to manage and steward our technologies with wisdom and benevolence in mind.

Pivoting in here and reflecting a little bit on the implications of some of this no self conversation we’ve been having for global priorities, the effective altruism community has narrowed down on three of these global priorities as central issues of consideration, existential risk, global poverty and animal suffering. We talked a bunch about existential risk already. Global poverty is prolific, and many of us live in quite nice and abundant circumstances.

Then there’s animal suffering, which can be thought of as in two categories. One being factory farmed animals, where we have billions upon billions of animals being born into miserable conditions and being slaughtered for sustenance. Then we also have wild animal suffering, which is a bit more esoteric and seems like it’s harder to get any traction on helping to alleviate. Thinking about these last two points, global poverty and animal suffering, what is your perspective on these?

I find the lack of willingness for people to empathize and be compassionate towards animal suffering to be quite frustrating, as well as global poverty, of course. If you view the perspective of no self as potentially being informative or helpful for leveraging human compassion and motivation to help other people and to help animals. One quick argument here that comes from the conventional view of self, so isn’t strictly true or rational, but is motivating for me, is that I feel like I was just born as me and then I just woke up one day as Lucas.

I, referring to this conventional and experientially illusory notion that I have of myself, this convenient fiction that I have. Now, you’re going to die and you could wake up as a factory farmed animal. Surely there are those billions upon billions of instantiations of consciousness that are just going through misery. If the self is an illusion then there are selfless chicken and cow experiences of enduring suffering. Any thoughts or reactions you have to global poverty, animal suffering and what I mentioned here?

Sam Harris: I guess the first thing to observe is that again, we are badly set up to prioritize what should be prioritized and to have the emotional response commensurate with what we could rationally understand is so. We have a problem of motivation. We have a problem of making data real. This has been psychologically studied, but it’s just manifest in oneself and in the world. We care more about the salient narrative that has a single protagonist than we do about the data on, even human suffering.

The classic example here is one little girl falls down a well, and you get wall to wall news coverage. All the while there could be a genocide or a famine killing hundreds of thousands of people, and it doesn’t merit more than five minutes. One broadcast. That’s clearly a bug, not a feature morally speaking, but it’s something we have to figure out how to work with because I don’t think it’s going away. One of the things that the effective altruism philosophy has done, I think usefully, is that it has separated two projects which up until the emergence of effective altruism, I think were more or less always conflated.

They’re both valid projects, but one has much greater moral consequence. The fusion of the two is, the concern about giving and how it makes one feel. I want to feel good about being philanthropic. Therefore, I want to give to causes that give me these good feels. In fact, at the end of the day, the feeling I get from giving is what motivates me to give. If I’m giving in a way that doesn’t really produce that feeling, well, then I’m going to give less or give less reliably.

Even in a contemplative Buddhist context, there’s an explicit fusion of these two things. The reason to be moral and to be generous is not merely, or even principally, the effect on the world. The reason is because it makes you a better person. It gives you a better mind. You feel better in your own skin. It is in fact, more rewarding than being selfish. I think that’s true, but that doesn’t get at really, the important point here, which is we’re living in a world where the difference between having good and bad luck is so enormous.

The inequalities are so shocking and indefensible. The fact that I was born me and not born in some hell hole in the middle of a civil war soon to be orphaned, and impoverished and riddled by disease, I can take no responsibility for the difference in luck there. That difference is the difference that matters more than anything else in my life. What the effective altruist community has prioritized is, actually helping the most people, or the most sentient beings.

That is fully divorceable from how something makes you feel. Now, I think it shouldn’t be ultimately divorceable. I think we should recalibrate our feelings or struggle to, so that we do find doing the most good the most rewarding thing in the end, but it’s hard to do. My inability to do it personally, is something that I have just consciously corrected for. I’ve talked about this a few times on my podcast. When Will MacAskill came on my podcast and we spoke about these things, I was convinced at the end of the day, “Well, I should take this seriously.”

I recognize that fighting malaria by sending bed nets to people in sub-Saharan Africa is not a cause I find particularly sexy. I don’t find it that emotionally engaging. I don’t find it that rewarding to picture the outcome. Again, compared to other possible ways of intervening in human misery and producing some better outcome, it’s not the same thing as rescuing the little girl from the well. Yet, I was convinced that, as Will said on that podcast and as organizations like GiveWell attest, giving money to the Against Malaria Foundation was and remains one of the absolute best uses of every dollar to mitigate unnecessary death and suffering.

I just decided to automate my giving to the Against Malaria Foundation because I knew I couldn’t be trusted to wake up every day, or every month or every quarter, whatever it would be, and recommit to that project because some other project would have captured my attention in the meantime. I was either going to give less to it or not give at all, in the end. I’m convinced that we do have to get around ourselves and figure out how to prioritize what a rational analysis says we should prioritize and get the sentimentality out of it, in general.

It’s very hard to escape entirely. I think we do need to figure out creative ways to reformat our sense of reward. The reward we find in helping people has to begin to become more closely coupled to what is actually most helpful. Conversely, the disgust or horror we feel over bad outcomes should be more closely coupled to the worst things that happen. As opposed to just the most shocking, but at the end of the day, minor things. We’re just much more captivated by a sufficiently ghastly story involving three people than we are by the deaths of literally millions that happen some other way. These are bugs we have to figure out how to correct for.

Lucas Perry: I hear you. The person running in the burning building to save the child is sung as a hero, but if you are say, earning to give for example and write enough checks to save dozens of lives over your lifetime, that might not go recognized or felt in the same way.

Sam Harris: And also these are different people, too. It’s also true to say that someone who is psychologically and interpersonally not that inspiring, and certainly not a saint might wind up doing more good than any saint ever does or could. I don’t happen to know Bill Gates. He could be saint-like. I literally never met him, but I don’t get that sense that he is. I think he’s kind of a normal technologist and might be normally egocentric, concerned about his reputation and legacy.

He might be a prickly bastard behind closed doors. I don’t know, but he certainly stands a chance of doing more good than any person in human history at this point, just based on the checks he’s writing and his intelligent prioritization of his philanthropic efforts. There is an interesting uncoupling here where you could just imagine someone who might be a total asshole, but actually does more good than any army of Saints you could muster. That’s interesting. That just proves a point that a concern about real world outcomes is divorceable from the psychology that we tend to associate with doing good in the world. On the point of animal suffering, I share your intuitions there, although again, this is a little bit like climate change in that I think that the ultimate fix will be technological. It’ll be a matter of people producing the Impossible Burger squared that is just so good that no one’s tempted to eat a normal burger anymore, or something like Memphis Meats, which actually, I invested in.

I have no idea where it’s going as a company, but when I had its CEO on my podcast back in the day, Uma Valeti, I just thought, “This is fantastic to engineer actual meat without producing any animal suffering. I hope he can bring this to scale.” At the time, it was like an $18,000-meatball. I don’t know what it is now, but it’s that kind of thing that will close the door to the slaughterhouse more than just convincing billions of people about the ethics. It’s too difficult and the truth may not align with exactly what we want.

I’m going to reap the whirlwind of criticism from the vegan mafia here, but it’s just not clear to me that it’s easy to be a healthy vegan. Forget about yourself as an adult making a choice to be a vegan, raising vegan kids is a medical experiment on your kids of a certain sort and it’s definitely possible to screw it up. There’s just no question about it. If you’re not going to admit that, you’re not a responsible parent.

It is possible, it is by no means easier to raise healthy vegan kids than it is to raise kids who eat meat sometimes and that’s just a problem, right? Now, that’s a problem that has a technical solution, but there’s still diversity of opinion about what constitutes a healthy human diet even when all things are on the menu. We’re just not there yet. It’s unlikely to be just a matter of supplementing B12.

Then the final point you made does get us into a kind of, I would argue, a reductio ad absurdum of the whole project ethically when you’re talking about losing sleep over whether to protect the rabbits from the foxes out there in the wild. If you’re going to go down that path, and I will grant you, I wouldn’t want to trade places with a rabbit, and there’s a lot of suffering out there in the natural world, but if you’re going to try to figure out how to minimize the suffering of wild animals in relation to other wild animals then I think you are a kind of antinatalist with respect to the natural world. I mean, then it would be just better if these animals didn’t exist, right? Let’s just hit stop on the whole biosphere, if that’s the project.

Then there’s the argument that there are many more ways to suffer and to be happy as a sentient being. Whatever story you want to tell yourself about the promise of future humanity, it’s just so awful to be a rabbit or an insect that if an asteroid hit us and canceled everything, that would be a net positive.

Lucas Perry: Yeah. That’s an actual view that I hear around a bunch. I guess my quick response is as we move farther into the future, if we’re able to reach an existential situation which is secure and where there is flourishing and we’re trying to navigate the moral landscape to new peaks, it seems like we will have to do something about wild animal suffering. With AGI and aligned superintelligence, I’m sure there could be very creative solutions using genetic engineering or something. Our descendants will have to figure that out, whether they are just like, “Are wild spaces really necessary in the future and are wild animals actually necessary, or are we just going to use those resources in space to build more AI that would dream beautiful dreams?”

Sam Harris: I just think it may be, in fact, the case that nature is just a horror show. It is bad almost any place you could be born in the natural world, you’re unlucky to be a rabbit and you’re unlucky to be a fox. We’re lucky to be humans, sort of, and we can dimly imagine how much luckier we might get in the future if we don’t screw up.

I find it compelling to imagine that we could create a world where certainly most human lives are well worth living and better than most human lives ever were. Again, I follow Pinker in feeling that we’ve sort of done that already. It’s not to say that there aren’t profoundly unlucky people in this world, and it’s not to say that things couldn’t change in a minute for all of us, but life has gotten better and better for virtually everyone when you compare us to any point in the past.

If we get to the place you’re imagining where we have AGI that we have managed to align with our interests and we’re migrating into of spaces of experience that changes everything, it’s quite possible we will look back on the “natural world” and be totally unsentimental about it, which is to say, we could compassionately make the decision to either switch it off or no longer provide for its continuation. It’s like that’s just a bad software program that evolution designed and wolves and rabbits and bears and mice, they were all unlucky on some level.

We could be wrong about that, or we might discover something else. We might discover that intelligence is not all it’s cracked up to be, that it’s just this perturbation on something that’s far more rewarding. At the center of the moral landscape, there’s a peak higher than any other and it’s not one that’s elaborated by lots of ideas and lots of creativity and lots of distinctions, it’s just this great well of bliss that we actually want to fully merge with. We might find out that the cicadas were already there. I mean, who knows how weird this place is?

Lucas Perry: Yeah, that makes sense. I totally agree with you and I feel this is true. I also feel that there’s some price that is paid because there’s already some stigma around even thinking this. I think it’s a really early idea to have in terms of the history of human civilization, so people’s initial reaction is like, “Ah, what? Nature’s so beautiful and why would you do that to the animals?” Et cetera. We may come to find out that nature is just very net negative, but I could be wrong and maybe it would be around neutral or better than that, but that would require a more robust and advanced science of consciousness.

Just hitting on this next one fairly quickly, effective altruism is interested in finding new global priorities and causes. They call this “Cause X,” something that may be a subset of existential risk or something other than existential risk or global poverty or animal suffering probably still just has to do with the suffering of sentient beings. Do you think that a possible candidate for Cause X would be machine suffering or the suffering of other non-human conscious things that we’re completely unaware of?

Sam Harris: Yeah, well, I think it’s a totally valid concern. Again, it’s one of these concerns that’s hard to get your moral intuitions tuned up to respond to. People have a default intuition that a conscious machine is impossible, that substrate independence, on some level, is impossible, they’re making an assumption without ever doing it explicitly… In fact, I think most people would explicitly deny thinking this, but it is implicit in what they then go on to think when you pose the question of the possibility of suffering machines and suffering computers.

That just seems like something that never needs to be worried about and yet the only way to close the door to worrying about it is to assume that consciousness is totally substrate-dependent and that we would never build a machine that could suffer because we’re building machines out of some other material. If we built a machine out of biological neurons, well, then, then we might be up for condemnation morally because we’ve taken an intolerable risk analogous to create some human-chimp hybrid or whatever. It’s like obviously, that thing’s going to suffer. It’s an ape of some sort and now it’s in a lab and what sort of monster would do that, right? We would expect the lights to come on in a system of that sort.

If consciousness is the result of information processing on some level, and again, that’s an “if,” we’re not sure that’s the case, and if information processing is truly substrate-independent, and that seems like more than an “if” at this point, we know that’s true, then we could inadvertently build conscious machines. And then the question is: What is it like to be those machines and are they suffering? There’s no way to prevent that on some level.

Certainly, if there’s any relationship between consciousness and intelligence, if building more and more intelligent machines is synonymous with increasing the likelihood that the lights will come on experientially, well, then we’re clearly on that path. It’s totally worth worrying about, but it’s again, judging from what my own mind is like and what my conversations with other people suggest, it seems very hard to care about for people. That’s just another one of these wrinkles.

Lucas Perry: Yeah. I think a good way of framing this is that humanity has a history of committing moral catastrophes because of bad incentives and they don’t even realize how bad the thing is that they’re doing, or they just don’t really care or they rationalize it, like subjugation of women and slavery. We’re in the context of human history and we look back at these people and see them as morally abhorrent.

Now, the question is: What is it today that we’re doing that’s morally abhorrent? Well, I think factory farming is easily one contender and perhaps human selfishness that leads to global poverty and millions of people drowning in shallow ponds is another one that we’ll look back on. With just some foresight towards the future, I agree that machine suffering is intuitively and emotionally difficult to empathize with if your sci-fi gene isn’t turned on. It could be the next thing.

Sam Harris: Yeah.

Lucas Perry: I’d also like to pivot here into AI alignment and AGI. In terms of existential risk from AGI or transformative AI systems, do you have thoughts on public intellectuals who are skeptical of existential risk from AGI or superintelligence? You had a talk about AI risk and I believe you got some flak from the AI community about that. Elon Musk was just skirmishing with the head of AI at Facebook, I think. What is your perspective about the disagreement and confusion here?

Sam Harris: It comes down to a failure of imagination on the one hand and also just bad argumentation. No sane person who’s concerned about this is concerned because they think it’s going to happen this year or next year. It’s not a bet on how soon this is going to happen. For me, it certainly isn’t a bet on how soon it’s going to happen. It’s just a matter of the implications of continually making progress in building more and more intelligent machines. Any progress, it doesn’t have to be Moore’s law, it just has to be continued progress, will ultimately deliver us into relationship with something more intelligent than ourselves.

To think that that is farfetched or is not likely to happen or can’t happen is to assume some things that we just can’t assume. It’s to assume that substrate independence is not in the cards for intelligence. Forget about consciousness. I mean, consciousness is orthogonal to this question. I’m not suggesting that AGI need be conscious, it just needs to be more competent than we are. We already know that our phones are more competent as calculators than we are, they’re more competent chess players than we are. You just have to keep stacking cognitive-information-processing abilities on that and making progress, however incremental.

I don’t see how anyone can be assuming substrate dependence for really any of the features of our mind apart from, perhaps, consciousness. Take the top 200 things we do cognitively, consciousness aside, just as a matter of sheer information-processing and behavioral control and power to make decisions and you start checking those off, those have to be substrate independent: facial recognition, voice recognition, we can already do that in silico. It’s just not something you need meat to do.

We’re going to build machines that get better and better at all of these things and ultimately, they will pass the Turing test and ultimately, it will be like chess or now Go as far as the eye can see, where it will be in relationship to something that is better than we are at everything that we have prioritized, every human competence we have put enough priority in that we took the time to build it into our machines in the first place: theorem-proving in mathematics, engineering software programs. There is no reason why a computer will ultimately not be the best programmer in the end, again, unless you’re assuming that there’s something magical about doing this in meat. I don’t know anyone who’s assuming that.

Arguing about the time horizon is a non sequitur, right? No one is saying that this need happen soon to ultimately be worth thinking about. We know that whatever the time horizon is, it can happen suddenly. We have historically been very bad at predicting when there will be a breakthrough. This is a point that Stuart Russell makes all the time. If you look at what Rutherford said about the nuclear chain reaction being a pipe dream, it wasn’t even 24 hours before Leo Szilard committed the chain reaction to paper and had the relevant breakthrough. We know we can make bad estimates about the time horizon, so at some point, we could be ambushed by a real breakthrough, which suddenly delivers exponential growth in intelligence.

Then there’s a question of just how quickly that could unfold and whether this something like an intelligence explosion. That’s possible. We can’t know for sure, but you need to find some foothold to doubt whether these things are possible and the footholds that people tend to reach for are either nonexistent or they’re non sequiturs.

Again, the time horizon is irrelevant and yet the time horizon is the first thing you hear from people who are skeptics about this: “It’s not going to happen for a very long time.” Well, I mean, Stuart Russell’s point here, which is, again, it’s just a reframing, but in the persuasion business, reframing is everything. The people who are consoled by this idea that this is not going to happen for 50 years wouldn’t be so consoled if we receive a message from an alien civilization which said, “People of Earth, we will arrive on your humble planet in 50 years. Get ready.”

If that happened, we would be prioritizing our response to that moment differently than the people who think it’s going to take 50 years for us to build AGI are prioritizing their response to what’s coming. We would recognize a relationship with something more powerful than ourselves is in the often. It’s only reasonable to do that on the assumption that we will continue to make progress.

The point I made in my TED Talk is that the only way to assume we’re not going to continue to make progress is to be convinced of a very depressing thesis. The only way we wouldn’t continue to make progress is if we open the wrong door of the sort that you and I have been talking about in this conversation, if we invoke some really bad roll of the dice in terms of existential risk or catastrophic civilizational failure, and we just find ourselves unable to build better and better computers. I mean, that’s the only thing that would cause us to be unable to do that. Given the power and value of intelligent machines, we will build more and more intelligent machines at almost any cost at this point, so a failure to do it would be a sign that something truly awful has happened.

Lucas Perry: Yeah. From my perspective, the people that are skeptical of substrate independence, I wouldn’t say that those are necessarily AI researchers. Those are regular persons or laypersons who are not computer scientists. I think that’s motivated by mind-body dualism, where one has a conventional and experiential sense of the mind as being non-physical, which may be motivated by popular religious beliefs, but when we get into the area of actual AI researchers, for them, it seems to either be like they’re attacking some naive version of the argument or a straw man or something

Sam Harris: Like robots becoming spontaneously malevolent?

Lucas Perry: Yeah. It’s either that, or they think that the alignment problem isn’t as hard as it is. They have some intuition, like why the hell would we even release systems that weren’t safe? Why would we not make technology that served us or something? To me, it seems that when there are people from like the mainstream machine-learning community attacking AI alignment and existential risk considerations from AI, it seems like they just don’t understand how hard the alignment problem is.

Sam Harris: Well, they’re not taking seriously the proposition that what we will have built are truly independent minds more powerful than our own. If you actually drill down on what that description means, it doesn’t mean something that is perfectly enslaved by us for all time, I mean, because that is by definition something that couldn’t be more intelligent across the board than we are.

The analogy I use is imagine if dogs had invented us to protect their interests. Well, so far, it seems to be going really well. We’re clearly more intelligent than dogs, they have no idea what we’re doing or thinking about or talking about most of the time, and they see us making elaborate sacrifices for their wellbeing, which we do. I mean, the people who own dogs care a lot about them and make, you could argue, irrational sacrifices to make sure they’re happy and healthy.

But again, back to the pandemic, if we recognize that we had a pandemic that was going to kill the better part of humanity and it was jumping from dogs to people and the only way to stop this is to kill all the dogs, we would kill all the dogs on a Thursday. There’d be some holdouts, but they would lose. The dog project would be over and the dogs would never understand what happened.

Lucas Perry: But that’s because humans aren’t perfectly aligned with dog values.

Sam Harris: But that’s the thing: Maybe it’s a solvable problem, but it’s clearly not a trivial problem because what we’re imagining are minds that continue to grow in power and grow in ways that by definition we can’t anticipate. Dogs can’t possibly anticipate where we will go next, what we will become interested in next, what we will discover next, what we’ll prioritize next. If you’re not imagining minds so vast that we can’t capture their contents ourselves, you’re not talking about the AGI that the people who are worried about alignment are talking about.

Lucas Perry: Maybe this is like a little bit of a nuanced distinction between you or I, but I think that that story that you’re developing there seems to assume that the utility function or the value learning or the objective function of the systems that we’re trying to align with human values is dynamic. It may be the case that you can build a really smart alien mind and it might become super-intelligent, but there are arguments that maybe you could make its alignment stable.

Sam Harris: That’s the thing we have to hope for, right? I’m not a computer scientist, so as far as the doability of this, that’s something I don’t have good intuitions about, but Stuart Russell’s argument that we would need a system whose ultimate value is to more and more closely approximate our current values that would continually, no matter how much its intelligence escapes our own, it would continually remain available to the conversation with us where we say, “Oh, no, no. Stop doing that. That’s not what we want.” That would be the most important message from its point of view, no matter how vast its mind got.

Maybe that’s doable, right, but that’s the kind of thing that would have to be true for the thing to remain completely aligned to us because the truth is we don’t want it aligned to who we used to be and we don’t want it aligned to the values of the Taliban. We want to grow in moral wisdom as well and we want to be able to revise our own ethical codes and this thing that’s smarter than us presumably could help us do that, provided it doesn’t just have its own epiphanies which cancel the value of our own or subvert our own in a way that we didn’t foresee.

If it really has our best interest at heart, but our best interests are best conserved by it deciding to pull the plug on everything, well, then we might not see the wisdom of that. I mean, it might even be the right answer. Now, this is assuming it’s conscious. We could be building something that is actually morally more important than we are.

Lucas Perry: Yeah, that makes sense. Certainly, eventually, we would want it to be aligned with some form of idealized human values and idealized human meta preferences over how value should change and evolve into the deep future. This is known, I think, as “ambitious value learning” and it is the hardest form of value learning. Maybe we can make something safe without doing this level of ambitious value learning, but something like that may be deeper in the future.

Now, as we’ve made moral progress throughout history, we’ve been expanding our moral circle of consideration. In particular, we’ve been doing this farther into space, deeper into time, across species, and potentially soon, across substrates. What do you see as the central way of continuing to expand our moral circle of consideration and compassion?

Sam Harris: Well, I just think we have to recognize that things like distance in time and space and superficial characteristics, like whether something has a face, much less a face that can make appropriate expressions or a voice that we can relate to, none of these things have moral significance. The fact that another person is far away from you in space right now shouldn’t fundamentally affect how much you care whether or not they’re being tortured or whether they’re starving to death.

Now, it does. We know it does. People are much more concerned about what’s happening on their doorstep, but I think proximity, if it has any weight at all, it has less and less weight the more our decisions obviously affect people regardless of separation and space, but the more it becomes truly easy to help someone on another continent because you can just push a button in your browser, then you’re caring less about them is clearly a bug. And so it’s just noticing that the things that attenuate our compassion tend to be things that for evolutionary reasons we’re designed to discount in this way, but at the level of actual moral reasoning about a global civilization it doesn’t make any sense and it prevents us from solving the biggest problems.

Lucas Perry: Pivoting into ethics more so now. I’m not sure if this is the formal label that you would use but your work on the moral landscape lands you pretty much it seems in the moral realism category.

Sam Harris: Mm-hmm (affirmative).

Lucas Perry: You’ve said something like, “Put your hand in fire to know what bad is.” That seems to disclose or seems to argue about the self intimating nature of suffering about how it’s clearly bad. If you don’t believe me, go and do the suffering things. From other moral realists who I’ve talked to and who argued for moral realism, like Peter Singer, they make similar arguments. What view or theory of consciousness are you most partial to? And how does this inform this perspective about the self intimating nature of suffering as being a bad thing?

Sam Harris: Well, I’m a realist with respect to morality and consciousness in the sense that I think it’s possible not to know what you’re missing. So if you’re a realist, the property that makes the most sense to me is that there are facts about the world that are facts whether or not anyone knows them. It is possible for everyone to be wrong about something. We could all agree about X and be wrong. That’s the realist position as opposed to pragmatism or some other variant, where it’s all just a matter, it’s all a language game, and the truth value of a statement is just the measure of the work it does in conversation. So with respect to consciousness, I’m a realist in the sense that if a system is conscious, if a cricket is conscious, if a sea cucumber is conscious, they’re conscious whether we know it or not. For the purposes of this conversation, let’s just decide that they’re not conscious, the lights are not on in those systems.

Well, that’s a claim that we could believe, we could all believe it, but we could be wrong about it. And so the facts exceed our experience at any given moment. And so it is with morally salient facts, like the existence of suffering. If a system can be conscious whether I know it or not a system can be suffering whether I know it or not. And that system could be me in the future or in some counterfactual state. I could think I’m doing the right thing by doing X. But the truth is I would have been much happier had I done Y and I’ll never know that. I was just wrong about the consequences of living in a certain way. That’s what realism on my view entails. So the way this relates to questions of morality and good and evil and right and wrong, this is back to my analogy of the moral landscape, I think morality really is a navigation problem. There are possibilities of experience in this universe and we don’t even need the concept of morality, we don’t need the concept of right and wrong and good and evil really.

That’s shorthand for, in my view, the way we should talk about the burden that’s on us in each moment to figure out what we should do next. Where should we point ourselves across this landscape of mind and possible minds? And knowing that it’s possible to move in the wrong direction, and what does it mean to be moving in the wrong direction? Well, it’s moving in a direction where everything is getting worse and worse and everything that was good a moment ago is breaking down to no good end. You could conceive of moving down a slope on the moral landscape only to ascend some higher peak. That’s intelligible to me that we might have to all move in the direction that seems to be making things worse but it is a sacrifice worth making because it’s the only way to get to something more beautiful and more stable.

I’m not saying that’s the world we’re living in, but it certainly seems like a possible world. But this just doesn’t seem open to doubt. There’s a range of experience on offer. And, on the one end, it’s horrific and painful and all the misery is without any silver lining, right? It’s not like we learn a lot from this ordeal. No, it just gets worse and worse and worse and worse and then we die, and I call that the worst possible misery for everyone. Alright so, the worst possible misery for everyone is bad if anything is bad, if the word bad is going to mean anything, it has to apply to the worst possible misery for everyone. But now some people come in and think they’re doing philosophy when they say things like, “Well, who’s to say the worst possible misery for everyone is bad?” Or, “Should we avoid the worst possible misery for everyone? Can you prove that we should avoid it?” And I actually think those are unintelligible noises that they’re making.

You can say those words, I don’t think you can actually mean those words. I have no idea what that person actually thinks they’re saying. You can play a language game like that but when you actually look at what the words mean, “the worst possible misery for everyone,” to then say, “Well, should we avoid it?” In a world where you should do anything, where the word should make sense, there’s nothing that you should do more than avoid the worst possible misery for everyone. By definition, it’s more fundamental than the concept of should. What I would argue is if you’re hung up on the concept of should, and you’re taken in by Hume’s flippant and ultimately misleading paragraph on, “You can’t get an ought from an is,” you don’t need oughts then. There is just this condition of is. There’s a range of experience on offer, and the one end it is horrible, on the other end, it is unimaginably beautiful.

And we clearly have a preference for one over the other, if we have a preference for anything. There is no preference more fundamental than escaping the worst possible misery for everyone. If you doubt that, you’re just not thinking about how bad things can get. It’s incredibly frustrating. In this conversation, you’re hearing the legacy of the frustration I’ve felt in talking to otherwise smart and well educated people who think they’re on interesting philosophical ground in doubting whether we should avoid the worst possible misery for everyone. Or that it would be good to avoid it, or perhaps it’s intelligible to have other priorities. And, again, I just think that they’re not understanding the words “worst possible misery and everyone”, they’re not letting those words and land in language cortex. And if they do, they’ll see that there is no other place to stand where you could have other priorities.

Lucas Perry: Yeah. And my brief reaction to that is, I still honestly feel confused about this. So maybe I’m in the camp of frustrating people. I can imagine other evolutionary timelines where there are minds that just optimize for the worst possible misery for everyone, just because in mind space those minds are physically possible.

Sam Harris: Well, that’s possible. We can certainly create a paperclip maximizer that is just essentially designed to make every conscious being suffer as much as it can. And that would be especially easy to do provided that intelligence wasn’t conscious. If it’s not a matter of its suffering, then yeah, we could use AGI to make things awful for everyone else. You could create a sadistic AGI that wanted everyone else to suffer and it derived immense pleasure from that.

Lucas Perry: Or immense suffering. I don’t see anything intrinsically motivating about suffering as navigating a mind necessarily away from it. Computationally, I can imagine a mind just suffering as much as possible and spreads that as much as possible. And maybe the suffering is bad in some objective sense, given consciousness realism, and that that was disclosing the intrinsic valence of consciousness in the universe. But the is-ought distinction there still seems confusing to me. Yes, suffering is bad and maybe the worst possible misery for everyone is bad, but that’s not universally motivating for all possible minds.

Sam Harris: The usual problem here is, it’s easy for me to care about my own suffering, but why should I care about the suffering of others? That seems to be the ethical stalemate that people worry about. My response there is that it doesn’t matter. You can take the view from above there and you can just say, “The universe would be better if all the sentient beings suffered less and it would be worse if they suffered more.” And if you’re unconvinced by that, you just have to keep turning the dial to separate those two more and more and more and more so that you get to the extremes. If any given sentient being can’t be moved to care about the experience of others, well, that’s one sort of world, that’s not a peak on the moral landscape. That will be a world where beings are more callous than they would otherwise be in some other corner of the universe. And they’ll bump into each other more and they’ll be more conflict and they’ll fail to cooperate in certain ways that would have opened doors to positive experiences that they now can’t have.

And you can try to use moralizing language about all of this and say, “Well, you still can’t convince me that I should care about people starving to death in Somalia.” But the reality is an inability to care about that has predictable consequences. If enough people can’t care about that then certain things become impossible and those things, if they were possible, lead to good outcomes that if you had a different sort of mind, you would enjoy. So all of this bites its own tail in an interesting way when you imagine being able to change a person’s moral intuitions. And then the question is, well, should you change those intuitions? Would it be good to change your sense of what is good? That question has an answer on the moral landscape. It has an answer when viewed as a navigation problem.

Lucas Perry: Right. But isn’t the assumption there that if something leads to a good world, then you should do it?

Sam Harris: Yes. You can even drop your notion of should. I’m sure it’s finite, but a functionally infinite number of worlds on offer and there’s ways to navigate into those spaces. And there are ways to fail to navigate into those spaces. There are ways to try and fail, and worse still, there are ways to not know what you’re missing, to not even know where you should be pointed on this landscape, which is to say, you need to be a realist here. There are experiences that are better than any experience that you are going to have and you are never going to know about them, possible experiences. And granting that, you don’t need a concept of should, should is just shorthand for how we speak with one another and try to admonish one another to be better in the future in order to cooperate better or to realize different outcomes. But it’s not a deep principle of reality.

What is a deep principle of reality is consciousness and its possibilities. Consciousness is the one thing that can’t be an illusion. Even if we’re in a simulation, even if we’re brains in vats, even if we’re confused about everything, something seems to be happening, and that seeming is the fact of consciousness. And almost as rudimentary as that is the fact that within this space of seemings, again, we don’t know what the base layer of reality is, we don’t know if our physics is the real physics, we could be confused, this could be a dream, we could be confused about literally everything except that in this space of seemings there appears to be a difference between things getting truly awful to no apparent good end and things getting more and more sublime.

And there’s potentially even a place to stand where that difference isn’t so captivating anymore. Certainly, there are Buddhists who would tell you that you can step off that wheel of opposites, ultimately. But even if you buy that, that is some version of a peak on my moral landscape. That is a contemplative peak where the difference between agony and ecstasy is no longer distinguishable because what you are then aware of is just that consciousness is intrinsically free of its content and no matter what its possible content could be. If someone can stabilize that intuition, more power to them, but then that’s the thing you should do, just to bring it back to the conventional moral framing.

Lucas Perry: Yeah. I agree with you. I’m generally a realist about consciousness and still do feel very confused, not just because of reasons in this conversation, but just generally about how causality fits in there and how it might influence our understanding of the worst possible misery for everyone being a bad thing. I’m also willing to go that far to accept that as objectively a bad thing, if bad means anything. But then I still get really confused about how that necessarily fits in with, say, decision theory or “shoulds” in the space of possible minds and what is compelling to who and why?

Sam Harris: Perhaps this is just semantic. Imagine all these different minds that have different utility functions. The paperclip maximizer wants nothing more than paperclips. And anything that reduces paperclips is perceived as a source of suffering. It has a disutility. If you have any utility function, you have this liking and not liking component provided your sentient. That’s what it is to be motivated consciously. For me, the worst possible misery for everyone is a condition where, whatever the character of your mind, every sentient mind is put in the position of maximal suffering for it. So some things like paperclips and some things hate paperclips. If you hate paperclips, we give you a lot of paperclips. If you like paperclips, we take away all your paperclips. If that’s your mind, we tune your corner of the universe to that torture chamber. You can be agnostic as to what the actual things are that make something suffer. It’s just suffering is by definition the ultimate frustration of that mind’s utility function.

Lucas Perry: Okay. I think that’s a really, really important crux and crucial consideration between us and a general point of confusion here. Because that’s the definition of what suffering is or what it means. I suspect that those things may be able to come apart. So, for you, maximum disutility and suffering are identical, but I guess I could imagine a utility function being separate or inverse from the hedonics of a mind. Maybe the utility function, which is purely a computational thing, is getting maximally satisfied, maximizing suffering everywhere, and the mind that is implementing that suffering is just completely immiserated while doing it. But the utility function, which is different and inverse from the experience of the thing, is just getting satiated and so the machine keeps driving towards maximum-suffering-world.

Sam Harris: Right, but there’s either something that is liked to be satiated in that way or there isn’t right now. If we’re talking about real conscious society, we’re talking about some higher order satisfaction or pleasure that is not suffering by my definition. We have this utility function ourselves. I mean when you take somebody who decides to climb to the summit of Mount Everest where the process almost every moment along the way is synonymous with physical pain and intermittent fear of death, torture by another name. But the whole project is something that they’re willing to train for, sacrifice for, dream about, and then talk about for the rest of their lives, and at the end of the day might be in terms of their conscious sense o