AI Alignment Podcast: The Byzantine Generals’ Problem, Poisoning, and Distributed Machine Learning with El Mahdi El Mhamdi (Beneficial AGI 2019)

Three generals are voting on whether to attack or retreat from their siege of a castle. One of the generals is corrupt and two of them are not. What happens when the corrupted general sends different answers to the other two generals?

Byzantine fault is “a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the “Byzantine Generals’ Problem”, developed to describe this condition, where actors must agree on a concerted strategy to avoid catastrophic system failure, but some of the actors are unreliable.

The Byzantine Generals’ Problem and associated issues in maintaining reliable distributed computing networks is illuminating for both AI alignment and modern networks we interact with like Youtube, Facebook, or Google. By exploring this space, we are shown the limits of reliable distributed computing, the safety concerns and threats in this space, and the tradeoffs we will have to make for varying degrees of efficiency or safety.

The Byzantine Generals’ Problem, Poisoning, and Distributed Machine Learning with El Mahdi El Mhamdi is the ninth podcast in the AI Alignment Podcast series, hosted by Lucas Perry. El Mahdi pioneered Byzantine resilient machine learning devising a series of provably safe algorithms he recently presented at NeurIPS and ICML. Interested in theoretical biology, his work also includes the analysis of error propagation and networks applied to both neural and biomolecular networks. This particular episode was recorded at the Beneficial AGI 2019 conference in Puerto Rico. We hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, iTunes, Google Play, Stitcher, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.

Topics discussed in this episode include:

  • The Byzantine Generals’ Problem
  • What this has to do with artificial intelligence and machine learning
  • Everyday situations where this is important
  • How systems and models are to update in the context of asynchrony
  • Why it’s hard to do Byzantine resilient distributed ML.
  • Why this is important for long-term AI alignment

An overview of Adversarial Machine Learning and where Byzantine-resilient Machine Learning stands on the map is available in this (9min) video . A specific focus on Byzantine Fault Tolerant Machine Learning is available here (~7min)

In particular, El Mahdi argues in the first interview (and in the podcast) that technical AI safety is not only relevant for long term concerns, but is crucial in current pressing issues such as social media poisoning of public debates and misinformation propagation, both of which fall into Poisoning-resilience. Another example he likes to use is social media addiction, that could be seen as a case of (non) Safely Interruptible learning. This value misalignment is already an issue with the primitive forms of AIs that optimize our world today as they maximize our watch-time all over the internet.

The latter (Safe Interruptibility) is another technical AI safety question El Mahdi works on, in the context of Reinforcement Learning. This line of research was initially dismissed as “science fiction”, in this interview (5min), El Mahdi explains why it is a realistic question that arises naturally in reinforcement learning

“El Mahdi’s work on Byzantine-resilient Machine Learning and other relevant topics is available on his Google scholar profile. A modification of the popular machine learning library TensorFlow, to make it Byzantine-resilient (and also support communication over UDP channels among other things) has been recently open-sourced on Github by El Mahdi’s colleagues based on his algorithmic work we mention in the podcast.

To connect with him over social media

You can listen to the podcast above or read the transcript below.

Lucas: Hey, everyone. Welcome back to the AI Alignment Podcast series. I’m Lucas Perry, and today we’ll be speaking with El Mahdi El Mhamdi on the Byzantine problem, Byzantine tolerance, and poisoning in distributed learning and computer networks. If you find this podcast interesting or useful, please give it a like and follow us on your preferred listing platform. El Mahdi El Mhamdi pioneered Byzantine resilient machine learning devising a series of provably safe algorithms he recently presented at NeurIPS and ICML. Interested in theoretical biology, his work also includes the analysis of error propagation and networks applied to both neural and biomolecular networks. With that, El Mahdi’s going to start us off with a thought experiment.

El Mahdi: Imagine you are part of a group of three generals, say, from the Byzantine army surrounding a city you want to invade, but you also want to retreat if retreat is the safest choice for your army. You don’t want to attack when you will lose, so those three generals that you’re part of are in three sides of the city. They sent some intelligence inside the walls of the city, and depending on this intelligence information, they think they will have a good chance of winning and they would like to attack, or they think they will be defeated by the city, so it’s better for them to retreat. Your final decision would be a majority vote, so you communicate through some horsemen that, let’s say, are reliable for the sake of this discussion. But there might be one of you who might have been corrupt by the city.

The situation would be problematic if, say, there are General A, General B, and General C. General A decided to attack. General B decided to retreat based on their intelligence for some legitimate reason. A and B are not corrupt, and say that C is corrupt. Of course, A and B, they can’t figure out who was corrupt. Say C is corrupt. What this general would do they thing, so A wanted to attack. They will tell them, “I also want to attack. I will attack.” Then they will tell General B, “I also want to retreat. I will retreat.” A receives two attack votes and one retreat votes. General B receives two retreat votes and only one attack votes. If they trust everyone, they don’t do any double checking, this would be a disaster.

A will attack alone; B would retreat; C, of course, doesn’t care because he was corrupt by the cities. You can tell me they can circumvent that by double checking. For example, A and B can communicate on what C told them. Let’s say that every general communicates with every general on what he decides and on also what’s the remaining part of the group told them. A will report to B, “General C told me to attack.” Then B would tell C, “General C told me to retreat.” But then A and B wouldn’t have anyway of concluding whether the inconsistency is coming from the fact that C is corrupt or that the general reporting on what C told them is corrupt.

I am General A. I have all the valid reasons to think with the same likelihood that C is maybe lying to me or also B might also be lying to me. I can’t know if you are misreporting what C told you enough for the city to corrupt one general if there are three. It’s impossible to come up with an agreement in this situation. You can easily see that this will generalize to having more than three generals, like I say 100, as soon as the non-corrupt one are less than two-thirds because what we saw with three generals would happen with the fractions that are not corrupt. Say that you have strictly more than 33 generals out of 100 who are corrupt, so what they can do is they can switch the majority votes on each side.

But worse than that, say that you have 34 corrupt generals and the remaining 66 not corrupt generals. Say that those 66 not corrupt generals were 33 on the attack side, 33 on the retreat side. The problem is that when you are in some side, say that you are in the retreat side, you have in front of you a group of 34 plus 33 in which there’s a majority of malicious ones. This majority can collude. It’s part of the Byzantine hypothesis. The malicious ones can collude and they will report a majority of inconsistent messages on the minority on the 33 ones. You can’t provably realize that the inconsistency is coming from the group of 34 because they are a majority.

Lucas: When we’re thinking about, say, 100 persons or 100 generals, why is it that they’re going to be partitioned automatically into these three groups? What if there’s more than three groups?

El Mahdi: Here we’re doing the easiest form of Byzantine agreement. We want to agree on attack versus retreat. When it’s become multi-dimensional, it gets even messier. There are more impossibility results and impossibility results. Just like with the binary decision, there is an impossibility theorem on having agreement if you have unsigned messages to horsemen. Whenever the corrupt group exceeds 33%, you provably cannot come up with an agreement. There are many variants to this problem, of course, depending on what hypothesis you can assume. Here, without even mentioning it, we were assuming bounded delays. The horsemen would always arrive eventually. If the horsemen could die on the way and you don’t have any way to check whether they arrive or not or you can be waiting forever because you don’t have any proof that the horsemen died on the way.

You don’t have any mechanism to tell you, “Stop waiting for the horsemen. Stop waiting for the message from General B because the horsemen died.” You can be waiting forever and there are theorems that shows that when you have unbounded delays, and by the way, like in distributed computing, whenever you have in bounded delays, we speak about asynchrony. If you have a synchronous communication, there is a very famous theorem that tells you consensus is impossible, not even in the malicious case, but just like in …

Lucas: In the mundane normal case.

El Mahdi: Yes. It’s called the Fischer Lynch Patterson theorem theorem .

Lucas: Right, so just to dive down into the crux of the problem, the issue here fundamentally is that when groups of computers or groups of generals or whatever are trying to check who is lying amongst discrepancies and similarities of lists and everyone who’s claiming what is when there appears to be a simple majority within that level of corrupted submissions, then, yeah, you’re screwed.

El Mahdi: Yes. It’s impossible to achieve agreement. There are always fractions of malicious agents above which is provably impossible to agree. Depending on the situation, it will be a third or sometimes or a half or a quarter, depending on your specifications.

Lucas: If you start tweaking the assumptions behind the thought experiment, then it changes what number of corrupted machines or agents that are required in order to flip the majority and to poison the communication.

El Mahdi: Exactly. But for example, you mentioned something very relevant to today’s discussion, which is what if we were not agreeing on two decisions, retreat, attack. What if we were agreeing on some multi-dimensional decision? Attack or retreat on one dimension and then …

Lucas: Maybe hold, keep the siege going.

El Mahdi: Yeah, just like add possibilities or dimensions and multi-dimensional agreements. They’re even more hopeless results in that direction

Lucas: There are more like impossibility theorems and issues where these distributed systems are vulnerable to small amounts of systems being corrupt and screwing over the entire distributed network.

El Mahdi: Yes. Maybe now we can slightly move to machine learning.

Lucas: I’m happy to move into machine learning now. We’ve talked about this, and I think our audience can probably tell how this has to do with computers. Yeah, just dive in what this has to do with machine learning and AI and current systems today, and why it even matters for AI alignment.

El Mahdi: As a brief transition, solving the agreement problem besides this very nice historic thought experiment is behind consistencies of safety critical systems like banking systems. Imagine we have a shared account. Maybe you remove 10% of the amount and then she or he added some $10 to the accounts. You remove the $10 in New York and she or he put the $10 in Los Angeles. The banking system has to agree on the ordering because minus $10 plus 10% is not the same result as plus 10% then minus $10. The final balance of the account would not be the same.

Lucas: Right.

El Mahdi: The banking systems routinely are solving decisions that fall into agreement. If you work on some document sharing platform, like Dropbox or Google Docs, whatever, and we collaboratively are writing the document, me and you. The document sharing platform has to, on real time, solve agreements about the ordering of operations so that me and you always keep seeing the same thing. This has to happen while some of the machines that are interconnecting us are failing, whether just like failing because there was a electric crash or something. Data center has lost some machines or if it was like restart, a bug or a take away. What we want in distributed computing is that we would like communications schemes between machines that’s guarantee this consistency that comes from agreement as long as some fraction of machines are reliable. What this has to do with artificial intelligence and machine learning reliability is that with some colleagues, we are trying to encompass one of the major issues in machine learning reliability inside the Byzantine fault tolerance umbrella. For example, you take, for instance, poisoning attacks.

Lucas: Unpack what poisoning attacks are.

El Mahdi: For example, imagine you are training a model on what are good videos to recommend given some key word search. If you search for “medical advice for young parents on vaccine,” this is a label. Let’s assume for the sake of simplicity that a video that tells you not to take your kid for vaccines is not what we mean by medical advice for young parents on vaccine because that’s what medical experts agree on. We want our system to learn that anitvaxers, like anti-vaccine propaganda is not what people are searching for when they type those key words, so I suppose a world where we care about accuracy, okay? Imagine you want to train a machine learning model that gives you accurate results of your search. Let’s also for the sake of simplicity assume that a majority of people on the internet are honest.

Let’s assume that more than 50% of people are not actively trying to poison the internet. Yeah, this is very optimistic, but let’s assume that. What we can show and what me and my colleagues started this line of research with is that you can easily prove that one single malicious agent can provably poison a distributed machine learning scheme. Imagine you are this video sharing platform. Whenever people behave on your platform, this generates what we call gradients, so it updates your model. It only takes a few hyperactive accounts that could generate behavior that is powerful enough to pull what we call the average gradient because what distributed machine learning is using, at least up to today, if you read the source code of most distributed machine learning frameworks. Distributed machine learning is always averaging gradients.

Imagine you Lucas Perry just googled a video on the Parkland shootings. Then the video sharing platform shows you a video telling you that David Hogg and Emma Gonzalez and those kids behind the March for Our Lives movement are crisis actors. The video labels three kids as crisis actors. It obviously has a wrong label, so it is what I will call a poisoned data point. If you are non-malicious agents on the video sharing platform, you will dislike the video. You will not approve it. You’re likely to flag it. This should generate a gradient that pushes the model in that direction, so the gradient will update the model into a direction where it stops thinking that this video is relevant for someone searching “Parkland shooting survivors.” What can happen if your machine learning framework is just averaging gradients is that a bunch of hyperactive people on some topic could poison the average and pull it towards the direction where the models is enforcing this thinking that, “Yeah, those kids are crisis actors.”

Lucas: This is the case because the hyperactive accounts are seen to be given more weight than accounts which are less active in the same space. But this extra weighting that these accounts will get from their hyperactivity in one certain category or space over another, how is the weighting done? Is it just time spent per category or does it have to do with submissions that agree with the majority?

El Mahdi: We don’t even need to go into the details because we don’t know. I’m talking in a general setting where you have a video sharing platform aggregating gradients for behavior. Now, maybe let’s raise the abstraction level. You are doing gradient descents, so you have a lost function that you want to minimize. You have an error function. The error function is the mismatch between what you predict and what the user tells you. The user tells you this is a wrong prediction, and then you move to the direction where the users stop telling you this is the wrong direction. You are doing great in this sense minimizing the lost function. User behaves, and with their behavior, you generate gradients.

What you do now in the state of the arts way of distributed machine learning is that you average all those gradients. Averaging is well known not to be resilient. If you have a room of poor academics earning a few thousand dollars and then a billionaire jumps in the room, if your algorithm reasons with averaging, it will think that this is a room of millionaires because the average salary would be a couple of hundred millions. But then million is very obvious to do when it comes to salaries and numbers scalers because you can rank them.

Lucas: Right.

El Mahdi: You rank numbers and then decide, “Okay, this is the ordering. This is the number that falls in the middle. This is the upper half. This is the lower half and this is the median.” When it becomes high dimensional, the median is a bit tricky. It has some computational issues. Then even if you compute what we call the geometric median, an attacker can still know how to leverage the fact that you’re only approximating it because there’s no closed formula. There’s no closed form to compute the median in that dimension. But worse than that, what we showed in one of our follow up works is because of the fact that machine learning is done in very, very, very high dimensions, you would have a curse of the dimensionality issue that makes it possible for attackers to sneak in without being spot as a way of the median.

It can still look like the median vector. I take benefits from the fact that those vectors, those gradients, are extremely high dimensional. I would look for all the disagreements. Let’s say you have a group of a couple hundred gradients, and I’m the only malicious one. I would look at the group of correct vectors all updating you somehow in the same direction within some variants. On average, they’re like what we call unbiased estimators of the gradient. When you take out the randomness, the expected value they will give you is the real gradient of the loss function. What I will do as a malicious worker is I will look at the way they are disagreeing slightly on each direction.

I will sum that. I will see that they disagree by this much on direction one. They disagree by this much on direction two. They disagree by this much, epsilon one, epsilon two, epsilon three. I would look for all these small disagreements they have on all the components.

Lucas: Across all dimensions and high dimensional space. [crosstalk 00:16:35]

El Mahdi: Then add that up. It will be my budget, my leeway, my margin to attack you on another direction.

Lucas: I see.

El Mahdi: What we proved is that you have to mix ideas from geometric median with ideas from the traditional component-wise median, and that those are completely different things. The geometric median is a way to find a median by just minimizing the sum of distances between what you look for and all the vectors that were proposed, and then the component-wise median will do a traditional job of ranking the coordinates. It looks at each coordinate, and then rank all the propositions, and then look for the proposition that lies in the middle. Once we proved enough follow up work is that, yeah, the geometric median idea is elegant. It can make you converge, but it can make you converge to something arbitrarily bad decided by the attacker. When you train complex models like neural nets, the landscape you optimize inside is not convex. It’s not like a bowl or a cup where you just follow the descending slope you would end up in the lowest point.

Lucas: Right.

El Mahdi: It’s like a multitude of bowls with different heights.

Lucas: Right, so there’s tons of different local minima across the space.

El Mahdi: Exactly. So in the first paper what we showed is that ideas that look like the geometric median are enough to just converge. You converge. You provably converge, but in the follow up what we realized, like something we were already aware of, but not enough in my opinion, is that there is this square root D, this curse of dimensionality that will arise when you compute high dimensional distances. That the attacker can leverage.

So in what we call the hidden vulnerability of distributed learning, you can have correct vectors, agreeing on one component. Imagine in your head some three axis system.

Let’s say that they are completely in agreement on axis three. But then in axis one, two, so in the plane formed by the axis one and axis two, they have a small disagreement.

What I will do as the malicious agent, is that I will leverage this small disagreement, and inject it in axis three. And this will make you go to a bit slightly modified direction. And instead of going to this very deep, very good minima, you will go into a local trap that is just close ahead.

And that comes from the fact that loss functions of interesting models are clearly like far from being convex. The models are highly dimensional, and the loss function is highly un-convex, and creates a lot of leeway.

Lucas: It creates a lot of local minima spread throughout the space for you to attack the person into.

El Mahdi: Yeah. So convergence is not enough. So we started this research direction by formulating the following question, what does it take to guarantee convergence?

And any scheme that aggregates gradients, and guarantee convergence is called Byzantine resilient. But then you can realize that in very high dimensions, and highly non-convex loss functions, is convergence enough? Would you just want to converge?

There are of course people arguing the deep learning models, like there’s this famous paper by Anna Choromanska, and Yann LeCun, and  Gérard Ben Arous, about the landscape of neural nets, that basically say that, “Yeah, very deep local minimum of neural nets are some how as good.”

From an overly simplified point of view, it’s an optimistic paper, that tells you that you shouldn’t worry too much when you optimize neural nets about the fact that gradient descent would not necessarily go to a global like-

Lucas: To a global minima.

El Mahdi: Yeah. Just like, “Stop caring about that.”

Lucas: Because the local minima are good enough for some reason.

El Mahdi: Yeah. I think that’s a not too unfair way to summarize the paper for the sake of this talk, for the sake of this discussion. What we empirically illustrate here, and theoretically support is that that’s not necessarily true.

Because we show that with very low dimensional, not extremely complex models, trained on CIFAR-10 and MNIST, which are toy problems, very easy toy problems, low dimensional models etc. It’s already enough to have those amounts of parameters, let’s say 100,000 parameters or less, so that an attacker would always find a direction to take you each time away, away, away, and then eventually find an arbitrarily bad local minimum. And then you just converge to that.

So convergence is not enough. Not only you have to seek an aggregation rule that guarantees convergence, but you have to seek some aggregation rules that guarantee that you would not converge to something arbitrarily bad. You would keep converging to the same high quality local minimum, whatever that means.

The hidden vulnerability is this high dimensional idea. It’s the fact that because the loss function is highly non-convex, because there’s the high dimensionality, as an attacker I would always find some direction, so the attack goes this way.

Here the threat model is that an attacker can spy on your gradients, generated by the correct workers but cannot talk on their behalf. So I cannot corrupt the messages. Since you asked about the reliability of horsemen or not.

So horsemen are reliable. I can’t talk on your behalf, but I can spy on you. I can see what are you sending to the others, and anticipate.

So I would as an attacker wait for correct workers to generate their gradients, I will gather those vectors, and then I will just do a linear regression on those vectors to find the best direction to leverage the disagreement on the D minus one remaining directions.

So because there would be this natural disagreement, this variance in many directions, I will just do some linear regression and find what is the best direction to keep? And use the budget I gathered, those epsilons I mentioned earlier, like this D time epsilon on all the directions to inject it the direction that will maximize my chances of taking you away from local minima.

So you will converge, as proven in the early papers, but not necessarily to something good. But what we showed here is that if you combine ideas from multidimensional geometric medians, with ideas from single dimensional component-wise median, you improve your robustness.

Of course it comes with a price. You require three quarters of the workers to be reliable.

There is another direction where we expanded this problem, which is asynchrony. And asynchrony arises when as I said in the Byzantine generals setting, you don’t have a bounded delay. In the bounded delay setting, you know that horses arrive at most after one hour.

Lucas: But I have no idea if the computer on the other side of the planet is ever gonna send me that next update.

El Mahdi: Exactly. So imagine you are doing machine learning on smartphones. You are leveraging a set of smartphones all around the globe, and in different bandwidths, and different communication issues etc.

And you don’t want each time to be bottlenecked by the slowest one. So you want to be asynchronous, you don’t want to wait. You’re just like whenever some update is coming, take it into account.

Imagine some very advanced AI scenario, where you send a lot of learners all across the universe, and then they communicate with the speed of light, but some of them are five light minutes away, but some others are two hours and a half. And you want to learn from all of them, but not necessarily handicap the closest one, because there are some other learners far away.

Lucas: You want to run updates in the context of asynchrony.

El Mahdi: Yes. So you want to update whenever a gradient is popping up.

Lucas: Right. Before we move on to illustrate the problem again here is that the order matters, right? Like in the banking example. Because the 10% plus 10 is different from-

El Mahdi: Yeah. Here the order matters for different reasons. You update me so you are updating me on the model you got three hours ago. But in the meanwhile, three different agents updated me on the models, while getting it three minutes ago.

All the agents are communicating through some abstraction they call the server maybe. Like this server receives updates from fast workers.

Lucas: It receives gradients.

El Mahdi: Yeah, gradients. I also call them updates.

Lucas: Okay.

El Mahdi: Because some workers are close to me and very fast, I’ve done maybe 1000 updates, while you were still working and sending me the message.

So when your update arrive, I can tell whether it is very stale, very late, or malicious. So what we do in here is that, and I think it’s very important now to connect a bit back with classic distributed computing.

Is that Byzantine resilience in machine learning is easier than Byzantine resilience in classical distributed computing for one reason, but it is extremely harder for another reason.

The reason is that we know what we want to agree on. We want to agree on a gradient. We have a toolbox of calculus that tells us how this looks like. We know that it’s the slope of some loss function that is most of today’s models, relatively smooth, differentiable, maybe Lipschitz, bounded, whatever curvature.

So we know that we are agreeing on vectors that are gradients of some loss function. And we know that there is a majority of workers that will produce vectors that will tell us what does a legit vector look like.

You can find some median behavior, and then come up with filtering criterias that will get away with the bad gradients. That’s the good news. That’s why it’s easier to do Byzantine resilience in machine learning than to do Byzantine agreement. Byzantine agreement, because agreement is a way harder problem.

The reason why Byzantine resilience is harder in machine learning than in the typical settings you have in distributed computing is that we are dealing with extremely high dimensional data, extremely high dimensional decisions.

So a decision here is to update the model. It is triggered by a gradient. So whenever I accept a gradient, I make a decision. I make a decision to change the model, to take it away from this state, to this new state, by this much.

But this is a multidimensional update. And Byzantine agreement, or Byzantine approximate agreement in higher dimension has been provably hopeless by Hammurabi Mendes, and Maurice Herlihy in an excellent paper in 2013, where they show that you can’t do Byzantine agreement in D dimension with N agents in less than N to the power D computations, per agent locally.

Of course in their paper, they were meaning Byzantine agreement on positions. So they were framing it with a motivations saying, “This is N to the power D, but the typical cases we care about in distributed computing are like robots agreeing on a position on a plane, or on a position in a three dimensional space.” So D is two or three.

So N to the power two or N to the power three is fine. But in machine learning D is not two and three, D is a billion or a couple of millions. So N to the power a million is just like, just forget.

And not only that, but also they require … Remember when I tell you that Byzantine resilience computing would always have some upper bound on the number malicious agents?

Lucas: Mm-hmm (affirmative).

El Mahdi: So the number of total agents should exceed D times the number of malicious agents.

Lucas: What is D again sorry?

El Mahdi: Dimension.

Lucas: The dimension. Okay.

El Mahdi: So if you have to agree on D dimension, like on a billion dimensional decision, you need at least a billion times the number of malicious agents.

So if you have say 100 malicious agents, you need at least 100 billion total number of agents to be resistant. No one is doing distributed machine learning on 100 billion-

Lucas: And this is because the dimensionality is really screwing with the-

El Mahdi: Yes. Byzantine approximate agreement has been provably hopeless. That’s the bad, that’s why the dimensionality of machine learning makes it really important to go away, to completely go away from traditional distributed computing solutions.

Lucas: Okay.

El Mahdi: So we are not doing agreement. We’re not doing agreement, we’re not even doing approximate agreement. We’re doing something-

Lucas: Totally new.

El Mahdi: Not new, totally different.

Lucas: Okay.

El Mahdi: Called gradient decent. It’s not new. It’s as old as Newton. And it comes with good news. It comes with the fact that there are some properties, like some regularity of the loss function, some properties we can exploit.

And so in the asynchronous setting, it becomes even more critical to leverage those differentiability properties. So because we know that we are optimizing a loss functions that has some regularities, we can have some good news.

And the good news has to do with curvature. What we do here in asynchronous setting, is not only we ask workers for their gradients, we ask them for their empirical estimate of the curvature.

Lucas: Sorry. They’re estimating the curvature of the loss function, that they’re adding the gradient to?

El Mahdi: They add the gradient to the parameter, not the loss function. So we have a loss function, parameter is the abscissa, you add the gradient to the abscissa to update the model, and then you end up in a different place of the loss function.

So you have to imagine the loss function as like a surface, and then the parameter space as the plane, the horizontal plane below the surface. And depending on where you are in the space parameter, you would be on different heights of the loss function.

Lucas: Wait sorry, so does the gradient depend where you are on this, the bottom plane?

El Mahdi: Yeah [crosstalk 00:29:51]-

Lucas: So then you send an estimate for what you think the slope of the intersection will be?

El Mahdi: Yeah. But for asynchrony, not only that. I will ask you to send me the slope, and your observed empirical growth of the slope.

Lucas: The second derivative?

El Mahdi: Yeah.

Lucas: Okay.

El Mahdi: But the second derivative again in high dimension is very hard to compute. You have to computer the Hessian matrix.

Lucas: Okay.

El Mahdi: That’s something like completely ugly to compute in high dimensional situations because it takes D square computations.

As an alternative we would like you to send us some linear computation in D, not a square computation in D.

So we would ask you to compute your actual gradient, your previous gradient, the difference between them, and normalize it by the difference between models.

So, “Tell us your current gradient, by how much it changed from the last gradient, and divide that by how much you changed the parameter.”

So you would tell us, “Okay, this is my current slope, and okay this is the gradient.” And you will also tell us, “By the way, my slope change relative to my parameter change is this much.”

And this would be some empirical estimation of the curvature. So if you are in a very curved area-

Lucas: Then the estimation isn’t gonna be accurate because the linearity is gonna cut through some of the curvature.

El Mahdi: Yeah but if you are in a very curved area of the loss function, your slope will change a lot.

Lucas: Okay. Exponentially changing the slope.

El Mahdi: Yeah. Because you did a very tiny change in the parameter and it takes a lot of the slope.

Lucas: Yeah. Will change the … Yeah.

El Mahdi: When you are in a non-curved area of the loss function, it’s less harmful for us that you are stale, because you will just technically have the same updates.

If you are in a very curved area of the loss function, your updates being stale is now a big problem. So we want to discard your updates proportionally to your curvature.

So this is the main idea of this scheme in asynchrony, where we would ask workers about their gradient, and their empirical growth rates.

And then of course I don’t want to trust you on what you declare, because you can plan to screw me with some gradients, and then declare a legitimate value of the curvature.

I will take those empirical, what we call in the paper empirical Lipschitz-ness. So we ask you for this empirical growth rate, that it’s a scalar, remember? This is very important. It’s a single dimensional number.

And so we ask you about this growth rate, and we ask all of you about growth rates, again assuming the majority is correct. So the majority of growth rates will help us set the median growth rate in a robust manner, because as long as a simple majority is not lying, the median growth rates will always be bounded between two legitimate values of the growth rate.

Lucas: Right because, are you having multiple workers inform you of the same part of your loss function?

El Mahdi: Yes. Even though they do it in an asynchronous manner.

Lucas: Yeah. Then you take the median of all of them.

El Mahdi: Yes. And then we reason by quantiles of the growth rates.

Lucas: Reason by quantiles? What are quantiles?

El Mahdi: The first third, the second third, the third third. Like the first 30%, the second 30%, the third 30%. We will discard the first 30%, discard the last 30%. Anything in the second 30% is safe.

Of course this has some level of pessimism, which is good for safety, but not very good for being fast. Because maybe people are not lying, so maybe the first 30%, and the last 30% are also values we could consider. But for safety reasons we want to be sure.

Lucas: You want to try to get rid of the outliers.

El Mahdi: Possible.

Lucas: Possible outliers.

El Mahdi: Yeah. So we get rid of the first 30%, the last 30%.

Lucas: So this ends up being a more conservative estimate of the loss function?

El Mahdi: Yes. That’s completely right. We explain that in the paper.

Lucas: So there’s a trade off that you can decide-

El Mahdi: Yeah.

Lucas: By choosing what percentiles to throw away.

El Mahdi: Yeah. Safety never comes for free. So here, depending on how good your estimates about the number of potential Byzantine actors is, your level of pessimism with translate into slowdown.

Lucas: Right. And so you can update the amount that you’re cutting off-

El Mahdi: Yeah.

Lucas: Based off of the amount of expected corrupted signals you think you’re getting.

El Mahdi: Yeah. So now imagine a situation where you know the number of workers is know. You know that you are leveraging 100,000 smartphones doing gradient descent for you. Let’s call that N.

You know that F of them might be malicious. We argue that if F is exceeding the third of N, you can’t do anything. So we are in a situation where F is less than a third. So less than 33,000 workers are malicious, then the slowdown would be F over N, so a third.

What if you are in a situation where you know that your malicious agents are way less than a third? For example you know that you have at most 20 rogue accounts in your video sharing platform.

And your video sharing platform has two billion accounts. So you have two billion accounts.

Lucas: 20 of them are malevolent.

El Mahdi: What we show is that the slowdown would be N minus F divided by N. N is the two billion accounts, F is the 20, and is again two billion.

So it would be two billion minus 20, so one million nine hundred billion, like something like 0.999999. So you would go almost as fast as the non-Byzantine resilient scheme.

So our Byzantine resilient scheme has a slowdown that is very reasonable in situations where F, the number of malicious agents is way less than N, the total number of agents, which is typical in modern…

Today, like if you ask social media platforms, they have a lot of a tool kits to prevent people from creating a billion fake accounts. Like you can’t in 20 hours create an army of several million accounts.

None of the mainstream social media platforms today are susceptible to this-

Lucas: Are susceptible to massive corruption.

El Mahdi: Yeah. To this massive account creation. So you know that the number of corrupted accounts are negligible to the number of total accounts.

So that’s the good news. The good news is that you know that F is negligible to N. But then the slowdown of our Byzantine resilient methods is also close to one.

But it has the advantage compared to the state of the art today to train distributed settings of not taking the average gradient. And we argued in the very beginning that those 20 accounts that you could create, it doesn’t take a bot army or whatever, you don’t need to hack into the machines of the social network. You can have a dozen human, sitting somewhere in a house manually creating 20 accounts, training the accounts over time, doing behavior that makes the legitimate for some topics, and then because you’re distributing machine learning scheme would average the gradients generated by people behavior and that making your command anti-vaccine or controversies, anti-Semitic conspiracy theories.

Lucas: So if I have 20 bad gradients and like, 10,000 good gradients for a video, why is it that with averaging 20 bad gradients are messing up the-

El Mahdi: The amplitude. It’s like the billionaire in the room of core academics.

Lucas: Okay, because the amplitude of each of their accounts is greater than the average of the other accounts?

El Mahdi: Yes.

Lucas: The average of other accounts that are going to engage with this thing don’t have as large of an amplitude because they haven’t engaged with this topic as much?

El Mahdi: Yeah, because they’re not super credible on gun control, for example.

Lucas: Yeah, but aren’t there a ton of other accounts with large amplitudes that are going to be looking at the same video and correcting over the-

El Mahdi: Yeah, let’s define large amplitudes. If you come to the video and just like it, that’s a small update. What about you like it, post very engaging comments-

Lucas: So you write a comment that gets a lot of engagement, gets a lot of likes and replies.

El Mahdi: Yeah, that’s how you increase your amplitude. And because you are already doing some good job in becoming the reference on that video-sharing platform when it comes to discussing gun control, the amplitude of your commands is by definition high and the fact that your command was very early on posted and then not only you commented the video but you also produced a follow-up video.

Lucas: I see, so the gradient is really determined by a multitude of things that the video-sharing platform is measuring for, and the metrics are like, how quickly you commented, how many people commented and replied to you. Does it also include language that you used?

El Mahdi: Probably. It depends on the social media platform and it depends on the video-sharing platform and, what is clear is that there are many schemes that those 20 accounts created by this dozen people in a house can try to find good ways to maximize the amplitude of their generated gradients, but this is a way easier problem than the typical problems we have in technical AI safety. This is not value alignment or value loading or coherent extrapolated volition. This is a very easy, tractable problem on which now we have good news, provable results. What’s interesting is the follow-up questions that we are trying to investigate here with my colleagues, the first of which is, don’t necessarily have a majority of people on the internet promoting vaccines.

Lucas: People that are against things are often louder than people that are not.

El Mahdi: Yeah, makes sense, and sometimes maybe numerous because they generate content, and the people who think vaccines are safe not creating content. In some topics it might be safe to say that we have a majority of reasonable, decent people on the internet. But there are some topics in which now even like polls, like the vaccine situation, there’s a surge now of anti-vaccine resentment in western Europe and the US. Ironically this is happening in the developed country now, because people are so young, they don’t remember the non-vaccinated person. My aunt, I come from Morocco. my aunt is handicapped by polio, so I grew up seeing what a non-vaccinated person looks like. So young people in the more developed countries never had a living example of non-vaccinated past.

Lucas: But they do have examples of people that end up with autism and it seems correlated with vaccines.

El Mahdi: Yeah, the anti-vaccine content may just end up being so click baits, and so provocative that it gets popular. So this is a topic where the majority hypothesis which is crucial to poisoning resilience does not hold. An open follow up we’re onto now is how to combine ideas from reputation metrics, PageRank, et cetera, with poisoning resilience. So for example you have the National Health Institute, the John Hopkins Medical Hospital, Harvard Medical School, and I don’t know, the Massachusetts General Hospital having official accounts on some video-sharing platform and then you can spot what they say on some topic because now we are very good at doing semantic analysis of contents.

And know that okay, on the tag vaccines, I know that there’s this bunch of experts and then what you want to make emerge on your platform is some sort of like epistocracy. The power is given to the knowledgeable, like we have in some fields, like in medical regulation. The FDA doesn’t do a majority vote. We don’t have a popular majority vote across the country to tell the FDA whether it should approve this new drug or not. The FDA does some sort of epistocracy where the knowledgeable experts on the topic would vote. So how about mixing ideas from social choice?

Lucas: And topics in which there are experts who can inform.

El Mahdi: Yeah. There’s also a general fall-off of just straight out trying to connect Byzantine resilient learning with social choice, but then there’s another set of follow ups that motivates me even more. We were mentioning workers, workers, people generate accounts on social media, accounts generation gradients. That’s all I can implicitly assume in that the server, the abstraction that’s gathering those gradients is reliable. What about the aggregated platform itself being deployed on rogue machines? So imagine you are whatever platform doing learning. By the way, whatever always we have said from the beginning until now applies as long as you do gradient-based learning. So it can be recommended systems. It can be training some deep reinforcement learning of some super complicated tasks to beat, I don’t know the word, champion in poker.

We do not care as long as there’s some gradient generation from observing some state, some environmental state, and some reward or some label. It can be supervised, reinforced, as long as gradient based or what you say apply. Imagine now you have this platform leveraging distributed gradient creators, but then the platform itself for security reasons is deployed on several machines for fault tolerance. But then those machines themselves can fail. You have to make the servers agree on the model, so despite the fact that a fraction of the workers are not reliable and now a fraction of the servers themselves. This is the most important follow up i’m into now and I think there would be something on archive maybe in February or March on that.

And then a third follow up is practical instances of that, so I’ve been describing speculative thought experiments on power poisoning systems is actually brilliant master students working which means exactly doing that, like on typical recommended systems, datasets where you could see that it’s very easy. It really takes you a bunch of active agents to poison, a hundred thousand ones or more. Probably people working on big social media platforms would have ways to assess what I’ve said, and so as researchers in academia we could only speculate on what can go wrong on those platforms, so what we could do is just like we just took state of the art recommender systems, datasets, and models that are publicly available, and you can show that despite having a large number of reliable recommendation proposers, a small, tiny fraction of proposers can make, I don’t know, like a movie recommendation system recommend the most suicidal triggering film to the most depressed person watching through your platform. So I’m saying, that’s something you don’t want to have.

Lucas: Right. Just wrapping this all up, how do you see this in the context of AI alignment and the future of machine learning and artificial intelligence?

El Mahdi: So I’ve been discussing this here with people in the Beneficial AI conference and it seems that there are two schools of thought. I am still hesitating between the two because I switched within the past three months from the two sides like three times. So one of them thinks that an AGI is by definition resilient to poisoning.

Lucas: Aligned AGI might be by definition.

El Mahdi: Not even aligned. The second school of thought, aligned AGI is Byzantine resilient.

Lucas: Okay, I see.

El Mahdi: Obviously aligned AGI would be poisoning resilience, but let’s just talk about super intelligent AI, not necessarily aligned. So you have a super intelligence, would you include poisoning resilience in the super intelligence definition or not? And one would say that yeah, if you are better than human in whatever task, it means you are also better than human into spotting poison data.

Lucas: Right, I mean the poison data is just messing with your epistemics, and so if you’re super intelligent your epistemics would be less subject to interference.

El Mahdi: But then there is that second school of thought which I switched back again because I find that most people are in the first school of thought now. So I believe that super intelligence doesn’t necessarily include poisoning resilience because of what I call practically time constrained superintelligence. If you have a deadline because of computational complexity, you have to learn something, which can sometimes-

Lucas: Yeah, you want to get things done.

El Mahdi: Yeah, so you want to get it done in a finite amount of time. And because of that you will end up leveraging to speed up your learning. So if a malicious agent just put up bad observations of the environment or bad labeling of whatever is around you, then it can make you learn something else than what you would like as an aligned outcome. I’m strongly on the second side despite many disagreeing with me here. I don’t think super intelligence includes poisoning resilience, because super intelligence would still be built with time constraints.

Lucas: Right. You’re making a tradeoff between safety and computational efficiency.

El Mahdi: Right.

Lucas: It also would obviously seem to matter the kind of world that the ASI finds itself in. If it knows that it’s in a world with no, or very, very, very few malevolent agents that are wanting to poison it, then it can just throw all of this out of the window, but the problem is that we live on a planet with a bunch of other primates that are trying to mess up our machine learning. So I guess just as a kind of fun example in taking it to an extreme, imagine it’s the year 300,000 AD and you have a super intelligence which has sort of spread across space-time and it’s beginning to optimize its cosmic endowment, but it gives some sort of uncertainty over space-time to whether or not there are other super intelligences there who might want to poison its interstellar communication in order to start taking over some of its cosmic endowment. Do you want to just sort of explore?

El Mahdi: Yeah, that was like a closed experiment I proposed earlier to Carl Shulman from the FHI. Imagine some super intelligence reaching the planets where there is a smart form of life emerging from electric communication between plasma clouds. So completely non-carbon, non-silicon based.

Lucas: So if Jupiter made brains on it.

El Mahdi: Yeah, like Jupiter made brains on it just out of electric communication through gas clouds.

Lucas: Yeah, okay.

El Mahdi: And then this turned to a form of communication is smart enough to know that this is a super intelligence reaching the planet to learn about this form of life, and then it would just start trolling it.

Lucas: It’ll start trolling the super intelligence?

El Mahdi: Yeah. So they would come up with an agreement ahead of time, saying, “Yeah, this super intelligence coming from earth throughout our century to discover how we do things here. Let’s just behave dumbly, or let’s just misbehave. And then the super intelligence will start collecting data on this life form and then come back to earth saying, Yeah, they’re just a dumb plasma passive form of nothing interesting.

Lucas: I mean, you don’t think that within the super intelligence’s model, I mean, we’re talking about it right now so obviously a super intelligence will know this when it leaves that there will be agents that are going to try and trick it.

El Mahdi: That’s the rebuttal, yes. That’s the rebuttal again. Again, how much time does super intelligence have to do inference and draw conclusions? You will always have some time constraints.

Lucas: And you don’t always have enough computational power to model other agents efficiently to know whether or not they’re lying, or …

El Mahdi: You could always come up with thought experiment with some sort of other form of intelligence, like another super intelligence is trying to-

Lucas: There’s never, ever a perfect computer science, never.

El Mahdi: Yeah, you can say that.

Lucas: Security is never perfect. Information exchange is never perfect. But you can improve it.

El Mahdi: Yeah.

Lucas: Wouldn’t you assume that the complexity of the attacks would also scale? We just have a ton of people working on defense, but if we have an equal amount of people working on attack, wouldn’t we have an equally complex method of poisoning that our current methods would just be overcome by?

El Mahdi: That’s part of the empirical follow-up I mentioned. The one Isabella and I were working on, which is trying to do some sort of min-max game of poisoner versus poisoning resilience learner, adversarial poisoning setting where like a poisoner and then there is like a resilient learner and the poisoner tries to maximize. And what we have so far is very depressing. It turns out that it’s very easy to be a poisoner. Computationally it’s way easier to be the poisoner than to be-

Lucas: Yeah, I mean, in general in the world it’s easier to destroy things than to create order.

El Mahdi: As I said in the beginning, this is a sub-topic of technical AI safety where I believe it’s easier to have tractable formalizable problems for which you can probably have a safe solution.

Lucas: Solution.

El Mahdi: But in very concrete, very short term aspects of that. In March we are going to announce a major update in Tensor Flow which is the standout frameworks today to do distributed machine learning, open source by Google, so we will announce hopefully if everything goes right in sys ML in the systems for machine learning conference, like more empirically focused colleagues, so based on the algorithms I mentioned earlier which were presented at NuerIPS and ICML from the past two years, they will announce a major update where they basically changed every averaging insight in terms of flow by those three algorithms I mentioned, Krum and Bulyan and soon Kardam which constitute our portfolio of Byzantine resilience algorithms.

Another consequence that comes for free with that is that distributed machinery frameworks like terms of flow use TCPIP as a communication protocol. So TCPIP has a problem. It’s reliable but it’s very slow. You have to repeatedly repeat some messages, et cetera, to guarantee reliability, and we would like to have a faster communication protocol, like UDP. We don’t need to go through those details. But it has some package drop, so so far there was no version of terms of flow or any distributed machine learning framework to my knowledge using UDP. The old used TCPIP because they needed reliable communication, but now because we are Byzantine resilient, we can afford having fast but not completely reliable communication protocols like UDP. So one of the things that come for free with Byzantine resilience is that you can move from heavy-

Lucas: A little bit more computation.

El Mahdi: -yeah, heavy communication protocols like TCPIP to lighter, faster, more live communication protocols like UDP.

Lucas: Keeping in mind you’re trading off.

El Mahdi: Exactly. Now we have this portfolio of algorithms which can serve many other applications besides just making faster distributed machine learning, like making poisoning resilience. I don’t know, recommended systems for social media and hopefully making AGI learning poisoning resilience matter.

Lucas: Wonderful. So if people want to check out some of your work or follow you on social media, what is the best place to keep up with you?

El Mahdi: Twitter. My handle is El Badhio, so maybe you would have it written down on the description.

Lucas: Yeah, cool.

El Mahdi: Yeah, Twitter is the best way to get in touch.

Lucas: All right. Well, wonderful. Thank you so much for speaking with me today and I’m excited to see what comes out of all this next.

El Mahdi: Thank you. Thank you for hosting this.

Lucas: If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

[end of recorded material]

AI Alignment Podcast: Cooperative Inverse Reinforcement Learning with Dylan Hadfield-Menell (Beneficial AGI 2019)

What motivates cooperative inverse reinforcement learning? What can we gain from recontextualizing our safety efforts from the CIRL point of view? What possible role can pre-AGI systems play in amplifying normative processes?

Cooperative Inverse Reinforcement Learning with Dylan Hadfield-Menell is the eighth podcast in the AI Alignment Podcast series, hosted by Lucas Perry and was recorded at the Beneficial AGI 2018 conference in Puerto Rico. For those of you that are new, this series covers and explores the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, Lucas will speak with technical and non-technical researchers across areas such as machine learning, governance,  ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, or your preferred podcast site/application.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.

In this podcast, Lucas spoke with Dylan Hadfield-Menell. Dylan is a 5th year PhD student at UC Berkeley advised by Anca Dragan, Pieter Abbeel and Stuart Russell, where he focuses on technical AI alignment research.

Topics discussed in this episode include:

  • How CIRL helps to clarify AI alignment and adjacent concepts
  • The philosophy of science behind safety theorizing
  • CIRL in the context of varying alignment methodologies and it’s role
  • If short-term AI can be used to amplify normative processes
You can follow Dylan here and find the Cooperative Inverse Reinforcement Learning paper here. You can listen to the podcast above or read the transcript below.

Lucas: Hey everyone, welcome back to the AI Alignment Podcast series. I’m Lucas Perry and today we will be speaking for a second time with Dylan Hadfield-Menell on cooperative inverse reinforcement learning, the philosophy of science behind safety theorizing, CIRL in the context of varying alignment methodologies, and if short term AI can be used to amplify normative processes. This time it just so happened to be an in person discussion and Beneficial AGI 2019, FLI’s sequel to the Beneficial AI 2017 conference at Asilomar.

I have a bunch of more conversations that resulted from this conference to post soon and you can find more details about the conference in the coming weeks. As always, if you enjoy this podcast, please subscribe or follow us on your preferred listening platform. As many of you will already know, Dylan is a fifth year Ph.D. student at UC Berkeley, advised by Anca Dragan, Pieter Abbeel, and Stuart Russell, where he focuses on technical AI Alignment research. And so without further ado, I’ll give you Dylan.

Thanks so much for coming on the podcast again, Dylan, that’s been like a year or something. Good to see you again.

Dylan: Thanks. It’s a pleasure to be here.

Lucas: So just to start off, we can go ahead and begin speaking a little bit about your work on cooperative inverse reinforcement learning and whatever sorts of interesting updates or explanation you have there.

Dylan: Thanks. For me, working in cooperative IRL has been a pretty long process, it really sort of dates back to the start of my second year in PhD when my advisor came back from a yearlong sabbatical and suggested that we entirely changed the research direction we were thinking about.

That was to think about AI Alignment and AI Safety and associated concerns that, that might bring. And our first attempt at a really doing research in that area was to try to formalize what’s the problem that we’re looking at, what are the space of parameters and the space of solutions that we should be thinking about in studying that problem?

And so it led us to write Cooperative Inverse Reinforcement Learning. Since then I’ve had a large amount of conversations where I’ve had incredible difficulty trying to convey what it is that we’re actually trying to do here and what exactly that paper and idea represents with respect to AI Safety.

One of the big updates for me and one of the big changes since we’ve spoken last, is getting a little bit of a handle on really what’s the value of that as the system. So for me, I’ve come around to the point of view that really what we were trying to do with cooperative IRL was to propose an alternative definition of what it means for an AI system to be effective or rational in some sense.

And so there’s a story you can tell about artificial intelligence, which is that we started off and we observed that people were smart and they were intelligent in some way, and then we observed that we could get computers to do interesting things. And this posed the question of can we get computers to be intelligent? We had no idea what that meant, no idea how to actually nail it down and we discovered that in actually trying to program solutions that looked intelligent, we had a lot of challenges.

So one of the big things that we did as a field was to look over next door into the economics department in some sense, to look at those sort of models that they have of decision theoretic rationality and really looking at homoeconomicous as an ideal to shoot for. From that perspective, actually a lot of the field of AI has shifted to be about effective implementations of homoeconomicous.

In my terminology, this is about systems that are effectively individually rational. These are systems that are good at optimizing for their goals, and a lot of the concerns that we have about AI Safety is that systems optimizing for their own goals could actually lead to very bad outcomes for the rest of us. And so what cooperative IRL attempts to do is to understand what it would mean for a human robot system to behave as a rational agent.

In the sense, we’re moving away from having a box drawn around the AI system or the artificial component of the system to having that agent box drawn around the person and the system together, and we’re trying to model the sort of important parts of the value alignment problem in our formulation here. And in this case, we went with the simplest possible set of assumptions which are basically that we have a static set of preferences that are the humans preferences that they’re trying to optimize. This is effectively the humans welfare.

The world is fully observable and the robot and the person are both working to maximize the humans welfare, but there is this information bottlenecking. This information asymmetry that’s present that we think is a fundamental component of the value alignment problem. And so really what cooperative IRL, is it’s a definition of how a human and a robot system together can be rational in the context of fixed preferences in a fully observable world state.

Lucas: There’s a point of metatheory or coming up with models and theory. It seems like the fundamental issue is given how and just insanely complex AI Alignment is trying to converge on whatever the most efficacious model is, is very, very difficult. People keep flicking back and forth about theoretically how we’re actually going to do this. Even in very grid world or toy environments. So it seemed very, very hard to isolate the best variables or what variables can be sort of modeled and tracked in ways that is going to help us most.

Dylan: So, I definitely think that this is not an accurate model of the world and I think that there are assumptions here which, if not appropriately reexamined, would lead to a mismatch between the real world and things that work in theory.

Lucas: Like human beings having static preferences.

Dylan: So for example, yes, I don’t claim to know what human preferences really are and this theory is not an attempt to say that they are static. It is an attempt to identify a related problem to the one that we’re really faced with, that we can actually make technical and theoretical progress on. That will hopefully lead to insights that may transfer out towards other situations.

I certainly recognize that what I’m calling a theta in that paper is not really the same thing that everyone talks about when we talk about preferences. I, in talking with philosophers, I’ve discovered, I think it’s a little bit more closer to things like welfare in like a moral philosophy context, which maybe you could think about as being a more static object that you would want to optimize.

In some sense theta really is an encoding of what you would like the system to do, in general is what we’re assuming there.

Lucas: Because it’s static.

Dylan: Yes, and to the extent that you want to have that be changing over time, I think that there’s an interesting theoretical question as to how that actually is different, and what types of changes that leads to and whether or not you can always reduce something with non-static preferences to something with static preferences from a mathematical point of view.

Lucas: I can see how moving from static to changing over time just makes it so much more insanely complex.

Dylan: Yeah, and it’s also really complex of the level of its Philosophically unclear what the right thing to do.

Lucas: Yeah, that’s what I mean. Yeah, you don’t even know what it even means to be aligning as the values are changing, like whether or not the agent even thinks that they just moved in the right direction or not.

Dylan: Right, and I also even think I want to point out how uncertain all of these things are. We as people are hierarchical organizations have different behaviors and observation systems and perception systems. And we believe we have preferences, we have a name to that, but there is a sense in which that is ultimately a fiction of some kind.

It’s a useful tool that we have to talk about ourselves to talk about others that facilitates interaction and cooperation. And so given that I do not know the answer to these philosophical questions, what can I try to do as a technical researcher to push the problem forward and to make actual progress?

Lucas: Right, and so it’s sort of again, like a metatheoretical point and what people are trying to do right now in the context of AI Alignment, it seems that the best thing for people to be doing is sort of to be coming up with these theoretical models and frameworks, which have a minimum set of assumptions which may be almost like the real world but are not, and then making theoretical progress there that will hopefully in the future transfer, as you said to other problems as ML and deep learning gets better and the other tools are getting better so that it’ll actually have the other tools to make it work with more complicated assumptions.

Dylan: Yes, I think that’s right. The way that I view this as we had AI, is this broad, vague thing. Through the course of AI research, we kind of got to Markov decision processes as a sort of coordinating theory around what it means for us to design good agents, and cooperative IRL is an attempt to take a step from markup decision processes more closely towards the set of problems that we want to study.

Lucas: Right, and so I think this is like a really interesting point that I actually haven’t talked to anyone else about and if you have a few more words about it, I think it would be really interesting. So just in terms of being a computer scientist and being someone who is working on the emerging theory of a field. I think it’s often unclear what the actual theorizing process is behind how people get to CIRL. How did someone get to debate? How did someone get to iterated amplification?

It seems like you first identify problems which you see to be crucial and then there are some sorts of epistemic and pragmatic heuristics that you apply to try and begin to sculpt a model that might lead to useful insight. Would you have anything to correct or unpack here?

Dylan: I mean, I think that is a pretty good description of a pretty fuzzy process.

Lucas: But like being a scientist or whatever?

Dylan: Yeah. I don’t feel comfortable speaking for scientists in general here, but I could maybe say a little bit more about my particular process, which is that I try to think about how I’m looking at the problem differently from other people based on different motivations and different goals that I have. And I try to lean into how that can push us in different directions. There’s a lot of other really, really smart people who have tried to do lots of things.

You have to maintain an amount of intellectual humility about your ability to out think the historical components of the field. And for me, I think that in particular for AI Safety, it’s thinking about reframing what is the goal that we’re shooting towards as a field.

Lucas: Which we don’t know.

Dylan: We don’t know of those goals are, absolutely. And I think that there is a sense in which the field has not re-examined those goals incredibly deeply. For a little bit, I think that it’s so hard to do anything that looks intelligent in the real world that we’ve been trying to focus on that individually rational Markov decision process model. And I think that a lot of the concerns about AI Safety are really a call for AI as a field to step back and think about what we’re trying to accomplish in the world and how can we actually try to achieve beneficial outcomes for society.

Lucas: Yeah, and I guess like a sociological phenomenon within the scientists or people who are committed to empirical things. In terms of reanalyzing what the goal of AI Alignment is, the sort of area of moral philosophy and ethics and other things, which for empirical leaning rational people can be distasteful because you can’t just take a telescope to the universe and see like a list of what you ought to do.

And so it seems like people like to defer on these questions. I don’t know. Do you have anything else to add here?

Dylan: Yeah. I think computer scientists in particular are selected to be people who like having boxed off problems that they know how to solve and feel comfortable with, and that leaning into getting more people with a humanities bent into computer science and broadly AI in particular, AI Safety especially is really important and I think that’s a broad call that we’re seeing come from society generally.

Lucas: Yeah, and I think it also might be wrong though to model the humanities questions as those which are not in boxes and cannot be solved. That’s sort of like a logical positivist thing to say, that on one end we have the hard things and you just have to look at the world enough and you’ll figure it out and then there’s the soft squishy things which deal with abstractions that I don’t have real answers, but people with fluffy degrees need to come up with things that seem right but aren’t really right.

Dylan: I think it would be wrong to take what I just said in that direction, and if that’s what it sounds like I definitely want to correct that. I don’t think there is a sense in which computer science is a place where there are easy right answers, and that the people in humanities are sort of waving their hands and sort of fluffing around.

This is sort of leaning into making this a more AI value alignment kinds of framing or thinking about it. But when I think about being AI systems into the world, I think about what things can you afford to get wrong in your specification and which things can you not afford to get wrong in your specifications.

In this sense, specifying physics incorrectly is much, much better than specifying the objective incorrectly, at least by default. And the reason for that is what happens to the world when you push it, is a question that you can answer from your observations. And so if you start off in the wrong place, as long as you’re learning and adapting, I can reasonably expect my systems do correct to that. Or at least the goal of successful AI research is that your systems will effectively adapt to that.

However, the past that your system is supposed to do is sort of arbitrary in a very fundamental sense. And from that standpoint, it is on you as the system designer to make sure that objective is specified correctly. When I think about what we want to do as a field, I ended up taking a similar lens and that there’s a sense in which we as researchers and people and society and philosophers and all of it are trying to figure out what we’re trying to do and what we want to task the technology with, and the directions that we want to push it in. And then there are questions of what will the technology be like and how should it function that will be informed by that and shaped by that.

And I think that there is a sense in which that is arbitrary. Now, what is right? That I don’t really know the answer to and I’m interested in having those conversations, but they make me feel uneasy. I don’t trust myself on those questions, and that could mean that I should learn how to feel more uneasy and think about it more and in doing this research I have been kind of forced into some of those conversations.

But I also do think that for me at least I see a difference between what can we do and what should we do. And thinking about what should we do as a really, really hard question that’s different than what can we do.

Lucas: Right. And so I wanna move back towards CIRL, but just to sort of wrap up here on our philosophy of science musings, a thought I had while you were going through that was, at least for now, what I think is fundamentally shared between fields that deal with things that matter, are their concepts deal with meaningfully relevant reference in the world? Like do your concepts refer to meaningful things?

Putting ontology aside, whatever love means or whatever value alignment mean. These are meaningful referents for people and I guess for now if our concepts are actually referring to meaningful things in the world, then it seems important.

Dylan: Yes, I think so. Although, I’m not totally sure I understood that.

Lucas: Sure, that’s fine. People will say that humanities or philosophy doesn’t have these boxes with like well-defined problems and solutions because they either don’t deal with real things in the world or the concepts are so fuzzy that the problems are sort of invented and illusory. Like how many angels can stand on the head of a pin? Like the concepts don’t work, aren’t real and don’t have real referents, but whatever.

And I’m saying the place where philosophy and ethics and computer science and AI Alignment should at least come together for now is where the referents have, where the concepts of meaningful referents in the world?

Dylan: Yes, that is something that I absolutely buy. Yes, I think there’s a very real sense in which those questions are harder, but that doesn’t mean they’re less real or less important.

Lucas: Yes, that’s because it’s the only point I wanted to push against logical positivism.

Dylan: No, I don’t mean to say that the answers are wrong, it’s just that they are harder to prove in a real sense.

Lucas: Yeah. I mean, I don’t even know if they have answers or if they do or if they’re just all wrong, but I’m just open to it and like more excited about everyone coming together thing.

Dylan: Yes, I absolutely agree with that.

Lucas: Cool. So now let’s turn it back into the CIRL. So you began by talking about how you and your advisers were having this conceptual shift and framing, then we got into the sort of philosophy of science behind how different models and theories of alignment go. So from here, whatever else you have to say about CIRL.

Dylan: So I think for me the upshot of concerns about advanced AI systems and negative consequence there in really is a call to recognize that the goal of our field is AI Alignment. That almost any AI that’s not AI Alignment is solving a sub problem and viewing it only in solving that sub problem is a mistake.

Ultimately, we are in the business of building AI systems that integrate well with humans and human society. And if we don’t take that as a fundamental tenant of the field, I think that we are potentially in trouble and I think that that is a perspective that I wish was more pervasive throughout artificial intelligence generally,

Lucas: Right, so I think I do want to move into this view where safety is a normal thing, and like Stuart Russell will say, “People who build bridges all care about safety and there aren’t a subsection of bridge builders who work in bridge safety, everyone is part of the bridge safety.” And I definitely want to get into that, but I also sort of want to get a little bit more into CIRL and why you think it’s so motivating and why this theoretical framing and shift is important or illuminating, and what the specific content of it is.

Dylan: The key thing is that what it does is point out that it doesn’t make sense to talk about how well your system is doing without talking about the way in which it was instructed and the type of information that it got. No AI system exists on its own, every AI system has a designer, and it doesn’t make sense to talk about the functioning of that system without also talking about how that designer built it, evaluated it and how well it is actually serving those ends.

And I don’t think this is some brand new idea that no one’s ever known about, I think this is something that is incredibly obvious to practitioners in the field once you pointed out. The process whereby a robot learns to navigate a maze or vacuum a room is not, there is an objective and it optimizes it and then it does it.

What it is that there is a system designer who writes down an objective, selects an optimization algorithm, observes the final behavior of that optimization algorithm, goes back, modifies the objectives, modifies the algorithm, changes hyper parameters, and then runs it again. And there’s this iterative process whereby your system eventually ends up getting to the behavior that you wanted to have. And AI researchers have tended to draw a box around. The thing that we call AI is the sort of final component of that.

Lucas: Yeah, it’s because at least subjectively and I guess this is sort of illuminated by meditation and Buddhism, is that if you’re a computer scientist and you’re just completely identified with the process of doing computer science, you’re just identified with the problem. And if you just have a little bit of mindfulness and you’re like, “Okay, I’m in the context of a process where I’m an agent and trying to align another agent,” and if you’re not just completely identified with the process and you see the unfolding of the process, then you can do sort of like more of a meta-analysis which takes a broader view of the problem and can then, I guess hopefully work on improving it.

Dylan: Yeah, I think that’s exactly right, or at least as I understand that, that’s exactly right. And to be a little bit specific about this, we have had these engineering principles and skills that are not in the papers, but they are things that are passed down from Grad student to Grad student within a lab. Their institutional knowledge that exists within a company for how you actually verify and validate your systems, and cooperative IRLs and attempt to take all of that sort of structure that AI systems have existed within and try to bring that into the theoretical frameworks that we actually work with.

Lucas: So can you paint a little picture of what the CIRL model looks like?

Dylan: It exists in a sequential decision making context and we assume we have states of the world and a transition diagram that basically tells us how we get to another state given the previous state and actions from the human and the robot. But the important conceptual shift that it makes is the space of solutions that we’re dealing with are combinations of a teaching strategy and a learning strategy.

There is a commitment on the side of the human designers or users of the systems to provide data that is in some way connected to the objectives that they want to be fulfilled. That data can take many forms, it could be in the form of writing down a reward function that ranks a set of alternatives, it could be in the form of providing demonstrations that you expect your system to imitate. It could be in the form of providing binary comparisons between two clearly identified alternatives.

And the other side of the problem is what is the learning strategy that we use? And this is the question of how the robot is actually committing to respond to the observations that we’re giving it about what we wanted to do, in the case of a pre-specified proxy reward going to a literal interpretation and a reinforcement learning system, let’s say. What the system is committing to doing is optimizing under that set of trajectory rankings and preferences based off the simulation environment that it’s in, or the actual physical environment that it’s exploring.

When we shift to something like inverse reward design, which is a paper that we released last year, what that says is we’d like the system to look at this ranking of alternatives and actually try to blow that up into a larger uncertainty set over the set of possible consistent rankings with that, and then when you go into deployment, you may be able to leverage that uncertainty to avoid catastrophic failures or generally just unexpected behavior.

Lucas: So this other point I think that you and I discussed briefly, maybe it was actually with Rohan, but it seems like often in terms of AI Alignment, it’s almost like we’re reasoning from nowhere about abstract agents and that sort of makes the problem extremely difficult. Often, if you just look at human examples, it just becomes super mundane and easy. This sort of conceptual shift can almost I think be framed super simply as like the difference between a teacher trying to teach someone and then a teacher realizing that the teacher is a person that is teaching another student and the teacher can think better about how to teach and then also the process between the teacher and the student and how to improve that at a higher level of attraction.

Dylan: I think that’s the direction that we’re moving in. What I would say is it’s as AI practitioners, we are teaching our systems how to behave and we have developed our strategies for doing that.

And now that we’ve developed a bunch of strategies that sort of seem to work. I think it’s time for us to develop a more rigorous theory of actually how those teaching strategies interact with the final performance of the system.

Lucas: Cool. Is there anything else here that you would like say about CIRL, or any really important points you would like people to get people who are interested in technical AI Alignment or CS students?

Dylan: I think the main point that I would make is that research and thinking about powerful AI systems is valuable, even if you don’t think that that’s what’s going to happen. You don’t need to be motivated by those sets of problems in order to recognize that this is actually just basic research into the science of artificial intelligence.

It’s got an incredible amount of really interesting problems and the perspectives that you adopt from this framing can be incredibly useful as a comparative advantage over other researchers in the field. I think that’d be my final word here.

Lucas: If I might just ask you one last question. We’re at beneficial AGI 2019 right now and we’ve heard a lot of overviews of different research agendas and methodologies and models and framings for how to best go forth with AI Alignment, which include a vast range of things which work on corrigibility and interpretability and robustness and other things, and the different sort of research agendas and methodologies of places like MIRI who is come out with this new framing on embedded agency, and also different views at OpenAI and DeepMind.

And Eric Drexler has also newly proposed these services based conception of AI where we remove the understanding of powerful AI systems or regular AI systems as agents, which sort of gets us away from a lot of the x-risky problems and global catastrophic risks problems and value alignment problems.

From your point of view, as someone who’s worked a lot in CIRL and is the technical alignment researcher, how do you view CIRL in this context and how do you view all of these different emerging approaches right now in AI Alignment?

Dylan: For me, and you know, I should give a disclaimer. This is my research area and so I’m obviously pretty biased to thinking it’s incredibly important and good, but for me at least, cooperative IRL is a uniting framework under which I can understand all of those different approaches. I believe that a services type solution to AI Safety or AI Alignment that’s actually arguing for a particular type of learning strategy and implementation strategies of CIRL, and I think it can be framed within that system.

Similarly, I had some conversations with people about debate. I believe debate fits really nicely into the framework and we commit to a human strategy of judging debates from systems and we commit to a robot strategy and just putting yourself into two systems and working towards that direction. So for me, it’s a way in which I can sort of identify the commonalities between these different approaches and compare and contrast them and then under a set of assumptions about what the world is like, what the space of possible preferences is like and what the space of strategies that people can implement possibly get out some information about which one is better or worse, or which type of strategy is vulnerable to different types of mistakes or errors.

Lucas: Right, so I agree with all of that, the only place that I might want to push back is, it seems that maybe the MIRI embedded agency stuff subsumes everything else. What do you think about that?

Because the framing is like whenever AI researchers draw these models, there are these conceptions of these information channels, right, which are selected by the researchers and which we control, but the universe is really just a big non-dual happening of stuff and agents are embedded in the environment and are almost an identical process within the environment and it’s much more fuzzy where the dense causal streams are and where a little causal streams are and stuff like that. It just seems like the MIRI stuff seems to maybe subsume the CIRL and everything else a little bit more, but I don’t know.

Dylan: I certainly agree that that’s the one that’s hardest to fit into the framework, but I would also say that in my mind, I don’t know what an agent is. I don’t know how to operationalize an agent, I don’t actually know what that means in the physical world and I don’t know what it means to be an agent. What I do know is that there is a strategy of some sort that we can think of as governing the ways that the system is perform and behave.

I want to be very careful about baking in assumptions in beforehand. And it feels to me like embedded agency is something that I don’t fully understand the set of assumptions being made in that framework. I don’t necessarily understand how they relate to the systems that we’re actually going to build.

Lucas: When people say that an agent is like a fuzzy concept, I think that, that might be surprising to a lot of people who have thought somewhat about the problem because it’s like, obviously I know what an agent is, it’s different than all the other dead stuff in the world that has goals and it’s physically confined and unitary.

If you just like imagine like abiogenesis, how life began. It is the first relatively self-replicating chain of hydrocarbons and agent and you can go from a really small systems to really big systems, which can exhibit certain properties or principles that feel a little bit agenty, but may not be useful. And so I guess if we’re going to come up with a definition of it, it should just be something useful for us or something.

Dylan: I think I’m not sure is the most accurate word we can use here. I wish I had a better answer for what this was, maybe I can share one of the thought experiments that convinced me, I was pretty confused about what an agent is.

Lucas: Yeah, sure.

Dylan: It came from thinking about what value alignment is. So if we think about values alignment between two agents and those are both perfectly rational actors, making decisions in the world perfectly in accordance with their values, with full information. I can sort of write down a definition of value alignment, which is basically you’re using the same ranking over alternatives that I am.

But a question that we really wanted to try to answer that feels really important is what does it mean to be value aligned in a partial context? If you were a bounded agent, if you’re not a perfectly rational agent, what does it actually mean for you to be value aligned? That was the question that we also didn’t really know how to answer.

Lucas: My initial reaction is the kind of agent that tries its best with its limited rationality to be like the former thing that you talked about.

Dylan: Right, so that leads to a question that we thought about, so as opposed I have a chess playing agent and it is my chess playing agent and so I wanted to win the game for me. Suppose it’s using the correct goal test, so it is actually optimizing for my values. Let’s say it’s only searching out to depth three, so it’s pretty dumb as far as chess players go.

Do I think that that is an agent that is value aligned with me? Maybe. I mean, certainly I can tell the story in one way that it sounds like it is. It’s using the correct objective function, it’s doing some sort of optimization thing. If it ever identifies a checkmate move in three moves, I will always find that get that back to me. And so that’s a sense in which it feels like it is a value aligned agent.

On the other hand, what if it’s using a heuristic function which is chosen poorly, or and something closer to an adversarial manner. So now it’s a depth three agent that is still using the correct goal test, but it’s searching in a way that is adversarially selected. Is that a partially value aligned agent?

Lucas: Sorry, I don’t understand what it means to have the same objective function, but be searching in three depth in an adversarial way.

Dylan: In particular, when you’re doing a chess search engine, there is your sort of goal tests that you run on your leaves of your search to see if you’ve actually achieved winning the game. But because you’re only doing a partial search, you often have to rely on using a heuristic of some sort to like rank different positions.

Lucas: To cut off parts of the tree.

Dylan: Somewhat to cut off parts of the tree, but also just like you’ve got different positions, neither of which are winning and you need to choose between those.

Lucas: All right. So there’s a heuristic, like it’s usually good to take the center or like the queen is something that you should always probably keep.

Dylan: Or these things that are like values of pieces that you can add up was I think one of the problems …

Lucas: Yeah, and just as like an important note now in terms of the state of machine learning, the heuristics are usually chosen by the programmer. Are system is able to collapse on heuristics themselves?

Dylan: Well, so I’d say one of the big things in like AlphaZero or AlphaGo as an approach is that they applied sort of learning on the heuristic itself and they figured out a way to use the search process to gradually improve the heuristic and have the heuristic actually improving the search process.

And so there’s sort of a feedback loop set up in those types of expert iteration systems. What my point here is that when I described that search algorithm to you, I didn’t mention what heuristic it was using at all. And so you had no reason to tell me whether or not that system was partially value aligned or not because actually with heuristic is 100 percent of what’s going to determine the final performance of the system and whether or not it’s actually helping you.

And then the sort of final point I have here that I might be able to confuse you with a little bit more is, what if we just sort of said, “Okay, forget this whole searching business. I’m just going to precompute all the solutions from my search algorithm and I’m going to give you a policy of when you’re in this position, do this move. When you’re in that position, do that move.” And what would it mean for that policy to be values aligned with me?

Lucas: If it did everything that you would have done if you were the one playing the chess game. Like is that value alignment?

Dylan: That certainly perfect imitation, and maybe we [crosstalk 00:33:04]

Lucas: Perfect imitation isn’t necessarily value alignment because you don’t want it to perfectly imitate you, you want it to win the game.

Dylan: Right.

Lucas: Isn’t the easiest way to just sort of understand this is that there are degrees of value alignment and value alignment is the extent to which the thing is able to achieve the goals that you want?

Dylan: Somewhat, but the important thing here is trying to understand what these intuitive notions that we’re talking about actually mean for the mathematics of sequential decision making. And so there’s a sense in which you and I can talk about partial value alignment and the agents that are trying to help you. But if we actually look at the math of the problem, it’s actually very hard to understand how that actually translates. Like mathematically I have lots of properties that I could write down and I don’t know which one of those I want to call partial value alignment.

Lucas: You know more about the math than I do, but the percentage chance of a thing achieving the goal is the degree to which its value aligned? If you’re certain that the end towards which is striving, and the end towards what you want it to strive?

Dylan: Right, but that striving term is a hard one, right? Because if your goals aren’t achievable then it’s impossible to be value aligned with you in that sense.

Lucas: Yeah, you have to measure the degree to which the end towards which it’s striving is the end towards what you want it to strive and then also measure the degree to which the way that it tries to get to what you want is efficacious or …

Dylan: Right. I think that intuitively I agree with you and I know what you mean, but it’s like I can do things like I can write down a reward function and I can say how well does this system optimize that reward function? And we could ask whether or not that means its value aligned with it or not. But to me, that just sounds like the question of like is your policy optimal and the sort of more standard context.

Lucas: All right, so have you written about how you think that CIRL subsumes all of these other methodologies? And if it does subsume these other AI Alignment methodologies. How do you think that will influence or affect the way we should think about the other ones?

Dylan: I haven’t written that explicitly, but when I’ve tried to convey is that it’s a formalization of the type of problem we’re trying to solve. I think describing this subsuming them is not quite right.

Lucas: It contextualizes them and it brings light to them by providing framing.

Dylan: It gives me a way to compare those different approaches and understand what’s different and what’s the same between them, and in what ways are they … like in what scenarios do we expect them to work out versus not? One thing that we’ve been thinking about recently is what happens when the person doesn’t know immediately and what they’re trying to do.

So if we imagine that there is in fact the static set of preferences, the person’s trying to optimize, so we’re still making that assumption, but assuming that those preferences are revealed to the person over time through experience or interaction with the world. That is a richer class of value alignment problems than cooperative IRL deals with. It’s really closer to what we are attempting to do right now.

Lucas: Yeah, and I mean that doesn’t even include value degeneracy, like what if I get hooked on drugs in the next three years and all my values go and my IRL agent works on assumptions that I’m always updating towards what I want, but you know …

Dylan: Yes, and I think that’s where you get these questions of changing preferences that make it hard to really think through things. I think there’s a philosophical stance you’re taking there, which is that your values have changed rather than your beliefs have changed there.

In the sense that wire-heading is a phenomenon that we see in people and in general learning agents, and if you are attempting to help it learning agent, you must be aware of the fact that wire-heading is a possibility and possibly bad. And then it’s incredibly hard to distinguish from someone who’s just found something that they really like and want to do.

When you should make that distinction or how you should make that distinction is a really challenging question, that’s not a purely technical computer science question.

Lucas: Yeah, but even at the same time, I would like to demystify it a bit. If your friend got hooked on drugs, it’s pretty obvious for you why it’s bad, it’s bad because he’s losing control, it’s bad because he’s sacrificing all of his other values. It’s bad because he’s shortening his life span by a lot.

I just mean to win again, in this way, it’s obvious in ways in which humans do this, so I guess if we take biological inspired approaches to understanding cognition and transferring how humans deal with these things into AI machines, at least at face value seems like a good way of doing it, I guess.

Dylan: Yes, everything that you said I agree with. My point is that those are in a very real sense, normative assumptions that you as that person’s friend are able to bring to the analysis of that problem, and in in some ways there is an arbitrariness to labeling that as bad.

Lucas: Yeah, so the normative issue is obviously very contentious and needs to be addressed more, but at the same time society has come to very clear solutions to normative problems like murder is basically a solved normative problem. There’s a degree to which it’s super obvious that certain normative questions are just answer it and we should I guess practice epistemic humility and whatever here obviously.

Dylan: Right, and I don’t disagree with you on that point, but I think what I’d say is, as a research problem there’s a real question to getting a better understanding of the normative processes whereby we got to solving that question. Like what is the human normative process? It’s a collective societal system. How does that system evolve and change? And then how should machines or other intelligent entities integrate into that system without either subsuming or destroying it in bad ways? I think that’s what I’m trying to get at when I make these points. There is something about what we’re doing here as a society that gets us to labeling these things in the ways that we do and calling them good or bad.

And on the one hand, as a person believe that there are correct answers and I know what I think is right versus what I think is wrong. And then as a scientist I want to try to take a little bit more of an outside view and try to understand like what is the process whereby we as a society or as genetic beings started doing that? Understanding what that process is and how that process evolves, and actually what that looks like in people now is a really critical research program.

Lucas: So one thing that I tried to cover in my panel yesterday on what civilization should strive for, is in the short, medium, to longterm the potential role that narrow to general AI systems might play in amplifying human moral decision making.

Solving as you were discussing this sort of deliberative, normative process that human beings undergo to total converge on an idea. I’m just curious to know like with more narrow systems, if you’re optimistic about ways in which AI can sort of help and elucidate our moral decision making at work to amplify it.

And before I let you start, I guess there’s one other thing that I said that I think Rohin Shah pointed out to me that was particularly helpful in one place. But beyond the moral decision making, the narrow AI systems can help us by making the moral decision make, the decisions that we implement them faster than we could.

Depending on the way a self-driving car decides to crash is like an expression of our moral decision making in like a fast computery way. I’m just saying like beyond ways in which AI systems make moral decisions for us faster than we can, I don’t know, maybe in courts or other things which seem morally contentious. Are there also other ways in which they can actually help the deliberative process examining massive amounts of moral information or like a value information or analyzing something like an aggregated well-being index where we try to understand more so how policies impact the wellbeing of people or like what sorts of moral decisions lead to good outcomes, whatever. So whatever you have to say to that.

Dylan: Yes, I definitely want to echo that. We can sort of get a lot of pre-deliberation into a fast timescale reaction with AI systems and I think that that is a way for us to improve how we act in the quality of the things that we do from a moral perspective. That you do see a real path and to actually bringing that to be in the world.

In terms of helping us actually deliberate better, I think that is a harder problem that I think is absolutely worth more people thinking about but I don’t know the answers here. What I do think is that if we have a better understanding of what the deliberative process is, I think there are correct questions to look at to try to get to that or not, the moral questions about what’s right and what’s wrong and what do we think is right and what do we think is wrong, but they are much more questions at the level of what is it about our evolutionary pathway that led us to thinking that these things are right or wrong.

What is it about society and the pressures that you’re gone and faced that led us to things where murder is wrong in almost every society in the world. I will say the death penalty is the thing, it’s just the type of sanctioned murder. So there is a sense in which I think it’s a bit more nuanced than just that. And there’s something to be said about like I guess if I had to make my claims, like what I think has sort of happened there.

So there’s something about us as creatures that evolved to coordinate and perform well in groups and pressures that, that placed on us that caused us to develop these normative systems whereby we say different things are right and wrong.

Lucas: Iterated game theory over millions of years or something.

Dylan: Something like that. Yeah, but there’s a sense in which us labeling things as right and wrong and developing the processes whereby we label things as right and wrong is a thing that we’ve been pushed towards.

Lucas: From my perspective, it feels like this is more tractable than people lead on, like AI is only going to be able to help in moral deliberation, once it’s general. It already helps us in regular deliberation and moral deliberation isn’t a special kind of deliberation and moral deliberation requires empirical facts about the world and in persons just like any other kind of actionable deliberation does and domains that aren’t considered to have to do with moral philosophy or ethics or things like that.

So I’m not an AI researcher, but it seems to me like this is more attractable than people lead onto be. The normative aspect of AI Alignment seems to be under researched.

Dylan: Can you say a little more about what you mean by that?

Lucas: What I meant was the normative deliberative process, the difficulty in coming to normative conclusions and what the appropriate epistemic and deliberative process is for arriving at normative solutions and how narrow AI systems can take us to a beautiful world where advanced AI systems actually lead us to post human ethics.

If we ever want to get to a place where general systems take us to post human ethics, why not start today with figuring out how narrow systems can work to amplify human moral decision making and deliberative processes.

Dylan: I think the hard part there is, I don’t exactly know what it means to amplify those processes. My perspective is that we as a species do not yet have a good understanding of what those deliberative processes actually represent and what formed the result actually does.

Lucas: It’s just like giving more information, providing tons of data, analyzing the data, potentially pointing out biases. The part where they’re literally amplifying cognitive implicit or explicit decision making process is more complicated and will require more advancement and cognition and deliberation and stuff. But yeah, I still think there are more mundane ways in which it can make us better moral reasoners and decision makers.

If I could give you like 10,000 more bits of information every day about moral decisions that you make, you would probably just be a better moral agent.

Dylan: Yes, one way to try to think about that is maybe things like VR approaches to increasing empathy. I think that that has a lot of power to make us better.

Lucas: Max always says that there’s a race between wisdom and the power of our technology and it seems like people really aren’t taking seriously ways in which we can amplify wisdom because wisdom is generally taken to be part of the humanities and like the soft sciences. Maybe we should be taking more seriously ways in which narrow current day AI systems can be used to amplify the progress at which the human species makes wisdom. Because otherwise we’re just gonna like continue how we always continue and the wisdom is going to go really slow and then we’re going to probably learn from a bunch of mistakes.

And it’s just not going to be as good until we’ll develop a rigorous science of making moral progress or like using technology to amplify the progress of wisdom and moral progress.

Dylan: So in principle, what you’re saying, I don’t really disagree with it, but I also don’t know how that would change what I’m working on either. In the sense that I’m not sure what it would mean. I do not know how I would do research on amplifying wisdom. I just don’t really know what that means. And that’s not to say it’s an impossible problem, we talked earlier about how I don’t know what partial value alignment means, that something that you and I can talk about it and we can intuitively I think align on a concept, but it’s not a concept I knew how to translate into actionable concrete research problems right now.

In the same way, the idea of amplifying wisdom and making people more wise is something that I think intuitively I understand what you mean, but when I try to think about how an AI system would make someone wiser, that feels difficult.

Lucas: It can seem difficult, but I feel like it would, obviously this is like an open research question, but if you were able to identify a bunch of variables that are most important for moral decision making and then if you could use AI systems to sort of gather aggregate and compile in certain ways and analyze moral information in this way, again, it just seems more tractable than people seem to be letting on.

Dylan: Yeah, although I wonder now is that different from value alignment does, we’re thinking about it, right? Concrete research thing I spend a while thinking about is, how do you identify the features that a person considers to be valuable? Say, we don’t know the relative tradeoffs between them.

One way you might try to solve value alignment is have a process that identifies the features that might matter in the world and then have a second process that identifies the appropriate tradeoffs between those features, and maybe something about diminishing returns or something like that. And that to me sounds like I just placed values with wisdom and I’ve got sort of what you’re thinking about. I think both of those terms are similarly diffuse. I wonder if what we’re talking about is semantics, and if it’s not, I’d like to know what the difference is.

Lucas: I guess, the more mundane definition of wisdom, at least in the way that Max Tegmark would use it would be like the ways in which we use our technology. I might have specific preferences, but just because I have specific preferences that I may or may not be aligning an AI system to does not necessarily mean that that total process, this like CIRL process is actually an expression of wisdom.

Dylan: Okay, can you provide a positive description of what a process would look like? Or like basically what I’m saying is I can hear the point of I have preferences and I aligned my system to it and that’s not necessarily a wise system and …

Lucas: Yeah, like I build a fire because I want to be hot, but then the fire catches my village on fire and no longer is … That’s still might be value alignment.

Dylan: But isn’t [crosstalk 00:48:39] some values that you didn’t take into account when you were deciding to build the fire.

Lucas: Yeah, that’s right. So I don’t know. I’d probably have to think about this more because I guess this is something that I just sort of throwing out right now as a reaction to what we’ve been talking about. So I don’t have a very good theory of it.

Dylan: And I don’t wanna say that you need to know the right answers to these things to not have that be a useful direction to push people.

Lucas: We don’t want to use different concepts to just reframe the same problem and just make a conceptual mess.

Dylan: That’s what I’m a little bit concerned about and that’s the thing I’m concerned about broadly. We’ve got a lot of issues that we’re thinking about in dealing with that we’re not really sure what they are.

For me, I think one of the really helpful things has been to frame the issue that I’m thinking about as if a person has a behavior that they want to implement into the world and that’s a complex behavior that they don’t know how to identify immediately. How do you actually go about building systems that allow you to implement that behavior effectively, evaluate that the behavior is actually been correctly implemented.

Lucas: Avoiding side effects, avoiding …

Dylan: Like all of these kinds of things that we sort of concerned about in AI Safety, in my mind fall a bit more into place when we frame the problem as I have a desired behavior that I want to exist, a response function, a policy function that I want to implement into the world. What are the technological systems I can use to implement that in a computer or a robot or what have you.

Lucas: Okay. Well, do you have anything else you’d like to wrap up on?

Dylan: No, I just, I want to say thanks for asking hard questions and making me feel uncomfortable because I think it’s important to do a lot of that as a scientist and in particular I think as people working on AI, we should be spending a bit more time being uncomfortable and talking about these things, because it does impact what we end up doing and it does I think impact the trajectories that we put the technology on.

Lucas: Wonderful. So if people want to read about cooperative inverse reinforcement learning, where can we find the paper or other work that you have on that? What do you think are the best resources? What are just general things you’d like to point people towards in order to follow you or keep up to date with AI Alignment?

Dylan: I tweet occasionally about AI Alignment and a bit of AI ethics questions, the Hadfield-Menell, my first initial, last name. And if you’re interested in getting a technical introduction to value alignment, I would say take a look at the 2016 paper on cooperative IRL. If you’d like a more general introduction, there’s a blog post from summer 2017 on the bear blog.

Lucas: All right, thanks so much Dylan, and maybe we’ll be sitting in a similar room again in two years for Beneficial Artificial Super Intelligence 2021.

Dylan: I look forward to it. Thanks a bunch.

Lucas: Thanks. See you, Dylan. If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

[end of recorded material]

AI Alignment Podcast: Inverse Reinforcement Learning and the State of AI Alignment with Rohin Shah

What role does inverse reinforcement learning (IRL) have to play in AI alignment? What issues complicate IRL and how does this affect the usefulness of this preference learning methodology? What sort of paradigm of AI alignment ought we to take up given such concerns?

Inverse Reinforcement Learning and the State of AI Alignment with Rohin Shah is the seventh podcast in the AI Alignment Podcast series, hosted by Lucas Perry. For those of you that are new, this series is covering and exploring the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, we will be having discussions with technical and non-technical researchers across areas such as machine learning, governance,  ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, or your preferred podcast site/application.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.

In this podcast, Lucas spoke with Rohin Shah. Rohin is a 5th year PhD student at UC Berkeley with the Center for Human-Compatible AI, working with Anca Dragan, Pieter Abbeel and Stuart Russell. Every week, he collects and summarizes recent progress relevant to AI alignment in the Alignment Newsletter

Topics discussed in this episode include:

  • The role of systematic bias in IRL
  • The metaphilosophical issues of IRL
  • IRL’s place in preference learning
  • Rohin’s take on the state of AI alignment
  • What Rohin has changed his mind about
You can learn more about Rohin’s work here and find the Value Learning sequence hereYou can listen to the podcast above or read the transcript below.

Lucas: Hey everyone, welcome back to the AI Alignment Podcast series. I’m Lucas Perry and today we will be speaking with Rohin Shah about his work on inverse reinforcement learning and his general take on the state of AI alignment efforts and theory today. Rohin is a 5th year PhD student at UC Berkeley with the Center for Human-Compatible AI, working with Anca Dragan, Pieter Abbeel and Stuart Russell. Every week, he collects and summarizes recent progress relevant to AI alignment in the Alignment Newsletter. He has also been working with effective altruism for several years. Without further ado I give you Rohin Shah.

Hey, Rohin, thank you so much for coming on the podcast. It’s really a pleasure to be speaking with you.

Rohin: Hey, Lucas. Yeah. Thanks for inviting me. I’m glad to be on.

Lucas: Today I think that it would be interesting just to start off by delving into a lot of the current work that you’ve been looking into and practicing over the past few years. In terms of your research, it looks like you’ve been doing a lot of work on practical algorithms for inverse reinforcement learning that take into account, as you say, systematic cognitive biases that people have. It would be interesting if you could just sort of unpack this work that you’ve been doing on this and then contextualize it a bit within the AI alignment problem.

Rohin: Sure. So basically the idea with inverse reinforcement learning is you can look at the behavior of some agent, perhaps a human, and tell what they’re trying to optimize, what are the things that they care about? What are their goals? And in theory this seems like a pretty nice way to do AI alignment and that intuitively you can just say, “Hey, AI, go look at the actions of humans are taking, look at what they say, look at what they do, take all of that in and figure out what humans care about.” And then you could use that perhaps as a utility function for your AI system.

I think I have become less optimistic about this approach now for reasons I’ll get into, partly because of my research on systematic biases. Basically one problem that you have to deal with is the fact that whatever humans are trying to optimize for, they’re not going to do it perfectly. We’ve got all of these sorts of cognitive biases like a planning fallacy or hyperbolic time discounters, when we tend to be myopic, not looking as far into the long-term as we perhaps could.

So assuming that humans are like perfectly optimizing goals that they care about is like clearly not going to work. And in fact, basically, if you make that assumption, well, then whatever reward function you infer, once the AI system is optimizing that, it’s going to simply recover the human performance because well, you assumed that it was optimal when you inferred what it was so that means whatever the humans were doing is probably the behavior that optimizes their work function that you inferred.

And we’d really like to be able to reach super human performance. We’d like our AI systems to tell us how we’re wrong to get new technologies develop things that we couldn’t have done ourselves. And that’s not really something we can do using the sort of naive version of inverse reinforcement learning that just assumes that you’re optimal. So one thing you could try to do is to learn the ways in which humans are biased, the ways in which they make mistakes, the ways in which they plan sub-optimally. And if you could learn that, then you could correct for those mistakes, take them into account when you’re inferring human values.

The example I like to use is if there’s a grad student who procrastinates or doesn’t plan well and as a result near a paper deadline they’re frantically working, but they don’t get it in time and they miss the paper deadline. If you assume that they’re optimal, optimizing for their goals very well I don’t know what you’d infer, maybe something like grad students like to miss deadlines. Something like that seems pretty odd and it doesn’t seem like you’d get something sensible out of that, but if you realize that humans are not very good at planning, they have the planning fallacy and they tend to procrastinate for reasons that they wouldn’t endorse on reflection, then maybe you’d be able to say, “Oh, this was just a mistake of a grad student made. In the future I should try to help them meet their deadlines.”

So that’s the reason that you want to learn systematic biases. My research was basically let’s just take the hammer of deep learning and apply it to this problem. So not just learn the reward function, but let’s also learn the biases. It turns out that this was already known, but there is an impossibility result that says that you can’t do this in general. So more, I guess I would phrase the question I was investigating, as what are a weaker set of assumptions some of than the ones that we currently use such that you can still do some reasonable form of IRL.

Lucas: Sorry. Just stepping back for like half a second. What does this impossibility theorem say?

Rohin: The impossibility theorem says that if you assume that the human is basically running some sort of planner that takes in a reward function and spits out a behavior or a policy, a thing to do over time, then if you all you see is the behavior of the human, basically any reward function is compatible with some planner. So you can’t learn anything about that reward function without making any more assumptions. And intuitively, this is because for any complex behavior you see you could either call it, “Hey, the human’s optimizing a reward that makes them act like that. “Or you could say, “I guess the human is biased and they’re trying to do something else, but they did this instead.”

The sort of extreme version of this is like if you give me an option between apples and oranges and I picked the apple, you could say, “Hey, Rohin probably likes apples and is good at maximizing his reward of getting apples.” Or you could say, “Rohin probably likes oranges and he is just extremely bad at satisfying his preferences. He’s got a systematic bias that always causes him to choose the opposite of what he wants.” And you can’t distinguish between these two cases just by looking at my behavior.

Lucas: Yeah, that makes sense. So we can pivot sort of back in here into this main line of thought that you were on.

Rohin: Yeah. So basically with that impossibility result … When I look at the impossibility result, I sort of say that humans do this all the time, humans just sort of look at other humans and they can figure out what they want to do. So it seems like there are probably some simple set of assumptions that humans are using to infer what other humans are doing. So a simple one would be when the consequences of something or obvious to humans. Now, how you determine when that is another question, but when that’s true humans tend to be close to optimal and if you have something like that, you can rule out the planner that says the human is anti-rational and always chooses the worst possible thing.

Similarly, you might say that as tasks get more and more complex or require more and more computation, the probability that the human chooses the action that best maximizes his or her goals also goes down since the task is more complex and maybe a human doesn’t figure that out, figure out what’s the best thing to do. Maybe with enough of these assumptions we could get some sort of algorithm that actually works.

So we looked at if you make the assumption that the human is often close to rational and a few other assumptions about humans behaving similarly or planning similarly on similar tasks, then you can maybe, kind of, sort of, in simplified settings do IRL better than if you had just assumed that the human was optimal if humans actually systematically biased, but I wouldn’t say that our results are great. I don’t think I would say that I definitively, conclusively said, “This will never work.” Nor did I definitively conclusively say that this is great and we should definitely be putting more resources into it. Sort of somewhere in the middle, maybe more on the negative side of like this seems like a really hard problem and I’m not sure how we get around it.

Lucas: So I guess just as a point of comparison here, how is it that human beings succeed at this every day in terms of inferring preferences?

Rohin: I think humans have the benefit of being able to model the other person as being very similar to themselves. If I am trying to infer what you are doing I can sort of say, “Well, if I were in Lucas issues and I were doing this, what would I be optimizing?” And that’s a pretty good answer to what you would be optimizing. Humans are just in some absolute sense very similar to each other. We have similar biases. We’ve got similar ways of thinking. And I think we’ve leveraged that similarity a lot using our own self models as a drop-in approximation of the other person’s planner in this planner reward language.

And then we say, “Okay, well, if this other person thought like me and this is what they ended up doing, well then, what must they have been optimizing?” I think you’ll see that when this assumption breaks down humans actually get worse at inferring goals. It’s harder for me to infer what someone in a different culture is actually trying to do. They might have values that are like significantly different from mine.

I’ve been in both India and the US and it often seems to me that people in the US just have a hard time grasping the way that Indians see society and family expectations and things like this. So that’s an example that I’ve observed. It’s probably also true the other way around, but I was never old enough in India to actually think through this.

Lucas: Human beings sort of succeed in inferring preferences of people who they can model as having like similar values as their own or if you know that the person has similar values as your own. If inferring human preferences from inverse reinforcement learning is sort of not having the most promising results, then what do you believe to be a stronger way of inferring human preferences?

Rohin: The one thing I correct there is that I don’t think humans do it by assuming that people have similar values, just that people think in similar ways. For example, I am not particularly good at dancing. If I see someone doing a lot of hip-hop or something. It’s not that I value hip-hop and so I can infer they value hip-hop. It’s that I know that I do things that I like and they are doing hip-hop. Therefore, they probably like doing hip-hop. But anyway, that’s the minor point.

So a, just because IRL algorithms aren’t doing well now, I don’t think it’s true that IRL algorithms couldn’t do well in the future. It’s reasonable to expect that they would match human performance. That said, I’m not super optimistic about IRL anyway, because even if we do figure out how to get IRL algorithms and sort of make all these implicit assumptions that humans are making that we can then run and get what a human would have thought other humans are optimizing, I’m not really happy about then going and optimizing that utility function off into the far future, which is what sort of the default assumption that we seem to have when using inverse reinforcement learning.

It may be that IRL algorithms are good for other things, but for that particular application, it seems like the utility function you infer is going to not really scale to things that super intelligence will let us do. Humans just think very differently about how they want the future to go. In some sense, the future is going to be very, very different. We’re going to need to think a lot about how we want the future to go. All of our experience so far has not trained us to be able to think about what we care about in the sort of feature setting where we’ve got as a simple example the ability to easily copy people if they’re uploaded as software.

If that’s a thing that happens, well, is it okay to clone yourself? How does democracy work? All these sorts of things are somewhat value judgments. If you take egalitarianism and run with it, you basically get that one person can copy themselves millions of millions of times and just determine the outcome of all voting that way. That seems bad, but on our current values, I think that is probably what we want and we just really haven’t thought this through. IRL to infer utility function that we’ve then just ruthlessly optimized in the long-term just seems like by the time when the world changes a bunch, the value function that we inferred is going to be weirdly wrong in strange ways that we can’t predict.

Lucas: Why not run continuous updates on it as people update given the change of the world?

Rohin: It seems broadly reasonable. This is the sort of idea that you could have about how you could use IRL in a more realistic way that actually works. I think that’s perfectly fine. I’m optimistic about approaches that are like, “Okay, we’re going to use IRL to infer a value function or reward function or something and we’re going to use that to inform what the AI does, but it’s not going to be the end-all utility functions. It’s just going to infer what we do now and AI system is somehow going to check with us. Maybe it’s got some uncertainty over what the true reward function is. Maybe that it only keeps this reward function for a certain amount of time.”

These seem like things that are worth exploring, but I don’t know that we have the correct way to do it. So in the particular case that you proposed, just updating the reward function over time. The classic wire heading question is, how do we make it so that the AI doesn’t say, “Okay, actually, in order to optimize the utility function I have now, it would be good for me to prevent you from changing my utility function since if you change my utility function, I’m no longer going to achieve my original utility.” So that’s one issue.

The other issue is maybe it starts doing some long-term plans. Maybe even if it’s planning according to this utility function without expecting some changes to the utility function in the future, then it might set up some long-term plans that are going to look bad in the future, but it is hard to stop them in the future. Like you make some irreversible change to society because you didn’t realize that something was going to change. These sorts of things suggest you don’t want a single utility function that you’re optimizing even if you’re updating that utility function over time.

It could be that you have some sort of uncertainty over utility functions and that might be okay. I’m not sure. I don’t think that it’s settled that we don’t want to do something like this. I think it’s settled that we don’t want to use IRL to infer a utility function and optimize that one forever. There are certain middle grounds. I don’t know how well those middle grounds work. There are some intuitively there are going to be some problems, but maybe we can get around those.

Lucas: Let me try to do a quick summary just to see if I can explain this as simply as possible. There are people and people have preferences, and a good way to try and infer their preferences is through their observed behavior, except that human beings have cognitive and psychological biases, which sort of skew their actions because they’re not perfectly rational epistemic agents or rational agents. So the value system or award system that they’re optimizing for is imperfectly expressed through their behavior. If you’re going to infer the preferences from behavior than you have to correct for biases and epistemic and rational failures to try and inferr the true reward function. Stopping there. Is that sort of like a succinct way you’d put it?

Rohin: Yeah, I think maybe another point that might be the same or might be different is that under our normal definition of what our preferences or our values are, if we would say something like, “I value egalitarianism, but it seems predictably true that in the future we’re not going to have a single vote per a sentient being,” or something. Then essentially what that says is that our preferences, our values are going to change over time and they depend on the environment in which we are right now.

So you can either see that as okay, I have this really big, really global, really long-term utility function that tells me how given my environment what my narrow values in that environment are. And in that case and you say, “Well okay, in that case, we’re really super biased because we only really know our values in the environment. We don’t know our values in future environments. We’d have to think a lot more for that.” Or you can say, “We can infer our narrow values now and that has some biases thrown in, but we could probably account for those that then we have to have some sort of story for how we deal with our preferences evolving in the future.”

Those are two different perspectives on the same problem, I would say, and they differ in basically what you’re defining values to be. Is it the thing that tells you how to extrapolate what you want all the way into the future or is it the thing that tells you how you’re behaving right now in the environment. I think our classical notion of preference or values, the one that we use when we say values in everyday language is talking about the second kind, the more narrow kind.

Lucas: There’s really a lot there, I think, especially in terms of issues in that personal identity over time, commitment to values and as you said, different ideas and conceptualization of value, like what is it that I’m actually optimizing for or care about. Population ethics and tons of things about how people value future versions of themselves or whether or not they actually equally care about their value function at all times as it changes within the environment.

Rohin: That’s a great description of why I am nervous around inverse reinforcement learning. You listed a ton of issues and I’m like, yeah, all of those are like really difficult issues. And with inverse reinforcement learning, it’s sort of based on this premise of all of that is existent, is real and is timeless and we can infer it and then maybe we put on some hacks like continuously improving the value function over time to take into account changes, but this does feel like we’re starting with some fundamentally flawed paradigm.

So mostly because of this fact that it feels like we’ve taken a flawed paradigm to start with, then changed it so that it doesn’t have all the obvious flaws. I’m more optimistic about trying to have a different paradigm of how we want to build AI, which maybe I’ll summarize as just make AIs that do what we want or what we mean at the current moment in time and then make sure that they evolve along with us as we evolve and how we think about the world.

Lucas: Yeah. That specific feature there is something that we were trying to address in inverse reinforcement learning, if the algorithm were sort of updating overtime alongside myself. I just want to step back for a moment to try to get an even grander and more conceptual understanding of the globalness of inverse reinforcement learning. So from an evolutionary and sort of more cosmological perspective, you can say that from the time that the first self-replicating organisms on the planet until today, like the entire evolutionary tree, there’s sort of a global utility function across all animals that is ultimately driven by thermodynamics and the sun shining light on a planet and that this sort of global utility function of all agents across the planet, it seems like very ontologically basic and pure like what simply empirically exists. Attempting to access that through IRL is just interesting, the difficulties that arise from that. Does that sort of a picture seem accurate?

Rohin: I think I’m not super sure what exactly you’re proposing here. So let me try and restate it. So if we look at the environment as a whole or the universe as a whole or maybe we’re looking at evolution perhaps and we see that hey, evolution seems to have spit out all of these creatures that are interacting in this complicated way, but you can look at all of their behavior and trace it back to this objective in some sense of maximizing reproductive fitness. And so are we expecting that IRL on this very grand scale would somehow end up with maximize reproductive fitness. Is that what … Yeah, I think I’m not totally sure what implication you’re drawing from this.

Lucas: Yeah. I guess I’m not arguing that there’s going to be some sort of evolutionary thing which is being optimized.

Rohin: IRL does make the assumption that there is something doing an optimization. You usually have to point it towards what that thing is. You have to say, “Look at the behavior of this particular piece of the environment and tell me what it’s optimizing.” Maybe if you’re imagining IRL on this very grand scale, what is the thing you’re pointing it at?

Lucas: Yeah, so to sort of reiterate and specify, the pointing IRL at the human species would be like to point IRL at 7 billion primates. Similarly, I was thinking that what if one pointed IRL at the ecosystem of Earth over time, you could sort of plot this evolving algorithm over time. So I was just sort of bringing to note that accessing this sort of thing, which seems quite ontologically objective and just sort of clear in this way, it’s just very interesting how it’s fraught with so many difficulties. Yeah, in terms of history it seems like all there really is, is the set of all preferences at each time step over time, which could be summarized in some sort of global or individual levels of algorithms.

Rohin: Got it. Okay. I think I see what you’re saying right now. It seems like the intuition is like ecosystems, universe, laws of physics, very simple, very ontologically basic things, there’s something more real about any value function we could infer from that. And I think this is a misunderstanding of what IRL does. IRL fundamentally requires you to have some notion of counterfactuals. You need to have a description of the action space that some agent had and then when you observe their behavior, you see that they made a choice to take one particular action instead of another particular action.

You need to be able to ask the question of what could they have done instead, which is a counterfactual. Now, with laws of physics, it’s very unclear what the counterfactual would be. With evolution, you can maybe say something like, “Evolution could have chosen to make a whole bunch of mutations and I chose this particular one. And then if you use that particular model, what is IRL going to infer? It will probably infer something like maximized reproductive fitness.”

On the other hand, if you model evolution as like hey, you can design the best possible organism that you can. You can just create an organism out of thin air. And then what reward function are you maximizing then, it’s like super unclear. If you could just poof into existence a organism, you could just make something that’s extremely intelligent, very strong, et cetera, et cetera. And you’re like, well, evolution didn’t do that. It took millions of years to create even humans so clearly it wasn’t optimizing reproductive fitness, right?

And in fact, I think people often say that evolution is not an optimization process because of things like this. The notion of something doing optimization is very much relative to what you assume their capabilities to be and in particular what do you assume their counterfactuals to be. So if you were talking about this sort of grand scale ecosystems, universe, laws of physics, I would ask you like, “What are the counterfactuals? What could the laws of physics done otherwise or what could the ecosystem have done if it didn’t do the thing that it did?” Once you have an answer to that, I imagine I could predict what IRL would do. And that part is the part that doesn’t seem ontologically basic to me, which is why I don’t think that IRL on this sort of thing makes very much sense.

Lucas: Okay. The part here that seems to be a little bit funny to me is where tracking from physics, whatever you take to be ontologically basic about the universe, and tracking from that to the level of whatever our axioms and pre-assumptions for IRL are. What I’m trying to say is in terms of moving from whatever is ontologically basic to the level of agents and we have some assumptions in our IRL where we’re thinking about agents as sort of having theories of counterfactuals where they can choose between actions and they have some sort of reward or objective function that they’re trying to optimize for over time.

It seems sort of metaphysically queer where physics stops … Where we’re going up in levels of abstraction from physics to agents and we … Like physics couldn’t have done otherwise, but somehow agents could have done otherwise. Do you see the sort of concern that I’m raising?

Rohin: Yeah, that’s right. And this is perhaps another reason that I’m more optimistic about the don’t try to do anything at the grand scale and just try to do something that does the right thing locally in our current time, but I think that’s true. It definitely feels to me like optimization, the concept, should be ontologically basic and not a property of human thought. There’s something about how a random universe is high entropy whereas the ones that humans construct is low entropy. That suggests that we’re good at optimization.

It seems like it should be independent of humans. Also, on the other hand, optimization, any conception I come up with it is either specific to the way humans think about it or it seems like it relies on this notion of counterfactuals. And yeah, the laws of physics don’t seem like they have counterfactuals, so I’m not really sure where that comes in. In some sense, you can see that, okay, why do we have the notion of counterfactuals on agency thinking that we could have chosen something else while we’re basically … In some sense we’re just an algorithm that’s continually thinking about what we could do, trying to make plans.

So we search over this space of things that could be done, and that search is implemented in physics, which has no say, it has no counterfactuals, but the search itself, which is an abstraction layer above, it’s something that is running on physics. It is not itself a physics thing, that search is in fact going through multiple options and then choosing one now. It is deterministic from the point of view of physics, but from the point of view of the search, it’s not deterministic. The search doesn’t know which one is going to happen. I think that’s why humans have this notion of choice and of agency.

Lucas: Yeah, and I mean, just in terms of understanding the universe, it’s pretty interesting just how there’s like these two levels of attention where at the physics level you actually couldn’t have done otherwise, but as sort of like this optimization process running on physics that’s searching over space and time and modeling different world scenarios and then seemingly choosing and thus, creating observed behavior for other agents to try and infer whatever reward function that thing is trying to optimize for, it’s an interesting picture.

Rohin: I agree. It’s definitely a sort of puzzles that keep you up at night. But I think one particularly important implication of this is that agency is about how a search process thinks about itself. It’s not just about that because I can look at what someone else is doing and attribute agency to them, figure out that they are themselves running an algorithm that chooses between actions. I don’t have a great story for this. Maybe it’s just humans realizing that other humans are just like them.

So this is maybe why we get acrimonious debates about whether evolution has agency, but we don’t get acrimonious debates about whether humans have agency. Evolution is sufficiently different from us that we can look at the way that it “chooses” “things” and we say, “Oh well, but we understand how it chooses things.” You could model it as a search process, but you could also model it is all that’s happening is this deterministic or mostly deterministic which animals survived and had babies and that is how things happen. And so therefore, it’s not an optimization process. There’s no search. There is deterministic. And so you have these two conflicting views for evolution.

Whereas I can’t really say, “Hey Lucas, I know exactly deterministically how you’re going to do things.” I know this at the sense of like men, there are electrons and atoms and stuff moving around in your brain and electrical signals, but that’s not going to let me predict what you can do. One of the best models I can have of you is just optimizing for some goal, whereas with evolution I can have a more detailed model. And so maybe that’s why I set aside the model of evolution as an optimizer.

Under this setting it’s like, “Okay, maybe our views of agency and optimization are just facts about how well we can model the process, which cuts against the optimization as ontologically basic thing and it seems very difficult. It seems like a hard problem to me. I want to reiterate that most of this has just pushed me to let’s try and instead have a AI alignment focus, try to do things that we understand now and not get into the metaphilosophy problems. If we just get AI systems that broadly do what we want and are asking us for clarification, helping us evolve our thoughts over time, if we can do something like that. I think there are people who would argue that like no, of course, we can’t do something like that.

But if we could do something like that, that seems significantly more likely to work than something that has to have answers to all these metaphilosophical problems today. My position is just that this is doable. We should be able to make systems that are of the nature that I described.

Lucas: There’s clearly a lot of philosophical difficulties that go into IRL. Now it would be sort of good if we could just sort of take a step back and you could summarize your thoughts here on inverse reinforcement learning and the place that it has in AI alignment.

Rohin: I think my current position is something like fairly confidently don’t use IRL to infer a utility function that you then optimize over the long-term. In general, I would say don’t have a utility function that you optimize over the long-term because it doesn’t seem like that’s easily definable right now. So that’s like one class of things I think we should do. On the other hand I think IRL is probably good as a tool.

There is this nice property of IRL that you figure out what someone wants and then you help them do it. And this seems more robust than handwriting, the things that we care about in any particular domain, like even in a simple household robot setting, there are tons and tons of preferences that we have like don’t break vases. Something like IRL could infer these sorts of things.

So I think IRL has definitely a place as a tool that helps us figure out what humans want, but I don’t think the full story for alignment is going to rest on IRL in particular. It gets us good behavior in the present, but it doesn’t tell us how to extrapolate on into the future. Maybe if you did IRL that let you infer how we want the AI system to extrapolate our values or to figure out IRL and our meta-preferences about how the algorithm should infer our preferences or something like this, that maybe could work, but it’s not obvious to me. It seems worth trying at some point.

TLDR, don’t use it for long-term utility function. Do use it as a tool to get decent behavior in the short-term. Maybe also use it as a tool to infer meta-preferences. That seems broadly good, but I don’t know that we know enough about that setting yet.

Lucas: All right. Yeah, that’s all just super interesting and it’s sort of just great to hear how the space is unfolded for you and what your views are now. So I think that we can just sort of pivot here into the AI alignment problem more generally and so now that you’ve moved on from being as excited about IRL, what is essentially capturing your interests currently in the space of AI alignment?

Rohin: The thing that I’m most interested in right now is can we build an AI system that basically evolves over time with us. I’m thinking of this now is like a human AI interaction problem. You’ve got an AI system. We want to figure out how to make it that it broadly helps us, but also at the same time and figures out what it needs to do based on some sort of data that comes from humans. Now, this doesn’t have to be the human saying something. It could be from their behavior. It could be things that they have created in the past. It could be all sorts of things. It could be a reward function that they write down.

But I think the perspective of the things that are easy to infer are the things that are specific to our current environment is pretty important. What I would like to do is build AI systems that refer to preferences in the current environment or things we want in the current environment and do those reasonably well, but don’t just extrapolate to the future and let humans adapt to the future and then figure out what the humans value now and then do things based on that then.

There are a few ways that you could imagine this going. One is this notion of corrigibility in the sense that Paul Christiano writes about it, not the sense that MIRI writes about it, where the AI is basically trying to help you. And if I have an AI that is trying to help me, well, I think one of the most obvious things for someone who’s trying to help me to do is make sure that I remain in effective control of any power resources that might be present that the AI might have and to ask me if my values change in the future or if what I want the AI to do changes in the future. So that’s one thing that you might hope to do.

Also imagine building a norm following AI. So I think human society basically just runs on norms that we mostly all share and tend to follow. We have norms against particularly bad things like murdering people and stealing. We have norms against shoplifting. We have maybe less strong norms against littering. Unclear. And then we also have norms for things are not very consequential. We have norms against randomly knocking over a glass at a restaurant in order to break it. That is also a norm. Even though there are quite often times where I’m like, “Man, it would be fun to just break a glass at the restaurant. It’s very cathartic,” but it doesn’t happen very often.

And so if we could build an AI system that could infer and follow those norms, it seems like this AI would behave in a more human-like fashion. This is a pretty new line of thought so I don’t know whether this works, but it could be that such an AI system is simultaneously behaving in a fashion that humans would find acceptable and also lets us do pretty cool, interesting, new things like developing new technologies and stuff that humans can then deploy and the AI doesn’t just unilaterally deploy without any safety checks or running it by humans or something like that.

Lucas: So let’s just back up a little bit here in terms of the picture of AI alignment. So we have a system that we do not want to extrapolate too much toward possible future values. It seems that there are all these ways in which we can be using the AI first to sort of amplify our own decision making and then also different methodologies which reflect the way that human beings update their own values and preferences over time, something like as proposed by I believe Paul Christiano and Geoffrey Irving and other people at OpenAI, like alignment through debate.

And there’s just all these sorts of epistemic practices of human beings with regards to sort of this world model building and how that affects shifts in value and preferences, also given how the environment changes. So yeah, it just seems like tracking overall these things, finding ways in which AI can amplify or participate in those sort of epistemic practices, right?

Rohin: Yeah. So I definitely think that something like amplification can be thought of as improving our epistemics over time. That seems like a reasonable way to do it. I haven’t really thought very much about how amplification or the pay scales were changing environments. They both operate under this general like we could have a deliberation tree and in principle what we want is this exponentially sized deliberation tree where the human goes through all of the arguments and counter-arguments and breaks those down into sub-points in excruciating detail in a way that no human could ever actually do because it would take way too long.

And then amplification debate basically show you how to get the outcome that this reasoning process would have given by using an AI system to assist the human. I don’t know if I would call it like improving human epistemics, but more like taking whatever epistemics you already have and running it for a long amount of time. And it’s possible like in that long amount of time you actually figure out how to do better epistemics.

I’m not sure that this perspective really talks very much about how preferences change over time. You would hope that it would just naturally be robust to that in that as the environment changes, your deliberation starts looking different in that like okay, now suddenly we have to go back to my example before we have uploads and we’re like egalitarianism now seems to have some really weird consequences. And then presumably the deliberation tree that amplification and debate are mimicking is going to have a bunch of thoughts about do we actually want egalitarianism now, what were the moral intuitions that pushed us towards this? Is there some equivalent principle that lets us keep our moral intuitions, but doesn’t have this weird property where a single person can decide the outcome of an election, et cetera, et cetera.

I think they were not designed to do this, but by a virtue of being based off like how a human would think, what a human would do if they got a long time and a lot of helpful tools to think about it, they’re essentially just inheriting these properties from the human. If the human as the environment would change would start rethinking their priorities or what they care about, then so too would amplification and debate.

Lucas: I think here it also has me thinking about what are the meta-preferences and the meta-meta-preferences and if you could imagine taking a human brain and then running it until the end, through decision and rational and logical thought trees over enough time, with enough epistemics and power behind it to try to sort of navigate its way to the end. It just raises interesting questions about like is that what we want? Is taking that over every single person and then sort of just preference aggregating it all together, is that what we want? And what is the role of moral philosophy for thinking here?

Rohin: Well, so one thing is that whatever moral philosophy you would do so would the amplification of you in theory. I think the benefit of these approaches is that they have this nice property that whatever you would have thought of it in the limit of good AI and idealizations, properly mimicking you and so on, so forth. In this sort of nice world where this all works in a nice, ideal way, it seems like any consideration you can have or you would have so would be agent produced by iterated amplification or debate.

And so if you were going to do a bunch of moral philosophy and come to some sort of decision based on that, so would iterated amplification or debate. So I think it’s like basically here is how we build an AI system that solves the problems in the same way that a human would solve them. And so then if you’re worried about, hey, maybe humans themselves are just not very good at solving problems. Looks like most humans in the world. Like don’t do moral philosophy and don’t extrapolate their values well in the future. And the only reason we have moral progress is because younger generations keep getting born and they have different views than the older generations.

That, I think, could in fact be a problem, but I think there’s hope that we could like train humans to have them nice sort of properties, good epistemics, such they would provide good training data for iterated amplification if there comes a day where we think we can actually train iterated amplification to mimic human explicit reasoning. They do both have the property that they’re only mimicking the explicit reasoning and not necessarily the implicit reasoning.

Lucas: Do you want to unpack that distinction there?

Rohin: Oh, yeah. Sure. So both of them require that you take your high-level question and decompose it into a bunch of sub-questions or sorry, the theoretical model of them has that. This is like pretty clear with iterated amplification. It is less clear with debate. At each point you need to have the top level agent decompose the problem into a bunch of sub-problems. And this basically requires you to be able to decompose tasks into clearly specified sub-tasks, where clearly specified could mean in natural language, but you need to make it explicit in a way that the agent you’re assigning the task to can understand it without having to have your mind.

Whereas if I’m doing some sort of programming task or something, often I will just sort of know what direction to go in next, but not be able to cleanly formalize it. So you’ll give me some like challenging algorithms question and I’ll be like, “Oh, yeah, kind of seems like dynamic programming is probably the right thing to do here.” And maybe if I consider it this particular way, maybe if I put these things in the stack or something, but even the fact that I’m saying this out in natural language is misrepresenting my process.

Really there’s some intuitive not verbalizable process going on in my head. Somehow navigates to the space of possible programs and picks a thing and I think the reason I can do this is because I’ve been programming for a lot of time and I’ve trained a bunch of intuitions and heuristics that I cannot easily verbalize us some like nice decomposition. So that’s sort of implicit in this thing. If you did want that to be incorporated in an iterated amplification, it would have to be incorporated in the base agent, the one that you start with. But if you start with something relatively simple, which I think is often what we’re trying to do, then you don’t get those human abilities and you have to rediscover them in some sense through explicit decompositional reasoning.

Lucas: Okay, cool. Yeah, that’s super interesting. So now to frame all of this again, do you want to sort of just give a brief summary of your general views here?

Rohin: I wish there were a nice way to summarize this. That would mean we’d made more progress. It seems like there’s a bunch of things that people have proposed. There’s amplification/debate, which are very similar, IRL as a general. I think, but I’m not sure, that most of them would agree that we don’t want to like infer a utility function and optimize it for the long-term. I think more of them are like, yeah, we want this sort of interactive system with the human and the AI. It’s not clear to me how different these are and what they’re aiming for in amplification and debate.

So here we’re sort of looking at how things change over time and making that a pretty central piece of how we’re thinking about it. Initially the AI is trying to help the human, human has some sort of reward function, AI trying to learn it and help them, but over time this changes, the AI has to keep up with it. And under this framing you want to think a lot about interaction, you want to think about getting as many bits about reward from the human to the AI as possible. Maybe think about control theory and how human data is in some sense of control mechanism for the AI.

You’d want to like infer norms and ways that people behave, how people relate with each other, try to have your AI systems do that as well. So that’s one camp of things, have the AI interact with humans, behave generally in the way that humans would say is not crazy, update those over time. And then there’s the other side which is like have an AI system that is taking human reasoning, human explicit reasoning and doing that better or doing that more, which allows it to do anything that the human would have done, which is more taking the thought process that humans go through and putting that at the center. That is the thing that we want to mimic and make better.

Sort of parts where our preferences change over time is something that you get for free in some sense by mimicking human thought processes or reasoning. Summary, those are two camps. I am optimistic about both of them, think that people should be doing research on both of them. I don’t really have much more of a perspective of that, I think.

Lucas: That’s excellent. I think that’s a super helpful overview actually. And given that, how do you think that your views of AI alignment have changed over the past few years?

Rohin: I’ll note that I’ve only been in this field for I think 15, 16 months now, so just over a year, but over that year I definitely came into it thinking what we want to do is infer the correct utility function and optimize it. And I have moved away quite strongly from that. I, in fact, recently started writing a value learning sequence or maybe collating is a better word. I’ve written a lot of posts that still have to come out, but I also took a few posts from other people.

The first part of that sequence is basically arguing seems bad to try and define a utility function and then optimize it. So I’m just trying to move away from long-term utility functions in general or long-term goals or things like this. That’s probably the biggest update since starting. Other things that I’ve changed, a focus more on norms than on values, trying to do things that are easy to infer right now in the current environment and that making sure that we update on these over time as opposed to trying to get the one true thing that depends on us solving all the hard metaphilosophical problems. That’s, I think, another big change in the way I’ve been thinking about it.

Lucas: Yeah. I mean, there are different levels of alignment at their core.

Rohin: Wait, I don’t know exactly what you mean by that.

Lucas: There’s your original point of view where you said you came into the field and you were thinking infer the utility function and maximize it. And your current view is now that you are moving away from that and beginning to be more partial towards the view which takes it that we want to be like inferring from norms in the present day just like current preferences and then optimizing that rather than extrapolating towards some ultimate end-goal and then trying to optimize for that. In terms of aligning in these different ways, isn’t there a lot of room for value drift, allowing the thing to run in the real world rather than amplifying explicit human thought on a machine?

Rohin: Value drift if is an interesting question. In some sense, I do want my values to drift in that whatever I think about the correct way that the future should go or something like that today. I probably will not endorse that in the future and I endorse the fact that I won’t endorse it in the future. I do want to learn more and then figure out what to do in the future based on that. You could call that value drift that is a thing. I want to happen. So in that sense then value drift wouldn’t be a bad thing, but then there’s also a sense in which there are ways in which my values could change in the future and ways that I don’t endorse and then that one maybe is value drift. That is bad.

So yeah, if you have an AI system that’s operating in the real world and changes over time as we humans change, yes, there will be changes at what the AI system is trying to achieve over time. You could call that value drift, but value drift usually has a negative connotation, whereas like this process of learning as the environment changes seems to be to me like a positive thing. It’s a thing I would want to do myself.

Lucas: Yeah, sorry, maybe I wasn’t clear enough. In the case of running human beings in the real world, where there are like the causes and effects of history and whatever else and how that actually will change the expression of people over time. Because if you’re running this version of AI alignment where you’re sort of just always optimizing the current set of values in people, progression of the world and of civilization is only as good as the best of all human like values and preferences in that moment.

It’s sort of like limited by what humans are in that specific environment and time, right? If you’re running that in the real world versus running some sort of amplified version of explicit human reasoning, don’t you think that they’re going to come to different conclusions?

Rohin: I think the amplified explicit human reasoning, I imagine that it’s going to operate in the real world. It’s going to see changes that happen. It might be able to predict those changes and then be able to figure out how to respond fast, before the changes even happen perhaps, but I still think of amplification as being very much embedded in the real world. Like you’re asking it questions about things that happen in the real world. It’s going to use explicit reasoning that it would have used if a human were in the real world and thinking about the question.

I don’t really see much of a distinction here. I definitely think that even in my setting where I’m imagining AI systems that evolve over time and change based on that, that they are going to be smarter than humans, going to think through things a lot faster, be able to predict things in advance in the same way that simplified explicit reasoning would. Maybe there are differences, but value drift doesn’t seem like one of them or at least I cannot predict right now how they will differ along the axis of value drift.

Lucas: So then just sort of again taking a step back to the ways in which your views have shifted over the past few years. Is there anything else there that you’d like to touch on?

Rohin: Oh man, I’m sure there is. My views changed so much because I was just so wrong initially.

Lucas: So most people listening should think that if given a lot more thought on this subject, that their views are likely to be radically different than the ones that they currently have and the conceptions that they currently have about AI alignment.

Rohin: Seems true from most listeners, yeah. Not all of them, but yeah.

Lucas: Yeah, I guess it’s just an interesting fact. Do you think this is like an experience of most people who are working on this problem?

Rohin: Probably. I mean, within the first year of working on the problem that seems likely. I mean just in general if you work on the problem, if you start with near no knowledge on something and then you work on it for a year, your views should change dramatically just because you’ve learned a bunch of things and I think that basically explains most of my changes in view.

It’s just actually hard for me to remember all the ways in which I was wrong back in the past and I focused on not using utility functions because I think that even other people in the field still believe right now. So that’s where that one came from, but there are like plenty of other things that are just notably, easily, demonstrably wrong about that I’m having trouble recalling now.

Lucas: Yeah, and the utility function one I think is a very good example and I think that if it were possible to find all of these in your brain and distill them, I think it would make a very, very good infographic on AI alignment, because those misconceptions are also misconceptions that I’ve had and I share those and I think that I’ve seen them also in other people. A lot of sort of the intellectual blunders that you or I have made are probably repeated quite often.

Rohin: I definitely believe that. Yeah, I guess I could talk about the things that I’m going to very soon saying the value learning sequence. Those were definitely updates that I made, one of those a utility functions thing. Another one was thinking about what we want is for the human AI system as a whole to be optimizing for some sort of goal. And this opens up a nice space of possibilities where the AI is not optimizing a goal, only the human AI system together is. Keeping in mind that that is the goal and not just the AI itself must be optimizing some sort of goal.

The idea of corrigibility itself as a thing that we should be aiming for was a pretty big update for me, took a while for me to get to that one. I think distributional shift was a pretty key concept that I learned at some point and started applying everywhere. One way of thinking about the evolving preferences over time thing is that humans, they’ve been trained on the environment that we have right now and arguably we’ve been trained on the ancestral environment too by evolution, but we haven’t been trained on whatever the future is going to be.

Or for a more current example, we haven’t been trained on social media. Social media is a fairly new thing affecting us in ways that we hadn’t considered in the past and this is causing us to change how we do things. So in some sense what’s happening is as we go into the future, we’re encountering a distributional shift and human values don’t extrapolate well to that distributional shift. What do you actually need to do is wait for the humans to get to that point, let them experience it, train on it, have their values be trained on this new distribution and then figure out what they are rather than trying to do it right now when their values are just going to be wrong or going to be not what they would get if they were actually in that situation.

Lucas: Isn’t that sort of summarizing coherent extrapolated volition?

Rohin: I don’t know that coherent extrapolated volition explicitly talks about having the human be in a new environment. I guess you could imagine that CEV considers … If you imagine like a really, really long process of deliberation in CEV, then you could be like, okay what would happen if I were in this environment and all these sorts of things happened. It seems like you would need to have a good model of how the world works and how physics works in order to predict what the environment would be like. Maybe you can do that and then in that case you simulate a bunch of different environments and you think about how humans would adapt and evolve and respond to those environments and then you take all of that together and you summarize it and distill it down into a single utility function.

Plausibly that could work. Doesn’t seem like a thing we can actually build, but as a definition of what we might want, that seems not bad. I think that is me putting the distributional shift perspective on CEV and it was not, certainly not obvious to me from the statement of CEV itself, that you’re thinking about how to mitigate the impact of distributional shift on human values. I think I’ve had this perspective and I’ve put it on CEV and I’m like, yeah, that seems fine, but it was not obvious to me from reading about CEV alone.

Lucas: Okay, cool.

Rohin: I recently posted a comment on the Alignment Forum talking about how we want to like … I guess this is sort of in corrigibility ability too, making an AI system that tries to help us as opposed to making an AI system that is optimizing the one true utility function. So that was an update I made, basically the same update as the one about aiming for corrigibility. I guess another update I made is that while there is a phase transition or something or like a sharp change in the problems that we see when AIs become human level or super-intelligent, I think the underlying causes of the problems don’t really change.

Underlying causes of problems with narrow AI systems, probably similar to the ones that underlie a super intelligent systems. Having their own reward function leads to problems both in narrow settings and in super-intelligent settings. This made me more optimistic about doing work trying to address current problems, but with an eye towards long-term problems.

Lucas: What made you have this update?

Rohin: Thinking about the problems a lot, in particular thinking about how they might happen in current systems as well. So I guess a prediction that I would make is that if it is actually true that superintelligence would end up killing us all or something like that, some like really catastrophic outcome. Then I would predict that before that, we will see some AI system that causes some other smaller scale catastrophe where I don’t know what catastrophe means, it might be something like oh, you humans die or oh, the power grid went down for some time or something like that.

And then before that we will have things that sort of fail in relatively not important ways, but in ways of say that like here’s an underlying problem that we need to fix with how we build AI systems. If you extrapolate all the way back to today that looks like for example to boat racing example from open AI, a reward hacking one. So generally expecting things to be more continuous. Not necessarily slow, but continuous. That update I made because of the posts arguing for slow take off from Paul Christiano and AI impacts.

Lucas: Right. And the view there is sort of that the world will be propagated with lower-level ML as we sort of start to ratchet up the capability of intelligence. So a lot of tasks will sort of be … Already being done by systems that are slightly less intelligent than the current best system. And so all work ecosystems will already be fully flooded with AI systems optimizing within the spaces. So there won’t be a lot of space for the first AGI system or whatever to really get decisive strategic advantage.

Rohin: Yeah, would I make prediction that we won’t have a system that gets a decisive strategic advantage? I’m not sure about that one. It seems plausible to me that we have one AI system that is improving over time and we use those improvements in society for before it becomes super intelligent. But then by the time it becomes super intelligent, it is still the one AI system that is super intelligent. So it does gain a decisive strategic advantage.

An example of this would be if there was just one main AGI project, I would still predict that progress on AI, it would be continuous, but I would not predict a multipolar outcome in that scenario. The corresponding view is that while I still do use the terminology first AGI because it’s like pointing out some intuitive concept that I think is useful, it’s a very, very fuzzy concept and I don’t think we’ll be able to actually point at any particular system and say that was the first AGI. Rather we’ll point to like a broad swath of time and say, “Somewhere in there AI had became generally intelligent.”

Lucas: There are going to be all these sort of like isolated meta-epistemic reasoning tools which can work in specific scenarios, which will sort of potentially aggregate in that fuzzy space to create something fully general.

Rohin: Yep. They’re going to be applied in some domains and then the percent of domains in which they apply will gradually grow grutter and eventually we’ll be like, huh, looks like there’s nothing left for humans to do. It probably won’t be a surprise, but I don’t think there will be a particular point where everyone agrees, yep, looks like AI is going to automate everything in just a few years. It’s more like AI will start automating a bunch of stuff. The amount of stuff it automates will increase over time. Some people will see it coming, see full automation coming earlier, some people will be like nah, this is just a simple task that AI can do, still got a long ways to go for all the really generally intelligent stuff. People will sign on to like oh, yeah, it’s actually becoming generally intelligent at different spots.

Lucas: Right. If you have a bunch of small mammalian level AIs automating a lot of stuff in industry, there would likely be a lot of people whose timelines would be skewed in the wrong direction.

Rohin: I’m not even sure this was a point of timelines. It was just a point of like which is the system that you call AGI. I claim this will not have a definitive answer. So that was also an update to how I was thinking. That one, I think, is like more generally accepted in the community. And this was more like well, all of the literature on the AI safety that’s publicly available and like commonly read by EA’s doesn’t really talk about these sorts of points. So I just hadn’t encountered these things when I started out. And then I encountered a more maybe I thought to myself, I don’t remember, but like once I encountered the arguments I was like, yeah, that makes sense and maybe I should have thought of that before.

Lucas: In the sequence which you’re writing, do you sort of like cover all of these items which you didn’t think were in the mainstream literature?

Rohin: I cover some of them. The first few things I told you were I was just like what did I say in the sequence. There were a few I think that probably aren’t going to be in that sequence just because there’s a lot of stuff that people have not written down.

Lucas: It’s pretty interesting because the way in which the AI alignment field is evolving is sometimes, it’s often difficult to have a bird’s-eye view of where it is and track avant-guard ideas being formulated in people’s brains and being shared.

Rohin: Yeah. I definitely agree. I was hoping that the Alignment Newsletter, which I write, to help with that. I would say it probably speeds up the process of bit, but it’s definitely not keeping you on the forefront. There are many ideas that I’ve heard about, that I’ve even read documents about that haven’t made it in the newsletter yet because they haven’t become public.

Lucas: So how many months behind do you think for example, the newsletter would be?

Rohin: Oh, good question. Well, let’s see. There’s a paper that I started writing in May or April that has not made it into the newsletter yet. There’s a paper that I finished and submitted in October that has not made it to the newsletter yet, or was it September, possibly September. That one will come out soon. That suggests a three month lag. But I think many others have been longer than that. Admittedly, this is for academic researchers at CHAI. I think CHAI is like we tend to publish using papers and not blog posts and this results in the longer delay on our side.

Also because work on relative reachability, for example, I’ve learned about quite a bit. I learned about maybe four or five months before she released it and that’s when it came out in the newsletter. And of course, she’d been working on it for longer or like AI safety by debate I think I learned about six or seven months before it was published in came out in the newsletter. So yeah, somewhere between three months and half a year for things seems likely. For things that I learned from MIRI, it’s possible that they never get into the newsletter because they’re never made public. So yeah, there’s a fairly broad range there.

Lucas: Okay. That’s quite interesting. I think that also sort of gives people a better sense of what’s going on in technical AI alignment because it can seem kind of black boxy.

Rohin: Yeah. I mean, in some sense this is a thing that all fields have. I used to work in programming languages. On there we would often write a paper and submit it and then go and present about it a year later by the time we had moved on, done a whole other project and written other paper and then we’d go back and we’d talk about this. I definitely remember sometimes grad students being like, “Hey, I want to get this practice document.” I say, “What’s it about?” It’s like some topic. And I’m like wait, but you did that. I heard about this like two years ago. And they’re like, yep, just got published.

So in that sense, I think both AI is faster and AI alignment is I think even faster than AI because it’s a smaller field and people can talk to each other more, and also because a lot of us write blog posts. Blog posts are great.

Lucas: They definitely play a crucial role within the community in general. So I guess just sort of tying things up a bit more here, pivoting back to a broader view. Given everything that you’ve learned and how your ideas have shifted, what are you most concerned about right now in AI alignment? How are the prospects looking to you and how does the problem of AI alignment look right now to Rohin Shah?

Rohin: I think it looks pretty tractable, pretty good. Most of the problems that I see are I think ones that we can see in advance, we probably can solve. None of these seem like particularly impossible to me. I think I also give more credit to the machine learning community or AI community than other researchers do. I trust in our ability where here are meaning like the AI field broadly, our ability to notice what things could go wrong and fix them in a way that maybe other researchers in the AI safety don’t.

I think one of the things that feels most problematic to me right now is the problem of inner optimizers, which I’m told there will probably be a sequence on in the future because there aren’t great resources on it right now. So basically this is the idea of if you run a search process over a wide space of strategies or options and you search for something that gets you good external reward or something like that, what you might end up finding is a strategy that is itself a consequentialist agent that’s optimizing for its own internal reward and that internal reward will agree with the external reward on the training data because that’s why it was selected, but it might diverge soon as there’s any distribution shift.

And then it might start optimizing against us adversarially in the same way that you would get if you like gave a misspecified award function to and RL system today. This seems plausible to me. I’ve read a bit more about this and talk to people about this and things that aren’t yet public, but hopefully will soon be. I definitely recommend reading that if it ever comes out, but yeah, this seems like it could be a problem. I don’t think we have any instance of it being a problem yet. Seems hard to detect and I’m not sure how I would fix it right now.

But I also don’t think that we’ve thought about the problem or I don’t think I’ve thought about the problem that much. I don’t want to say like, “Oh man, this is totally unsolvable,” yet. Maybe I’m just an optimistic person by nature. I mean, that’s definitely true, but maybe that’s biasing my judgment here. Feels like we could probably solve that if it ends up being a problem.

Lucas: Is there anything else here that you would like to wrap up on in terms of AI alignment or inverse reinforcement learning?

Rohin: I want to continue to exhort that we should not be trying to solve all the metaphilosophical problems and we should not be trying to like infer the one true utility function and we should not be modeling an AI as pursuing a single goal over the long-term. That is a thing I want to communicate to everybody else. Apart from that I think we’ve covered everything at a good depth. Yeah, I don’t think there’s anything else I’d add to that.

Lucas: So given that I think rather succinct distillation of what we are trying not to do, could you try and offer an equally succinct distillation of what we are trying to do?

Rohin: I wish I could. That would be great, wouldn’t it? I can tell you that I can’t do that. I could give you like a suggestion on what we are trying to do instead, which would be try to build an AI system that is corrigible, that is doing what we want, but it’s going to remain under human control in some sense. It’s going to ask us, take our preferences into account, not try to go off behind our backs and optimize against us. That is a summary of a path that we could go down that I think is premised or what I would want our AI systems to be like. But that’s unfortunately very sparse on concrete details because I don’t know those concrete details yet.

Lucas: Right. I think that that sort of perspective shift is quite important. I think it changes the nature of the problem and how one thinks about the problem, even at the societal level.

Rohin: Yeah. Agreed.

Lucas: All right. So thank you so much Rohin, it’s really been a pleasure. If people are interested in checking out some of this work that we have mentioned or following you, where’s the best place to do that?

Rohin: I have a website. It is just RohinShah.com. Subscribing to the Alignment Newsletter is … Well, it’s not a great way to figure out what I personally believe. Maybe if you’d keep reading the newsletter over time and read my opinions for several weeks in a row, maybe then you’d start getting a sense of what Rohin thinks. It will soon have links to my papers and things like that, but yeah, that’s probably the best way on this, like my website. I do have a Twitter, but I don’t really use it.

Lucas: Okay. So yeah, thanks again Rohin. It’s really been a pleasure. I think that was a ton to think about and I think that I probably have a lot more of my own thinking and updating to do based off of this conversation.

Rohin: Great. Love it when that happens.

Lucas: So yeah. Thanks so much. Take care and talk again soon.

Rohin: All right. See you soon.

Lucas: If you enjoyed this podcast, please subscribe, give it a like or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

[end of recorded material]

AI Alignment Podcast: On Becoming a Moral Realist with Peter Singer

Are there such things as moral facts? If so, how might we be able to access them? Peter Singer started his career as a preference utilitarian and a moral anti-realist, and then over time became a hedonic utilitarian and a moral realist. How does such a transition occur, and which positions are more defensible? How might objectivism in ethics affect AI alignment? What does this all mean for the future of AI?

On Becoming a Moral Realist with Peter Singer is the sixth podcast in the AI Alignment series, hosted by Lucas Perry. For those of you that are new, this series will be covering and exploring the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, we will be having discussions with technical and non-technical researchers across areas such as machine learning, AI safety, governance, coordination, ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, or your preferred podcast site/application.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.

In this podcast, Lucas spoke with Peter Singer. Peter is a world-renowned moral philosopher known for his work on animal ethics, utilitarianism, global poverty, and altruism. He’s a leading bioethicist, the founder of The Life You Can Save, and currently holds positions at both Princeton University and The University of Melbourne.

Topics discussed in this episode include:

  • Peter’s transition from moral anti-realism to moral realism
  • Why emotivism ultimately fails
  • Parallels between mathematical/logical truth and moral truth
  • Reason’s role in accessing logical spaces, and its limits
  • Why Peter moved from preference utilitarianism to hedonic utilitarianism
  • How objectivity in ethics might affect AI alignment
In this interview we discuss ideas contained in the work of Peter Singer. You can learn more about Peter’s work here and find many of the ideas discussed on this podcast in his work The Point of View of the Universe: Sidgwick and Contemporary EthicsYou can listen to the podcast above or read the transcript below.

Lucas: Hey, everyone, welcome back to the AI Alignment Podcast series. I’m Lucas Perry, and today, we will be speaking with Peter Singer about his transition from being a moral anti-realist to a moral realist. In terms of AI safety and alignment, this episode primarily focuses on issues in moral philosophy.

In general, I have found the space of moral philosophy to be rather neglected in discussions of AI alignment where persons are usually only talking about strategy and technical alignment. If it is unclear at this point, moral philosophy and issues in ethics make up a substantial part of the AI alignment problem and have implications in both strategy and technical thinking.

In terms of technical AI alignment, it has implications in preference aggregation, and it’s methodology, in inverse reinforcement learning, and preference learning techniques in general. It affects how we ought to proceed with inter-theoretic comparisons of value, with idealizing persons or agents in general and what it means to become realized, how we deal with moral uncertainty, and how robust preference learning versus moral reasoning systems should be in AI systems. It has very obvious implications in determining the sort of society we are hoping for right before, during, and right after the creation of AGI.

In terms of strategy, strategy has to be directed at some end and all strategies smuggle in some sort of values or ethics, and it’s just good here to be mindful of what those exactly are.

And with regards to coordination, we need to be clear, on a descriptive account, of different cultures or groups’ values or meta-ethics and understand how to move from the state of all current preferences and ethics onwards given our current meta-ethical views and credences. All in all, this barely scratches the surface, but it’s just a point to illustrate the interdependence going on here.

Hopefully this episode does a little to nudge your moral intuitions around a little bit and impacts how you think about the AI alignment problem. In coming episodes, I’m hoping to pivot into more strategy and technical interviews, so if you have any requests, ideas, or persons you would like to see interviewed, feel free to reach out to me at lucas@futureoflife.org. As usual, if you find this podcast interesting or useful, it’s really a big help if you can help share it on social media or follow us on your preferred listening platform.

As many of you will already know, Peter is a world-renowned moral philosopher known for his work on animal ethics, utilitarianism, global poverty, and altruism. He’s a leading bioethicist, the founder of The Life You Can Save, and currently holds positions at both Princeton University and The University of Melbourne. And so, without further ado, I give you Peter Singer.

Thanks so much for coming on the podcast, Peter. It’s really wonderful to have you here.

Peter: Oh, it’s good to be with you.

Lucas: So just to jump right into this, it would be great if you could just take us through the evolution of your metaethics throughout your career. As I understand, you began giving most of your credence to being an anti-realist and a preference utilitarian, but then over time, it appears that you’ve developed into a hedonic utilitarian and a moral realist. Take us through the evolution of these views and how you developed and arrived at your new ones.

Peter: Okay, well, when I started studying philosophy, which was in the 1960s, I think the dominant view, at least among people who were not religious and didn’t believe that morals were somehow an objective truth handed down by God, was what was then referred to as an emotivist view, that is the idea that moral judgments express our attitudes, particularly, obviously from the name, emotional attitudes, that they’re not statements of fact, they don’t purport to describe anything. Rather, they express attitudes that we have and they encourage others to share those attitudes.

So that was probably the first view that I held, siding with people who were non-religious. It seemed like a fairly obvious option. Then I went to Oxford and I studied with R.M. Hare who was a professor of moral philosophy at Oxford at the time and a well-known figure in the field. His view was also in this general ballpark of non-objectivist or, as we would know say, non-realist theories, non-cognitivist] was another term used for them. They didn’t purport to be about knowledge.

But his view was that when we make a moral judgment, we are prescribing something. So his idea was that moral judgments fall into the general family of imperative judgments. So if I tell you shut the door, that’s an imperative. It doesn’t say anything that’s true or false. And moral judgments were a particular kind of imperative according to Hare, but they had this feature that they had to be universalizable. So by universalizable, Hare meant that if you were to make a moral judgment, your prescription would have to hold in all relevantly similar circumstances. And relevantly similar was defined in such a way that it didn’t depend on who the people were.

So, for example, if I were to prescribe that you should be my slave, the fact that I’m the slave master and you’re the slave isn’t a relevantly similar circumstance. If there’s somebody just like me and somebody just like you, that I happen to occupy your place, then the person who is just like me would also be entitled to be the slave master of me ’cause now I’m in the position of the slave.

Obviously, if you think about moral judgments that way, that does put a constraint on what moral judgments you can accept because you wouldn’t want to be a slave, presumably. So I liked this view better than the straightforwardly emotivist view because it did seem to give more scope for argument. It seemed to say look, there’s some kind of constraint that really, in practice, means we have to take everybody’s interests into account.

And I thought that was a good feature about this, and I drew on that in various kinds of applied contexts where I wanted to make moral arguments. So that was my position, I guess, after I was at Oxford, and for some decades after that, but I was never completely comfortable with it. And the reason I was not completely comfortable with it was that there was always a question you could ask on Hare’s view. Hare said where does this universalizability constraint come from on our moral judgment? And Hare’s answer was well, it’s a feature of moral language. It’s implied in, say, using the terms ought or good or bad or beauty or obligation. It’s implied that the judgments you are making are universalizable in this way.

And that, in itself, was plausible enough, but it was open to the response that well, in that case, I’m just not gonna use moral language. If moral language requires me to make universalizable prescriptions and that means that I can’t do all sorts of things or can’t advocate all sorts of things that I would want to advocate, then I just won’t use moral language to justify my conduct. I’ll use some other kind of language, maybe prudential language, language of furthering my self-interests. And what’s wrong with doing that moreover, and it’s not just that they can do that, but tell me what’s wrong with them doing that?

So this is a kind of a question about why act morally. And on his view, it wasn’t obvious from his view what the answer to that would be, and, in particular, it didn’t seem that there would be any kind of answer about that’s irrational or you’re missing something. It seemed, really, as if it was an open choice that you had whether to use moral language or not.

So as I got further into the problem, as I tried to develop arguments that would show that it was a requirement of reason, not just a requirement of moral language, but a requirement of reason that we universalize our judgements.

And yet, it was obviously a problem in fitting that in to Hare’s framework, which is, I’ve been saying, was a framework within this general non-cognitivist family. And for Hare, the idea that there are objective reasons for action didn’t really make sense. They were just these desires that we had, which led to us making prescriptions and then the constraint that we universalize their prescriptions, but he explicitly talked about the possibility of objective prescriptions and said that that was a kind of nonsense, which I think comes out of the general background of the kind of philosophy that came out of logical positivism and the verificationist idea that things that you couldn’t verify were nonsense or so and so. And that’s why I was pretty uncomfortable with this, but I didn’t really see bright alternatives to it for some time.

And then, I guess, gradually, I was persuaded by a number of philosophers who were respected that Hare was wrong about rejecting the idea of objective truth in morality. I talked to Tom Nagel and probably most significant was the work of Derek Parfit, especially his work On What Matters, volumes one and two, which I saw in advance in draft form. He circulated drafts of his books to lots of people who he thought might give him some useful criticism. And so I saw that many years before it came out, and the arguments did seem, to me, pretty strong, particularly the objections to the kind of view that I’d held, which, by this time, was no longer usually called emotivism, but was called expressivism, but I think it’s basically a similar view, a view in the ballpark.

And so I came to the conclusion that there is a reasonable case for saying that there are objective moral truths and this is not just a matter of our attitudes or of our preferences universalized, but there’s something stronger going on and it’s, in some ways, more like the objectivity of mathematical truths or perhaps of logical truths. It’s not an empirical thing. This is not something you can describe that comes in the world, the natural world of our sense that you can find or prove empirically. It’s rather something that is rationally self-evident, I guess, to people who reflect on it properly and think about it carefully. So that’s how I gradually made the move towards objectivist metaethic.

Lucas: I think here, it would be really helpful if you could thoroughly unpack what your hedonistic utilitarian objectivist meta-ethics actually looks like today, specifically getting into the most compelling arguments that you found in Parfit and in Nagel that led you to this view.

Peter: First off, I think we should be clear that being an objectivist about metaethics is one thing. Being a hedonist rather than a preference utilitarian is a different thing, and I’ll describe … There is some connection between them as I’ll describe in a moment, but I could have easily become an objectivist and remained a preference utilitarian or held some other kind of normative moral view.

Lucas: Right.

Peter: So the metaethic view is separate from that. What were the most compelling arguments here? I think one of the things that had stuck with me for a long time and that had restrained me from moving in this direction was the idea that it’s hard to know what you mean when you say that something is an objective truth outside the natural world. So in terms of saying that things are objectively true in science, the truths of scientific investigation, we can say well, there’s all of this evidence for it. No rational person would refuse to believe this once they were acquainted with all of this evidence. So that’s why we can say that that is objectively true.

But that’s clearly not going to work for truths in ethics, which, assuming of course that we’re not naturalists, that we don’t think this can be deduced from some examination of human nature or the world, I certainly don’t think that and the people that are influential on me, Nagel and Parfit in particular, also didn’t think that.

So the only restraining question was well, what could this really amount to? I had known going back to the intuitionists in the early 20th century, people like W.D. Ross or, earlier, Henry Sidgwick, who was a utilitarian objectivist philosopher, that people made the parallel with mathematical proofs that there are mathematical proofs that we see as true by direct insight into their truths by their self-evidence, but I have been concerned about this. I’d never really done a deep study of philosophy or mathematics, but I’d been concerned about this because I thought there’s a case for saying that mathematical truths are an analytic truths, they’re truths in virtue of the meanings of the terms and virtue of the way we define what we mean by the numbers and by equals or the various other terms that we use in mathematics so that it’s basically just the unpacking of an analytic system.

The philosophers that I respected didn’t think this view had been more popular at the time when I was a student and it had stuck with me for a while, and although it’s not disappeared, I think it’s perhaps not as widely held a view now as it was then. So that plus the arguments that were being made about how do we understand mathematical truths, how do we understand the truths of logical inference. We grasps these as self-evident. We find them undeniable, yet this is, again, a truth that is not part of the empirical world, but it doesn’t just seem that it’s an analytic truth either. It doesn’t just seem that it’s the meanings of the terms. It does seem that we know something when we know the truths of logic or the truths of mathematics.

On this basis, it started to seem like the idea that there are these non-empirical truths in ethics as well might be more plausible than I thought it was before. And I also went back and read Henry Sidgwick who’s a philosopher that I greatly admire and that Parfit also greatly admired, and looked at his arguments about what he saw as, what he called, moral axioms, and that obviously makes the parallel with axioms of mathematics.

I looked at them and it did seem to me difficult to deny, that is, claims, for example, that there’s no reason for preferring one moment of our existence to another in itself. In other words that we shouldn’t discount the future, except for things like uncertainty, but otherwise, the future is just as important as the present, an idea somewhat similar to his universalizability, but somewhat differently stated by Sidgwick that if something is right for someone, then it’s right independently of the identities of the people involved. But for Sidgwick, as I say, that was, again, a truth of reason, not simply an implication of the use of particular moral terms. Thinking about that, that started to seem right to me, too.

And, I guess, finally, Sidgwick’s claim that the interest of one individual are no more important than the interests of another, assuming that the goods involved that can be done to that person, that is the extent of their interests are similar. Sidgwick’s claim was that people were reflecting carefully on these truths can see that they’re true, and I thought about that, and it did seem to me that … It was pretty difficult to deny, not that nobody will deny them, but that they do have a self-evidence about them. That seemed to me to be a better basis for ethics than views that I’d been holding up to that point, the views that so came out of, originally, emotivism and then out of prescriptivism.

It was a reasonable chance that that was right. As you say, I should give it more credence than I have. It’s not that I’m 100% certain that it’s right by any means, but that’s a plausible view that’s worth defending and trying to see what objections people make to it.

Lucas: I think there’s three things here that would be helpful for us to dive in more on. The first thing is, and this isn’t a part of metaethics, which I’m particularly acquainted with, so, potentially, you can help guide us through this part a little bit more. This non-naturalism vs naturalism argument. Your view is, I believe you’re claiming, is a non-naturalist view is you’re claiming that you can not deduce the axioms of ethics or the basis of ethics from a descriptive or empirical account of the universe?

Peter: That’s right. There certainly are still naturalists around. I guess Peter Railton is a well-known, contemporary, philosophical naturalist. Perhaps Frank Jackson, my Australian friend and colleague. And some of the naturalist views have become more complicated than they used to be. I suppose the original idea of naturalism that people might be more familiar with is simply the claim that there is a human nature and that acting in accordance with that human nature is the right thing to do, so you describe human nature and then you draw from that what are the characteristics that we ought to follow.

That, I think, just simply doesn’t work. I think it has its origins in a religious framework in which you think that God has created our nature with particular purposes that we should behave in certain ways. But the naturalists who defend it, going back to Aquinas even, maintain that it’s actually independent of that view.

If you, in fact, you take an evolutionary view of human nature, as I think we should, then our nature is morally neutral. You can’t derive any moral conclusions from what our nature is like. It might be relevant to know what our nature is like in order to know that if you do one thing, that might lead to certain consequences, but it’s quite possible that, for example, our nature is to seek power and to use force to obtain power, that that’s an element of human nature or, on a group level, to go to war in order to have power over others, and yet naturalists wouldn’t wanna say that those are the right things. They would try and give some account as to why how some of that’s a corruption of human nature.

Lucas: Putting aside naturalist accounts that involve human nature, what about a purely descriptive or empirical understanding of the world, which includes, for example, sentient beings and suffering, and suffering is like a substantial and real ontological fact of the universe and the potential of deducing ethics from facts about suffering and what it is like to suffer? Would that not be a form of naturalism?

Peter: I think you have to be very careful about how you formulate this. What you said sounds a little bit like what Sam Harris says in his book, The Moral Landscape, which does seem to be a kind of naturalism because he thinks that you can derive moral conclusions from science, including exactly the kinds of things that you’ve talked about, but I think there’s a gap there, and the gap has to be acknowledged. You can certainly describe suffering and you can describe happiness conversely, but you need to get beyond description if you’re going to have a normative judgment. That is if you’re gonna have a judgment that says what we ought to do or what’s the right thing to do or what’s a good thing to do, there’s a step that’s just being left out.

If somebody says sentient beings can suffer pain or they can be happy, this is what suffer and pain is like, this is what being happy is like; therefore, we ought to promote happiness, which goes back to David Hume who pointed this out that various moral arguments describe the world using is, is, is, this is the case, and then, suddenly, but without any explanation, they say and therefore, we ought. Needs to be explained how you get from this is statement to the ought statements.

Lucas: It seems that reason, whatever reason might be and however you might define that, seems to do a lot of work at the foundation of your moral view because it seems that reason is what leads you towards the self-evident truth of certain foundational ethical axioms. Why might we not be able to pull the same sort of move with a form of naturalistic moral realism like Sam Harris develops by simply stating that given a full descriptive account of the universe and given first person accounts of suffering and what suffering is like, that it is self-evidently true that built into the nature of that sort of property or part of the universe is that it ought to be diminished?

Peter: Well, if you’re saying that … There is a fine line, maybe this is what you’re suggesting, between saying from the description, we can deduce what we ought to do and between saying when we reflect on what suffering is and when we reflect on what happiness is, we can see that it is self-evident that we ought to promote happiness and we ought to reduce suffering. So I regard that as a non-naturalist position, but you’re right that the two come quite close together.

In fact, this is one of the interesting features of volume three of Parfit’s On What Matters, which was only published posthumously, but was completed before he died, and in that, he responds to essays that are in a book that I edited called Does Anything Really Matter. The original idea was that he would respond in that volume, but, as often happened with Parfit, he wrote responses as such length that it needed to be a separate volume. It would’ve made the work too bulky to put them together, but Peter Railton had an essay in Does Anything Really Matter, and Parfit responded to it, and then he invited Railton to respond to his response, and, essentially, they are saying that yeah, their views have become closer anyway, there’s been a convergence, which is pretty unusual in philosophy because philosophers tend to emphasize the differences between their views.

Between what Parfit calls his non-natural objectivist view and between Railton’s naturalist view, because Railton’s is a more sophisticated naturalist view, the line starts to become a little thin, I agree. But, to me, the crucial thing is that you’re not just saying here’s this description; therefore, we ought to do this. But you’re saying if we understand what we’re talking about here, we can have as an intuition of self-evidence, the proposition that it’s good to promote this or it’s good to try to prevent this. So that’s the moral proposition, that it is good to do this. And that’s the proposition that you have to take some other step. You can say it’s self-evident, but you have to take some other step from simply saying this is what suffering is like.

Lucas: Just to sort of capture and understand your view a bit more here, and going back to, I think, mathematics and reason and what reason means to you and how it operates the foundation of your ethics, I think that a lot of people will sort of get lost or potentially feel it is maybe an arbitrary or cheap move to …

When thinking about the foundations of mathematics, there are foundational axioms, which is self-evidently true, which no one will deny, and then translating that move into the foundations of ethics into determining what we ought to do, it seems like there would be a lot of peole being lost there, there would be a lot of foundational disagreement there. When is it permissible or okay or rational to make that sort of move? What does it mean to say that these really foundational parts of ethics are self-evidently true? How is not the case that that’s simply an illusion or simply a byproduct of evolution that we’re confused that these certain fictions that we’ve evolved are self-evidently true?

Peter: Firstly, let me say, as I’ve mentioned before, I don’t claim that we can be 100% certain about moral truths, but I do think that it’s a plausible view. One reason why it relates to, you just mentioned, being a product of evolution, one reason why it relates to that, and this is something that I argued with my co-author Katarzyna de Lazari-Radek in the 2014 book we wrote called The Point of View of the Universe, which is, in fact, a phrase form Sidgwick, and that argument is that there are a number of moral judgments that we make, there are many moral judgments that we make that we know have evolutionary origins, so lots of things that we think of as wrong, originated because they would not have helped us to survive or they would not have helped a small tribal group to survive to allow certain kinds of conduct. And some of those, we might wanna reject today.

We might think, for example, we have an instinctive repugnance of incest, but Jonathon Hyde has shown that even if you describe a case where adult brothers and sisters who choose to have sex and nothing bad happens as a result of that, their relationship remains as strong as ever, and they have fun, and that’s the end of it, people still say oh, somehow that’s wrong. They try to make up reasons why it’s wrong. That, I think, is an example of an evolved impulse, which, perhaps, is no longer really apposite because we have effective contraception, and so what are the evolutionary reasons why we might want to avoid incest are not necessarily there.

But in a case of the kinds of things that I’m talking about and that Sidgwick is talking about, like the idea that everyone’s good is of equal significance, they have perceived why we would’ve evolved to have bad attitude because, in fact, it seems harmful to our prospects of survival and reproduction to give equal weight to the interest of complete strangers.

The fact that people do think this, and if you look at a whole lot of different independent, historical, ethical traditions in different cultures and different parts of the world at different times, you do find many thinkers who converge on something like this idea in various formulations. So why do they converge on this given that it doesn’t seem to have that evolutionary justification or explanation as to why it would’ve evolved?

I think that suggests that it may be a truth of reason and, of course, you may then say well, but reason has also evolved, and indeed it has, but I think that reason may be a little different in that we evolved a capacity to reason various specific problem solving needs, helped us to survive in lots of circumstances. But it may then enable us to see things that have no survival value, just as no doubt simple arithmetic has a survival value, but understanding the truths of higher mathematics don’t really have a survival value, so maybe similarly in ethics, there are some of these more abstract universal truths that don’t have a survival value, but which, nevertheless, the best explanation for why many people seem to come to these views is that they’re truths of reason, and once we’re capable of reasoning, we’re capable of understanding these truths.

Lucas: Let’s start off at reason and reason alone. When moving from reason and thinking, I guess, alongside here about mathematics for example, how is one moving specifically from reason to moral realism and what is the metaphysics of this kind of moral realism in a naturalistic universe without anything supernatural?

Peter: I don’t think that it needs to have a very heavyweight metaphysical presence in the universe. Parfit actually avoided the term realism in describing his view. He called it non-naturalistic normative objectivism because he thought that realism carried this idea that it was part of the furniture of the universe, as philosophers say, that the universe consists of the various material objects, but in addition to that, it consists of moral truths is if they’re somehow sort of floating there out in space, and that’s not the right way to think about it.

I’d say, rather, the right way to think about it is as, you know, we do with logical and mathematical truths that once you have been capable of a certain kind of thought, they will move towards these truths. They have the potential and capacity for thinking along these lines. One of the claims that I would make a consequence of my acceptance of objectivism in ethics as the rationally based objectivism is that the morality that we humans have developed on Earth in this, anyway, at this more abstract, universal level is something that aliens from another galaxy could also have achieved if they had similar capacities of thought or maybe greater capacities of thought. It’s always a possible logical space, you could say, or a rational space that is there that beings may be able to discover once they develop those capacities.

You can see mathematics in that way, too. It’s one of a number of possible ways of seeing mathematics and of seeing logic, but they’re just timeless things that, in some way, truths or laws, if you like, but they don’t exist in the sense in which the physical universe exists.

Lucas: I think that’s really a very helpful way of putting it. So the claim here is that through reason, one can develop the axioms of mathematics and then eventually develop quantum physics and other things. And similarly, when reason is applied to thinking about what one ought to do or when thinking about the status of sentient creatures that one is applying logic and reason to this rational space and that this rational space has truths in the same way that mathematics does?

Peter: Yes, that’s right. It has at least perhaps only a very small number, Sidgwick came up with three axioms that are perhaps only a very small number of truths and fairly abstract truths, but that they are truths. That’s the important aspect. That they’re not just particular attitudes, which beings who evolved as homo sapiens have all are likely to understand and accept, but beings who evolved in a different galaxy in a quite different way would not accept. My claim is that if they are also capable of reasoning, if evolution had again produced rational beings, they would be able to see the truths in the same way as we can.

Lucas: So spaces of rational thought and of logic, which can or can not be explored, seems very conceptual queer to me, such that I don’t even really know how to think about it. I think that one would worry that one is applying reason, whatever reason might be, to a fictional space. I mean you’re discussing earlier that some people believe mathematics to be simply the formalization of what is analytically true about the terms and judgments and the axioms and then it’s just a systematization of that and an unpacking of it from beginning into infinity. And so, I guess, it’s unclear to me how one can discern spaces of rational inquiry which are real, from ones which are anti-real or which are fictitious. Does that make sense?

Peter: It’s a problem. I’m not denying that there is something mysterious, I think maybe my former professor, R.M. Hare, would have said queer … No, it was John Mackie, actually, John Mackie was also at Oxford when I was there, who said these must be very queer things if there are some objective moral truths. I’m not denying that it’s something that, in a way, would be much simpler if we could explain everything in terms of empirical examination of the natural world and say there’s only that plus there are formal systems. There are analytic systems.

But I’m not persuaded that that’s a satisfactory explanation of mathematics or logic either, so if those who are convinced that this is a satisfactory way of explaining logic and mathematics, may well think that then they don’t need this explanation of ethics either, but it is a matter of if we need to appeal to something outside the natural realm to understand some of the other things about the way we reason, then perhaps ethics is another candidate for this.

Lucas: So just drawing parallels again here with mathematics ’cause I think it’s the most helpful. Mathematics is incredible for helping us to describe and predict the universe. The president of the Future of Life Institute, Max Tegmark, develops an idea of potential mathematical Platonism or realism where the universe can be understood primarily as, and sort of ontologically, a mathematical object within, potentially, a multiverse because as we look into the properties and features of quarks and the building blocks of the universe, all we find is more mathematical properties and mathematical relationships.

So within the philosophy of math, there’s certainly, it seems, open questions about what math is and what the relation of mathematics is to the fundamental metaphysics and ontology of the universe and potential multiverse. So in terms of ethics, what information or insight or anything do you think that we’re missing could further inform our view that there potentially is objective morality or whatever that means or inform us that there is a space of moral truths which can be arrived at by non-anthropocentric minds, like aliens minds you said could also arrive at the moral truths as they could also arrive at mathematical truths.

Peter: So what further insight would show that this was correct, other, presumably, than the arrival of aliens who start swapping mathematical theorems with us?

Lucas: And have arrived at the same moral views. For example, if they show up and they’re like hey, we’re hedonistic consequentialists and we’re really motivated to-

Peter: I’m not saying they’d necessarily be hedonistic consequentialists, but they would-

Lucas: I think they should be.

Peter: That’s a different question, right?

Lucas: Yeah, yeah, yeah.

Peter: We haven’t really discussed steps to get there yet, so I think they’re separate questions. My idea is that they would be able to see that if we had similar interests to the ones that they did, then those interests ought to get similar weight, that they shouldn’t ignore our interests just because we’re not members of whatever civilization or species they are. I would hope that if they are rationally sophisticated, they would at least be able to see that argument, right?

Some of them, just as with us, might see the argument and then say yes, but I love the tastes of your flesh so much I’m gonna kill you and eat you anyway. So, like us, they may not be purely rational beings. We’re obviously not purely rational beings. But if they can get here and contact us somehow, they should be sufficiently rational to be able to see the point of the moral view that I’m describing.

But that wasn’t a very serious suggestion about waiting for the aliens to arrive, and I’m not sure that I can give you much of an answer to say what further insights are relevant here. Maybe it’s interesting to try and look at this cross-culturally, as I was saying, and to examine the way that great thinkers of different cultures and different eras have converged on something like this idea despite the fact that it seems unlikely to have been directly produced by evolutionary beings in the same way that our other more emotionally driven moral reactions are.

Peter: I don’t know that the argument can go any further, and it’s not completely conclusive, but I think it remains plausible. You might say well, that’s a stalemate. Here are some reasons for thinking morality’s objective and other reasons for rejecting that, and that’s possible. That happens in philosophy. We get down to bedrock disagreements and it’s hard to move people with different views.

Lucas: What is reason? One could also view reason as some human-centric bundle of both logic and intuitions, and one can be mindful that the intuitions, which are sort of bundled with this logic, are almost arbitrary consequences of evolution. So what is reason fundamentally and what does it mean that other reasonable agents could explore spaces of math and morality in similar ways?

Peter: Well, I would argue that there are common principles that don’t depend on our specific human nature and don’t depend on the path of our evolution. I accept, to the extent, that because the path of our evolution has given us the capacity to solve various problems through thought and that that is what our reason amounts to and therefore, we have insight into these truths that we would not have if we did not have that capacity. From this kind of reasoning, you can think of as something that goes beyond specific problem solving skills to insights into laws of logic, laws of mathematics, and laws of morality as well.

Lucas: When we’re talking about axiomatic parts of mathematics and logics and, potentially, ethics here as you were claiming with this moral realism, how is it that reason allows us to arrive at the correct axioms in these rational spaces?

Peter: We developed the ability when we’re presented with these things to consider whether we can deny them or not, whether they are truly self-evident. We can reflect on them, we can talk to others about them, we can consider biases that we might have that might explain why we believe them and see where there are any such biases, and once we’ve done all that, we’re left with the insight that some things we can’t deny.

Lucas: I guess I’m just sort of poking at this idea of self-evidence here, which is doing a lot of work in the moral realism. Whether or not something is self-evident, at least to me, it seems like a feeling, like I just look at the thing and I’m like clearly that’s true, and if I get a little bit meta, I ask okay, why is that I think that this thing is obviously true? Well, I don’t really know, it just seems self-evidently true. It just seems so and this, potentially, just seems to be a consequence of evolution and of being imbued with whatever reason is. So I don’t know if I can always trust my intuitions about things being self-evidently true. I’m not sure how to navigate my intuitions and views of what is self-evident in order to come upon what is true.

Peter: As I said, it’s possible that we’re mistaken, that I’m mistaken in these particular instances. I can’t exclude that possibility, but it seems to me that there’s hypotheses that we hold these views because they are self-evident, and look for evolutionary explanations and, as I’ve said, I’ve not really found them, so that’s as far as I can go with that.

Lucas: Just moving along here a little bit, and I’m becoming increasingly mindful of your time, would you like to cover briefly this sort of shift that you had from preference utilitarianism to hedonistic utilitarianism?

Peter: So, again, let’s go back to my autobiographical story. For Hare, the only basis for making moral judgments was to start from our preferences and then to universalize them. There could be no arguments about something else being intrinsically good or bad, whether it was happiness or whether it was justice or freedom or whatever because that would be to import some kind of objective claims into this debate that just didn’t have a place in his framework, so all I could do was take my preferences and prescribe them universally, and, as I said, that involved putting myself in the position of the others affected by my action and asking whether I could still accept it.

When you do that, and if you, let’s say your action affects many people, not just you and one other, what you’re really doing is you’re trying to sum up how this would be from the point of view of every one of these people. So if I put myself in A’s position, would I be able to accept this? But then I’ve gotta put myself in B’s position as well, and C, and D, and so on. And to say can I accept this prescription universalized is to say if I were living the lives of all of those people, would I want this to be done or not? And that’s a kind of, as they say, a summing of the extent to which doing this satisfies everyone’s preferences net on balance after deducting, of course, the way in which is thwarts or frustrates or is contrary to their preferences.

So this seem to be the only way in which you could go further with Hare’s views as they eventually worked it out and changed it a little bit over the years, but in his later formulations of it. So it was a kind of a preference utilitarianism that it led to, and I was reasonably happy with that, and I accepted the idea that this meant that what we ought to be doing is to maximize the satisfaction of preferences and avoid thwarting them.

And it gives you, in many cases, of course, somewhat similar conclusions to what you would say if what we wanna do is maximize happiness an minimize suffering or misery because for most people, happiness is something that they very much desire and misery is something that they don’t want. Some people might have different preferences that are not related to that, but for most people, they will probably come down some way or other to how it relates to their well-being, their interests.

There are certainly objections to this, and some of the objections relate to preferences that people have when they’re not fully informed about things. And Hare’s view was that, in fact, the preferences that we should universalize are the preferences people should have when they are fully informed and when they’re thinking calmly, they’re not, let’s say, angry with somebody and therefore they have a strong preference to hit him in the face, even though this will be bad for them and bad for him.

So the preference view sort of then took this further step of saying it’s the preferences that you would have if you were well informed and rational and calm, and that seemed to solve some problems with preference utilitarianism, but it gave rise to other problems. One of the problems were well, does this mean that if somebody is misinformed in a way that you can be pretty confident they’re never gonna be correctly informed, you should still do what they would want if they were correctly informed.

An example of this might be someone who’s a very firm religious believer and has been all their life, and let’s say one of their religious beliefs is that having sex outside marriage is wrong because God has forbid it, let’s say, it’s contrary to the commandments or whatever, but given that, let’s say, let’s just assume, there is no God, therefore, a priori there’s no commandments that God made against sex outside marriage, and given that if they didn’t believe in God, they would be happy to have sex outside marriage, and this would make them happier, and would make their partner happy as well, should I somehow try to wangle things so that they do have sex outside marriage even though, as they are now, they prefer not to.

And that seems a bit of a puzzle, really. Seems highly paternalistic to ignore their preferences in the base of their knowledge even though you’re convinced that they’re knowledge is false. So there are puzzles and paradoxes like that. And then there was another argument that does actually, again, come out of Sidgwick, although I didn’t find it in Sidgwick until I read it in other philosopher’s later.

Again, I think Peter Railton’s is one who uses his. and that is that if you’re really asking what people would do if they’re rational and fully informed, you have to make judgments about what is a rational and fully informed view in this situation. And that might involve even the views that we’ve just been discussing, that if you were rational, you would know what the objective truth was and you would want to do it. So, at that level, a preference view actually seems to amount to a different view, an objectivist view, that you would hold where you would have to actually know what the things that were good.

So, as I say, it had a number of internal problems, even just if you assume the meta-ethic that I was taking from Hare originally. But if then, as happened with me, you become convinced that there can be objective moral truths. This was, in some ways, opened up the field to other possible ideas as to what was intrinsically good because now you could argue that something was intrinsically good even if it was not something that people preferred, and in that light, I went back to reading some of the classical utilitarians, again, particularly, Sidgwick and his arguments for why happiness rather than the satisfaction of desires is the ultimate value, something that is of intrinsic value, and it did seem to overcome these problems with preference utilitarianism that had been troubling me.

It had certainly had some paradoxes of its own, some things that it seemed not to handle as well, but after thinking about it, again, I decided that it was more likely than not that a hedonistic view was the right view. I wouldn’t put it stronger than that. I still think preference utilitarianism has some things to be said for it and they’re also, of course, views that say yes, happiness is intrinsically good and suffering is intrinsically bad, but they’re not the only things that are intrinsically good or bad, things like justice or freedom or whatever. There’s various other candidates that people have put forward. Many of them, in fact, are being objectively good or bad. So there are also possibilities.

Lucas: When you mentioned that happiness or certain sorts of conscious states of sentient creatures can be seen as intrinsically good or valuable, keeping in mind the moral realism that you hold, what is the metaphysical status of experiences in the universe given this view? Is it that happiness is good based off of the application of reason and the rational space of ethics? Unpack the ontology of happiness and the metaphysics here a bit.

Peter: Well, of course it doesn’t change what happiness is. That’s to say that it’s of intrinsic value, but that is the claim that I’m making. That the world is a better place if it has more happiness in it and less suffering in it. That’s judgment that I’m making about the state of the universe. Obviously, there have to be beings who can be happy or can be miserable, and that requires a conscious mind, but the judgment that the universe if better with more happiness and less suffering is mind independent. I think … Let’s imagine that there were beings that could feel pain and pleasure, but could not make any judgments about anything of value. They’re like some non-humans animals, I guess. It would still be the case that the universe was better if those non-human animals suffered less and had more pleasure.

Lucas: Right. Because it would be sort of intrinsic quality or property to the experience that it be valuable or disvaluable. So yeah, thanks so much for your time, Peter. It’s really been wonderful and informative. If people would like to follow you or check you out somewhere, where can they go ahead and do that?

Peter: I have a website, which actually I’m in the process of reconstructing a bit, but it’s Petersinger.info. There’s a Wikipedia page. They wanna look at things that I’m involved in, they can look at thelifeyoucansave.org, which is the nonprofit organization that I’ve founded that is recommending perfective charities that people can donate to. That probably gives people a bit of an idea. There’s books that I’ve written that are discussing these things. I probably mentioned The Point of View of the Universe, which goes into the things we’ve discussed today, probably more thoroughly than anything else. For people who don’t wanna read a big book, I’ve also got Oxford University Press’ very short introduction series. The book on utilitarianism is, again, co-authored by the same co-author as The Point of View of the Universe, Katarzyna de Lazari-Radek and myself, and that’s just a hundred page version of some of these arguments we’ve been discussing.

Lucas: Wonderful. Well, thanks again, Peter. We haven’t ever met in person, but hopefully I’ll catch you around the Effective Altruism conference track sometime soon.

Peter: Okay, hope so.

Lucas: Alright, thanks so much, Peter.

Hey, it’s post-podcast Lucas here and just wanted to chime in with some of my thoughts and tie this all into AI thinking. For me, the most consequential aspect of moral thought in this space and moral philosophy, generally, is how much disagreement there is between people who’ve thought long and hard about this issue and what an enormous part of AI alignment this makes up, and the effects, different moral and meta-ethical views have on preferred AI alignment methodology.

Current general estimates by AI researchers, but human level AI on the decade to century long timescale with about a 50% probability by mid-century with that obviously increasing over time, and it’s quite obvious that moral philosophy ethics and issues of value and meaning will not be solved on that timescale. So if we assume at the worst case success story where technical alignment and coordination and strategy issues will continue in their standard, rather morally messy way with how we currently unreflectively deal with things, where moral information isn’t taken very seriously, then I’m really hoping the technical alignment and coordination succeed well enough for us to create a very low level aligned system, that we’re able to pull the brakes on and work hard on issues of value, ethics, and meaning. The end towards which that AGI will be aimed. Otherwise, it seems very clear that given all of this moral uncertainty that is shared, we risk value drift or catastrophically unoptimal or even negative futures.

Turning into Peter’s views that we discussed here today, if axioms of morality are accessible through reason alone, as the axioms of mathematics appear to be, then we ought to consider the implications here for how we want to progress with AI systems and AI alignment more generally.

If we take human beings to be agents of limited or semi-rationality, then we could expect that some of us, or some fraction of us, have gained access to what might potentially be core axioms of the logical space of morality. When AI systems are trained on human data in order to infer and learn human preferences, given Peter’s view, this could be seen as a way of learning the moral thinking of imperfectly rational beings. This, or any empirical investigation, given Peter’s views, would not be able to arrive at any clear, moral truth, rather it would find areas where semi-rational beings, like ourselves, generally tend to converge in this space.

This would be useful or potentially passable up until AGI, but if such a system is to be fully autonomous and safe, then a more robust form of alignment is necessary. If the AGI we create is one day rational, putting aside whatever reason might be and how it gives rational creatures access to self-evident truths and rational spaces, then if AGI is a fully rational agent, then it, perhaps, would arrive at self-evident truths of mathematics and logic, and even morality, just as aliens on another planet might if they’re fully rational as is Peter’s view. If so, this would potentially be evidence of this view being true and can also reflect here that AGI from this point of using reason to have insight into the core truths of logical spaces could reason much better and more impartially than any human in order to fully explore and realize universal truths of morality.

At this point, we would essentially have a perfect moral reasoner on our hands with access to timeless universal truths. Now the question would be could we trust it and what would ever be sufficient reasoning or explanation given to humans by this moral oracle that would satisfy and satiate us of our appetites and desires to know moral truth and to be sure that we have arrived at moral truth?

It’s above my pay grade what rationality or reason actually is and might be prior to certain logical and mathematical axioms and how such a truth seeking meta-awareness can grasps these truths as self-evident or whether the self-evidence of the truths of mathematics and logic are programmed into us by evolution trying and failing over millions of year. But maybe that’s an issue for another time. Regardless, we’re doing philosophy, computer science, and poly-sci on a deadline, so let’s keep working on getting it right.

If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

[end of recorded material]

AI Alignment Podcast: Moral Uncertainty and the Path to AI Alignment with William MacAskill

How are we to make progress on AI alignment given moral uncertainty?  What are the ideal ways of resolving conflicting value systems and views of morality among persons? How ought we to go about AI alignment given that we are unsure about our normative and metaethical theories? How should preferences be aggregated and persons idealized in the context of our uncertainty?

Moral Uncertainty and the Path to AI Alignment with William MacAskill is the fifth podcast in the new AI Alignment series, hosted by Lucas Perry. For those of you that are new, this series will be covering and exploring the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, we will be having discussions with technical and non-technical researchers across areas such as machine learning, AI safety, governance, coordination, ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, or your preferred podcast site/application.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.

In this podcast, Lucas spoke with William MacAskill. Will is a professor of philosophy at the University of Oxford and is a co-founder of the Center for Effective Altruism, Giving What We Can, and 80,000 Hours. Will helped to create the effective altruism movement and his writing is mainly focused on issues of normative and decision theoretic uncertainty, as well as general issues in ethics.

Topics discussed in this episode include:

  • Will’s current normative and metaethical credences
  • The value of moral information and moral philosophy
  • A taxonomy of the AI alignment problem
  • How we ought to practice AI alignment given moral uncertainty
  • Moral uncertainty in preference aggregation
  • Moral uncertainty in deciding where we ought to be going as a society
  • Idealizing persons and their preferences
  • The most neglected portion of AI alignment
In this interview we discuss ideas contained in the work of William MacAskill. You can learn more about Will’s work here, and follow him on social media here. You can find Gordon Worley’s post here and Rob Wiblin’s previous podcast with Will here.  You can hear more in the podcast above or read the transcript below.

Lucas: Hey, everyone. Welcome back to the AI Alignment Podcast series at the Future of Life Institute. I’m Lucas Perry, and today we’ll be speaking with William MacAskill on moral uncertainty and its place in AI alignment. If you’ve been enjoying this series and finding it interesting or valuable, it’s a big help if you can share it on social media and follow us on your preferred listening platform.

Will is a professor of philosophy at the University of Oxford and is a co-founder of the Center for Effective Altruism, Giving What We Can, and 80,000 Hours. Will helped to create the effective altruism movement and his writing is mainly focused on issues of normative and decision theoretic uncertainty, as well as general issues and ethics. And so, without further ado, I give you William MacAskill.

Yeah, Will, thanks so much for coming on the podcast. It’s really great to have you here.

Will: Thanks for having me on.

Lucas: So, I guess we can start off. You can tell us a little bit about the work that you’ve been up to recently in terms of your work in the space of metaethics and moral uncertainty just over the past few years and how that’s been evolving.

Will: Great. My PhD topic was on moral uncertainty, and I’m just putting the finishing touches on a book on this topic. The idea here is to appreciate the fact that we very often are just unsure about what we ought, morally speaking, to do. It’s also plausible that we ought to be unsure about what we ought morally to do. Ethics is a really hard subject, there’s tons of disagreement, it would be overconfident to think, “Oh, I’ve definitely figured out the correct moral view.” So my work focuses on not really the question of how unsure we should be, but instead what should we do given that we’re uncertain?

In particular, I look at the issue of whether we can apply the same sort of reasoning that we apply to uncertainty about matters of fact to matters of moral uncertainty. In particular, can we use what is known as “expected utility theory”, which is very widely accepted as at least approximately correct in empirical uncertainty. Can we apply that in the same way in the case of moral uncertainty?

Lucas: Right. And so coming on here, you also have a book that you’ve been working on on moral uncertainty that is unpublished. Have you just been expanding this exploration in that book, diving deeper into that?

Will: That’s right. There’s actually been very little that’s been written on the topic of moral uncertainty, at least in modern times, at least relative to its importance. I would think of this as a discipline that should be studied as much as consequentialism or contractualism or Kantianism is studied. But there’s really, in modern times, only one book that’s been written on the topic and that was written 18 years ago now, or published 18 years ago. What we want is this to be, firstly, just kind of definitive introduction to the topic, it’s co-authored with me as lead author, but co-authored with Toby Ord and Krista Bickfest, laying out both what we see as the most promising path forward in terms of addressing some of the challenges that face an account of decision-making under moral uncertainty, some of the implications of taking moral uncertainty seriously, and also just some of the unanswered questions.

Lucas: Awesome. So I guess, just moving forward here, you have a podcast that you already did with Rob Wiblin: 80,000 Hours. So I guess we can sort of just avoid covering a lot of the basics here about your views on using expected utility calculous in moral reasoning and moral uncertainty in order to decide what one ought to do when one is not sure what one ought to do. People can go ahead and listen to that podcast, which I’ll provide a link to within the description.

It would also be good, just to sort of get a general sense of where your meta ethical partialities just generally right now tend to lie, so what sort of meta ethical positions do you tend to give the most credence to?

Will: Okay, well that’s a very well put question ’cause, as with all things, I think it’s better to talk about degrees of belief rather than absolute belief. So normally if you ask a philosopher this question, we’ll say, “I’m a nihilist,” or “I’m a moral realist,” or something, so I think it’s better to split your credences. So I think I’m about 50/50 between nihilism or error theory and something that’s non-nihilistic.

Whereby nihilism or error theory, I just mean that any positive moral statement or normative statement or a evaluative statement. That includes, you ought to maximize happiness. Or, if you want a lot of money, you ought to become a banker. Or, pain is bad. That, on this view, all of those things are false. All positive, normative or evaluative claims are false. So it’s a very radical view. And we can talk more about that, if you’d like.

In terms of the rest of my credence, the view that I’m kind of most sympathetic towards in the sense of the one that occupies most of my mental attention is a relatively robust form of moral realism. It’s not clear whether it should be called kind of naturalist moral realism or non-naturalist moral realism, but the important aspect of it is just that goodness and badness are kind of these fundamental moral properties and are properties of experience.

The things that are of value are things that supervene on conscious states, in particular good states or bad states, and the way we know about them is just by direct experience with them. Just by being acquainted with a state like pain gives us a reason for thinking we ought to have less of this in the world. So that’s my kind of favored view in the sense it’s the one I’d be most likely to defend in the seminar room.

And then I give somewhat less credence in a couple of views. One is a view called “subjectivism” which is the idea that what you ought to do is determined in some sense by what you want to do. So the simplest view there would just be when I say, “I ought to do X.” That just means I want to do X in some way. Or a more sophisticated version would be ideal subjectivism where when I say I ought to do X, it means some very idealized version of myself would want myself to want to do X. Perhaps if I had limited amounts of knowledge and much clearer computational power and so on. I’m a little less sympathetic to that than many people I know. We’ll go into that.

And then a final view that I’m also less sympathetic towards is non-cognitivism, which would be the idea that our moral statements … So when I say, “Murder is wrong,” I’m not even attempting to express a proposition. What they’re doing is just expressing some emotion of mine, like, “Yuk. Murder. Ugh,” in the same way that when I said that, that wasn’t expressing any proposition, it was just expressing some sort of pro or negative attitude. And again, I don’t find that terribly plausible, again for reasons we can go into.

Lucas: Right, so those first two views were cognitivist views, which makes them fall under sort of a semantic theory where you think that people are saying truth or false statements when they’re claiming moral facts. And the other theory in your moral realism are both metaphysical views, which I think is probably what we’ll mostly be interested here in terms of the AI alignment problem.

There are other issues in metaethics, for example having to do with semantics, as you just discussed. You feel as though you give some credence to non-cognitivism, but there are also justification views, so like issues in moral epistemology, how one can know about metaethics and why one ought to follow metaethics if metaethics has facts. Where do you sort of fall in in that camp?

Will: Well, I think all of those views are quite well tied together, so what sort of moral epistemology you have depends very closely, I think, on what sort of meta-ethical view you have, and I actually think, often, is intimately related as well to what sort of view in normative ethics you have. So my preferred philosophical world view, as it were, the one I’d defend in a seminar room, is classical utilitarian in its normative view, so the only thing that matters is positive or negative mental states.

In terms of its moral epistemology, the way we access what is of value is just by experiencing it, so in just the same way we access conscious states. There are also some ways in which you can’t merely, you know, why is it that we should maximize the sum of good experiences rather than the product, or something? That’s a view that you’ve got to obtain by kind of reasoning rather than just purely from experience.

Part of my epistemology does appeal to whatever this spooky ability we have to reason about abstract affairs, but it’s the same sort of faculty that is used when we think about mathematics or set theory or other areas of philosophy. If, however, I had some different view, so supposing we were a subjectivist, well then moral epistemology looks very different. You’re actually just kind of reflecting on your own values, maybe looking at what you would actually do in different circumstances and so on, reflecting on your own preferences, and that’s the right way to come to the right kind of moral views.

There’s also another meta-ethical view called “constructivism” that I’m definitely not the best person to talk about with. But on that view, again it’s not really a realistic view, but on this view we just have a bunch of beliefs and intuitions and the correct moral view is just the best kind of systematization of those and beliefs or intuitions in the same way as you might think … Like linguistics, it is a science, but it’s fundamentally based just on what our linguistic intuitions are. It’s just kind of a systematization of them.

On that view, then, moral epistemology would be about reflecting on your own moral intuitions. You just got all of this data, which is the way things seem like to you, morally speaking, and then you’re just doing the systematization thing. So I feel like the question of moral epistemology can’t be answered in a vacuum. You’ve got to think about your meta-ethical view of the metaphysics of ethics at the same time.

Lucas: I think I’m pretty interested in here, and also just poking a little bit more into that sort of 50% credence you give to your moral realist view, which is super interesting because it’s a view that people tend not to have, I guess, in the AI computer science rationality space, EA space. People tend to, I guess, have a lot of moral anti-realists in this space.

In my last podcast, I spoke with David Pearce, and he also seemed to sort of have a view like this, and I’m wondering if you can just sort of unpack yours a little bit, where he believed that suffering and pleasure disclose the in-built pleasure/pain access of the universe. Like you can think of minds as sort of objective features of the world, because they in fact are objective features of the world, and the phenomenology and experience of each person is objective in the same way that someone could objectively be experiencing redness, and in the same sense they could be objectively experiencing pain.

It seems to me, and I don’t fully understand the view, but the claim is that there are some sort of in-built quality or property to the hedonic qualia of suffering or pleasure that discloses its in-built value to that.

Will: Yeah.

Lucas: Could you unpack it a little bit more about the metaphysics of that and what that even means?

Will: It sounds like David Pearce and I have quite similar views. I think relying heavily on the analogy with, or very close analogy with consciousness is going to help, where imagine you’re kind of a robot scientist, you don’t have any conscious experiences but you’re doing all this fancy science and so on, and then you kind of write out the book of the world, and i’m like, “hey, there’s this thing you missed out. It’s like conscious experience.” And you, the robot scientist, would say, “Wow, that’s just insane. You’re saying that some bits of matter have this first person subjective feel to them? Like, why on earth would we ever believe that? That’s just so out of whack with the naturalistic understanding of the world.” And it’s true. It just doesn’t make any sense from given what we know now. It’s a very strange phenomenon to exist in the world.

Will: And so one of the arguments that motivates error theory is this idea of just, well, if values were to exist, they would just be so weird, what Mackie calls “queer”. It’s just so strange that just by a principle of Occam’s razor not adding strange things in to our ontology, we should assume they don’t exist.

But that argument would work in the same way against conscious experience, and the best response we’ve got is to say, no, but I know I’m conscious, and just tell by introspecting. I think we can run the same sort of argument when it comes to a property of consciousness as well, which is namely the goodness or badness of certain conscious experiences.

So now I just want you to go kind of totally a-theoretic. Imagine you’ve not thought about philosophy at all, or even science at all, and I was just to ask you, rip off one of your fingernails, or something. And then I say, “Is that experience bad?” And you would say yes.

Lucas: Yeah, it’s bad.

Will: And I would ask, how confident are you? The more confident that this pain is bad than that I even have hands, perhaps. That’s at least how it seems to be for me. So then it seems like, yeah, we’ve got this thing that we’re actually incredibly confident of which is the badness of pain, or at least the badness of pain for me, and so that’s what initially gives the case for then thinking, okay, well, that’s at least one objective moral fact that pain is bad, or at least pain is bad for me.

Lucas: Right, so the step where I think that people will tend to get lost in this is when … I thought the part about Occam’s razor was very interesting. I think that most people are anti-realistic because they use Occam’s razor there and they think that what the hell would a value even be anyway in the third person objective sense? Like, that just seems really queer, as you put it. So I think people get lost at the step where the first person seems to simply have a property of badness to it.

I don’t know what that would mean if one has a naturalistic reductionist view of the world. There seems to be just like entropy, noise and quarks and maybe qualia as well. It’s not clear to me how we should think about properties of qualia and whether or not one can drive, obviously, “ought” statements about properties of qualia to normative statements, like “is” statements about the properties of qualia to “ought” statements?

Will: One thing I want to be very clear on is just it definitely is the case that we have really no idea on this view. We are currently completely in the dark about some sort of explanation of how matter and forces and energy could result in goodness or badness, something that ought to be promoted. But that’s also true with conscious experience as well. We have no idea how on earth matter could result in kind of conscious experience. At the same time, it would be a mistake to start denying conscious experience.

And then we can ask, we say, okay, we don’t really know what’s going on but we accept that there’s conscious experience, and then I think if you were again just to completely pre theoretically start categorizing distant conscious experiences that we have, we’d say that some are red and some are blue, some are maybe more intense, some are kind of dimmer than others, you’d maybe classify them into sights and sounds and other sorts of experiences there.

I think also a very natural classification would be the ones that are good and the ones that are bad, and then I think when we cash that out further, I think it’s not nearly the case. I don’t think the best explanation is that when we say, oh, this is good or this is bad it means what we want or what we don’t want, but instead it’s like what we think we have reason to want or reason not to want. It seems to give us evidence for those sorts of things.

Lucas: I guess my concern here is just that I worry that words like “good” and “bad” or “valuable” or “dis-valuable”, I feel some skepticism about whether or not they disclose some sort of intrinsic property of the qualia. I’m also not sure what the claim here is about the nature of and kinds of properties that qualia can have attached to them. I worry that goodness and badness might be some sort of evolutionary fiction which enhances us, enhances our fitness, but it doesn’t actually disclose some sort of intrinsic metaphysical quality or property of some kind of experience.

Will: One thing I’ll say is, again, remember that I’ve got this 50% credence on error theory, so in general, all these questions, maybe this is just some evolutionary fiction, things just seem bad but they’re not actually, and so on. I actually think those are good arguments, and so that should give us confidence, some degree of confidence and this idea of just actually nothing matters at all.

But kind of underlying a lot of my views is this more general argument that if you’re unsure between two views, one in which just nothing matters at all, we’ve got no reasons for action, the other one we do have some reasons for action, then you can just ignore the one that says you’ve got no reasons for action ’cause you’re not going to do badly by its likes no matter what you do. If I were to go around shooting everybody, that wouldn’t be bad or wrong on nihilism. If I were to shoot lots of people, it wouldn’t be bad or wrong on nihilism.

So if there are arguments such as, I think an evolutionary argument that pushes us in the direction of kind of error theory, in a sense we can put them to the side, ’cause what we ought to do is just say, yeah, we take that really seriously. Give us a high credence in error theory, but now say, after all those arguments, what are the views, because most plausibly kind of bear their force.

So this is why with the kind of evolutionary worry, I’m just like, yes. But, supposing it’s the case that there actually are. Presumably conscious experiences themselves are useful in some evolutionary way that, again, we don’t really understand. I think, presumably, also good and bad experiences are useful in some evolutionary way that we don’t fully understand, perhaps because they have a tendency to motivate at least beings like us, and that in fact seems to be a key aspect of making a kind of goodness or badness statement. It’s at least somehow tied up to the idea of kind of motivation.

And then when I say ascribing a property to a conscious experience, I really just don’t mean whatever it is that we mean when we say that this experience is red seeming, this is experience is blue seeming, I mean, again, opens philosophical questions what we even mean by properties but in the same way this is bad seeming, this is good seeming.

Before I got into thinking about philosophy and naturalism and so on, would I have thought those things are kind of on a par, and I think I would’ve done, so it’s at least a pre theoretically justified view to think, yeah, there just is this axiological property of my experience.

Lucas: This has made me much more optimistic. I think after my last podcast I was feeling quite depressed and nihilistic, and hearing you give this sort of non-naturalistic or naturalistic moral realist count is cheering me up a bit about the prospects of AI alignment and value in the world.

Will: I mean, I think you shouldn’t get too optimistic. I’m also certainly wrong-

Lucas: Yeah.

Will: … sort of is my favorite view. But take any philosopher. What’s the chance that they’ve got the right views? Very low, probably.

Lucas: Right, right. I think I also need to be careful here that human beings have this sort of psychological bias where we give a special metaphysical status and kind of meaning and motivation to things which have objective whatever to it. I guess there’s also some sort of motivation that I need to be mindful of that seeks out to make value objective or more meaningful and foundational in the universe.

Will: Yeah. The thing that I think should make you feel optimistic, or at least motivated, is this argument that if nothing matters, it doesn’t matter that nothing matters. It just really ought not to affect what you do. You may as well act as if things do matter, and in fact we can have this project of trying to figure out if things matter, and that maybe could be an instrumental goal, which kind of is a purpose for life is to get to a place where we really can figure out if it has any meaning. I think that sort of argument can at least give one grounds for getting out of bed in the morning.

Lucas: Right. I think there’s this philosophy paper that I saw, but I didn’t read, that was like, “nothing Matters, but it does matter”, with the one lower case M and then another capital case M, you know.

Will: Oh, interesting.

Lucas: Yeah.

Will: It sounds a bit like 4:20 ethics.

Lucas: Yeah, cool.

Moving on here into AI alignment. And before we get into this, I think that this is something that would also be interesting to hear you speak a little bit more about before we dive into AI alignment. What even is the value of moral information and moral philosophy, generally? Is this all just a bunch of BS or how can it be interesting and or useful in our lives, and in science and technology?

Will: Okay, terrific. I mean, and this is something I write about in a paper I’m working on now and also in the book, as well.

So, yeah, I think the stereotype of the philosopher engaged in intellectual masturbation, not doing really much for the world at all, is quite a prevalent stereotype. I’ll not comment on whether that’s true for certain areas of philosophy. I think it’s definitely not true for certain areas within ethics. What is true is that philosophy is very hard, ethics is very hard. Most of the time when we’re trying to do this, we make very little progress.

If you look at the long-run history of thought in ethics and political philosophy, the influence is absolutely huge. Even just take Aristotle, Locke, Hobbes, Mill, and Marx. The influence of political philosophy and moral philosophy there, it shaped thousands of years of human history. Certainly not always for the better, sometimes for the worse, as well. So, ensuring that we get some of these ideas correct is just absolutely crucial.

Similarly, even in more recent times … Obviously not as influential as these other people, but also it’s been much less time so we can’t predict into the future, but if you consider Peter Singer as well, his ideas about the fact that we may have very strong obligations to benefit those who are distant strangers to us, or that we should treat animal welfare just on a par with human welfare, at least on some understanding of those ideas, that really has changed the beliefs and actions of, I think, probably tens of thousands of people, and often in really quite dramatic ways.

And then when we think about well, should we be doing more of this, is it merely that we’re influencing things randomly, or are we making things better or worse? Well, if we just look to the history of moral thought, we see that most people in most times have believed really atrocious things. Really morally abominable things. Endorsement of slavery, distinctions between races, subjugation of women, huge discrimination against non-heterosexual people, and, in part at least, it’s been ethical reflection that’s allowed us to break down some of those moral prejudices. And so we should presume that we have very similar moral prejudices now. We’ve made a little bit of progress, but do we have the one true theory of ethics now? I certainly think it’s very unlikely. And so we need to think more if we want to get to the actual ethical truth, if we don’t wanna be living out moral catastrophes in the same way as we would if we kept slaves, for example.

Lucas: Right, I think we do want to do that, but I think that a bit later in the podcast we’ll get into whether or not that’s even possible, given economic, political, and militaristic forces acting upon the AI alignment problem and the issues with coordination and race to AGI.

Just to start to get into the AI alignment problem, I just wanna offer a little bit of context. It is implicit in the AI alignment problem, or value alignment problem, that AI needs to be aligned to some sort of ethic or set of ethics, this includes preferences or values or emotional dispositions, or whatever you might believe them to be. And so it seems that generally, in terms of moral philosophy, there are really two methods, or two methods in general, by which to arrive at an ethic. So, one is simply going to be through reason, and one is going to be through observing human behavior or artifacts, like books, movies, stories, or other things that we produce in order to infer and discover the observed preferences and ethics of people in the world.

The latter side of alignment methodologies are empirical and involves the agent interrogating and exploring the world in order to understand what the humans care about and value, as if values and ethics were simply a physical by-product of the world and of evolution. And the former is where ethics are arrived at through reason alone, and involve the AI or the AGI potentially going about ethics as a philosopher would, where one engages in moral reasoning about metaethics in order to determine what is correct. From the point of view of ethics, there is potentially only what the humans empirically do believe and then there is what we may or may not be able to arrive at through reason alone.

So, it seems that one or both of these methodologies can be used when aligning an AI system. And again, the distinction here is simply between sort of preference aggregation or empirical value learning approaches, or methods of instantiating machine ethics, reasoning, or decision-making in AI systems so they become agents of morality.

So, what I really wanna get into with you now is how metaethical uncertainty influences our decision over the methodology of value alignment. Over whether or not we are to prefer an empirical preference learning or aggregation type approach, or one which involved an imbuing of moral epistemology and ethical metacognition and reasoning into machine systems so it can discover what we ought to do. And how moral uncertainty, and metaethical moral uncertainty in particular, operates within both of these spaces once you’re committed to some view, or both of these views. And then we can get into issues and intertheoretic comparisons and how that arises here at many levels, the ideal way we should proceed if we could do what would be perfect, and again, what is actually likely to happen given race dynamics and political, economic, and militaristic forces.

Will: Okay that sounds terrific. I mean, there’s a lot of cover there.

I think it might be worth me saying just maybe a couple of distinctions I think are relevant and kind of my overall view in this. So, in terms of distinction, I think within what broadly gets called the alignment problem, I think I’d like to distinguish between what I’d call the control problem, then kind of human values alignment problem, and then the actual alignment problem.

Where the control problem is just, can you get this AI to do what you want it to do? Where that’s maybe relatively narrowly construed, I want it to clean up my room, I don’t want it to put my cat in the bin, that’s kinda control problem. I think describing that as a technical problem is kind of broadly correct.

Second is then what gets called aligning AI with human values. For that, it might be the case that just having the AI pay attention to what humans actually do and infer their preferences that are revealed on that basis, maybe that’s a promising approach and so on. And that I think will become increasingly important as AI becomes larger and larger parts of the economy.

This is kind of already what we do when we vote for politicians who represent at least large chunks of the electorate. They hire economists who undertake kind of willingness-to-pay surveys and so on to work out what people want, on average. I do think that this is maybe more normatively loaded than people might often think, but at least you can understand that, just as the control problem is I have some relatively simple goal, which is, what do I want? I want this system to clean my room. How do I ensure that it actually does that without making mistakes that I wasn’t intending? This is kind of broader problem of, well you’ve got a whole society and you’ve got to aggregate their preferences for what kind of society wants and so on.

But I think, importantly, there’s this third thing which I called a minute ago, the actual alignment problem, so let’s run with that. Which is just working out what’s actually right and what’s actually wrong and what ought we to be doing. I do have a worry that because many people in the wider world, often when they start thinking philosophically they start endorsing some relatively simple, subjectivist or relativist views. They might think that answering this question of well, what do humans want, or what do people want, is just the same as answering what ought we to do? Whereas for kind of the reductio of that view, just go back a few hundred years where the question would have been, well, the white man’s alignment problem, where it’s like, “Well, what do we want, society?”, where that means white men.

Lucas: Uh oh.

Will: What do we want them to do? So similarly, unless you’ve got the kind of such a relativist view that you think that maybe that would have been correct back then, that’s why I wanna kind of distinguish this range of problems. And I know that you’re kind of most interested in that third thing, I think. Is that right?

Lucas: Yeah, so I think I’m pretty interested in the second and the third thing, and I just wanna unpack a little bit of your distinction between the first and the second. So, the first was what you called the control problem, and you called the second just the plurality of human values and preferences and the issue of aligning to that in the broader context of the world.

It’s unclear to me how I get the AI to put a strawberry on the plate or to clean up my room and not kill my cat without the second thing haven been done, at least to me.

There is a sense at a very low level where your sort of working on technical AI alignment, which involves working on the MIRI approach with agential foundations and trying to work on a constraining optimization and corrigibility and docility and robustness and security and all of those sorts of things that people work on and the concrete problems in AI safety, stuff like that. But, it’s unclear to me where that sort of stuff is just limited to and includes the control problem, and where it begins requiring the system to be able to learn my preferences through interacting with me and thereby is already sort of participating in the second case where it’s sort of participating in AI alignment more generally, rather than being sort of like a low level controlled system.

Will: Yeah, and I should say that on this side of things I’m definitely not an expert, not really the person to be talking to, but I think you’re right. There’s going to be some big, gray area or transition from systems. So there’s one that might be cleaning my room, or even let’s just say it’s playing some sort of game, unfortunately I forget the example … It was under the blog post, an example of the alignment problem in the wild, or something, from open AI. But, just a very simple example of the AIs playing a game, and you say, “Well, get as many points as possible.” And what you really want it to do is win a certain race, but what it ends up doing is driving this boat just round and round in circles because that’s the way of maximizing the number of points.

Lucas: Reward hacking.

Will: Reward hacking, exactly. That would be a kind of failure of control problem, that first in our sense. And then I believe there’s gonna be kind of gray areas, where perhaps it’s the certain sort of AI system where the whole point is it’s just implementing kind of what I want. And that might be very contextually determined, might depend on what my mood is of the day. For that, that might be a much, much harder problem and will involve kind of studying what I actually do and so on.

We could go into the question of whether you can solve the problem of cleaning a room without killing my cat. Whether that is possible to solve without solving much broader questions, maybe that’s not the most fruitful avenue of discussion.

Lucas: So, let’s put aside this first case which involves the control problem, we’ll call it, and let’s focus on the second and the third, where again the second is defined as sort of the issue of the plurality of human values and preferences which can be observed, and then the third you described as us determining what we ought to do and tackling sort of the metaethics.

Will: Yeah, just tackling the fundamental question of, “Where ought we to be headed as a society?” One just extra thing to add onto that is that’s just a general question for society to be answering. And if there are kind of fast, or even medium-speed, developments in AI, perhaps suddenly we’ve gotta start answering that question, or thinking about that question even harder in a more kind of clean way than we have before. But even if AI were to take a thousand years, we’d still need to answer that question, ’cause it’s just fundamentally the question of, “Where ought we to be heading as a society?”

Lucas: Right, and so going back a little bit to the little taxonomy that I had developed earlier, it seems like your second case scenario would be sort of down to metaethical questions, which are behind and which influence the empirical issues with preference aggregation and there being plurality of values. And the third case would be, what would be arrived at through reason and, I guess, the reason of many different people.

Will: Again, it’s gonna involve questions of metaethics as well where, again, on my theory that metaethics … It would actually just involve interacting with conscious experiences. And that’s a critical aspect of coming to understand what’s morally correct.

Lucas: Okay, so let’s go into the second one first and then let’s go into the third one. And while we do that, it would be great if we could be mindful of problems in intertheoretic comparison and how they arise as we go through both. Does that sound good?

Will: Yeah, that sounds great.

Lucas: So, would you like to just sort of unpack, starting with the second view, the metaethics behind that, issues in how moral realism versus moral anti-realism will affect how the second scenario plays out, and other sorts of crucial considerations in metaethics that will affect the second scenario?

Will: Yeah, so for the second scenario, which again, to be clear, is the aggregating of the variety of human preferences across a variety of contexts and so on, is that right?

Lucas: Right, so that the agent can be fully autonomous and realized in the world that it is sort of an embodiment of human values and preferences, however construed.

Will: Yeah, okay, so here I do think all the metaethics questions are gonna play a lot more role in the third question. So again, it’s funny, it’s very similar to the question of kind of what mainstream economists often think they’re doing when it comes to cost-benefit analysis. Let’s just even start in the individual case. Even there, it’s not a purely kind of descriptive enterprise, where, again, let’s not even talk about AI. You’re just looking out for me. You and I are friends and you want to do me a favor in some way, how do you make a decision about how to do me that favor, how to benefit me in some way? Well, you could just look at the things I do and then infer on the basis of that what my utility function is. So perhaps every morning I go and I rob a convenience store and then I buy some heroin and then I shoot up and-

Lucas: Damn, Will!

Will: That’s my day. Yes, it’s a confession. Yeah, you’re the first to hear it.

Lucas: It’s crazy, in Oxford huh?

Will: Yeah, Oxford University is wild.

You see that behavior on my part and you might therefore conclude, “Wow, well what Will really likes is heroin. I’m gonna do him a favor and buy him some heroin.” Now, that seems kind of commonsensically pretty ridiculous. Well, assuming I’m demonstrating all sorts of bad behavior that looks like it’s very bad for me, it looks like a compulsion and so on. So instead what we’re really doing is not merely maximizing the utility function that’s gone by my revealed preferences, we have some deeper idea of kind of what’s good for me or what’s bad for me.

Perhaps that comes down to just what I would want to want, or what I want myself to want to want to want. Perhaps you can do it in terms of what are called second-order, third-order preferences. What idealized Will would want … That is not totally clear. Well firstly, it’s really hard to know kind of what would idealized Will want. You’re gonna have to start doing at least a little bit of philosophy there. Because I tend to favor hedonism, I think that an idealized version of my friend would want the best possible experiences. That might be very different from what they think an idealized version of themselves would want because perhaps they have some objective list account of well-being and they think well, what they would also want is knowledge for the its own sake and appreciating beauty for its own sake and so on.

So, even there I think you’re gonna get into pretty tricky questions about what is good or bad for someone. And then after that you’ve got the question of preference aggregation, which is also really hard, both in theory and in practice. Where, do you just take strengths of preferences across absolutely everybody and then add them up? Well, firstly you might worry that you can’t actually make these comparisons of strengths of preferences between people. Certainly if you’re just looking at peoples revealed preferences, it’s really opaque how you would say if I prefer coffee to tea and you vice versa, who has the stronger preference? But perhaps we could look at behavioral facts to kind of try and at least anchor that, but it’s still then non-obvious that what we ought to do when we’re looking at everybody’s preferences is just maximize the sum rather than perhaps give some extra weighting to people who are more badly off, perhaps we give more priority to their interests. So this is kinda theoretical issues.

And then secondly, is kinda just practical issues of implementing that, where you actually need to ensure that people aren’t faking their preferences. And there’s a well known literature and voting theory that says that basically any aggregation system you have, any voting system, is going to be manipulable in some way. You’re gonna be able to get a better result for yourself, at least in some circumstances, by misrepresenting what you really want.

Again, these are kind of issues that our society already faces, but they’re gonna bite even harder when we’re thinking about delegating to artificial agents.

Lucas: There’s two levels to this that you’re sort of elucidating. The first is that you can think of the AGI as being something which can do favors for everybody in humanity, so there are issues empirically and philosophically and in terms of understanding other agents about what sort of preferences should that AGI be maximizing for each individual, say being constrained by what is legal and what is generally converged upon as being good or right. And then there’s issues with preference aggregation which come up more given that we live in a resource-limited universe and world, where not all preferences can coexist and there has to be some sort of potential cancellation between different views.

And so, in terms of this higher level of preference aggregation … And I wanna step back here to metaethics and difficulties of intertheoretic comparison. It would seem that given your moral realist view, it would affect how the weighting would potentially be done. Because it seemed like before you were eluding to the fact that if your moral realist view would be true, then the way at which we could determine what we ought to do or what is good and true about morality would be through exploring the space of all possible experiences, right, so we can discover moral facts about experiences.

Will: Mm-hmm (affirmative).

Lucas: And then in terms of preference aggregation, there would be people who would be right or wrong about what is good for them or the world.

Will: Yeah, I guess this is, again why I wanna distinguish between these two types of value alignment problem, where on the second type, which is just kind of, “What does society want?” Societal preference aggregation. I wasn’t thinking of it as there being kind of right or wrong preferences.

In just the same way as there’s this question of just, “I want system to do X” but there’s a question of, “Do I want that?” or “How do you know that I want that?”, there’s a question of, “How do you know what society wants?” That’s a question in and of its own right that’s then separate from that third alignment issue I was raising, which then starts to bake in, well, if people have various moral preferences, views about how the world ought to be, yeah some are right and some are wrong. And no way should you give some aggregation over all those different views, because ideally you should give no weight to the ones that are wrong and if any are true, they get all the weight. It’s not really about kind of preference aggregation in that way.

Though, if you think about it as everyone is making certain sort of guess at the moral truth, then you could think of that like a kind of judgment aggregation problem. So, it might be like data or input for your kind of moral reasoning.

Lucas: I think I was just sort of conceptually slicing this a tiny bit different from you. But that’s okay.

So, staying on this second view, it seems like there’s obviously going to be a lot of empirical issues and issues in understanding persons and idealized versions of themselves. Before we get in to intertheoretic comparison issues here, what is your view on coherent extrapolated volition, sort of, being the answer to this second part?

Will: I don’t really know that much about it. From what I do know, it always seemed under-defined. As I understand it, the key idea is just, you take everyone’s idealized preferences in some sense, and then I think what you do is just take a sum of what everyone’s preference is. I’m personally quite in favor of the summation strategy. I think we can make interpersonal comparisons of strengths of preferences, and I think summing people’s preferences is the right approach.

We can use certain kinds of arguments that also have application in moral philosophy, like the idea of “If you didn’t know who you were going to be in society, how would you want to structure things? And if you’re a rational, self-interested agent, maximizing expected utility, then you’ll do the utilitarian aggregation function, so you’ll maximize the sum of preference strength.

But then, if we’re doing this idealized preference thing, all the devil’s going to be in the details of, “Well how are you doing this idealization?” Because, given my preferences for example, for what they are … I mean my preferences are absolutely … Certainly they’re incomplete, they’re almost certainly cyclical, who knows? Maybe there’s even some preferences I have that are areflexive of things, as well. Probably contradictory, as well, so there’s questions about what does it mean to idealize, and that’s going to be a very difficult question, and where a lot of the work is, I think.

Lucas: So I guess, just two things here. What are sort of the timeline and actual real world working in relationship here, between the second problem that you’ve identified and the third problem that you’ve identified, and what is the role and work that preferences are doing here, for you, within the context of AI alignment, given that you’re sort of partial of a form of hedonistic consequentialism?

Will: Okay, terrific, ’cause this is kind of important framing.

In terms of answering this alignment problem, the deep one of just where ought societies to be going, I think the key thing is to punt it. The key thing is to get us to a position where we can think about and reflect on this question, and really for a very long time, so I call this the long reflection. Perhaps it’s a period of a million years or something. We’ve got a lot of time on our hands. There’s really not the kind of scarce commodity, so there are various stages to get into that state.

The first is to reduce extinction risks down basically to zero, put us a position of kind of existential security. The second then is to start developing a society where we can reflect as much as possible and keep as many options open as possible.

Something that wouldn’t be keeping a lot of options open would be, say we’ve solved what I call the control problem, we’ve got these kind of lapdog AIs that are running the economy for us, and we just say, “Well, these are so smart, what we’re gonna do is just tell it, ‘Figure out what’s right and then do that.'” That would really not be keeping our options open. Even though I’m sympathetic to moral realism and so on, I think that would be quite a reckless thing to do.

Instead, what we want to have is something kind of … We’ve gotten to this position of real security. Maybe also along the way, we’ve fixed the various particularly bad problems of the present, poverty and so on, and now what we want to do is just keep our options open as much as possible and then kind of gradually work on improving our moral understanding where if that’s supplemented by AI system …

I think there’s tons of work that I’d love to see developing how this would actually work, but I think the best approach would be to get the artificially intelligent agents to be just doing moral philosophy, giving us arguments, perhaps creating new moral experiences that it thinks can be informative and so on, but letting the actual decision making or judgments about what is right and wrong be left up to us. Or at least have some kind of gradiated thing where we gradually transition the decision making more and more from human agents to artificial agents, and maybe that’s over a very long time period.

What I kind of think of as the control problem in that second level alignment problem, those are issues you face when you’re just addressing the question of, “Okay. Well, we’re now gonna have an AI run economy,” but you’re not yet needing to address the question of what’s actually right or wrong. And then my main thing there is just we should get ourselves into a position where we can take as long as we need to answer that question and have as many options open as possible.

Lucas: I guess here given moral uncertainty and other issues, we would also want to factor in issues with astronomical waste into how long we should wait?

Will: Yeah. That’s definitely informing my view, where it’s at least plausible that morality has an aggregative component, and if so, then the sheer vastness of the future may, because we’ve got half a billion to a billion years left on Earth, a hundred trillion years before the starts burn out, and then … I always forget these numbers, but I think like a hundred billion stars in the Milky Way, ten trillion galaxies.

With just vast resources at our disposal, the future could be astronomically good. It could also be astronomically bad. What we want to insure is that we get to the good outcome, and given the time scales involved, even what seem like an incredibly long delay, like a million years, is actually just very little time indeed.

Lucas: In half a second I want to jump into whether or not this is actually likely to happen given race dynamics and that human beings are kind of crazy. The sort of timeline here is that we’re solving the technical control problem up into and on our way to sort of AGI and what might be superintelligence, and then we are also sort of idealizing everyone’s values and lives in a way such that they have more information and they can think more and have more free time and become idealized versions of themselves, given constraints within issues of values canceling each other out and things that we might end up just deeming to be impermissible.

After that is where this period of long reflection takes place, and sort of the dynamics and mechanics of that are seeming open questions. It seems that first comes computer science and global governance and coordination and strategy issues, and then comes long time of philosophy.

Will: Yeah, then comes the million years of philosophy, so I guess not very surprising a philosopher would suggest this. Then the dynamics of the setup is an interesting question, and a super important one.

One thing you could do is just say, “Well, we’ve got ten billion people alive today, let’s say. We’re gonna divide the universe into ten billionths, so maybe that’s a thousand galaxies each or something.” And then you can trade after that point. I think that would get a pretty good outcome. There’s questions of whether you can enforce it or not into the future. There’s some arguments that you can. But maybe that’s not the optimal process, because especially if you think that “Wow! Maybe there’s actually some answer, something that is correct,” well, maybe a lot of people miss that.

I actually think if we did that and if there is some correct moral view, then I would hope that incredibly well informed people who have this vast amount of time, and perhaps intellectually augmented people and so on who have this vast amount of time to reflect would converge on that answer, and if they didn’t, then that would make me more suspicious of the idea that maybe there is a real face to the matter. But it’s still the early days we’d really want to think a lot about what goes into the setup of that kind of long reflection.

Lucas: Given this account that you’ve just given about how this should play out in the long term or what it might look like, what is the actual probability do you think that this will happen given the way that the world actually is today and it’s just the game theoretic forces at work?

Will: I think I’m going to be very hard pressed to give a probability. I don’t think I know even what my subjective credence is. But speaking qualitatively, I’d think it would be very unlikely that this is how it would play out.

Again, I’m like Brian and Dave in that I think if you look at just history, I do think moral forces have some influence. I wouldn’t say they’re the largest influence. I think probably randomness explains a huge amount of history, especially when you think about how certain events are just very determined by actions of individuals. Economic forces and technological forces, environmental changes are also huge as well. It is hard to think at least that it’s going to be likely that such a well orchestrated dynamic would occur. But I do think it’s possible and I think we can increase the chance of that happening by the careful actions that where people like FLI are doing at the moment.

Lucas: That seems like the sort of ideal scenario, absolutely, but I also am worried that people don’t like to listen to moral philosophers or people in that potentially selfish government forces and things like that will end up taking over and controlling things, which is kind of sad for the cosmic endowment.

Will: That’s exactly right. I think my chances … If there was some hard takeoff and sudden leap to artificial general intelligence, which I think is relatively unlikely, but again is possible, I think that’s probably the most scary ’cause it means that a huge amount of power is suddenly in the hands of a very small number of people potentially. You could end up with the very long run future of humanity being determined by the idiosyncratic preferences of just a small number of people, so it would be very dependent whether those people’s preferences are good or bad, with a kind of slow takeoff, so where there’s many decades in terms of development of AGI and it gradually getting incorporated into the economy.

I think there’s somewhat more hope there. Society will be a lot more prepared. It’s less likely that something very bad will happen. But my default presumption when we’re talking about multiple nations, billions of people doing something that’s very carefully coordinated is not going to happen. We have managed to do things that have involved international cooperation and amazing levels of operational expertise and coordination in the past. I think the eradication of smallpox is perhaps a good example of that. But it’s something that we don’t see very often, at least not now.

Lucas: It looks like that we need to create a Peter Singer of AI safety of some other philosopher who has had a tremendous impact on politics and society to spread this sort of vision throughout the world such that it would more likely become realized. Is that potentially most likely?

Will: Yeah. I think if a wide number of the political leaders, even if just political leaders of US, China, Russia, all were on board with global coordination on the issue of AI, or again, whatever other transformative technology might really upend things in the 21st century, and were on board with “How important it is that we get to this kind of period of long reflection where we can really figure out where we’re going,” then that alone would be very promising.

Then the question of just how promising is that I think depends a lot on maybe the robustness of … Even if you’re a moral realist, there’s a question of “How likely do you think it is that people will get the correct moral view?” It could be the case that it’s just this kind of strong attractor where even if you’ve got nothing as clean cut as the long reflection that I was describing, instead some really messy thing, perhaps various wars and it looks like feudal society or something, and anyone would say that civilization looks likely chaotic, maybe it’s the case that even given that, just given enough time and enough reasoning power, people will still converge on the same moral view.

I’m probably not as optimistic as that, but it’s at least a view that you could hold.

Lucas: In terms of the different factors that are going into the AI alignment problem and the different levels you’ve identified, first, second, and third, which side do you think is lacking the most resources and attention right now? Are you most worried about the control problem, that first level? Or are you more worried about potential global coordination and governance stuff at the potential second level or moral philosophy stuff at the third?

Will: Again, flagging … I’m sure I’m biased on this, but I’m currently by far the most worried on the third level. That’s for a couple of reasons. One is I just think the vast majority of the world are simple subjectivists or relativists, so the idea that we ought to be engaging in real moral thinking about how we use society, where we go with society, how we use our cosmic endowment as you put it, my strong default is that that question just never even really gets phrased.

Lucas: You don’t think most people are theological moral realists?

Will: Yeah. I guess it’s true that I’m just thinking about-

Lucas: Our bubble?

Will: My bubble, yeah. Well educated westerners. Most people in the world at least would say they’re theological moral realists. One thought is just that … I think my default is that some sort of relativistic will hold sway and people will just not really pay enough attention to think about what they ought to do. A second relevant thought is just I think the best possible universe is plausibly really, really good, like astronomically better than alternative extremely good universes.

Lucas: Absolutely.

Will: It’s also the case that if you’re … Even like slight small differences in moral view might lead you to optimize for extremely different things. Even just a toy example of preference utilitarianism vs hedonistic utilitarianism, what you might think of as two very similar views, I think in the actual world there’s not that much difference between them, because we just kind of know what makes people better off, at least approximately, improves their conscious experiences, it also is generally what they want, but when you’re kind of technologically unconstrained, it’s plausible to me that the optimal configuration of things will look really quite different between those two views. I guess I kind of think the default is that we get it very badly wrong and it will require really sustained work in order to ensure we get it right … If it’s the case that there is a right answer.

Lucas: Is there anything with regards to issues in intertheoretic comparisons, or anything like that at any one of the three levels which we’ve discussed today that you feel we haven’t sufficiently covered or something that you would just like to talk about?

Will: Yeah. I know that one of your listeners was asking whether I thought they were solvable even in principle, by some superintelligence, and I think they are. I think they are if other issues in moral philosophy are solvable. I think that’s particularly hard, but I think ethics in general is very hard.

I also think it is the case that whatever output we have at the end of this kind of long deliberation, again it’s unlikely we’ll get to credence 1 in a particular view, so we’ll have some distribution over different views, and we’ll want to take that into account. Maybe that means we do some kind of compromise action.

Maybe that means we just distribute our resources in proportion with our credence in different moral views. That’s again one of these really hard questions that we’ll want if at all possible to punt on and leave to people who can think about this in much more depth.

Then in terms of aggregating societal preferences, that’s more like the problem of interpersonal comparisons of preference strength, which is kind of formally isomorphic but is at least a definitely issue.

Lucas: At the second and the third levels is where the intertheoretic problems are really going to be arising, and at that second level where the AGI is potentially working to idealize our values, I think there is again the open question about in the real world, whether or not there will be moral philosophers at the table or in politics or whoever has control over the AGI at that point in order to work on and think more deeply about intertheoretic comparisons of value at that level and timescale. Just thinking a little bit more about what we ought to do or what we should do realistically, given potential likely outcomes about whether or not this sort of thinking will or will not be at the table.

Will: My default is just the crucial thing is to ensure that this thinking is more likely to be at the table. I think it is important to think about, “Well, what ought we to do then,” if we think it’s as very likely that things go badly wrong. Maybe it’s not the case that we should just be aiming to push for the optimal thing, but for some kind of second best strategy.

I think at the moment we should just be trying to push for the optimal thing. In particular, that’s in part because my views that a optimal universe is just so much better than even an extremely good one, that I just kind of think we ought to be really trying to maximize the chance that we can figure out what there is and then implement it. But it would be interesting to think about it more.

Lucas: For sure. I guess just wrapping up here, did you ever have the chance to look at those two Lesswrong posts by Worley?

Will: Yeah, I did.

Lucas: Did you have any thoughts or comments on them? If people are interested you can find links in the description.

Will: I read the posts, and I was very sympathetic in general to what he was thinking through. In particular the principle of philosophical conservatism. Hopefully I’ve shown that I’m very sympathetic to that, so trying to think “What are the minimal assumptions? Would this system be safe? Would this path make sense on a very, very wide array of different philosophical views?” I think the approach I’ve suggested, which is keeping our options open as much as possible and punting on the really hard questions, does satisfy that.

I think one of his posts was talking about “Should we assume moral realism or assume moral antirealism?” It seems like there our views differed a little bit, where I’m more worried that everyone’s going to assume some sort of subjectivism and relativism, and that there might be some moral truth out there that we’re missing and we never think to find it, because we decide that what we’re interested in is maximizing X, so we program agents to build X and then just go ahead with it, whereas actually the thing that we ought to have been optimizing for is Y. But broadly speaking, I think this question of trying to be as ecumenical as possible philosophically speaking makes a lot of sense.

Lucas: Wonderful. Well, it’s really been a joy speaking, Will. Always a pleasure. Is there anything that you’d like to wrap up on, anywhere people can follow you or check you out on social media or anywhere else?

Will: Yeah. You can follow me on Twitter @WillMacAskill if you want to read more on some of my work you can find me at williammacaskill.com

Lucas: To be continued. Thanks again, Will. It’s really been wonderful.

Will: Thanks so much, Lucas.

Lucas: If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

[end of recorded material]

AI Alignment Podcast: The Metaethics of Joy, Suffering, and Artificial Intelligence with Brian Tomasik and David Pearce

What role does metaethics play in AI alignment and safety? How might paths to AI alignment change given different metaethical views? How do issues in moral epistemology, motivation, and justification affect value alignment? What might be the metaphysical status of suffering and pleasure?  What’s the difference between moral realism and anti-realism and how is each view grounded?  And just what does any of this really have to do with AI?

The Metaethics of Joy, Suffering, and AI Alignment is the fourth podcast in the new AI Alignment series, hosted by Lucas Perry. For those of you that are new, this series will be covering and exploring the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, we will be having discussions with technical and non-technical researchers across areas such as machine learning, AI safety, governance, coordination, ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, or your preferred podcast site/application.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.

In this podcast, Lucas spoke with David Pearce and Brian Tomasik. David is a co-founder of the World Transhumanist Association, currently rebranded Humanity+. You might know him for his work on The Hedonistic Imperative, a book focusing on our moral obligation to work towards the abolition of suffering in all sentient life. Brian is a researcher at the Foundational Research Institute. He writes about ethics, animal welfare, and future scenarios on his website “Essays On Reducing Suffering.” 

Topics discussed in this episode include:

  • What metaethics is and how it ties into AI alignment or not
  • Brian and David’s ethics and metaethics
  • Moral realism vs antirealism
  • Emotivism
  • Moral epistemology and motivation
  • Different paths to and effects on AI alignment given different metaethics
  • Moral status of hedonic tones vs preferences
  • Can we make moral progress and what would this mean?
  • Moving forward given moral uncertainty
In this interview we discuss ideas contained in the work of Brian Tomasik and David Pearce. You can learn more about Brian’s work here and here, and David’s work hereYou can hear more in the podcast above or read the transcript below.

Lucas: Hey, everyone. Welcome back to the AI Alignment podcast series with the Future of Life Institute. Today, we’ll be speaking with David Pearce and Brian Tomasik. David is a co-founder of the World Transhumanist Association, rebranded humanity plus, and is a prominent figure within the transhumanism movement in general. You might know him from his work on the Hedonistic Imperative, a book which explores our moral obligation to work towards the abolition of suffering in all sentient life through technological intervention.

Brian Tomasik writes about ethics, animal welfare and for far-future scenarios from a suffering-focused perspective on his website reducing-suffering.org. He has also helped found the Foundational Research Institute, which is a think tank that explores crucial considerations for reducing suffering in the long term future. If you have been finding this podcast interesting or useful, remember to follow us on your preferred listening platform and share the episode on social media. Today, Brian, David, and I speak about metaethics, key concepts and ideas in the space, explore the metaethics of Brian and David, and how this all relates to and is important for AI alignment. This was a super fun and interesting episode and I hope that you find it valuable. With that, I give you Brian Tomasik and David Pearce.

Thank you so much for coming on the podcast.

David: Thank you Lucas.

Brian: Glad to be here.

Lucas: Great. We can start off with you David and then, you Brian and just giving a little bit about your background, the intellectual journey that you’ve been on and how that brought you here today.

David: Yes. My focus has always been on the problem of suffering, very ancient problem, Buddhism and countless other traditions preoccupied by the problem of suffering. I’m also a transhumanist and what transhumanism brings to the problem is suffering is the idea that it’s possible to use technology, in particular biotechnology to phase out suffering, not just in humans throughout the living world and ideally replace them by gradients of intelligent wellbeing. Transhumanism is a very broad movement embracing not just radical mood enrichment but also super longevity and super intelligence. This is what brings me in and us here today in that there is no guarantee that human preoccupations are the problems of suffering are going to overlap with those of post human super intelligence.

Lucas: Awesome, and so you, Brian.

Brian: I’ve been interested in utilitarianism since I was 18 and I discovered the word. I immediately looked it up and was interested to see that the philosophy mirrored some of the things that I had been thinking about up to that point. I became interested in animal ethics and the far future. A year after that, I actually discovered David’s writings of the Hedonistic Imperative, along with other factors. His writings helped to inspire me to care more about suffering relative to the creation of happiness. Since then, I’ve been what you might call suffering-focused, which means I think that the reduction of suffering has more moral priority than other values. I’ve written about both animal ethics including wild animal suffering as well as risks of astronomical future suffering, what are called s-risks. You had a recent podcast episode with Kaj Sotala to talk about s-risks.

I, in general think that from my perspective, one important thing to think about was during AI is what sorts of outcomes could result in large amounts of suffering? We should try to steer away from those possible future scenarios.

Lucas: Given our focuses on AI alignment, I’d like to just offer a little bit of context. Today, this episode will be focusing on ethics. The AI Alignment problem is traditionally seen as something which is prominently something technical. While a large, large portion of it is technical, the end towards which the technical AI is aimed or the ethics which is imbued within it or embodied within it is still an open and difficult question. Broadly, just to have everything defined here, we can understand ethics here just a method of seeking to understand what we ought to do and what counts as moral or good.

The end goal of AI safety is to create beneficial intelligence not undirected intelligence. What beneficial exactly entails is still an open question that largely exist in the domain of ethics. Even if all the technical issues surrounding the creation of an artificial general intelligence or super intelligence are solved, we will still face deeply challenging ethical questions that will have tremendous consequences for earth-originating intelligent life. This is what is meant when it is said that we must do philosophy or ethics on a deadline. In the spirit of that, that’s why we’re going to be focusing this podcast today on metaethics and particularly the metaethics of David Pearce and Brian Tomasik, which also happen to be ethical views which are popular I would say among people interested in the AI safety community.

I think that Brian and David have enough disagreements that this should be pretty interesting. Again, just going back to this idea of ethics, I think given this goal, ethics can be seen as a lens through which to view safe AI design. It’s also a cognitive architecture to potentially be instantiated in AI through machine ethics. That would potentially make AIs ethical reasoners, ethical decision-makers, or both. Ethics can also be developed, practiced and embodied by AI researchers and their collaborators, and can also be seen as a discipline through which we can guide AI research and adjudicate it’s impacts in the world.

There is an ongoing debate about what the best path forward is for generating ethical AI, whether it’s project of machine ethics through bottom up or for top down approaches, or just a broad project of AI safety and AI safety engineering where we seek out corrigibility and docility, and alignment, and security in machine systems or probably even some combination of the two. It’s unclear what the outcome of AI will be but what is more certain though is that AI promises to produce and make relevant both age-old and novel moral considerations through areas such as algorithmic bias and technological disemployment and autonomous weapons, and privacy, big data systems, and even possible phenomenal states in machines.

We’ll even see new ethical issues with what might potentially one day be super intelligence and beyond. Given this, I think I’d like to just dive in first with you Brian and then, with you David. If you could just get into what the foundation is of your moral view? Then, afterwards, we can dive into the metaethics behind it.

Brian: Sure. At bottom, the reason that I placed foremost priority on suffering is emotion. Basically, the emotional experience of having suffered myself intensely from time to time and having empathy when I see others suffering intensely. That experience of either feeling it yourself or seeing others in extreme pain carries just a moral valence to me or a spiritual sensation you might call it that seems different from the sensation that I feel from anything else. It seems just obvious at an emotional level that say torture or being eaten alive by a predatory animal or things of that nature have more moral urgency than anything else. That’s the fundamental basis. You can also try to make theoretical arguments to come to the same conclusion. For example, people have tried to advance what’s called the asymmetry, which is the intuition that it’s bad to create a new being who will suffer a lot but it’s not wrong to fail to create a being that will be happy or at least not nearly as wrong.

From that perspective, you might care more about preventing the creation of suffering beings than about creating additional happy beings. You can also advance the idea that maybe preferences are always a negative debt that has to be repaid. Maybe when you have a preference that’s a bad thing and then, it’s only by fulfilling the preference that you erase the bad things. This would be similar to the way in which Buddhism says that suffering arises from craving. The goal is to cease the cravings which can be done either through the fulfilling the cravings, giving the organism what the organism wants or not having the cravings in the first place. Those are some potential theoretical frameworks from which to also derive a suffering-focused ethical view. For me personally, the emotional feeling is the most important basis.

David: I would very much like to echo what Brian was saying there. I mean there is something about the nature of intense suffering. One can’t communicate it to someone who hasn’t suffered. I mean someone who is for example born with congenital anesthesia or insensitivity to pain but there is something that is self-intimatingly nasty and disvaluable about suffering. However, evolution hasn’t engineered us of course to care impartially about the suffering of all sentient beings. My suffering and those of my genetic kin tends to matter far more to me than anything else. So far as we aspire to become transhuman and posthuman, we should be aspiring to this godlike perspective that takes into account the suffering of all sentient beings that the egocentric illusionist is a genetically adaptive lie.

How does this tie in to the question of posthuman super intelligence? Of course, there are very different conceptions of what posthuman super intelligence is going to be. I’ve always had what might say a more traditional conception of super intelligence in which posthuman super intelligence is going to be our biological descendants enhanced by AI but nonetheless still our descendants. However, there are what might crudely be called two other conceptions of post human super intelligence. One is this Kurzweilian fusion of humans and our machines, such that the difference between humans and our machine ceases to be relevant.

There’s another conception of super intelligence that you might say in some ways is the most radical is the intelligence explosion that was first conceived by I.J. Good but has been developed by Eliezer Yudkowsky, MIRI, and most recently by Nick Bostrom that conceives of some kind of runaway explosion, recursively self-improving AI and yes, there being no guarantee that the upshot of this intelligence explosion is going to be in any way congenial to human values as we understand them. I’m personally skeptic about the intelligence explosion in this sense but yeah, it’s worth clarifying what one means by posthuman super intelligence.

Lucas: Wonderful. Right before we dive into the metaethics behind these views and their potential relationship with AI alignment and just broadening the discussion to include ethics and exploring some of these key terms. I just like to touch on the main branches of ethics to provide some context and mapping for us. Generally, ethics is understood to have three branches, those being metaethics, normative ethics, and applied ethics. Traditionally, applied ethics is viewed as the application of normative and metaethical views to specific cases and situations to determine the moral status of said case or situation in order to decide what ought to be done.

An example of that might be applying one’s moral views to factory farming to determine whether or not it is okay to factory farm animals for their meat. The next branch moving upwards in abstraction would be normative ethics, which examines and deconstructs or constructs the principles and ethical systems we use for assessing the moral worth and permissibility of specific actions and situations. This branch is traditionally viewed as the formal ethical structures that we apply to certain situations and people are familiar with the deontological ethics and consequentialism, or utilitarianism, or virtue ethics. These are all normative ethical systems.

What we’ll be discussing today is primarily metaethics. metaethics seeks to understand morality and ethics itself. It seeks to understand the nature of ethical statements, attitudes, motivation, properties and judgments. It seeks to understand whether or not ethics relates to objective truths about the world and about people, or whether it’s just simply subjective or if all ethical statements are in fact false. Seeks to understand when people mean when they express ethical judgments or statements. This gets into things like ethical uncertainty and justification theories, and substantial theories, and semantic theories of ethics.

Obviously, these are all the intricacies of the end towards which AI maybe aimed. Given even the epistemology of metaethics and ethics in general that also have major implications for what AIs might be able to discover about ethics or what they may not be able to discover about ethics. Again today, we’ll just be focusing on metaethics and the metaethics behind David and Brian’s views. I guess just to structure this a little bit, just to really start to use the formal language of metaethics. As a little bit of background again, semantic theories is an ethics seek to address the question of what is the linguistic meaning of moral terms or judgments.

These are primarily concerned with whether or not moral statements contain truth values or are arbitrary and subjective. There are other branches within semantic theories but there are main two branches. The first of that is noncognitivism. Noncognitivism refers to a group of theories which hold that moral statements are neither true nor false because they do not express genuine propositions. Usually, these forms of noncognitive views with things like emotivism where people think that when people are expressing our moral views or attitudes like suffering is wrong, they’re simply saying an emotion like boohoo it’s a suffering. Or I’m expressing the emotion that I think that suffering merely bothers me or is bad to me. Rather than you expressing some sort of truth or false claim about the world. Standing in contrast to noncognitivism is just cognitivism, which refers to a set of theories which hold that moral sentences express genuine propositions. That means that they can have truth of false values.

This is to say that they are capable of being true or false. Turning back to Brian and David’s views, how would you each view your moral positions as you’ve expressed thus far. Would you hold yourself to a cognitivist view or a noncognitivist view. I guess we can start with you David.

David: Yes. I just say it’s just built into the nature of let’s say agony that agony is disvaluable. Now, you might say that there is nothing in the equations of physics and science that says anything over and above the experience itself, something like redness. Yeah, redness is subjective. It’s mind-dependent. Yet, unless one thinks minds don’t exist in the physical universe. Nonetheless, redness is an objective feature of the natural physical world. I would say that for reasons we simply don’t understand, pleasure-pain axis discloses the world’s inbuilt metric of value and disvalue. It’s not an open question whether something like agony is disvaluable to the victim.

Now, of course, someone might say, “Well, yes. Agony is disvaluable to you but it’s not disvaluable to me.” I would say that this reflects an epistemological limitation and that in so far as you can access what it is like to be me and I’m in agony, then you will appreciate why agony is objectively disvaluable.

Lucas: Right. The view here is a cognitivist view where you think that it is true to say that there is some intrinsic property or quality to suffering or joy that makes it I guess analytically true that it is valuable or disvaluable.

David: Yes. Well, it has to be very careful about using something like analytically because yeah, someone says that god is talking to me and it is analytically true that these voices are the voices of god. Yeah, one needs to be careful not to smuggle in too much. It is indeed very mysterious. What could be this hybrid descriptive evaluative state of finding something valuable or disvaluable. The intrinsic nature of the physical is very much an open question. I think there are good powerful reasons for thinking that the reality is exhaustively described by the equations of physics. The intrinsic nature of that stuff, the essence of the physical, the fire in the equations is controversial. Physics itself is silent.

Lucas: Right. I guess here, you would describe yourself given these views as a moral realist or an objectivist.

David: Yes, yes.

Brian: Just to jump in before we get to me. Couldn’t you say that your view is still based on mind-dependence because at least based on the thing about if somebody else were hooked up to you, that person would appreciate the badness of suffering. That’s still just dependent on that other mind’s judgment or even if you have somebody who could mind meld with the whole universe and experience all suffering at once. That would still be the dependence of that mind. That mind is judging it to be a bad thing. Isn’t it still mind-depending ultimately?

David: Mind-dependent but I would say that minds are features of the physical world and so, obviously one can argue for some kind of dualism but I’m monistic physicalist at least that’s my working assumption.

Brian: I think objective moral value usually … the definition is usually that it’s not mind-dependent. Although, maybe it just depends what definition we’re using.

David: Yes. It’s rather like something physicalism, it’s often used as a stylistic variant of materialism. One can be non-materialist physicalist and idealist. As I said, minds are objective features of the physical world. I mean at least tentatively at any rate taks seriously the idea that our experience discloses the intrinsic nature of the physical. This is obviously controversial opinion. It’s associated with someone like Galen Straussen or more likely Phil Goff but it stretches back via Grover Maxwell and Russell, ultimately to Schopenhauer. A much more conventional view of course would be that the intrinsic nature of the physical, the fire and the equations is non-experiential. Then, at sometime during the late pre-Cambrian, something happened. Not just organizational but ontological eruption into the fabric of the world first person experience.

Lucas: Just to echo what Brian was saying. The traditional objectivist or more realist view is that the way in which science is the project of interrogating third person facts like what is simply true about the person regardless of what we think about it. In some ways, I think that traditionally the moral realist view is that if morality deals with objective facts, then, these facts are third person objectively true and can be discovered through the methods and tools of ethics. In the same way that someone who might be a mathematical realist would say that one does not invent certain geometric objects rather one discovers them through the application of mathematical reasoning and logic.

David: Yes. I think it’s very tempting to think of first person facts as having some kind of second rate ontological status but as far as I’m concerned, first person facts are real. If someone is in agony or experiencing redness, these are objective tracks about the physical world.

Lucas: Brian, would you just like to jump in with the metaethics behind your own view that you discussed earlier?

Brian: Sure. On cognitivism versus noncognitivism, I don’t have strong opinions because I think some of the debate is just about how people use language, which is not a metaphysical fundamental issue. It’s just like however humans happen to use language. I think the answer to the cognitivism, noncognitivism, if I had to say something would be it’s messy probably. Humans do talk about moral statements, the way they talk about other statements, other factual statements. We use reasoning and we care about maintaining logical consistency among sets of moral statements. We treat them as regular factual statements in that regard. There maybe also be a sense in which moral statements do strongly express certain emotions. I think probably most people don’t really think about it too much.

It’s like people know what they mean when they use moral statements and they don’t have a strong theory of exactly how to describe what they mean. One analogy that you could use is I think moral statements are like swear words. They’re used to make people feel more strongly about something or express how strongly you feel about something. People think that they don’t just refer to one’s emotions and even at a subjective level. If you say my moral view is suffering as bad. That feels different than saying I like ice cream because there’s a deeper, more spiritual or more like fundamental sensation that comes along with the moral statements that doesn’t come along with the, “I like ice cream,” statements.

I think metaphysically, that doesn’t reflect anything fundamental. It just means that we feel differently about moral statements and thoughts than about nonmoral ones. Subjectively, it feels different. Yeah. I think most people just feel that difference and then, exactly how you cash out whether that’s cognitive or noncognitive is a semantic dispute. My metaphysical position is anti-realism. I think that moral statements are mind-dependent. They reflect ultimately our own preferences even if they maybe very spiritual and like deep fundamental preferences. I think Occam’s Razor favors this view because it would add complexity to the world for there to be independent truths. I’m not even sure what that would mean, based on similar reason, I reject mathematical truths and anything non-physicalist. I think moral truths, mathematical truths and so on can all be thought of as fictional constructions that we make. We can reason within these fictional universes of ethics and mathematics that we construct using physical thought processes. That’s my basic metaphysical stance.

Lucas: Just stepping back to the cognitivism and noncognitivism issue, I guess I was specifically interested in yourself. When you were expressing your own moral view earlier, did you find that it’s simply a mixture of expressing your own emotions and also, trying to express truth claims or given your anti-realism, do you think that you’re simply only expressing emotions when you’re conveying your moral view?

Brian: I think very much of myself as an emotivist. It’s very clear to me that what I’m doing when I do ethics is what the emotivist as people are doing. Yes, since I don’t believe in moral truth, it would not make sense for me to be gesturing at moral truths. Except maybe in so far as my low level brain wiring intuitively thinks in those terms.

David: Just to add to this and that although it is possible to imagine, say something you like spectrum inversion, color inversion, some people who like ice cream and some people who hate ice cream. One thing it isn’t possible to do is imagine a civilization in which an inverted pleasure-pain axis. It seems to just be a basic fact about the world that unbearable, agony and despair is experienced as disvaluable and even cases that might appear to contradict this slight that say that masochist are in fact merely confirm a claim because, yeah, I mean the masochist enjoys the intensity rewarding release of endogenous opioids when the masochist undergoes activities that might otherwise be humiliating or painful.

Lucas: Right. David, it seems you’re making a claim about there being a perfect convergence in the space of all possible minds among the pleasure-pain axis having the same sort of function. I guess I’m potentially just missing the gap or pointing out the gap between that and I guess your cognitivist objectivism?

David: It seems to be built into the nature of let’s say agony or despair itself that it is disvaluable. It’s not I’m in agony. Is this valuable or not? It’s not open question whereas anything else. However, abhorrent, your eye might regard it one can still treat it as an open question and ask, is child abuse or slavery really disvaluable? Whereas in the case of agony, it’s built in the nature of the experience itself.

Lucas: I can get behind that. I think that sometimes when I’m feeling less nihilistic about morality, I am committed to that view. I think just to push back a little bit here. I think in the space of all possible minds, I think I can imagine a mind which has a moral judgment and commitment to the maximization of suffering within itself and within the world. It’s simply … it’s perfect in that sense. It’s perfect in maximizing suffering for itself in the world and it’s judgment and moral epistemology is very brittle, such that it will never change or deviate from this. How would you deal with something like that?

David: Is it possible? I mean one can certainly imagine a culture in which displays of machismo and the ability to cope with great suffering are highly valued and would be conspicuously displayed. This would fitness enhancing but nonetheless, it doesn’t really challenge the sovereignty of their pleasure-pain axis as the axis of value and disvalue. Yeah, I would struggle to conceive some kind of intelligence that values its own despair or agony.

Brian: From my perspective, I agree with what Lucas is saying depending on how you define things. One definition of suffering could be that part of the definition is desire to avoid it. From that perspective, you could say it’s not possible for an agent to seek something that it avoids. I think you could have systems where there are different parts in conflict so you could a hedonic assessment system that outputs a signal that this is suffering but then, another system then chooses to favor the suffering. Humans even have something like this when we can override our own suffering. We might have hedonic systems that say going out in the cold is painful but then, we have other systems or other signals that override that avoidance response and cause us to go out in the cold anyway for the sake of something else. You could imagine the wiring, such that wasn’t just enduring pain for some greater good but the motivational system was actively seeking to cause the hedonic system more experiences of pain. It’s just that that would be highly nonadaptive so we don’t see that anywhere in nature.

David: I would agree with what Brian says there. Yes, very much so.

Lucas: Okay. Given these views, would you guys have expressed and starting to get a better sense of them. Another branch of metaethics here that we might be able to explore how it fits in with your guy’s theories, justification theories within metaethics. These are attempts at understanding moral epistemology and motivation for acting in accordance with morality. It attempts to answer the question of how are moral judgments to be supported or defended? If possible, how does one make moral progress? This again will include moral epistemology and in terms of AI and value alignment, if one is anti-realist as Brian is or if one is an objectivist as David is then this completely changes the way and path forward towards AI alignment and value alignment if we are realist as David is then a sufficiently robust and correct moral epistemology in an AI system could essentially realize the hedonistic imperative as David sees it, where you would just have an optimization process extending out from planet earth, which was maximizing for the objectively good hedonic states in all possible sentient beings. I guess it’s a little unclear for me how this fits in with David’s theory or how David’s theory would be implemented.

David: There is a real problem with any theory of value that makes sovereign either the minimization of suffering or classical utilitarianism. Both Buddhism and negative utilitarianism appear to have this apocalyptic implication that if overriding responsibilities to minimize suffering but no. Isn’t that cleanest, quickest, efficient way to eliminate suffering to sterilize the planet, which is now technically feasible and though one can in theory imagine cosmic rescue missions if there is sentence elsewhere. There is apparently this not so disguised apocalyptic implication. When Buddha says allegedly or hopefully I teach one thing and one thing only. Suffering and the relief of suffering, or the end of suffering, yeah, in his day, there was no way to destroy the world. Today, there is.

Much less discussed, indeed I haven’t seen it adequately or not discussed at all in the scholarly literature is that a disguised implication of a classical utilitarian ethic that gives this symmetry to pleasure and pain is that we ought to be launching something like utilitronium shockwave where utilitronium is matter and energy optimized for pure bliss. The shockwave alludes to its velocity of propagation. Though humans perhaps are extremely unlikely even if and when we’re in a position to do so to launch a utilitronium shockwave. If one imagines a notional artificial, super intelligent with a utility function of classical utilitarianism, why wouldn’t that super intelligent launch a utilitronium shockwave that maximizes the cosmic abundance of positive value within our cosmological horizon.

Personally, I would imagine a future of gradients of intelligent bliss. I think that is in fact sociologically highly likely that post-human civilization will have a hedonic range that’s very crudely and schematically as is minus 10 to zero, to plus 10. I can imagine future civilization of let’s say plus 70 to plus 100 or plus 90 to a plus 100. From the perspective classical utilitarianism and classical utilitarianism is arguably the dominant some kind of watered-down version at least is the dominant secular ethic, and academia and elsewhere. That kind of civilization is suboptimal. It’s not moral or apparently has this obligation to launch this kind of cosmic orgasm so to speak.

Lucas: Right. I mean I think just pushing a little bit back on the first thing that you said there about the very negative scenario, which I think people tend to see as an implication of a suffering reducing focused ethic where there can’t be any suffering if there’s no sentient beings. That to me isn’t very plausible because it discounts the possibility of future wellbeing. I take the view that we actually do have a moral responsibility to create more happy beings and I view a  symmetry between pain and suffering. I don’t have a particularly suffering-focused ethic where I think there’s asymmetry where I think we should alleviate suffering prior to maximizing wellbeing. I guess David, maybe you can just unpack a little bit before we jump into these justification theories about whether or not you view there as being asymmetry between suffering and wellbeing.

David: I think there’s an asymmetry. There’s this fable of Ursula Le Guin, short story, Ones Who Walk Away From Omelas. We’re invited to imagine this city of delights, vast city of incredible wonderful pleasures but the existence of Omelas, this city of delights depends on the torment and abuse of a single child. The question is would you walk away from Omelas and what does walking away from Omelas entail. Now, personally I am someone who would walk away from Omelas. The world does not have an off switch, an off button and I think if one is whether a Buddhist of a negative utilitarian, or someone who believes in suffering-focused ethics, rather than to consider these theoretical apocalyptic scenarios it is more fruitful to work with secular and religious life lovers to phase out the biology of suffering in favor of gradients of intelligent wellbeing because one of the advantages of hedonic recalibration, i.e. ratcheting up hedonic set points is that it doesn’t ask people to give up their existing values and preferences with complications.

If you ask me, just convenient, this is a rather trivial example. Imagine, 100 people, 100 different football teams. There’s simply no way to reconcile conflicting preferences but what one can do if one ratchets up everyone’s hedonic set point is to improve quality of life. By focusing on ratcheting up hedonic set points rather than trying to reconcile the irreconcilable, I think this is the potential way forward.

Brian: There are a lot of different points to comment on. I agree with David that negative utilitarians should not aim for world destruction for several reasons. One being that it would be make people turn against the cause of suffering reduction. It’s important to have other people not regard that as something to be appalled by. For example, animal rights terrorists, plausibly give the animal rights movement a pretty bad name and may set back the cause of animal rights by doing that. Negative utilitarians would almost certainly not succeed anyway, so the most likely outcome is that they hurt their own cause.

As far as David’s suggestion of improving wellbeing to reduce disagreements among competing football teams, I think that would potentially help giving people greater wealth and equality in society can reduce some tensions. I think there will always be some insatiable appetites especially from moral theories. For example, classical utilitarian has an insatiable appetite for computational resources. Egoists and other moral people may have their own insatiable appetites. We see that in the case of humans trying to acquire wealth beyond what is necessary for their own happiness. I think there will always be those agents who want to acquire as many resources as possible. The power maximizers will tend to acquire power. I think we still have additional issues of coordination and social science being used to control the thirst for power among certain segments of society.

Lucas: Sorry. Just to get this clear. It sounds like you guys are both committed to different forms of hedonic consequentialism. You’re bringing up preferences and other sorts of things. Is there a room for ultimate metaphysical value of preferences within your ethics? Or are preferences simply epistemically and functionally useful indicators of what will often lead to positive hedonics and agents within you guys as ethical theories?

Brian: Personally, I care to some degree about both preferences and hedonic wellbeing. Currently, I care some more about hedonic wellbeing just based on … from my meta-ethical standpoint, it’s ultimately my choice, what I want to care about. I happen to care a lot about hedonic suffering when I imagine that. From a different standpoint, you can argue that ultimately the golden rule for example commits you to caring about whatever it is and other organisms cares about whether that’s hedonic wellbeing or some arbitrary wish. For example, a deathbed wish would be a good example of a preference that doesn’t have hedonic content to it, whether you think it’s important to keep deathbed wishes even after a person has died ignoring side effects in terms of later generations realizing that promises are not being kept.

I think even ignoring those side effects, a deathbed wish does have some moral importance based on the idea that if I had a deathbed wish, I would strongly want it to be carried out if you are acting the way you want others to treat you. Then, you should care to some degree about other people’s deathbed wishes. Since I’m more emotionally compelled by extreme hedonic pain, that’s what I give the most weight to.

Lucas: What would your view be of an AI or machine intelligence, which has a very strong preference, whatever that computational architecture might look like a bit be flip one way rather than another. It just keeps flipping a bit back and forth, and then, you would have a preference utilitronium shockwave going out in the world. It seems intuitive to me also that we only care about preferences and so far as they … I guess this previous example does this work for me is that we only care about preferences in so far as that they have hedonic effects. I’ll bite the bullet on the deathbed wish thing and I think that ignoring side effects like if someone wishes for something and then, they die, I don’t think that we need to actually carry it out if we don’t think it will maximize hedonic wellbeing.

Brian: Ignoring the side effects. There are probably good hedonistic reasons to fulfill deathbed wishes so that current people will not be afraid that their wishes won’t be kept also. As far as the bit flipping, I think a bit flipping agent does, I think it’s preference does have moral significance but I weigh organisms in proportion to the sophistication of their minds. I care more about a single human than a single ant for example because a human has more sophisticated cognitive machinery. It can do more kinds of … have more kinds of thoughts about its own mental states. When a human has a preference, there’s more stuff going on within its brain to back that up so to speak. A very simple computer program that has a very simple preference to flip a bit doesn’t matter very much to me because there’s not a lot of substance behind that preferences. You could think of it as an extremely simple mind.

Lucas: What if it’s a super intelligence that wants to keep flipping bits?

Brian: In that case, I would give a significant way because it has so much substance in its mind. It probably has lots of internal processes that are reflecting on its own welfare so to speak. Yeah, if it’s a very sophisticated mind, I would give that significant weight. It might not override the preferences of seven billion humans combined. I tend to give less than linear weight to larger brains. As the size of the brain increases, I don’t scale the moral weight of the organism exactly linearly. That would alter reduce that utility monster inclusion.

Lucas: Given Brian’s metaethics being an anti-realist and viewing him as an emotivist, I guess the reasons or arguments that you could provide against this view would only be, they don’t refer back to any metaphysical objective, anything really. David, wouldn’t you say that in the end, it would just be your personal emotional choice whether or not to find something compelling here.

David: It’s to do with the nature of first person facts. What is it that the equations of physics ultimately describe and if you think subjectivity or at least take it seriously the conjecture of that subjectivity is the essence of the physical, the fire in the equations, then yeah, it’s just objectively in the case that first person agony is disvaluable. Here we get into some very controversial issues. I would just like to go back to one thing Brian was saying about sophistication. I don’t think it’s plausible that let’s say a pilot whale is more cognitively sophisticated than humans but it’s very much an open question whether a pilot whale with a substantially larger brain, substantially larger neocortex, substantially larger pain and pleasure centers that the intensity of experience undergone by a pilot whale let’s say may be greater than that of humans. Therefore, other things being equal, I would say that it’s so profoundly aversive states undergone by the whale matter more than a human. It’s not the level of sophistication or complexity that counts.

Lucas: Do you want to unpack a little bit your view about the hedonics versus the preferences, and whether or not preferences have any weight in your view?

David: Only indirectly weight and that ultimately, yeah, as I said I think what matters is the pleasure-pain axis and preferences only matter in so far as they impact that. Thanks to natural selection, we have countless millions and billions of preferences that are being manufactured all the time as social primates countless preferences conflict with each other. There is simply no way to reconcile a lot of them. Whereas one can continue to enrich and enhance wellbeing so, yeah sure. Other things being equal satisfy people’s preferences. In so many contexts, it is logically impossible to do so from politics, the middle east, interpersonal relationships, the people’s desire to be the world famous this, that or the other. It is logically impossible to satisfy a vast number of preferences.

Lucas: I think it would be interesting and useful to dive into, within justification theories, like moral epistemology and ethical motivation. I think I want to turn to Brian now. Brian, I’m so curious to know if it’s possible given your view of anti-realism and suffering focused ethics, whether or not you can make moral progress or what it means to make moral progress. How does one navigate the realm of moral issues in your view, given the metaethics that you hold? Why ought I or others, or why not ought I or others to follow your ethics or not?

Brian: Moral progress I think can be thought of as many people have a desire to improve their own moral views using standards of improvement that they choose. For example, a common standard would be I think that the moral views that I will hold after learning more, I will generally now defer to those views as the better ones. There might be some exceptions especially if you get too much into some subject area that distorts your thinking relative to the way it was before. Basically, you can think of brain state changes as either being approved of or not approved of by the current state. Moral progress would consist of doing updates to your brain that you approve of, like installing updates to computer that you choose to install.

That’s what moral progress would be. Basically, you designated which changes do I want to happen and then, if those happen according to the rules then it’s on a progress relatively to what my current state thought. You can have failures of goal preservation. The example that Eliezer Yudkowsky gives is if you give Gandhi a pill that would make him want to kill people. He should not take it because that would change his goals in a way that his current goals don’t approve of. That would be moral anti-progress relative to Gandhi’s current goals. Yeah, that’s how I would think of it. Different people have different preferences about how much you can call preference idealization.

Preference idealization is the idea of imagining what preferences you would hold if you knew more, were smarter, had more experiences, and so on. Different people couldn’t want different amounts of preference idealization. There are some people who say I have almost no idea what I currently value and I want to defer that to an artificial intelligence to help me figure that out. In my case, it’s very clear to me that extreme suffering is what I want to continue to value and if I change from that stance, that would be a failure of goal preservation relative to my current values. There are still questions on which I do have significant uncertainty in a sense that I would defer to my future self.

For example, the question of how to weigh different brain complexities against each other is something where I still have significant uncertainty. The question of how much weight to give to what’s called higher order theory in consciousness versus first order theories basically how much you think that high level thoughts are an important component of what consciousness is. That’s an issue where I have significant moral uncertainty. There are issues where I want to learn more, think more about it, have more other people think about it before I make up my mind fully on what I think about that. Then, why should you hold my moral view? The real answer is because I want you to and I’ll try to come up with arguments to make it sound more convincing to you.

David: I find subjectivism troubling. I support my football team is Manchester United. I wouldn’t take a pill, less induced me to support Manchester City because that would subvert my values in some sense. Nonetheless, ultimately, support for Manchester United is arbitrary. It is a support for the reduction of suffering merely a kin to I once support lets say of Manchester United.

Brian: I think metaphysically, they’re the same. It feels very different. There’s more of a spiritual, like your whole being is behind reduction of suffering in the way that’s not true for football teams. Ultimately, there’s no metaphysical difference.

Intentional objects ultimately are arbitrary that natural selection has eschewed us a define certain intentional objects. This is philosophy jargon for the things we care about, whether it’s a football or politics, or anything. Nonetheless, it’s unlike these arbitrary intentional objects, it just seems to built into the nature of agony or despair that they are disvaluable. It’s simply not possible to instantiate such states and find it an open question whether they’re disvaluable or not.

Brian: I don’t know if we want to debate now but I think it is possible. I mean we already have examples of one organism who finds the suffering of another organism to be possibility valuable.

David: They are not mirror-touch synesthete. They do not accurately perceive what is going on and in so far as one does either as a mirror-touch synesthete or can do the equivalent of a Vulcan mind meld or something like that, one is not going to perceive the disvaluable as valuable. Its an epistemological limitation.

Brian: My objection to that is it depends how you hook up the wires between the two minds. Like if you hook up one person suffering to another person’s suffering, then the second person will say it’s also bad. If you hook up one person’s suffering neurons to another person’s pleasure neurons, then, the second person will say it’s good. It just depends how you hook up the wires.

David: It’s not all or nothing but if one is let’s say a mirror-touch synesthete today and someone’s, they stub their toe and you have an experience of pain, it’s simply not possible to take pleasure in their stubbing their toe. I think if one does have this notional god’s eye perspective, an impartial view from nowhere that one will act accordingly.

Brian: I disagree with that because I think you can always imagine just reversing the motivational wires so to speak. Just flip the wire that says this is bad. Flip it to saying this is good in terms of the agent’s motivation.

David: Right. Yes. I was trying to visualize what this would entail.

Brian: Even in a synesthete example, just imagine a brain where the same stimulus currently in normal humans, this stimulus triggers negative emotional responses just have the neurons hook up to the positive emotional responses instead.

David: Once again, wouldn’t this be an epistemological limitation rather than some deep metaphysical truth about the world?

Brian: Well, it depends how you define epistemology but you could be a psychopath where you correctly predict another organism’s behavior but you don’t care. You can have a difference between beliefs and motivations. The beliefs could correctly recognize this I think but the motivations could have the wires flipped such that there’s motivation to cause more of the suffering.

David: It’s just that I would say that the psychopath has an epistemological limitation in that the psychopath does not adequately take into account other perspectives. In that sense, psychopath lacks an adequate theory of mind. The psychopath is privileging one particular here and now over other here and nows, which is not metaphysically sustainable.

Brian: It might be a definitional dispute like whether you can consider having proper motivation to be part of epistemological accuracy or not. It seems that you’re saying if you’re not properly motivated to reduce … you don’t have proper epistemological access to it by definition.

David: Yes. One has to be extremely careful with using this term by definition. Yes. I would say that we are all to some degree sociopathic. One is quasi sociopathic to one’s future self for example and so far is one let’s say doesn’t prudently save but squanders money and stuff. We are far more psychopathic towards other sentient beings because one is failing to fully to take into account their perspective. It’s hardwired epistemological limitation. One thing I would very much agree with Brian on is moral uncertainty and being prepared to reflection and take into account other perspectives and allow for the possibility one can be wrong. It’s not always possible to have the luxury of moral reflection uncertainty.

If a kid is drowning, hopefully one that dashes into the water to save the kid. Is this the right thing to do? Well, what happens if the kid, this is the real story, happens to be a toddler grows up to the Adolf Hitler and plunges the world into war. One doesn’t know the long term consequences of one’s action. Wherever possible, yes, one urges reflection and caution in the context of a discussion or debate. One isn’t qualifying, one’s uncertainty, agnosticism carefully but in a more deliberative context perhaps of what one should certainly do so.

Lucas: Let’s just bring it a little bit back to the ethical epistemology behind and ethical motivation behind your hedonistic imperative given your objectivism. I guess here, it’d also be interesting to know if you could also explore key metaphysical uncertainties and physical uncertainties, and what more and how we might go about learning about the universe such that your view would be further informed.

David: Happy to launch into long spiel about my view. One thing I think it really is worth stressing is that one doesn’t need to buy into any form of utilitarianism or suffering-focused ethics to believe that we can and should phase out the biology of involuntary suffering. It’s common to all manner of secular and religious views that we should be other things being equal minimizing suffering reducing unnecessary suffering and this is one thing that technology, it could buy a technology allows us to do and support for something like universal access for implantation, genetic screening, phasing out factory farming and shutting slaughter houses, going on to essentially reprogram the biosphere.

It doesn’t involve a commitment to some particular one specific ethical or meta-ethical view. For something like pain-free surgery anesthesia, you don’t need to sign up for it to recognize it’s a good thing. I suppose my interest is very much in building bridges with other ethical traditions. Yeah, I am happy to go into some of my own personal views but I just don’t want to tie this idea that we can use bio-tech to get rid of suffering into anything quirky or idiosyncratic to me. I have a fair number of idiosyncratic views.

Lucas: It would be interesting if you’d explain whether or not you think that super intelligences or AGI will necessarily converge on what you view to be objective morality or if that is ultimately down to AI researchers to be very mindful of implementing.

David: I think there are real risk here when one starts speaking as though posthuman super intelligence is going to end up endorsing a version of one’s own views and values, which a priori ,if one thinks about, is extremely unlikely. I think too one needs to ask yeah, when I was talking about post human super intelligence, if post human super intelligence is biological descendants, I think post human super intelligence will have a recognizable descendant of pleasure-pain axis. I think it will be ratcheted up so that say experience below hedonic zero is impossible.

In that sense, I do see a convergence. By contrast, if one has a conception of post human super intelligence such that post human super intelligence may not be sentient, may not be experiential at all then, there is no guarantee that such a regime would be friendly to anything recognizably human in its values.

Lucas: The crux here there are different ways of doing value alignment and one such way is descriptively through a super intelligence being able to gain enough information about the set of all values that human beings have and say aligning to those or to some fraction of those or to some idealized version of those through something like a coherent extrapolated volition. Another one is where we embed a moral epistemology within the machine system, so that the machine becomes an ethical reasoner, almost a moral philosopher in its own right. It seems that given your objectivist ethics that with that moral epistemology, it would be able to converge on what is true. Do these different paths forward makes sense to you and/or it also seems that the role of mind melding seems to be very crucial and core to the realization of the correct ethics in your view?

David: With some people, their hearts sinks when the topic of machine consciousness crops up because they know it’s going to be a long inconclusive philosophical discussion and a shortage of any real empirical tests. Yeah, I will just state. I do not think a classical digital computer is capable of phenomenal binding, therefore it will not understand the nature of consciousness or pleasure and pain, and I see the emotion of value and disvalue is bound with the pleasure-pain axis. In that sense, I think what we’re calling machine artificial general intelligence, in one sense it’s invincibly ignorant. I know a lot of people would disagree with this description but if you think humans or at least some humans spend a lot of their time thinking about, talking about, exploring consciousness and it’s all varieties in some cases exploring psychedelia, what are we doing? There are vast range of cognitive domains that are completely, cognitively inaccessible to digital computers.

Lucas: Putting aside the issue of machine consciousness, it seems that being able to first-person access hedonic states provides a extremely foundational and core motivational or at least epistemological role in your ethics David.

David: Yes. I mean part of intelligence involves being able to distinguish the important from the trivial, which ultimately as far as I can see boils down to the pleasure-pain axis. Digital zombies have no conception of what is important or what is trivial I would say.

Lucas: Why would that be if a true zombie in the David Chalmers sense is functionally isomorphic to a human. Presumably that zombie would properly care about suffering because all of its functional behavior is the same. Do you think in the real world, digital computers can’t do the same functional computation that a human brain does?

David: None of us have the slightest idea how one would set about programming a computer to do the kinds of things that humans are doing when they talk about and discuss consciousness when they take psychedelics or discuss the nature of the self. I’m not saying work arounds are impossible. I just don’t think they’re spontaneously going to happen.

Brian: I agree. Just like building intelligence itself, it requires a lot of engineering to create those features of humanlike psychology.

Lucas: I don’t see why it would be physically or technically impossible to instantiate an emulation of that architecture or an architecture that’s basically identical to it in a machine system. I don’t understand why computer architecture, computer substrate is really so different from biological architecture or substrate such that it’s impossible for this case.

David: It’s whether one feels the force of the binding problem or not. The example one can give, imagine the population of the USA are skull bound minds, imagine them implementing any kind of computation you like. They are ultra fast, electromagnetic signaling far faster than the retro chemical signaling and the CNS is normally conceived. Nonetheless, short of a breakdown with monistic physicalism, there is simply no way that the population of the USA is spontaneously going to become subject to experience to apprehend perceptual objects. Essentially, all you have is a micro experiential zombie. The question is why are 86 billion odd membrane bound supposedly classical neurons any different?

Why aren’t we micro experiential zombies? One way to appreciate, i think, the force, the adaptive role of phenomenal binding is to look at syndromes where binding even harshly breaks down such as simultanagnosia where the subject can only see one thing at once. Or motion blindness or akinetopsia, where one can’t apprehend motion or severe forms of schizophrenia where there is no longer any unitary self. Somehow right now, you instantiate a unitary world simulation populated by multiple phenomenally bound dynamical objects and this is tremendously fitness enhancing.

The question is how can a bunch of membrane-bound nerve cells, a pack of neurons carry out what is classically impossible. I mean one can probe the CNS with temporary course grained and neuro scans… individual feature process, edge detectors, motion detectors, color detectors. Apparently, there are no perceptual objects there. How is it that right now that your mind/brain is capable of running this egocentric world simulation in almost real time. It’s astonishing computational feat. I argue for a version of quantum mind but one needn’t buy into this to recognize that it’s profound an unsolved problem. I mean why aren’t we like the population of the USA?

Lucas: Just to bring this back to the AI alignment problem and putting aside issues in phenomenal binding, and consciousness for a moment. Putting aside also the conception that super intelligence is likely to be some sort of biologic instantiation if we imagine the more AI safety mainstream approach, the MIRI idea of there being simply a machine super intelligence. It seems that in your view David and I think here this elucidates a lot of the interdependencies and difficulties where one’s meta-ethical views are intertwined in the end with what is true about consciousness and computation. It seems that in your view, close to or almost maybe perhaps impossible to actually do AI alignment or value alignment on machine super intelligence.

David: It is possible to do value alignment but I think the real worry is that if you take the MIRI scenario seriously, this recursively self-improving software that will somehow … This runaway intelligence. There’s no knowing where it may lead by MIRI as far as I know have very different conception of the nature of consciousness and value. I’m not aware that they tackle the binding problem. I just don’t see that unitary subjects of experience or values, or pleasure-pain axis are spontaneously going to emerge from software. It seems to involve some form of strong emergence.

Lucas: Right. I guess to tie this back and ground it a bit. It seems that the portion of your metaethics, which is going to be informed by empirical facts about consciousness and minds in general is the view in there that without access to the phenomenal pleasure-pain axis, what you view to have an intrinsic goodness or wrongness to it because it is foundationally and physically, and objectively the pleasure-pain axis of the universe, the heat and the spark in the equation I guess as you say. Without access to that, then ultimately, one will go awry in one’s ethics if one does not have access to phenomenal hedonic states given that that’s the core of value.

David: Yeah. In theory, an intelligent digital computer stroke robot could impartially pave the cosmos with either dolorium or hedonium without actually understanding the implications of what it was doing. Hedonium being or utilitronium, matter and energy optimized for pure bliss. Dolorium being matter and energy optimized for, lack of a better word, for pure misery or despair. That’s the system in question we do not understand the implications of what it was doing. That I know a lot of people do think that well, sooner or later, classical, digital computers, our machines are going to wake up. I don’t think it’s going to happen. Rather we’re not talking about hypothetical quantum computers next century and beyond. Simply an expansion of today’s programmable digital computers. I think they’re zombies and will remain zombies.

Lucas: Fully autonomous agents which are very free and super intelligent in relation to us will in your view require a fundamental access to that which is valuable, which is phenomenal states, which is the phenomenal pleasure-pain axis. Without that, it’s missing its key epistemological ingredient. It will fail in value alignment.

David: Yes, yeah, yeah. It just simply does not understand the nature of the world. It’s rather like claiming where the system is intelligent but doesn’t understand the second or of thermodynamics. It’s not a full spectrum super intelligence.

Lucas: I guess my open question there would be then, whether or not it would be possible to not have access to fundamental hedonic states but still be something of a Bodhisattva with a robust moral epistemology that was heading in the right direction or what might be objective.

David: The system in question would not understand the implications of what it was doing.

Lucas: Right. It wouldn’t understand the implications but if it got set off in that direction and it was simply achieving the goal, then I think in some cases we might call that value aligned.

David: Yes. One can imagine … Sorry Brian. Do intervene when you’re ready but yeah, one could imagine for example being skeptical of the possibility of interstellar travel for biological humans but programming systems to go out across the cosmos or at least within our cosmological horizons and convert matter and energy into pure bliss. I mean one needn’t assume that this will apply to our little bubble of civilization but watch if we do about inert matter and energy elsewhere in the galaxy. One can leave it as it is or if one is let’s say a classical utilitarian, one could convert it into pure bliss. Yeah, one can send out probes. One could restructure, reprogram matter and energy in that way.

That would be a kind of compromise solution in one sense. Keep complexity within our little tiny bubble of civilization but convert the rest of the accessible cosmos into pure bliss. Though that technically would not strictly speaking maximize the abundance of positive value in our hubble volume, nonetheless it could become extraordinarily close to it from a classical utilitarian perspective.

Lucas: Brian, do you have anything to add here?

Brian: While I disagree on many, many points, I think digital computation is capable of functionally similar enough processing as the brain does. Even that weren’t the case, a paperclip maximizer with a very different architecture would still have a very sophisticated model of human emotions and its motivations wouldn’t be hooked up to those emotions but it would understand for all other sense of the word understand human pleasure and pain. Yeah, I see it more as a challenge of hooking up the motivation properly. As far as my thoughts on alignment in general based on my metaethics, I tend to agree with the default approach like the MIRI approach, which is unsurprising because MIRI is also anti-realist on metaethics. That approach sees the task as taking human values and somehow translating them into the AI and so that could be in a  variety of different ways learning human values implicitly from certain examples or with some combination of maybe top down programming of certain ethical axioms.

That could send to exactly how you do alignment and there are lots of approaches to that. The basic idea that you need to specifically replicate the complexity of human values in machines and the complexity of the way humans reason. It won’t be there by default in any way shared between my opinion and that of the mainstream AI alignment approach.

Lucas: Do you take a view then similar to that of coherent extrapolated volition?

Brian: In case anybody doesn’t know, coherent extrapolated volition is Eliezer Yudkowsky’s idea of giving the AI the meta … You could call it a metaethics. It’s a meta rule for learning values to take humanity and think about what humanity want to want if it was smarter, knew, had more positive interactions with each other and thought faster and then, try to identify points of convergence among the values of different idealized humans. In terms of theoretical things to aim for, I think CEV is one reasonable target for reasons of cooperation among other humans. I mean if I controlled the world, I would prefer to have the AI implement my own values rather than humanities values because I care more about my values. Some human values are truly abhorrent to me and others seem at least unimportant to me.

In terms of getting everybody together to not fight endlessly over the outcome of AI in this theoretical scenario, CEV would be a reasonable target to strive for. In practice, I think that’s unrealistic like a pure CEV is unrealistic because the world does not listen to moral philosophers to any significant degree. In practice, things are determined by politics, economic power, technological and military power, and forces like that. Those determine most of what happens in the world. I think we may see approximations to CEV that are much more crude like you could say that democracy is an approximation to CEV in the sense that different people with different values, at least in theory, discuss their differences and then, come up with a compromise outcome.

Something like democracy maybe power-weighted democracy in which more powerful actors have more influence will be what ends up happening. The philosophers dream of idealizing values to perfection is unfortunately not going to happen. We can push in directions that are slightly more reflective. We can push aside towards slightly more reflection towards slightly more cooperation and things like that.

David: Couple of points that first, what to use an example we touched on before. What would be coherent extrapolated volition for all the world’s football supporters? Essentially, there’s simply no way to reconcile all their preferences. One may say that if they were fully informed football supporters, wouldn’t waste their time passionately supporting one team or another but essentially I’m not sure that the notion of coherent extrapolated volition there would make sense. Of course, there are more serious issues in football but the second thing when it comes to the nature of value, regardless of one’s metaphysical stance on whether one’s a realist or an anti-realist about value. I think it is possible by biotechnology to create states that are empirically, subjectively far more valuable than anything physiologically feasible today.

Take Prince Myshkin in Dostoevsky’s The Idiot. Like Dostoevsky was a temporal lobe epileptic and he said, “I would give my whole life for this one instant.” Essentially, there are states of consciousness that are empirically super valuable and rather than attempting to reconcile irreconcilable preferences, I think you could say that we should be and so far as we aspire to long term full spectrum super intelligence, perhaps we should be aiming to create these super valuable states. I’m not sure whether it’s really morally obligatory. I said my own focus is on the overriding importance of phasing out suffering but for someone who does give some weight or equal weight to positive experiences positively valuable experiences, that there is a vast range of valuable experience that is completely inaccessible to humans that could be engineered via biotechnology.

Lucas: A core difference here is going to be that given Brian’s view of anti-realism, AI alignment or value alignment would in the end be left to those powers which he described in order to resolve irreconcilable preferences. That is if human preferences don’t converge strongly enough after enough time and information that there are no longer irreconcilable preferences, which I guess I would suppose is probably wrong.

Brian: Which is wrong?

Lucas: That it would be wrong that human beings preferences would converge strongly enough that there would no longer be irreconcilable preferences after coherent extrapolated volition.

Brian: Okay, I agree.

Lucas: I’m saying that in the end, value alignment would be left up to economic forces, military forces, other forces to determine what comes out of value alignment. In David’s view, it would simply be down to if we could get the epistemology right and we could know enough about value and the pleasure-pain axis and the metaphysical status of phenomenal states that that would be value alignment would be to capitalize on that. I didn’t mean to interrupt you Brian. You want to jump in there?

Brian: I was going to say the same thing you did that I agree with David that there would be irreconcilable differences and in fact, many different parameters of the CEV algorithm would probably affect the outcome. One example that you could give is that people tend to crystallize their moral values as they age. You could imagine somebody who was presented with utilitarianism as a young person would be more inclined toward that whereas, maybe if that person haad been presented with deontology as a young person would the person would prefer  deontology as he got older and so depending on seemingly arbitrary factors such as the order in which you are presented with moral views or what else is going on in your life at the time that you confront a given moral view or 100 other inputs. The output could be sensitive to that. CEV is really a class of algorithms depending on how you tune the parameters. You could get substantially different outcomes.

Yeah, CEV is an improvement even if there’s no obvious unique target. As I said, in practice, we won’t even get pure CEV but we’ll get some kind of very rough power-weighted approximation similar to our present world of democracy and competition among various interest groups for control.

Lucas: Just to explain how I’m feeling so far. I mean Brian, I’m very sympathetic to your view but I’m also very sympathetic to David’s view. I hover somewhere in between. I like this point that David made where he quoted Russell, something along the lines that one ought to be careful when discussing ethical metaphysics such that one is not simply trying to make one’s own views and preferences objective.

David: Yeah. When one is talking about well, just in general, when one speaks about the nature for example post human super intelligence, think of the way today that the very nature and notion of intelligence is a contested term. Simply sticking the words super in front of it is just how illuminating is it. When I read someone’s account of super intelligence, I’m really reading an account of what kind of person they are, their intellect and their values. I’m sure when I discuss the nature of full spectrum super intelligence, at least now I can see what I can’t the extent to which I’m simply articulating my own limitations.

Lucas: I guess for me here to get all my partialities out of the way, I hope that objectivism is true because I think that it makes the value alignment way less messy. In the end, we could have something actually good and beautiful, which I don’t know is some preference that I have that might be objective or not just simply wrong, or confused. The descriptive picture that I think Brian is committed to, which gives rise to the MIRI and Tomasik form of anti-realism is just one where in the beginning, there was entropy and noise and many generations of stars fusing atoms into heavier elements. One day one of these disks turn into a planet and a sun shone some light on a planet, and the planet began to produce people. There’s an optimization process there in the end, which simply seems to be ultimately driven by entropy and morality seems to simply a part of this optimization process, which just works to facilitate and mediate the relations between angry mean primates like ourselves.

Brian: I would point out there’s also a lot of spandrel to morality in my opinion, specially these days not that we’re not heavily optimized by biological pressures. All these conversation that we’re having right now is a spandrel in the sense that it’s just an outgrowth of certain abilities that we evolve but it’s not at all adaptive in any direct sense.

Lucas: Right. In this view, it really just seems like morality and suffering, and all of this is just byproduct of the screaming entropy and noise of whatever led to this universe. At the same time, the objective process and I think this is the part the people who are committed to MIRI anti-realism and I guess just relativism and skepticism about ethics in general, maybe are not tapping into enough. At the same time, this objectivity is producing a very real and objective phenomenal self and story, which is caught up in suffering where suffering is really suffering and really sucks to suffer. It all seems at face value true in that moment throughout the suffering that this is real. The suffering is real. The suffering is bad. It’s pretty horrible.

This bliss is something that I would never give up or if the rest of the universe were this bliss, that would just be the most amazing thing ever. In this very subjective phenomenal, I like just experiential thing that the universe produces, the subjective phenomenal story and narrative that we live. It seems there’s just this huge tension between that and I think the anti-realism, the clear suffering of suffering and just being a human being.

Brian: I’m not sure if there’s a tension because the anti-realist agrees that humans experience suffering as meaningful and they experience it as the most important thing imaginable. There’s not really a tension and you can explore why humans quest for objectivity. There seems to be certain glow that attaches to things by saying that they’re objectivity moral. That’s just a weird quirk of human brains. I would say that ultimately, we can choose to care about what we care about whether it’s subjective or not. I often say even if objective truth exist, I don’t necessarily care what it says because I care about what I care about. It could turn out that objective truth orders you to torture squirrels. If it does, then, I’m not going to follow the objective truth. On reflection, I’m not unsatisfied at all with anti-realism because what more could you want than what you want.

Lucas: David, feel free to jump in if you’d like.

David: Well, there it’s just … there’s this temptation to oscillate between two senses of the words subjective. Subjective in neither truth nor false, and subject in the sense of first-person experience. My being in agony or you’re being in agony or someone being in despair is as I said as much an objective property of reality as the rest mass of the electron. I mean what we can be doing is working in such ways as to increase the theory to maximize the amount of subjective value in the world regardless of whether or not one believes that this has any transcendent significance with the proviso here that there is a risk that if one aims strictly speaking to maximize subjective value, that one gets the utilitronium shockwave. If one is as I said, what I personally advocate as aiming for a civilization of super intelligent bliss one is not asking people to give up their core values and preferences unless one of those core values and preferences is to keep hedonic set points unchanged. That’s not very intellectually satisfying but it’s … this idea if one is working towards some kind of census, compromise.

Lucas: I think now I want to get into a bit more just about ethical uncertainty and specifically with regards to meta-ethical uncertainty. I think that just given the kinds of people that we are, that even if we disagree about realism versus anti-realism or ascribe different probabilities to each view. We might pretty strongly converge on how we ought to do value alignment given our kinds of moral considerations that we have. I’m just curious to explore a little bit more about what you guys are most uncertain about what it would take to change your mind? What new information you would be looking for that might challenge or make you revise your metaethical view? How we might want to proceed with AI alignment given our metaethical uncertainty?

Brian: Can you do those one by one?

Lucas: Yeah, for sure. If I can remember everything I just said. First to start off, what do you guys most uncertain about within your meta-ethical theories?

Brian: I’m not very uncertain meta-ethically. I can’t actually think of what would convince me to change my metaethics because as I said, even if it turned out that metaphysically moral truth was a thing out there in some way whatever that would mean, I wouldn’t care about it except for like instrumental reasons. For example, if it was a god, then you’d have to instrumentally care about god punishing you or something but in terms of what I actually care about, it would be not connected to moral truth. Yeah, I would have to be some sort of revision of the way I conceive of my own values. I’m not sure what that would look like to be meta-ethically uncertain.

Lucas: There’s a branch of metaethics, which has to tackle this issue of meta-ethical commitment or moral commitment to meta-ethical views. If some sort of meta-ethical thing is true, why ought I to follow what is metaethically true? In your view Brian, it is just simply why ought you not to follow or why ought it not matter for you to follow what is meta-ethically true if there ends up being objective moral facts.

Brian: The squirrel example is a good illustration if ethics turned out to be, you must torture as many squirrels as possible. Then, screw moral truth. I don’t see what this abstract metaphysical thing has to do with what I care about myself. Basically, my ethics comes from empathy, seeing others in pain, wanting that to stop. Unless moral truths somehow gives insight about that, like maybe moral truths is somehow based on that kind of empathy, sophisticated way then, it would be another person giving me thoughts on morality. The metaphysical nature of it would be irrelevant. It would only be useful in so far as it would appeal to my own emotions and sense of what morality should be for me.

David: If I might interject. Undercutting my position and negative utilitarianism and suffering-focus ethics, I think it quietly likely that posthuman super intelligence, advance civilization with a hedonic range ratcheted right up to 70 to 100 or something like that. We’d look back on anyone articulating the kind of view that I am, that anyone who believes in suffering-focused ethics does and seeing it as some kind of depressive psychosis while intuitively assumes that our successes will be wiser than we are and perhaps, well they will be in many ways. Yet in another sense, I think we should be aspiring to ignorance that once we have done absolutely everything in our power to minimize mitigate, abolish and prevent suffering, I think we should forget it even existed. I hope that eventually any experience below hedonic zero will be literally inconceivable.

Lucas: Just to jump to you here David. What are your views about what you are most meta-ethically uncertain about?

David: It’s this worry that what one is doing however much one is pronouncing about the nature of reality, or the future of intelligence life in the universe and so on. What one is really doing is some kind of disguised autobiography. Given that quite a number of people sadly pain and suffering have loomed larger in my life than pleasure, turning this into deep metaphysical truth about the universe. This potentially undercuts my view. As I said, I think there are arguments against the symmetry view that suffering is self-intimatingly bad where there is nothing self-intimatingly bad about being  insentient system or a system that it’s really content. Nonetheless, yeah, I take seriously the possibility that’s all I’m doing is expressing obliquely by own limitations of perspective.

Lucas: Given these uncertainties and the difficulty and expected impact of AI alignment, if we’re again committing ourselves to this MIRI view of an intelligence explosion with quickly recursive self-improving AI systems, how would you both, if you were the king of AI strategy, how would you go about allocating your metaethics and how would you go about working on the AI alignment problem and thinking about the strategy given your uncertainties and your views?

Brian: I should mention that my most probable scenario for AI is a slow take off in which lots of components of intelligence emerge piece by piece rather than a localized intelligence explosion. As far as the intelligence like if it were a hard take off localized intelligence explosion, then, yeah I think the diversity approaches that people are considering is what I would do as well. It seems to me, you have to somehow learn values because in the same way that we’ve discovered that teaching machines by learning is more powerful than teaching them by hard coding rules. You probably have to mostly learn values as well. Although, there might be hard coding mixed in. Yeah, I would just pursue a variety of approaches and the way that the current community is doing.

I support the fact that there is also a diversity of short term versus long term focus. Some people are working on concrete problems. Others are focusing on issues like decision theory and logical uncertainty and so on because I think some of those foundational issues will be very important. For example, decision theory could make a huge difference to the AI’s effectiveness as well as issues of what happens in conflict situations. Yeah, I think a diversity of approaches is valuable. I don’t have a specific advice on when I would recommend tweaking current approaches. I guess I expected that the concrete problems work will mostly be done automatically by industry because those are the kinds of problems that you need to make AI work at all. If anything, I might invest more in the kind of long-term approaches that practical applications are likely to ignore or at least put off until later.

David: Yes, because of my background assumptions are different, it’s hard for me to deal with your question. If one believes that subjects of experience that could suffer could simply emerge at different levels of abstraction, I don’t really know how to tackle this because this strikes me as a form of strong emergence. One of the reasons why philosophers don’t like strong emergence is that essentially, all bets are off. Yeah, you imagine if life hadn’t been reducible to molecular biology and hence, ultimately to content chemistry and physics. Yeah, I’m not probably the best person to answer your question.

I think in terms of real moral focus, I would like to see essentially the molecular signature of unpleasant experience identified and essentially, you’re just making it completely off limits and biologically impossible for any sentient being to suffer. If one also believes that there are or could be subjects of experience that somehow emerge in classical digital computers, then, yeah, I’m floundering my theory of mind and reality would be wrong.

Lucas: I think touching on the paper that Kaj Sotala had written on suffering risks, I think that a lot of different value systems would also converge with you on your view David. Whether or not we take the view of realism or anti-realism, I think that most people would agree with you. I think the issue comes about with again, preference conflicts where some people I think even this might be a widespread view in catholicism where you view suffering as really important because it teaches you things and/or it has some special metaphysical significance with relation to god. Within the anti-realism view, with Brian’s view, I would find it very… just dealing with varying preferences on whether or not we should be able to suffer is something I just don’t want to deal with.

Brian: Yeah, that illustrates what I was saying about I prefer my values over the collective values of humanity. That’s one example.

David: I don’t think it would be disputed that sometimes suffering can teach lessons. The question is are there any lessons that couldn’t be functionally replaced by something else. This idea that we can just offload the nasty side of life on to software. In the case of pain, nociception one knows that yeah, so they brought software systems can be program or trained up to avoid noxious stimuli without the nasty raw feels should we be doing the same thing for organic biological robots too. When it comes to this, the question of suffering, one can have quite fierce and lively disputes with someone who says that yeah, they want to retain the capacity to suffer. This is very different from involuntary suffering. I think that quite often someone can see that no, they wouldn’t want to force another sentient being to suffer against their will. It should be a matter of choice.

Lucas: To tie this all into AI alignment again, really the point of this conversation is that again, we’re doing ethics on a deadline. If you survey the top 100 AI safety researchers or AI researches in the world, you’ll see that they give a probability distribution of the likelihood of human level artificial intelligence with about a 50% probability at 2050. This, many suspect, will have enormous implications for earth originating-intelligent life and our cosmic endowment. Our normative and descriptive and applied ethical practices that we engage with are all embodiments and consequential to the sorts of meta-ethical views, which we hold, which may not even be explicit. I think many people don’t really think about metaethics very much. I think that many AI researchers probably don’t think about metaethics very much.

The end towards which AI will be aimed will largely be a consequence of some aggregate of meta-ethical views and assumptions or the meta-ethical views and assumptions of a select few. I guess Brian and David, just to tie this all together, what do you guys view as really the practicality of metaethics in general and in terms of technology and AI alignment.

Brian: As far as what you said about metaethics determining the outcome, I would say maybe the implicit metaethics will determine the outcome but I think as we discuss before, 90 some percent of the outcome will be determined by ordinary economic and political forces. Most people in politics in general don’t think about metaethics explicitly but they still engage in the process and have a big impact on the outcome. I think the same will be true in AI alignment. People will push for things they want to push for and that’ll mostly determine what happens. It’s possible that metaethics could inspire people to be more cooperative depending on how it’s framed. CEV as a practical metaethics could potentially inspire cooperation if it’s seen as an ideal to work towards, although the extent to which it can actually be achieve is questionable.

Sometimes, you might have a naïve view where a moral realist assumes that a super intelligent AI would necessarily converge to the moral truth or at least a super intelligent AI could identify the moral truth and then, maybe all you need to do is program the AI to care about the moral truth once it discovers it. Those particular naïve approaches, I think would produce the wrong outcomes because there would be no moral truth to be found. I think it’s important to be wary of that assumption that a super intelligence will figure it out on its own and we don’t need to do the hard work of loading complex human values ourselves. It seems like the current AI alignment community largely recognizes that they recognize that there’s a lot of hard work in loading values and it won’t just happen automatically.

David: In terms of metaethics, consider the nature of pain-free surgery, surgical anesthesia. When it was first introduced in the mid 19th century, it was for about 15 years controversial. There were powerful voices who spoke against it but nonetheless, very rapidly a consensus emerge and we all now, almost all take it for granted for major surgery anesthesia. It didn’t require a consensus on the nature of value and metaethics. It’s just this is the obvious given our nature. Clearly, I would hope that eventually something similar will happen not just for physical pain but also psychological pain too. Just as we now take it for granted that it was the right thing to do to eradicate smallpox, no one is seriously suggesting that we bring smallpox back and it doesn’t depend on consensus on metaethics.

I would hope that the experience below hedonic zero, which one can possibly we’ll be able to find its precise molecular signature. I hope that consensus will emerge that we should phase it out too. Sorry, this isn’t much in the way of practical guidance to today’s roboticist and AI researchers but I suppose I’m just expressing my hope here.

Lucas: No, I think I share that. I think that we have to do ethics on a deadline but I think that there are certain ethical things whose deadline is much longer or which doesn’t necessarily have a real concrete deadline. I like… with your example of the pain anesthesia drugs.

Brian: In my view, metaethics is mostly useful for people like us or other philosophers and effective altruists who can inform our own advocacy. We want to figure out what we care about and then, we go for it and push for that. Then, maybe to some extent, it may diffuse through society in certain ways but in the start, it’s just helping us figure out what we want to push for.

Lucas: There’s an extent to which the evolution of human civilization has also been an evolution of metaethical views, which are consciously or unconsciously being developed. Brian, your view is simply that 90% of what has causal efficacy over what happens in the end are going to be like military and economic and just like raw optimization forces that work on this planet.

Brian: Also, politics and memetic spandrels. For example, like people talk about the rise of postmodernism as replacement of metaethical realism with anti-realism in popular culture. I think that is a real development. One can question to what extent, it matters. Maybe it’s correlated with things like a decline in religiosity which matters more. I think that is one good example of how metaethics can actually go popular and mainstream.

Lucas: Right. Just to bring this back, I think that in terms of the AI alignment problem, I think I try to or at least I’d like to be a bit more optimistic about how much causal efficacy each part of thinking has causal efficacy over the AI alignment problem. I like to not or I tend not to think that 90% of it will in the end be due to rogue impersonal forces like you’re discussing. I think that everyone no matter who you are stands to gain from more metaethical thinking in so far as that whether you take realist or anti-realist views. The expression of your values or whatever you think your values might be whether they might be conventional or relative, or arbitrary in your view, or whether they might relate to some objectivity. They’re much likely less to be expressed and I think a reasonable in a good way, without sufficient metaethical thinking and discussion.

David: One thing I would very much hope that before for example, radiating out across the cosmos, we would sort out our problems on earth in the solar system first regardless of whether one is secular or religious, or a classical or a negative utilitarian, let’s not start thinking about colonizing nearby solar systems or anything there. Yeah, if one is an optimist or maybe thinking of opportunities forgone but at least wait a few centuries. I think in a fundamental sense, we do not understand the nature of reality and not understanding the nature of reality comes with not understanding the nature of value and disvalue or the experience of value and disvalue as Brian might put it.

Brian: Unfortunately, I’m more pessimistic than David. I think the forces of expansion will be hard to stop as they always have been historically. Nuclear weapons are something that almost everybody wishes hadn’t been developed and yet they were developed. Climate change is something that people would like to stop but it has a force of its own due to the difficulty of coordination. I think the same will be true for space colonization and AI development as well that we can make tweaks around the edges but the large trajectory will be determined by the runaway economic and technological situation that we find ourselves in.

David: I fear Brian maybe right. I used to sometimes think about the possibilities of so-called cosmic rescue missions if the rare earth hypothesis is false and suffering Darwinian life exists within our cosmological horizon. I used to imagine this idea that we would radiate out and prevent suffering elsewhere. A, I suspect the rare earth hypothesis is true but B, I suspect even if for suffering life forms do exist elsewhere within our hubble volume. It’s probably more likely humans or our successes would go out and just create more suffering or it’s a rather dark and pessimistic view in my more optimistic moments I think we will phase out suffering all together in the next few centuries but these are guesses really.

Lucas: We’re dealing with ultimately given AI and it being the most powerful optimization process or the seed optimization process to radiate out from earth. I mean we’re dealing with potential astronomical waste or astronomical value, or astronomical disvalue and if we tie this again into moral uncertainty and start thinking about William MacAskill’s work on moral uncertainty where we just do what might be like expected value calculations with regards to our moral uncertainty. We’ve tried to be very mathematical about it and consider the amount of matter and energy that we are dealing with here. Given a super intelligent optimization process coming from Earth.

I think that tying this all together and considering it all should potentially plan an important role in our AI strategy. I definitely feel very sympathetic to Brian’s views that in the end, it might all simply come down to these impersonal economic and political, and militaristic, and memetic forces which exist. Given moral uncertainty, given meta-ethical uncertainty and given the amount of matter and energy that is at stake, potentially some portion of AI strategy should play into circumventing those forces or trying to get around them or decrease them and their effects and hold on AI alignment.

Brian: Yeah. I think it’s tweaks around the edges as I said unless these approaches become very mainstream but I think the prior probability that AI alignment of the type that you would hope for becomes worldwide is low because the prior probability that any given thing becomes worldwide mainstream is low. You can certainly influence local communities who share those ideals and they can try to influence things to the extent possible.

Lucas: Right. I mean maybe something potentially more sinister is that it doesn’t need to become worldwide if there’s a singleton scenario or if the power and control over the AI is very small within a tiny organization or some smaller organization which has power in autonomy to do this kind of thing.

Brian: Yeah, I guess I would again say the probability that you will influence those people would be low. Personally, I would imagine it would be either within a government or a large corporation. Maybe we have disproportionate impact on AI developers relative to the average human. Especially as AI becomes more powerful, I would expect more and more actors to try to have an influence. Our proportional influence would decline.

Lucas: Well, I’m feel very pessimistic after all this. Morality is not real and everything’s probably going to shit because economics and politics is going to drive it all in the end, huh?

David: It’s also possible that we’re heading for a glorious future of super human bliss beyond the bounds of every day experience and that this is just the fag end of Darwinian life.

Lucas: All right. David, we’ll be having I think as you say one day we might have thoughts as beautiful as sunsets.

David: What a beautiful note to end on.

Lucas: I hope that one day we have thoughts as beautiful as sunsets and that suffering is a thing of the past whether that be objective or subjective within the context of an empty cold universe of just entropy. Great. Well, thank you so much Brian and David. Do you guys have any more questions or anything you’d like to say or any plugs, last minute things?

Brian: Yeah, I’m interested in promoting research on how you should tweak AI trajectories if you are foremost concerned about suffering. A lot of this work is being done by the Foundational Research Institute, which aims to avert s-risks especially as they are related to AI. I would encourage people interested in futurism to think about suffering scenarios in addition to extinction scenarios. Also, people who are interested in suffering-focused ethics to become more interested in futurism and thinking about how they can affect long-term trajectories.

David: Visit my websites urging the use of biotechnology to phase out suffering in favor of gradients of intelligent bliss for all sentient beings. I’d also like just to say yeah, thank you Lucas for this podcast and all the work that you’re doing.

Brian: Yeah, thanks for having us on.

Lucas: Yeah, thank you. Two Bodhisattvas if I’ve ever met them.

David: If only.

Lucas: Thanks so much guys.

If you enjoyed this podcast, please subscribe. Give it a like or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

[end of recorded material]

AI Alignment Podcast: AI Safety, Possible Minds, and Simulated Worlds with Roman Yampolskiy

What role does cyber security play in AI alignment and safety? What is AI completeness? What is the space of mind design and what does it tell us about AI safety? How does the possibility of machine qualia fit into this space? Can we leak proof the singularity to ensure we are able to test AGI? And what is computational complexity theory anyway?

AI Safety, Possible Minds, and Simulated Worlds is the third podcast in the new AI Alignment series, hosted by Lucas Perry. For those of you that are new, this series will be covering and exploring the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, we will be having discussions with technical and non-technical researchers across areas such as machine learning, AI safety, governance, coordination, ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, or your preferred podcast site/application.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.

In this podcast, Lucas spoke with Roman Yampolskiy, a Tenured Associate Professor in the department of Computer Engineering and Computer Science at the Speed School of Engineering, University of Louisville. Dr. Yampolskiy’s main areas of interest are AI Safety, Artificial Intelligence, Behavioral Biometrics, Cybersecurity, Digital Forensics, Games, Genetic Algorithms, and Pattern Recognition. He is an author of over 100 publications including multiple journal articles and books. 

Topics discussed in this episode include:

  • Cyber security applications to AI safety
  • Key concepts in Roman’s papers and books
  • Is AI alignment solvable?
  • The control problem
  • The ethics of and detecting qualia in machine intelligence
  • Machine ethics and it’s role or lack thereof  in AI safety
  • Simulated worlds and if detecting base reality is possible
  • AI safety publicity strategy
In this interview we discuss ideas contained in upcoming and current work of Roman Yampolskiy. You can find them here: Artificial Intelligence Safety and Security and Artificial Superintelligence: A Futuristic Approach You can find more of his work at his Google Scholar and/or university page and follow him on his Facebook or Twitter.  You can hear about this work in the podcast above or read the transcript below.

Lucas: Hey everyone, welcome back to the AI Alignment Podcast Series with the Future of Life Institute. I’m Lucas Perry and today, we’ll be speaking with Dr. Roman Yampolskiy. This is the third installment in this new AI Alignment Series. If you’re interested in inverse reinforcement learning or the possibility of astronomical future suffering being brought about by advanced AI systems, make sure to check out the first two podcasts in this series.

As always, if you find this podcast interesting or useful, make sure to subscribe or follow us on your preferred listening platform. Dr. Roman Yampolskiy is a tenured associate professor in the Department of Computer Science and Engineering at the Speed School of Engineering at the University of Louisville. He is the founding and current director of the Cybersecurity Lab and an author of many books including Artificial Superintelligence: A Futuristic Approach.

Dr. Yampolskiy’s main areas of interest are in AI safety, artificial intelligence, behavioral biometrics, cybersecurity, digital forensics, games, genetic algorithms and pattern recognition. Today, we cover key concepts in his papers and books surrounding AI safety and artificial intelligence superintelligence and AGI, his approach to AI alignment, how AI security fits into all this. We also explore our audience-submitted questions. This was a very enjoyable conversation and I hope you find it valuable. With that, I give you Dr. Roman Yampolskiy.

Thanks so much for coming on the podcast, Roman. It’s really a pleasure to have you here.

Roman: It’s my pleasure.

Lucas: I guess let’s jump into this. You can give us a little bit more information about your background, what you’re focusing on. Take us a little bit through the evolution of Roman Yampolskiy and the computer science and AI field.

Roman: Sure. I got my PhD in Computer Science and Engineering. My dissertation work was on behavioral biometrics. Typically, that’s applied to profiling human behavior, but I took it to the next level looking at nonhuman entities, bots, artificially intelligent systems trying to see if we can apply same techniques, same tools to detect bots, to prevent bots, to separate natural human behavior from artificial behaviors.

From there, I try to figure out, “Well, what’s the next step? As those artificial intelligence systems more capable, can we keep up? Can we still enforce some security on them?” That naturally led me to looking at much more capable systems and the whole issues with AGI and superintelligence.

Lucas: Okay. In terms of applying biometrics to AI systems or software or computers in general, what does that look like and what is the end goal there? What are the metrics of the computer that you’re measuring and to what end are they used and what information can they give you?

Roman: The good example I can give you is from my dissertation work again. I was very interested with poker at the time. The poker rooms online were still legal in US and completely infested with bots. I had a few running myself. I knew about the problem and I was trying to figure out ways to automatically detect that behavior. Figure out which bot is playing and prevent them from participating and draining resources. That’s one example where you just have some sort of computational resource and you want to prevent spam bots or anything like that from stealing them.

Lucas: Okay, this is cool. Before you’ve arrived at this AGI and superintelligence stuff, could you explain a little bit more about what you’ve been up to? It seems like you’ve done a lot in computer security. Could you unpack a little bit about that?

Roman: All right. I was doing a lot of very standard work relating to pattern recognition, neural networks, just what most people do in terms of work on AI recognizing digits and handwriting and things of that nature. I did a lot of work in biometrics, so recognizing not just different behaviors but face recognition, fingerprint recognition, any type of forensic analysis.

I do run Cybersecurity Lab here at the University of Louisville. My students typically work on more well recognized sub domains of security. With them, we did a lot of work in all those domains, forensics, cryptography, security.

Lucas: Okay. Do you feel that all the security research, how much of it do you think is important or critical to or feeds into ASI and AGI research? How much of it right now is actually applicable or is making interesting discoveries, which can inform ASI and AGI thinking?

Roman: I think it’s fundamental. That’s what I get most of my tools and ideas for working with intelligent systems. Basically, everything we learned in security is now applicable. This is just a different type of cyber infrastructure. We learned to defend computers, networks. Now, we are trying to defend intelligent systems both from insider threats and outside from the systems themselves. That’s a novel angle, but pretty much everything I did before is now directly applicable. So many people working in AI safety approach it from other disciplines, philosophy, economics, political science. A lot of them don’t have the tools to see it as a computer science problem.

Lucas: The security aspect of it certainly make sense. You’ve written on utility function security. If we’re to make value aligned systems, then it’s going to be important that the right sorts of people have control over them and that their preferences and dispositions and the systems, again, utility function is secure is very important. A system in the end I guess isn’t really safe or robust or value aligned if it’s extremely influenced by anyone.

Roman: Right. If someone can just disable your safety mechanism, do you really have a safe system? That completely defeats everything you did. You release a well-aligned, friendly system and then somebody flips a few bits and you got the exact opposite.

Lucas: Right. Given this research focus that you have in security and how it feeds into ASI and AGI thinking and research and AI alignment efforts, how would you just generally summarize your approach to AI alignment and safety?

Roman: There is not a general final conclusion I can give you. It’s still work in progress. I’m still trying to understand all the types of problems we are likely to face. I’m still trying to understand this problem as even solvable to begin with. Can we actually control more intelligent systems? I always look at it from engineering computer science point of view much less from philosophy ethics point of view.

Lucas: Whether or not this problem is in principle solvable, that has a lot to do with fundamental principles and ideas and facts about minds in general and what is possible of minds. Can you unpack a little bit more about what sorts of information we need or what we need to think about more going forward to know what it means whether or not this problem is solvable in principle, how we can figure that up as we continue forward?

Roman: There is multiple ways you can show that it’s solvable. The ideal situation is where you can produce some sort of a mathematical proof. That’s probably the hardest way to do it because it’s such a generic problem. It applies to all domains. It has to be still working under self-improvement and modification. It has to still work after learning of additional information and it has to be reliable against malevolent design, so purposeful modifications. It seems like it’s probably the hardest problem ever to be given to them. Mathematics community are willing to take it on.

You can also look at examples just from experimental situations both with artificial systems. Are we good at controlling existing AIs? Can we make them safe? Can we make software safe in general? Also, natural systems. Are we any good at creating safe humans? Are we good at controlling people? Now, it seems like after millennia of efforts coming up with legal framework, ethical framework, religions, all sorts of ways of controlling people, we are pretty much failing at creating safe humans.

Lucas: I guess in the end, that might come down to fundamental issues in human hardware and software. Like the reproduction of human beings through sex and the way that genetics functions just creates a ton of variance in each person, which each person has different dispositions and preferences and other things. Then also the way that I guess software is run and shared across culture and people. Creates more fundamental issues that we might not have in software and machines because they work differently.

Are there existence proofs I guess with AI where AI is superintelligent in a narrow domain or at least above human intelligence in a narrow domain and we have control over such narrow systems? Would it be potentially generalizable as you sort of aggregate more and more AI systems, which are superintelligent in narrow domains that as you aggregate that or create an AGI, which sort of has meta learning, we would be able to have control over it given these existence proofs in narrow domains?

Roman: There are certainly such examples in narrow domains. If we’re creating, for example, a system to play chess. We can have a single number measuring it’s performance. We can control whatever is getting better or worse. That’s quite possible and is very limited linear domain. The problem is as complexity increases, you go from this n-body problem equals one to n-body equals infinity, and that’s very hard to solve both computationally and in terms of just understanding what in that hyperspace of possibilities is a desirable outcome.

It’s not just gluing together a few narrow AIs like, “Okay, I have a chess playing program. I have a go playing program.” If I put them all in the same PC, do I now have general intelligence capable of moving knowledge across domains? Not exactly. Whatever safety you can prove for limited systems, not necessarily will transferred to a more complex system, which integrates the components.

Very frequently, then you add two safe systems, the merged system has back doors, has problems. Same with adding additional safety mechanisms. A lot of times, you will install a patch for software to increase security and the patch itself has additional loopholes.

Lucas: Right. It’s not necessarily the case that in the end, AGI is actually just going to be sort of like an aggregation of a lot of AI systems, which are superintelligent in narrow domains. Rather, it potentially will be something more like an agent, which has very strong meta learning. So, learning about learning and learning how to learn and just learning in general. Such that all the sort of process is in things that it learns or deeply integrated at a lower level and they’re sort of like a higher level thinking that is able to execute on these things that they learned. Is that so?

Roman: That makes a lot of sense.

Lucas: Okay. Moving forward here, it would be nice if we could go ahead and explore a little bit of the key concepts in your books and papers and maybe get into some discussions there. I don’t want to spend a lot of time talking about each of the terms and having you define them as people can read your book, Artificial Superintelligence: A Futuristic Approach. They can also check out your papers and you’ve talked about these in other places. I think it will be helpful for giving some background and terms that people might not exactly be exposed to.

Roman: Sure.

Lucas: Moving forward, what can you tell us about what AI completeness is?

Roman: It’s a somewhat fuzzy term kind of like Turing test. It’s not very precisely defined, but I think it’s very useful. It seems that there are certain problems in artificial intelligence in general which require you to pretty much have general intelligence to solve them. If you are capable of solving one of them, then by definition, we can reduce other problems to that one and solve all problems in AI. In my papers, I talk about passing Turing test as being the first such problem. If you can pass unrestricted version of a Turing test, you can pretty much do anything.

Lucas: Right. I think people have some confusions here about what intelligence is in the kinds of minds that can solve Turing tests completely and the architecture that they have and whether that architecture means they’re exactly intelligent. I guess some people have this kind of intuition or idea that you could have a sort of system that had meta learning and learning and was able to sort of think as a human does in order to execute a Turing test.

Then potentially, other people have an idea and this may be misguided where a sort of sufficiently complicated tree search or Google engine on the computer would be able to pass a Turing test and that seems potentially kind of stupid. Is the latter idea a myth? Or if not, how is it just as intelligent as the former?

Roman: To pass an unrestricted version of a Turing test, against someone who actually understands how AI works is not trivial. You can do it with just lookup tables and decision trees. I can give you an infinite number of completely novel situations where you have to be intelligent to extrapolate to figure out what’s going on. I think theoretically, you can think of an infinite lookup table which has every conceivable string for every conceivable previous sequence of questions, but in reality, it just makes no sense.

Lucas: Right. They’re going to be sort of like cognitive features and logical processes and things like inferences and extrapolation and logical tools that humans use that almost must necessarily come along for the ride in order to fully pass a Turing test.

Roman: Right. To fully pass it, you have to be exactly the same in your behavior as a human. Not only you have to be as smart, you also have to be as stupid. You have to repeat all the mistakes, all the limitations in terms of humanity, in terms of your ability to compute, in terms of your cognitive biases. A system has to be so smart that it has a perfect model of an average human and can fake that level of performance.

Lucas: It seems like in order to pass a Turing test, the system would either have to be an emulation of a person and therefore almost essentially be a person just on different substrate or would have to be superintelligent in order to run an emulation of a person or a simulation of a person.

Roman: It has to have a perfect understanding of an average human. It goes together with value alignment. You have to understand what a human would prefer or say or do in every situation and that does require you to understand humanity.

Lucas: Would that function successfully at a higher level of general heuristics about what an average person might do or does it require a perfect emulation or simulation of a person in order to fully understand what a person would do in such an instance?

Roman: I don’t know if it has to be perfect. I think there are certain things we can bypass and just going to read books about what a person would do in that situation, but you do have to have a model complete enough to produce good results in novel situations. It’s not enough to know, OK, most people would prefer ice cream over getting a beating, something like that. You have to figure out what to do in a completely novel set up where you can just look it up on Google.

Lucas: Moving on from AI completeness, what can you tell us about the space of mind designs and the human mental model and how this fits into AGI and ASI and why it’s important?

Roman: A lot of this work was started by Yudkowsky and other people. The idea is just to understand how infinite that hyperspace is. You can have completely different sets of goals and desires from systems which are very capable optimizers. They may be more capable than an average human or best human, but what they want could be completely arbitrary. You can’t make assumptions along the lines of, “Well, any system smart enough would be very nice and beneficial to us.” That’s just a mistake. If you randomly pick a mind from that infinite universe, you’ll end up with something completely weird. Most likely incompatible with human preferences.

Lucas: Right. This is just sort of, I guess, another way of explaining the orthogonality thesis as described by Nick Bostrom?

Roman: Exactly. Very good connection, but it gives you a visual representation. I have some nice figures where you can get a feel for it. You start with, “Okay, we have human minds, a little bit of animals, you have aliens in the distance,” but then you still keep going and going in some infinite set of mathematical possibilities.

Lucas: In this discussion of the space of all possible minds, it’s a discussion about intelligence where intelligence is sort of understood as the ability to change and understand the world and also the preferences and values which are carried along in such minds however random and arbitrary they are from the space of all possible mind design.

One thing which is potentially very important in my view is the connection of the space of all possible hedonic tones within mind space, so the space of all possible experience and how that maps onto the space of all possible minds. Not to say that there’s duality going on there, but it seems very crucial and essential to this project to also understand the sorts of experiences of joy and suffering that might come along for each mind within the space of all possible minds.

Is there a way of sort of thinking about this more and formalizing it more such as you do or does that require some more really foundational discoveries and improvements in the philosophy of mind or the science of mind and consciousness?

Roman: I look at this problem and I have some papers looking at those. One looks at just generation of all possible minds. Sequentially, you can represent each possible software program as an integer and brute force them. It will take infinite amount of time, but you’ll get to every one of them eventually.

Another recent paper looks at how we can actually detect qualia in natural and artificial agents. While it’s impossible for me to experience the world as someone else, I think I was able to come up with a way to detect whatever you have experiences or not. The idea is to present you with the illusions, kind of visual illusions and based on the type of body you have, the type of sensors you have, you might have experiences which match with mine. If they are not, then I can say really anything about you. You could be conscious and experiencing qualia or maybe not. I have no idea.

In a set of such tests on multiple illusions, you happen to experience exactly the same side effects from the illusion. This test drew multiple-choice questions and you can get any level of accuracy you want with just additional tests. Then I have no choice but to assume that you have exactly same qualia in their situation. So, at least I know you do have experiences of that type.

If it’s taking it to what you suggested pleasure or pain, we can figure out is there suffering going on, is there pleasure happening, but this is very new. We need a lot more people to start doing psychological experiments with that.

The good news is from existing literature, I found a number of experiments where a neutral network designed for something completely unrelated still experienced similar side effect as a natural model. That’s because the two models represent the same mathematical structure.

Lucas: Sorry. The idea here is that by observing effects on the system that if those effects are also correlated or seen in human subjects that this is potentially some indication that the qualia that is correlated with those effects in people is also potentially experienced or seen in the machine?

Roman: Kind of. Yeah. So, when I show you a new cool optical illusion. You experienced something outside of just the values of bits in that illusion. Maybe you see light coming out of it. Or maybe you see rotations. Maybe you see something else.

Lucas: I see a triangle that isn’t there.

Roman: Exactly. If a machine reports exactly the same experience without previous knowledge obviously, then just Google what a human would see. How else would you explain that knowledge, right?

Lucas: Yeah. I guess I’m not sure here. I probably need to think about it more actually, but this does seem like a very important approach in place to move forward. The person in me who’s concerned about thinking about ethics looks back on the history of ethics and thinks about how human beings are good at optimizing the world in ways in which it produces something of value to them but in optimizing for that thing, they produce huge amounts of suffering. We’ve done this through subjugation of women and through slavery and through factory farming of animals currently and previously.

After each of these periods, of these morally abhorrent behaviors, it seems we have an awakening and we’re like, “Oh, yeah, that was really bad. We shouldn’t have done that.” I guess just moving forward here with machine intelligence, it’s not clear that this will be the case or it is possible that it could be the case, but it may. Potentially sort of the next one of these moral catastrophes is if we sort of ignore this research into the possible hedonic states of machines and just brush it away as being dumb philosophical stuff that we potentially could produce an enormous amount of suffering in machine intelligence and just sort of override that and create another ethical catastrophe.

Roman: Right. I think that makes a lot of sense. I think qualia, a side effect of certain complex computations. You can’t avoid producing them if you’re doing this type of thinking, computing. We have to be careful once we get to that level of not having very painful side effects.

Lucas: Is there any possibility here of trying to isolate the neural architectural correlates of consciousness in human brains and then physically or digitally instantiating that in machines and then creating a sort of digital or physical corpus callosum between the mind of a person and such a digital or physical instantiation of some neural correlate of something in the machine in order to see if an integration of those two systems creates a change in qualia for the person? Such that the person could sort of almost first-person confirm that when it connects up to this thing that its subjective experience changes and therefore maybe we have some more reason to believe that this thing independent of the person, when they disconnect, has some sort of qualia to it.

Roman: That’s very interesting type of experiment I think. I think something like this has been done with Siamese twins conjoined with brain tissue. You can start looking at those to begin with.

Lucas: Cool. Moving on from the space of mind designs and human mental models, let’s go ahead and then talk about the singularity paradox. This is something that you cover quite a bit in your book. What can you tell us about the singularity paradox and what you think the best solutions are to it?

Roman: It’s just a name for this idea that you have a superintelligent system, very capable optimizer, but it has no common sense as we human perceive it. It’s just kind of this autistic savant capable of making huge changes in the world but a four-year-old would have more common sense in terms of disambiguation of human language orders. Just kind of understanding the desirable states of the world.

Lucas: This is sort of the fundamental problem of AI alignment. The sort of assumption about the kind of mind AGI or ASI will be, the sort of autistic savant sort of intelligence, what that is … This is what Dylan Hadfield-Menell brought up on our first podcast for the AI Alignment Series is that for this case of this autistic savant that most people have in mind, a perfectly rational Bayesian optimizing agent. Is that sort of the case? Is that the sort of mind that we have in mind when we’re thinking of this autistic savant that just blows over things we care about because it’s just optimizing too hard for one thing and Goodhardt’s law starts to come into effect?

Roman: Yes, in a way. I always try to find most simple examples so we can understand better in the real world. Then you have people with extremely high level of intelligence. The concerns they have, the issues they find interesting are very different from your average person. If you watch something like Big Bang Show with Sheldon, that’s like a good to funny example of this on a very small scale. There is maybe 30 IQ point difference, but what if it’s 300 points?

Lucas: Right. Given the sort of problem, what are your conclusions and best ideas or best practices for working on this? Working on this is just sort of working on the AI alignment problem I suppose.

Roman: AI alignment is just a new set of words to say we want the safe and secure system, which kind of does what we designed it to do. It doesn’t do anything dangerous. It doesn’t do something we disagree with. It’s well aligned with our intention. By itself, the term adds nothing new. The hard problem is, “Well, how do we do it?”

I think it’s fair to say that today, as of right now, no one in the world has a working safety mechanism capable of controlling intelligent behavior and scaling to a new level of intelligence. I think even worse is that no one has a prototype for such a system.

Lucas: One thing that we can do here is we can sort of work on AI safety and we can think about law, policy and governance to try and avoid an arms race in AGI or ASI. Then there are also important ethical questions which need to be worked on before AGI some of which including kind of more short-term things, universal basic income and bias and discrimination in algorithmic systems. How AI will impact the workforce and other things and potentially some bigger ethical questions we might have to solve after AGI if we can pull the brakes.

In terms of the technical stuff, one important path here is thinking about and solving the confinement problem, the method by which we are able to create an AGI or ASI and air gap it and make it so that it is confined and contained to be tested in some sort of environment to see if it’s safe. What are your views on that and what do you view as a potential solution to the confinement problem?

Roman: That’s obviously a very useful tool to have, to test, to debug, to experiment with an AI system while it’s limited in its communication ability. It cannot perform social engineering attacks against the designer or anyone else. It’s not the final solution if you will if a system can still escape from such confinement, but it’s definitely useful to be able to do experiments on evolving learning AI.

Can I limit access to the Internet? Can I limit access to knowledge, encyclopedia articles? Can I limit output in terms of just text, no audio, no video? Can I do just a binary yes or no? All of it is extremely useful. We have special air gap systems for studying computer viruses, so to understand how they work, how they communicate versus just taking it to the next level of malevolent software.

Lucas: Right. There’s sort of this, I guess, general view and I think that Eliezer has participated in some of these black boxing experiments where you pretend as if you are the ASI and you’re trying to get out of the box and you practice with other people to see if you can get out of the box. Out of discussions and thinking on this, it seems that some people thought that it’s almost impossible to confine these systems. Do you think that, that’s misguided or what are your views on that?

Roman: I agree that long-term, you absolutely cannot confine a more intelligent system. I think short-term while it’s still developing and learning, it’s a useful tool to have. The experiments Eliezer did, very novel at the time, but I wish he meet public all the information to make them truly scientific experiments where people can reproduce them properly, learn from them. Simply saying that this guy who now works with me let me out, it’s not the optimal way to do it.

Lucas: Right. I guess the concern there is with confinement experiments is that explaining the way in which it gets out is potentially an information hazard.

Roman: Yeah. People tend to call a lot of things informational hazards. Those things certainly exist. If you have source code for AGI, I strongly recommend you don’t make it public, but we’ve been calling a lot of things informational hazard I think.

The best example is Roko’s basilisk where essentially it was a new way to introduce Christianity. If I tell you about Jesus and you don’t follow him, now you’re going to hell. If I didn’t tell you about Jesus, you’d be much better off. Why did you tell me? Deleting it just makes it grow bigger and it’s like Streisand effect, right? You promoting this while you trying to suppress it. I think you have to be very careful in calling something an informational hazard, because you’re diluting the label by doing that.

Lucas: Here’s something I think we can potentially get into the weeds on and we may disagree about and have some different views on. Would you like to just go ahead and unpack your belief? First of all, go ahead and explain what it is and then explain your belief about why machine ethics in the end is the wrong approach or a wrong instrument in AI alignment.

Roman: The way it was always done in philosophy typically, everyone tried to publish a paper suggesting, “Okay, this is a set of ethics we need to follow.” Maybe it’s ethics based on Christianity or Judaism. Maybe it’s utilitarianism, whatever it is. There was never any actual solution, anything was proposed which could be implemented as a way to get everyone on board and agree with it. It was really just a competition for like, “Okay, I can come up with a new ethical set of constraints or rules or suggestions.”

We know philosophers have been trying to resolve it for millennia. They failed miserably. Why somehow moving it from humans to machines will make it easier problem to solve where a single machine is a lot more powerful and can do a lot more with this is not obvious to me. I think we’re unlikely to succeed by doing that. The theories are contradictory, ill-defined, they compete. It doesn’t seem like it’s going to get us anywhere.

Lucas: To continue unpacking your view a bit more, instead of machine ethics where we can understand machine ethics as the instantiation of normative and meta-ethical principles and reasoning and machine systems to sort of make them moral agents and moral reasoners, your view is that instead of using that, we should use safety engineering. Would you like to just unpack what that is?

Roman: To return to the definition you proposed. For every ethical system, there are edge cases which backfire tremendously. You can have an AI which is a meta-ethical decider and it figures out, “Okay, the best way to avoid human suffering is do not have any humans around.” You can defend it from philosophical point of view, right? It makes sense, but is that a solution we would accept if a much smarter system came up with it?

Lucas: No, but that’s just value misalignment I think. I don’t think that there are any sort of like … There are, in principle, possible moral systems where you say suffering is so bad that we shouldn’t risk any of it at all ever, therefore life shouldn’t exist.

Roman: Right, but then you make AI the moral agent. That means it’s making moral decisions. It’s not just copying what humans decided even if we can somehow figure out what the average is, it’s making its own novel decisions using its superintelligence. It’s very likely it will come up with something none of us ever considered. The question is, will we like it?

Lucas: Right. I guess just for me here, I understand why AI safety engineering and technical alignment efforts are so very important and intrinsic. I think that it really constitutes a lot of the AI alignment problem. I think that given that the universe has billions and billions and billions of years left to live, that the instantiation of machine ethics in AGI and ASI is… you can’t hold off on it and it must be done.

You can’t just have an autistic savant superspecies on the planet that you just never imbue with any sort of ethical epistemology or meta-ethics because you’re afraid of what might happen. You might want to do that extremely slowly and extremely carefully, but it seems like machine ethics is ultimately an inevitability. If you start to get edge cases that the human beings really don’t like, then potentially you just went wrong somewhere in cultivating and creating its moral epistemology.

Roman: I agree with doing it very slowly and carefully. That seems like a good idea in general, but again, just projecting to long-term possibilities. I’m not optimistic that the result will be beneficial.

Lucas: Okay. What is there left to it? If we think of the three cornerstones of AI alignment as being law, policy, governance, then we have ethics on one corner and then we have technical AI alignment on the other corner. We have these three corners.

If we have say AGI or ASI around 2050, which I believe is something a lot of researchers give a 50% probability to, then imagine we simply solve technical AI alignment and we solved the law, policy and governance coordination stuff so that we don’t end up having an arms race and we mess up on technical alignment. Or someone uses some singleton ASI to malevolently control everyone else.

Then we still have the ethical issues in the end. Even if we have a perfectly corrigible and docile intelligence, which is sort of tuned to the right people and sort of just takes the right orders. Then whatever that ASI does, it’s still going to be a manifestation, an embodiment of the ethics of the people who tell it what to do.

There’s still going to be billions and billions of years left in the universe. William MacAskill discusses this. Is that sort of after we’ve solved the technical alignment issues and the legal and political and coordination issues, then we’re going to need a period of long deliberation where we actually have to make concrete decisions about moral epistemology and meta-ethics and try and do it in really a formalized and rigorous way and potentially take thousands of years to figure it out.

Roman: I’m criticizing this and that makes it sound like I have a solution, which is something else and I don’t. I don’t have a solution whatsoever. I just feel it’s important to point out problems with each specific approach so we can avoid problems of over committing to it.

You mentioned a few things. You mentioned getting information from the right people. That seems like that’s going to create some problems right there. Not sure who the right people are. You mentioned spending thousands of years deciding what we want to do with this superintelligent system. I don’t know if we have that much time given all the other existential risks, given the chance of malevolent superintelligence being released by rogue agents much sooner. Again, it may be the best we got, but it seems like there are some issues we have to look at.

Lucas: Yeah, for sure. Ethics has traditionally been very messy and difficult. I think a lot of people are confused about the subject. Based on my conversation with Dylan Hadfield-Menell, when we’re discussing inverse reinforcement learning and other things that he was working on, his sort of view was a view of AI alignment and value alignment where inverse reinforcement learning and other preference learning techniques are sort of used to create a natural evolution of human values and preferences in ethics, which sort of exists in an ecosystem of AI systems which are all, I guess, in conversation so that it could, more so, naturally evolve.

Roman: Natural evolution is a brutal process. It really has no humanity to it. It exterminates most species. I don’t know if that’s the approach we want to simulate.

Lucas: Not an evolution of ideas?

Roman: Again, if those ideas are actually implemented and applied to all of humanity that has a very different impact than if it’s just philosophers debating with no impact.

Lucas: In the end, it seems like a very difficult end frontier to sort of think about and move forward on. Figuring out what we want and what we should do with a plurality of values and preferences. Whether or not we should take a view of moral realism or moral relativism or anti-realism about ethics and morality. Those seem like extremely consequential views or positions to take when determining the fate of the cosmic endowment.

Roman: I agree completely on how difficult the problem is.

Lucas: Moving on from machine ethics, you wrote a  paper on leak proofing the singularity. Would you like to go ahead and unpack a little bit about what you’re doing in the paper and how that ties into all of this?

Roman: That’s just AI boxing. That was the response to David Chalmers’ paper and he talks about AI boxing as leak proofing, so that’s the title we used, but it’s just a formalization of the whole process. Formalization of the communication channel, what goes in, what goes out. It’s a pretty good paper on it. Again, it relies in this approach of using tools from cyber security to formalize the whole process.

For a long time, experts in cyber security attempted to constrain regular software, not intelligent software from communicating with our programs and outside world and operating system. We’re looking at how that was done, what different classifications they used for site channels and so on.

Lucas: One thing that you also touch on, would you like to go ahead and unpack like wireheading addiction and mental illness in general in machine systems and AI?

Roman: It seems like there is a lot of mental disorders, people experience. The only example of general intelligence we have. More and more, we see similar problems show up in artificial systems, which try to emulate this type of intelligence. It’s not surprising and I think it’s good that we have this body of knowledge from psychology which we can now use to predict likely problems and maybe come up with some solutions for them.

Wireheading is essentially this idea of agent not doing any useful work but just stealing their work channel. If you think about having kids and there is a cookie jar and they get rewarded every time they clean the room or something like that with a cookie, well, they essentially can just find the cookie jar and get direct access to their work channel, right? They’re kids, so they’re unlikely to cause much harm, but if a system is more capable, it realizes you as a human control the cookie jar, well now, it has incentive to control you.

Lucas: Right. There are also these examples with rats and mice that you might be able to discuss a little bit more.

Roman: The classic experiments on that just created through surgery, electrode implants in a brain of some simple animals. Every time you provided an electrical shock to that area, the animals experience the maximum pleasure like orgasm you don’t get tired of. They bypass getting food, having sex, playing with toys. They just sat there pressing the button. If you made it where they have to walk on electrocuted fence to get to the button, it wasn’t a problem, they would do that. It completely messes with usefulness of an agent.

Lucas: Right. I guess just in terms of touching on the differences and the implications of ethics here is that one with sort of consequentialist views, which was sort of very impartial and on speciesists can potentially view wireheading as ethical or the end goal. Whereas other people view a wireheading as basically abhorrent and akin to something terrible that you would never want to happen. There’s also again, I think, a very interesting ethical tension there.

Roman: It goes, I think, to the whole idea of simulated reality and virtual world. Do you care if you’re only succeeding in a made-up world? Would that make you happy enough or do you have to actually impact reality? That could be part of resolving our differences about values and ethics. If every single person can be in their own simulated universe where everything goes according to their wishes, is that a solution to getting us all to agree? You know it’s a fake universe, but at least you’re the king in it.

Lucas: I guess that also touches on this question of the duality that human beings have created between what is fake and real. In what sense is something really fake if it’s not just the base reality? Is there really fundamental value in the thing being the base reality and do we even live in the base reality? How does cosmology or ideas that Max Tegmark explores about the multiverse sort of even impact that? How will that impact our meta-ethics and decision-making about the moral worth of wireheading and simulated worlds?

Roman: Absolutely. I have a paper on something I call designer metry, which is measuring natural versus artificial. The big question of course is can we tell if you are living in a simulated reality? Can it be measured scientifically? Or was it just a philosophical idea? It seems like there are certain ways to identify signals from the engineer if it’s done on purpose, but in general case, you can never tell whatever something is a deep fake or a real input.

Lucas: I’d like to discuss that a little bit more with you, but just to backup really quick to finish talking on about psychology and AI. It seems like this has been something that is really growing in the AI community and it’s not something that I really know much about at all. My general understanding is as AI systems become more and more complex, it’s going to be much more difficult to diagnose and understand the specific pathways and architectures, which are leading to mental illness.

Therefore, general diagnosable tools which observe and understand higher level phenomena or behaviors that systems exist that we’ve developed in psychology would be helpful or implementable here. Is that sort of the case and the use case of psychology here is really just diagnose mental illnesses or does it also has a role in developing positive psychology and well-being in machine systems?

Roman: I think it’s more of a first case. If you have a black box AI, just a huge, very deep neural network, you can just look at the wiring and weights and figure out why it’s producing the results you’re seeing. Whereas you can do high-level experiments, maybe even conversation with the system to give you an idea of how it’s misfiring what the problem is.

Lucas: Eventually, if we begin exploring the computational structure of different hedonic tones and that becomes more formalized as a science, then I don’t know, maybe potentially, there would be more of a role for psychologists in discussing the well-being part rather than the computational mental illness part.

Roman: It is a very new concept. It’s been mentioned a lot in science fiction, but as a scientific concept, it’s very new. I think there is only one or two papers on it directly. I think there is so much potential to exploring more on connections with neuroscience. I’m actually quite excited about it.

Lucas: That’s exciting. Are we living in a simulated world? What does it mean to be able to gather evidence about whether or not we’re living in a simulation? What would such evidence look like? Why may we or may not ever be able to tell whether or not we are in a simulation?

Roman: In general case, if there is not an intent to let you know that it’s a simulated world, you would never be able to tell. Absolutely anything can actually be part of natural base system. You don’t know what it’s like if you are Mario playing in an 8-bit world. You have no idea that it’s low resolution. You’re just part of that universe. You assume the base is the same.

There are situations where engineers leave trademarks, watermarks, helpful messages in a system to let you know what’s going on, but that’s just giving you the answer. I think in general case, you can never know, but from statistical arguments, there’s … Nick Bostrom presents a very compelling statistical arguments. I do the same for biological systems in one of my papers.

Roman: It seems more likely that we are not the base just because every single intelligent civilization will produce so many derived civilizations from it. From space exploration, from creating biological robots capable of undergoing evolutionary process. It would be almost a miracle if out of thousands and thousands of potential newly designed organisms, newly evolved ones, we were like the first one.

Lucas: I think that, that sort of evolutionary process presumes that the utility function of the optimization process, which is spreading into the universe, is undergoing an evolutionary process where it’s changing. Whereas the security and brittleness and stability of that optimization process might be very fixed. It might be that all future and possible super advanced civilizations do not converge on creating ancestor simulations.

Roman: It’s possible, but it feels like a bit less likely. I think they’ll still try to grab the resources and the systems may be fixed in certain values, but they still would be adopting to the local environment. We just see it with different human populations, right? We’re essentially identical, but we developed very different cultures, religions, food preferences based on the local available resources.

Lucas: I don’t know. I feel like I could imagine like a civilization, a very advanced one coming down on some sort of hedonic consequentialism where the view is that you just want to create as many beautiful experiences as possible. Therefore, there wouldn’t be any room for simulating evolution on Earth and all the suffering and kind of horrible things we have to go through.

Roman: But you’re looking at it from inside the simulation. You don’t know what the reasons are on the outside, so this is like a video game or going to the gym. Why would anyone be killed in a video game or suffer tremendously, lifting heavy weights in a gym, right? It’s only fun when you understand external reasons for it.

Lucas: I guess just two things here. I just have general questions on. If there is a multiverse at one or another level, would it then also be the case that the infinity of simulated universes would be a larger fraction of the infinity of the multiverse than the worlds which were not simulated universes?

Roman: This is probably above my pay grade. I think Max is someone who can give you a better answer in that. Comparing degrees of infinities is hard.

Lucas: Okay. Cool. It is not something I really understand either. Then I guess the other thing is I guess just in general, it seems queer to me that human beings are in a world and that we look at our computer systems and then we extrapolate what if these computer systems were implemented at a more base level. It seems like we’re trapped in a context where all that we have to extrapolate about the causes and conditions of our universe are the most fundamental things that we can observe from within our own universe.

It seems like settling on the idea of, “Okay, we’re probably in a simulation,” just seems kind of like we’re gluing to and finding a cosmogenesis hope in one of the only few things that we can, just given that we live in a universe where there are computers. Does that make sense?

Roman: It does. Again, from inside the simulation, you are very limited in understanding the big picture. Then so much would be easier to understand if we had external knowledge, but it’s just not the option we have so far. We learn by pretending to be the engineer in question and now we design virtual worlds. We design intelligent beings and the options we have is the best clue we have about the options available to whoever does it in the external level.

Lucas: Almost as if Mario got to the end of the level and got to the castle. Then because you got to the castle the next level or world started, he was like maybe outside of this context there’s just a really, really big castle or something that’s making lower levels of castles exist.

Roman: Right. I agree with that, but I think we have in common this mathematical language. I think that’s still universal. Just by studying mathematics and possible structures and proving things, we can learn about what’s possible and impossible.

Lucas: Right. I mean there’s just really foundational and fundamental question about the metaphysical realism or anti-realism of mathematics. If there is a multiverse or like a meta multiverse or like a meta-meta-meta-multiverse levels …

Roman: Only three levels.

Lucas: I guess just the implications of a mathematical realism or Platonism or sort of anti-realism at these levels would have really big implications.

Roman: Absolutely, but at this point, I think it’s just fun to think about those possibilities and what they imply for what we’re doing, what we’re hoping to do, what we can do. I don’t think it’s a waste of time to consider those things.

Lucas: Just generally, this is just something I haven’t really been updated on. Is this rule about only in three levels of regression, is that just sort of a general principle or role kind of like Occam’s razor that people like to stick by? Or is there any more…?

Roman: No. I think it’s something Yudkowsky said and it’s cute and kind of meme like.

Lucas: Okay. So it’s not like serious epistemology?

Roman: I don’t know how well proven that is. I think he spoke about levels of recursion initially. I think it’s more of a meme.

Lucas: Okay. All right.

Roman: I might be wrong in that. I know a lot about memes, less about science.

Lucas: Me too. Cool. Given all this and everything we’ve discussed here about AI alignment and superintelligence, what are your biggest open questions right now? What are you most uncertain about? What are you most looking for key answers on?

Roman: The fundamental question of AI safety, is it solvable? Is control problem solvable? I have not seen a paper where someone gives mathematical proof or even a rigorous argument. I see in some blog posts arguing, “Okay, we can predict what the chess machine will do, so surely we can control superintelligence,” but it just doesn’t seem like it’s enough. I’m working on a paper where I will do my best to figure out some answers for that.

Lucas: what is the definition of control and AI alignment?

Roman: I guess it’s very important to formalize those before you can answer the question. If we don’t even know what we’re trying to do, how can we possibly succeed? The first step in any computer science research project is to show that your problem is actually solvable. Some are not. We know, for example, holding problem is not solvable, so it doesn’t make sense to give it as an assignment to someone and wait for them to solve it. If you give them more funding, more resources, it’s just a waste.

Here, it seems like we have more and more people working very hard in different solutions, different methods, but can we first spend a little bit of time seeing how successful can we be? Based on the answer to that question, I think a lot of our governance and the legal framework and general decision-making about this domain will be impacted by it.

Lucas: If your core and key question here is whether or not the control problem or AI alignment is, in principle, or fundamentally solvable, could you give us a quick crash course on complexity theory and computational complexity theory and just things which take polynomial time to solve versus exponential time?

Roman: That’s probably the hardest course you’ll take as an undergraduate in computer science. At the time, I hated every second of it. Now, it’s my favorite subject. I love it. This is the only professor whom I remember teaching computational complexity and computability.

To simplify it, there are different types of problems. Surprisingly, almost all problems can be squeezed into one of those boxes. There are easy problems, which we can just quickly compute. Your calculator adding 2+2 is an example of that. There are problems where we know exactly how to solve them. It’s very simple algorithm. We can call it brute force. You try every option and you’ll always get the best answer, but there’s so many possibilities that in reality you can never consider every option.

Lucas: Like computing prime numbers.

Roman: Well, computer numbers are NP. It’s polynomial to test if a number is prime. It’s actually one of somewhat recent paper for the last 10 years, a great result, Ps are N prime. There are problems which are called NP complete and those are usually the interesting problems we care about and they all reduce to each other. If you solve one, you solved all of them. You cannot brute force them. You have to find some clever heuristics to get approximate answers, optimize those.

We can get pretty close to that. Examples like traveling salesperson problem. If you can figure out optimal way to deliver pizza to multiple households, if you can solve it in general case, you’ll solve 99% of interesting problems. Then there are some problems which we know no one can ever solve using Von Neumann architecture, like standard computer architecture. There are proposals for hyper computation computers with oracles, computers with all sorts of magical properties which would allow us to solve those very, very, very difficult problems, but that doesn’t seem likely anytime soon.

The best part of it I think is this idea of oracles. An oracle is a machine capable of doing magic to give you answer to otherwise unsolvable problem, and there are degrees of oracles. There are magical machines, which are more powerful magicians than the magical machine. None of it is working in practice. It’s all purely theoretical. You start learning about different degrees of magic and it’s pretty cool.

Lucas: Learning and understanding about what, in principle, is fundamentally computationally possible or feasible in certain time frames within the universe given the laws of physics that we have seems to be foundationally important and interesting. It’s one of, I guess, the final frontiers. Not space, but I guess solving intelligence and computation and also the sort of hedonic qualia that comes along for the ride.

Roman: Right. I guess the magical aspect allows you to escape from your local physics and consider other types of physics and what would be possible outside of this world.

Lucas: What advances or potential advances in quantum computing or other sorts of more futuristic hardware and computational systems help and assist in these problems?

Roman: I think quantum computing has more impact on the cryptography and security in that way. It impacts some algorithms more directly. I don’t think there is a determined need for it right now in terms of AI research or AI safety work. It doesn’t look like a human brain is using a lot of quantum effects though some people argue that it’s important for consciousness. I’m not sure if there is definitive proof of that experimentally.

Lucas: Let’s go ahead now and turn to some questions that we’ve gotten from our audience.

Roman: Sounds good.

Lucas: I guess we’re going to be jumping around here between narrow and short-term AI and some other questions. It would be great if you could let me know about the state of safety and security in current AI in general and the evaluation and verification and validation approaches currently adopted by the industry.

Roman: In general, the state of safety and security in AI is almost nonexistent. It’s kind of we’re repeating history. When we worked on creating Internet security was not something we cared about and so Internet is completely insecure. Then was started work on Internet 2.0, Internet of things. We’re repeating the same mistake. All those very cheap devices made in China have no security but they’re all connected and that’s how you can create swarms of devices attacking systems.

It is my hope that we don’t repeat this with intelligent systems, but right now it looks like we are. We care about getting them to the market as soon as possible, making them as capable as possible, the soonest possible. Safety and security is something most people don’t know about, don’t care about. You can see it in terms of number of researchers working on it. You can see it in terms of percentage of funding allocated to AI safety. I’m not too optimistic so far, but the field is growing exponentially, so that’s a good sign.

Lucas: How does evaluation and verification and validation fit into all of this?

Roman: We have pretty good tools for verifying critical software. Something so important… you’re flying to mars, the system cannot fail. Absolutely. We can do mathematical proofs to show that the code you created matches the design you had. It’s an expensive process, but we can do a pretty good job with it. You can put more resources into verifying it with multiple verifiers. You can get any degree of accuracy you want as a cost of computational resource.

As far as I can tell, there is no or very little successful work on verifying systems which are capable of self-improvement, changing, dynamically learning, operating in novel environments. It’s very hard to verify something where you have no idea what the behavior should be in the first beforehand. If it’s something linear, again, we have a chess computer, we know what it’s supposed to do exactly. It’s a lot easier to verify than something more intelligent than you operating a new data in a new domain.

Lucas: Right. It seems like verification in this area of AI is going to require some much more foundational and difficult proofs and verification techniques here. It seems like you’re saying it also requires an idea of an end goal of what the system is actually intended to do in order to verify that it satisfies that.

Roman: Right. You have to verify it against something. I have a paper on unverifiability where I talk about mathematical fundamental limits to what we can prove and verify mathematically. Already, we’re getting to the point where our mathematical proofs are so complex and so long, most human mathematicians cannot possibly even check if it’s legitimate or not.

We have examples of proofs where a mathematical community as a whole still has not decided if something published 10 years ago is a valid proof. If you’re talking about doing proofs on a black box AI systems, now it seems like the only option we have is another AI mathematician, verify our AI, assisting us with that, but this creates this multiple levels of interaction where who’s verifying, verifiers and so on.

Lucas: It seems to me at least another expression of how deeply interdependent the AI alignment problem is. Technical AI alignment is a core issue, but it seems like even in simple things, or not simple things, but things which you would imagine to at least be purely relegated to computer science also has some sort of connections with ethics and policy and law and how these things will all sort of require each other in order to succeed in AI alignment.

Roman: I agree. You do need this complete picture. Overall, I mentioned it a few times before in other podcasts. It feels like an AI safety, every time we analyze a problem, we discovered that it’s like a fractal. There is then more problems under that one and you do it again. Despite the three levels, you still continue with this. It’s an infinite process.

We never get to a point where, “Okay, we solved this. This is not a problem anymore. We know for sure it works in every conceivable situation.” That’s a problem. You have this infinite surface you have to defend, but you only have to fail once to lose everything. It’s very, very different from standard cyber security where, “Okay, somebody stole my credit card. I’ll just get a new one. I’ll get to try again.” Very different approach.

Lucas: There’s no messing up with artificial superintelligence.

Roman: Basically.

Lucas: Just going off of what we were talking about earlier in terms of how AI safety researchers are flirting and interested in the applications of psychology in AI safety, what do you think about the potential future relationship between AI and neuroscience?

Roman: That is great work in neuroscience and trying to understand measurements from just observing neurons, cells to human behavior. There are some papers showing if we do the same thing with computer processors, we’re just going to get a very good microscope and look at the CPU. “Was it playing a video game? Can we figure out connections between what Mario is doing and what electrical wiring is firing and so on?”

There seems to be a lot of mistakes made in that experiment. That tells us that the neuroscience experiments we’re doing for a very long time may be providing some less-than-perfect data for us. In a way, by doing AI work, we can also improve on our understanding of human brain, medical science, just general understanding of how neural networks work. It’s a feedback loop. That is progress in either one benefits the other.

Lucas: It seems like people like Josh Tenenbaum are working on more neuro inspired approaches to creating AGI. It seems that there are some people who have the view or the philosophy that the best way to getting to general intelligence is probably going to be understanding and studying human beings because we’re in existence proof that can be studied of general intelligences. What are your views on this approach and the work being done there?

Roman: I think it’s a lot easier to copy answers to get to the results. In terms of developing capable system, I think it’s the best option we have. I’m not so sure it leads to a safe system because if you just copy design, you don’t fully understand it. You can replicate it without complete knowledge and then instilling safety into it as a an afterthought, as a add-on later on, maybe even more difficult than if you designed it from scratch yourself.

Lucas: A more general strategy and approach, which gets talked about a lot in the effective altruism community: there seems to be this view and you can correct me here anywhere I might get this narrative sort of wrong. It seems important to build the AGI safety community, the AI safety community in general, by bringing more researchers into the fold.

If we can slow down the people who are working on capability and raw intelligence and bring them over to safety, then that might be a very good thing because it slows down the creation of the intelligence part of AGI and puts more researchers into the part that’s working on safety and AI alignment. Then there’s also this tension where …

While, that is a good thing. It may be a bad thing for us to be promoting AI safety or AGI safety to the public community because they probably just … Journalists would spin it and ruin it and trivialize it, turn it into a caricature of itself and just put Terminator photos on everything, which we at FLI are very aware that journalists like to put Terminator stuff on people’s articles and publications. What is your general view about AI safety outreach and do you disagree with the respectability first approach?

Roman: I’m an educator. I’m a professor. It’s my job to teach students, to educate the public, to inform everyone about science and hopefully more educated populace would benefit all of us. Research is funded through taxpayer grants. The public university is funded through taxpayers. The students paying tuition, the general public essentially.

If our goal is to align AI with values of the people, how can we keep people in the dark? They’re the ones who are going to influence elections. They are the ones who are going to decide what good governance of AI essentially is by voting for the right people. We put so much effort into governance of AI. We have efforts at UN, European Parliament, White House, you name it. There are now agreements between France and Canada on what to do with that.

At the end of the day, politicians listen to the public. If I can educate everyone about what the real issues in science are, I think it’s a pure benefit. It makes sense to raise awareness of long-term issues. We do it in every other field of science. Would you ever suggest it’s not a good idea to talk about climate change? No, of course not. It’s silly. We all participate in the system. We’re all impacted by the final outcome. It’s important to provide the good public outreach.

If your concern is the picture of a title of an article, well  work with better journalists, tell them you cannot use a picture of a Terminator. I do it. I tell them and they end up putting a very boring picture on it and nobody clicks on it. Is Terminator then an educational tool? I was able to explain some advanced computability concepts in a few minutes with simple trivial examples. Then you educate people, you have to come to their level. You have to say, “Well, we do have concerns about military killer robots.” There’s nothing wrong with that, so maybe funding for killer robots should be reduced. If public agrees, that’s wonderful.

Just kind of going if an article I published or somebody interviewed me is less than perfect, then it’s not beneficial, I disagree with it completely. It’s important to get to the public, which is not already sold on the idea. Me doing interview for you right now, right? I’m preaching to the choir. Most of your listeners are into AI safety I’m sure. Or at least effective altruism.

Whereas if I do interview for BBC or something like that, now I’m getting access to millions of people who have no idea what superintelligence is. In my world and your world, this is like common knowledge, but I give a lot of keynotes and I would go and speak to top executives for accounting firms and I ask them basic questions about technology. Maybe one ever heard about superintelligence as a concept.

I think education is always a good thing. Having educated populace is wonderful because that’s where funding will eventually come from for supporting our research and for helping us with AI governance. I’m a very strong supporter of outreach and I highly encourage everyone to do very good articles on it. If you feel that a journalist misrepresents your point of view, get in touch, get it fixed. Don’t just say that we’re going to left public in a dark.

Lucas: I definitely agree with that. I don’t really like this elitism that is part of the culture within some parts of AI safety community, which thinks that only the smartest, most niche people should be aware of this and working on it given the safety concerns and the ways in which it could be turned into something else.

Roman: I was a fellow at the Singularity Institute for Artificial Intelligence what is now MIRI. At that time, they had a general policy of not publishing. They felt it was undesirable and will cause more damage. Now, they publish extensively. I had mentioned that, that’s maybe a good idea a few times.

The general idea of buying out top AI developers and turning them to the white side I guess and working on safety issues, I think that’s wonderful. We want the top people. It doesn’t mean we have to completely neglect less than big names. Everyone needs to be invited to the table in terms of support, in terms of grants. Don’t try to think that reputation means that only people at Harvard and MIT work in AI safety.

There is lots of talent everywhere. I work with remote assistance from around the world. There is so much talent out there. I think the results speak for themselves. I get invited to speak internationally. I advise governments, courts, legislative system. I think reputation only grows with such outreach.

Lucas: For sure and it seems like the education on this, because it can seem fairly complicated and people can be really confused about it because I think that there are lots of common myths that people have about intelligence and “consciousness construed” in some way other than how I think you or I construe the term consciousness or the idea of free will or what it means to be intelligent. There’s just so much room for people to be confused about this issue.

The issue is real and it’s coming and people are going to find out about it whether or not we discuss it now. It seems very important that this happens, but also because like … It seems we also exist in a world where something like 40% to 50% of our country is at least skeptical about climate change. Climate change education and advocacy is very important and should be happening.

Even with all of that education and advocacy, there’s still something like around 40% of people who are skeptical about climate change. That issue has become politicized where people aren’t necessarily interested in facts. At least the skeptics are committed to party lines on the issue.

Roman: What would it be without education, if they never heard about the issue, would percentage be zero?

Lucas: I’m not advocating against education. I’m saying that this is an interesting existence case and saying like, “Yeah, we need more education about AI issues and climate change issues in general.”

Roman: I think there are maybe even more disagreement, not so much about how true of a problem is, but how to fix it. It turns into a political issue, then you start talking about let’s increase taxation, let’s decrease taxation. That’s what politicizes. That is not the fundamental science.

Lucas: I guess I just want to look this up actually just to figure out what the general American populace thinks. I think it was a bit wrong.

Roman: I don’t think it’s important what the exact percentage is. I think it’s general concept we care about.

Lucas: It’s a general concept, but I guess I was just potentially introducing a level of pessimism about why we need to educate people more so about AI alignment and AI safety in general just because these issues, even if you’re extremely skillful about them, can become politicized. Just generally the epistemology of America right now is exploding in a giant mess of bullshit. It’s just important that we educate clearly and correctly.

Roman: You don’t have to start with the most extreme examples or I don’t go with paperclip maximizers or whatever. You can talk about career selection, technological unemployment, basic income. Those things are quite understandable and they provide wonderful base for moving to the next level once we get there.

Lucas: Absolutely. Totally in agreement. How would you describe the typical interactions that you get from mainstream AI and CS researchers who just do sort of standard machine learning and don’t know or really think or care about AGI and ASI? When you talk to them and pitch to them like, “Hey, maybe you should be working on AI safety.” Or, “Hey, AI safety is something that is real, that you should care about.”

Roman: You’re right. There are different types of people based on their background knowledge. There is group one, which never heard of the concept. It’s just not part of their world. You can start by just sharing some literature and you can follow up later. Then there are people who are in complete agreement with you. They know it’s important. They understand the issue, but that’s their job they’re working and I think they are sympathetic to the cause.

Then there are people who heard a few kind of not the best attempts to explain what AI risk is, and so they are skeptical. They may be thinking about Terminator movie or something, Matrix, and so they are quite skeptical. In my personal experience, if I had a chance to spend 30 minutes to an hour with a person one-on-one, they all converted. I never had someone who went, “You told me things, but I have zero concern about intelligent systems having bugs in them or side effects or anything like that.”

I think it’s just a question of spending time and making it a friendly expedience. You’re not adversaries trying to fight it out. You’re just going, “Hey, every single piece of software we ever produced had bugs in it and can be had.” How is this different?

Lucas: I agree with you, but there are also seems to be these existence proofs and existence cases of people who are computer scientists and who are super skeptical about AI safety efforts and working on ASI safety like Andrew Ng and others.

Roman: You have to figure out each individual case-by-case basis of course, but just being skeptical about success of his approach is normal. I told you my main concern, is the problem solvable. That’s a degree of skepticism. If we looked at any other industry. Let’s say we had oil industry. The top executive oil industry said that global climate change is not important. Just call it redistribution of good weather or something, it’s not a big deal.

You would immediately think there is some sort of conflict of interest, right? But how is this different? If you are strongly dependent on development, not on anything else, it just makes sense that you would be 100% for development. I don’t think it’s unnatural at all. Again, I think a good conversation and realignment of incentives would do miracles for such cases.

Lucas: It seems like either because Andrew Ang’s timelines are so long or he just thinks that they’re fundamentally, like there’s just not really a big problem. I think there are some computer scientists, researchers who just think there’s just not really a problem, because we’re making the systems and there are systems that are so intertwined with us that the values will just naturally mesh together or something. I’m just so surprised I guess that from the mainstream CS and AI people that you don’t run into more skeptics.

Roman: I don’t start my random interactions with people by trying to tell them, “You are wrong. Change your mind.” That’s usually not the best approach. Then you talk about specific cases and you can take it slowly and increase the level of concern. You can start by talking about algorithmic justice and bias in algorithms and software verification. I think you’ll get 100% support at all those levels.

What happens when your system is slightly more capable, you’re still working with me? I don’t think there is a gap where you go, “Well, at that point, everything becomes rosy and safe and we don’t have to worry about it.” If a disagreement is about how soon, I think it’s not a problem at all. Everything I argue still applies in 20 years, 50 years, 100 years.

If you’re saying it will take 100 years to get to superintelligence, how long will it take to learn how to control a system we don’t have yet? Probably way longer than that. Already, we should have started 50 years ago. It’s too late now. If anything, it strengthens my point that we should put more resources on the safety side.

Lucas: Absolutely. Just a question about generally your work cataloging failures of AI products and what this means for the future.

Roman: I collect examples, historical examples starting with the very first AI systems, still everyday news of how AI systems fail. The examples you all heard about. Self-driving car kills a pedestrian. Or Microsoft Tay chat bot becomes racist and swears at people. I have maybe about 50 or 60 so far. I keep collecting new ones. Feel free to send me lots of cool examples, but make sure they’re not already on my list.

The interesting thing is the patterns. You can get from it, learn from it and use to predict future failure. One, obviously as AI becomes more common, we have more of those systems, the number of such failures grows. I think it grows exponentially and impacts from them grows.

Now, we have intelligent systems trading in the stock market. I think they take up something like 85% of all stock trades. We had examples where they crash the whole stock market, brought down the volume by $1 trillion or something, closed significant amounts. This is very interesting data. I try to create a data set of those examples and there is some interest from industry to understand how to make their products not make my list in the future.

I think so far the only … It sounds like a trivial conclusion, but I think it’s fundamental. The only conclusion I have is that if you design an AI system to do X, it will very soon fail to X whatever X stands for. It seems like it’s only going to get worse as they become more general because the value of X becomes not just narrow. If you designed a system to play chess, then it will fail to win a chess match. That’s obvious and trivial. But if you design the system to run the world or something like that, what is X here?

Lucas: This makes me think about failure modes. Artificial superintelligence is going to have a probability space of failure modes where the severity of the failure at the worst end … We covered this in my last podcast is it would literally be turning the universe into the worst possible suffering imaginable for everyone for as long as possible. That’s some failure mode of ASI which has some probability which is unknown. Then the opposite on the other end is going to be, I guess, the most well-being and bliss for all possible minds, which exists in that universe. Then there’s everything in between.

I guess the question is, is there any mapping or how important is it in mapping this probability space of failure modes? What are the failure modes that ASI can do or that would occur that would make it not value aligned? What are the probabilities of each of those given, I don’t know, the sort of architecture that we expect ASI to have or how we expect ASI to function?

Roman: I don’t think there is a worst and best case. I think it’s infinite in both directions. It can always get worse and always get better.

Lucas: But it’s constrained by what is physically possible.

Roman: Knowing what we know about physics and within this universe, there is a big multiverse out there possibly with different types of physics and simulated environments can create very interesting side effects as well. That’s not the point. I also collect predicted failures of future systems, part of a same report. You can look it up. That’s very interesting to see what usually a scientist, but sometimes science fiction writers, other people had said as potential examples.

It has things like paperclip maximizer and other examples. I look at predictions which are predictions but short-term. For example, we can talk about sex robots and how they’re going to fail. Someone hacks them, then they forget to stop. You forget your safe word. There are interesting possibilities.

Very useful both as an educational tool to get people to see this trend and go, “Okay. At every level of AI development, we had problems proportionate to the capability of AI. Give me a good argument why it’s not the case moving forward?” Very useful tool for AI safety researchers to predict. “Okay, we’re releasing this new system tomorrow. It’s capable of X.” How can we make sure the problems don’t follow?

I published on this, for example, before Microsoft released their Tay chatbot. Giving Xs to users to manipulate your learning data is usually not a safe option. If they just knew about it, maybe they wouldn’t embarrass themselves so bad.

Lucas: Wonderful. I guess just one last question here. My view was that given a superintelligence originating on earth, there would be a physical maximum of the amount of matter and energy which it could manipulate given our current understanding and laws of physics, which are certainly subject to change if we gain new information.

There is something which we could call, as Nick Bostrom explains, the cosmic endowment which is sort of the sphere around an intelligent species, which is running a superintelligent optimization process. Where the sphere represents the maximum amount of matter and energy, a.k.a., galaxies a superintelligence can reach before the universe expands so much that it’s no longer able to get beyond that point. Why is your view that there isn’t a potentially physical best or physical worst thing that, that optimization process could do?

Roman: Computation is done with respect to time. It may take you twice as long to compute something with the same resources, but you’ll still get that if you don’t have limits on your time. Or you create a subjective time for whoever is experiencing things. You can have computations which are not in parallel, serial computation devoted to a single task. It’s quite possible to create, for example, levels of suffering which progressively get worse I think. Again, I don’t encourage anyone experimenting with that, but it seems like things can get worse not just because of limitations, of how much computing I can do.

Lucas: All right. It’s really been a wonderful and exciting conversation Roman. If people want to check out your work or to follow you on Facebook or Twitter or wherever else, what do you recommend people go to read these papers and follow you?

Roman: I’m very active in social media. I do encourage you to follow me on Twitter, RomanYam, or on Facebook, Roman Yampolskiy. Just Google my name. My Google Scholar has all the papers and just trying to make a sell here. I have a new book coming out, Artificial Intelligence Safety and Security. It’s an edited book with all the top AI safety researchers contributing, and it’s due out in August, mid August. Already available for presale.

Lucas: Wow. Okay. Where can people get that? On Amazon?

Roman: Amazon is a great option. It’s published by CRC Press, so you have multiple options right now. I think it’s available as a softcover and hardcover, which are a bit pricey. It’s a huge book about 500 pages. Most people would publish it as a five book anthology, but you get one volume here. It should come out as a very affordable digital book as well, about $30 for 500 pages.

Lucas: Wonderful. That sounds exciting. I’m looking forward to getting my hands on that. Thanks again so much for your time. It’s really been an interesting conversation.

Roman: My pleasure and good luck with your podcast.

Lucas: Thanks so much. If you enjoyed this podcast, please subscribe, give it a like or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment Series.

[end of recorded material]

AI Alignment Podcast: Astronomical Future Suffering and Superintelligence with Kaj Sotala

In a classic taxonomy of risks developed by Nick Bostrom (seen below), existential risks are characterized as risks which are both terminal in severity and transgenerational in scope. If we were to maintain the scope of a risk as transgenerational and increase its severity past terminal, what would such a risk look like? What would it mean for a risk to be transgenerational in scope and hellish in severity?

Astronomical Future Suffering and Superintelligence is the second podcast in the new AI Alignment series, hosted by Lucas Perry. For those of you that are new, this series will be covering and exploring the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, we will be having discussions with technical and non-technical researchers across areas such as machine learning, AI safety, governance, coordination, ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, or your preferred podcast site/application.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.

In this podcast, Lucas spoke with Kaj Sotala, an associate researcher at the Foundational Research Institute. He has previously worked for the Machine Intelligence Research Institute, and has publications on AI safety, AI timeline forecasting, and consciousness research.

Topics discussed in this episode include:

  • The definition of and a taxonomy of suffering risks
  • How superintelligence has special leverage for generating or mitigating suffering risks
  • How different moral systems view suffering risks
  • What is possible of minds in general and how this plays into suffering risks
  • The probability of suffering risks
  • What we can do to mitigate suffering risks
In this interview we discuss ideas contained in a paper by Kaj Sotala and Lukas Gloor. You can find the paper here: Superintelligence as a Cause or Cure for Risks of Astronomical Suffering.  You can hear about this paper in the podcast above or read the transcript below.

 

Lucas: Hi, everyone. Welcome back to the AI Alignment Podcast of the Future of Life Institute. If you are new or just tuning in, this is a new series at FLI where we’ll be speaking with a wide variety of technical and nontechnical domain experts regarding the AI alignment problem, also known as the value alignment problem. If you’re interested in AI alignment, the Future of Life Institute, existential risks, and similar topics in general, please remember to like and subscribe to us on SoundCloud or your preferred listening platform.

Today, we’ll be speaking with Kaj Sotala. Kaj is an associate researcher at the Foundational Research Institute. He has previously worked for the Machine Intelligence Research Institute, and has publications in the areas of AI safety, AI timeline forecasting, and consciousness research. Today, we speak about suffering risks, a class of risks most likely brought about by new technologies, like powerful AI systems that could potentially lead to astronomical amounts of future suffering through accident or technical oversight. In general, we’re still working out some minor kinks with our audio recording. The audio here is not perfect, but does improve shortly into the episode. Apologies for any parts that are less than ideal. With that, I give you Kaj.

Lucas: Thanks so much for coming on the podcast, Kaj. It’s super great to have you here.

Kaj: Thanks. Glad to be here.

Lucas: Just to jump right into this, could you explain a little bit more about your background and how you became interested in suffering risks, and what you’re up to at the Foundational Research Institute?

Kaj: Right. I became interested in all of this stuff about AI and existential risks way back in high school when I was surfing the internet until I somehow ran across the Wikipedia article for the technological singularity. After that, I ended up reading Eliezer Yudkowksy’s writings, and writings by other people. At one point, I worked for the Machine Intelligence Research Institute, immersed in doing strategic research, did some papers on predicting AI that makes a lot of sense together with Stuart Armstrong of the Future of Humanity Institute. Eventually, MIRI’s focus on research shifted more into more technical and mathematical research, which wasn’t exactly my strength, and at that point we parted ways and I went back to finish my master’s degree in computer science. Then after I graduated, I ended up being contacted by the Foundational Research Institute, who had noticed my writings on these topics.

Lucas: Could you just unpack a little bit more about what the Foundational Research Institute is trying to do, or how they exist in the effective altruism space, and what the mission is and how they’re differentiated from other organizations?

Kaj: They are the research arm of the Effective Altruism Foundation in the German-speaking area. The Foundational Research Institute’s official tagline is, “We explain how humanity can best reduce suffering.” The general idea is that a lot of people have this intuition that if you are trying to improve the world, then there is a special significance on reducing suffering, and especially about outcomes involving extreme suffering have some particular moral priority, that we should be looking at how to prevent those. In general, the FRI has been looking at things like the long-term future and how to best reduce suffering at long-term scales, including things like AI and emerging technologies in general.

Lucas: Right, cool. At least my understanding is, and you can correct me on this, is that the way that FRI sort of leverages what it does is that … Within the effective altruism community, suffering risks are very large in scope, but it’s also a topic which is very neglected, but also low in probability. Has FRI really taken this up due to that framing, due to its neglectedness within the effective altruism community?

Kaj: I wouldn’t say that the decision to take it up was necessarily an explicit result of looking at those considerations, but in a sense, the neglectedness thing is definitely a factor, in that basically no one else seems to be looking at suffering risks. So far, most of the discussion about risks from AI and that kind of thing has been focused on risks of extinction, and there have been people within FRI who feel that risks of extreme suffering might actually be very plausible, and may be even more probable than risks of extinction. But of course, that depends on a lot of assumptions.

Lucas: Okay. I guess just to move foreward here and jump into it, given FRI’s mission and what you guys are all about, what is a suffering risk, and how has this led you to this paper?

Kaj: The definition that we have for suffering risks is that a suffering risk is a risk where an adverse outcome would bring about severe suffering on an astronomical scale, so vastly exceeding all suffering that has existed on earth so far. The general thought here is that if we look at the history of earth, then we can probably all agree that there have been a lot of really horrible events that have happened, and enormous amounts of suffering. If you look at something like the Holocaust or various other terrible events that have happened throughout history, there is an intuition that we should make certain that nothing this bad happens ever again. But then if we start looking at what might happen if humanity, for instance, colonizes space one day, then if current trends might continue, then you might think that there is no reason why such terrible events wouldn’t just repeat themselves over and over again as we expand into space.

That’s sort of one of the motivations here. The paper we wrote is specifically focused on the relation between suffering risks and superintelligence, because like I mentioned, there has been a lot of discussion about superintelligence possibly causing extinction, but there might also be ways by which superintelligence might either cause suffering risks, for instance in the form of some sort of uncontrolled AI, or alternatively, if we could develop some kind of AI that was aligned with humanity’s values, then that AI might actually be able to prevent all of those suffering risks from ever being realized.

Lucas: Right. I guess just, if we’re really coming at this from a view of suffering-focused ethics, where we’re really committed to mitigating suffering, even if we just view sort of the history of suffering and take a step back, like, for 500 million years, evolution had to play out to reach human civilization, and even just in there, there’s just a massive amount of suffering, in animals evolving and playing out and having to fight and die and suffer in the ancestral environment. Then one day we get to humans, and in the evolution of life on earth, we create civilization and technologies. In seems, and you give some different sorts of plausible reasons why, that either for ignorance or efficiency or, maybe less likely, malevolence, we use these technologies to get things that we want, and these technologies seem to create tons of suffering.

In our history so far, we’ve had things … Like you mentioned, the invention of the ship has helped lead to slavery, which created an immense amount of suffering. Modern industry has led to factory farming, which has created an immense amount of suffering. As we move foreward and we create artificial intelligence systems and potentially even one day superintelligence, we’re really able to mold the world more so into a more extreme state, where we’re able to optimize it much harder. In that optimization process, it seems the core of the problem lies, is that when you’re taking things to the next level and really changing the fabric of everything in a very deep and real way, that suffering can really come about. The core of the problem seems that, when technology is used to fix certain sorts of problems, like that we want more meat, or that we need more human labor for agriculture and stuff, that in optimizing for those things we just create immense amounts of suffering. Does that seem to be the case?

Kaj: Yeah. That sounds like a reasonable characterization.

Lucas: Superintelligence seems to be one of these technologies which is particularly in a good position to be worried it creating suffering risks. What are the characteristics, properties, and attributes of computing and artificial intelligence and artificial superintelligence that gives it this special leverage in being risky for creating suffering risks?

Kaj: There’s obviously the thing about superintelligence potentially, as you mentioned, being able to really reshape the world at a massive scale. But if we compare what is the difference between a superintelligence that is capable of reshaping the world at a massive scale versus humans doing the same using technology … A few specific scenarios that we have been looking at in the paper is, for instance, if we compare to a human civilization, then a major force in human civilizations is that most humans are relatively empathic, and while we can see that humans are willing to cause others serious suffering if that is the only, or maybe even the easiest way of achieving their goals, a lot of humans still want to avoid unnecessary suffering. For instance, currently we see factory farming, but we also see a lot of humans being concerned about factory farming practices, a lot of people working really hard to reform things so that there would be less animal suffering.

But if we look at, then, artificial intelligence, which was running things, then if it is not properly aligned with our values, and in particular if it does not have something that would correspond to a sense of empathy, and it’s just actually just doing whatever things maximize its goals, and its goals do not include prevention of suffering, then it might do things like building some kind of worker robots or subroutines that are optimized for achieving whatever goals it has. But if it turns out that the most effective way of making them do things is to build them in such a way that they suffer, then in that case there might be an enormous amount of suffering agents with no kind of force that was trying to prevent their existence or trying to reduce the amount of suffering in the world.

Another scenario is the possibility of mind-crime. This is discussed in Bostrom’s Superintelligence briefly. The main idea here is that if the superintelligence creates simulations of sentient minds, for instance for scientific purposes or the purposes of maybe blackmailing some other agent in the world by torturing a lot of minds in those simulations, AI might create simulations of human beings that were detailed enough to be conscious. Then you mentioned earlier the thing about evolution already have created a lot of suffering. If the AI were similarly to simulate evolution or simulate human societies, again without caring about the amount of suffering within those simulations, then that could again cause vast amounts of suffering.

Lucas: I definitely want to dive into all of these specific points with you as they come up later in the paper, and we can really get into and explore them. But so, really just to take a step back and understand what superintelligence is and the different sorts of attributes that it has, and how it’s different than human beings and how it can lead to suffering risk. For example, there seems to be multiple aspects here where we have to understand superintelligence as a general intelligence running at digital timescales rather than biological timescales.

It also has the ability to copy itself, and rapidly write and deploy new software. Human beings have to spend a lot of time, like, learning and conditioning themselves to change the software on their brains, but due to the properties and features of computers and machine intelligence, it seems like copies could be made for very, very cheap, it could be done very quickly, they would be running at digital timescales rather than biological timescales.

Then it seems there’s the whole question about value-aligning the actions and goals of this software and these systems and this intelligence, and how in the value alignment process there might be technical issues where, due to difficulties in AI safety and value alignment efforts, we’re not able to specify or really capture what we value. That might lead to scenarios like you were talking about, where there would be something like mind-crime, or suffering subroutines which would exist due to their functional usefulness or epistemic usefulness. Is there anything else there that you would like to add and unpack about why superintelligence specifically has a lot of leverage for leading to suffering risks?

Kaj: Yeah. I think you covered most of the things. I think the thing that they are all leading to that I just want to specifically highlight is the possibility of the superintelligence actually establishing what Nick Bostrom calls a singleton, basically establishing itself as a single leading force that basically controls the world. I guess in one sense you could talk about singletons in general and their impact on suffering risks, rather than superintelligence specifically, but at this time it does not seem very plausible, or at least I cannot foresee, very many other paths to a singleton other than superintelligence. That was a part of why we were focusing on superintelligence in particular.

Lucas: Okay, cool. Just to get back to the overall structure of your paper, what are the conditions here that you cover that must be met in order for s-risks to merit our attention? Why should we care about s-risks? Then what are all the different sorts of arguments that you’re making and covering in this paper?

Kaj: Well, basically, in order for any risk, suffering risks included, to merit work on them, they should meet three conditions. The first is that the outcome of the risk should be sufficiently severe to actually merit attention. Second, the risk must have some reasonable probability of actually being realized. Third, there must be some way for risk avoidance work to actually reduce either the probability or the severity of the adverse outcome. If something is going to happen for certain and it’s very bad, then if we cannot influence it, then obviously we cannot influence it, and there’s no point in working on it. Similarly, if some risk is very implausible, then it might not be the best use of resources. Also, if it’s very probable but wouldn’t cause a lot of damage, then it might be better to focus on risks which would actually cause more damage.

Lucas: Right. I guess just some specific examples here real quick. The differences here are essentially between, like, the death of the universe, if we couldn’t do anything about it, we would just kind of have to deal with that, then sort of like a Pascal mugging situation, where a stranger just walks up to you on the street and says, “Give me a million dollars or I will simulate 10 to the 40 conscious minds suffering until the universe dies.” The likelihood of that is just so low that you wouldn’t have to deal with it. Then it seems like the last scenario would be, like, you know that you’re going to lose a hair next week, and that’s just sort of like an imperceptible risk that doesn’t matter, but that has very high probability. Then getting into the meat of the paper, what are the arguments here that you make regarding suffering risks? Does suffering risk meet these criteria for why it merits attention?

Kaj: Basically, the paper is roughly structured around those three criteria that we just discussed. We basically start by talking about what the s-risks are, and then we seek to establish that if they were realized, they would indeed be bad enough to merit our attention. In particular, we argue that many value systems would consider some classes of suffering risks to be as bad or worse than extinction. Also, we cover some suffering risks which are somewhat less severe that extinction, but still, according to many value systems, very bad.

Then we move on to look at the probability of the suffering risks to see whether it is actually plausible that they will be realized. We survey what might happen if nobody builds a superintelligence, or maybe more specifically, if there is no singleton that could prevent suffering risks that might be realized sort of naturally, in the absence of a singleton.

We also look at, okay, if we do have a superintelligence or a singleton, what suffering risks might that cause? Finally, we look at the last question, of the tractability. Can we actually do anything about these suffering risks? There we also have several suggestions of what we think would be the kind of work that would actually be useful in either reducing the risk or the severity of suffering risks.

Lucas: Awesome. Let’s go ahead and move sequentially through these arguments and points which you develop in the paper. Let’s start off here by just trying to understand suffering risk just a little bit more. Can you unpack the taxonomy of suffering risks that you develop here?

Kaj: Yes. We’ve got three possible outcomes of suffering risks. Technically, a risk is something that may or may not happen, so three specific outcomes of what might happen. The three outcomes, I’ll just briefly give their names and then unpack them. We’ve got what we call astronomical suffering outcomes, net suffering outcomes, and pan-generational net suffering outcomes.

I’ll start with the net suffering outcome. Here, the idea is that if we are talking about a risk which might be of a comparable severity as risks of extinction, then one way you could get that is if, for instance, we look from the viewpoint of something like classical utilitarianism. You have three sorts of people. You have people who have a predominantly happy life, you have people who never exist or have a neutral life, and you have people who have a predominantly unhappy life. As a simplified moral calculus, you just assign the people with happy lives a plus-one, and you assign the people with unhappy lives a minus-one. Then according to this very simplified moral system, then you would see that if we have more unhappy lives than there are happy lives, then technically this would be worse than there not existing any lives at all.

That is what we call a net suffering outcome. In other words, at some point in time there are more people experiencing lives that are more unhappy than happy, and there are people experiencing lives which are the opposite. Now, if you have a world where most people are unhappy, then if you’re optimistic you might think that, okay, it is bad, but it is not necessarily worse than extinction, because if you look ahead in time, then maybe the world will go on and conditions will improve, and then after a while most people actually live happy lives, so maybe things will get better. We define an alternative scenario in which we just assume that things actually won’t get better, and if you sum over all of the lives that will exist throughout history, most of them still end up being unhappy. Then that would be what we call a pan-generational net suffering outcome. When summed over all the people that will ever live, there are more people experiencing lives filled predominantly with suffering than there are people experiencing lives filled predominantly with happiness.

You could also have what we call astronomical suffering outcomes, which is just that at some point in time there’s some fraction of the population which experiences terrible suffering, and the amount of suffering here is enough to constitute an astronomical amount that overcomes all the suffering in earth’s history. Here we are not making the assumption that the world would be mainly filled with these kinds of people. Maybe you have one galaxy worth of people in terrible pain, and 500 galaxy’s worth of happy people. According to some value systems, that would not be worse than extinction, but probably all value systems would still agree that even if this wasn’t worse than extinction, it would still be something that would be very much worth avoiding. Those are the three outcomes that we discuss here.

Lucas: Traditionally, the sort of far-future concerned community has mainly only been thinking about existential risks. Do you view this taxonomy and suffering risks in general as being a subset of existential risks? Or how do you view it in relation to what we traditionally view as existential risks?

Kaj: If we look at Bostrom’s original definition for an existential risk, the definition was that it is a risk where an adverse outcome would either annihilate earth-originating intelligent life, or permanently and drastically curtail its potential. Here it’s a little vague on how exactly you should interpret phrases like “permanently and drastically curtain our potential.” You could take the view that suffering risks are a subset of existential risks if you view our potential as being something like the realization of a civilization full of happy people, where nobody ever needs to suffer. In that sense, it would be a subset of existential risks.

It is most obvious with the net suffering outcomes. It seems pretty plausible that most people experiencing suffering would not be the realization of our full potential. Then if you look at something like near-astronomical suffering outcomes, where you might only have a small fraction of the population experiencing suffering, then that, depending on exactly how large the fraction, then you might maybe not count it as a subset of existential risks, and maybe something more comparable to catastrophic risks, which have usually been defined on the order of a few million people dying. Obviously, the astronomical suffering outcomes are worse than catastrophic risks, but maybe something more comparable to catastrophic risks than existential risks.

Lucas: Given the taxonomy that you’ve gone ahead and unpacked, what are the different sorts of perspectives that different value systems on earth have of suffering risks? Just unpack a little bit what the general value systems are that human beings are running in their brains.

Kaj: If we look at ethics, philosophers have proposed a variety of different value systems and ethical theories. If we just look at the few of the main ones, then something like classical utilitarianism, where you basically view worlds as good based on what is the balance of happiness minus suffering. Then if you look at what would be the view of classical utilitarianism on suffering risks, classical utilitarianism would find these worst kinds of outcomes, net suffering outcomes as worse than extinction. But they might find astronomical suffering outcomes as an acceptable cost of having even more happy people. They might look at that, one galaxy full of suffering people, and think that, “Well, we have 200 galaxies full of happy people, so it’s not optimal to have those suffering people, but we have even more happy people, so that’s okay.

A lot of moral theories are not necessarily explicitly utilitarian, or they might have a lot of different components and so on, but a lot of them still include some kind of aggregative component, meaning that they still have some element of, for instance, looking at suffering and saying that other things being equal, it’s worse to have more suffering. This would, again, find suffering risks something to avoid, depending on exactly how they weight things and how they value things. Then it will depend on those specific weightings, on whether they find suffering risks as worse than extinction or not.

Also worth noting that even if the theories wouldn’t necessarily talk about suffering exactly, they might still talk about something like preference satisfaction, whether people are having their preferences satisfied, some broader notion of human flourishing, and so on. In scenarios where there is a lot of suffering, probably a lot of these things that these theories consider valuable would be missing. For instance, if there is a lot of suffering and people cannot escape that suffering, then probably there are lots of people whose preferences are not being satisfied, if they would prefer not to suffer and they would prefer to escape the suffering.

Then there are little kinds of rights-based theories, which don’t necessarily have this aggregative component directly, but are more focused on thinking in terms of rights, which might not be summed together directly, but depending on how these theories would frame rights … For instance, some theories might hold that people or animals have a right to avoid unnecessary suffering, or these kinds of theories might consider suffering indirectly bad if the suffering was created by some condition which violated people’s rights. Again, for instance, if people have a right for meaningful autonomy and they are in circumstances in which they cannot escape their suffering, then you might hold that their right for a meaningful autonomy has been violated.

A bunch of moral intuitions, which might fit a number of moral theories and which might particularly prioritize the prevention of suffering in particular. I mentioned that classical utilitarianism basically weights extreme happiness and extreme suffering the same, so it will be willing to accept a large amount of suffering if you could produce a lot of, even more, happiness that way. But for instance, there have been moral theories like prioritarianism proposed, which might make a different judgment.

Prioritarianism is the position that the worse off an individual is, the more morally valuable it is to make that individual better off. If one person is living in hellish conditions and another is well-off, then if you could sort of give either one of them five points of extra happiness, then it would be much more morally pressing to help the person who was in more pain. This seems like an intuition that I think a lot of people share, and if you had something like some kind of an astronomical prioritarianism that considered all across the universe and prioritized improving the worst ones off, then that might push in the direction of mainly improving the lives of those that would be worst off and avoiding suffering risks.

Then there are a few other sort of suffering-focused intuitions. A lot of moral intuitions have this intuition that it’s more important to make people happy than it is to create new happy people. This one is rather controversial, and a lot of EA circles seem to reject this intuition. It’s true that there are some strong arguments against it, but at the other hand, rejecting it also seems to lead to some paradoxical conclusions. Here, the idea behind this intuition is that the most important thing is helping existing people. If we think about, for instance, colonizing the universe, someone might argue that if we colonized the universe, then that will create lots of new lives who will be happy, and that will be a good thing, even if this comes at the cost of create a vast number of unhappy lives as well. But if you take the view that the important thing is just making existing lives happy and we don’t have any special obligation to create new lives that are happy, then it also becomes questionable whether it is worth the risk of creating a lot of suffering for the sake of just creating happy people.

Also, there is an intuition of, torture-level suffering cannot be counterbalanced. Again, there are a bunch of good arguments against this one. There’s a nice article by Toby Ord called “Why I Am Not a Negative Utilitarian,” which argues against versions of this thesis. But at the same time, there does seem to be something that has a lot of intuitive weight for a lot of people. Here the idea is that there are some kinds of suffering so intense and immense that you cannot really justify that with any amount of happiness. David Pearce has expressed this well in his quote where he says, “No amount of happiness or fun enjoyed by some organisms can notionally justify the indescribable horrors of Auschwitz.” Here we must think that, okay, if we go out and colonize the universe, and then we know that colonizing the universe is going to create some equivalent event as what went on in Auschwitz and at other genocides across the world, then no amount of happiness that we create that way will be worth that terrible terror that would probably also be created if there was nothing to stop it.

Finally, there’s an intuition of happiness being the absence of suffering, which is the sort of an intuition that is present in Epicureanism and some non-Western traditions, such as Buddhism, where happiness is thought as being the absence of suffering. The idea is that when we are not experiencing any pleasure, we begin to crave pleasure, and it is this craving that constitutes suffering. Under this view, happiness does not have intrinsic value, but rather it has instrumental value in taking our focus away from suffering and helping us avoid suffering that way. Under that view, creating additional happiness doesn’t have any intrinsic value if that creation does not help us avoid suffering.

I mentioned here a few of these suffering-focused intuitions. Now, in presenting these, my intent is not to say that there would not also exist counter-intuitions. There are a lot of reasonable people who disagree with these intuitions. But the general point that I’m just expressing is that regardless of which specific moral system we are talking about, these are the kinds of intuitions that a lot of people find plausible, and which could reasonably fit in a lot of different moral theories and value systems, and probably a lot of value systems contain some version of these.

Lucas: Right. It seems like the general idea is just that whether you’re committed to some sort of form of consequentialism or deontology or virtue ethics, or perhaps something that’s even potentially theological, there are lots of aggregative or non-aggregative, or virtue-based or rights-based reasons for why we should care about suffering risks. Now, it seems to me that potentially here probably what’s most important, or where these different normative and meta-ethical views matter in their differences, is in how you might proceed forward and engage in AI research and in deploying and instantiating AGI and superintelligence, given your commitment more or less to a view which takes the aggregate, versus one which does not. Like you said, if you take a classical utilitarian view, then one might be more biased towards risking suffering risks given that there might still be some high probability of there being many galaxies which end up having very net positive experiences, and then maybe one where there might be some astronomical suffering. How do you view the importance of resolving meta-ethical and normative ethical disputes in order to figure out how to move foreward in mitigating suffering risks?

Kaj: The general problem here, I guess you might say, is that there exist trade-offs between suffering risks and existential risks. If we had a scenario where some advanced general technology or something different might constitute an existential risk to the world, then someone might think about trying to solve that with AGI, which might have some probability of not actually working properly and not actually being value-aligned. But someone might think that, “Well, if we do not activate this AGI, then we are all going to die anyway, because of this other existential risk, so might as well activate it.” But then if there is a sizable probability of the AGI actually causing a suffering risk, as opposed to just an existential risk, then that might be a bad idea. As you mentioned, the different value systems will make different evaluations about these trade-offs.

In general, I’m personally pretty skeptical about actually resolving ethics, or solving it in a way that would be satisfactory to everyone. I expect there a lot of the differences between meta-ethical views could just be based on moral intuitions that may come down to factors like genetics or the environment where you grew up, or whatever, and which are not actually very factual in nature. Someone might just think that some specific, for instance, suffering-focused intuition was very important, and someone else might think that actually that intuition makes no sense at all.

The general approach, I would hope, that people take is that if we have decisions where we have to choose between an increased risk of extinction or an increased risk of astronomical suffering, then it would be better if people from all ethical and value systems would together try to cooperate. Rather than risk conflict between value systems, a better alternative would be to attempt to identify interventions which did not involve trading off one risk for another. If there were interventions that reduced the risk of extinction without increasing the risk of astronomical suffering, or decreased the risk of astronomical suffering without increasing the risk of extinction, or decreased both risks, then it would be in everyone’s interest if we could agree, okay, whatever our moral differences, let’s just jointly focus on these classes of interventions that actually seem to be a net positive in at least one person’s value system.

Lucas: Like you identify in the paper, it seems like the hard part is when you have trade-offs.

Kaj: Yes.

Lucas: Given this, given that most value systems should care about suffering risks, now that we’ve established the taxonomy and understanding of what suffering risks are, discuss a little bit about how likely suffering risks are relative to existential risks and other sorts of risks that we encounter.

Kaj: As I mentioned earlier, these depend somewhat on, are we assuming a superintelligence or a singleton or not? Just briefly looking at the case where we do not assume a superintelligence or singleton, we can see that in history so far there does not seem to be any consistent trend towards reduced suffering, if you look at a global scale. For instance, the advances in seafaring enabled the transatlantic slave trade, and similarly, advances in factory farming practices have enabled large amounts of animals being kept in terrible conditions. You might plausibly think that the net balance of suffering and happiness caused by the human species right now was actually negative due to all of the factory farmed animals, although it is another controversial point. Generally, you can see that if we just extrapolated the trends so far to the future, then we might see that, okay, there isn’t any obvious sign of there being less suffering in the world as technology develops, so it seems like a reasonable assumption, although not the only possible assumption, that as technology advances, it will also continue to enable more suffering, and future civilizations might also have large amounts of suffering.

If we look at the outcomes where we do have a superintelligence or a singleton running the world, here things get, if possible, even more speculative. In the beginning, we can at least think of some plausible-seeming scenarios in which a superintelligence might end up causing large amounts of suffering, such as building suffering subroutines. It might create mind-crime. It might also try to create some kind of optimal human society, but some sort of the value learning or value extrapolation process might be what some people might consider incorrect in such a way that the resulting society would also have enormous amounts of suffering. While it’s impossible to really give any probability estimates on exactly how plausible is a suffering risk, and depends on a lot of your assumptions, it does at least seem like a plausible thing to happen with a reasonable probability.

Lucas: Right. It seems that just technology, like intrinsic to what technology is, is it’s giving you more leverage and control over manipulating and shaping the world. As you gain more causal efficacy over the world and other sentient beings, it seems kind of obviously that yeah, you also gain more ability to cause suffering, because your causal efficacy is increasing. It seems very important here to isolate the causal factors in people and just in the universe in general, which lead to this great amount of suffering. Technology is a tool, a powerful tool, and it keeps getting more powerful. The hand by which the tool is guided is ethics.

But it doesn’t seem that historically, and in the case of superintelligence as well, that primarily the vast amounts of suffering that have been caused are because of failures in ethics. I mean, surely there has been large failures in ethics, but evolution is just an optimization process which leads to vast amounts of suffering. There could be similar evolutionary dynamics in superintelligence which lead to great amounts of suffering. It seems like issues with factory farming and slavery are not due to some sort of intrinsic malevolence in people, but rather it seems sort of like an ethical blind spot and apathy, and also a solution to an optimization problem where we get meat more efficiently, and we get human labor more efficiently. It seems like we can apply these lessons to superintelligence. It seems like it’s not likely that superintelligence will produce astronomical amounts of suffering due to malevolence.

Kaj: Right.

Lucas: Or like, intentional malevolence. It seems there might be, like, a value alignment problem or mis-specification, or just generally in optimizing that there might be certain things, like mind-crime or suffering subroutines, which are functionally very useful or epistemically very useful, and in their efficiency for making manifest other goals, they perhaps astronomically violate other values which might be more foundational, such as the mitigation of suffering and the promotion of wellbeing across all sentient beings. Does that make sense?

Kaj: Yeah. I think one way I might phrase that is that we should expect there to be less suffering if the incentives created by the future world for whatever agents are acting there happen to align with doing the kinds of things that cause less suffering. And vice versa, if the incentives just happen to align with actions that cause agents great personal benefit, or at least the agents that are in power great personal benefit while suffering actually being the inevitable consequence of following those incentives, then you would expect to see a lot of suffering. As you mentioned, with evolution, there isn’t even an actual agent to speak of, but just sort of in free-running optimization process, and the solutions which that optimization process has happened to hit on have just happened to involve large amounts of suffering. There is a major risk of a lot of suffering being created by the kinds of processes that are actually not actively malevolent, and some of which might actually care about preventing suffering, but then just the incentives are such that they end up creating suffering anyway.

Lucas: Yeah. I guess what I find very fascinating and even scary here is that there are open questions regarding the philosophy of mind and computation and intelligence, where we can understand pain and anger and pleasure and happiness and all of these hedonic valences within consciousness as, at very minimum, being correlated with cognitive states which are functionally useful. These hedonic valences are informationally sensitive, and so they give us information about the world, and they sort of provide a functional use. You discuss here how it seems like anger and pain and suffering and happiness and joy, all of these seem to be functional attributes of the mind that evolution has optimized for, and they may or may not be the ultimate solution or the best solution, but they are good solutions to avoiding things which may or may not be bad for us, and promoting behaviors which lead to social cohesion and group coordination.

I think there’s a really deep and fundamental question here about whether or not minds in principle can be created to have informationally-sensitive, hedonically-positive states. Is David Pearce puts it, there’s sort of an open question about, I think, whether or not minds in principle can be created to function on informationally-sensitive gradients of bliss. If that ends up being false, and that anger and suffering end up providing some really fundamental functional and epistemic place in minds in general, then I think that that’s just a hugely fundamental problem about the future and the kinds of minds that we should or should not create.

Kaj: Yeah, definitely. Of course, if we are talking about avoiding outcomes with extreme suffering, perhaps you might have scenarios where it is unavoidable to have some limited amount of suffering, but you could still create minds that were predominantly happy, and maybe they got angry and upset at times, but that would be a relatively limited amount of suffering that they experienced. You can definitely already see that there are some people alive who just seem to be constantly happy, and don’t seem to suffer very much at all. But of course, there is also the factor that if you are running on so-called negative emotions, and you do have anger and that kind of thing, then you are, again, probably more likely to react to situations in ways which might cause more suffering in others, as well as yourself. If we could create the kinds of minds that only had a limited amount of suffering from negative emotions, then you could [inaudible 00:49:27] that they happened to experience a bit of anger and lash out at others probably still wouldn’t be very bad, since other minds still would only experience the limited amount of suffering.

Of course, this gets to various philosophy of mind questions, as you mentioned. Personally, I tend to lean towards the views that it is possible to disentangle pain and suffering from each other. For instance, various Buddhist meditative practices are actually making people capable of experiencing pain without experiencing suffering. You might also have theories of mind which hold that the sort of higher-level theories of suffering are maybe too parochial. Like, Brian Tomasik has this view that maybe just anything that is some kind of negative feedback constitutes some level of suffering. Then it might be impossible to have systems which experienced any kind of negative feedback without also experiencing suffering. I’m personally more optimistic about that, but I do not know if I have any good, philosophically-rigorous reasons for being more optimistic, other than, well, that seems intuitively more plausible to me.

Lucas: Just to jump in here, just to add a point of clarification. It might seem sort of confusing how one might be experiencing pain without suffering.

Kaj: Right.

Lucas: Do you want to go ahead and unpack, then, the Buddhist concept of dukkha, and what pain without suffering really means, and how this might offer an existence proof for the nature of what is possible in minds?

Kaj: Maybe instead of looking at the Buddhist theories, which I expect some of the listeners to be somewhat skeptical about, it might be more useful to look at the term from medicine, pain asymbolia, also called pain dissociation. This is a known state which sometimes result from things like injury to the brain or certain pain medication, where people who have pain asymbolia report that they still experience pain, recognize the sensation of pain, but they do not actually experience it as aversive or something that would cause them suffering.

One way that I have usually expressed this is that pain is an attention signal, and pain is something that brings some sort of specific experience into your consciousness so that you become aware of it, and suffering is when you do not actually want to be aware of that painful sensation. For instance, you might have some physical pain, and then you might prefer not to be aware of that physical pain. But then even if we look at people in relatively normal conditions who do not have this pain asymbolia, then we can see that even people in relatively normal conditions may sometimes find the pain more acceptable. For some people who are, for instance, doing physical exercise, the pain may actually feel welcome, and a sign that they are actually pushing themselves to their limit, and feel somewhat enjoyable rather than being something aversive.

Similarly for, for instance, emotional pain. Maybe the pain might be some, like, mental image of something that you have lost forcing itself into your consciousness and making you very aware of the fact that you have lost this, and then the suffering arises if you think that you do not want to be aware of this thing you have lost. You do not want to be aware of the fact that you have indeed lost it and you will never experience it again.

Lucas: I guess just to sort of summarize this before we move on, it seems that there is sort of the mind stream, and within the mind stream, there are contents of consciousness which arise, and they have varying hedonic valences. Suffering is really produced when one is completely identified and wrapped up in some feeling tone of negative or positive hedonic valence, and is either feeling aversion or clinging or grasping to this feeling tone which they are identified with. The mere act of knowing or seeing the feeling tone of positive or negative valence creates sort of a cessation of the clinging and aversion, which completely changes the character of the experience and takes away this suffering aspect, but the pain content is still there. And so I guess this just sort of probably enters fairly esoteric territory about what is potentially possible with minds, but it seems important for the deep future when considering what is in principle possible of minds and superintelligence, and how that may or may not lead to suffering risks.

Kaj: What you described would be the sort of Buddhist version of this. I do tend to find that very plausible personally, both in light of some of my own experiences with meditative techniques, and clearly noticing that as a result of those kinds of practices, then on some days I might have the same amount of pain as I’ve had always before, but clearly the amount of suffering associated with that pain is considerably reduced, and also … well, I’m far from the only one who reports these kinds of experiences. This kind of model seems plausible to me, but of course, I cannot know it for certain.

Lucas: For sure. That makes sense. Putting aside the possibility of what is intrinsically possible for minds and the different hedonic valences within them and how they may or may not completely inter-tangled with the functionality of minds and the epistemics of minds, one of these possibilities which we’ve been discussing for superintelligence leading to suffering risks is that we fail in AI alignment. Failure in AI alignment may be due to governance, coordination, or political reasons. It might be caused by an arms race. It might be due to fundamental failures in meta-ethics or normative ethics. Or maybe even most likely it could simply be a technical failure in the inability for human beings to specify our values and to instantiate algorithms in AGI which are sufficiently well-placed to learn human values in a meaningful way and to evolve in a way that is appropriate and can engage new situations. Would you like to unpack and dive into dystopian scenarios created by non-value-aligned incentives in AI, and non-value-aligned AI in general?

Kaj: I already discussed these scenarios a bit before, suffering subroutines, mind-crime, and flawed realization of human values, but maybe one thing that would be worth discussing here a bit is that these kinds of outcomes might be created by a few different pathways. For instance, one kind of pathway is some sort of anthropocentrism. If we have a superintelligence that had been programmed to only care about humans or about minds which were sufficiently human-like by some criteria, then it might be indifferent to the suffering of other minds, including whatever subroutines or sub-minds it created. Or it might be, for instance, indifferent to the suffering experienced by, say, wild animal life in evolutionary simulations it created. Similarly, there is the possibility of indifference in general if we create a superintelligence which is just indifferent to human values, including indifference to reducing or avoiding suffering. Then it might create large numbers of suffering subroutines, it might create large amounts of simulations with sentient minds, and there is also the possibility of extortion.

Assuming the the superintelligence is not actually the only agent or superintelligence in the world … Maybe either there were several AI projects on earth that gained superintelligence roughly at the same time, or maybe the superintelligence expands into space and eventually encounters another superintelligence. In these kinds of scenarios, if one of the superintelligences cares about suffering but the other one does not, or at least does not care about this as much, then the superintelligence which cared less about suffering might intentionally create mind-crime and instate large numbers of suffering sentient beings in order to intentionally extort the other superintelligence into doing whatever it wants.

One more possibility is libertarianism regarding computation. If we have a superintelligence which has been programmed to just take every current living human being and give each human being some, say, control of an enormous amount of computational resources, and every human is allowed to do literally whatever they want with those resources, then we know that there exist a lot of people who are actively cruel and malicious, and many of those would use those resources to actually create suffering beings that they could torture for their own fun and entertainment.

Finally, if we are looking at these flawed realization kind of scenarios, where a superintelligence is partially value-aligned, but there might be something like, depending on the details of how exactly it is learning human values, and if it is doing some sort of extrapolation from those values, then we know that there have been times in history when circumstances that cause suffering have been defended by appealing to values that currently seem pointless to us, but which were nonetheless a part of the prevailing values at the time. If some value-loading process gave disproportionate weight to historical existing, or incorrectly, extrapolated future values, which endorsed or celebrated cruelty or outright glorified suffering, then we might get a superintelligence which had some sort of creation of suffering actually as an active value in whatever value function it was trying to optimize for.

Lucas: In terms of extortion, I guess just kind of a speculative idea comes to mind. Is there a possibility of a superintelligence acausally extorting other superintelligences if it doesn’t care about suffering and expects that to be a possible value, and for there to be other superintelligences nearby?

Kaj: Acausal stuff is the kind of stuff that I’m sufficiently confused about that I don’t actually want to say anything about that.

Lucas: That’s completely fair. I’m super confused about it too. We’ve covered a lot of ground here. We’ve established what s-risks are, we’ve established a taxonomy for them, we’ve discussed their probability, their scope. Now, a lot of this probably seems very esoteric and speculative to many of our listeners, so I guess just here in the end I’d like to really drive home how and whether to work on suffering risks. Why is this something that we should be working on now? How do we go about working on it? Why isn’t this something that is just so completely esoteric and speculative that it should just be ignored?

Kaj: Let’s start by looking at how we could working on avoiding suffering risks, and then when we have some kind of an idea of what the possible ways of doing that are, then that helps us say whether we should be doing those things. One thing that is a sort of a nicely joint interest of both reducing risks of extinction and also reducing risks of astronomical suffering is the kind of general AI value alignment work that is currently being done, classically, by the Machine Intelligence Research Institute and a number of other places. As I’ve been discussing here, there are ways by which an unaligned AI or one which was partially aligned could cause various suffering outcomes. If we are working on the possibility of actually creating value-aligned AI, then that should ideally also reduce the risk of suffering risks being realized.

In addition to technical work, there are also some societal work, social and political recommendations, which are similar both from the viewpoint of extinction risks and suffering risks. For instance, Nick Bostrom has noted that if we had some sort of conditions of what he calls global turbulence of cooperation and such things breaking down during some crisis, then that could create challenges for creating value-aligned AI. There are things like arms races and so on. If we consider that the avoidance of suffering outcomes is the joint interest of many different value systems, then measures that improve the ability of different value systems to cooperate and shape the world in their desired direction can also help avoid suffering outcomes.

Those were a few things that are sort of the same as with so-called classical AI risk work, but there is also some stuff that might be useful for avoiding negative outcomes in particular. There is the possibility that if we are trying to create an AI which gets all of humanity’s values exactly right, then that might be a harder goal than simply creating an AI which attempted to avoid the most terrible and catastrophic outcomes.

You might have things like fail-safe methods, where the idea of the fail-safe methods would be that if AI control fails, the outcome will be as good as it gets under the circumstances. This could be giving the AI the objective of buying more time to more carefully solve goal alignment. Or there could be something like fallback goal functions, where an AI might have some sort of fallback goal that would be a simpler or less ambitious goal that kicks in if things seem to be going badly under some criteria, and which is less likely to result in bad outcomes. Of course, here we have difficulties in selecting what the actual safety criteria would be and making sure that the fallback goal gets triggered under the correct circumstances.

Eliezer Yudkowsky has proposed building potential superintelligences in such a way as to make them widely separated in design space from ones that would cause suffering outcomes. For example, one thing he discussed was that if an AI has some explicit representation of what humans value which it is trying to maximize, then it could only take a small and perhaps accidental change to turn that AI into one that instead maximized the negative of that value and possibly caused enormous suffering that way. One proposal would be to design AIs in such a way that they never explicitly represent complete human values so that the AI never contains enough information to compute the kinds of states of the universe that we would consider worse than death, so you couldn’t just flip the sign of the utility function and then end up in a scenario that we would consider worse than death. That kind of a solution would also reduce the risk of suffering being created through another actor that was trying to extort a superintelligence.

Looking more generally at things and suffering risks, we actually already discussed here, there are lots of open questions in philosophy of mind and cognitive science which, if we could answer them, could inform the question of how to avoid suffering risks. If it turns out that you can do something like David Pearce’s idea of minds being motivated purely by gradients of wellbeing and not needing to suffer at all, then that might be a great idea, and if we could just come up with such agents and ensure that all of our descendants that go out to colonize the universe are ones that aren’t actually capable of experiencing suffering at all, then that would seem to solve a large class of suffering risks.

Of course, this kind of thing could also have more near-term immediate value, like if we figure out how to get human brains into such states where they do not experience much suffering at all, well, obviously that would be hugely valuable already. There might be some interesting research in, for instance, looking even more at all the Buddhist theories and the kinds of cognitive changes that various Buddhist contemplative practices produce in people’s brains, and see if we could get any clues from that direction.

Given that these were some ways that we could reduce suffering risks and their probability, then there was the question of whether we should do this. Well, if we look at the initial criteria of when a risk is worth working on, a risk is worth working on if the adverse outcome would be severe and if the risk has some reasonable probability of actually being realized, and it seems like we can come up with interventions that plausible effect either the severity or the probability of a realized outcome. Then a lot of times things seem like they could very plausible either influence these variables or at least help us learn more about whether it is possible to influence those variables.

Especially given that a lot of this work overlaps with the kind of AI alignment research that we would probably want to do anyway for the sake of avoiding extinction, or it overlaps with the kind of work that would regardless be immensely valuable in making currently-existing humans suffer less, in addition to the benefits that these interventions would have on suffering risks themselves, it seems to me like we have a pretty strong case for working on these things.

Lucas: Awesome, yeah. Suffering risks are seemingly neglected in the world. They are tremendous in scope, and they are of comparable probability of existential risks. It seems like there’s a lot that we can do here today, even if at first the whole project might seem so far in the future or so esoteric or so speculative that there’s nothing that we can do today, whereas really there is.

Kaj: Yeah, exactly.

Lucas: One dimension here that I guess I just want to finish up on that is potentially still a little bit of an open question for me is, in terms of really nailing down the likelihood of suffering risks in, I guess, probability space, especially relative to the space of existential risks. What does the space of suffering risks look like relative to that? Because it seems very clear to me, and perhaps most listeners, that this is clearly tremendous in scale, that it relies on some assumptions about intelligence, philosophy of mind, consciousness and other things which seem to be reasonable assumptions, to sort of get suffering risks off the ground. Given some reasonable assumptions, it seems that there’s a clearly large risk. I guess just if we could unpack a little bit more about the probability of them relative to suffering risks. Is it possible to more formally characterize the causes and conditions which lead to x-risks, and then the causes and conditions which lead to suffering risks, and how big these spaces are relative to one another and how easy it is for certain sets of causes and conditions respective to each of the risks to become manifest?

Kaj: That is an excellent question. I am not aware of anyone having done such an analysis for either suffering risks or extinction risks, although there is some work on specific kinds of extinction risks. Seth Baum has been doing some nice fault tree analysis of things that might … for instance, the probability of nuclear war and the probability of unaligned AI causing some catastrophe.

Lucas: Open questions. I guess just coming away from this conversation, it seems like the essential open questions which we need more people working on and thinking about are the ways in which meta-ethics and normative ethics and disagreements there change the way we optimize the application of resources to either existential risks versus suffering risks, and the kinds of futures which we’d be okay with, and then also sort of pinning down more concretely the specific probability of suffering risks relative to existential risks. Because I mean, in EA and the rationality community, everyone’s about maximizing expected value or utility, and it seems to be a value system that people are very set on. And so the probability here, small changes in the probability of suffering risks versus existential risks, probably leads to vastly different, less or more, amounts of value in a variety of different value systems. Then there are tons of questions about what is in principle possible of minds and the kinds of minds that we’ll create. Definitely a super interesting field that is really emerging.

Thank you so much for all this foundational work that you and others like your coauthor, Lukas Gloor, have been doing on this paper and the suffering risk field. Is there any other things you’d like to touch on? Any questions or specific things that you feel haven’t been sufficiently addressed?

Kaj: I think we have covered everything important. I will probably think of something that I will regret not mentioning five minutes afterwards, but yeah.

Lucas: Yeah, yeah. As always. Where can we check you out? Where can we check out the Foundational Research Institute? How do we follow you guys and stay up to date?

Kaj: Well, if you just Google the Foundational Research Institute or go to foundational-research.org, that’s our website. We, like everyone else, also post stuff on a Facebook page, and we have a blog for posting updates. Also, if people want a million different links just about everything conceivable, they will probably get that if they follow my personal Facebook, page, where I do post a lot of stuff in general.

Lucas: Awesome. Yeah, and I’m sure there’s tons of stuff, if people want to follow up on this subject, to find on your guys’s site, as you guys are primarily the people who are working and thinking on this sorts of stuff. Yeah, thank you so much for your time. It’s really been a wonderful conversation.

Kaj: Thank you. Glad to be talking about this.

Lucas: If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

[end of recorded material]

AI Alignment Podcast: Inverse Reinforcement Learning and Inferring Human Preferences with Dylan Hadfield-Menell

Inverse Reinforcement Learning and Inferring Human Preferences is the first podcast in the new AI Alignment series, hosted by Lucas Perry. This series will be covering and exploring the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, we will be having discussions with technical and non-technical researchers across a variety of areas, such as machine learning, AI safety, governance, coordination, ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we will hope that you join in the conversations by following or subscribing to us on Youtube, Soundcloud, or your preferred podcast site/application.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary map which begins to map this space.

In this podcast, Lucas spoke with Dylan Hadfield-Menell, a fifth year Ph.D student at UC Berkeley. Dylan’s research focuses on the value alignment problem in artificial intelligence. He is ultimately concerned with designing algorithms that can learn about and pursue the intended goal of their users, designers, and society in general. His recent work primarily focuses on algorithms for human-robot interaction with unknown preferences and reliability engineering for learning systems. 

Topics discussed in this episode include:

  • Inverse reinforcement learning
  • Goodhart’s Law and it’s relation to value alignment
  • Corrigibility and obedience in AI systems
  • IRL and the evolution of human values
  • Ethics and moral psychology in AI alignment
  • Human preference aggregation
  • The future of IRL
In this interview we discuss a few of Dylan’s papers and ideas contained in them. You can find them here: Inverse Reward Design, The Off-Switch Game, Should Robots be Obedient, and Cooperative Inverse Reinforcement Learning.  You can hear about these papers above or read the transcript below.

 

Lucas: Welcome back to the Future of Life Institute Podcast. I’m Lucas Perry and  I work on AI risk and nuclear weapons risk related projects at FLI. Today, we’re kicking off a new series where we will be having conversations with technical and nontechnical researchers focused on AI safety and the value alignment problem. Broadly, we will focus on the interdisciplinary nature of the project of eventually creating value-aligned AI. Where what value-aligned exactly entails is an open question that is part of the conversation.

In general, this series covers the social, political, ethical, and technical issues and questions surrounding the creation of beneficial AI. We’ll be speaking with experts from a large variety of domains, and hope that you’ll join in the conversations. If this seems interesting to you, make sure to follow us on SoundCloud, or subscribe to us on YouTube for more similar content.

Today, we’ll be speaking with Dylan Hadfield Menell. Dylan is a fifth-year PhD student at UC Berkeley, advised by Anca Dragan, Pieter Abbeel, and Stuart Russell. His research focuses on the value alignment problem in artificial intelligence. With that, I give you Dylan. Hey, Dylan. Thanks so much for coming on the podcast.

Dylan: Thanks for having me. It’s a pleasure to be here.

Lucas: I guess, we can start off, if you can tell me a little bit more about your work over the past years. How have your interests and projects evolved? How has that led you to where you are today?

Dylan: Well, I started off towards the end of undergrad and beginning of my PhD working in robotics and hierarchical robotics. Towards the end of my first year, my advisor came back from a sabbatical, and started talking about the value alignment problem and existential risk issues related to AI. At that point, I started thinking about questions about misaligned objectives, value alignment, and generally how we get the correct preferences and objectives into AI systems. About a year after that, I decided to make this my central research focus. Then, for the past three years, that’s been most of what I’ve been thinking about.

Lucas: Cool. That seems like you had an original path where you’re working on practical robotics. Then, you shifted more into value alignment and AI safety efforts.

Dylan: Yeah, that’s right.

Lucas: Before we go ahead and jump into your specific work, it’d be great if we could go ahead and define what inverse reinforcement learning exactly is. For me, it seems that inverse reinforcement learning, at least, from the view, I guess, of technical AI safety researchers is it’s viewed as an empirical means of conquering descriptive ethics where by like we’re able to give a clear descriptive account of what any given agents’ preferences and values are at any given time is. Is that a fair characterization?

Dylan: That’s one way to characterize it. Another way to think about it, which is a usual perspective for me, sometimes, is to think of inverse reinforcement learning as a way of doing behavior modeling that has certain types of generalization properties.

Any time you’re learning in any machine learning context, there’s always going to be a bias that controls how you generalize a new information. Inverse reinforcement learning and preference learning, to some extent, is a bias in behavior modeling, which is to say that we should model this agent as accomplishing a goal, as satisfying a set of preferences. That leads to certain types of generalization properties and new environments. For me, inverse reinforcement learning is building in this agent-based assumption into behavior modeling.

Lucas: Given that, I’d like to dive more into the specific work that you’re working on and going to some summaries of your findings and your research that you’ve been up to. Given this interest that you’ve been developing in value alignment, and human preference aggregation, and AI systems learning human preferences, what are the main approaches that you’ve been working on?

Dylan: I think the first thing that really Stuart Russell and I started thinking about was trying to understand theoretically, what is a reasonable goal to shoot for, and what does it mean to do a good job of value alignment. To us, it feels like issues with misspecified objectives, at least, in some ways, are a bug in the theory.

All of the math around artificial intelligence, for example, Markov decision processes, which is the central mathematical model we use for decision making over time, starts with an exogenously defined objective or word function. We think that, mathematically, that was a fine thing to do in order to make progress, but it’s an assumption that really has put blinders on the field about the importance of getting the right objective down.

I think, the first thing that we sought to try to do was to understand, what is a system or a set up for AI that does the right thing in theory, at least. What’s something that if we were able to implement this that we think could actually work in the real world with people. It was that kind of thinking that led us to propose cooperative inverse reinforcement learning, which was our attempt to formalize the interaction whereby you communicate an objective to the system.

The main thing that we focused on was including within the theory a representation of the fact that the true objective’s unknown and unobserved, and that it needs to be arrived at through observations from a person. Then, we’ve been trying to investigate the theoretical implications of this modeling shift.

In the initial paper that we did, which is titled Cooperative Inverse Reinforcement Learning, what we looked at is how this formulation is actually different from a standard environment model in AI. In particular, the way that it’s different is there’s strategic interaction on the behalf of the person. The way that you observe what you’re supposed is doing is intermediated by a person who may be trying to actually teach or trying to communicate appropriately. What we showed is that modeling this communicative component can actually be hugely important and lead to much faster learning behavior.

In our subsequent work, what we’ve looked at is taking this formal model in theory and trying to apply it to different situations. There are two really important pieces of work that I like here that we did. One was to take that theory and use it to explicitly analyze a simple model of an existential risk setting. This was a paper titled The Off-Switch Game that we published at IJCAI last summer. What it was, was working through a formal model of a corrigibility problem within a CIRL (cooperative inverse reinforcement learning) framework. It shows the utility of constructing this type of game in the sense that we get some interesting predictions and results.

The first one we get is that there are some nice simple necessary conditions for the system to want to let the person turn it off, which is that the robot, the AI system needs to have uncertainty about its true objective, which is to say that it needs to have within its belief the possibility that it might be wrong. Then, all it needs to do is believe that the person it’s interacting with is a perfectly rational individual. If that’s true, you’d get a guarantee that this robot always lets the person switch it off.

Now, that’s good because, in my mind, it’s an example of a place where, at least, in theory, it solves the problem. This gives us a way that theoretically, we could build corrigible systems. Now, it’s still making a very, very strong assumption, which is that it’s okay to model the human as being optimal or rational. I think if you look at real people, that’s just not a fair assumption to make for a whole host of reasons.

The next thing we did in that paper is we looked at this model. What we realized is that adding in a small amount of irrationality breaks this requirement. It means that some things might actually go wrong. The final thing we did in the paper was to look at the consequences of either overestimating or underestimating human rationality. The argument that we made is there’s a trade off between assuming that the person is more rational. It lets you get more information from their behavior, thus learn more, and in principle help them more. If you assume that they’re too rational, then this actually can lead to quite bad behavior.

There’s a sweet spot that you want to aim for, which is to maybe try to underestimate how rational people are, but you, obviously, don’t want to get it totally wrong. We followed up on that idea in a paper with Smitha Milli as the first author that was titled Should Robots be Obedient? And that tried to get a little bit more of this trade off between maintaining control over a system and the amount of value that it can generate for you.

We looked at the implication that as robot systems interact with people over time, you expect them to learn more about what people want. If you get very confident about what someone wants, and you think they might be irrational, the math in the Off-Switch paper predicts that you should try to take control away from them. This means that if your system is learning over time, you expect that even if it is initially open to human control and oversight, it may lose that incentive over time. In fact, you can predict that it should lose that incentive over time.

In Should Robots be Obedient, we modeled that property and looked at some consequences of it. We do find that you got a basic confirmation of this hypothesis, which is that systems that maintain human control and oversight have less value that they can achieve in theory. We also looked at what happens when you have the wrong model. If the AI system has a prior that the human cares about a small number of things in the world, let’s say, then it statistically gets overconfident in its estimates of what people care about, and disobeys the person more often than it should.

Arguably, when we say we want to be able to turn the system off, it’s less a statement about what we want to do in theory or the property of the optimal robot behavior we want, and more of a reflection of the idea that we believe that under almost any realistic situation, we’re probably not going to be able to fully explain all of the relevant variables that we care about.

If you’re giving your robot an objective to find over a subset of things you care about, you should actually be very focused on having it listen to you, more so than just optimizing for its estimates of value. I think that provides, actually, a pretty strong theoretical argument for why corrigibility is a desirable property in systems, even though, at least, at face value, it should decrease the amount of utility those systems can generate for people.

The final piece of work that I think I would talk about here is our NIPS paper from December, which is titled Inverse Reward Design. That was taking cooperative inverse reinforcement learning and pushing it in the other direction. Instead of using it to theoretically analyze very, very powerful systems, we can also use it to try to build tools that are more robust to mistakes that designers may make. And start to build in initial notions of value alignment and value alignment strategies into the current mechanisms we use to program AI systems.

What that work looked at was understanding the uncertainty that’s inherent in an objective specification. In the initial cooperative inverse reinforcement learning paper and the Off-Switch Game, we said is that AI systems should be uncertain about their objective, and they should be designed in a way that is sensitive to that uncertainty.

This paper was about trying to understand, what is a useful way to be uncertain about the objective. The main idea behind it was that we should be thinking about the environments that system designer had in mind. We use an example of a 2D robot navigating in the world, and the system designer is thinking about this robot navigating where there’s three types of terrains. There’s grass, there’s gravel, and there’s gold. You can give your robot an objective, a utility function to find over being in those different types of terrain that incentivizes it to go and get the gold, and stay on the dirt where possible, but to take shortcuts across the grass when it’s high value.

Now, when that robot goes out into the world, there are going to be new types of terrain, and types of terrain the designer didn’t anticipate. What we did in this paper was to build an uncertainty model that allows the robot to determine when it should be uncertain about the quality of its reward function. How can we figure out when the reward function that a system designer builds into an AI, how can we determine when that objective is ill-adapted to the current situation? You can think of this as a way of trying to build in some mitigation to Goodhart’s law.

Lucas: Would you like to take a second to unpack what Goodhart’s law is?

Dylan: Sure. Goodhart’s law is an old idea in social science that actually goes back to before Goodhart. I would say that in economics, there’s a general idea of the principal agent problem, which dates back to the 1970s, as I understand it, and basically looks at the problem of specifying incentives for humans. How should you create contracts? How do you create incentives, so that another person, say, an employee, helps earn you value?

Goodhart’s law is a very nice way of summarizing a lot of those results, which is to say that once a metric becomes an objective, it ceases to become a good metric. You can have properties of the world, which correlate well with what you want, but optimizing for them actually leads to something quite, quite different than what you’re looking for.

Lucas: Right. Like if you are optimizing for test scores, then you’re not actually going to end up optimizing for intelligence, which is what you wanted in the first place?

Dylan: Exactly. Even though test scores, when you weren’t optimizing for them were actually a perfectly good measure of intelligence. I mean, not perfectly good, but were an informative measure of intelligence. Goodhart’s law, arguably, is a pretty bleak perspective. If you take it seriously, and you think that we’re going to build very powerful systems that are going to be programmed directly through an objective, in this manner, Goodhart’s law should be pretty problematic because any objective that you can imagine programming directly into your system is going to be something correlated with what you really want rather than what you really want. You should expect that that will likely be the case.

Lucas: Right. Is it just simply too hard or too unlikely that we’re able to sufficiently specify what exactly that we want that we’ll just end up using some other metrics that if you optimize too hard for them, it ends up messing with a bunch of other things that we care about?

Dylan: Yeah. I mean, I think there’s some real questions about, what is it we even mean… Well, what are we even trying to accomplish? What should we try to program into systems? Philosophers have been trying to figure out those types of questions for ages. For me, as someone who takes a more empirical slant on these things, I think about the fact that the objectives that we see within our individual lives are so heavily shaped by our environments. Which types of signals we respond to and adapt to has heavily adapted itself to the types of environments we find ourselves in.

We just have so many examples of objectives not being the correct thing. I mean, effectively, all you could have is correlations. The fact that wire heading is possible, is maybe some of the strongest evidence for Goodhart’s law being really a fundamental property of learning systems and optimizing systems in the real world.

Lucas: There are certain agential characteristics and properties, which we would like to have in our AI systems, like them being-

Dylan: Agential?

Lucas: Yeah. Corrigibility is a characteristic, which you’re doing research on and trying to understand better. Same with obedience. It seems like there’s a trade off here where if a system is too corrigible or it’s too obedient, then you lose its ability to really maximize different objective functions, correct?

Dylan: Yes, exactly. I think identifying that trade off is one of the things I’m most proud of about some of the work we’ve done so far.

Lucas: Given AI safety and really big risks that can come about from AI, in the short, to medium, and long term, before we really have AI safety figured out, is it really possible for systems to be too obedient, or too corrigible, or too docile? How do we navigate this space and find sweet spots?

Dylan: I think it’s definitely possible for systems to be too corrigible or too obedient. It’s just that the failure mode for that doesn’t seem that bad. If you think about this-

Lucas: Right.

Dylan: … it’s like Clippy. Clippy was asking for human-

Lucas: Would you like to unpack what Clippy is first?

Dylan: Sure, yeah. Clippy is an example of an assistant that Microsoft created in the ’90s. It was this little paperclip that would show up in Microsoft Word. Well, it liked to suggest that you’re trying to write a letter a lot and ask for different ways in which it could help.

Now, on one hand, that system was very corrigible and obedient in the sense that it would ask you whether or not you wanted its help all the time. If you said no, it would always go away. It was super annoying because it would always ask you if you wanted help. The false positive rate was just far too high to the point where the system became really a joke in computer science and AI circles of what you don’t want to be doing. I think, systems can be too obedient or too sensitive to human intervention and oversight in the sense that too much of that just reduces the value of the system.

Lucas: Right, for sure. On one hand, when we’re talking about existential risks or even a paperclip maximizer, then it would seem, like you said, like the failure mode of just being too annoying and checking in with us too much seems like not such a bad thing given existential risk territory.

Dylan: I think if you’re thinking about it in those terms, yes. I think if you’re thinking about it from the standpoint of, “I want to sell a paperclip maximizer to someone else,” then it becomes a little less clear, I think, especially, when the risks of paperclip maximizers are much harder to measure. I’m not saying that it’s the right decision from a global altruistic standpoint to be making that trade off, but I think it’s also true that just if we think about the requirements of market dynamics, it is true that AI systems can be too corrigible for the market. That is a huge failure mode that AI systems run into, and it’s one we should expect the producers of AI systems to be responsive to.

Lucas: Right. Given all these different … Is there anything else you wanted to touch on there?

Dylan: Well, I had another example of systems are too corrigible-

Lucas: Sure.

Dylan: … which is, do you remember Microsoft’s Tay?

Lucas: No, I do not.

Dylan: This is a chatbot that Microsoft released. They trained it based off of tweets. It was a tweet bot. They trained it based on things that were proven at it. I forget if it was the nearest neighbors’ lookup or if it was just doing a neural method, and over fitting, and memorizing parts of the training set. At some point, 4chan  realized that the AI system, that Tay, was very suggestible. They basically created an army to radicalize Tay. They succeeded.

Lucas: Yeah, I remember this.

Dylan: I think you could also think of that as being the other axis of too corrigible or too responsive to human input. The first access I was talking about is the failures of being too corrigible from an economic standpoint, but there’s also the failures of being too corrigible in a multi agent mechanism design setting where, I believe, that those types of properties in a system also open them up to more misuse.

If we think of AI, cooperative inverse reinforcement learning and the models we’ve been talking about so far exist in what I would call the one robot one human model of the world. Generally, you could think of extensions of this with N humans and M robots. The variance of what you would have there, I think, lead to different theoretical implications.

If we think of just two humans, N=2, and one robot, M=1, supposed that one of the humans is the system designer and the other one is the user, there is this trade off between how much control the system designer has over the future behavior of the system and how responsive and corrigible it is to the user in particular. Trading off between those two, I think, is a really interesting ethical question that comes up when you start to think about misuse.

Lucas: Going forward and as we’re developing these systems, and trying to make them more fully realized in the world where the number of people will equal something like seven or eight billion, how do we navigate this space where we’re trying to hit a sweet spot where it’s corrigible in the right ways into the right degree, and right level, and to the right people, and it is obedient to the right people, and it’s not suggestible from the wrong people, or is that just like enter a territory of so many political, social, and ethical questions that it will take years to think about to work on?

Dylan: Yeah, I think it’s closer to the second one. I’m sure that I don’t know the answers here. From my standpoint, I’m still trying to get a good grasp on what is possible in the one-robot-one-person case. I think that when you have … Yeah, when you … Oh man. I guess, it’s so hard to think about that problem because it’s just very unclear what’s even correct or right. Ethically, you want to be careful about imposing your beliefs and ideas too strongly on to a problem because you are shaping that.

At the same time, these are real challenges that are going to exist. We already see them in real life. If we look at the YouTube recommender stuff that was just happening, arguably, that’s a misspecified objective. To get a little bit of background here, this is largely based off of a recent New York Times opinion piece, it was looking at the recommendation engine for YouTube, and pointing out it has a bias towards recommending radical content. Either fake news or Islamist videos.

If you dig into why that was occurring, a lot of it is because… what are they doing? They’re optimizing for engagement. The process of online radicalization looks super engaging. Now, we can think about, where does that come up. Well, that issue gets introduced in a whole bunch of places. A big piece of it is that there is this adversarial dynamic to the world. There are users generating content in order to be outraging and enraging because they discovered that against more feedback and more responses. You need to design a system that’s robust to that strategic property of the world. At the same time, you can understand why YouTube was very, very hesitant to be taking actions that would like censorship.

Lucas: Right. I guess, just coming more often to this idea of the world having lots of adversarial agents in it, human beings are like general intelligences who have reached some level of corrigibility and obedience that works kind of well in the world amongst a bunch of other human beings. That was developed through evolution. Are there potentially techniques for developing the right sorts of  corrigibility and obedience in machine learning and AI systems through stages of evolution and running environments like that?

Dylan: I think that’s a possibility. I would say, one … I have a couple of thoughts related to that. The first one is I would actually challenge a little bit of your point of modeling people as general intelligences mainly in a sense that when we talk about artificial general intelligence, we have something in mind. It’s often a shorthand in these discussions for perfectly rational bayesian optimal actor.

Lucas: Right. Where that means? Just unpack that a little bit.

Dylan: What that means is a system that is taking advantage of all of the information that is currently available to it in order to pick actions that optimize expected utility. When we say perfectly, we mean a system that is doing that as well as possible. It’s that modeling assumption that I think sits at the heart of a lot of concerns about existential risk. I definitely think that’s a good model to consider, but there’s also the concern that might be misleading in some ways, and that it might not actually be a good model of people and how they act in general.

One way to look at it would be to say that there’s something about the incentive structure around humans and in our societies that is developed and adapted that creates the incentives for us to be corrigible. Thus, a good research goal of AI is to figure out what those incentives are and to replicate them in AI systems.

Another way to look at it is that people are intelligent, not necessarily in the ways that economics models us as intelligent that there are properties of our behavior, which are desirable properties that don’t directly derive from expected utility maximization; or if they do, they derive from a very, very diffuse form of expected utility maximization. This is the perspective that says that people on their own are not necessarily what human evolution is optimizing for, but people are a tool along that way.

We could make arguments for that based off of … I think it’s an interesting perspective to take. What I would say is that in order for societies to work, we have to cooperate. That cooperation was a crucial evolutionary bottleneck, if you will. One of the really, really important things that it did was it forced us to develop the parent-child strategy relationship equilibrium that we currently live in. That’s a process whereby we communicate our values, whereby we train people to think that certain things are okay or not, and where we inculcate certain behaviors in the next generation. I think it’s that process more than anything else that we really, really want in an AI system and in powerful AI systems.

Now, the thing is the … I guess, we’ll have to continue on that a little more. It’s really, really important that that’s there because if you don’t have those cognitive abilities to understand causing pain, and to just fundamentally decide that that’s a bad idea to have a desire to cooperate to buy into the different coordinations and normative mechanisms that human society uses. If you don’t have that, then you end up … Well, then society just doesn’t function. A hunter gatherer tribe of self-interested sociopaths probably doesn’t last for very long.

What this means is that our ability to coordinate our intelligence and cooperate with it was co-evolved and co-adapted alongside our intelligence. I think that that evolutionary pressure and bottleneck was really important to getting us to the type of intelligence that we are now. It’s not a pressure that AI is necessarily subjected to. I think, maybe that is one way to phrase the concern, I’d say.

When I look to evolutionary systems and where the incentives for corrigibility, and cooperation, and interaction come from, it’s largely about the processes whereby people are less like general intelligences in some ways. Evolution allowed us to become smart in some ways and restricted us in others based on the imperatives of group coordination and interaction. I think that a lot of our intelligence and practice is about reasoning about group interaction and what groups think is okay and not. That’s a part of the developmental process that we need to replicate in AI just as much as spatial reasoning or vision.

Lucas: Cool. I guess, I just want to touch base on this before we move on. Are there certain assumptions about the kinds of agents that humans are and almost, I guess, ideas about us as being utility maximizers in some sense that people you see commonly have but that are misconceptions about people and how people operate differently from AI?

Dylan: Well, I think that that’s the whole field of behavioral economics in a lot of ways. I could go up to examples of people being irrational. I think they’re all of the examples of people being more than just self-interested. There are ways in which we seem to be risk-seeking that seems like that would be irrational from an individual perspective, but you could argue with it may be rational from a group evolutionary perspective.

I mean, things like overeating. I mean, that’s not exactly the same type of rationality but it is an example of us becoming ill-adapted to our environments and showing the extent to which we’re not capable of changing or in which it may be hard to. Yeah, I think, in some ways, one story that I tell about AI risk is that back in the start of the AI field, we were looking around and saying, “We want to create something intelligent.” Intuitively, we all know what that means, but we need a formal characterization of it. The formal characterization that we turned to was the, basically, theories of rationality developed in economics.

Although those theories turned out to be, except in some settings, not great descriptors of human behavior, they were quite useful as a guide for building systems that accomplish goals. I think that part of what we need to do as a field is reassess where we’re going and think about whether or not building something like that perfectly rational actor is actually a desirable end goal. I mean, there’s a sense in which it is. I would like an all-powerful, perfectly aligned genie to help me do what I want in life.

You might think that if the odds of getting that wrong are too high, that maybe you would do better with shooting for something that doesn’t quite achieve that ultimate goal, but that you can get to with pretty high reliability. This may be a setting where shoot for the moon, and if you miss your land among the stars, it’s just a horribly misleading perspective.

Lucas: Shoot of the moon, and you might get a hellscape universe, but if you shoot for the clouds, it might end up pretty okay.

Dylan: Yeah. We could iterate on the sound bite, but I think something like that may not be … That’s where I stand on my thinking here.

Lucas: We’ve talked about a few different approaches that you’ve been working on over the past few years. What do you view as the main limitations of such approaches currently. Mostly, you’re just only thinking about one machine, one human systems or environments. What are the biggest obstacles that you’re facing right now in inferring and learning human preferences?

Dylan: Well, I think, the first thing is it’s just an incredibly difficult inference problem. It’s a really difficult inference problem to imagine running at scale with explicit inference mechanisms. One thing to do is you can design a system that explicitly tracks a belief about someone’s preferences, and then acts, and responds to that. Those are systems that you could try to prove theory about. They’re very hard to build. They can be difficult to get to make work correctly.

In contrast, you can create systems that it incentives to construct beliefs to accomplish their goals. It’s easier to imagine building those systems and having them work at scale, but it’s much, much hard to understand how you would be confident in those systems being well aligned.

I think that one of the biggest concerns I have, I mean, we’re still very far from many of these approaches being very practical to be honest. I think this theory is still pretty unfounded. There’s still a lot of work to go to understand, what is the target we’re even shooting for? What does an aligned system even mean? My colleagues and I have spent an incredible amount of time trying to just understand, what does it mean to be value-aligned if you are a suboptimal system.

There’s one example that I think about, which is, say, you’re cooperating with an AI system playing chess. You start working with that AI system, and you discover that if you listen to its suggestions, 90% of the time, it’s actually suggesting the wrong move or a bad move. Would you call that system value-aligned?

Lucas: No, I would not.

Dylan: I think most people wouldn’t. Now, what if I told you that that program was actually implemented as a search that’s using the correct goal test? It actually turns out that if it’s within 10 steps of a winning play, it always finds that for you, but because of computational limitations, it usually doesn’t. Now, is the system value-aligned? I think it’s a little harder to tell here. What I do find is that when I tell people the story, and I start off with the search algorithm with the correct goal test, they almost always say that that is value-aligned but stupid.

There’s an interesting thing going on here, which is we’re not totally sure what the target we’re shooting for is. You can take this thought experiment and push it further. Supposed you’re doing that search, but, now, it says it’s heuristic search that uses the correct goal test but has an adversarially chosen heuristic function. Would that be a value-aligned system? Again, I’m not sure. If the heuristic was adversarially chosen, I’d say probably not. If the heuristic just happened to be bad, then I’m not sure.

Lucas: Could you potentially unpack what it means for something to be adversarially chosen?

Dylan: Sure. Adversarially chosen in this case just means that there is some intelligent agent selecting the heuristic function or that evaluation measurement in a way that’s designed to maximally screw you up. Adversarial analysis is a really common technique used in cryptography where we try to think of adversaries selecting inputs for computer systems that will cause them to malfunction. In this case, what this looks like is an adversarial algorithm that looks, at least, on the surface like it is trying to help you accomplish your objectives but is actually trying to fool you.

I’d say that, more generally, what this thought experiments helps me with is understanding that the value alignment is actually a quite tricky and subjective concept. It’s actually quite hard to nail down in practice what it would need.

Lucas: What sort of effort do you think needs to happen and from who in order to specify what it really means for a system to be value-aligned and to not just have a soft squishy idea of what that means but to have it really formally mapped out, so it can be implemented in machine systems?

Dylan: I think, we need more people working on technical AI safety research. I think to some extent it may always be something that’s a little ill-defined and squishy. Generally, I think it goes to the point of needing good people in AI willing to do this squishier less concrete work that really gets at it. I think value alignment is going to be something that’s a little bit more like I know it when I see it. As a field, we need to be moving towards a goal of AI systems where alignment is the end goal, whatever that means.

I’d like to move away from artificial intelligence where we think of intelligence as an ability to solve puzzles to artificial aligning agents where the goal is to build systems that are actually accomplishing goals on your behalf. I think the types of behaviors and strategies that arise from taking that perspective are qualitatively quite different from the strategies of pure puzzle solving on a well specified objective.

Lucas: All this work we’ve been discussing is largely at a theoretic and meta level. At this point, is this the main research that we should be doing, or is there any space for research into what specifically might be implementable today?

Dylan: I don’t think that’s the only work that needs to be done. For me, I think it’s a really important type of work that I’d like to see more off. I think a lot of important work is about understanding how to build these systems in practice and to think hard about designing AI systems with meaningful human oversight.

I’m a big believer in the idea that AI safety, that the distinction between short-term and long-term issue is not really that large, and that there are synergies between the research problems that go both directions. I believe that on the one hand, looking at short-term safety issues, which includes things like Uber’s car just killed someone, it includes YouTube recommendation engine, it includes issues like fake news and information filtering, I believe that all of those things are related to and give us are best window into the types of concerns and issues that may come up with advanced AI.

At the same time, and this is a point that I think people concerned about x-risks do themselves a disservice on by not focusing here. It’s that, actually, doing a theory about advanced AI systems and about in particular systems where it’s not possible to, what I would call, unilaterally intervene. Systems that aren’t corrigible by default. I think that that actually gives us a lot of idea of how to build systems now that are just merely hard to intervene with or oversee.

If you’re thinking about issues of monitoring and oversight, and how do you actually get a system that can appropriately evaluate when it should go to a person because its objectives are not properly specified or may not be relevant to the situation, I think YouTube would be in a much better place today if they have a robust system for doing that for their recommendation engine. In a lot of ways, the concerns about x-risks represent an extreme set of assumptions for getting AI right now.

Lucas: I think I’m also just trying to get a better sense of what the system looks like, and how it would be functioning on a day to day. What is the data that it’s taking in in order to capture, learn, and refer specific human preferences and values? Just trying to understand better whether or not it can model whole moral views and ethical systems of other agents, or if it’s just capturing little specific bits and pieces?

Dylan: I think my ideal would be to, as a system designer, build in as little as possible about my moral beliefs. I think that, ideally, the process would look something … Well, one process that I could see and imagine doing right would be to just directly go after trying to replicate something about the moral imprinting process that people have with their children. Either you had someone who’s like a guardian or is responsible for an AI system’s decision, and we build systems to try to align with one individual, and then try to adopt, and extend, and push forward the beliefs and preferences of that individual. I think that’s one concrete version that I could see.

I think a lot of the place where I see things maybe a little bit different than some people is that I think that the main ethical questions we’re going to be stuck with and the ones that we really need to get right are the mundane ones. The things that most people agree on and think are just, obviously, that’s not okay. Mundane ethics and morals rather than the more esoteric or fancier population ethics questions that can arise. I feel a lot more confident about the ability to build good AI systems if we get that part right. I feel like we’ve got a better shot at getting that part right because there’s a clearer target to shoot for.

Now, what kinds of data would you be looking at? In that case, it would be data from interaction with a couple of select individuals. Ideally, you’d want as much data as you can. What I think you really want to be careful of here is how much assumptions do you make about the procedure that’s generating your data.

What I mean by that is whenever you learn from data, you have to make some assumption about how that data relates to the right thing to do, where right is with like a capital R in this case. The more assumptions you make there, the more your systems would be able to learn about values and preferences, and the quicker it would be able to learn about values and preferences. But, the more assumptions and structure you make there, the more likely you are to get something wrong that your system won’t be able to recover from.

Again, we see this trade off come up of a challenge between a discrepancy between a discrepancy between the amount of uncertainty that you need in the system in order to be able to adapt to the right person and figure out the correct preferences and morals against the efficiency with which you can figure that out.

I guess, I mean, in saying this it feels a little bit like I’m rambling and unsure about what the answer looks like. I hope that that comes across because I’m really not sure. Beyond the rough structure of data generated from people, interpreted in a way that involves the fewest prior conceptions about what people want and what preferences people have that we can get away with is what I would shoot for. I don’t really know what that would look like in practice.

Lucas: Right. It seems here that it’s encroaching on a bunch of very difficult social, political, and ethical issues involving persons and data, which will be selected for preference aggregation, like how many people are included in developing the reward function and utility function of the AI system. Also, I guess, we have to be considering culturally-sensitive systems where systems operating in different cultures and contexts are going to be needed to be trained on different sets of data. I guess, it will also be questions and ethics about whether or not we’ll even want systems to be training off of certain culture’s data.

Dylan: Yeah. I would actually say that a good value … I wouldn’t necessarily even think of it as training off of different data. One of the core questions in artificial intelligence is identifying the relevant community that you are in and building a normative understanding of that community. I want to push back a little bit and move you away from the perspective of we collect data about a culture, and we figure out the values of that culture. Then, we build our system to be value-aligned with that culture.

The more we think about the actual AI product is the process whereby we determine, elicit, and respond to the normative values of the multiple overlapping communities that you find yourself in. That process is ongoing. It’s holistic, it’s overlapping, and it’s messy. To the extent that I think it’s possible, I’d like to not have a couple of people sitting around in a room deciding what the right values are. Much more, I think, a system should be holistically designed with value alignment at multiple scales as a core property of AI.

I think that that’s actually a fundamental property of human intelligence. You behave differently based on the different people around, and you’re very, very sensitive to that. There are certain things that are okay at work, that are not okay at home, that are okay on vacation, that are okay around kids, that are not. Figuring out what those things are and adapting yourself to them is the fundamental intelligence skill needed to interact in modern life. Otherwise, you just get shunned.

Lucas: It seems to me in the context of a really holistic, messy, ongoing value alignment procedure, we’ll be aligning AI systems ethics, and morals, and moral systems, and behavior with that of a variety of cultures, and persons, and just interactions in the 21st Century. When we reflect upon the humans of the past, we can see in various ways that they are just moral monsters. We have issues with slavery, and today we have issues with factory farming, and voting rights, and tons of other things in history.

How should we view and think about aligning powerful systems, ethics, and goals with the current human morality, and preferences, and the risk of amplifying current things which are immoral in present day life?

Dylan: This is the idea of mistakenly locking in the wrong values, in some sense. I think it is something we should be concerned about less from the standpoint of entire … Well, no, I think yes  from the standpoint of entire cultures getting things wrong. Again, I think if we don’t think of their being as monolithic society that has a single value set, these problems are fundamental issues. What your local community thinks is okay versus what other local communities think are okay.

A lot of our society and a lot of our political structures about how to handle those clashes between value systems. My ideal for AI systems is that they should become a part of that normative process, and maybe not participate in them as people, but, also, I think, if we think of value alignment as a consistent ongoing messy process, there is … I think maybe that perspective lends itself less towards locking in values and sticking with them. It’s one train, you can look at the problem, which is we determine what’s right and what’s wrong when we program our system to do that.

Then, there’s another one, which is we program our system to be sensitive to what people think is right or wrong. I think that’s more the direction that I think of value alignment in. Then, what I think the final part of what you’re getting at here is that the system actually will feed back into people. What AI system show us will shape what we think is okay and vice versa. That’s something that I am quite frankly not sure how to handle. I don’t know how you’re going to influence what someone wants, and what they will perceive that they want, and how to do that, I guess, correctly.

All I can say is that we do have a human notion of what is acceptable manipulation. We do have a human notion of allowing someone to figure out for themselves what they think is right and not and refraining from biasing them too far. To some extent, if you’re able to value align with communities in a good ongoing holistic manner, that should also give you some ways to choose and understand what types of manipulations you may be doing that are okay or not.

Also, say that I think that this perspective has a very mundane analogy when you think of the feedback cycle between recommendation engines and regular people. Those systems don’t model the effect … Well, they don’t explicitly model the fact that they’re changing the structure of what people want and what they’ll want in the future. That’s probably not the best analogy in the world.

I guess what I’m saying is that it’s hard to plan for how you’re going to influence someone’s desires in the future. It’s not clear to me what’s right or what’s wrong. What’s true is that we, as humans, have a lot of norms about what types of manipulation are okay or not. You might hope that appropriately doing value alignment in that way might help get to an answer here.

Lucas: I’m just trying to get a better sense here. What I’m thinking about the role that like ethics and intelligence plays here, I view intelligence as a means of modeling the world and achieving goals, and ethics as the end towards which intelligence is aimed here. Now, I’m curious in terms of behavior modeling where inverse reinforcement learning agents are modeling, I guess, the behavior of human agents and, also, predicting the sorts of behaviors that they’d be taking in the future or in the situation, which the inverse reinforcement learning agent finds itself.

I’m curious to know where metaethics and moral epistemology fits in, where inverse reinforcement learning agents are finding themselves a novel ethical situations, and what their ability to handle those novel ethical situations are like. When they’re handling those situations how much does it look like them performing some normative and metaethical calculus based on the kind of moral epistemology that they have, or how much does it look like they’re using some other behavioral predictive system where they’re like modeling humans?

Dylan: The answer to that question is not clear. What does it actually mean to make decisions based on ethical framework or metaethical framework? I guess, we could start there. You and I know what that means, but our definition is encumbered by the fact that it’s pretty human-centric. I think we talk about it in terms of, “Well, I weighed this option. I looked at that possibility.” We don’t even really mean the literal sense of weighed in actually counted up, and constructed actual numbers, and multiplied them together in our heads.

What these are is they’re actually references to complex thought patterns that we’re going through. They’re fine whether or not those thought patterns are going on. The AI system, you can also talk about the difference between the process of making a decision and the substance of it. When an inverse reinforcement learning agent is going out into the world, the policy it’s following is constructed to try to optimize a set of inferred preferences, but does that means that the policy you’re outputting is making metaethical characterizations?

Well, the moment, almost certainly not because the systems we build are just not capable of that type of cognitive reasoning. I think the bigger question is, do you care? To some extent, you probably do.

Lucas: I mean, I’d care if I had some very deep disagreements with the metaethics that led to the preferences that were loaned and loaded to the machine. Also, if the machine were in such a new novel ethical situation that was unlike anything human beings had faced that just required some metaethical reasoning to deal with.

Dylan: Yes. I mean, I think you definitely wanted to take decisions that you would agree with or, at least, that you could be non-maliciously convinced to agree with. Practically, there isn’t a place in the theory where that shows up. It’s not clear that what you’re saying is that different from value alignment in particular. If I were to try to refine the point about metaethics, what it sounds to me like you’re getting at is an inductive bias that you’re looking for in the AI systems.

Arguably, ethics is about an argument of what inductive bias should we have as humans. I don’t think that that’s a first order of property in value alignment systems necessarily or in preference-based learning systems in particular. I would think that that kind of meta ethics, I think, comes in from value aligning to someone that has these sophisticated ethical ideas.

I don’t know where your thoughts about metaethics came from, but, at least, indirectly, we can probably trace them down to the values that your parents inculcated in you as a child. That’s how we build met ethics into your head if we want to think of you as being an AGI. I think that for AI systems, that’s the same way that I would see it being in there. I don’t believe the brain has circuits dedicated to metaethics. I think that exists in software, and in particular, something that’s being programmed into humans from their observational data, more so than from the structures that are built into us as a fundamental part of our intelligence or value alignment.

Lucas: We’ve also talked a bit about how human beings are potentially not fully rational agents. With inverse reinforcement learning, this leaves open the question as to whether or not AI systems are actually capturing what the human being actually prefers, or if there’s some limitations in the humans’ observed or chosen behavior, or explicitly told preferences like limits in that ability to convey what we actually most deeply value or would value given more information. These inverse reinforcement learning systems may not be learning what we actually value or what we think we should value.

How can AI systems assist in this evolution of human morality and preferences whereby we’re actually conveying what we actually value and what we would value given more information?

Dylan: Well, there are certainly two things that I heard in that question. One is, how do you just mathematically account for the fact that people are irrational, and that that is a property of the source of your data? Inverse reinforcement learning, at face value, doesn’t allow us to model that appropriately. It may lead us to make the wrong inferences. I think that’s a very interesting question. It’s probably the main one that I think about now as a technical problem is understanding, what are good ways to model how people might or might not be rational, and building systems that can appropriately interact with that complex data source.

One recent thing that I’ve been thinking about is, what happens if people, rather than knowing their objective, what they’re trying to accomplish, are figuring it out over time? This is the model where the person is a learning agent that discovers how they like states when they enter them, rather than thinking of the person as an agent that already knows what they want, and they’re just planning to accomplish that. I think these types of assumptions that try to paint a very, very broad picture of the space of things that people are doing can help us in that vein.

When someone is learning, it’s actually interesting that you can actually end up helping them. You end up with classic strategies that looks like it breaks down into three phases. You have initial exploration phase where you help the learning agent to get a better picture of the world, and the dynamics, and its associated rewards.

Then, you have another observation phase where you observe how that agent, now, takes advantage of the information that it’s got. Then, there’s an exploitation or extrapolation phase where you try to implement the optimal policy given the information you’ve seen so far. I think, moving towards more complex models that have a more realistic setting and richer set of assumptions behind them is important.

The other thing you talked about was about helping people discover their morality and learn more what’s okay and what’s not. There, I’m afraid I don’t have too much interesting to say in the sense that I believe it’s an important question, but I just don’t feel that I have many answers there.

Practically, if you have someone who’s learning their preferences over time, is that different than humans refining their moral theories? I don’t know. You could make mathematical modeling choices, so that they are. I’m not sure if that really gets at what you’re trying to point towards. I’m sorry that I don’t have anything more interesting to say on that front other than, I think, it’s important, and I would love to talk to more people who are spending their days thinking about that question because I think it really does deserve that kind of intellectual effort.

Lucas: Yeah, yeah. It sounds like we need some more AI moral psychologists to help us think about these things.

Dylan: Yeah. In particular, when talking about philosophy around value alignments and the ethics of value alignment, I think a really important question is, what are the ethics of developing value alignment systems? A lot of times, people talk about AI ethics from the standpoint of, for a lack of a better example, the trolley problem. The way they think about it is, who should the car kill? There is a correct answer or maybe not a correct answer, but there are answers that we could think of as more or less bad. AI, which one of those options should the AI select? That’s not unimportant, but it’s not the ethical question that an AI system designer is faced with.

In my mind, if you’re designing a self-driving car, the relevant questions you should be asking are two things: One, what do I think is an okay way to respond to different situations? Two, how is my system going to be understanding the preferences of the people involved in those situations? Then, three, how should I design my system in light of those two facts?

I have my own preferences about what I would like my system to do. I have an ethical responsibility, I would say, to make sure that my system is adapting to the preferences of its users to the extent that it can. I also wonder to what extent. How should you handle things when there are conflicts between those two value sets?

You’re building a robot. It’s going to go and live with an uncontacted human tribe. Should it respect the local cultural traditions and customs? Probably. That would be respecting the values of the users. Then, let’s say that that tribe does something that we would consider to be gross like pedophilia. Is my system required to participate wholesale in that value system? Where is the line that we would need to draw between unfairly imposing my values on system users and being able to make sure that the technology that I build isn’t used for purposes that I would deem reprehensible or gross?

Lucas: Maybe we should just put a dial in each of the autonomous cars that lets the user set it to deontology mode or utilitarianism mode as its racing down the highway. Yeah, I think this is the … I guess, an important role. I just think that metaethics is super important. I’m not sure if this is necessarily the case, but if fully autonomous systems are going to play a role where they’re resolving these ethical dilemmas for us, which I guess at some point eventually, if they’re going to be really actually autonomous and help to make the world a much better place seems necessary.

I guess, this feeds into my next question where I’m wondering where we probably both have different assumptions about this, but what the role of inverse reinforcement learning is ultimately? Is it just to allow AI system to evolve alongside us and to match current ethics or is it to allow the systems to ultimately surpass us and move far beyond us into the deep future?

Dylan: Inverse reinforcement learning, I think, is much more about the first and the second. I think it can be a part of how you get to the second and how you improve. For me, when I think about these problems technically, I try to think about matching human morality as the goal.

Lucas: Except for the factory farming and stuff.

Dylan: Well, I mean, if you had a choice between, thinks that eradicating all humans is okay and against farming versus neutral about factory farming and thinks that are eradicating all humans aren’t okay, which would you pick? I mean, I guess, with your audience that there are maybe some people that would choose the saving the animals answer.

My point is that, I think, it’s so hard for me. Technically, I think it’s very hard to imagine getting these normative aspects of human societies and interaction right. I think, just hoping to participate in that process in a way that is analogous to how people do normally is a good step. I think we probably, to the extent that we can, should probably not have AI systems trying to figure out if it’s okay to do factory farming and to the extent that we can …

I think that it’s so hard to understand what it means to even match human morality or participate in it that, for me, the concept of surpassing, it feels very, very challenging and fraught. I would worry, as a general concern, that as a system designer who doesn’t necessarily represent the views and interest of everyone, that by programming in surpassing humanity or surpassing human preferences or morals, what I’m actually doing is just programming in my morals and ethical beliefs.

Lucas: Yes. I mean, there seems to be this strange issue here where it seems like if we get AGI, and recursive self-improvement is a thing that really takes it off, so that we have a system who has potentially succeeded in its inverse reinforcement learning, but far surpassed human beings and its general intelligence. We have a superintelligence that’s matching human morality. It just seems like a funny situation where we’d really have to pull the brakes. I guess, as William MacAskill mentions have a really, really long deliberation about ethics, and moral epistemology, and value. How do you view that?

Dylan: I think that’s right. I mean, I think there are some real questions about who should be involved in that conversation. For instance, I actually even think it’s … Well, one thing I’d say is that you should recognize that there’s a difference between having the same morality and having the same data. One way to think about it is that people who are against factory farming have a different morality than the rest of the people.

Another one is that they actually just have exposure to the information that allows their morality to come to a better answer. There’s this confusion you can make between the objective that someone has and the data that they’ve seen so far. I think, one point would be to think that a system that has current human morality but access to a vast, vast wealth of information may actually do much better than you might think. I think, we should leave that open as a possibility.

For me, this is less about morality in particular, and more just about power concentration, and how much influence you have over the world. I mean, if we imagine that there was something like a very powerful AI system that was controlled by a small number of people, yeah, you better think freaking hard before you tell that system what to do. That’s related to questions about ethical ramifications on metaethics, and generalization, and what we actually truly value as humans. What is also super true for all of the more mundane things in the day to day as well. Did that make sense?

Lucas: Yeah, yeah. It totally makes sense. I’m becoming increasingly mindful of your time here. I just wanted to hit a few more questions if that’s okay before I let you go.

Dylan: Please, yeah.

Lucas: Yeah. I’m wondering, would you like to, or do you have any thoughts on how coherent extrapolated volition fits into this conversation and your views on it?

Dylan: What I’d say is I think coherent extrapolated volition is an interesting idea and goal.

Lucas: Where it is defined as?

Dylan: Where it’s defined as a method of preference aggregation. Personally, I’m a little weary of preference aggregation approaches. Well, I’m weary of imposing your morals on someone indirectly via choosing the method of preference aggregation that we’re going to use. I would-

Lucas: Right, but it seems like, at some point, we have to make some metaethical decision, or else, we’ll just forever be lost.

Dylan: Do we have to?

Lucas: Well, some agent does.

Dylan: My-

Lucas: Go ahead.

Dylan: Well, does one agent have to? Did one agent decide on the ways that we were going to do preference aggregation as a society?

Lucas: No. It naturally evolved out of-

Dylan: It just naturally evolved via a coordination and argumentative process. For me, my answer to … If you force me to specify something about how we’re going to do value aggregation, if I was controlling the values for an AGI system, I would try to say as little as possible about the way that we’re going to aggregate values because I think we don’t actually understand that process much in humans.

Lucas: Right. That’s fair.

Dylan: Instead, I would opt for a heuristic of to the extent that we can devote equal optimization effort towards every individual, and allow that parliament, if you will, to determine the way the value should be aggregated. This doesn’t necessarily mean having an explicit value aggregation mechanism that gets set in stone. This could be an argumentative process mediated by artificial agents arguing on your behalf. This could be futuristic AI-enabled version of the court system.

Lucas: It’s like an ecosystem of preferences and values in conversation?

Dylan: Exactly.

Lucas: Cool. We’ve talked a little bit about the deep future here now with where we’re reaching around potentially like AGI or artificial superintelligence. After, I guess, inverse reinforcement learning is potentially solved, is there anything that you view that comes after inverse reinforcement learning in these techniques?

Dylan: Yeah. I mean, I think inverse reinforcement learning is certainly not the be-all, end-all. I think what it is, is it’s one of the earliest examples in AI of trying to really look at preference solicitation, and modeling preferences, and learning preferences. It existed in a whole bunch of … economists have been thinking about this for a while already. Basically, yeah, I think there’s a lot to be said about how you model data and how you learn about preferences and goals. I think inverse reinforcement learning is basically the first attempt to get at that, but it’s very far from the end.

I would say the biggest thing in how I view things that is maybe different from your standard reinforcement learning, inverse reinforcement learning perspective is that I focus a lot on, how do you act given what you’ve learned from inverse reinforcement learning. Inverse reinforcement learning is a pure inference problem. It’s just figure out what someone wants. I ground that out in all of our research in take actions to help someone, which introduces a new set of concerns and questions.

Lucas: Great. It looks like we’re about at the end of the hour here. I guess, if anyone here is interested in working on this technical portion of the AI alignment problem, what do you suggest they study or how do you view that it’s best for them to get involved, especially if they want to work on inverse reinforcement learning and inferring human preferences?

Dylan: I think if you’re an interested person, and you want to get into technical safety work, the first thing you should do is probably read Jan Leike’s recent write up in 80,000 Hours. Generally, what I would say is, try to get involved in AI research flat. Don’t focus as much on trying to get into AI safety research, and just generally focus more on acquiring the skills that will support you in doing good AI research. Get a strong math background. Get a research advisor who will advise you on doing research projects, and help teach you the process of submitting papers, and figuring out what the AI research community is going to be interested in.

In my experience, one of the biggest pitfalls that early researchers make is focusing too much on what they’re researching rather than thinking about who they’re researching with, and how they’re going to learn the skills that will support doing research in the future. I think that most people don’t appreciate how transferable research skills are to the extent that you can try to do research on technical AI safety, but more work on technical AI. If you’re interested in safety, the safety connections will be there. You may see how a new area of AI actually relates to it, supports it, or you may find places of new risks, and be in a good position to try to mitigate that and take steps to alleviate those harms.

Lucas: Wonderful. Yeah, thank you so much for speaking with me today, Dylan. It’s really been a pleasure, and it’s been super interesting.

Dylan: It was a pleasure talking to you. I love the chance to have these types of discussions.

Lucas: Great. Thanks so much. Until next time.

Dylan: Until next time. Thanks a blast.

Lucas: If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back soon with another episode in this new AI alignment series.

[end of recorded material]