AI Alignment Podcast: Moral Uncertainty and the Path to AI Alignment with William MacAskill

Published

18 September, 2018

How are we to make progress on AI alignment given moral uncertainty? What are the ideal ways of resolving conflicting value systems and views of morality among persons? How ought we to go about AI alignment given that we are unsure about our normative and metaethical theories? How should preferences be aggregated and persons idealized in the context of our uncertainty?

Moral Uncertainty and the Path to AI Alignment with William MacAskill is the fifth podcast in the new AI Alignment series, hosted by Lucas Perry. For those of you that are new, this series will be covering and exploring the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, we will be having discussions with technical and non-technical researchers across areas such as machine learning, AI safety, governance, coordination, ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, or your preferred podcast site/application.

If you're interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.

In this podcast, Lucas spoke with William MacAskill. Will is a professor of philosophy at the University of Oxford and is a co-founder of the Center for Effective Altruism, Giving What We Can, and 80,000 Hours. Will helped to create the effective altruism movement and his writing is mainly focused on issues of normative and decision theoretic uncertainty, as well as general issues in ethics.

Topics discussed in this episode include:

Will’s current normative and metaethical credences
The value of moral information and moral philosophy
A taxonomy of the AI alignment problem
How we ought to practice AI alignment given moral uncertainty
Moral uncertainty in preference aggregation
Moral uncertainty in deciding where we ought to be going as a society
Idealizing persons and their preferences
The most neglected portion of AI alignment

In this interview we discuss ideas contained in the work of William MacAskill. You can learn more about Will's work here, and follow him on social media here. You can find Gordon Worley's post here and Rob Wiblin's previous podcast with Will here.

Transcript

Lucas: Hey, everyone. Welcome back to the AI Alignment Podcast series at the Future of Life Institute. I'm Lucas Perry, and today we'll be speaking with William MacAskill on moral uncertainty and its place in AI alignment. If you've been enjoying this series and finding it interesting or valuable, it's a big help if you can share it on social media and follow us on your preferred listening platform.

Will is a professor of philosophy at the University of Oxford and is a co-founder of the Center for Effective Altruism, Giving What We Can, and 80,000 Hours. Will helped to create the effective altruism movement and his writing is mainly focused on issues of normative and decision theoretic uncertainty, as well as general issues and ethics. And so, without further ado, I give you William MacAskill.

Yeah, Will, thanks so much for coming on the podcast. It's really great to have you here.

Will: Thanks for having me on.

Lucas: So, I guess we can start off. You can tell us a little bit about the work that you've been up to recently in terms of your work in the space of metaethics and moral uncertainty just over the past few years and how that's been evolving.

Will: Great. My PhD topic was on moral uncertainty, and I'm just putting the finishing touches on a book on this topic. The idea here is to appreciate the fact that we very often are just unsure about what we ought, morally speaking, to do. It's also plausible that we ought to be unsure about what we ought morally to do. Ethics is a really hard subject, there's tons of disagreement, it would be overconfident to think, "Oh, I've definitely figured out the correct moral view." So my work focuses on not really the question of how unsure we should be, but instead what should we do given that we're uncertain?

In particular, I look at the issue of whether we can apply the same sort of reasoning that we apply to uncertainty about matters of fact to matters of moral uncertainty. In particular, can we use what is known as "expected utility theory", which is very widely accepted as at least approximately correct in empirical uncertainty. Can we apply that in the same way in the case of moral uncertainty?

Lucas: Right. And so coming on here, you also have a book that you've been working on on moral uncertainty that is unpublished. Have you just been expanding this exploration in that book, diving deeper into that?

Will: That's right. There's actually been very little that's been written on the topic of moral uncertainty, at least in modern times, at least relative to its importance. I would think of this as a discipline that should be studied as much as consequentialism or contractualism or Kantianism is studied. But there's really, in modern times, only one book that's been written on the topic and that was written 18 years ago now, or published 18 years ago. What we want is this to be, firstly, just kind of definitive introduction to the topic, it's co-authored with me as lead author, but co-authored with Toby Ord and Krista Bickfest, laying out both what we see as the most promising path forward in terms of addressing some of the challenges that face an account of decision-making under moral uncertainty, some of the implications of taking moral uncertainty seriously, and also just some of the unanswered questions.

Lucas: Awesome. So I guess, just moving forward here, you have a podcast that you already did with Rob Wiblin: 80,000 Hours. So I guess we can sort of just avoid covering a lot of the basics here about your views on using expected utility calculous in moral reasoning and moral uncertainty in order to decide what one ought to do when one is not sure what one ought to do. People can go ahead and listen to that podcast, which I'll provide a link to within the description.

It would also be good, just to sort of get a general sense of where your meta ethical partialities just generally right now tend to lie, so what sort of meta ethical positions do you tend to give the most credence to?

Will: Okay, well that's a very well put question 'cause, as with all things, I think it's better to talk about degrees of belief rather than absolute belief. So normally if you ask a philosopher this question, we'll say, "I'm a nihilist," or "I'm a moral realist," or something, so I think it's better to split your credences. So I think I'm about 50/50 between nihilism or error theory and something that's non-nihilistic.

Whereby nihilism or error theory, I just mean that any positive moral statement or normative statement or a evaluative statement. That includes, you ought to maximize happiness. Or, if you want a lot of money, you ought to become a banker. Or, pain is bad. That, on this view, all of those things are false. All positive, normative or evaluative claims are false. So it's a very radical view. And we can talk more about that, if you'd like.

In terms of the rest of my credence, the view that I'm kind of most sympathetic towards in the sense of the one that occupies most of my mental attention is a relatively robust form of moral realism. It's not clear whether it should be called kind of naturalist moral realism or non-naturalist moral realism, but the important aspect of it is just that goodness and badness are kind of these fundamental moral properties and are properties of experience.

The things that are of value are things that supervene on conscious states, in particular good states or bad states, and the way we know about them is just by direct experience with them. Just by being acquainted with a state like pain gives us a reason for thinking we ought to have less of this in the world. So that's my kind of favored view in the sense it's the one I'd be most likely to defend in the seminar room.

And then I give somewhat less credence in a couple of views. One is a view called "subjectivism" which is the idea that what you ought to do is determined in some sense by what you want to do. So the simplest view there would just be when I say, "I ought to do X." That just means I want to do X in some way. Or a more sophisticated version would be ideal subjectivism where when I say I ought to do X, it means some very idealized version of myself would want myself to want to do X. Perhaps if I had limited amounts of knowledge and much clearer computational power and so on. I'm a little less sympathetic to that than many people I know. We'll go into that.

And then a final view that I'm also less sympathetic towards is non-cognitivism, which would be the idea that our moral statements ... So when I say, "Murder is wrong," I'm not even attempting to express a proposition. What they're doing is just expressing some emotion of mine, like, "Yuk. Murder. Ugh," in the same way that when I said that, that wasn't expressing any proposition, it was just expressing some sort of pro or negative attitude. And again, I don't find that terribly plausible, again for reasons we can go into.

Lucas: Right, so those first two views were cognitivist views, which makes them fall under sort of a semantic theory where you think that people are saying truth or false statements when they're claiming moral facts. And the other theory in your moral realism are both metaphysical views, which I think is probably what we'll mostly be interested here in terms of the AI alignment problem.

There are other issues in metaethics, for example having to do with semantics, as you just discussed. You feel as though you give some credence to non-cognitivism, but there are also justification views, so like issues in moral epistemology, how one can know about metaethics and why one ought to follow metaethics if metaethics has facts. Where do you sort of fall in in that camp?

Will: Well, I think all of those views are quite well tied together, so what sort of moral epistemology you have depends very closely, I think, on what sort of meta-ethical view you have, and I actually think, often, is intimately related as well to what sort of view in normative ethics you have. So my preferred philosophical world view, as it were, the one I'd defend in a seminar room, is classical utilitarian in its normative view, so the only thing that matters is positive or negative mental states.

In terms of its moral epistemology, the way we access what is of value is just by experiencing it, so in just the same way we access conscious states. There are also some ways in which you can't merely, you know, why is it that we should maximize the sum of good experiences rather than the product, or something? That's a view that you've got to obtain by kind of reasoning rather than just purely from experience.

Part of my epistemology does appeal to whatever this spooky ability we have to reason about abstract affairs, but it's the same sort of faculty that is used when we think about mathematics or set theory or other areas of philosophy. If, however, I had some different view, so supposing we were a subjectivist, well then moral epistemology looks very different. You're actually just kind of reflecting on your own values, maybe looking at what you would actually do in different circumstances and so on, reflecting on your own preferences, and that's the right way to come to the right kind of moral views.

There's also another meta-ethical view called "constructivism" that I'm definitely not the best person to talk about with. But on that view, again it's not really a realistic view, but on this view we just have a bunch of beliefs and intuitions and the correct moral view is just the best kind of systematization of those and beliefs or intuitions in the same way as you might think ... Like linguistics, it is a science, but it's fundamentally based just on what our linguistic intuitions are. It's just kind of a systematization of them.

On that view, then, moral epistemology would be about reflecting on your own moral intuitions. You just got all of this data, which is the way things seem like to you, morally speaking, and then you're just doing the systematization thing. So I feel like the question of moral epistemology can't be answered in a vacuum. You've got to think about your meta-ethical view of the metaphysics of ethics at the same time.

Lucas: I think I'm pretty interested in here, and also just poking a little bit more into that sort of 50% credence you give to your moral realist view, which is super interesting because it's a view that people tend not to have, I guess, in the AI computer science rationality space, EA space. People tend to, I guess, have a lot of moral anti-realists in this space.

In my last podcast, I spoke with David Pearce, and he also seemed to sort of have a view like this, and I'm wondering if you can just sort of unpack yours a little bit, where he believed that suffering and pleasure disclose the in-built pleasure/pain access of the universe. Like you can think of minds as sort of objective features of the world, because they in fact are objective features of the world, and the phenomenology and experience of each person is objective in the same way that someone could objectively be experiencing redness, and in the same sense they could be objectively experiencing pain.

It seems to me, and I don't fully understand the view, but the claim is that there are some sort of in-built quality or property to the hedonic qualia of suffering or pleasure that discloses its in-built value to that.

Will: Yeah.

Lucas: Could you unpack it a little bit more about the metaphysics of that and what that even means?

Will: It sounds like David Pearce and I have quite similar views. I think relying heavily on the analogy with, or very close analogy with consciousness is going to help, where imagine you're kind of a robot scientist, you don't have any conscious experiences but you're doing all this fancy science and so on, and then you kind of write out the book of the world, and i’m like, “hey, there's this thing you missed out. It's like conscious experience.” And you, the robot scientist, would say, "Wow, that's just insane. You're saying that some bits of matter have this first person subjective feel to them? Like, why on earth would we ever believe that? That's just so out of whack with the naturalistic understanding of the world." And it's true. It just doesn't make any sense from given what we know now. It's a very strange phenomenon to exist in the world.

Will: And so one of the arguments that motivates error theory is this idea of just, well, if values were to exist, they would just be so weird, what Mackie calls "queer". It's just so strange that just by a principle of Occam's razor not adding strange things in to our ontology, we should assume they don't exist.

But that argument would work in the same way against conscious experience, and the best response we've got is to say, no, but I know I'm conscious, and just tell by introspecting. I think we can run the same sort of argument when it comes to a property of consciousness as well, which is namely the goodness or badness of certain conscious experiences.

So now I just want you to go kind of totally a-theoretic. Imagine you've not thought about philosophy at all, or even science at all, and I was just to ask you, rip off one of your fingernails, or something. And then I say, "Is that experience bad?" And you would say yes.

Lucas: Yeah, it's bad.

Will: And I would ask, how confident are you? The more confident that this pain is bad than that I even have hands, perhaps. That's at least how it seems to be for me. So then it seems like, yeah, we've got this thing that we're actually incredibly confident of which is the badness of pain, or at least the badness of pain for me, and so that's what initially gives the case for then thinking, okay, well, that's at least one objective moral fact that pain is bad, or at least pain is bad for me.

Lucas: Right, so the step where I think that people will tend to get lost in this is when ... I thought the part about Occam's razor was very interesting. I think that most people are anti-realistic because they use Occam's razor there and they think that what the hell would a value even be anyway in the third person objective sense? Like, that just seems really queer, as you put it. So I think people get lost at the step where the first person seems to simply have a property of badness to it.

I don't know what that would mean if one has a naturalistic reductionist view of the world. There seems to be just like entropy, noise and quarks and maybe qualia as well. It's not clear to me how we should think about properties of qualia and whether or not one can drive, obviously, "ought" statements about properties of qualia to normative statements, like "is" statements about the properties of qualia to "ought" statements?

Will: One thing I want to be very clear on is just it definitely is the case that we have really no idea on this view. We are currently completely in the dark about some sort of explanation of how matter and forces and energy could result in goodness or badness, something that ought to be promoted. But that's also true with conscious experience as well. We have no idea how on earth matter could result in kind of conscious experience. At the same time, it would be a mistake to start denying conscious experience.

And then we can ask, we say, okay, we don't really know what's going on but we accept that there's conscious experience, and then I think if you were again just to completely pre theoretically start categorizing distant conscious experiences that we have, we'd say that some are red and some are blue, some are maybe more intense, some are kind of dimmer than others, you'd maybe classify them into sights and sounds and other sorts of experiences there.

I think also a very natural classification would be the ones that are good and the ones that are bad, and then I think when we cash that out further, I think it's not nearly the case. I don't think the best explanation is that when we say, oh, this is good or this is bad it means what we want or what we don't want, but instead it's like what we think we have reason to want or reason not to want. It seems to give us evidence for those sorts of things.

Lucas: I guess my concern here is just that I worry that words like "good" and "bad" or "valuable" or "dis-valuable", I feel some skepticism about whether or not they disclose some sort of intrinsic property of the qualia. I'm also not sure what the claim here is about the nature of and kinds of properties that qualia can have attached to them. I worry that goodness and badness might be some sort of evolutionary fiction which enhances us, enhances our fitness, but it doesn't actually disclose some sort of intrinsic metaphysical quality or property of some kind of experience.

Will: One thing I'll say is, again, remember that I've got this 50% credence on error theory, so in general, all these questions, maybe this is just some evolutionary fiction, things just seem bad but they're not actually, and so on. I actually think those are good arguments, and so that should give us confidence, some degree of confidence and this idea of just actually nothing matters at all.

But kind of underlying a lot of my views is this more general argument that if you're unsure between two views, one in which just nothing matters at all, we've got no reasons for action, the other one we do have some reasons for action, then you can just ignore the one that says you've got no reasons for action 'cause you're not going to do badly by its likes no matter what you do. If I were to go around shooting everybody, that wouldn't be bad or wrong on nihilism. If I were to shoot lots of people, it wouldn't be bad or wrong on nihilism.

So if there are arguments such as, I think an evolutionary argument that pushes us in the direction of kind of error theory, in a sense we can put them to the side, 'cause what we ought to do is just say, yeah, we take that really seriously. Give us a high credence in error theory, but now say, after all those arguments, what are the views, because most plausibly kind of bear their force.

So this is why with the kind of evolutionary worry, I'm just like, yes. But, supposing it's the case that there actually are. Presumably conscious experiences themselves are useful in some evolutionary way that, again, we don't really understand. I think, presumably, also good and bad experiences are useful in some evolutionary way that we don't fully understand, perhaps because they have a tendency to motivate at least beings like us, and that in fact seems to be a key aspect of making a kind of goodness or badness statement. It's at least somehow tied up to the idea of kind of motivation.

And then when I say ascribing a property to a conscious experience, I really just don't mean whatever it is that we mean when we say that this experience is red seeming, this is experience is blue seeming, I mean, again, opens philosophical questions what we even mean by properties but in the same way this is bad seeming, this is good seeming.

Before I got into thinking about philosophy and naturalism and so on, would I have thought those things are kind of on a par, and I think I would've done, so it's at least a pre theoretically justified view to think, yeah, there just is this axiological property of my experience.

Lucas: This has made me much more optimistic. I think after my last podcast I was feeling quite depressed and nihilistic, and hearing you give this sort of non-naturalistic or naturalistic moral realist count is cheering me up a bit about the prospects of AI alignment and value in the world.

Will: I mean, I think you shouldn't get too optimistic. I'm also certainly wrong-

Lucas: Yeah.

Will: ... sort of is my favorite view. But take any philosopher. What's the chance that they've got the right views? Very low, probably.

Lucas: Right, right. I think I also need to be careful here that human beings have this sort of psychological bias where we give a special metaphysical status and kind of meaning and motivation to things which have objective whatever to it. I guess there's also some sort of motivation that I need to be mindful of that seeks out to make value objective or more meaningful and foundational in the universe.

Will: Yeah. The thing that I think should make you feel optimistic, or at least motivated, is this argument that if nothing matters, it doesn't matter that nothing matters. It just really ought not to affect what you do. You may as well act as if things do matter, and in fact we can have this project of trying to figure out if things matter, and that maybe could be an instrumental goal, which kind of is a purpose for life is to get to a place where we really can figure out if it has any meaning. I think that sort of argument can at least give one grounds for getting out of bed in the morning.

Lucas: Right. I think there's this philosophy paper that I saw, but I didn't read, that was like, "nothing Matters, but it does matter", with the one lower case M and then another capital case M, you know.

Will: Oh, interesting.

Lucas: Yeah.

Will: It sounds a bit like 4:20 ethics.

Lucas: Yeah, cool.

Moving on here into AI alignment. And before we get into this, I think that this is something that would also be interesting to hear you speak a little bit more about before we dive into AI alignment. What even is the value of moral information and moral philosophy, generally? Is this all just a bunch of BS or how can it be interesting and or useful in our lives, and in science and technology?

Will: Okay, terrific. I mean, and this is something I write about in a paper I'm working on now and also in the book, as well.

So, yeah, I think the stereotype of the philosopher engaged in intellectual masturbation, not doing really much for the world at all, is quite a prevalent stereotype. I'll not comment on whether that's true for certain areas of philosophy. I think it's definitely not true for certain areas within ethics. What is true is that philosophy is very hard, ethics is very hard. Most of the time when we're trying to do this, we make very little progress.

If you look at the long-run history of thought in ethics and political philosophy, the influence is absolutely huge. Even just take Aristotle, Locke, Hobbes, Mill, and Marx. The influence of political philosophy and moral philosophy there, it shaped thousands of years of human history. Certainly not always for the better, sometimes for the worse, as well. So, ensuring that we get some of these ideas correct is just absolutely crucial.

Similarly, even in more recent times ... Obviously not as influential as these other people, but also it's been much less time so we can't predict into the future, but if you consider Peter Singer as well, his ideas about the fact that we may have very strong obligations to benefit those who are distant strangers to us, or that we should treat animal welfare just on a par with human welfare, at least on some understanding of those ideas, that really has changed the beliefs and actions of, I think, probably tens of thousands of people, and often in really quite dramatic ways.

And then when we think about well, should we be doing more of this, is it merely that we're influencing things randomly, or are we making things better or worse? Well, if we just look to the history of moral thought, we see that most people in most times have believed really atrocious things. Really morally abominable things. Endorsement of slavery, distinctions between races, subjugation of women, huge discrimination against non-heterosexual people, and, in part at least, it's been ethical reflection that's allowed us to break down some of those moral prejudices. And so we should presume that we have very similar moral prejudices now. We've made a little bit of progress, but do we have the one true theory of ethics now? I certainly think it's very unlikely. And so we need to think more if we want to get to the actual ethical truth, if we don't wanna be living out moral catastrophes in the same way as we would if we kept slaves, for example.

Lucas: Right, I think we do want to do that, but I think that a bit later in the podcast we'll get into whether or not that's even possible, given economic, political, and militaristic forces acting upon the AI alignment problem and the issues with coordination and race to AGI.

Just to start to get into the AI alignment problem, I just wanna offer a little bit of context. It is implicit in the AI alignment problem, or value alignment problem, that AI needs to be aligned to some sort of ethic or set of ethics, this includes preferences or values or emotional dispositions, or whatever you might believe them to be. And so it seems that generally, in terms of moral philosophy, there are really two methods, or two methods in general, by which to arrive at an ethic. So, one is simply going to be through reason, and one is going to be through observing human behavior or artifacts, like books, movies, stories, or other things that we produce in order to infer and discover the observed preferences and ethics of people in the world.

The latter side of alignment methodologies are empirical and involves the agent interrogating and exploring the world in order to understand what the humans care about and value, as if values and ethics were simply a physical by-product of the world and of evolution. And the former is where ethics are arrived at through reason alone, and involve the AI or the AGI potentially going about ethics as a philosopher would, where one engages in moral reasoning about metaethics in order to determine what is correct. From the point of view of ethics, there is potentially only what the humans empirically do believe and then there is what we may or may not be able to arrive at through reason alone.

So, it seems that one or both of these methodologies can be used when aligning an AI system. And again, the distinction here is simply between sort of preference aggregation or empirical value learning approaches, or methods of instantiating machine ethics, reasoning, or decision-making in AI systems so they become agents of morality.

So, what I really wanna get into with you now is how metaethical uncertainty influences our decision over the methodology of value alignment. Over whether or not we are to prefer an empirical preference learning or aggregation type approach, or one which involved an imbuing of moral epistemology and ethical metacognition and reasoning into machine systems so it can discover what we ought to do. And how moral uncertainty, and metaethical moral uncertainty in particular, operates within both of these spaces once you're committed to some view, or both of these views. And then we can get into issues and intertheoretic comparisons and how that arises here at many levels, the ideal way we should proceed if we could do what would be perfect, and again, what is actually likely to happen given race dynamics and political, economic, and militaristic forces.

Will: Okay that sounds terrific. I mean, there's a lot of cover there.

I think it might be worth me saying just maybe a couple of distinctions I think are relevant and kind of my overall view in this. So, in terms of distinction, I think within what broadly gets called the alignment problem, I think I'd like to distinguish between what I'd call the control problem, then kind of human values alignment problem, and then the actual alignment problem.

Where the control problem is just, can you get this AI to do what you want it to do? Where that's maybe relatively narrowly construed, I want it to clean up my room, I don't want it to put my cat in the bin, that's kinda control problem. I think describing that as a technical problem is kind of broadly correct.

Second is then what gets called aligning AI with human values. For that, it might be the case that just having the AI pay attention to what humans actually do and infer their preferences that are revealed on that basis, maybe that's a promising approach and so on. And that I think will become increasingly important as AI becomes larger and larger parts of the economy.

This is kind of already what we do when we vote for politicians who represent at least large chunks of the electorate. They hire economists who undertake kind of willingness-to-pay surveys and so on to work out what people want, on average. I do think that this is maybe more normatively loaded than people might often think, but at least you can understand that, just as the control problem is I have some relatively simple goal, which is, what do I want? I want this system to clean my room. How do I ensure that it actually does that without making mistakes that I wasn't intending? This is kind of broader problem of, well you've got a whole society and you've got to aggregate their preferences for what kind of society wants and so on.

But I think, importantly, there's this third thing which I called a minute ago, the actual alignment problem, so let's run with that. Which is just working out what's actually right and what's actually wrong and what ought we to be doing. I do have a worry that because many people in the wider world, often when they start thinking philosophically they start endorsing some relatively simple, subjectivist or relativist views. They might think that answering this question of well, what do humans want, or what do people want, is just the same as answering what ought we to do? Whereas for kind of the reductio of that view, just go back a few hundred years where the question would have been, well, the white man's alignment problem, where it's like, "Well, what do we want, society?", where that means white men.

Lucas: Uh oh.

Will: What do we want them to do? So similarly, unless you've got the kind of such a relativist view that you think that maybe that would have been correct back then, that's why I wanna kind of distinguish this range of problems. And I know that you're kind of most interested in that third thing, I think. Is that right?

Lucas: Yeah, so I think I'm pretty interested in the second and the third thing, and I just wanna unpack a little bit of your distinction between the first and the second. So, the first was what you called the control problem, and you called the second just the plurality of human values and preferences and the issue of aligning to that in the broader context of the world.

It's unclear to me how I get the AI to put a strawberry on the plate or to clean up my room and not kill my cat without the second thing haven been done, at least to me.

There is a sense at a very low level where your sort of working on technical AI alignment, which involves working on the MIRI approach with agential foundations and trying to work on a constraining optimization and corrigibility and docility and robustness and security and all of those sorts of things that people work on and the concrete problems in AI safety, stuff like that. But, it's unclear to me where that sort of stuff is just limited to and includes the control problem, and where it begins requiring the system to be able to learn my preferences through interacting with me and thereby is already sort of participating in the second case where it's sort of participating in AI alignment more generally, rather than being sort of like a low level controlled system.

Will: Yeah, and I should say that on this side of things I'm definitely not an expert, not really the person to be talking to, but I think you're right. There's going to be some big, gray area or transition from systems. So there's one that might be cleaning my room, or even let's just say it's playing some sort of game, unfortunately I forget the example ... It was under the blog post, an example of the alignment problem in the wild, or something, from open AI. But, just a very simple example of the AIs playing a game, and you say, "Well, get as many points as possible." And what you really want it to do is win a certain race, but what it ends up doing is driving this boat just round and round in circles because that's the way of maximizing the number of points.

Lucas: Reward hacking.

Will: Reward hacking, exactly. That would be a kind of failure of control problem, that first in our sense. And then I believe there's gonna be kind of gray areas, where perhaps it's the certain sort of AI system where the whole point is it's just implementing kind of what I want. And that might be very contextually determined, might depend on what my mood is of the day. For that, that might be a much, much harder problem and will involve kind of studying what I actually do and so on.

We could go into the question of whether you can solve the problem of cleaning a room without killing my cat. Whether that is possible to solve without solving much broader questions, maybe that's not the most fruitful avenue of discussion.

Lucas: So, let's put aside this first case which involves the control problem, we'll call it, and let's focus on the second and the third, where again the second is defined as sort of the issue of the plurality of human values and preferences which can be observed, and then the third you described as us determining what we ought to do and tackling sort of the metaethics.

Will: Yeah, just tackling the fundamental question of, "Where ought we to be headed as a society?" One just extra thing to add onto that is that's just a general question for society to be answering. And if there are kind of fast, or even medium-speed, developments in AI, perhaps suddenly we've gotta start answering that question, or thinking about that question even harder in a more kind of clean way than we have before. But even if AI were to take a thousand years, we'd still need to answer that question, 'cause it's just fundamentally the question of, "Where ought we to be heading as a society?"

Lucas: Right, and so going back a little bit to the little taxonomy that I had developed earlier, it seems like your second case scenario would be sort of down to metaethical questions, which are behind and which influence the empirical issues with preference aggregation and there being plurality of values. And the third case would be, what would be arrived at through reason and, I guess, the reason of many different people.

Will: Again, it's gonna involve questions of metaethics as well where, again, on my theory that metaethics ... It would actually just involve interacting with conscious experiences. And that's a critical aspect of coming to understand what's morally correct.

Lucas: Okay, so let's go into the second one first and then let's go into the third one. And while we do that, it would be great if we could be mindful of problems in intertheoretic comparison and how they arise as we go through both. Does that sound good?

Will: Yeah, that sounds great.

Lucas: So, would you like to just sort of unpack, starting with the second view, the metaethics behind that, issues in how moral realism versus moral anti-realism will affect how the second scenario plays out, and other sorts of crucial considerations in metaethics that will affect the second scenario?

Will: Yeah, so for the second scenario, which again, to be clear, is the aggregating of the variety of human preferences across a variety of contexts and so on, is that right?

Lucas: Right, so that the agent can be fully autonomous and realized in the world that it is sort of an embodiment of human values and preferences, however construed.

Will: Yeah, okay, so here I do think all the metaethics questions are gonna play a lot more role in the third question. So again, it's funny, it's very similar to the question of kind of what mainstream economists often think they're doing when it comes to cost-benefit analysis. Let's just even start in the individual case. Even there, it's not a purely kind of descriptive enterprise, where, again, let's not even talk about AI. You're just looking out for me. You and I are friends and you want to do me a favor in some way, how do you make a decision about how to do me that favor, how to benefit me in some way? Well, you could just look at the things I do and then infer on the basis of that what my utility function is. So perhaps every morning I go and I rob a convenience store and then I buy some heroin and then I shoot up and-

Lucas: Damn, Will!

Will: That's my day. Yes, it's a confession. Yeah, you're the first to hear it.

Lucas: It's crazy, in Oxford huh?

Will: Yeah, Oxford University is wild.

You see that behavior on my part and you might therefore conclude, "Wow, well what Will really likes is heroin. I'm gonna do him a favor and buy him some heroin." Now, that seems kind of commonsensically pretty ridiculous. Well, assuming I'm demonstrating all sorts of bad behavior that looks like it's very bad for me, it looks like a compulsion and so on. So instead what we're really doing is not merely maximizing the utility function that's gone by my revealed preferences, we have some deeper idea of kind of what's good for me or what's bad for me.

Perhaps that comes down to just what I would want to want, or what I want myself to want to want to want. Perhaps you can do it in terms of what are called second-order, third-order preferences. What idealized Will would want ... That is not totally clear. Well firstly, it's really hard to know kind of what would idealized Will want. You're gonna have to start doing at least a little bit of philosophy there. Because I tend to favor hedonism, I think that an idealized version of my friend would want the best possible experiences. That might be very different from what they think an idealized version of themselves would want because perhaps they have some objective list account of well-being and they think well, what they would also want is knowledge for the its own sake and appreciating beauty for its own sake and so on.

So, even there I think you're gonna get into pretty tricky questions about what is good or bad for someone. And then after that you've got the question of preference aggregation, which is also really hard, both in theory and in practice. Where, do you just take strengths of preferences across absolutely everybody and then add them up? Well, firstly you might worry that you can't actually make these comparisons of strengths of preferences between people. Certainly if you're just looking at peoples revealed preferences, it's really opaque how you would say if I prefer coffee to tea and you vice versa, who has the stronger preference? But perhaps we could look at behavioral facts to kind of try and at least anchor that, but it's still then non-obvious that what we ought to do when we're looking at everybody's preferences is just maximize the sum rather than perhaps give some extra weighting to people who are more badly off, perhaps we give more priority to their interests. So this is kinda theoretical issues.

And then secondly, is kinda just practical issues of implementing that, where you actually need to ensure that people aren't faking their preferences. And there's a well known literature and voting theory that says that basically any aggregation system you have, any voting system, is going to be manipulable in some way. You're gonna be able to get a better result for yourself, at least in some circumstances, by misrepresenting what you really want.

Again, these are kind of issues that our society already faces, but they're gonna bite even harder when we're thinking about delegating to artificial agents.

Lucas: There's two levels to this that you're sort of elucidating. The first is that you can think of the AGI as being something which can do favors for everybody in humanity, so there are issues empirically and philosophically and in terms of understanding other agents about what sort of preferences should that AGI be maximizing for each individual, say being constrained by what is legal and what is generally converged upon as being good or right. And then there's issues with preference aggregation which come up more given that we live in a resource-limited universe and world, where not all preferences can coexist and there has to be some sort of potential cancellation between different views.

And so, in terms of this higher level of preference aggregation ... And I wanna step back here to metaethics and difficulties of intertheoretic comparison. It would seem that given your moral realist view, it would affect how the weighting would potentially be done. Because it seemed like before you were eluding to the fact that if your moral realist view would be true, then the way at which we could determine what we ought to do or what is good and true about morality would be through exploring the space of all possible experiences, right, so we can discover moral facts about experiences.

Will: Mm-hmm (affirmative).

Lucas: And then in terms of preference aggregation, there would be people who would be right or wrong about what is good for them or the world.

Will: Yeah, I guess this is, again why I wanna distinguish between these two types of value alignment problem, where on the second type, which is just kind of, "What does society want?" Societal preference aggregation. I wasn't thinking of it as there being kind of right or wrong preferences.

In just the same way as there's this question of just, "I want system to do X" but there's a question of, "Do I want that?" or "How do you know that I want that?", there's a question of, "How do you know what society wants?" That's a question in and of its own right that's then separate from that third alignment issue I was raising, which then starts to bake in, well, if people have various moral preferences, views about how the world ought to be, yeah some are right and some are wrong. And no way should you give some aggregation over all those different views, because ideally you should give no weight to the ones that are wrong and if any are true, they get all the weight. It's not really about kind of preference aggregation in that way.

Though, if you think about it as everyone is making certain sort of guess at the moral truth, then you could think of that like a kind of judgment aggregation problem. So, it might be like data or input for your kind of moral reasoning.

Lucas: I think I was just sort of conceptually slicing this a tiny bit different from you. But that's okay.

So, staying on this second view, it seems like there's obviously going to be a lot of empirical issues and issues in understanding persons and idealized versions of themselves. Before we get in to intertheoretic comparison issues here, what is your view on coherent extrapolated volition, sort of, being the answer to this second part?

Will: I don't really know that much about it. From what I do know, it always seemed under-defined. As I understand it, the key idea is just, you take everyone's idealized preferences in some sense, and then I think what you do is just take a sum of what everyone's preference is. I'm personally quite in favor of the summation strategy. I think we can make interpersonal comparisons of strengths of preferences, and I think summing people's preferences is the right approach.

We can use certain kinds of arguments that also have application in moral philosophy, like the idea of "If you didn't know who you were going to be in society, how would you want to structure things? And if you're a rational, self-interested agent, maximizing expected utility, then you'll do the utilitarian aggregation function, so you'll maximize the sum of preference strength.

But then, if we're doing this idealized preference thing, all the devil's going to be in the details of, "Well how are you doing this idealization?" Because, given my preferences for example, for what they are ... I mean my preferences are absolutely ... Certainly they're incomplete, they're almost certainly cyclical, who knows? Maybe there's even some preferences I have that are areflexive of things, as well. Probably contradictory, as well, so there's questions about what does it mean to idealize, and that's going to be a very difficult question, and where a lot of the work is, I think.

Lucas: So I guess, just two things here. What are sort of the timeline and actual real world working in relationship here, between the second problem that you've identified and the third problem that you've identified, and what is the role and work that preferences are doing here, for you, within the context of AI alignment, given that you're sort of partial of a form of hedonistic consequentialism?

Will: Okay, terrific, 'cause this is kind of important framing.

In terms of answering this alignment problem, the deep one of just where ought societies to be going, I think the key thing is to punt it. The key thing is to get us to a position where we can think about and reflect on this question, and really for a very long time, so I call this the long reflection. Perhaps it's a period of a million years or something. We've got a lot of time on our hands. There's really not the kind of scarce commodity, so there are various stages to get into that state.

The first is to reduce extinction risks down basically to zero, put us a position of kind of existential security. The second then is to start developing a society where we can reflect as much as possible and keep as many options open as possible.

Something that wouldn't be keeping a lot of options open would be, say we've solved what I call the control problem, we've got these kind of lapdog AIs that are running the economy for us, and we just say, "Well, these are so smart, what we're gonna do is just tell it, 'Figure out what's right and then do that.'" That would really not be keeping our options open. Even though I'm sympathetic to moral realism and so on, I think that would be quite a reckless thing to do.

Instead, what we want to have is something kind of ... We've gotten to this position of real security. Maybe also along the way, we've fixed the various particularly bad problems of the present, poverty and so on, and now what we want to do is just keep our options open as much as possible and then kind of gradually work on improving our moral understanding where if that's supplemented by AI system ...

I think there's tons of work that I'd love to see developing how this would actually work, but I think the best approach would be to get the artificially intelligent agents to be just doing moral philosophy, giving us arguments, perhaps creating new moral experiences that it thinks can be informative and so on, but letting the actual decision making or judgments about what is right and wrong be left up to us. Or at least have some kind of gradiated thing where we gradually transition the decision making more and more from human agents to artificial agents, and maybe that's over a very long time period.

What I kind of think of as the control problem in that second level alignment problem, those are issues you face when you're just addressing the question of, "Okay. Well, we're now gonna have an AI run economy," but you're not yet needing to address the question of what's actually right or wrong. And then my main thing there is just we should get ourselves into a position where we can take as long as we need to answer that question and have as many options open as possible.

Lucas: I guess here given moral uncertainty and other issues, we would also want to factor in issues with astronomical waste into how long we should wait?

Will: Yeah. That's definitely informing my view, where it's at least plausible that morality has an aggregative component, and if so, then the sheer vastness of the future may, because we've got half a billion to a billion years left on Earth, a hundred trillion years before the starts burn out, and then ... I always forget these numbers, but I think like a hundred billion stars in the Milky Way, ten trillion galaxies.

With just vast resources at our disposal, the future could be astronomically good. It could also be astronomically bad. What we want to insure is that we get to the good outcome, and given the time scales involved, even what seem like an incredibly long delay, like a million years, is actually just very little time indeed.

Lucas: In half a second I want to jump into whether or not this is actually likely to happen given race dynamics and that human beings are kind of crazy. The sort of timeline here is that we're solving the technical control problem up into and on our way to sort of AGI and what might be superintelligence, and then we are also sort of idealizing everyone's values and lives in a way such that they have more information and they can think more and have more free time and become idealized versions of themselves, given constraints within issues of values canceling each other out and things that we might end up just deeming to be impermissible.

After that is where this period of long reflection takes place, and sort of the dynamics and mechanics of that are seeming open questions. It seems that first comes computer science and global governance and coordination and strategy issues, and then comes long time of philosophy.

Will: Yeah, then comes the million years of philosophy, so I guess not very surprising a philosopher would suggest this. Then the dynamics of the setup is an interesting question, and a super important one.

One thing you could do is just say, "Well, we've got ten billion people alive today, let's say. We're gonna divide the universe into ten billionths, so maybe that's a thousand galaxies each or something." And then you can trade after that point. I think that would get a pretty good outcome. There's questions of whether you can enforce it or not into the future. There's some arguments that you can. But maybe that's not the optimal process, because especially if you think that "Wow! Maybe there's actually some answer, something that is correct," well, maybe a lot of people miss that.

I actually think if we did that and if there is some correct moral view, then I would hope that incredibly well informed people who have this vast amount of time, and perhaps intellectually augmented people and so on who have this vast amount of time to reflect would converge on that answer, and if they didn't, then that would make me more suspicious of the idea that maybe there is a real face to the matter. But it’s still the early days we'd really want to think a lot about what goes into the setup of that kind of long reflection.

Lucas: Given this account that you've just given about how this should play out in the long term or what it might look like, what is the actual probability do you think that this will happen given the way that the world actually is today and it's just the game theoretic forces at work?

Will: I think I'm going to be very hard pressed to give a probability. I don't think I know even what my subjective credence is. But speaking qualitatively, I'd think it would be very unlikely that this is how it would play out.

Again, I'm like Brian and Dave in that I think if you look at just history, I do think moral forces have some influence. I wouldn't say they're the largest influence. I think probably randomness explains a huge amount of history, especially when you think about how certain events are just very determined by actions of individuals. Economic forces and technological forces, environmental changes are also huge as well. It is hard to think at least that it's going to be likely that such a well orchestrated dynamic would occur. But I do think it's possible and I think we can increase the chance of that happening by the careful actions that where people like FLI are doing at the moment.

Lucas: That seems like the sort of ideal scenario, absolutely, but I also am worried that people don't like to listen to moral philosophers or people in that potentially selfish government forces and things like that will end up taking over and controlling things, which is kind of sad for the cosmic endowment.

Will: That's exactly right. I think my chances ... If there was some hard takeoff and sudden leap to artificial general intelligence, which I think is relatively unlikely, but again is possible, I think that's probably the most scary 'cause it means that a huge amount of power is suddenly in the hands of a very small number of people potentially. You could end up with the very long run future of humanity being determined by the idiosyncratic preferences of just a small number of people, so it would be very dependent whether those people's preferences are good or bad, with a kind of slow takeoff, so where there's many decades in terms of development of AGI and it gradually getting incorporated into the economy.

I think there's somewhat more hope there. Society will be a lot more prepared. It's less likely that something very bad will happen. But my default presumption when we're talking about multiple nations, billions of people doing something that's very carefully coordinated is not going to happen. We have managed to do things that have involved international cooperation and amazing levels of operational expertise and coordination in the past. I think the eradication of smallpox is perhaps a good example of that. But it's something that we don't see very often, at least not now.

Lucas: It looks like that we need to create a Peter Singer of AI safety of some other philosopher who has had a tremendous impact on politics and society to spread this sort of vision throughout the world such that it would more likely become realized. Is that potentially most likely?

Will: Yeah. I think if a wide number of the political leaders, even if just political leaders of US, China, Russia, all were on board with global coordination on the issue of AI, or again, whatever other transformative technology might really upend things in the 21st century, and were on board with "How important it is that we get to this kind of period of long reflection where we can really figure out where we're going," then that alone would be very promising.

Then the question of just how promising is that I think depends a lot on maybe the robustness of ... Even if you're a moral realist, there's a question of "How likely do you think it is that people will get the correct moral view?" It could be the case that it's just this kind of strong attractor where even if you've got nothing as clean cut as the long reflection that I was describing, instead some really messy thing, perhaps various wars and it looks like feudal society or something, and anyone would say that civilization looks likely chaotic, maybe it's the case that even given that, just given enough time and enough reasoning power, people will still converge on the same moral view.

I'm probably not as optimistic as that, but it's at least a view that you could hold.

Lucas: In terms of the different factors that are going into the AI alignment problem and the different levels you've identified, first, second, and third, which side do you think is lacking the most resources and attention right now? Are you most worried about the control problem, that first level? Or are you more worried about potential global coordination and governance stuff at the potential second level or moral philosophy stuff at the third?

Will: Again, flagging ... I'm sure I'm biased on this, but I'm currently by far the most worried on the third level. That's for a couple of reasons. One is I just think the vast majority of the world are simple subjectivists or relativists, so the idea that we ought to be engaging in real moral thinking about how we use society, where we go with society, how we use our cosmic endowment as you put it, my strong default is that that question just never even really gets phrased.

Lucas: You don't think most people are theological moral realists?

Will: Yeah. I guess it's true that I'm just thinking about-

Lucas: Our bubble?

Will: My bubble, yeah. Well educated westerners. Most people in the world at least would say they're theological moral realists. One thought is just that ... I think my default is that some sort of relativistic will hold sway and people will just not really pay enough attention to think about what they ought to do. A second relevant thought is just I think the best possible universe is plausibly really, really good, like astronomically better than alternative extremely good universes.

Lucas: Absolutely.

Will: It's also the case that if you're ... Even like slight small differences in moral view might lead you to optimize for extremely different things. Even just a toy example of preference utilitarianism vs hedonistic utilitarianism, what you might think of as two very similar views, I think in the actual world there's not that much difference between them, because we just kind of know what makes people better off, at least approximately, improves their conscious experiences, it also is generally what they want, but when you're kind of technologically unconstrained, it's plausible to me that the optimal configuration of things will look really quite different between those two views. I guess I kind of think the default is that we get it very badly wrong and it will require really sustained work in order to ensure we get it right ... If it's the case that there is a right answer.

Lucas: Is there anything with regards to issues in intertheoretic comparisons, or anything like that at any one of the three levels which we've discussed today that you feel we haven't sufficiently covered or something that you would just like to talk about?

Will: Yeah. I know that one of your listeners was asking whether I thought they were solvable even in principle, by some superintelligence, and I think they are. I think they are if other issues in moral philosophy are solvable. I think that's particularly hard, but I think ethics in general is very hard.

I also think it is the case that whatever output we have at the end of this kind of long deliberation, again it's unlikely we'll get to credence 1 in a particular view, so we'll have some distribution over different views, and we'll want to take that into account. Maybe that means we do some kind of compromise action.

Maybe that means we just distribute our resources in proportion with our credence in different moral views. That's again one of these really hard questions that we'll want if at all possible to punt on and leave to people who can think about this in much more depth.

Then in terms of aggregating societal preferences, that's more like the problem of interpersonal comparisons of preference strength, which is kind of formally isomorphic but is at least a definitely issue.

Lucas: At the second and the third levels is where the intertheoretic problems are really going to be arising, and at that second level where the AGI is potentially working to idealize our values, I think there is again the open question about in the real world, whether or not there will be moral philosophers at the table or in politics or whoever has control over the AGI at that point in order to work on and think more deeply about intertheoretic comparisons of value at that level and timescale. Just thinking a little bit more about what we ought to do or what we should do realistically, given potential likely outcomes about whether or not this sort of thinking will or will not be at the table.

Will: My default is just the crucial thing is to ensure that this thinking is more likely to be at the table. I think it is important to think about, "Well, what ought we to do then," if we think it's as very likely that things go badly wrong. Maybe it's not the case that we should just be aiming to push for the optimal thing, but for some kind of second best strategy.

I think at the moment we should just be trying to push for the optimal thing. In particular, that's in part because my views that a optimal universe is just so much better than even an extremely good one, that I just kind of think we ought to be really trying to maximize the chance that we can figure out what there is and then implement it. But it would be interesting to think about it more.

Lucas: For sure. I guess just wrapping up here, did you ever have the chance to look at those two Lesswrong posts by Worley?

Will: Yeah, I did.

Lucas: Did you have any thoughts or comments on them? If people are interested you can find links in the description.

Will: I read the posts, and I was very sympathetic in general to what he was thinking through. In particular the principle of philosophical conservatism. Hopefully I've shown that I'm very sympathetic to that, so trying to think "What are the minimal assumptions? Would this system be safe? Would this path make sense on a very, very wide array of different philosophical views?" I think the approach I've suggested, which is keeping our options open as much as possible and punting on the really hard questions, does satisfy that.

I think one of his posts was talking about "Should we assume moral realism or assume moral antirealism?" It seems like there our views differed a little bit, where I'm more worried that everyone's going to assume some sort of subjectivism and relativism, and that there might be some moral truth out there that we're missing and we never think to find it, because we decide that what we're interested in is maximizing X, so we program agents to build X and then just go ahead with it, whereas actually the thing that we ought to have been optimizing for is Y. But broadly speaking, I think this question of trying to be as ecumenical as possible philosophically speaking makes a lot of sense.

Lucas: Wonderful. Well, it's really been a joy speaking, Will. Always a pleasure. Is there anything that you'd like to wrap up on, anywhere people can follow you or check you out on social media or anywhere else?

Will: Yeah. You can follow me on Twitter @WillMacAskill if you want to read more on some of my work you can find me at williammacaskill.com

Lucas: To be continued. Thanks again, Will. It's really been wonderful.

Will: Thanks so much, Lucas.

Lucas: If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We'll be back again soon with another episode in the AI Alignment series.

View transcript

Podcast

Related episodes

If you enjoyed this episode, you might also like:

12 December, 2025

AI Alignment Podcast: Moral Uncertainty and the Path to AI Alignment with William MacAskill

Transcript

Related episodes

Why the AI Race Undermines Safety (with Steven Adler)

Can Defense in Depth Work for AI? (with Adam Gleave)

How We Keep Humans in Control of AI (with Beatrice Erkers)

Why Building Superintelligence Means Human Extinction (with Nate Soares)

Sign up for the Future of Life Institute newsletter