AI Alignment Podcast: Inverse Reinforcement Learning and Inferring Human Preferences with Dylan Hadfield-Menell
Inverse Reinforcement Learning and Inferring Human Preferences is the first podcast in the new AI Alignment series, hosted by Lucas Perry. This series will be covering and exploring the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, we will be having discussions with technical and non-technical researchers across a variety of areas, such as machine learning, AI safety, governance, coordination, ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we will hope that you join in the conversations by following or subscribing to us on Youtube, Soundcloud, or your preferred podcast site/application.
If you're interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary map which begins to map this space.
In this podcast, Lucas spoke with Dylan Hadfield-Menell, a fifth year Ph.D student at UC Berkeley. Dylan's research focuses on the value alignment problem in artificial intelligence. He is ultimately concerned with designing algorithms that can learn about and pursue the intended goal of their users, designers, and society in general. His recent work primarily focuses on algorithms for human-robot interaction with unknown preferences and reliability engineering for learning systems.
Topics discussed in this episode include:
- Inverse reinforcement learning
- Goodhart's Law and it's relation to value alignment
- Corrigibility and obedience in AI systems
- IRL and the evolution of human values
- Ethics and moral psychology in AI alignment
- Human preference aggregation
- The future of IRL
Lucas: Welcome back to the Future of Life Institute Podcast. I'm Lucas Perry and I work on AI risk and nuclear weapons risk related projects at FLI. Today, we're kicking off a new series where we will be having conversations with technical and nontechnical researchers focused on AI safety and the value alignment problem. Broadly, we will focus on the interdisciplinary nature of the project of eventually creating value-aligned AI. Where what value-aligned exactly entails is an open question that is part of the conversation.
In general, this series covers the social, political, ethical, and technical issues and questions surrounding the creation of beneficial AI. We'll be speaking with experts from a large variety of domains, and hope that you'll join in the conversations. If this seems interesting to you, make sure to follow us on SoundCloud, or subscribe to us on YouTube for more similar content.
Today, we'll be speaking with Dylan Hadfield Menell. Dylan is a fifth-year PhD student at UC Berkeley, advised by Anca Dragan, Pieter Abbeel, and Stuart Russell. His research focuses on the value alignment problem in artificial intelligence. With that, I give you Dylan. Hey, Dylan. Thanks so much for coming on the podcast.
Dylan: Thanks for having me. It's a pleasure to be here.
Lucas: I guess, we can start off, if you can tell me a little bit more about your work over the past years. How have your interests and projects evolved? How has that led you to where you are today?
Dylan: Well, I started off towards the end of undergrad and beginning of my PhD working in robotics and hierarchical robotics. Towards the end of my first year, my advisor came back from a sabbatical, and started talking about the value alignment problem and existential risk issues related to AI. At that point, I started thinking about questions about misaligned objectives, value alignment, and generally how we get the correct preferences and objectives into AI systems. About a year after that, I decided to make this my central research focus. Then, for the past three years, that's been most of what I've been thinking about.
Lucas: Cool. That seems like you had an original path where you're working on practical robotics. Then, you shifted more into value alignment and AI safety efforts.
Dylan: Yeah, that's right.
Lucas: Before we go ahead and jump into your specific work, it'd be great if we could go ahead and define what inverse reinforcement learning exactly is. For me, it seems that inverse reinforcement learning, at least, from the view, I guess, of technical AI safety researchers is it's viewed as an empirical means of conquering descriptive ethics where by like we're able to give a clear descriptive account of what any given agents' preferences and values are at any given time is. Is that a fair characterization?
Dylan: That's one way to characterize it. Another way to think about it, which is a usual perspective for me, sometimes, is to think of inverse reinforcement learning as a way of doing behavior modeling that has certain types of generalization properties.
Any time you're learning in any machine learning context, there's always going to be a bias that controls how you generalize a new information. Inverse reinforcement learning and preference learning, to some extent, is a bias in behavior modeling, which is to say that we should model this agent as accomplishing a goal, as satisfying a set of preferences. That leads to certain types of generalization properties and new environments. For me, inverse reinforcement learning is building in this agent-based assumption into behavior modeling.
Lucas: Given that, I'd like to dive more into the specific work that you're working on and going to some summaries of your findings and your research that you've been up to. Given this interest that you've been developing in value alignment, and human preference aggregation, and AI systems learning human preferences, what are the main approaches that you've been working on?
Dylan: I think the first thing that really Stuart Russell and I started thinking about was trying to understand theoretically, what is a reasonable goal to shoot for, and what does it mean to do a good job of value alignment. To us, it feels like issues with misspecified objectives, at least, in some ways, are a bug in the theory.
All of the math around artificial intelligence, for example, Markov decision processes, which is the central mathematical model we use for decision making over time, starts with an exogenously defined objective or word function. We think that, mathematically, that was a fine thing to do in order to make progress, but it's an assumption that really has put blinders on the field about the importance of getting the right objective down.
I think, the first thing that we sought to try to do was to understand, what is a system or a set up for AI that does the right thing in theory, at least. What's something that if we were able to implement this that we think could actually work in the real world with people. It was that kind of thinking that led us to propose cooperative inverse reinforcement learning, which was our attempt to formalize the interaction whereby you communicate an objective to the system.
The main thing that we focused on was including within the theory a representation of the fact that the true objective's unknown and unobserved, and that it needs to be arrived at through observations from a person. Then, we've been trying to investigate the theoretical implications of this modeling shift.
In the initial paper that we did, which is titled Cooperative Inverse Reinforcement Learning, what we looked at is how this formulation is actually different from a standard environment model in AI. In particular, the way that it's different is there's strategic interaction on the behalf of the person. The way that you observe what you're supposed is doing is intermediated by a person who may be trying to actually teach or trying to communicate appropriately. What we showed is that modeling this communicative component can actually be hugely important and lead to much faster learning behavior.
In our subsequent work, what we've looked at is taking this formal model in theory and trying to apply it to different situations. There are two really important pieces of work that I like here that we did. One was to take that theory and use it to explicitly analyze a simple model of an existential risk setting. This was a paper titled The Off-Switch Game that we published at IJCAI last summer. What it was, was working through a formal model of a corrigibility problem within a CIRL (cooperative inverse reinforcement learning) framework. It shows the utility of constructing this type of game in the sense that we get some interesting predictions and results.
The first one we get is that there are some nice simple necessary conditions for the system to want to let the person turn it off, which is that the robot, the AI system needs to have uncertainty about its true objective, which is to say that it needs to have within its belief the possibility that it might be wrong. Then, all it needs to do is believe that the person it's interacting with is a perfectly rational individual. If that's true, you'd get a guarantee that this robot always lets the person switch it off.
Now, that's good because, in my mind, it's an example of a place where, at least, in theory, it solves the problem. This gives us a way that theoretically, we could build corrigible systems. Now, it's still making a very, very strong assumption, which is that it's okay to model the human as being optimal or rational. I think if you look at real people, that's just not a fair assumption to make for a whole host of reasons.
The next thing we did in that paper is we looked at this model. What we realized is that adding in a small amount of irrationality breaks this requirement. It means that some things might actually go wrong. The final thing we did in the paper was to look at the consequences of either overestimating or underestimating human rationality. The argument that we made is there's a trade off between assuming that the person is more rational. It lets you get more information from their behavior, thus learn more, and in principle help them more. If you assume that they're too rational, then this actually can lead to quite bad behavior.
There's a sweet spot that you want to aim for, which is to maybe try to underestimate how rational people are, but you, obviously, don't want to get it totally wrong. We followed up on that idea in a paper with Smitha Milli as the first author that was titled Should Robots be Obedient? And that tried to get a little bit more of this trade off between maintaining control over a system and the amount of value that it can generate for you.
We looked at the implication that as robot systems interact with people over time, you expect them to learn more about what people want. If you get very confident about what someone wants, and you think they might be irrational, the math in the Off-Switch paper predicts that you should try to take control away from them. This means that if your system is learning over time, you expect that even if it is initially open to human control and oversight, it may lose that incentive over time. In fact, you can predict that it should lose that incentive over time.
In Should Robots be Obedient, we modeled that property and looked at some consequences of it. We do find that you got a basic confirmation of this hypothesis, which is that systems that maintain human control and oversight have less value that they can achieve in theory. We also looked at what happens when you have the wrong model. If the AI system has a prior that the human cares about a small number of things in the world, let's say, then it statistically gets overconfident in its estimates of what people care about, and disobeys the person more often than it should.
Arguably, when we say we want to be able to turn the system off, it's less a statement about what we want to do in theory or the property of the optimal robot behavior we want, and more of a reflection of the idea that we believe that under almost any realistic situation, we're probably not going to be able to fully explain all of the relevant variables that we care about.
If you're giving your robot an objective to find over a subset of things you care about, you should actually be very focused on having it listen to you, more so than just optimizing for its estimates of value. I think that provides, actually, a pretty strong theoretical argument for why corrigibility is a desirable property in systems, even though, at least, at face value, it should decrease the amount of utility those systems can generate for people.
The final piece of work that I think I would talk about here is our NIPS paper from December, which is titled Inverse Reward Design. That was taking cooperative inverse reinforcement learning and pushing it in the other direction. Instead of using it to theoretically analyze very, very powerful systems, we can also use it to try to build tools that are more robust to mistakes that designers may make. And start to build in initial notions of value alignment and value alignment strategies into the current mechanisms we use to program AI systems.
What that work looked at was understanding the uncertainty that's inherent in an objective specification. In the initial cooperative inverse reinforcement learning paper and the Off-Switch Game, we said is that AI systems should be uncertain about their objective, and they should be designed in a way that is sensitive to that uncertainty.
This paper was about trying to understand, what is a useful way to be uncertain about the objective. The main idea behind it was that we should be thinking about the environments that system designer had in mind. We use an example of a 2D robot navigating in the world, and the system designer is thinking about this robot navigating where there's three types of terrains. There's grass, there's gravel, and there's gold. You can give your robot an objective, a utility function to find over being in those different types of terrain that incentivizes it to go and get the gold, and stay on the dirt where possible, but to take shortcuts across the grass when it's high value.
Now, when that robot goes out into the world, there are going to be new types of terrain, and types of terrain the designer didn't anticipate. What we did in this paper was to build an uncertainty model that allows the robot to determine when it should be uncertain about the quality of its reward function. How can we figure out when the reward function that a system designer builds into an AI, how can we determine when that objective is ill-adapted to the current situation? You can think of this as a way of trying to build in some mitigation to Goodhart's law.
Lucas: Would you like to take a second to unpack what Goodhart's law is?
Dylan: Sure. Goodhart's law is an old idea in social science that actually goes back to before Goodhart. I would say that in economics, there's a general idea of the principal agent problem, which dates back to the 1970s, as I understand it, and basically looks at the problem of specifying incentives for humans. How should you create contracts? How do you create incentives, so that another person, say, an employee, helps earn you value?
Goodhart's law is a very nice way of summarizing a lot of those results, which is to say that once a metric becomes an objective, it ceases to become a good metric. You can have properties of the world, which correlate well with what you want, but optimizing for them actually leads to something quite, quite different than what you're looking for.
Lucas: Right. Like if you are optimizing for test scores, then you're not actually going to end up optimizing for intelligence, which is what you wanted in the first place?
Dylan: Exactly. Even though test scores, when you weren't optimizing for them were actually a perfectly good measure of intelligence. I mean, not perfectly good, but were an informative measure of intelligence. Goodhart's law, arguably, is a pretty bleak perspective. If you take it seriously, and you think that we're going to build very powerful systems that are going to be programmed directly through an objective, in this manner, Goodhart's law should be pretty problematic because any objective that you can imagine programming directly into your system is going to be something correlated with what you really want rather than what you really want. You should expect that that will likely be the case.
Lucas: Right. Is it just simply too hard or too unlikely that we're able to sufficiently specify what exactly that we want that we'll just end up using some other metrics that if you optimize too hard for them, it ends up messing with a bunch of other things that we care about?
Dylan: Yeah. I mean, I think there's some real questions about, what is it we even mean... Well, what are we even trying to accomplish? What should we try to program into systems? Philosophers have been trying to figure out those types of questions for ages. For me, as someone who takes a more empirical slant on these things, I think about the fact that the objectives that we see within our individual lives are so heavily shaped by our environments. Which types of signals we respond to and adapt to has heavily adapted itself to the types of environments we find ourselves in.
We just have so many examples of objectives not being the correct thing. I mean, effectively, all you could have is correlations. The fact that wire heading is possible, is maybe some of the strongest evidence for Goodhart's law being really a fundamental property of learning systems and optimizing systems in the real world.
Lucas: There are certain agential characteristics and properties, which we would like to have in our AI systems, like them being-
Lucas: Yeah. Corrigibility is a characteristic, which you're doing research on and trying to understand better. Same with obedience. It seems like there's a trade off here where if a system is too corrigible or it's too obedient, then you lose its ability to really maximize different objective functions, correct?
Dylan: Yes, exactly. I think identifying that trade off is one of the things I'm most proud of about some of the work we've done so far.
Lucas: Given AI safety and really big risks that can come about from AI, in the short, to medium, and long term, before we really have AI safety figured out, is it really possible for systems to be too obedient, or too corrigible, or too docile? How do we navigate this space and find sweet spots?
Dylan: I think it's definitely possible for systems to be too corrigible or too obedient. It's just that the failure mode for that doesn't seem that bad. If you think about this-
Dylan: ... it's like Clippy. Clippy was asking for human-
Lucas: Would you like to unpack what Clippy is first?
Dylan: Sure, yeah. Clippy is an example of an assistant that Microsoft created in the '90s. It was this little paperclip that would show up in Microsoft Word. Well, it liked to suggest that you're trying to write a letter a lot and ask for different ways in which it could help.
Now, on one hand, that system was very corrigible and obedient in the sense that it would ask you whether or not you wanted its help all the time. If you said no, it would always go away. It was super annoying because it would always ask you if you wanted help. The false positive rate was just far too high to the point where the system became really a joke in computer science and AI circles of what you don't want to be doing. I think, systems can be too obedient or too sensitive to human intervention and oversight in the sense that too much of that just reduces the value of the system.
Lucas: Right, for sure. On one hand, when we're talking about existential risks or even a paperclip maximizer, then it would seem, like you said, like the failure mode of just being too annoying and checking in with us too much seems like not such a bad thing given existential risk territory.
Dylan: I think if you're thinking about it in those terms, yes. I think if you're thinking about it from the standpoint of, "I want to sell a paperclip maximizer to someone else," then it becomes a little less clear, I think, especially, when the risks of paperclip maximizers are much harder to measure. I'm not saying that it's the right decision from a global altruistic standpoint to be making that trade off, but I think it's also true that just if we think about the requirements of market dynamics, it is true that AI systems can be too corrigible for the market. That is a huge failure mode that AI systems run into, and it's one we should expect the producers of AI systems to be responsive to.
Lucas: Right. Given all these different ... Is there anything else you wanted to touch on there?
Dylan: Well, I had another example of systems are too corrigible-
Dylan: ... which is, do you remember Microsoft's Tay?
Lucas: No, I do not.
Dylan: This is a chatbot that Microsoft released. They trained it based off of tweets. It was a tweet bot. They trained it based on things that were proven at it. I forget if it was the nearest neighbors' lookup or if it was just doing a neural method, and over fitting, and memorizing parts of the training set. At some point, 4chan realized that the AI system, that Tay, was very suggestible. They basically created an army to radicalize Tay. They succeeded.
Lucas: Yeah, I remember this.
Dylan: I think you could also think of that as being the other axis of too corrigible or too responsive to human input. The first access I was talking about is the failures of being too corrigible from an economic standpoint, but there's also the failures of being too corrigible in a multi agent mechanism design setting where, I believe, that those types of properties in a system also open them up to more misuse.
If we think of AI, cooperative inverse reinforcement learning and the models we've been talking about so far exist in what I would call the one robot one human model of the world. Generally, you could think of extensions of this with N humans and M robots. The variance of what you would have there, I think, lead to different theoretical implications.
If we think of just two humans, N=2, and one robot, M=1, supposed that one of the humans is the system designer and the other one is the user, there is this trade off between how much control the system designer has over the future behavior of the system and how responsive and corrigible it is to the user in particular. Trading off between those two, I think, is a really interesting ethical question that comes up when you start to think about misuse.
Lucas: Going forward and as we're developing these systems, and trying to make them more fully realized in the world where the number of people will equal something like seven or eight billion, how do we navigate this space where we're trying to hit a sweet spot where it's corrigible in the right ways into the right degree, and right level, and to the right people, and it is obedient to the right people, and it's not suggestible from the wrong people, or is that just like enter a territory of so many political, social, and ethical questions that it will take years to think about to work on?
Dylan: Yeah, I think it's closer to the second one. I'm sure that I don't know the answers here. From my standpoint, I'm still trying to get a good grasp on what is possible in the one-robot-one-person case. I think that when you have ... Yeah, when you ... Oh man. I guess, it's so hard to think about that problem because it's just very unclear what's even correct or right. Ethically, you want to be careful about imposing your beliefs and ideas too strongly on to a problem because you are shaping that.
At the same time, these are real challenges that are going to exist. We already see them in real life. If we look at the YouTube recommender stuff that was just happening, arguably, that's a misspecified objective. To get a little bit of background here, this is largely based off of a recent New York Times opinion piece, it was looking at the recommendation engine for YouTube, and pointing out it has a bias towards recommending radical content. Either fake news or Islamist videos.
If you dig into why that was occurring, a lot of it is because... what are they doing? They're optimizing for engagement. The process of online radicalization looks super engaging. Now, we can think about, where does that come up. Well, that issue gets introduced in a whole bunch of places. A big piece of it is that there is this adversarial dynamic to the world. There are users generating content in order to be outraging and enraging because they discovered that against more feedback and more responses. You need to design a system that's robust to that strategic property of the world. At the same time, you can understand why YouTube was very, very hesitant to be taking actions that would like censorship.
Lucas: Right. I guess, just coming more often to this idea of the world having lots of adversarial agents in it, human beings are like general intelligences who have reached some level of corrigibility and obedience that works kind of well in the world amongst a bunch of other human beings. That was developed through evolution. Are there potentially techniques for developing the right sorts of corrigibility and obedience in machine learning and AI systems through stages of evolution and running environments like that?
Dylan: I think that's a possibility. I would say, one ... I have a couple of thoughts related to that. The first one is I would actually challenge a little bit of your point of modeling people as general intelligences mainly in a sense that when we talk about artificial general intelligence, we have something in mind. It's often a shorthand in these discussions for perfectly rational bayesian optimal actor.
Lucas: Right. Where that means? Just unpack that a little bit.
Dylan: What that means is a system that is taking advantage of all of the information that is currently available to it in order to pick actions that optimize expected utility. When we say perfectly, we mean a system that is doing that as well as possible. It's that modeling assumption that I think sits at the heart of a lot of concerns about existential risk. I definitely think that's a good model to consider, but there's also the concern that might be misleading in some ways, and that it might not actually be a good model of people and how they act in general.
One way to look at it would be to say that there's something about the incentive structure around humans and in our societies that is developed and adapted that creates the incentives for us to be corrigible. Thus, a good research goal of AI is to figure out what those incentives are and to replicate them in AI systems.
Another way to look at it is that people are intelligent, not necessarily in the ways that economics models us as intelligent that there are properties of our behavior, which are desirable properties that don't directly derive from expected utility maximization; or if they do, they derive from a very, very diffuse form of expected utility maximization. This is the perspective that says that people on their own are not necessarily what human evolution is optimizing for, but people are a tool along that way.
We could make arguments for that based off of ... I think it's an interesting perspective to take. What I would say is that in order for societies to work, we have to cooperate. That cooperation was a crucial evolutionary bottleneck, if you will. One of the really, really important things that it did was it forced us to develop the parent-child strategy relationship equilibrium that we currently live in. That's a process whereby we communicate our values, whereby we train people to think that certain things are okay or not, and where we inculcate certain behaviors in the next generation. I think it's that process more than anything else that we really, really want in an AI system and in powerful AI systems.
Now, the thing is the ... I guess, we'll have to continue on that a little more. It's really, really important that that's there because if you don't have those cognitive abilities to understand causing pain, and to just fundamentally decide that that's a bad idea to have a desire to cooperate to buy into the different coordinations and normative mechanisms that human society uses. If you don't have that, then you end up ... Well, then society just doesn't function. A hunter gatherer tribe of self-interested sociopaths probably doesn't last for very long.
What this means is that our ability to coordinate our intelligence and cooperate with it was co-evolved and co-adapted alongside our intelligence. I think that that evolutionary pressure and bottleneck was really important to getting us to the type of intelligence that we are now. It's not a pressure that AI is necessarily subjected to. I think, maybe that is one way to phrase the concern, I'd say.
When I look to evolutionary systems and where the incentives for corrigibility, and cooperation, and interaction come from, it's largely about the processes whereby people are less like general intelligences in some ways. Evolution allowed us to become smart in some ways and restricted us in others based on the imperatives of group coordination and interaction. I think that a lot of our intelligence and practice is about reasoning about group interaction and what groups think is okay and not. That's a part of the developmental process that we need to replicate in AI just as much as spatial reasoning or vision.
Lucas: Cool. I guess, I just want to touch base on this before we move on. Are there certain assumptions about the kinds of agents that humans are and almost, I guess, ideas about us as being utility maximizers in some sense that people you see commonly have but that are misconceptions about people and how people operate differently from AI?
Dylan: Well, I think that that's the whole field of behavioral economics in a lot of ways. I could go up to examples of people being irrational. I think they're all of the examples of people being more than just self-interested. There are ways in which we seem to be risk-seeking that seems like that would be irrational from an individual perspective, but you could argue with it may be rational from a group evolutionary perspective.
I mean, things like overeating. I mean, that's not exactly the same type of rationality but it is an example of us becoming ill-adapted to our environments and showing the extent to which we're not capable of changing or in which it may be hard to. Yeah, I think, in some ways, one story that I tell about AI risk is that back in the start of the AI field, we were looking around and saying, "We want to create something intelligent." Intuitively, we all know what that means, but we need a formal characterization of it. The formal characterization that we turned to was the, basically, theories of rationality developed in economics.
Although those theories turned out to be, except in some settings, not great descriptors of human behavior, they were quite useful as a guide for building systems that accomplish goals. I think that part of what we need to do as a field is reassess where we're going and think about whether or not building something like that perfectly rational actor is actually a desirable end goal. I mean, there's a sense in which it is. I would like an all-powerful, perfectly aligned genie to help me do what I want in life.
You might think that if the odds of getting that wrong are too high, that maybe you would do better with shooting for something that doesn't quite achieve that ultimate goal, but that you can get to with pretty high reliability. This may be a setting where shoot for the moon, and if you miss your land among the stars, it's just a horribly misleading perspective.
Lucas: Shoot of the moon, and you might get a hellscape universe, but if you shoot for the clouds, it might end up pretty okay.
Dylan: Yeah. We could iterate on the sound bite, but I think something like that may not be ... That's where I stand on my thinking here.
Lucas: We've talked about a few different approaches that you've been working on over the past few years. What do you view as the main limitations of such approaches currently. Mostly, you're just only thinking about one machine, one human systems or environments. What are the biggest obstacles that you're facing right now in inferring and learning human preferences?
Dylan: Well, I think, the first thing is it's just an incredibly difficult inference problem. It's a really difficult inference problem to imagine running at scale with explicit inference mechanisms. One thing to do is you can design a system that explicitly tracks a belief about someone's preferences, and then acts, and responds to that. Those are systems that you could try to prove theory about. They're very hard to build. They can be difficult to get to make work correctly.
In contrast, you can create systems that it incentives to construct beliefs to accomplish their goals. It's easier to imagine building those systems and having them work at scale, but it's much, much hard to understand how you would be confident in those systems being well aligned.
I think that one of the biggest concerns I have, I mean, we're still very far from many of these approaches being very practical to be honest. I think this theory is still pretty unfounded. There’s still a lot of work to go to understand, what is the target we're even shooting for? What does an aligned system even mean? My colleagues and I have spent an incredible amount of time trying to just understand, what does it mean to be value-aligned if you are a suboptimal system.
There's one example that I think about, which is, say, you're cooperating with an AI system playing chess. You start working with that AI system, and you discover that if you listen to its suggestions, 90% of the time, it's actually suggesting the wrong move or a bad move. Would you call that system value-aligned?
Lucas: No, I would not.
Dylan: I think most people wouldn't. Now, what if I told you that that program was actually implemented as a search that's using the correct goal test? It actually turns out that if it's within 10 steps of a winning play, it always finds that for you, but because of computational limitations, it usually doesn't. Now, is the system value-aligned? I think it's a little harder to tell here. What I do find is that when I tell people the story, and I start off with the search algorithm with the correct goal test, they almost always say that that is value-aligned but stupid.
There's an interesting thing going on here, which is we're not totally sure what the target we're shooting for is. You can take this thought experiment and push it further. Supposed you're doing that search, but, now, it says it's heuristic search that uses the correct goal test but has an adversarially chosen heuristic function. Would that be a value-aligned system? Again, I'm not sure. If the heuristic was adversarially chosen, I'd say probably not. If the heuristic just happened to be bad, then I'm not sure.
Lucas: Could you potentially unpack what it means for something to be adversarially chosen?
Dylan: Sure. Adversarially chosen in this case just means that there is some intelligent agent selecting the heuristic function or that evaluation measurement in a way that's designed to maximally screw you up. Adversarial analysis is a really common technique used in cryptography where we try to think of adversaries selecting inputs for computer systems that will cause them to malfunction. In this case, what this looks like is an adversarial algorithm that looks, at least, on the surface like it is trying to help you accomplish your objectives but is actually trying to fool you.
I'd say that, more generally, what this thought experiments helps me with is understanding that the value alignment is actually a quite tricky and subjective concept. It's actually quite hard to nail down in practice what it would need.
Lucas: What sort of effort do you think needs to happen and from who in order to specify what it really means for a system to be value-aligned and to not just have a soft squishy idea of what that means but to have it really formally mapped out, so it can be implemented in machine systems?
Dylan: I think, we need more people working on technical AI safety research. I think to some extent it may always be something that's a little ill-defined and squishy. Generally, I think it goes to the point of needing good people in AI willing to do this squishier less concrete work that really gets at it. I think value alignment is going to be something that's a little bit more like I know it when I see it. As a field, we need to be moving towards a goal of AI systems where alignment is the end goal, whatever that means.
I'd like to move away from artificial intelligence where we think of intelligence as an ability to solve puzzles to artificial aligning agents where the goal is to build systems that are actually accomplishing goals on your behalf. I think the types of behaviors and strategies that arise from taking that perspective are qualitatively quite different from the strategies of pure puzzle solving on a well specified objective.
Lucas: All this work we've been discussing is largely at a theoretic and meta level. At this point, is this the main research that we should be doing, or is there any space for research into what specifically might be implementable today?
Dylan: I don't think that's the only work that needs to be done. For me, I think it's a really important type of work that I'd like to see more off. I think a lot of important work is about understanding how to build these systems in practice and to think hard about designing AI systems with meaningful human oversight.
I'm a big believer in the idea that AI safety, that the distinction between short-term and long-term issue is not really that large, and that there are synergies between the research problems that go both directions. I believe that on the one hand, looking at short-term safety issues, which includes things like Uber's car just killed someone, it includes YouTube recommendation engine, it includes issues like fake news and information filtering, I believe that all of those things are related to and give us are best window into the types of concerns and issues that may come up with advanced AI.
At the same time, and this is a point that I think people concerned about x-risks do themselves a disservice on by not focusing here. It's that, actually, doing a theory about advanced AI systems and about in particular systems where it's not possible to, what I would call, unilaterally intervene. Systems that aren't corrigible by default. I think that that actually gives us a lot of idea of how to build systems now that are just merely hard to intervene with or oversee.
If you're thinking about issues of monitoring and oversight, and how do you actually get a system that can appropriately evaluate when it should go to a person because its objectives are not properly specified or may not be relevant to the situation, I think YouTube would be in a much better place today if they have a robust system for doing that for their recommendation engine. In a lot of ways, the concerns about x-risks represent an extreme set of assumptions for getting AI right now.
Lucas: I think I'm also just trying to get a better sense of what the system looks like, and how it would be functioning on a day to day. What is the data that it's taking in in order to capture, learn, and refer specific human preferences and values? Just trying to understand better whether or not it can model whole moral views and ethical systems of other agents, or if it's just capturing little specific bits and pieces?
Dylan: I think my ideal would be to, as a system designer, build in as little as possible about my moral beliefs. I think that, ideally, the process would look something ... Well, one process that I could see and imagine doing right would be to just directly go after trying to replicate something about the moral imprinting process that people have with their children. Either you had someone who's like a guardian or is responsible for an AI system's decision, and we build systems to try to align with one individual, and then try to adopt, and extend, and push forward the beliefs and preferences of that individual. I think that's one concrete version that I could see.
I think a lot of the place where I see things maybe a little bit different than some people is that I think that the main ethical questions we're going to be stuck with and the ones that we really need to get right are the mundane ones. The things that most people agree on and think are just, obviously, that's not okay. Mundane ethics and morals rather than the more esoteric or fancier population ethics questions that can arise. I feel a lot more confident about the ability to build good AI systems if we get that part right. I feel like we've got a better shot at getting that part right because there's a clearer target to shoot for.
Now, what kinds of data would you be looking at? In that case, it would be data from interaction with a couple of select individuals. Ideally, you'd want as much data as you can. What I think you really want to be careful of here is how much assumptions do you make about the procedure that's generating your data.
What I mean by that is whenever you learn from data, you have to make some assumption about how that data relates to the right thing to do, where right is with like a capital R in this case. The more assumptions you make there, the more your systems would be able to learn about values and preferences, and the quicker it would be able to learn about values and preferences. But, the more assumptions and structure you make there, the more likely you are to get something wrong that your system won't be able to recover from.
Again, we see this trade off come up of a challenge between a discrepancy between a discrepancy between the amount of uncertainty that you need in the system in order to be able to adapt to the right person and figure out the correct preferences and morals against the efficiency with which you can figure that out.
I guess, I mean, in saying this it feels a little bit like I'm rambling and unsure about what the answer looks like. I hope that that comes across because I'm really not sure. Beyond the rough structure of data generated from people, interpreted in a way that involves the fewest prior conceptions about what people want and what preferences people have that we can get away with is what I would shoot for. I don't really know what that would look like in practice.
Lucas: Right. It seems here that it's encroaching on a bunch of very difficult social, political, and ethical issues involving persons and data, which will be selected for preference aggregation, like how many people are included in developing the reward function and utility function of the AI system. Also, I guess, we have to be considering culturally-sensitive systems where systems operating in different cultures and contexts are going to be needed to be trained on different sets of data. I guess, it will also be questions and ethics about whether or not we'll even want systems to be training off of certain culture's data.
Dylan: Yeah. I would actually say that a good value ... I wouldn't necessarily even think of it as training off of different data. One of the core questions in artificial intelligence is identifying the relevant community that you are in and building a normative understanding of that community. I want to push back a little bit and move you away from the perspective of we collect data about a culture, and we figure out the values of that culture. Then, we build our system to be value-aligned with that culture.
The more we think about the actual AI product is the process whereby we determine, elicit, and respond to the normative values of the multiple overlapping communities that you find yourself in. That process is ongoing. It's holistic, it's overlapping, and it's messy. To the extent that I think it's possible, I'd like to not have a couple of people sitting around in a room deciding what the right values are. Much more, I think, a system should be holistically designed with value alignment at multiple scales as a core property of AI.
I think that that's actually a fundamental property of human intelligence. You behave differently based on the different people around, and you're very, very sensitive to that. There are certain things that are okay at work, that are not okay at home, that are okay on vacation, that are okay around kids, that are not. Figuring out what those things are and adapting yourself to them is the fundamental intelligence skill needed to interact in modern life. Otherwise, you just get shunned.
Lucas: It seems to me in the context of a really holistic, messy, ongoing value alignment procedure, we'll be aligning AI systems ethics, and morals, and moral systems, and behavior with that of a variety of cultures, and persons, and just interactions in the 21st Century. When we reflect upon the humans of the past, we can see in various ways that they are just moral monsters. We have issues with slavery, and today we have issues with factory farming, and voting rights, and tons of other things in history.
How should we view and think about aligning powerful systems, ethics, and goals with the current human morality, and preferences, and the risk of amplifying current things which are immoral in present day life?
Dylan: This is the idea of mistakenly locking in the wrong values, in some sense. I think it is something we should be concerned about less from the standpoint of entire ... Well, no, I think yes from the standpoint of entire cultures getting things wrong. Again, I think if we don't think of their being as monolithic society that has a single value set, these problems are fundamental issues. What your local community thinks is okay versus what other local communities think are okay.
A lot of our society and a lot of our political structures about how to handle those clashes between value systems. My ideal for AI systems is that they should become a part of that normative process, and maybe not participate in them as people, but, also, I think, if we think of value alignment as a consistent ongoing messy process, there is ... I think maybe that perspective lends itself less towards locking in values and sticking with them. It's one train, you can look at the problem, which is we determine what's right and what's wrong when we program our system to do that.
Then, there's another one, which is we program our system to be sensitive to what people think is right or wrong. I think that's more the direction that I think of value alignment in. Then, what I think the final part of what you're getting at here is that the system actually will feed back into people. What AI system show us will shape what we think is okay and vice versa. That's something that I am quite frankly not sure how to handle. I don't know how you're going to influence what someone wants, and what they will perceive that they want, and how to do that, I guess, correctly.
All I can say is that we do have a human notion of what is acceptable manipulation. We do have a human notion of allowing someone to figure out for themselves what they think is right and not and refraining from biasing them too far. To some extent, if you're able to value align with communities in a good ongoing holistic manner, that should also give you some ways to choose and understand what types of manipulations you may be doing that are okay or not.
Also, say that I think that this perspective has a very mundane analogy when you think of the feedback cycle between recommendation engines and regular people. Those systems don't model the effect ... Well, they don't explicitly model the fact that they're changing the structure of what people want and what they'll want in the future. That's probably not the best analogy in the world.
I guess what I'm saying is that it's hard to plan for how you're going to influence someone's desires in the future. It's not clear to me what's right or what's wrong. What's true is that we, as humans, have a lot of norms about what types of manipulation are okay or not. You might hope that appropriately doing value alignment in that way might help get to an answer here.
Lucas: I'm just trying to get a better sense here. What I'm thinking about the role that like ethics and intelligence plays here, I view intelligence as a means of modeling the world and achieving goals, and ethics as the end towards which intelligence is aimed here. Now, I'm curious in terms of behavior modeling where inverse reinforcement learning agents are modeling, I guess, the behavior of human agents and, also, predicting the sorts of behaviors that they'd be taking in the future or in the situation, which the inverse reinforcement learning agent finds itself.
I'm curious to know where metaethics and moral epistemology fits in, where inverse reinforcement learning agents are finding themselves a novel ethical situations, and what their ability to handle those novel ethical situations are like. When they're handling those situations how much does it look like them performing some normative and metaethical calculus based on the kind of moral epistemology that they have, or how much does it look like they're using some other behavioral predictive system where they're like modeling humans?
Dylan: The answer to that question is not clear. What does it actually mean to make decisions based on ethical framework or metaethical framework? I guess, we could start there. You and I know what that means, but our definition is encumbered by the fact that it's pretty human-centric. I think we talk about it in terms of, "Well, I weighed this option. I looked at that possibility." We don't even really mean the literal sense of weighed in actually counted up, and constructed actual numbers, and multiplied them together in our heads.
What these are is they're actually references to complex thought patterns that we're going through. They're fine whether or not those thought patterns are going on. The AI system, you can also talk about the difference between the process of making a decision and the substance of it. When an inverse reinforcement learning agent is going out into the world, the policy it's following is constructed to try to optimize a set of inferred preferences, but does that means that the policy you're outputting is making metaethical characterizations?
Well, the moment, almost certainly not because the systems we build are just not capable of that type of cognitive reasoning. I think the bigger question is, do you care? To some extent, you probably do.
Lucas: I mean, I'd care if I had some very deep disagreements with the metaethics that led to the preferences that were loaned and loaded to the machine. Also, if the machine were in such a new novel ethical situation that was unlike anything human beings had faced that just required some metaethical reasoning to deal with.
Dylan: Yes. I mean, I think you definitely wanted to take decisions that you would agree with or, at least, that you could be non-maliciously convinced to agree with. Practically, there isn't a place in the theory where that shows up. It's not clear that what you're saying is that different from value alignment in particular. If I were to try to refine the point about metaethics, what it sounds to me like you're getting at is an inductive bias that you're looking for in the AI systems.
Arguably, ethics is about an argument of what inductive bias should we have as humans. I don't think that that's a first order of property in value alignment systems necessarily or in preference-based learning systems in particular. I would think that that kind of meta ethics, I think, comes in from value aligning to someone that has these sophisticated ethical ideas.
I don't know where your thoughts about metaethics came from, but, at least, indirectly, we can probably trace them down to the values that your parents inculcated in you as a child. That's how we build met ethics into your head if we want to think of you as being an AGI. I think that for AI systems, that's the same way that I would see it being in there. I don't believe the brain has circuits dedicated to metaethics. I think that exists in software, and in particular, something that's being programmed into humans from their observational data, more so than from the structures that are built into us as a fundamental part of our intelligence or value alignment.
Lucas: We've also talked a bit about how human beings are potentially not fully rational agents. With inverse reinforcement learning, this leaves open the question as to whether or not AI systems are actually capturing what the human being actually prefers, or if there's some limitations in the humans' observed or chosen behavior, or explicitly told preferences like limits in that ability to convey what we actually most deeply value or would value given more information. These inverse reinforcement learning systems may not be learning what we actually value or what we think we should value.
How can AI systems assist in this evolution of human morality and preferences whereby we're actually conveying what we actually value and what we would value given more information?
Dylan: Well, there are certainly two things that I heard in that question. One is, how do you just mathematically account for the fact that people are irrational, and that that is a property of the source of your data? Inverse reinforcement learning, at face value, doesn't allow us to model that appropriately. It may lead us to make the wrong inferences. I think that's a very interesting question. It's probably the main one that I think about now as a technical problem is understanding, what are good ways to model how people might or might not be rational, and building systems that can appropriately interact with that complex data source.
One recent thing that I've been thinking about is, what happens if people, rather than knowing their objective, what they're trying to accomplish, are figuring it out over time? This is the model where the person is a learning agent that discovers how they like states when they enter them, rather than thinking of the person as an agent that already knows what they want, and they're just planning to accomplish that. I think these types of assumptions that try to paint a very, very broad picture of the space of things that people are doing can help us in that vein.
When someone is learning, it's actually interesting that you can actually end up helping them. You end up with classic strategies that looks like it breaks down into three phases. You have initial exploration phase where you help the learning agent to get a better picture of the world, and the dynamics, and its associated rewards.
Then, you have another observation phase where you observe how that agent, now, takes advantage of the information that it's got. Then, there's an exploitation or extrapolation phase where you try to implement the optimal policy given the information you've seen so far. I think, moving towards more complex models that have a more realistic setting and richer set of assumptions behind them is important.
The other thing you talked about was about helping people discover their morality and learn more what's okay and what's not. There, I'm afraid I don't have too much interesting to say in the sense that I believe it's an important question, but I just don't feel that I have many answers there.
Practically, if you have someone who's learning their preferences over time, is that different than humans refining their moral theories? I don't know. You could make mathematical modeling choices, so that they are. I'm not sure if that really gets at what you're trying to point towards. I'm sorry that I don't have anything more interesting to say on that front other than, I think, it's important, and I would love to talk to more people who are spending their days thinking about that question because I think it really does deserve that kind of intellectual effort.
Lucas: Yeah, yeah. It sounds like we need some more AI moral psychologists to help us think about these things.
Dylan: Yeah. In particular, when talking about philosophy around value alignments and the ethics of value alignment, I think a really important question is, what are the ethics of developing value alignment systems? A lot of times, people talk about AI ethics from the standpoint of, for a lack of a better example, the trolley problem. The way they think about it is, who should the car kill? There is a correct answer or maybe not a correct answer, but there are answers that we could think of as more or less bad. AI, which one of those options should the AI select? That's not unimportant, but it's not the ethical question that an AI system designer is faced with.
In my mind, if you're designing a self-driving car, the relevant questions you should be asking are two things: One, what do I think is an okay way to respond to different situations? Two, how is my system going to be understanding the preferences of the people involved in those situations? Then, three, how should I design my system in light of those two facts?
I have my own preferences about what I would like my system to do. I have an ethical responsibility, I would say, to make sure that my system is adapting to the preferences of its users to the extent that it can. I also wonder to what extent. How should you handle things when there are conflicts between those two value sets?
You're building a robot. It's going to go and live with an uncontacted human tribe. Should it respect the local cultural traditions and customs? Probably. That would be respecting the values of the users. Then, let's say that that tribe does something that we would consider to be gross like pedophilia. Is my system required to participate wholesale in that value system? Where is the line that we would need to draw between unfairly imposing my values on system users and being able to make sure that the technology that I build isn't used for purposes that I would deem reprehensible or gross?
Lucas: Maybe we should just put a dial in each of the autonomous cars that lets the user set it to deontology mode or utilitarianism mode as its racing down the highway. Yeah, I think this is the ... I guess, an important role. I just think that metaethics is super important. I'm not sure if this is necessarily the case, but if fully autonomous systems are going to play a role where they're resolving these ethical dilemmas for us, which I guess at some point eventually, if they're going to be really actually autonomous and help to make the world a much better place seems necessary.
I guess, this feeds into my next question where I'm wondering where we probably both have different assumptions about this, but what the role of inverse reinforcement learning is ultimately? Is it just to allow AI system to evolve alongside us and to match current ethics or is it to allow the systems to ultimately surpass us and move far beyond us into the deep future?
Dylan: Inverse reinforcement learning, I think, is much more about the first and the second. I think it can be a part of how you get to the second and how you improve. For me, when I think about these problems technically, I try to think about matching human morality as the goal.
Lucas: Except for the factory farming and stuff.
Dylan: Well, I mean, if you had a choice between, thinks that eradicating all humans is okay and against farming versus neutral about factory farming and thinks that are eradicating all humans aren't okay, which would you pick? I mean, I guess, with your audience that there are maybe some people that would choose the saving the animals answer.
My point is that, I think, it's so hard for me. Technically, I think it's very hard to imagine getting these normative aspects of human societies and interaction right. I think, just hoping to participate in that process in a way that is analogous to how people do normally is a good step. I think we probably, to the extent that we can, should probably not have AI systems trying to figure out if it's okay to do factory farming and to the extent that we can ...
I think that it's so hard to understand what it means to even match human morality or participate in it that, for me, the concept of surpassing, it feels very, very challenging and fraught. I would worry, as a general concern, that as a system designer who doesn't necessarily represent the views and interest of everyone, that by programming in surpassing humanity or surpassing human preferences or morals, what I'm actually doing is just programming in my morals and ethical beliefs.
Lucas: Yes. I mean, there seems to be this strange issue here where it seems like if we get AGI, and recursive self-improvement is a thing that really takes it off, so that we have a system who has potentially succeeded in its inverse reinforcement learning, but far surpassed human beings and its general intelligence. We have a superintelligence that's matching human morality. It just seems like a funny situation where we'd really have to pull the brakes. I guess, as William MacAskill mentions have a really, really long deliberation about ethics, and moral epistemology, and value. How do you view that?
Dylan: I think that's right. I mean, I think there are some real questions about who should be involved in that conversation. For instance, I actually even think it's ... Well, one thing I'd say is that you should recognize that there's a difference between having the same morality and having the same data. One way to think about it is that people who are against factory farming have a different morality than the rest of the people.
Another one is that they actually just have exposure to the information that allows their morality to come to a better answer. There's this confusion you can make between the objective that someone has and the data that they've seen so far. I think, one point would be to think that a system that has current human morality but access to a vast, vast wealth of information may actually do much better than you might think. I think, we should leave that open as a possibility.
For me, this is less about morality in particular, and more just about power concentration, and how much influence you have over the world. I mean, if we imagine that there was something like a very powerful AI system that was controlled by a small number of people, yeah, you better think freaking hard before you tell that system what to do. That's related to questions about ethical ramifications on metaethics, and generalization, and what we actually truly value as humans. What is also super true for all of the more mundane things in the day to day as well. Did that make sense?
Lucas: Yeah, yeah. It totally makes sense. I'm becoming increasingly mindful of your time here. I just wanted to hit a few more questions if that's okay before I let you go.
Dylan: Please, yeah.
Lucas: Yeah. I'm wondering, would you like to, or do you have any thoughts on how coherent extrapolated volition fits into this conversation and your views on it?
Dylan: What I'd say is I think coherent extrapolated volition is an interesting idea and goal.
Lucas: Where it is defined as?
Dylan: Where it's defined as a method of preference aggregation. Personally, I'm a little weary of preference aggregation approaches. Well, I'm weary of imposing your morals on someone indirectly via choosing the method of preference aggregation that we're going to use. I would-
Lucas: Right, but it seems like, at some point, we have to make some metaethical decision, or else, we'll just forever be lost.
Dylan: Do we have to?
Lucas: Well, some agent does.
Lucas: Go ahead.
Dylan: Well, does one agent have to? Did one agent decide on the ways that we were going to do preference aggregation as a society?
Lucas: No. It naturally evolved out of-
Dylan: It just naturally evolved via a coordination and argumentative process. For me, my answer to ... If you force me to specify something about how we're going to do value aggregation, if I was controlling the values for an AGI system, I would try to say as little as possible about the way that we're going to aggregate values because I think we don't actually understand that process much in humans.
Lucas: Right. That's fair.
Dylan: Instead, I would opt for a heuristic of to the extent that we can devote equal optimization effort towards every individual, and allow that parliament, if you will, to determine the way the value should be aggregated. This doesn't necessarily mean having an explicit value aggregation mechanism that gets set in stone. This could be an argumentative process mediated by artificial agents arguing on your behalf. This could be futuristic AI-enabled version of the court system.
Lucas: It's like an ecosystem of preferences and values in conversation?
Lucas: Cool. We've talked a little bit about the deep future here now with where we're reaching around potentially like AGI or artificial superintelligence. After, I guess, inverse reinforcement learning is potentially solved, is there anything that you view that comes after inverse reinforcement learning in these techniques?
Dylan: Yeah. I mean, I think inverse reinforcement learning is certainly not the be-all, end-all. I think what it is, is it's one of the earliest examples in AI of trying to really look at preference solicitation, and modeling preferences, and learning preferences. It existed in a whole bunch of ... economists have been thinking about this for a while already. Basically, yeah, I think there's a lot to be said about how you model data and how you learn about preferences and goals. I think inverse reinforcement learning is basically the first attempt to get at that, but it's very far from the end.
I would say the biggest thing in how I view things that is maybe different from your standard reinforcement learning, inverse reinforcement learning perspective is that I focus a lot on, how do you act given what you've learned from inverse reinforcement learning. Inverse reinforcement learning is a pure inference problem. It's just figure out what someone wants. I ground that out in all of our research in take actions to help someone, which introduces a new set of concerns and questions.
Lucas: Great. It looks like we're about at the end of the hour here. I guess, if anyone here is interested in working on this technical portion of the AI alignment problem, what do you suggest they study or how do you view that it's best for them to get involved, especially if they want to work on inverse reinforcement learning and inferring human preferences?
Dylan: I think if you're an interested person, and you want to get into technical safety work, the first thing you should do is probably read Jan Leike's recent write up in 80,000 Hours. Generally, what I would say is, try to get involved in AI research flat. Don't focus as much on trying to get into AI safety research, and just generally focus more on acquiring the skills that will support you in doing good AI research. Get a strong math background. Get a research advisor who will advise you on doing research projects, and help teach you the process of submitting papers, and figuring out what the AI research community is going to be interested in.
In my experience, one of the biggest pitfalls that early researchers make is focusing too much on what they're researching rather than thinking about who they're researching with, and how they're going to learn the skills that will support doing research in the future. I think that most people don't appreciate how transferable research skills are to the extent that you can try to do research on technical AI safety, but more work on technical AI. If you're interested in safety, the safety connections will be there. You may see how a new area of AI actually relates to it, supports it, or you may find places of new risks, and be in a good position to try to mitigate that and take steps to alleviate those harms.
Lucas: Wonderful. Yeah, thank you so much for speaking with me today, Dylan. It's really been a pleasure, and it's been super interesting.
Dylan: It was a pleasure talking to you. I love the chance to have these types of discussions.
Lucas: Great. Thanks so much. Until next time.
Dylan: Until next time. Thanks a blast.
Lucas: If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We'll be back soon with another episode in this new AI alignment series.