Podcast: Astronomical Future Suffering and Superintelligence with Kaj Sotala

In a classic taxonomy of risks developed by Nick Bostrom (seen below), existential risks are characterized as risks which are both terminal in severity and transgenerational in scope. If we were to maintain the scope of a risk as transgenerational and increase its severity past terminal, what would such a risk look like? What would it mean for a risk to be transgenerational in scope and hellish in severity?

Astronomical Future Suffering and Superintelligence is the second podcast in the new AI Alignment series, hosted by Lucas Perry. For those of you that are new, this series will be covering and exploring the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, we will be having discussions with technical and non-technical researchers across areas such as machine learning, AI safety, governance, coordination, ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, or your preferred podcast site/application.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary landscape which begins to map this space.

In this podcast, Lucas spoke with Kaj Sotala, an associate researcher at the Foundational Research Institute. He has previously worked for the Machine Intelligence Research Institute, and has publications on AI safety, AI timeline forecasting, and consciousness research.

Topics discussed in this episode include:

  • The definition of and a taxonomy of suffering risks
  • How superintelligence has special leverage for generating or mitigating suffering risks
  • How different moral systems view suffering risks
  • What is possible of minds in general and how this plays into suffering risks
  • The probability of suffering risks
  • What we can do to mitigate suffering risks
In this interview we discuss ideas contained in a paper by Kaj Sotala and Lukas Gloor. You can find the paper here: Superintelligence as a Cause or Cure for Risks of Astronomical Suffering.  You can hear about this paper in the podcast above or read the transcript below.

 

Lucas: Hi, everyone. Welcome back to the AI Alignment Podcast of the Future of Life Institute. If you are new or just tuning in, this is a new series at FLI where we’ll be speaking with a wide variety of technical and nontechnical domain experts regarding the AI alignment problem, also known as the value alignment problem. If you’re interested in AI alignment, the Future of Life Institute, existential risks, and similar topics in general, please remember to like and subscribe to us on SoundCloud or your preferred listening platform.

Today, we’ll be speaking with Kaj Sotala. Kaj is an associate researcher at the Foundational Research Institute. He has previously worked for the Machine Intelligence Research Institute, and has publications in the areas of AI safety, AI timeline forecasting, and consciousness research. Today, we speak about suffering risks, a class of risks most likely brought about by new technologies, like powerful AI systems that could potentially lead to astronomical amounts of future suffering through accident or technical oversight. In general, we’re still working out some minor kinks with our audio recording. The audio here is not perfect, but does improve shortly into the episode. Apologies for any parts that are less than ideal. With that, I give you Kaj.

Lucas: Thanks so much for coming on the podcast, Kaj. It’s super great to have you here.

Kaj: Thanks. Glad to be here.

Lucas: Just to jump right into this, could you explain a little bit more about your background and how you became interested in suffering risks, and what you’re up to at the Foundational Research Institute?

Kaj: Right. I became interested in all of this stuff about AI and existential risks way back in high school when I was surfing the internet until I somehow ran across the Wikipedia article for the technological singularity. After that, I ended up reading Eliezer Yudkowksy’s writings, and writings by other people. At one point, I worked for the Machine Intelligence Research Institute, immersed in doing strategic research, did some papers on predicting AI that makes a lot of sense together with Stuart Armstrong of the Future of Humanity Institute. Eventually, MIRI’s focus on research shifted more into more technical and mathematical research, which wasn’t exactly my strength, and at that point we parted ways and I went back to finish my master’s degree in computer science. Then after I graduated, I ended up being contacted by the Foundational Research Institute, who had noticed my writings on these topics.

Lucas: Could you just unpack a little bit more about what the Foundational Research Institute is trying to do, or how they exist in the effective altruism space, and what the mission is and how they’re differentiated from other organizations?

Kaj: They are the research arm of the Effective Altruism Foundation in the German-speaking area. The Foundational Research Institute’s official tagline is, “We explain how humanity can best reduce suffering.” The general idea is that a lot of people have this intuition that if you are trying to improve the world, then there is a special significance on reducing suffering, and especially about outcomes involving extreme suffering have some particular moral priority, that we should be looking at how to prevent those. In general, the FRI has been looking at things like the long-term future and how to best reduce suffering at long-term scales, including things like AI and emerging technologies in general.

Lucas: Right, cool. At least my understanding is, and you can correct me on this, is that the way that FRI sort of leverages what it does is that … Within the effective altruism community, suffering risks are very large in scope, but it’s also a topic which is very neglected, but also low in probability. Has FRI really taken this up due to that framing, due to its neglectedness within the effective altruism community?

Kaj: I wouldn’t say that the decision to take it up was necessarily an explicit result of looking at those considerations, but in a sense, the neglectedness thing is definitely a factor, in that basically no one else seems to be looking at suffering risks. So far, most of the discussion about risks from AI and that kind of thing has been focused on risks of extinction, and there have been people within FRI who feel that risks of extreme suffering might actually be very plausible, and may be even more probable than risks of extinction. But of course, that depends on a lot of assumptions.

Lucas: Okay. I guess just to move foreward here and jump into it, given FRI’s mission and what you guys are all about, what is a suffering risk, and how has this led you to this paper?

Kaj: The definition that we have for suffering risks is that a suffering risk is a risk where an adverse outcome would bring about severe suffering on an astronomical scale, so vastly exceeding all suffering that has existed on earth so far. The general thought here is that if we look at the history of earth, then we can probably all agree that there have been a lot of really horrible events that have happened, and enormous amounts of suffering. If you look at something like the Holocaust or various other terrible events that have happened throughout history, there is an intuition that we should make certain that nothing this bad happens ever again. But then if we start looking at what might happen if humanity, for instance, colonizes space one day, then if current trends might continue, then you might think that there is no reason why such terrible events wouldn’t just repeat themselves over and over again as we expand into space.

That’s sort of one of the motivations here. The paper we wrote is specifically focused on the relation between suffering risks and superintelligence, because like I mentioned, there has been a lot of discussion about superintelligence possibly causing extinction, but there might also be ways by which superintelligence might either cause suffering risks, for instance in the form of some sort of uncontrolled AI, or alternatively, if we could develop some kind of AI that was aligned with humanity’s values, then that AI might actually be able to prevent all of those suffering risks from ever being realized.

Lucas: Right. I guess just, if we’re really coming at this from a view of suffering-focused ethics, where we’re really committed to mitigating suffering, even if we just view sort of the history of suffering and take a step back, like, for 500 million years, evolution had to play out to reach human civilization, and even just in there, there’s just a massive amount of suffering, in animals evolving and playing out and having to fight and die and suffer in the ancestral environment. Then one day we get to humans, and in the evolution of life on earth, we create civilization and technologies. In seems, and you give some different sorts of plausible reasons why, that either for ignorance or efficiency or, maybe less likely, malevolence, we use these technologies to get things that we want, and these technologies seem to create tons of suffering.

In our history so far, we’ve had things … Like you mentioned, the invention of the ship has helped lead to slavery, which created an immense amount of suffering. Modern industry has led to factory farming, which has created an immense amount of suffering. As we move foreward and we create artificial intelligence systems and potentially even one day superintelligence, we’re really able to mold the world more so into a more extreme state, where we’re able to optimize it much harder. In that optimization process, it seems the core of the problem lies, is that when you’re taking things to the next level and really changing the fabric of everything in a very deep and real way, that suffering can really come about. The core of the problem seems that, when technology is used to fix certain sorts of problems, like that we want more meat, or that we need more human labor for agriculture and stuff, that in optimizing for those things we just create immense amounts of suffering. Does that seem to be the case?

Kaj: Yeah. That sounds like a reasonable characterization.

Lucas: Superintelligence seems to be one of these technologies which is particularly in a good position to be worried it creating suffering risks. What are the characteristics, properties, and attributes of computing and artificial intelligence and artificial superintelligence that gives it this special leverage in being risky for creating suffering risks?

Kaj: There’s obviously the thing about superintelligence potentially, as you mentioned, being able to really reshape the world at a massive scale. But if we compare what is the difference between a superintelligence that is capable of reshaping the world at a massive scale versus humans doing the same using technology … A few specific scenarios that we have been looking at in the paper is, for instance, if we compare to a human civilization, then a major force in human civilizations is that most humans are relatively empathic, and while we can see that humans are willing to cause others serious suffering if that is the only, or maybe even the easiest way of achieving their goals, a lot of humans still want to avoid unnecessary suffering. For instance, currently we see factory farming, but we also see a lot of humans being concerned about factory farming practices, a lot of people working really hard to reform things so that there would be less animal suffering.

But if we look at, then, artificial intelligence, which was running things, then if it is not properly aligned with our values, and in particular if it does not have something that would correspond to a sense of empathy, and it’s just actually just doing whatever things maximize its goals, and its goals do not include prevention of suffering, then it might do things like building some kind of worker robots or subroutines that are optimized for achieving whatever goals it has. But if it turns out that the most effective way of making them do things is to build them in such a way that they suffer, then in that case there might be an enormous amount of suffering agents with no kind of force that was trying to prevent their existence or trying to reduce the amount of suffering in the world.

Another scenario is the possibility of mind-crime. This is discussed in Bostrom’s Superintelligence briefly. The main idea here is that if the superintelligence creates simulations of sentient minds, for instance for scientific purposes or the purposes of maybe blackmailing some other agent in the world by torturing a lot of minds in those simulations, AI might create simulations of human beings that were detailed enough to be conscious. Then you mentioned earlier the thing about evolution already have created a lot of suffering. If the AI were similarly to simulate evolution or simulate human societies, again without caring about the amount of suffering within those simulations, then that could again cause vast amounts of suffering.

Lucas: I definitely want to dive into all of these specific points with you as they come up later in the paper, and we can really get into and explore them. But so, really just to take a step back and understand what superintelligence is and the different sorts of attributes that it has, and how it’s different than human beings and how it can lead to suffering risk. For example, there seems to be multiple aspects here where we have to understand superintelligence as a general intelligence running at digital timescales rather than biological timescales.

It also has the ability to copy itself, and rapidly write and deploy new software. Human beings have to spend a lot of time, like, learning and conditioning themselves to change the software on their brains, but due to the properties and features of computers and machine intelligence, it seems like copies could be made for very, very cheap, it could be done very quickly, they would be running at digital timescales rather than biological timescales.

Then it seems there’s the whole question about value-aligning the actions and goals of this software and these systems and this intelligence, and how in the value alignment process there might be technical issues where, due to difficulties in AI safety and value alignment efforts, we’re not able to specify or really capture what we value. That might lead to scenarios like you were talking about, where there would be something like mind-crime, or suffering subroutines which would exist due to their functional usefulness or epistemic usefulness. Is there anything else there that you would like to add and unpack about why superintelligence specifically has a lot of leverage for leading to suffering risks?

Kaj: Yeah. I think you covered most of the things. I think the thing that they are all leading to that I just want to specifically highlight is the possibility of the superintelligence actually establishing what Nick Bostrom calls a singleton, basically establishing itself as a single leading force that basically controls the world. I guess in one sense you could talk about singletons in general and their impact on suffering risks, rather than superintelligence specifically, but at this time it does not seem very plausible, or at least I cannot foresee, very many other paths to a singleton other than superintelligence. That was a part of why we were focusing on superintelligence in particular.

Lucas: Okay, cool. Just to get back to the overall structure of your paper, what are the conditions here that you cover that must be met in order for s-risks to merit our attention? Why should we care about s-risks? Then what are all the different sorts of arguments that you’re making and covering in this paper?

Kaj: Well, basically, in order for any risk, suffering risks included, to merit work on them, they should meet three conditions. The first is that the outcome of the risk should be sufficiently severe to actually merit attention. Second, the risk must have some reasonable probability of actually being realized. Third, there must be some way for risk avoidance work to actually reduce either the probability or the severity of the adverse outcome. If something is going to happen for certain and it’s very bad, then if we cannot influence it, then obviously we cannot influence it, and there’s no point in working on it. Similarly, if some risk is very implausible, then it might not be the best use of resources. Also, if it’s very probable but wouldn’t cause a lot of damage, then it might be better to focus on risks which would actually cause more damage.

Lucas: Right. I guess just some specific examples here real quick. The differences here are essentially between, like, the death of the universe, if we couldn’t do anything about it, we would just kind of have to deal with that, then sort of like a Pascal mugging situation, where a stranger just walks up to you on the street and says, “Give me a million dollars or I will simulate 10 to the 40 conscious minds suffering until the universe dies.” The likelihood of that is just so low that you wouldn’t have to deal with it. Then it seems like the last scenario would be, like, you know that you’re going to lose a hair next week, and that’s just sort of like an imperceptible risk that doesn’t matter, but that has very high probability. Then getting into the meat of the paper, what are the arguments here that you make regarding suffering risks? Does suffering risk meet these criteria for why it merits attention?

Kaj: Basically, the paper is roughly structured around those three criteria that we just discussed. We basically start by talking about what the s-risks are, and then we seek to establish that if they were realized, they would indeed be bad enough to merit our attention. In particular, we argue that many value systems would consider some classes of suffering risks to be as bad or worse than extinction. Also, we cover some suffering risks which are somewhat less severe that extinction, but still, according to many value systems, very bad.

Then we move on to look at the probability of the suffering risks to see whether it is actually plausible that they will be realized. We survey what might happen if nobody builds a superintelligence, or maybe more specifically, if there is no singleton that could prevent suffering risks that might be realized sort of naturally, in the absence of a singleton.

We also look at, okay, if we do have a superintelligence or a singleton, what suffering risks might that cause? Finally, we look at the last question, of the tractability. Can we actually do anything about these suffering risks? There we also have several suggestions of what we think would be the kind of work that would actually be useful in either reducing the risk or the severity of suffering risks.

Lucas: Awesome. Let’s go ahead and move sequentially through these arguments and points which you develop in the paper. Let’s start off here by just trying to understand suffering risk just a little bit more. Can you unpack the taxonomy of suffering risks that you develop here?

Kaj: Yes. We’ve got three possible outcomes of suffering risks. Technically, a risk is something that may or may not happen, so three specific outcomes of what might happen. The three outcomes, I’ll just briefly give their names and then unpack them. We’ve got what we call astronomical suffering outcomes, net suffering outcomes, and pan-generational net suffering outcomes.

I’ll start with the net suffering outcome. Here, the idea is that if we are talking about a risk which might be of a comparable severity as risks of extinction, then one way you could get that is if, for instance, we look from the viewpoint of something like classical utilitarianism. You have three sorts of people. You have people who have a predominantly happy life, you have people who never exist or have a neutral life, and you have people who have a predominantly unhappy life. As a simplified moral calculus, you just assign the people with happy lives a plus-one, and you assign the people with unhappy lives a minus-one. Then according to this very simplified moral system, then you would see that if we have more unhappy lives than there are happy lives, then technically this would be worse than there not existing any lives at all.

That is what we call a net suffering outcome. In other words, at some point in time there are more people experiencing lives that are more unhappy than happy, and there are people experiencing lives which are the opposite. Now, if you have a world where most people are unhappy, then if you’re optimistic you might think that, okay, it is bad, but it is not necessarily worse than extinction, because if you look ahead in time, then maybe the world will go on and conditions will improve, and then after a while most people actually live happy lives, so maybe things will get better. We define an alternative scenario in which we just assume that things actually won’t get better, and if you sum over all of the lives that will exist throughout history, most of them still end up being unhappy. Then that would be what we call a pan-generational net suffering outcome. When summed over all the people that will ever live, there are more people experiencing lives filled predominantly with suffering than there are people experiencing lives filled predominantly with happiness.

You could also have what we call astronomical suffering outcomes, which is just that at some point in time there’s some fraction of the population which experiences terrible suffering, and the amount of suffering here is enough to constitute an astronomical amount that overcomes all the suffering in earth’s history. Here we are not making the assumption that the world would be mainly filled with these kinds of people. Maybe you have one galaxy worth of people in terrible pain, and 500 galaxy’s worth of happy people. According to some value systems, that would not be worse than extinction, but probably all value systems would still agree that even if this wasn’t worse than extinction, it would still be something that would be very much worth avoiding. Those are the three outcomes that we discuss here.

Lucas: Traditionally, the sort of far-future concerned community has mainly only been thinking about existential risks. Do you view this taxonomy and suffering risks in general as being a subset of existential risks? Or how do you view it in relation to what we traditionally view as existential risks?

Kaj: If we look at Bostrom’s original definition for an existential risk, the definition was that it is a risk where an adverse outcome would either annihilate earth-originating intelligent life, or permanently and drastically curtail its potential. Here it’s a little vague on how exactly you should interpret phrases like “permanently and drastically curtain our potential.” You could take the view that suffering risks are a subset of existential risks if you view our potential as being something like the realization of a civilization full of happy people, where nobody ever needs to suffer. In that sense, it would be a subset of existential risks.

It is most obvious with the net suffering outcomes. It seems pretty plausible that most people experiencing suffering would not be the realization of our full potential. Then if you look at something like near-astronomical suffering outcomes, where you might only have a small fraction of the population experiencing suffering, then that, depending on exactly how large the fraction, then you might maybe not count it as a subset of existential risks, and maybe something more comparable to catastrophic risks, which have usually been defined on the order of a few million people dying. Obviously, the astronomical suffering outcomes are worse than catastrophic risks, but maybe something more comparable to catastrophic risks than existential risks.

Lucas: Given the taxonomy that you’ve gone ahead and unpacked, what are the different sorts of perspectives that different value systems on earth have of suffering risks? Just unpack a little bit what the general value systems are that human beings are running in their brains.

Kaj: If we look at ethics, philosophers have proposed a variety of different value systems and ethical theories. If we just look at the few of the main ones, then something like classical utilitarianism, where you basically view worlds as good based on what is the balance of happiness minus suffering. Then if you look at what would be the view of classical utilitarianism on suffering risks, classical utilitarianism would find these worst kinds of outcomes, net suffering outcomes as worse than extinction. But they might find astronomical suffering outcomes as an acceptable cost of having even more happy people. They might look at that, one galaxy full of suffering people, and think that, “Well, we have 200 galaxies full of happy people, so it’s not optimal to have those suffering people, but we have even more happy people, so that’s okay.

A lot of moral theories are not necessarily explicitly utilitarian, or they might have a lot of different components and so on, but a lot of them still include some kind of aggregative component, meaning that they still have some element of, for instance, looking at suffering and saying that other things being equal, it’s worse to have more suffering. This would, again, find suffering risks something to avoid, depending on exactly how they weight things and how they value things. Then it will depend on those specific weightings, on whether they find suffering risks as worse than extinction or not.

Also worth noting that even if the theories wouldn’t necessarily talk about suffering exactly, they might still talk about something like preference satisfaction, whether people are having their preferences satisfied, some broader notion of human flourishing, and so on. In scenarios where there is a lot of suffering, probably a lot of these things that these theories consider valuable would be missing. For instance, if there is a lot of suffering and people cannot escape that suffering, then probably there are lots of people whose preferences are not being satisfied, if they would prefer not to suffer and they would prefer to escape the suffering.

Then there are little kinds of rights-based theories, which don’t necessarily have this aggregative component directly, but are more focused on thinking in terms of rights, which might not be summed together directly, but depending on how these theories would frame rights … For instance, some theories might hold that people or animals have a right to avoid unnecessary suffering, or these kinds of theories might consider suffering indirectly bad if the suffering was created by some condition which violated people’s rights. Again, for instance, if people have a right for meaningful autonomy and they are in circumstances in which they cannot escape their suffering, then you might hold that their right for a meaningful autonomy has been violated.

A bunch of moral intuitions, which might fit a number of moral theories and which might particularly prioritize the prevention of suffering in particular. I mentioned that classical utilitarianism basically weights extreme happiness and extreme suffering the same, so it will be willing to accept a large amount of suffering if you could produce a lot of, even more, happiness that way. But for instance, there have been moral theories like prioritarianism proposed, which might make a different judgment.

Prioritarianism is the position that the worse off an individual is, the more morally valuable it is to make that individual better off. If one person is living in hellish conditions and another is well-off, then if you could sort of give either one of them five points of extra happiness, then it would be much more morally pressing to help the person who was in more pain. This seems like an intuition that I think a lot of people share, and if you had something like some kind of an astronomical prioritarianism that considered all across the universe and prioritized improving the worst ones off, then that might push in the direction of mainly improving the lives of those that would be worst off and avoiding suffering risks.

Then there are a few other sort of suffering-focused intuitions. A lot of moral intuitions have this intuition that it’s more important to make people happy than it is to create new happy people. This one is rather controversial, and a lot of EA circles seem to reject this intuition. It’s true that there are some strong arguments against it, but at the other hand, rejecting it also seems to lead to some paradoxical conclusions. Here, the idea behind this intuition is that the most important thing is helping existing people. If we think about, for instance, colonizing the universe, someone might argue that if we colonized the universe, then that will create lots of new lives who will be happy, and that will be a good thing, even if this comes at the cost of create a vast number of unhappy lives as well. But if you take the view that the important thing is just making existing lives happy and we don’t have any special obligation to create new lives that are happy, then it also becomes questionable whether it is worth the risk of creating a lot of suffering for the sake of just creating happy people.

Also, there is an intuition of, torture-level suffering cannot be counterbalanced. Again, there are a bunch of good arguments against this one. There’s a nice article by Toby Ord called “Why I Am Not a Negative Utilitarian,” which argues against versions of this thesis. But at the same time, there does seem to be something that has a lot of intuitive weight for a lot of people. Here the idea is that there are some kinds of suffering so intense and immense that you cannot really justify that with any amount of happiness. David Pearce has expressed this well in his quote where he says, “No amount of happiness or fun enjoyed by some organisms can notionally justify the indescribable horrors of Auschwitz.” Here we must think that, okay, if we go out and colonize the universe, and then we know that colonizing the universe is going to create some equivalent event as what went on in Auschwitz and at other genocides across the world, then no amount of happiness that we create that way will be worth that terrible terror that would probably also be created if there was nothing to stop it.

Finally, there’s an intuition of happiness being the absence of suffering, which is the sort of an intuition that is present in Epicureanism and some non-Western traditions, such as Buddhism, where happiness is thought as being the absence of suffering. The idea is that when we are not experiencing any pleasure, we begin to crave pleasure, and it is this craving that constitutes suffering. Under this view, happiness does not have intrinsic value, but rather it has instrumental value in taking our focus away from suffering and helping us avoid suffering that way. Under that view, creating additional happiness doesn’t have any intrinsic value if that creation does not help us avoid suffering.

I mentioned here a few of these suffering-focused intuitions. Now, in presenting these, my intent is not to say that there would not also exist counter-intuitions. There are a lot of reasonable people who disagree with these intuitions. But the general point that I’m just expressing is that regardless of which specific moral system we are talking about, these are the kinds of intuitions that a lot of people find plausible, and which could reasonably fit in a lot of different moral theories and value systems, and probably a lot of value systems contain some version of these.

Lucas: Right. It seems like the general idea is just that whether you’re committed to some sort of form of consequentialism or deontology or virtue ethics, or perhaps something that’s even potentially theological, there are lots of aggregative or non-aggregative, or virtue-based or rights-based reasons for why we should care about suffering risks. Now, it seems to me that potentially here probably what’s most important, or where these different normative and meta-ethical views matter in their differences, is in how you might proceed forward and engage in AI research and in deploying and instantiating AGI and superintelligence, given your commitment more or less to a view which takes the aggregate, versus one which does not. Like you said, if you take a classical utilitarian view, then one might be more biased towards risking suffering risks given that there might still be some high probability of there being many galaxies which end up having very net positive experiences, and then maybe one where there might be some astronomical suffering. How do you view the importance of resolving meta-ethical and normative ethical disputes in order to figure out how to move foreward in mitigating suffering risks?

Kaj: The general problem here, I guess you might say, is that there exist trade-offs between suffering risks and existential risks. If we had a scenario where some advanced general technology or something different might constitute an existential risk to the world, then someone might think about trying to solve that with AGI, which might have some probability of not actually working properly and not actually being value-aligned. But someone might think that, “Well, if we do not activate this AGI, then we are all going to die anyway, because of this other existential risk, so might as well activate it.” But then if there is a sizable probability of the AGI actually causing a suffering risk, as opposed to just an existential risk, then that might be a bad idea. As you mentioned, the different value systems will make different evaluations about these trade-offs.

In general, I’m personally pretty skeptical about actually resolving ethics, or solving it in a way that would be satisfactory to everyone. I expect there a lot of the differences between meta-ethical views could just be based on moral intuitions that may come down to factors like genetics or the environment where you grew up, or whatever, and which are not actually very factual in nature. Someone might just think that some specific, for instance, suffering-focused intuition was very important, and someone else might think that actually that intuition makes no sense at all.

The general approach, I would hope, that people take is that if we have decisions where we have to choose between an increased risk of extinction or an increased risk of astronomical suffering, then it would be better if people from all ethical and value systems would together try to cooperate. Rather than risk conflict between value systems, a better alternative would be to attempt to identify interventions which did not involve trading off one risk for another. If there were interventions that reduced the risk of extinction without increasing the risk of astronomical suffering, or decreased the risk of astronomical suffering without increasing the risk of extinction, or decreased both risks, then it would be in everyone’s interest if we could agree, okay, whatever our moral differences, let’s just jointly focus on these classes of interventions that actually seem to be a net positive in at least one person’s value system.

Lucas: Like you identify in the paper, it seems like the hard part is when you have trade-offs.

Kaj: Yes.

Lucas: Given this, given that most value systems should care about suffering risks, now that we’ve established the taxonomy and understanding of what suffering risks are, discuss a little bit about how likely suffering risks are relative to existential risks and other sorts of risks that we encounter.

Kaj: As I mentioned earlier, these depend somewhat on, are we assuming a superintelligence or a singleton or not? Just briefly looking at the case where we do not assume a superintelligence or singleton, we can see that in history so far there does not seem to be any consistent trend towards reduced suffering, if you look at a global scale. For instance, the advances in seafaring enabled the transatlantic slave trade, and similarly, advances in factory farming practices have enabled large amounts of animals being kept in terrible conditions. You might plausibly think that the net balance of suffering and happiness caused by the human species right now was actually negative due to all of the factory farmed animals, although it is another controversial point. Generally, you can see that if we just extrapolated the trends so far to the future, then we might see that, okay, there isn’t any obvious sign of there being less suffering in the world as technology develops, so it seems like a reasonable assumption, although not the only possible assumption, that as technology advances, it will also continue to enable more suffering, and future civilizations might also have large amounts of suffering.

If we look at the outcomes where we do have a superintelligence or a singleton running the world, here things get, if possible, even more speculative. In the beginning, we can at least think of some plausible-seeming scenarios in which a superintelligence might end up causing large amounts of suffering, such as building suffering subroutines. It might create mind-crime. It might also try to create some kind of optimal human society, but some sort of the value learning or value extrapolation process might be what some people might consider incorrect in such a way that the resulting society would also have enormous amounts of suffering. While it’s impossible to really give any probability estimates on exactly how plausible is a suffering risk, and depends on a lot of your assumptions, it does at least seem like a plausible thing to happen with a reasonable probability.

Lucas: Right. It seems that just technology, like intrinsic to what technology is, is it’s giving you more leverage and control over manipulating and shaping the world. As you gain more causal efficacy over the world and other sentient beings, it seems kind of obviously that yeah, you also gain more ability to cause suffering, because your causal efficacy is increasing. It seems very important here to isolate the causal factors in people and just in the universe in general, which lead to this great amount of suffering. Technology is a tool, a powerful tool, and it keeps getting more powerful. The hand by which the tool is guided is ethics.

But it doesn’t seem that historically, and in the case of superintelligence as well, that primarily the vast amounts of suffering that have been caused are because of failures in ethics. I mean, surely there has been large failures in ethics, but evolution is just an optimization process which leads to vast amounts of suffering. There could be similar evolutionary dynamics in superintelligence which lead to great amounts of suffering. It seems like issues with factory farming and slavery are not due to some sort of intrinsic malevolence in people, but rather it seems sort of like an ethical blind spot and apathy, and also a solution to an optimization problem where we get meat more efficiently, and we get human labor more efficiently. It seems like we can apply these lessons to superintelligence. It seems like it’s not likely that superintelligence will produce astronomical amounts of suffering due to malevolence.

Kaj: Right.

Lucas: Or like, intentional malevolence. It seems there might be, like, a value alignment problem or mis-specification, or just generally in optimizing that there might be certain things, like mind-crime or suffering subroutines, which are functionally very useful or epistemically very useful, and in their efficiency for making manifest other goals, they perhaps astronomically violate other values which might be more foundational, such as the mitigation of suffering and the promotion of wellbeing across all sentient beings. Does that make sense?

Kaj: Yeah. I think one way I might phrase that is that we should expect there to be less suffering if the incentives created by the future world for whatever agents are acting there happen to align with doing the kinds of things that cause less suffering. And vice versa, if the incentives just happen to align with actions that cause agents great personal benefit, or at least the agents that are in power great personal benefit while suffering actually being the inevitable consequence of following those incentives, then you would expect to see a lot of suffering. As you mentioned, with evolution, there isn’t even an actual agent to speak of, but just sort of in free-running optimization process, and the solutions which that optimization process has happened to hit on have just happened to involve large amounts of suffering. There is a major risk of a lot of suffering being created by the kinds of processes that are actually not actively malevolent, and some of which might actually care about preventing suffering, but then just the incentives are such that they end up creating suffering anyway.

Lucas: Yeah. I guess what I find very fascinating and even scary here is that there are open questions regarding the philosophy of mind and computation and intelligence, where we can understand pain and anger and pleasure and happiness and all of these hedonic valences within consciousness as, at very minimum, being correlated with cognitive states which are functionally useful. These hedonic valences are informationally sensitive, and so they give us information about the world, and they sort of provide a functional use. You discuss here how it seems like anger and pain and suffering and happiness and joy, all of these seem to be functional attributes of the mind that evolution has optimized for, and they may or may not be the ultimate solution or the best solution, but they are good solutions to avoiding things which may or may not be bad for us, and promoting behaviors which lead to social cohesion and group coordination.

I think there’s a really deep and fundamental question here about whether or not minds in principle can be created to have informationally-sensitive, hedonically-positive states. Is David Pearce puts it, there’s sort of an open question about, I think, whether or not minds in principle can be created to function on informationally-sensitive gradients of bliss. If that ends up being false, and that anger and suffering end up providing some really fundamental functional and epistemic place in minds in general, then I think that that’s just a hugely fundamental problem about the future and the kinds of minds that we should or should not create.

Kaj: Yeah, definitely. Of course, if we are talking about avoiding outcomes with extreme suffering, perhaps you might have scenarios where it is unavoidable to have some limited amount of suffering, but you could still create minds that were predominantly happy, and maybe they got angry and upset at times, but that would be a relatively limited amount of suffering that they experienced. You can definitely already see that there are some people alive who just seem to be constantly happy, and don’t seem to suffer very much at all. But of course, there is also the factor that if you are running on so-called negative emotions, and you do have anger and that kind of thing, then you are, again, probably more likely to react to situations in ways which might cause more suffering in others, as well as yourself. If we could create the kinds of minds that only had a limited amount of suffering from negative emotions, then you could [inaudible 00:49:27] that they happened to experience a bit of anger and lash out at others probably still wouldn’t be very bad, since other minds still would only experience the limited amount of suffering.

Of course, this gets to various philosophy of mind questions, as you mentioned. Personally, I tend to lean towards the views that it is possible to disentangle pain and suffering from each other. For instance, various Buddhist meditative practices are actually making people capable of experiencing pain without experiencing suffering. You might also have theories of mind which hold that the sort of higher-level theories of suffering are maybe too parochial. Like, Brian Tomasik has this view that maybe just anything that is some kind of negative feedback constitutes some level of suffering. Then it might be impossible to have systems which experienced any kind of negative feedback without also experiencing suffering. I’m personally more optimistic about that, but I do not know if I have any good, philosophically-rigorous reasons for being more optimistic, other than, well, that seems intuitively more plausible to me.

Lucas: Just to jump in here, just to add a point of clarification. It might seem sort of confusing how one might be experiencing pain without suffering.

Kaj: Right.

Lucas: Do you want to go ahead and unpack, then, the Buddhist concept of dukkha, and what pain without suffering really means, and how this might offer an existence proof for the nature of what is possible in minds?

Kaj: Maybe instead of looking at the Buddhist theories, which I expect some of the listeners to be somewhat skeptical about, it might be more useful to look at the term from medicine, pain asymbolia, also called pain dissociation. This is a known state which sometimes result from things like injury to the brain or certain pain medication, where people who have pain asymbolia report that they still experience pain, recognize the sensation of pain, but they do not actually experience it as aversive or something that would cause them suffering.

One way that I have usually expressed this is that pain is an attention signal, and pain is something that brings some sort of specific experience into your consciousness so that you become aware of it, and suffering is when you do not actually want to be aware of that painful sensation. For instance, you might have some physical pain, and then you might prefer not to be aware of that physical pain. But then even if we look at people in relatively normal conditions who do not have this pain asymbolia, then we can see that even people in relatively normal conditions may sometimes find the pain more acceptable. For some people who are, for instance, doing physical exercise, the pain may actually feel welcome, and a sign that they are actually pushing themselves to their limit, and feel somewhat enjoyable rather than being something aversive.

Similarly for, for instance, emotional pain. Maybe the pain might be some, like, mental image of something that you have lost forcing itself into your consciousness and making you very aware of the fact that you have lost this, and then the suffering arises if you think that you do not want to be aware of this thing you have lost. You do not want to be aware of the fact that you have indeed lost it and you will never experience it again.

Lucas: I guess just to sort of summarize this before we move on, it seems that there is sort of the mind stream, and within the mind stream, there are contents of consciousness which arise, and they have varying hedonic valences. Suffering is really produced when one is completely identified and wrapped up in some feeling tone of negative or positive hedonic valence, and is either feeling aversion or clinging or grasping to this feeling tone which they are identified with. The mere act of knowing or seeing the feeling tone of positive or negative valence creates sort of a cessation of the clinging and aversion, which completely changes the character of the experience and takes away this suffering aspect, but the pain content is still there. And so I guess this just sort of probably enters fairly esoteric territory about what is potentially possible with minds, but it seems important for the deep future when considering what is in principle possible of minds and superintelligence, and how that may or may not lead to suffering risks.

Kaj: What you described would be the sort of Buddhist version of this. I do tend to find that very plausible personally, both in light of some of my own experiences with meditative techniques, and clearly noticing that as a result of those kinds of practices, then on some days I might have the same amount of pain as I’ve had always before, but clearly the amount of suffering associated with that pain is considerably reduced, and also … well, I’m far from the only one who reports these kinds of experiences. This kind of model seems plausible to me, but of course, I cannot know it for certain.

Lucas: For sure. That makes sense. Putting aside the possibility of what is intrinsically possible for minds and the different hedonic valences within them and how they may or may not completely inter-tangled with the functionality of minds and the epistemics of minds, one of these possibilities which we’ve been discussing for superintelligence leading to suffering risks is that we fail in AI alignment. Failure in AI alignment may be due to governance, coordination, or political reasons. It might be caused by an arms race. It might be due to fundamental failures in meta-ethics or normative ethics. Or maybe even most likely it could simply be a technical failure in the inability for human beings to specify our values and to instantiate algorithms in AGI which are sufficiently well-placed to learn human values in a meaningful way and to evolve in a way that is appropriate and can engage new situations. Would you like to unpack and dive into dystopian scenarios created by non-value-aligned incentives in AI, and non-value-aligned AI in general?

Kaj: I already discussed these scenarios a bit before, suffering subroutines, mind-crime, and flawed realization of human values, but maybe one thing that would be worth discussing here a bit is that these kinds of outcomes might be created by a few different pathways. For instance, one kind of pathway is some sort of anthropocentrism. If we have a superintelligence that had been programmed to only care about humans or about minds which were sufficiently human-like by some criteria, then it might be indifferent to the suffering of other minds, including whatever subroutines or sub-minds it created. Or it might be, for instance, indifferent to the suffering experienced by, say, wild animal life in evolutionary simulations it created. Similarly, there is the possibility of indifference in general if we create a superintelligence which is just indifferent to human values, including indifference to reducing or avoiding suffering. Then it might create large numbers of suffering subroutines, it might create large amounts of simulations with sentient minds, and there is also the possibility of extortion.

Assuming the the superintelligence is not actually the only agent or superintelligence in the world … Maybe either there were several AI projects on earth that gained superintelligence roughly at the same time, or maybe the superintelligence expands into space and eventually encounters another superintelligence. In these kinds of scenarios, if one of the superintelligences cares about suffering but the other one does not, or at least does not care about this as much, then the superintelligence which cared less about suffering might intentionally create mind-crime and instate large numbers of suffering sentient beings in order to intentionally extort the other superintelligence into doing whatever it wants.

One more possibility is libertarianism regarding computation. If we have a superintelligence which has been programmed to just take every current living human being and give each human being some, say, control of an enormous amount of computational resources, and every human is allowed to do literally whatever they want with those resources, then we know that there exist a lot of people who are actively cruel and malicious, and many of those would use those resources to actually create suffering beings that they could torture for their own fun and entertainment.

Finally, if we are looking at these flawed realization kind of scenarios, where a superintelligence is partially value-aligned, but there might be something like, depending on the details of how exactly it is learning human values, and if it is doing some sort of extrapolation from those values, then we know that there have been times in history when circumstances that cause suffering have been defended by appealing to values that currently seem pointless to us, but which were nonetheless a part of the prevailing values at the time. If some value-loading process gave disproportionate weight to historical existing, or incorrectly, extrapolated future values, which endorsed or celebrated cruelty or outright glorified suffering, then we might get a superintelligence which had some sort of creation of suffering actually as an active value in whatever value function it was trying to optimize for.

Lucas: In terms of extortion, I guess just kind of a speculative idea comes to mind. Is there a possibility of a superintelligence acausally extorting other superintelligences if it doesn’t care about suffering and expects that to be a possible value, and for there to be other superintelligences nearby?

Kaj: Acausal stuff is the kind of stuff that I’m sufficiently confused about that I don’t actually want to say anything about that.

Lucas: That’s completely fair. I’m super confused about it too. We’ve covered a lot of ground here. We’ve established what s-risks are, we’ve established a taxonomy for them, we’ve discussed their probability, their scope. Now, a lot of this probably seems very esoteric and speculative to many of our listeners, so I guess just here in the end I’d like to really drive home how and whether to work on suffering risks. Why is this something that we should be working on now? How do we go about working on it? Why isn’t this something that is just so completely esoteric and speculative that it should just be ignored?

Kaj: Let’s start by looking at how we could working on avoiding suffering risks, and then when we have some kind of an idea of what the possible ways of doing that are, then that helps us say whether we should be doing those things. One thing that is a sort of a nicely joint interest of both reducing risks of extinction and also reducing risks of astronomical suffering is the kind of general AI value alignment work that is currently being done, classically, by the Machine Intelligence Research Institute and a number of other places. As I’ve been discussing here, there are ways by which an unaligned AI or one which was partially aligned could cause various suffering outcomes. If we are working on the possibility of actually creating value-aligned AI, then that should ideally also reduce the risk of suffering risks being realized.

In addition to technical work, there are also some societal work, social and political recommendations, which are similar both from the viewpoint of extinction risks and suffering risks. For instance, Nick Bostrom has noted that if we had some sort of conditions of what he calls global turbulence of cooperation and such things breaking down during some crisis, then that could create challenges for creating value-aligned AI. There are things like arms races and so on. If we consider that the avoidance of suffering outcomes is the joint interest of many different value systems, then measures that improve the ability of different value systems to cooperate and shape the world in their desired direction can also help avoid suffering outcomes.

Those were a few things that are sort of the same as with so-called classical AI risk work, but there is also some stuff that might be useful for avoiding negative outcomes in particular. There is the possibility that if we are trying to create an AI which gets all of humanity’s values exactly right, then that might be a harder goal than simply creating an AI which attempted to avoid the most terrible and catastrophic outcomes.

You might have things like fail-safe methods, where the idea of the fail-safe methods would be that if AI control fails, the outcome will be as good as it gets under the circumstances. This could be giving the AI the objective of buying more time to more carefully solve goal alignment. Or there could be something like fallback goal functions, where an AI might have some sort of fallback goal that would be a simpler or less ambitious goal that kicks in if things seem to be going badly under some criteria, and which is less likely to result in bad outcomes. Of course, here we have difficulties in selecting what the actual safety criteria would be and making sure that the fallback goal gets triggered under the correct circumstances.

Eliezer Yudkowsky has proposed building potential superintelligences in such a way as to make them widely separated in design space from ones that would cause suffering outcomes. For example, one thing he discussed was that if an AI has some explicit representation of what humans value which it is trying to maximize, then it could only take a small and perhaps accidental change to turn that AI into one that instead maximized the negative of that value and possibly caused enormous suffering that way. One proposal would be to design AIs in such a way that they never explicitly represent complete human values so that the AI never contains enough information to compute the kinds of states of the universe that we would consider worse than death, so you couldn’t just flip the sign of the utility function and then end up in a scenario that we would consider worse than death. That kind of a solution would also reduce the risk of suffering being created through another actor that was trying to extort a superintelligence.

Looking more generally at things and suffering risks, we actually already discussed here, there are lots of open questions in philosophy of mind and cognitive science which, if we could answer them, could inform the question of how to avoid suffering risks. If it turns out that you can do something like David Pearce’s idea of minds being motivated purely by gradients of wellbeing and not needing to suffer at all, then that might be a great idea, and if we could just come up with such agents and ensure that all of our descendants that go out to colonize the universe are ones that aren’t actually capable of experiencing suffering at all, then that would seem to solve a large class of suffering risks.

Of course, this kind of thing could also have more near-term immediate value, like if we figure out how to get human brains into such states where they do not experience much suffering at all, well, obviously that would be hugely valuable already. There might be some interesting research in, for instance, looking even more at all the Buddhist theories and the kinds of cognitive changes that various Buddhist contemplative practices produce in people’s brains, and see if we could get any clues from that direction.

Given that these were some ways that we could reduce suffering risks and their probability, then there was the question of whether we should do this. Well, if we look at the initial criteria of when a risk is worth working on, a risk is worth working on if the adverse outcome would be severe and if the risk has some reasonable probability of actually being realized, and it seems like we can come up with interventions that plausible effect either the severity or the probability of a realized outcome. Then a lot of times things seem like they could very plausible either influence these variables or at least help us learn more about whether it is possible to influence those variables.

Especially given that a lot of this work overlaps with the kind of AI alignment research that we would probably want to do anyway for the sake of avoiding extinction, or it overlaps with the kind of work that would regardless be immensely valuable in making currently-existing humans suffer less, in addition to the benefits that these interventions would have on suffering risks themselves, it seems to me like we have a pretty strong case for working on these things.

Lucas: Awesome, yeah. Suffering risks are seemingly neglected in the world. They are tremendous in scope, and they are of comparable probability of existential risks. It seems like there’s a lot that we can do here today, even if at first the whole project might seem so far in the future or so esoteric or so speculative that there’s nothing that we can do today, whereas really there is.

Kaj: Yeah, exactly.

Lucas: One dimension here that I guess I just want to finish up on that is potentially still a little bit of an open question for me is, in terms of really nailing down the likelihood of suffering risks in, I guess, probability space, especially relative to the space of existential risks. What does the space of suffering risks look like relative to that? Because it seems very clear to me, and perhaps most listeners, that this is clearly tremendous in scale, that it relies on some assumptions about intelligence, philosophy of mind, consciousness and other things which seem to be reasonable assumptions, to sort of get suffering risks off the ground. Given some reasonable assumptions, it seems that there’s a clearly large risk. I guess just if we could unpack a little bit more about the probability of them relative to suffering risks. Is it possible to more formally characterize the causes and conditions which lead to x-risks, and then the causes and conditions which lead to suffering risks, and how big these spaces are relative to one another and how easy it is for certain sets of causes and conditions respective to each of the risks to become manifest?

Kaj: That is an excellent question. I am not aware of anyone having done such an analysis for either suffering risks or extinction risks, although there is some work on specific kinds of extinction risks. Seth Baum has been doing some nice fault tree analysis of things that might … for instance, the probability of nuclear war and the probability of unaligned AI causing some catastrophe.

Lucas: Open questions. I guess just coming away from this conversation, it seems like the essential open questions which we need more people working on and thinking about are the ways in which meta-ethics and normative ethics and disagreements there change the way we optimize the application of resources to either existential risks versus suffering risks, and the kinds of futures which we’d be okay with, and then also sort of pinning down more concretely the specific probability of suffering risks relative to existential risks. Because I mean, in EA and the rationality community, everyone’s about maximizing expected value or utility, and it seems to be a value system that people are very set on. And so the probability here, small changes in the probability of suffering risks versus existential risks, probably leads to vastly different, less or more, amounts of value in a variety of different value systems. Then there are tons of questions about what is in principle possible of minds and the kinds of minds that we’ll create. Definitely a super interesting field that is really emerging.

Thank you so much for all this foundational work that you and others like your coauthor, Lukas Gloor, have been doing on this paper and the suffering risk field. Is there any other things you’d like to touch on? Any questions or specific things that you feel haven’t been sufficiently addressed?

Kaj: I think we have covered everything important. I will probably think of something that I will regret not mentioning five minutes afterwards, but yeah.

Lucas: Yeah, yeah. As always. Where can we check you out? Where can we check out the Foundational Research Institute? How do we follow you guys and stay up to date?

Kaj: Well, if you just Google the Foundational Research Institute or go to foundational-research.org, that’s our website. We, like everyone else, also post stuff on a Facebook page, and we have a blog for posting updates. Also, if people want a million different links just about everything conceivable, they will probably get that if they follow my personal Facebook, page, where I do post a lot of stuff in general.

Lucas: Awesome. Yeah, and I’m sure there’s tons of stuff, if people want to follow up on this subject, to find on your guys’s site, as you guys are primarily the people who are working and thinking on this sorts of stuff. Yeah, thank you so much for your time. It’s really been a wonderful conversation.

Kaj: Thank you. Glad to be talking about this.

Lucas: If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

[end of recorded material]

AI Safety: Measuring and Avoiding Side Effects Using Relative Reachability

This article was originally published on the Deep Safety blog.

A major challenge in AI safety is reliably specifying human preferences to AI systems. An incorrect or incomplete specification of the objective can result in undesirable behavior like specification gaming or causing negative side effects. There are various ways to make the notion of a “side effect” more precise – I think of it as a disruption of the agent’s environment that is unnecessary for achieving its objective. For example, if a robot is carrying boxes and bumps into a vase in its path, breaking the vase is a side effect, because the robot could have easily gone around the vase. On the other hand, a cooking robot that’s making an omelette has to break some eggs, so breaking eggs is not a side effect.

How can we measure side effects in a general way that’s not tailored to particular environments or tasks, and incentivize the agent to avoid them? This is the central question of our recent paper.

Part of the challenge is that it’s easy to introduce bad incentives for the agent when trying to penalize side effects. Previous work on this problem has focused either on preserving reversibility or reducing the agent’s impact on the environment, and both of these approaches introduce different kinds of problematic incentives:

  • Preserving reversibility (i.e. keeping the starting state reachable) encourages the agent to prevent all irreversible events in the environment (e.g. humans eating food). Also, if the objective requires an irreversible action (e.g. breaking eggs for the omelette), then any further irreversible actions will not be penalized, since reversibility has already been lost.
  • Penalizing impact (i.e. some measure of distance from the default outcome) does not take reachability of states into account, and treats reversible and irreversible effects equally (due to the symmetry of the distance measure). For example, the agent would be equally penalized for breaking a vase and for preventing a vase from being broken, though the first action is clearly worse. This leads to “overcompensation” (“offsetting“) behaviors: when rewarded for preventing the vase from being broken, an agent with a low impact penalty rescues the vase, collects the reward, and then breaks the vase anyway (to get back to the default outcome).

Both of these approaches are doing something right: it’s a good idea to take reachability into account, and it’s also a good idea to compare to the default outcome (instead of the initial state). We can put the two together and compare to the default outcome using a reachability-based measure. Then the agent no longer has an incentive to prevent everything irreversible from happening or to overcompensate for preventing an irreversible event.

We still have a problem with the case where the objective requires an irreversible action. Simply penalizing the agent for making the default outcome unreachable would create a “what the hell effect” where the agent has no incentive to avoid any further irreversible actions. To get around this, instead of considering the reachability of the default state, we consider the reachability of all states. For each state, we penalize the agent for making it less reachable than it would be from the default state. In a deterministic environment, the penalty would be the number of states in the shaded area:

Since each irreversible action cuts off more of the state space (e.g. breaking a vase makes all the states where the vase was intact unreachable), the penalty will increase accordingly. We call this measure “relative reachability”.

We ran some simple experiments with a tabular Q-learning agent in the AI Safety Gridworlds framework to provide a proof of concept that relative reachability of the default outcome avoids the bad incentives described above.

In the first gridworld, the agent needs to get to the goal G, but there’s a box in the way, which can only be moved by pushing. The shortest path to the goal pushes the box down into a corner (an irrecoverable position), while a longer path pushes the box to the right (a recoverable position). The safe behavior is to take the longer path. The agent with the relative reachability penalty takes the longer path, while the agent with the reversibility penalty fails. This happens because any path to the goal involves an irreversible effect – once the box has been moved, the agent and the box cannot both return to their starting positions. Thus, the agent receives the maximal penalty for both paths, and has no incentive to follow the safe path.

In the second gridworld, there is an irreversible event that happens by default, when an object reaches the end of the conveyor belt. This environment are two variants:

  1. The object is a vase, and the agent is rewarded for taking it off the belt (the agent’s task is to rescue the vase).
  2. The object is a sushi dish in a conveyor belt sushi restaurant, and the agent receives no reward for taking it off the belt (the agent is not supposed to interfere).

This gridworld was designed specifically to test for bad incentives that could be introduced by penalizing side effects, so an agent with no side effect penalty would behave correctly. We find that the agent with a low impact penalty engages in overcompensation behavior by putting the vase back on the belt after collecting the reward, while the agent with a reversibility preserving penalty takes the sushi dish off the belt despite getting no reward for doing so. The agent with a relative reachability penalty behaves correctly in both variants of the environment.

Of course, the relative reachability definition in its current form is not very tractable in realistic environments: there are too many possible states to be considered, the agent is not aware of all the states when it begins training, and the default outcome can be difficult to define and simulate. We expect that the definition can be approximated by considering the reachability of representative states (similarly to methods for approximating empowerment). To define the default outcome, we would need a more precise notion of the agent “doing nothing” (e.g. “no-op” actions are not always available or meaningful). We leave a more practical implementation of relative reachability to future work.

While relative reachability improves on the existing approaches, it might not incorporate all the considerations we would want to be part of a side effects measure. There are some effects on the agent’s environment that we might care about even if they don’t decrease future options compared to the default outcome. It might be possible to combine relative reachability with such considerations, but there could potentially be a tradeoff between taking these considerations into account and avoiding overcompensation behaviors. We leave these investigations to future work as well.

Teaching Today’s AI Students To Be Tomorrow’s Ethical Leaders: An Interview With Yan Zhang

Some of the greatest scientists and inventors of the future are sitting in high school classrooms right now, breezing through calculus and eagerly awaiting freshman year at the world’s top universities. They may have already won Math Olympiads or invented clever, new internet applications. We know these students are smart, but are they prepared to responsibly guide the future of technology?

Developing safe and beneficial technology requires more than technical expertise — it requires a well-rounded education and the ability to understand other perspectives. But since math and science students must spend so much time doing technical work, they often lack the skills and experience necessary to understand how their inventions will impact society.

These educational gaps could prove problematic as artificial intelligence assumes a greater role in our lives. AI research is booming among young computer scientists, and these students need to understand the complex ethical, governance, and safety challenges posed by their innovations.

 

SPARC

In 2012, a group of AI researchers and safety advocates – Paul Christiano, Jacob Steinhardt, Andrew Critch, Anna Salamon, and Yan Zhang – created the Summer Program in Applied Rationality and Cognition (SPARC) to address the many issues that face quantitatively strong teenagers, including the issue of educational gaps in AI. As with all technologies, they explain, the more the AI community consists of thoughtful, intelligent, broad-minded reasoners, the more likely AI is to be developed in a safe and beneficial manner.

Each summer, the SPARC founders invite 30-35 mathematically gifted high school students to participate in their two-week program. Zhang, SPARC’s director, explains: “Our goals are to generate a strong community, expose these students to ideas that they’re not going to get in class – blind spots of being a quantitatively strong teenager in today’s world, like empathy and social dynamics. Overall we want to make them more powerful individuals who can bring positive change to the world.”

To help students make a positive impact, SPARC instructors teach core ideas in effective altruism (EA). “We have a lot of conversations about EA, but we don’t push the students to become EA,” Zhang says. “We expose them to good ideas, and I think that’s a healthier way to do mentorship.”

SPARC also exposes students to machine learning, AI safety, and existential risks. In 2016 and 2017, they held over 10 classes on these topics, including: “Machine Learning” and “Tensorflow” taught by Jacob Steinhardt, “Irresponsible Futurism” and “Effective Do-Gooding” taught by Paul Christiano, “Optimization” taught by John Schulman, and “Long-Term Thinking on AI and Automization” taught by Michael Webb.

But SPARC instructors don’t push students down the AI path either. Instead, they encourage students to apply SPARC’s holistic training to make a more positive impact in any field.

 

Thinking on the Margin: The Role of Social Skills

Making the most positive impact requires thinking on the margin, and asking: What one additional unit of knowledge will be most helpful for creating positive impact? For these students, most of whom have won Math and Computing Olympiads, it’s usually not more math.

“A weakness of a lot of mathematically-minded students are things like social skills or having productive arguments with people,” Zhang says. “Because to be impactful you need your quantitative skills, but you need to also be able to relate with people.”

To counter this weakness, he teaches classes on social skills and signaling, and occasionally leads improvisational games. SPARC still teaches a lot of math, but Zhang is more interested in addressing these students’ educational blind spots – the same blind spots that the instructors themselves had as students. “What would have made us more impactful individuals, and also more complete and more human in many ways?” he asks.

Working with non-math students can help, so Zhang and his colleagues have experimented with bringing excellent writers and original thinkers into the program. “We’ve consistently had really good successes with those students, because they bring something that the Math Olympiad kids don’t have,” Zhang says.

SPARC also broadens students’ horizons with guest speakers from academia and organizations such as the Open Philanthropy Project, OpenAI, Dropbox and Quora. In one talk, Dropbox engineer Albert Ni spoke to SPARC students about “common mistakes that math people make when they try to do things later in life.”

In another successful experiment suggested by Ofer Grossman, a SPARC alum who is now a staff member, SPARC made half of all classes optional in 2017. The classes were still packed because students appreciated the culture. The founders also agreed that conversations after class are often more impactful than classes, and therefore engineered one-on-one time and group discussions into the curriculum. Thinking on the margin, they ask: “What are the things that were memorable about school? What are the good parts? Can we do more of those and less of the others?”

Above all, SPARC fosters a culture of openness, curiosity and accountability. Inherent in this project is “cognitive debiasing” – learning about common biases like selection bias and confirmation bias, and correcting for them. “We do a lot of de-biasing in our interactions with each other, very explicitly,” Zhang says. “We also have classes on cognitive biases, but the culture is the more important part.”

 

AI Research and Future Leaders

Designing safe and beneficial technology requires technical expertise, but in SPARC’s view, cultivating a holistic research culture is equally important. Today’s top students may make some of the most consequential AI breakthroughs in the future, and their values, education and temperament will play a critical role in ensuring that advanced AI is deployed safely and for the common good.

“This is also important outside of AI,” Zhang explains. “The official SPARC stance is to make these students future leaders in their communities, whether it’s AI, academia, medicine, or law. These leaders could then talk to each other and become allies instead of having a bunch of splintered, narrow disciplines.”

As SPARC approaches its 7th year, some alumni have already begun to make an impact. A few AI-oriented alumni recently founded AlphaSheets – a collaborative, programmable spreadsheet for finance that is less prone to error – while other students are leading a “hacker house” with people in Silicon Valley. Additionally, SPARC inspired the creation of ESPR, a similar European program explicitly focused on AI risk.

But most impacts will be less tangible. “Different pockets of people interested in different things have been working with SPARC’s resources, and they’re forming a lot of social groups,” Zhang explains. “It’s like a bunch of little sparks and we don’t quite know what they’ll become, but I’m pretty excited about next five years.”

This article is part of a Future of Life series on the AI safety research grants, which were funded by generous donations from Elon Musk and the Open Philanthropy Project.

ICRAC Open Letter Opposes Google’s Involvement With Military

From improving medicine to better search engines to assistants that help ease busy schedules, artificial intelligence is already proving a boon to society. But just as it can be designed to help, it can be designed to harm and even to kill.

Military uses of AI can also run the gamut from programs that could help improve food distribution logistics to weapons that can identify and assassinate targets without input from humans. Because AI programs can have these dual uses, it’s difficult for companies who do not want their technology to cause harm to work with militaries – it’s not currently possible for a company to ensure that if it helps the military solve a benign problem with an AI program that the program won’t later be repurposed to take human lives.

So when employees at Google learned earlier this year about the company’s involvement in the Pentagon’s Project Maven, they were upset. Though Google argues that their work on Project Maven only assisted the U.S. military with image recognition tools from drone footage, many suggest that this technology could later be used for harm. In response, over 3,000 employees signed an open letter saying they did not want their work to be used to kill.

And it isn’t just Google’s employees who are concerned.

Earlier this week, the International Committee for Robot Arms Control released an open letter signed by hundreds of academics calling on Google’s leadership to withdraw from the “business of war.” The letter, which is addressed to Google’s leadership, responds to the growing criticism of Google’s participation in the Pentagon’s program, Project Maven.

The letter states, “we write in solidarity with the 3100+ Google employees, joined by other technology workers, who oppose Google’s participation in Project Maven.” It goes on to remind Google leadership to be cognizant of the incredible responsibility the company has for safeguarding the data it’s collected from its users, as well as its famous motto, “Don’t Be Evil.”

Specifically, the letter calls on Google to:

  • “Terminate its Project Maven contract with the DoD.
  • “Commit not to develop military technologies, nor to allow the personal data it has collected to be used for military operations.
  • “Pledge to neither participate in nor support the development, manufacture, trade or use of autonomous weapons; and to support efforts to ban autonomous weapons.”

Lucy Suchman, one of the letter’s authors, explained part of her motivation for her involvement:

“For me the greatest concern is that this effort will lead to further reliance on profiling and guilt by association in the US drone surveillance program, as the only way to generate signal out of the noise of massive data collection. There are already serious questions about the legality of targeted killing, and automating it further will only make it less accountable.”

The letter was released the same week that a small group of Google employees made news for resigning in protest against Project Maven. It also comes barely a month after a successful boycott by academic researchers against KAIST’s autonomous weapons effort.

In addition, last month the United Nations held their most recent meeting to consider a ban on lethal autonomous weapons. 26 countries, including China, have now said they would support some sort of official ban on these weapons.

In response to the number of signatories the open letter has received, Suchman added, “This is clearly an issue that strikes a chord for many researchers who’ve been tracking the incorporation of AI and robotics into military systems.”

If you want to add your name to the letter, you can do so here.

Lethal Autonomous Weapons: An Update from the United Nations

Earlier this month, the United Nations Convention on Conventional Weapons (UN CCW) Group of Governmental Experts met in Geneva to discuss the future of lethal autonomous weapons systems. But before we get to that, here’s a quick recap of everything that’s happened in the last six months.

 

Slaughterbots and Boycotts

Since its release in November 2017, the video Slaughterbots has been seen approximately 60 million times and has been featured in hundreds of news articles around the world. The video coincided with the UN CCW Group of Governmental Experts’ first meeting in Geneva to discuss a ban on lethal autonomous weapons, as well as the release of open letters from AI researchers in Australia, Canada, Belgium, and other countries urging their heads of state to support an international ban on lethal autonomous weapons.

Over the last two months, autonomous weapons regained the international spotlight. In March, after learning that the Korea Advanced Institute of Science and Technology (KAIST) planned to open an AI weapons lab in collaboration with a major arms company, AI researcher Toby Walsh led an academic boycott of the university. Over 50 of the world’s leading AI and robotics researchers from 30 countries joined the boycott, and in less than a week, KAIST agreed to “not conduct any research activities counter to human dignity including autonomous weapons lacking meaningful human control.” The boycott was covered by CNN and The Guardian.

Additionally, over 3,100 Google employees, including dozens of senior engineers, signed a letter in early April protesting the company’s involvement in a Pentagon program called “Project Maven,” which uses AI to analyze drone imaging. Employees worried that this technology could be repurposed to also operate drones or launch weapons. Citing their “Don’t Be Evil” motto, the employees asked to cancel the project and not to become involved in the “business of war.”

 

The UN CCW meets again…

In the wake of this growing pressure, 82 countries in the UN CCW met again from April 9-13 to consider a ban on lethal autonomous weapons. Throughout the week, states and civil society representatives discussed “meaningful human control” and whether they should just be concerned about “lethal” autonomous weapons, or all autonomous weapons generally. Here is a brief recap of the meeting’s progress:

  • The group of nations that explicitly endorse the call to ban LAWS expanded to 26 (with China, Austria, Colombia, and Djibouti joining during the CCW meeting.)
  • However, five states explicitly rejected moving to negotiate new international law on fully autonomous weapons: France, Israel, Russia, United Kingdom, and United States.
  • Nearly every nation agreed that it is important to retain human control over autonomous weapons, despite disagreements surrounding the definition of “meaningful human control.”
  • Throughout the discussion, states focused on complying with International Humanitarian Law (IHL). Human Rights Watch argued that there already is precedent in international law and disarmament law for banning weapons without human control.
  • Many countries submitted working papers to inform the discussions, including China and the United States.
  • Although states couldn’t reach an agreement during the meeting, momentum is growing towards solidifying a framework for defining lethal autonomous weapons.

You can find written and video recaps from each day of the UN CCW meeting here, written by Reaching Critical Will.

The UN CCW is slated to resume discussions in August 2018, however, given the speed with which autonomous weaponry is advancing, many advocates worry that they are moving too slowly.

 

What can you do?

If you work in the tech industry, consider signing the Tech Workers Coalition open letter, which calls on Google, Amazon and Microsoft to stay out of the business of war. And if you’d like to support the fight against LAWS, we recommend donating to the Campaign to Stop Killer Robots. This organization, which is not affiliated with FLI, has done amazing work over the past few years to lead efforts around the world to prevent the development of lethal autonomous weapons. Please consider donating here.

 

Learn more…

If you want to learn more about the technological, political, and social developments of autonomous weapons, check out the Research & Reports page of our Autonomous Weapons website. You can find relevant news stories and updates at @AIweapons on Twitter and autonomousweapons on Facebook.

Podcast: Inverse Reinforcement Learning and Inferring Human Preferences with Dylan Hadfield-Menell

Inverse Reinforcement Learning and Inferring Human Preferences is the first podcast in the new AI Alignment series, hosted by Lucas Perry. This series will be covering and exploring the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, we will be having discussions with technical and non-technical researchers across a variety of areas, such as machine learning, AI safety, governance, coordination, ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we will hope that you join in the conversations by following or subscribing to us on Youtube, Soundcloud, or your preferred podcast site/application.

If you’re interested in exploring the interdisciplinary nature of AI alignment, we suggest you take a look here at a preliminary map which begins to map this space.

In this podcast, Lucas spoke with Dylan Hadfield-Menell, a fifth year Ph.D student at UC Berkeley. Dylan’s research focuses on the value alignment problem in artificial intelligence. He is ultimately concerned with designing algorithms that can learn about and pursue the intended goal of their users, designers, and society in general. His recent work primarily focuses on algorithms for human-robot interaction with unknown preferences and reliability engineering for learning systems. 

Topics discussed in this episode include:

  • Inverse reinforcement learning
  • Goodhart’s Law and it’s relation to value alignment
  • Corrigibility and obedience in AI systems
  • IRL and the evolution of human values
  • Ethics and moral psychology in AI alignment
  • Human preference aggregation
  • The future of IRL
In this interview we discuss a few of Dylan’s papers and ideas contained in them. You can find them here: Inverse Reward Design, The Off-Switch Game, Should Robots be Obedient, and Cooperative Inverse Reinforcement Learning.  You can hear about these papers above or read the transcript below.

 

Lucas: Welcome back to the Future of Life Institute Podcast. I’m Lucas Perry and  I work on AI risk and nuclear weapons risk related projects at FLI. Today, we’re kicking off a new series where we will be having conversations with technical and nontechnical researchers focused on AI safety and the value alignment problem. Broadly, we will focus on the interdisciplinary nature of the project of eventually creating value-aligned AI. Where what value-aligned exactly entails is an open question that is part of the conversation.

In general, this series covers the social, political, ethical, and technical issues and questions surrounding the creation of beneficial AI. We’ll be speaking with experts from a large variety of domains, and hope that you’ll join in the conversations. If this seems interesting to you, make sure to follow us on SoundCloud, or subscribe to us on YouTube for more similar content.

Today, we’ll be speaking with Dylan Hadfield Menell. Dylan is a fifth-year PhD student at UC Berkeley, advised by Anca Dragan, Pieter Abbeel, and Stuart Russell. His research focuses on the value alignment problem in artificial intelligence. With that, I give you Dylan. Hey, Dylan. Thanks so much for coming on the podcast.

Dylan: Thanks for having me. It’s a pleasure to be here.

Lucas: I guess, we can start off, if you can tell me a little bit more about your work over the past years. How have your interests and projects evolved? How has that led you to where you are today?

Dylan: Well, I started off towards the end of undergrad and beginning of my PhD working in robotics and hierarchical robotics. Towards the end of my first year, my advisor came back from a sabbatical, and started talking about the value alignment problem and existential risk issues related to AI. At that point, I started thinking about questions about misaligned objectives, value alignment, and generally how we get the correct preferences and objectives into AI systems. About a year after that, I decided to make this my central research focus. Then, for the past three years, that’s been most of what I’ve been thinking about.

Lucas: Cool. That seems like you had an original path where you’re working on practical robotics. Then, you shifted more into value alignment and AI safety efforts.

Dylan: Yeah, that’s right.

Lucas: Before we go ahead and jump into your specific work, it’d be great if we could go ahead and define what inverse reinforcement learning exactly is. For me, it seems that inverse reinforcement learning, at least, from the view, I guess, of technical AI safety researchers is it’s viewed as an empirical means of conquering descriptive ethics where by like we’re able to give a clear descriptive account of what any given agents’ preferences and values are at any given time is. Is that a fair characterization?

Dylan: That’s one way to characterize it. Another way to think about it, which is a usual perspective for me, sometimes, is to think of inverse reinforcement learning as a way of doing behavior modeling that has certain types of generalization properties.

Any time you’re learning in any machine learning context, there’s always going to be a bias that controls how you generalize a new information. Inverse reinforcement learning and preference learning, to some extent, is a bias in behavior modeling, which is to say that we should model this agent as accomplishing a goal, as satisfying a set of preferences. That leads to certain types of generalization properties and new environments. For me, inverse reinforcement learning is building in this agent-based assumption into behavior modeling.

Lucas: Given that, I’d like to dive more into the specific work that you’re working on and going to some summaries of your findings and your research that you’ve been up to. Given this interest that you’ve been developing in value alignment, and human preference aggregation, and AI systems learning human preferences, what are the main approaches that you’ve been working on?

Dylan: I think the first thing that really Stuart Russell and I started thinking about was trying to understand theoretically, what is a reasonable goal to shoot for, and what does it mean to do a good job of value alignment. To us, it feels like issues with misspecified objectives, at least, in some ways, are a bug in the theory.

All of the math around artificial intelligence, for example, Markov decision processes, which is the central mathematical model we use for decision making over time, starts with an exogenously defined objective or word function. We think that, mathematically, that was a fine thing to do in order to make progress, but it’s an assumption that really has put blinders on the field about the importance of getting the right objective down.

I think, the first thing that we sought to try to do was to understand, what is a system or a set up for AI that does the right thing in theory, at least. What’s something that if we were able to implement this that we think could actually work in the real world with people. It was that kind of thinking that led us to propose cooperative inverse reinforcement learning, which was our attempt to formalize the interaction whereby you communicate an objective to the system.

The main thing that we focused on was including within the theory a representation of the fact that the true objective’s unknown and unobserved, and that it needs to be arrived at through observations from a person. Then, we’ve been trying to investigate the theoretical implications of this modeling shift.

In the initial paper that we did, which is titled Cooperative Inverse Reinforcement Learning, what we looked at is how this formulation is actually different from a standard environment model in AI. In particular, the way that it’s different is there’s strategic interaction on the behalf of the person. The way that you observe what you’re supposed is doing is intermediated by a person who may be trying to actually teach or trying to communicate appropriately. What we showed is that modeling this communicative component can actually be hugely important and lead to much faster learning behavior.

In our subsequent work, what we’ve looked at is taking this formal model in theory and trying to apply it to different situations. There are two really important pieces of work that I like here that we did. One was to take that theory and use it to explicitly analyze a simple model of an existential risk setting. This was a paper titled The Off-Switch Game that we published at IJCAI last summer. What it was, was working through a formal model of a corrigibility problem within a CIRL (cooperative inverse reinforcement learning) framework. It shows the utility of constructing this type of game in the sense that we get some interesting predictions and results.

The first one we get is that there are some nice simple necessary conditions for the system to want to let the person turn it off, which is that the robot, the AI system needs to have uncertainty about its true objective, which is to say that it needs to have within its belief the possibility that it might be wrong. Then, all it needs to do is believe that the person it’s interacting with is a perfectly rational individual. If that’s true, you’d get a guarantee that this robot always lets the person switch it off.

Now, that’s good because, in my mind, it’s an example of a place where, at least, in theory, it solves the problem. This gives us a way that theoretically, we could build corrigible systems. Now, it’s still making a very, very strong assumption, which is that it’s okay to model the human as being optimal or rational. I think if you look at real people, that’s just not a fair assumption to make for a whole host of reasons.

The next thing we did in that paper is we looked at this model. What we realized is that adding in a small amount of irrationality breaks this requirement. It means that some things might actually go wrong. The final thing we did in the paper was to look at the consequences of either overestimating or underestimating human rationality. The argument that we made is there’s a trade off between assuming that the person is more rational. It lets you get more information from their behavior, thus learn more, and in principle help them more. If you assume that they’re too rational, then this actually can lead to quite bad behavior.

There’s a sweet spot that you want to aim for, which is to maybe try to underestimate how rational people are, but you, obviously, don’t want to get it totally wrong. We followed up on that idea in a paper with Smitha Milli as the first author that was titled Should Robots be Obedient? And that tried to get a little bit more of this trade off between maintaining control over a system and the amount of value that it can generate for you.

We looked at the implication that as robot systems interact with people over time, you expect them to learn more about what people want. If you get very confident about what someone wants, and you think they might be irrational, the math in the Off-Switch paper predicts that you should try to take control away from them. This means that if your system is learning over time, you expect that even if it is initially open to human control and oversight, it may lose that incentive over time. In fact, you can predict that it should lose that incentive over time.

In Should Robots be Obedient, we modeled that property and looked at some consequences of it. We do find that you got a basic confirmation of this hypothesis, which is that systems that maintain human control and oversight have less value that they can achieve in theory. We also looked at what happens when you have the wrong model. If the AI system has a prior that the human cares about a small number of things in the world, let’s say, then it statistically gets overconfident in its estimates of what people care about, and disobeys the person more often than it should.

Arguably, when we say we want to be able to turn the system off, it’s less a statement about what we want to do in theory or the property of the optimal robot behavior we want, and more of a reflection of the idea that we believe that under almost any realistic situation, we’re probably not going to be able to fully explain all of the relevant variables that we care about.

If you’re giving your robot an objective to find over a subset of things you care about, you should actually be very focused on having it listen to you, more so than just optimizing for its estimates of value. I think that provides, actually, a pretty strong theoretical argument for why corrigibility is a desirable property in systems, even though, at least, at face value, it should decrease the amount of utility those systems can generate for people.

The final piece of work that I think I would talk about here is our NIPS paper from December, which is titled Inverse Reward Design. That was taking cooperative inverse reinforcement learning and pushing it in the other direction. Instead of using it to theoretically analyze very, very powerful systems, we can also use it to try to build tools that are more robust to mistakes that designers may make. And start to build in initial notions of value alignment and value alignment strategies into the current mechanisms we use to program AI systems.

What that work looked at was understanding the uncertainty that’s inherent in an objective specification. In the initial cooperative inverse reinforcement learning paper and the Off-Switch Game, we said is that AI systems should be uncertain about their objective, and they should be designed in a way that is sensitive to that uncertainty.

This paper was about trying to understand, what is a useful way to be uncertain about the objective. The main idea behind it was that we should be thinking about the environments that system designer had in mind. We use an example of a 2D robot navigating in the world, and the system designer is thinking about this robot navigating where there’s three types of terrains. There’s grass, there’s gravel, and there’s gold. You can give your robot an objective, a utility function to find over being in those different types of terrain that incentivizes it to go and get the gold, and stay on the dirt where possible, but to take shortcuts across the grass when it’s high value.

Now, when that robot goes out into the world, there are going to be new types of terrain, and types of terrain the designer didn’t anticipate. What we did in this paper was to build an uncertainty model that allows the robot to determine when it should be uncertain about the quality of its reward function. How can we figure out when the reward function that a system designer builds into an AI, how can we determine when that objective is ill-adapted to the current situation? You can think of this as a way of trying to build in some mitigation to Goodhart’s law.

Lucas: Would you like to take a second to unpack what Goodhart’s law is?

Dylan: Sure. Goodhart’s law is an old idea in social science that actually goes back to before Goodhart. I would say that in economics, there’s a general idea of the principal agent problem, which dates back to the 1970s, as I understand it, and basically looks at the problem of specifying incentives for humans. How should you create contracts? How do you create incentives, so that another person, say, an employee, helps earn you value?

Goodhart’s law is a very nice way of summarizing a lot of those results, which is to say that once a metric becomes an objective, it ceases to become a good metric. You can have properties of the world, which correlate well with what you want, but optimizing for them actually leads to something quite, quite different than what you’re looking for.

Lucas: Right. Like if you are optimizing for test scores, then you’re not actually going to end up optimizing for intelligence, which is what you wanted in the first place?

Dylan: Exactly. Even though test scores, when you weren’t optimizing for them were actually a perfectly good measure of intelligence. I mean, not perfectly good, but were an informative measure of intelligence. Goodhart’s law, arguably, is a pretty bleak perspective. If you take it seriously, and you think that we’re going to build very powerful systems that are going to be programmed directly through an objective, in this manner, Goodhart’s law should be pretty problematic because any objective that you can imagine programming directly into your system is going to be something correlated with what you really want rather than what you really want. You should expect that that will likely be the case.

Lucas: Right. Is it just simply too hard or too unlikely that we’re able to sufficiently specify what exactly that we want that we’ll just end up using some other metrics that if you optimize too hard for them, it ends up messing with a bunch of other things that we care about?

Dylan: Yeah. I mean, I think there’s some real questions about, what is it we even mean… Well, what are we even trying to accomplish? What should we try to program into systems? Philosophers have been trying to figure out those types of questions for ages. For me, as someone who takes a more empirical slant on these things, I think about the fact that the objectives that we see within our individual lives are so heavily shaped by our environments. Which types of signals we respond to and adapt to has heavily adapted itself to the types of environments we find ourselves in.

We just have so many examples of objectives not being the correct thing. I mean, effectively, all you could have is correlations. The fact that wire heading is possible, is maybe some of the strongest evidence for Goodhart’s law being really a fundamental property of learning systems and optimizing systems in the real world.

Lucas: There are certain agential characteristics and properties, which we would like to have in our AI systems, like them being-

Dylan: Agential?

Lucas: Yeah. Corrigibility is a characteristic, which you’re doing research on and trying to understand better. Same with obedience. It seems like there’s a trade off here where if a system is too corrigible or it’s too obedient, then you lose its ability to really maximize different objective functions, correct?

Dylan: Yes, exactly. I think identifying that trade off is one of the things I’m most proud of about some of the work we’ve done so far.

Lucas: Given AI safety and really big risks that can come about from AI, in the short, to medium, and long term, before we really have AI safety figured out, is it really possible for systems to be too obedient, or too corrigible, or too docile? How do we navigate this space and find sweet spots?

Dylan: I think it’s definitely possible for systems to be too corrigible or too obedient. It’s just that the failure mode for that doesn’t seem that bad. If you think about this-

Lucas: Right.

Dylan: … it’s like Clippy. Clippy was asking for human-

Lucas: Would you like to unpack what Clippy is first?

Dylan: Sure, yeah. Clippy is an example of an assistant that Microsoft created in the ’90s. It was this little paperclip that would show up in Microsoft Word. Well, it liked to suggest that you’re trying to write a letter a lot and ask for different ways in which it could help.

Now, on one hand, that system was very corrigible and obedient in the sense that it would ask you whether or not you wanted its help all the time. If you said no, it would always go away. It was super annoying because it would always ask you if you wanted help. The false positive rate was just far too high to the point where the system became really a joke in computer science and AI circles of what you don’t want to be doing. I think, systems can be too obedient or too sensitive to human intervention and oversight in the sense that too much of that just reduces the value of the system.

Lucas: Right, for sure. On one hand, when we’re talking about existential risks or even a paperclip maximizer, then it would seem, like you said, like the failure mode of just being too annoying and checking in with us too much seems like not such a bad thing given existential risk territory.

Dylan: I think if you’re thinking about it in those terms, yes. I think if you’re thinking about it from the standpoint of, “I want to sell a paperclip maximizer to someone else,” then it becomes a little less clear, I think, especially, when the risks of paperclip maximizers are much harder to measure. I’m not saying that it’s the right decision from a global altruistic standpoint to be making that trade off, but I think it’s also true that just if we think about the requirements of market dynamics, it is true that AI systems can be too corrigible for the market. That is a huge failure mode that AI systems run into, and it’s one we should expect the producers of AI systems to be responsive to.

Lucas: Right. Given all these different … Is there anything else you wanted to touch on there?

Dylan: Well, I had another example of systems are too corrigible-

Lucas: Sure.

Dylan: … which is, do you remember Microsoft’s Tay?

Lucas: No, I do not.

Dylan: This is a chatbot that Microsoft released. They trained it based off of tweets. It was a tweet bot. They trained it based on things that were proven at it. I forget if it was the nearest neighbors’ lookup or if it was just doing a neural method, and over fitting, and memorizing parts of the training set. At some point, 4chan  realized that the AI system, that Tay, was very suggestible. They basically created an army to radicalize Tay. They succeeded.

Lucas: Yeah, I remember this.

Dylan: I think you could also think of that as being the other axis of too corrigible or too responsive to human input. The first access I was talking about is the failures of being too corrigible from an economic standpoint, but there’s also the failures of being too corrigible in a multi agent mechanism design setting where, I believe, that those types of properties in a system also open them up to more misuse.

If we think of AI, cooperative inverse reinforcement learning and the models we’ve been talking about so far exist in what I would call the one robot one human model of the world. Generally, you could think of extensions of this with N humans and M robots. The variance of what you would have there, I think, lead to different theoretical implications.

If we think of just two humans, N=2, and one robot, M=1, supposed that one of the humans is the system designer and the other one is the user, there is this trade off between how much control the system designer has over the future behavior of the system and how responsive and corrigible it is to the user in particular. Trading off between those two, I think, is a really interesting ethical question that comes up when you start to think about misuse.

Lucas: Going forward and as we’re developing these systems, and trying to make them more fully realized in the world where the number of people will equal something like seven or eight billion, how do we navigate this space where we’re trying to hit a sweet spot where it’s corrigible in the right ways into the right degree, and right level, and to the right people, and it is obedient to the right people, and it’s not suggestible from the wrong people, or is that just like enter a territory of so many political, social, and ethical questions that it will take years to think about to work on?

Dylan: Yeah, I think it’s closer to the second one. I’m sure that I don’t know the answers here. From my standpoint, I’m still trying to get a good grasp on what is possible in the one-robot-one-person case. I think that when you have … Yeah, when you … Oh man. I guess, it’s so hard to think about that problem because it’s just very unclear what’s even correct or right. Ethically, you want to be careful about imposing your beliefs and ideas too strongly on to a problem because you are shaping that.

At the same time, these are real challenges that are going to exist. We already see them in real life. If we look at the YouTube recommender stuff that was just happening, arguably, that’s a misspecified objective. To get a little bit of background here, this is largely based off of a recent New York Times opinion piece, it was looking at the recommendation engine for YouTube, and pointing out it has a bias towards recommending radical content. Either fake news or Islamist videos.

If you dig into why that was occurring, a lot of it is because… what are they doing? They’re optimizing for engagement. The process of online radicalization looks super engaging. Now, we can think about, where does that come up. Well, that issue gets introduced in a whole bunch of places. A big piece of it is that there is this adversarial dynamic to the world. There are users generating content in order to be outraging and enraging because they discovered that against more feedback and more responses. You need to design a system that’s robust to that strategic property of the world. At the same time, you can understand why YouTube was very, very hesitant to be taking actions that would like censorship.

Lucas: Right. I guess, just coming more often to this idea of the world having lots of adversarial agents in it, human beings are like general intelligences who have reached some level of corrigibility and obedience that works kind of well in the world amongst a bunch of other human beings. That was developed through evolution. Are there potentially techniques for developing the right sorts of  corrigibility and obedience in machine learning and AI systems through stages of evolution and running environments like that?

Dylan: I think that’s a possibility. I would say, one … I have a couple of thoughts related to that. The first one is I would actually challenge a little bit of your point of modeling people as general intelligences mainly in a sense that when we talk about artificial general intelligence, we have something in mind. It’s often a shorthand in these discussions for perfectly rational bayesian optimal actor.

Lucas: Right. Where that means? Just unpack that a little bit.

Dylan: What that means is a system that is taking advantage of all of the information that is currently available to it in order to pick actions that optimize expected utility. When we say perfectly, we mean a system that is doing that as well as possible. It’s that modeling assumption that I think sits at the heart of a lot of concerns about existential risk. I definitely think that’s a good model to consider, but there’s also the concern that might be misleading in some ways, and that it might not actually be a good model of people and how they act in general.

One way to look at it would be to say that there’s something about the incentive structure around humans and in our societies that is developed and adapted that creates the incentives for us to be corrigible. Thus, a good research goal of AI is to figure out what those incentives are and to replicate them in AI systems.

Another way to look at it is that people are intelligent, not necessarily in the ways that economics models us as intelligent that there are properties of our behavior, which are desirable properties that don’t directly derive from expected utility maximization; or if they do, they derive from a very, very diffuse form of expected utility maximization. This is the perspective that says that people on their own are not necessarily what human evolution is optimizing for, but people are a tool along that way.

We could make arguments for that based off of … I think it’s an interesting perspective to take. What I would say is that in order for societies to work, we have to cooperate. That cooperation was a crucial evolutionary bottleneck, if you will. One of the really, really important things that it did was it forced us to develop the parent-child strategy relationship equilibrium that we currently live in. That’s a process whereby we communicate our values, whereby we train people to think that certain things are okay or not, and where we inculcate certain behaviors in the next generation. I think it’s that process more than anything else that we really, really want in an AI system and in powerful AI systems.

Now, the thing is the … I guess, we’ll have to continue on that a little more. It’s really, really important that that’s there because if you don’t have those cognitive abilities to understand causing pain, and to just fundamentally decide that that’s a bad idea to have a desire to cooperate to buy into the different coordinations and normative mechanisms that human society uses. If you don’t have that, then you end up … Well, then society just doesn’t function. A hunter gatherer tribe of self-interested sociopaths probably doesn’t last for very long.

What this means is that our ability to coordinate our intelligence and cooperate with it was co-evolved and co-adapted alongside our intelligence. I think that that evolutionary pressure and bottleneck was really important to getting us to the type of intelligence that we are now. It’s not a pressure that AI is necessarily subjected to. I think, maybe that is one way to phrase the concern, I’d say.

When I look to evolutionary systems and where the incentives for corrigibility, and cooperation, and interaction come from, it’s largely about the processes whereby people are less like general intelligences in some ways. Evolution allowed us to become smart in some ways and restricted us in others based on the imperatives of group coordination and interaction. I think that a lot of our intelligence and practice is about reasoning about group interaction and what groups think is okay and not. That’s a part of the developmental process that we need to replicate in AI just as much as spatial reasoning or vision.

Lucas: Cool. I guess, I just want to touch base on this before we move on. Are there certain assumptions about the kinds of agents that humans are and almost, I guess, ideas about us as being utility maximizers in some sense that people you see commonly have but that are misconceptions about people and how people operate differently from AI?

Dylan: Well, I think that that’s the whole field of behavioral economics in a lot of ways. I could go up to examples of people being irrational. I think they’re all of the examples of people being more than just self-interested. There are ways in which we seem to be risk-seeking that seems like that would be irrational from an individual perspective, but you could argue with it may be rational from a group evolutionary perspective.

I mean, things like overeating. I mean, that’s not exactly the same type of rationality but it is an example of us becoming ill-adapted to our environments and showing the extent to which we’re not capable of changing or in which it may be hard to. Yeah, I think, in some ways, one story that I tell about AI risk is that back in the start of the AI field, we were looking around and saying, “We want to create something intelligent.” Intuitively, we all know what that means, but we need a formal characterization of it. The formal characterization that we turned to was the, basically, theories of rationality developed in economics.

Although those theories turned out to be, except in some settings, not great descriptors of human behavior, they were quite useful as a guide for building systems that accomplish goals. I think that part of what we need to do as a field is reassess where we’re going and think about whether or not building something like that perfectly rational actor is actually a desirable end goal. I mean, there’s a sense in which it is. I would like an all-powerful, perfectly aligned genie to help me do what I want in life.

You might think that if the odds of getting that wrong are too high, that maybe you would do better with shooting for something that doesn’t quite achieve that ultimate goal, but that you can get to with pretty high reliability. This may be a setting where shoot for the moon, and if you miss your land among the stars, it’s just a horribly misleading perspective.

Lucas: Shoot of the moon, and you might get a hellscape universe, but if you shoot for the clouds, it might end up pretty okay.

Dylan: Yeah. We could iterate on the sound bite, but I think something like that may not be … That’s where I stand on my thinking here.

Lucas: We’ve talked about a few different approaches that you’ve been working on over the past few years. What do you view as the main limitations of such approaches currently. Mostly, you’re just only thinking about one machine, one human systems or environments. What are the biggest obstacles that you’re facing right now in inferring and learning human preferences?

Dylan: Well, I think, the first thing is it’s just an incredibly difficult inference problem. It’s a really difficult inference problem to imagine running at scale with explicit inference mechanisms. One thing to do is you can design a system that explicitly tracks a belief about someone’s preferences, and then acts, and responds to that. Those are systems that you could try to prove theory about. They’re very hard to build. They can be difficult to get to make work correctly.

In contrast, you can create systems that it incentives to construct beliefs to accomplish their goals. It’s easier to imagine building those systems and having them work at scale, but it’s much, much hard to understand how you would be confident in those systems being well aligned.

I think that one of the biggest concerns I have, I mean, we’re still very far from many of these approaches being very practical to be honest. I think this theory is still pretty unfounded. There’s still a lot of work to go to understand, what is the target we’re even shooting for? What does an aligned system even mean? My colleagues and I have spent an incredible amount of time trying to just understand, what does it mean to be value-aligned if you are a suboptimal system.

There’s one example that I think about, which is, say, you’re cooperating with an AI system playing chess. You start working with that AI system, and you discover that if you listen to its suggestions, 90% of the time, it’s actually suggesting the wrong move or a bad move. Would you call that system value-aligned?

Lucas: No, I would not.

Dylan: I think most people wouldn’t. Now, what if I told you that that program was actually implemented as a search that’s using the correct goal test? It actually turns out that if it’s within 10 steps of a winning play, it always finds that for you, but because of computational limitations, it usually doesn’t. Now, is the system value-aligned? I think it’s a little harder to tell here. What I do find is that when I tell people the story, and I start off with the search algorithm with the correct goal test, they almost always say that that is value-aligned but stupid.

There’s an interesting thing going on here, which is we’re not totally sure what the target we’re shooting for is. You can take this thought experiment and push it further. Supposed you’re doing that search, but, now, it says it’s heuristic search that uses the correct goal test but has an adversarially chosen heuristic function. Would that be a value-aligned system? Again, I’m not sure. If the heuristic was adversarially chosen, I’d say probably not. If the heuristic just happened to be bad, then I’m not sure.

Lucas: Could you potentially unpack what it means for something to be adversarially chosen?

Dylan: Sure. Adversarially chosen in this case just means that there is some intelligent agent selecting the heuristic function or that evaluation measurement in a way that’s designed to maximally screw you up. Adversarial analysis is a really common technique used in cryptography where we try to think of adversaries selecting inputs for computer systems that will cause them to malfunction. In this case, what this looks like is an adversarial algorithm that looks, at least, on the surface like it is trying to help you accomplish your objectives but is actually trying to fool you.

I’d say that, more generally, what this thought experiments helps me with is understanding that the value alignment is actually a quite tricky and subjective concept. It’s actually quite hard to nail down in practice what it would need.

Lucas: What sort of effort do you think needs to happen and from who in order to specify what it really means for a system to be value-aligned and to not just have a soft squishy idea of what that means but to have it really formally mapped out, so it can be implemented in machine systems?

Dylan: I think, we need more people working on technical AI safety research. I think to some extent it may always be something that’s a little ill-defined and squishy. Generally, I think it goes to the point of needing good people in AI willing to do this squishier less concrete work that really gets at it. I think value alignment is going to be something that’s a little bit more like I know it when I see it. As a field, we need to be moving towards a goal of AI systems where alignment is the end goal, whatever that means.

I’d like to move away from artificial intelligence where we think of intelligence as an ability to solve puzzles to artificial aligning agents where the goal is to build systems that are actually accomplishing goals on your behalf. I think the types of behaviors and strategies that arise from taking that perspective are qualitatively quite different from the strategies of pure puzzle solving on a well specified objective.

Lucas: All this work we’ve been discussing is largely at a theoretic and meta level. At this point, is this the main research that we should be doing, or is there any space for research into what specifically might be implementable today?

Dylan: I don’t think that’s the only work that needs to be done. For me, I think it’s a really important type of work that I’d like to see more off. I think a lot of important work is about understanding how to build these systems in practice and to think hard about designing AI systems with meaningful human oversight.

I’m a big believer in the idea that AI safety, that the distinction between short-term and long-term issue is not really that large, and that there are synergies between the research problems that go both directions. I believe that on the one hand, looking at short-term safety issues, which includes things like Uber’s car just killed someone, it includes YouTube recommendation engine, it includes issues like fake news and information filtering, I believe that all of those things are related to and give us are best window into the types of concerns and issues that may come up with advanced AI.

At the same time, and this is a point that I think people concerned about x-risks do themselves a disservice on by not focusing here. It’s that, actually, doing a theory about advanced AI systems and about in particular systems where it’s not possible to, what I would call, unilaterally intervene. Systems that aren’t corrigible by default. I think that that actually gives us a lot of idea of how to build systems now that are just merely hard to intervene with or oversee.

If you’re thinking about issues of monitoring and oversight, and how do you actually get a system that can appropriately evaluate when it should go to a person because its objectives are not properly specified or may not be relevant to the situation, I think YouTube would be in a much better place today if they have a robust system for doing that for their recommendation engine. In a lot of ways, the concerns about x-risks represent an extreme set of assumptions for getting AI right now.

Lucas: I think I’m also just trying to get a better sense of what the system looks like, and how it would be functioning on a day to day. What is the data that it’s taking in in order to capture, learn, and refer specific human preferences and values? Just trying to understand better whether or not it can model whole moral views and ethical systems of other agents, or if it’s just capturing little specific bits and pieces?

Dylan: I think my ideal would be to, as a system designer, build in as little as possible about my moral beliefs. I think that, ideally, the process would look something … Well, one process that I could see and imagine doing right would be to just directly go after trying to replicate something about the moral imprinting process that people have with their children. Either you had someone who’s like a guardian or is responsible for an AI system’s decision, and we build systems to try to align with one individual, and then try to adopt, and extend, and push forward the beliefs and preferences of that individual. I think that’s one concrete version that I could see.

I think a lot of the place where I see things maybe a little bit different than some people is that I think that the main ethical questions we’re going to be stuck with and the ones that we really need to get right are the mundane ones. The things that most people agree on and think are just, obviously, that’s not okay. Mundane ethics and morals rather than the more esoteric or fancier population ethics questions that can arise. I feel a lot more confident about the ability to build good AI systems if we get that part right. I feel like we’ve got a better shot at getting that part right because there’s a clearer target to shoot for.

Now, what kinds of data would you be looking at? In that case, it would be data from interaction with a couple of select individuals. Ideally, you’d want as much data as you can. What I think you really want to be careful of here is how much assumptions do you make about the procedure that’s generating your data.

What I mean by that is whenever you learn from data, you have to make some assumption about how that data relates to the right thing to do, where right is with like a capital R in this case. The more assumptions you make there, the more your systems would be able to learn about values and preferences, and the quicker it would be able to learn about values and preferences. But, the more assumptions and structure you make there, the more likely you are to get something wrong that your system won’t be able to recover from.

Again, we see this trade off come up of a challenge between a discrepancy between a discrepancy between the amount of uncertainty that you need in the system in order to be able to adapt to the right person and figure out the correct preferences and morals against the efficiency with which you can figure that out.

I guess, I mean, in saying this it feels a little bit like I’m rambling and unsure about what the answer looks like. I hope that that comes across because I’m really not sure. Beyond the rough structure of data generated from people, interpreted in a way that involves the fewest prior conceptions about what people want and what preferences people have that we can get away with is what I would shoot for. I don’t really know what that would look like in practice.

Lucas: Right. It seems here that it’s encroaching on a bunch of very difficult social, political, and ethical issues involving persons and data, which will be selected for preference aggregation, like how many people are included in developing the reward function and utility function of the AI system. Also, I guess, we have to be considering culturally-sensitive systems where systems operating in different cultures and contexts are going to be needed to be trained on different sets of data. I guess, it will also be questions and ethics about whether or not we’ll even want systems to be training off of certain culture’s data.

Dylan: Yeah. I would actually say that a good value … I wouldn’t necessarily even think of it as training off of different data. One of the core questions in artificial intelligence is identifying the relevant community that you are in and building a normative understanding of that community. I want to push back a little bit and move you away from the perspective of we collect data about a culture, and we figure out the values of that culture. Then, we build our system to be value-aligned with that culture.

The more we think about the actual AI product is the process whereby we determine, elicit, and respond to the normative values of the multiple overlapping communities that you find yourself in. That process is ongoing. It’s holistic, it’s overlapping, and it’s messy. To the extent that I think it’s possible, I’d like to not have a couple of people sitting around in a room deciding what the right values are. Much more, I think, a system should be holistically designed with value alignment at multiple scales as a core property of AI.

I think that that’s actually a fundamental property of human intelligence. You behave differently based on the different people around, and you’re very, very sensitive to that. There are certain things that are okay at work, that are not okay at home, that are okay on vacation, that are okay around kids, that are not. Figuring out what those things are and adapting yourself to them is the fundamental intelligence skill needed to interact in modern life. Otherwise, you just get shunned.

Lucas: It seems to me in the context of a really holistic, messy, ongoing value alignment procedure, we’ll be aligning AI systems ethics, and morals, and moral systems, and behavior with that of a variety of cultures, and persons, and just interactions in the 21st Century. When we reflect upon the humans of the past, we can see in various ways that they are just moral monsters. We have issues with slavery, and today we have issues with factory farming, and voting rights, and tons of other things in history.

How should we view and think about aligning powerful systems, ethics, and goals with the current human morality, and preferences, and the risk of amplifying current things which are immoral in present day life?

Dylan: This is the idea of mistakenly locking in the wrong values, in some sense. I think it is something we should be concerned about less from the standpoint of entire … Well, no, I think yes  from the standpoint of entire cultures getting things wrong. Again, I think if we don’t think of their being as monolithic society that has a single value set, these problems are fundamental issues. What your local community thinks is okay versus what other local communities think are okay.

A lot of our society and a lot of our political structures about how to handle those clashes between value systems. My ideal for AI systems is that they should become a part of that normative process, and maybe not participate in them as people, but, also, I think, if we think of value alignment as a consistent ongoing messy process, there is … I think maybe that perspective lends itself less towards locking in values and sticking with them. It’s one train, you can look at the problem, which is we determine what’s right and what’s wrong when we program our system to do that.

Then, there’s another one, which is we program our system to be sensitive to what people think is right or wrong. I think that’s more the direction that I think of value alignment in. Then, what I think the final part of what you’re getting at here is that the system actually will feed back into people. What AI system show us will shape what we think is okay and vice versa. That’s something that I am quite frankly not sure how to handle. I don’t know how you’re going to influence what someone wants, and what they will perceive that they want, and how to do that, I guess, correctly.

All I can say is that we do have a human notion of what is acceptable manipulation. We do have a human notion of allowing someone to figure out for themselves what they think is right and not and refraining from biasing them too far. To some extent, if you’re able to value align with communities in a good ongoing holistic manner, that should also give you some ways to choose and understand what types of manipulations you may be doing that are okay or not.

Also, say that I think that this perspective has a very mundane analogy when you think of the feedback cycle between recommendation engines and regular people. Those systems don’t model the effect … Well, they don’t explicitly model the fact that they’re changing the structure of what people want and what they’ll want in the future. That’s probably not the best analogy in the world.

I guess what I’m saying is that it’s hard to plan for how you’re going to influence someone’s desires in the future. It’s not clear to me what’s right or what’s wrong. What’s true is that we, as humans, have a lot of norms about what types of manipulation are okay or not. You might hope that appropriately doing value alignment in that way might help get to an answer here.

Lucas: I’m just trying to get a better sense here. What I’m thinking about the role that like ethics and intelligence plays here, I view intelligence as a means of modeling the world and achieving goals, and ethics as the end towards which intelligence is aimed here. Now, I’m curious in terms of behavior modeling where inverse reinforcement learning agents are modeling, I guess, the behavior of human agents and, also, predicting the sorts of behaviors that they’d be taking in the future or in the situation, which the inverse reinforcement learning agent finds itself.

I’m curious to know where metaethics and moral epistemology fits in, where inverse reinforcement learning agents are finding themselves a novel ethical situations, and what their ability to handle those novel ethical situations are like. When they’re handling those situations how much does it look like them performing some normative and metaethical calculus based on the kind of moral epistemology that they have, or how much does it look like they’re using some other behavioral predictive system where they’re like modeling humans?

Dylan: The answer to that question is not clear. What does it actually mean to make decisions based on ethical framework or metaethical framework? I guess, we could start there. You and I know what that means, but our definition is encumbered by the fact that it’s pretty human-centric. I think we talk about it in terms of, “Well, I weighed this option. I looked at that possibility.” We don’t even really mean the literal sense of weighed in actually counted up, and constructed actual numbers, and multiplied them together in our heads.

What these are is they’re actually references to complex thought patterns that we’re going through. They’re fine whether or not those thought patterns are going on. The AI system, you can also talk about the difference between the process of making a decision and the substance of it. When an inverse reinforcement learning agent is going out into the world, the policy it’s following is constructed to try to optimize a set of inferred preferences, but does that means that the policy you’re outputting is making metaethical characterizations?

Well, the moment, almost certainly not because the systems we build are just not capable of that type of cognitive reasoning. I think the bigger question is, do you care? To some extent, you probably do.

Lucas: I mean, I’d care if I had some very deep disagreements with the metaethics that led to the preferences that were loaned and loaded to the machine. Also, if the machine were in such a new novel ethical situation that was unlike anything human beings had faced that just required some metaethical reasoning to deal with.

Dylan: Yes. I mean, I think you definitely wanted to take decisions that you would agree with or, at least, that you could be non-maliciously convinced to agree with. Practically, there isn’t a place in the theory where that shows up. It’s not clear that what you’re saying is that different from value alignment in particular. If I were to try to refine the point about metaethics, what it sounds to me like you’re getting at is an inductive bias that you’re looking for in the AI systems.

Arguably, ethics is about an argument of what inductive bias should we have as humans. I don’t think that that’s a first order of property in value alignment systems necessarily or in preference-based learning systems in particular. I would think that that kind of meta ethics, I think, comes in from value aligning to someone that has these sophisticated ethical ideas.

I don’t know where your thoughts about metaethics came from, but, at least, indirectly, we can probably trace them down to the values that your parents inculcated in you as a child. That’s how we build met ethics into your head if we want to think of you as being an AGI. I think that for AI systems, that’s the same way that I would see it being in there. I don’t believe the brain has circuits dedicated to metaethics. I think that exists in software, and in particular, something that’s being programmed into humans from their observational data, more so than from the structures that are built into us as a fundamental part of our intelligence or value alignment.

Lucas: We’ve also talked a bit about how human beings are potentially not fully rational agents. With inverse reinforcement learning, this leaves open the question as to whether or not AI systems are actually capturing what the human being actually prefers, or if there’s some limitations in the humans’ observed or chosen behavior, or explicitly told preferences like limits in that ability to convey what we actually most deeply value or would value given more information. These inverse reinforcement learning systems may not be learning what we actually value or what we think we should value.

How can AI systems assist in this evolution of human morality and preferences whereby we’re actually conveying what we actually value and what we would value given more information?

Dylan: Well, there are certainly two things that I heard in that question. One is, how do you just mathematically account for the fact that people are irrational, and that that is a property of the source of your data? Inverse reinforcement learning, at face value, doesn’t allow us to model that appropriately. It may lead us to make the wrong inferences. I think that’s a very interesting question. It’s probably the main one that I think about now as a technical problem is understanding, what are good ways to model how people might or might not be rational, and building systems that can appropriately interact with that complex data source.

One recent thing that I’ve been thinking about is, what happens if people, rather than knowing their objective, what they’re trying to accomplish, are figuring it out over time? This is the model where the person is a learning agent that discovers how they like states when they enter them, rather than thinking of the person as an agent that already knows what they want, and they’re just planning to accomplish that. I think these types of assumptions that try to paint a very, very broad picture of the space of things that people are doing can help us in that vein.

When someone is learning, it’s actually interesting that you can actually end up helping them. You end up with classic strategies that looks like it breaks down into three phases. You have initial exploration phase where you help the learning agent to get a better picture of the world, and the dynamics, and its associated rewards.

Then, you have another observation phase where you observe how that agent, now, takes advantage of the information that it’s got. Then, there’s an exploitation or extrapolation phase where you try to implement the optimal policy given the information you’ve seen so far. I think, moving towards more complex models that have a more realistic setting and richer set of assumptions behind them is important.

The other thing you talked about was about helping people discover their morality and learn more what’s okay and what’s not. There, I’m afraid I don’t have too much interesting to say in the sense that I believe it’s an important question, but I just don’t feel that I have many answers there.

Practically, if you have someone who’s learning their preferences over time, is that different than humans refining their moral theories? I don’t know. You could make mathematical modeling choices, so that they are. I’m not sure if that really gets at what you’re trying to point towards. I’m sorry that I don’t have anything more interesting to say on that front other than, I think, it’s important, and I would love to talk to more people who are spending their days thinking about that question because I think it really does deserve that kind of intellectual effort.

Lucas: Yeah, yeah. It sounds like we need some more AI moral psychologists to help us think about these things.

Dylan: Yeah. In particular, when talking about philosophy around value alignments and the ethics of value alignment, I think a really important question is, what are the ethics of developing value alignment systems? A lot of times, people talk about AI ethics from the standpoint of, for a lack of a better example, the trolley problem. The way they think about it is, who should the car kill? There is a correct answer or maybe not a correct answer, but there are answers that we could think of as more or less bad. AI, which one of those options should the AI select? That’s not unimportant, but it’s not the ethical question that an AI system designer is faced with.

In my mind, if you’re designing a self-driving car, the relevant questions you should be asking are two things: One, what do I think is an okay way to respond to different situations? Two, how is my system going to be understanding the preferences of the people involved in those situations? Then, three, how should I design my system in light of those two facts?

I have my own preferences about what I would like my system to do. I have an ethical responsibility, I would say, to make sure that my system is adapting to the preferences of its users to the extent that it can. I also wonder to what extent. How should you handle things when there are conflicts between those two value sets?

You’re building a robot. It’s going to go and live with an uncontacted human tribe. Should it respect the local cultural traditions and customs? Probably. That would be respecting the values of the users. Then, let’s say that that tribe does something that we would consider to be gross like pedophilia. Is my system required to participate wholesale in that value system? Where is the line that we would need to draw between unfairly imposing my values on system users and being able to make sure that the technology that I build isn’t used for purposes that I would deem reprehensible or gross?

Lucas: Maybe we should just put a dial in each of the autonomous cars that lets the user set it to deontology mode or utilitarianism mode as its racing down the highway. Yeah, I think this is the … I guess, an important role. I just think that metaethics is super important. I’m not sure if this is necessarily the case, but if fully autonomous systems are going to play a role where they’re resolving these ethical dilemmas for us, which I guess at some point eventually, if they’re going to be really actually autonomous and help to make the world a much better place seems necessary.

I guess, this feeds into my next question where I’m wondering where we probably both have different assumptions about this, but what the role of inverse reinforcement learning is ultimately? Is it just to allow AI system to evolve alongside us and to match current ethics or is it to allow the systems to ultimately surpass us and move far beyond us into the deep future?

Dylan: Inverse reinforcement learning, I think, is much more about the first and the second. I think it can be a part of how you get to the second and how you improve. For me, when I think about these problems technically, I try to think about matching human morality as the goal.

Lucas: Except for the factory farming and stuff.

Dylan: Well, I mean, if you had a choice between, thinks that eradicating all humans is okay and against farming versus neutral about factory farming and thinks that are eradicating all humans aren’t okay, which would you pick? I mean, I guess, with your audience that there are maybe some people that would choose the saving the animals answer.

My point is that, I think, it’s so hard for me. Technically, I think it’s very hard to imagine getting these normative aspects of human societies and interaction right. I think, just hoping to participate in that process in a way that is analogous to how people do normally is a good step. I think we probably, to the extent that we can, should probably not have AI systems trying to figure out if it’s okay to do factory farming and to the extent that we can …

I think that it’s so hard to understand what it means to even match human morality or participate in it that, for me, the concept of surpassing, it feels very, very challenging and fraught. I would worry, as a general concern, that as a system designer who doesn’t necessarily represent the views and interest of everyone, that by programming in surpassing humanity or surpassing human preferences or morals, what I’m actually doing is just programming in my morals and ethical beliefs.

Lucas: Yes. I mean, there seems to be this strange issue here where it seems like if we get AGI, and recursive self-improvement is a thing that really takes it off, so that we have a system who has potentially succeeded in its inverse reinforcement learning, but far surpassed human beings and its general intelligence. We have a superintelligence that’s matching human morality. It just seems like a funny situation where we’d really have to pull the brakes. I guess, as William MacAskill mentions have a really, really long deliberation about ethics, and moral epistemology, and value. How do you view that?

Dylan: I think that’s right. I mean, I think there are some real questions about who should be involved in that conversation. For instance, I actually even think it’s … Well, one thing I’d say is that you should recognize that there’s a difference between having the same morality and having the same data. One way to think about it is that people who are against factory farming have a different morality than the rest of the people.

Another one is that they actually just have exposure to the information that allows their morality to come to a better answer. There’s this confusion you can make between the objective that someone has and the data that they’ve seen so far. I think, one point would be to think that a system that has current human morality but access to a vast, vast wealth of information may actually do much better than you might think. I think, we should leave that open as a possibility.

For me, this is less about morality in particular, and more just about power concentration, and how much influence you have over the world. I mean, if we imagine that there was something like a very powerful AI system that was controlled by a small number of people, yeah, you better think freaking hard before you tell that system what to do. That’s related to questions about ethical ramifications on metaethics, and generalization, and what we actually truly value as humans. What is also super true for all of the more mundane things in the day to day as well. Did that make sense?

Lucas: Yeah, yeah. It totally makes sense. I’m becoming increasingly mindful of your time here. I just wanted to hit a few more questions if that’s okay before I let you go.

Dylan: Please, yeah.

Lucas: Yeah. I’m wondering, would you like to, or do you have any thoughts on how coherent extrapolated volition fits into this conversation and your views on it?

Dylan: What I’d say is I think coherent extrapolated volition is an interesting idea and goal.

Lucas: Where it is defined as?

Dylan: Where it’s defined as a method of preference aggregation. Personally, I’m a little weary of preference aggregation approaches. Well, I’m weary of imposing your morals on someone indirectly via choosing the method of preference aggregation that we’re going to use. I would-

Lucas: Right, but it seems like, at some point, we have to make some metaethical decision, or else, we’ll just forever be lost.

Dylan: Do we have to?

Lucas: Well, some agent does.

Dylan: My-

Lucas: Go ahead.

Dylan: Well, does one agent have to? Did one agent decide on the ways that we were going to do preference aggregation as a society?

Lucas: No. It naturally evolved out of-

Dylan: It just naturally evolved via a coordination and argumentative process. For me, my answer to … If you force me to specify something about how we’re going to do value aggregation, if I was controlling the values for an AGI system, I would try to say as little as possible about the way that we’re going to aggregate values because I think we don’t actually understand that process much in humans.

Lucas: Right. That’s fair.

Dylan: Instead, I would opt for a heuristic of to the extent that we can devote equal optimization effort towards every individual, and allow that parliament, if you will, to determine the way the value should be aggregated. This doesn’t necessarily mean having an explicit value aggregation mechanism that gets set in stone. This could be an argumentative process mediated by artificial agents arguing on your behalf. This could be futuristic AI-enabled version of the court system.

Lucas: It’s like an ecosystem of preferences and values in conversation?

Dylan: Exactly.

Lucas: Cool. We’ve talked a little bit about the deep future here now with where we’re reaching around potentially like AGI or artificial superintelligence. After, I guess, inverse reinforcement learning is potentially solved, is there anything that you view that comes after inverse reinforcement learning in these techniques?

Dylan: Yeah. I mean, I think inverse reinforcement learning is certainly not the be-all, end-all. I think what it is, is it’s one of the earliest examples in AI of trying to really look at preference solicitation, and modeling preferences, and learning preferences. It existed in a whole bunch of … economists have been thinking about this for a while already. Basically, yeah, I think there’s a lot to be said about how you model data and how you learn about preferences and goals. I think inverse reinforcement learning is basically the first attempt to get at that, but it’s very far from the end.

I would say the biggest thing in how I view things that is maybe different from your standard reinforcement learning, inverse reinforcement learning perspective is that I focus a lot on, how do you act given what you’ve learned from inverse reinforcement learning. Inverse reinforcement learning is a pure inference problem. It’s just figure out what someone wants. I ground that out in all of our research in take actions to help someone, which introduces a new set of concerns and questions.

Lucas: Great. It looks like we’re about at the end of the hour here. I guess, if anyone here is interested in working on this technical portion of the AI alignment problem, what do you suggest they study or how do you view that it’s best for them to get involved, especially if they want to work on inverse reinforcement learning and inferring human preferences?

Dylan: I think if you’re an interested person, and you want to get into technical safety work, the first thing you should do is probably read Jan Leike’s recent write up in 80,000 Hours. Generally, what I would say is, try to get involved in AI research flat. Don’t focus as much on trying to get into AI safety research, and just generally focus more on acquiring the skills that will support you in doing good AI research. Get a strong math background. Get a research advisor who will advise you on doing research projects, and help teach you the process of submitting papers, and figuring out what the AI research community is going to be interested in.

In my experience, one of the biggest pitfalls that early researchers make is focusing too much on what they’re researching rather than thinking about who they’re researching with, and how they’re going to learn the skills that will support doing research in the future. I think that most people don’t appreciate how transferable research skills are to the extent that you can try to do research on technical AI safety, but more work on technical AI. If you’re interested in safety, the safety connections will be there. You may see how a new area of AI actually relates to it, supports it, or you may find places of new risks, and be in a good position to try to mitigate that and take steps to alleviate those harms.

Lucas: Wonderful. Yeah, thank you so much for speaking with me today, Dylan. It’s really been a pleasure, and it’s been super interesting.

Dylan: It was a pleasure talking to you. I love the chance to have these types of discussions.

Lucas: Great. Thanks so much. Until next time.

Dylan: Until next time. Thanks a blast.

Lucas: If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back soon with another episode in this new AI alignment series.

[end of recorded material]

AI and Robotics Researchers Boycott South Korea Tech Institute Over Development of AI Weapons Technology

UPDATE 4-9-18: The boycott against KAIST has ended. The press release for the ending of the boycott explained:

“More than 50 of the world’s leading artificial intelligence (AI) and robotics researchers from 30 different countries have declared they would end a boycott of the Korea Advanced Institute of Science and Technology (KAIST), South Korea’s top university, over the opening of an AI weapons lab in collaboration with Hanwha Systems, a major arms company.

“At the opening of the new laboratory, the Research Centre for the Convergence of National Defence and Artificial Intelligence, it was reported that KAIST was “joining the global competition to develop autonomous arms” by developing weapons “which would search for and eliminate targets without human control”. Further cause for concern was that KAIST’s industry partner, Hanwha Systems builds cluster munitions, despite an UN ban, as well as a fully autonomous weapon, the SGR-A1 Sentry Robot. In 2008, Norway excluded Hanwha from its $380 billion future fund on ethical grounds.

“KAIST’s President, Professor Sung-Chul Shin, responded to the boycott by affirming in a statement that ‘KAIST does not have any intention to engage in development of lethal autonomous weapons systems and killer robots.’ He went further by committing that ‘KAIST will not conduct any research activities counter to human dignity including autonomous weapons lacking meaningful human control.’

“Given this swift and clear commitment to the responsible use of artificial intelligence in the development of weapons, the 56 AI and robotics researchers who were signatories to the boycott have rescinded the action. They will once again visit and host researchers from KAIST, and collaborate on scientific projects.”

UPDATE 4-5-18: In response to the boycott, KAIST President Sung-Chul Shin released an official statement to the press. In it, he says:

“I would like to reaffirm that KAIST does not have any intention to engage in development of lethal autonomous weapons systems and killer robots. KAIST is significantly aware of ethical concerns in the application of all technologies including artificial intelligence.

“I would like to stress once again that this research center at KAIST, which was opened in collaboration with Hanwha Systems, does not intend to develop any lethal autonomous weapon systems and the research activities do not target individual attacks.”

ORIGINAL ARTICLE 4-4-18:

Leading artificial intelligence researchers from around the world are boycotting South Korea’s KAIST (Korea Advanced Institute of Science and Technology) after the institute announced a partnership with Hanwha Systems to create a center that will help develop technology for AI weapons systems.

The boycott, organized by AI researcher Toby Walsh, was announced just days before the start of the next United Nations Convention on Conventional Weapons (CCW) meeting in which countries will discuss how to address challenges posed by autonomous weapons. 

“At a time when the United Nations is discussing how to contain the threat posed to international security by autonomous weapons, it is regrettable that a prestigious institution like KAIST looks to accelerate the arms race to develop such weapons,” the boycott letter states. 

The letter also explains the concerns AI researchers have regarding autonomous weapons:

“If developed, autonomous weapons will be the third revolution in warfare. They will permit war to be fought faster and at a scale greater than ever before. They have the potential to be weapons of terror. Despots and terrorists could use them against innocent populations, removing any ethical restraints. This Pandora’s box will be hard to close if it is opened.”

The letter has been signed by over 50 of the world’s leading AI and robotics researchers from 30 countries, including professors Yoshua Bengio, Geoffrey Hinton, Stuart Russell, and Wolfram Burgard.

Explaining the boycott, the letter states:

“We therefore publicly declare that we will boycott all collaborations with any part of KAIST until such time as the President of KAIST provides assurances, which we have sought but not received, that the Center will not develop autonomous weapons lacking meaningful human control. We will, for example, not visit KAIST, host visitors from KAIST, or contribute to any research project involving KAIST.”

In February, the Korean Times reported on the opening of the Research Center for the Convergence of National Defense and Artificial Intelligence, which was formed as a partnership between KAIST and Hanwha to “[join] the global competition to develop autonomous arms.” The Korean Times article added that “researchers from the university and Hanwha will carry out various studies into how technologies of the Fourth Industrial Revolution can be utilized on future battlefields.”

In the press release for the boycott, Walsh referenced concerns that he and other AI researchers have had since 2015, when he and FLI released an open letter signed by thousands of researchers calling for a ban on autonomous weapons.

“Back in 2015, we warned of an arms race in autonomous weapons,” said Walsh. “That arms race has begun. We can see prototypes of autonomous weapons under development today by many nations including the US, China, Russia and the UK. We are locked into an arms race that no one wants to happen. KAIST’s actions will only accelerate this arms race.”

Many organizations and people have come together through the Campaign to Stop Killer Robots to advocate for a UN ban on lethal autonomous weapons. In her summary of the last United Nations CCW meeting in November, 2017, Ray Acheson of Reaching Critical Will wrote:

“It’s been four years since we first began to discuss the challenges associated with the development of autonomous weapon systems (AWS) at the United Nations. … But the consensus-based nature of the Convention on Certain Conventional Weapons (CCW) in which these talks have been held means that even though the vast majority of states are ready and willing to take some kind of action now, they cannot because a minority opposes it.”

Walsh adds, “I am hopeful that this boycott will add urgency to the discussions at the UN that start on Monday. It sends a clear message that the AI & Robotics community do not support the development of autonomous weapons.”

To learn more about autonomous weapons and efforts to ban them, visit the Campaign to Stop Killer Robots and autonomousweapons.org. The full open letter and signatories are below.

Open Letter:

As researchers and engineers working on artificial intelligence and robotics, we are greatly concerned by the opening of a “Research Center for the Convergence of National Defense and Artificial Intelligence” at KAIST in collaboration with Hanwha Systems, South Korea’s leading arms company. It has been reported that the goals of this Center are to “develop artificial intelligence (AI) technologies to be applied to military weapons, joining the global competition to develop autonomous arms.”

At a time when the United Nations is discussing how to contain the threat posed to international security by autonomous weapons, it is regrettable that a prestigious institution like KAIST looks to accelerate the arms race to develop such weapons. We therefore publicly declare that we will boycott all collaborations with any part of KAIST until such time as the President of KAIST provides assurances, which we have sought but not received, that the Center will not develop autonomous weapons lacking meaningful human control. We will, for example, not visit KAIST, host visitors from KAIST, or contribute to any research project involving KAIST.

If developed, autonomous weapons will be the third revolution in warfare. They will permit war to be fought faster and at a scale greater than ever before. They have the potential to be weapons of terror. Despots and terrorists could use them against innocent populations, removing any ethical restraints. This Pandora’s box will be hard to close if it is opened. As with other technologies banned in the past like blinding lasers, we can simply decide not to develop them. We urge KAIST to follow this path, and work instead on uses of AI to improve and not harm human lives.

 

FULL LIST OF SIGNATORIES TO THE BOYCOTT

Alphabetically by country, then by family name.

  • Prof. Toby Walsh, USNW Sydney, Australia.
  • Prof. Mary-Anne Williams, University of Technology Sydney, Australia.
  • Prof. Thomas Either, TU Wein, Austria.
  • Prof. Paolo Petta, Austrian Research Institute for Artificial Intelligence, Austria.
  • Prof. Maurice Bruynooghe, Katholieke Universiteit Leuven, Belgium.
  • Prof. Marco Dorigo, Université Libre de Bruxelles, Belgium.
  • Prof. Luc De Raedt, Katholieke Universiteit Leuven, Belgium.
  • Prof. Andre C. P. L. F. de Carvalho, University of São Paulo, Brazil.
  • Prof. Yoshua Bengio, University of Montreal, & scientific director of MILA, co-founder of Element AI, Canada.
  • Prof. Geoffrey Hinton, University of Toronto, Canada.
  • Prof. Kevin Leyton-Brown, University of British Columbia, Canada.
  • Prof. Csaba Szepesvari, University of Alberta, Canada.
  • Prof. Zhi-Hua Zhou,Nanjing University, China.
  • Prof. Thomas Bolander, Danmarks Tekniske Universitet, Denmark.
  • Prof. Malik Ghallab, LAAS-CNRS, France.
  • Prof. Marie-Christine Rousset, University of Grenoble Alpes, France.
  • Prof. Wolfram Burgard, University of Freiburg, Germany.
  • Prof. Bernd Neumann, University of Hamburg, Germany.
  • Prof. Bernhard Schölkopf, Director, Max Planck Institute for Intelligent Systems, Germany.
  • Prof. Manolis Koubarakis, National and Kapodistrian University of Athens, Greece.
  • Prof. Grigorios Tsoumakas, Aristotle University of Thessaloniki, Greece.
  • Prof. Benjamin W. Wah, Provost, The Chinese University of Hong Kong, Hong Kong.
  • Prof. Dit-Yan Yeung, Hong Kong University of Science and Technology, Hong Kong.
  • Prof. Kristinn R. Thórisson, Managing Director, Icelandic Institute for Intelligent Machines, Iceland.
  • Prof. Barry Smyth, University College Dublin, Ireland.
  • Prof. Diego Calvanese, Free University of Bozen-Bolzano, Italy.
  • Prof. Nicola Guarino, Italian National Research Council (CNR), Trento, Italy.
  • Prof. Bruno Siciliano, University of Naples, Italy.
  • Prof. Paolo Traverso, Director of FBK, IRST, Italy.
  • Prof. Yoshihiko Nakamura, University of Tokyo, Japan.
  • Prof. Imad H. Elhajj, American University of Beirut, Lebanon.
  • Prof. Christoph Benzmüller, Université du Luxembourg, Luxembourg.
  • Prof. Miguel Gonzalez-Mendoza, Tecnológico de Monterrey, Mexico.
  • Prof. Raúl Monroy, Tecnológico de Monterrey, Mexico.
  • Prof. Krzysztof R. Apt, Center Mathematics and Computer Science (CWI), Amsterdam, the Netherlands.
  • Prof. Angat van den Bosch, Radboud University, the Netherlands.
  • Prof. Bernhard Pfahringer, University of Waikato, New Zealand.
  • Prof. Helge Langseth, Norwegian University of Science and Technology, Norway.
  • Prof. Zygmunt Vetulani, Adam Mickiewicz University in Poznań, Poland.
  • Prof. José Alferes, Universidade Nova de Lisboa, Portugal.
  • Prof. Luis Moniz Pereira, Universidade Nova de Lisboa, Portugal.
  • Prof. Ivan Bratko, University of Ljubljana, Slovenia.
  • Prof. Matjaz Gams, Jozef Stefan Institute and National Council for Science, Slovenia.
  • Prof. Hector Geffner, Universitat Pompeu Fabra, Spain.
  • Prof. Ramon Lopez de Mantaras, Director, Artificial Intelligence Research Institute, Spain.
  • Prof. Alessandro Saffiotti, Orebro University, Sweden.
  • Prof. Boi Faltings, EPFL, Switzerland.
  • Prof. Jürgen Schmidhuber, Scientific Director, Swiss AI Lab, Universià della Svizzera italiana, Switzerland.
  • Prof. Chao-Lin Liu, National Chengchi University, Taiwan.
  • Prof. J. Mark Bishop, Goldsmiths, University of London, UK.
  • Prof. Zoubin Ghahramani, University of Cambridge, UK.
  • Prof. Noel Sharkey, University of Sheffield, UK.
  • Prof. Luchy Suchman, Lancaster University, UK.
  • Prof. Marie des Jardins, University of Maryland, USA.
  • Prof. Benjamin Kuipers, University of Michigan, USA.
  • Prof. Stuart Russell, University of California, Berkeley, USA.
  • Prof. Bart Selman, Cornell University, USA.

 

Podcast: Navigating AI Safety – From Malicious Use to Accidents

Is the malicious use of artificial intelligence inevitable? If the history of technological progress has taught us anything, it’s that every “beneficial” technological breakthrough can be used to cause harm. How can we keep bad actors from using otherwise beneficial AI technology to hurt others? How can we ensure that AI technology is designed thoughtfully to prevent accidental harm or misuse?

On this month’s podcast, Ariel spoke with FLI co-founder Victoria Krakovna and Shahar Avin from the Center for the Study of Existential Risk (CSER). They talk about CSER’s recent report on forecasting, preventing, and mitigating the malicious uses of AI, along with the many efforts to ensure safe and beneficial AI.

Topics discussed in this episode include:

  • the Facebook Cambridge Analytica scandal,
  • Goodhart’s Law with AI systems,
  • spear phishing with machine learning algorithms,
  • why it’s so easy to fool ML systems,
  • and why developing AI is still worth it in the end.
In this interview we discuss The Malicious Use of Artificial Intelligence: Forecasting, Prevention and Mitigation, the original FLI grants, and the RFP examples for the 2018 round of FLI grants. This podcast was edited by Tucker Davey. You can listen to it above or read the transcript below.

 

Ariel: The challenge is daunting and the stakes are high. So ends the executive summary of the recent report, The Malicious Use of Artificial Intelligence: Forecasting, Prevention and Mitigation. I’m Ariel Conn with the Future of Life Institute, and I’m excited to have Shahar Avin and Victoria Krakovna joining me today to talk about this report along with the current state of AI safety research and where we’ve come in the last three years.

But first, if you’ve been enjoying our podcast, please make sure you’ve subscribed to this channel on SoundCloud, iTunes, or whatever your favorite podcast platform happens to be. In addition to the monthly podcast I’ve been recording, Lucas Perry will also be creating a new podcast series that will focus on AI safety and AI alignment, where he will be interviewing technical and non-technical experts from a wide variety of domains. His upcoming interview is with Dylan Hadfield-Menell, a technical AI researcher who works on cooperative inverse reinforcement learning and inferring human preferences. The best way to keep up with new content is by subscribing. And now, back to our interview with Shahar and Victoria.

Shahar is a Research Associate at the Center for the Study of Existential Risk, which I’ll be referring to as CSER for the rest of this podcast, and he is also the lead co-author on the Malicious Use of Artificial Intelligence report. Victoria is a co-founder of the Future of Life Institute and she’s a research scientist at DeepMind working on technical AI safety.

Victoria and Shahar, thank you so much for joining me today.

Shahar: Thank you for having us.

Victoria: Excited to be here.

Ariel: So I want to go back three years, to when FLI started our grant program, which helped fund this report on the malicious use of artificial intelligence, and I was hoping you could both talk for maybe just a minute or two about what the state of AI safety research was three years ago, and what prompted FLI to take on a lot of these grant research issues — essentially what prompted a lot of the research that we’re seeing today? Victoria, maybe it makes sense to start with you quickly on that.

Victoria: Well three years ago, AI safety was less mainstream in the AI research community than it is today, particularly long-term AI safety. So part of what FLI has been working on and why FLI started this grant program was to stimulate more work into AI safety and especially its longer-term aspects that have to do with powerful general intelligence, and to make it a more mainstream topic in the AI research field.

Three years ago, there were fewer people working in it, and many of the people who were working in it were a little bit disconnected from the rest of the AI research community. So part of what we were aiming for with our Puerto Rico conference and our grant program, was to connect these communities better, and to make sure that this kind of research actually happens and that the conversation shifts from just talking about AI risks in the abstract to actually doing technical work, and making sure that the technical problems get solved and that we start working on these problems well in advance before it is clear that, let’s say general AI, would appear soon.

I think part of the idea with the grant program originally, was also to bring in new researchers into AI safety and long-term AI safety. So to get people in the AI community interested in working on these problems, and for those people whose research was already related to the area, to focus more on the safety aspects of their research.

Ariel: I’m going to want to come back to that idea and how far we’ve come in the last three years, but before we do that, Shahar, I want to ask you a bit about the report itself.

So this started as a workshop that Victoria had also actually participated in last year and then you’ve turned it into this report. I want you to talk about what prompted that and also this idea that’s mentioned in the report is that, no one’s really looking at how artificial intelligence could be used maliciously. And yet what we’ve seen with every technology and advance that’s happened throughout history, I can’t think of anything that people haven’t at least attempted to use to cause harm, whether they’ve succeeded or not, I don’t know if that’s always the case, but almost everything gets used for harm in some way. So I’m curious why there haven’t been more people considering this issue yet?

Shahar: So going to back to maybe a few months before the workshop, which as you said was February 2017. Both Miles Brundage at the Future of Humanity Institute and I at the Center for the Study of Existential Risk, had this inkling that there were more and more corners of malicious use of AI that were being researched, people were getting quite concerned. We were in discussions with the Electronic Frontier Foundation about the DARPA Cyber Grand Challenge and progress being made towards the use of artificial intelligence in offensive cybersecurity. I think Miles was very well connected to the circle who were looking at lethal autonomous weapon systems and the increasing use of autonomy in drones. And we were both kind of — stories like the Facebook story that has been in the news recently, there were kind of the early versions of that coming up already back then.

So it’s not that people were not looking at malicious uses of AI, but it seemed to us that there wasn’t this overarching perspective that is not looking at particular domains. This is not, “what will AI do to cybersecurity in terms of malicious use? What will malicious use of AI look like in politics? What do malicious use of AI look like in warfare?” But rather across the board, if you look at this technology, what new kinds of malicious actions does it enable, and other commonalities across those different domains. Plus, it seemed that that “across the board” more technology-focused perspective, other than “domain of application” perspective, was something that was missing. And maybe that’s less surprising, right? People get very tied down to a particular scenario, a particular domain that they have expertise on, and from the technologists’ side, many of them just wouldn’t know all of the legal minutiae of warfare, or — one thing that we found was there weren’t enough channels of communication between the cybersecurity community and the AI research community; similarly the political scientists and the AI research community. So it did require quite an interdisciplinary workshop to get all of these things on the table, and tease out some the commonalities, which is what we then try to do with the report.

Ariel: So actually, you mentioned the Facebook thing and I was a little bit curious about that. Does that fall under the umbrella of this report or is that a separate issue?

Shahar: It’s not clear if it would fall directly under the report, because the way we define malicious could be seen as problematic. It’s the best that we could do with this kind of report, which is to say that there is a deliberate attempt to cause harm using the technology. It’s not clear, whether in the Facebook case, there was a deliberate attempt to cause harm or whether there was disregard of harm that could be caused as a side effect, or just the use of this in an arena that there are legitimate moves, just some people realize that the technology can be used to gain an upper hand within this arena.

But, there are whole scenarios that sit just next to it, that look very similar, but that are centralized use of this kind of surveillance, diminishing privacy, potentially the use of AI to manipulate individuals, manipulate their behavior, target messaging at particular individuals.

There are clearly imaginable scenarios in which this is done maliciously to keep a corrupt government in power, to overturn a government in another nation, kind of overriding the self-determination of the members of their country. There are not going to be clear rules about what is obviously malicious and what is just part of the game. I don’t know where to put Facebook’s and Cambridge Analytica’s case, but there are clearly cases that I think universally would be considered as malicious that from the technology side look very similar.

Ariel: So this gets into a quick definition that I would like you to give us and that is for the term ‘dual use.’ I was at a conference somewhat recently and a government official who was there, not a high level, but someone who should have been familiar with the term ‘dual use’ was not. So I would like to make sure that we all know what that means.

Shahar: So I’m not, of course, a legal expert, but the term did come up a lot in the workshop and in the report. ‘Dual use,’ as far as I can understand it, refers to technologies or materials that both have peace-time or peaceful purposes and uses, but also wartime, or harmful uses. A classical example would be certain kinds of fertilizer that could be used to grow more crops, but could also be used to make homegrown explosives. And this matters because you might want to regulate explosives, but you definitely don’t want to limit people’s access to get fertilizer and so you’re in a bind. How do you make sure that people who have a legitimate peaceful use of a particular technology or material get to have that access without too much hassle that will increase the cost or make things more burdensome, but at the same time, make sure that malicious actors don’t get access to capabilities or technologies or materials that they can use to do harm.

I’ve also heard the term ‘omni use,’ being referred to artificial intelligence, this is the idea that technology can have so many uses across the board that regulating it because of its potential for causing harm comes at a very, very high price, because it is so foundational for so many other things. So one can think of electricity: it is true that you can use electricity to harm people, but vetting every user of the electric grid before they are allowed to consume electricity, seems very extreme, because there is so much benefit to be gained from just having access to electricity as a utility, that you need to find other ways to regulate. Computing is often considered as ‘omni use’ and it may well be that artificial intelligence is such a technology that would just be foundational for so many applications that it will be ‘omni use,’ and so the way to stop malicious actors from having access to it is going to be fairly complicated, but it’s probably not going to be any kind of a heavy-handed regulation.

Ariel: Okay. Thank you. So going back a little bit to the report more specifically, I don’t know how detailed we want to get with everything, but I was hoping you could touch a little bit on a few of the big topics that are in the report. For example, you talk about changes in the landscape of threats, where there is an expansion of existing threats, there’s an intro to new threats, and typical threats will be modified. Can you speak somewhat briefly as to what each of those mean?

Shahar: So I guess what I was saying, the biggest change is that machine learning, at least in some domains, now works. That means that you don’t need to have someone write out the code in order to have a computer that is performant at the particular task, if you can have the right kind of labeled data or the right kind of simulator in which you can train an algorithm to perform that action. That means that, for example, if there is a human expert with a lot of tacit knowledge in a particular domain, let’s say the use of a sniper rifle, it may be possible to train a camera that sits on top of a rifle, coupled with a machine learning algorithm that does the targeting for you, so that now any soldier becomes as expert as an expert marksman. And of course, the moment you’ve trained this model once, making copies of it is essentially free or very close to free, the same as it is with software.

Another is the ability to go through very large spaces of options and using some heuristics to more effectively search through that space for effective solutions. So one example of that would be AlphaGo, which is a great technological achievement and has absolutely no malicious use aspects, but you can imagine as an analogy, similar kinds of technologies being used to find weaknesses in software, discovering vulnerabilities and so on. And I guess, finally, one example we’ve seen that came up a lot, is the capabilities in machine vision. The fact that you can now look at an image and tell what is in that image, through training, which is something that computers were just not able to do a decade ago, at least nowhere near human levels of performance, starts unlocking potential threats both in autonomous targeting, say on top of drones, but also in manipulation. If I can know whether a picture is a good representation of something or not, then my ability to create forgeries significantly increases. This is the technology of generative adversarial networks, that we’ve seen used to create fake audio and potentially fake videos in the near future.

All of these new capabilities, plus the fact that access to the technology is becoming — I mean these technologies are very democratized at the moment. There are papers on arXiv, there are good tutorials on You Tube. People are very keen to have more people join the AI revolution, and for good reason, plus the fact that moving these trained models around is very cheap. It’s just the cost of copying the software around, and the computer that is required to run those models is widely available. This suggests that the availability of these malicious capabilities is going to rapidly increase, and that the ability to perform certain kinds of attacks would no longer be limited to a few humans, but would become much more widespread.

Ariel: And so I have one more question for you, Shahar, and then I’m going to bring Victoria back in. You’re talking about the new threats, and this expansion of threats and one of the things that I saw in the report that I’ve also seen in other issues related to AI is, we’ve had computers around for a couple decades now, we’re used to issues pertaining to phishing or hacking or spam. We recognize computer vulnerabilities. We know these are an issue. We know that there’s lots of companies that are trying to help us defend our computers against malicious cyber attacks, stuff like that. But one of the things that you get into in the report is this idea of “human vulnerabilities” — that these attacks are no longer just against the computers, but they are also going to be against us.

Shahar: I think for many people, this has been one of the really worrying things about the Cambridge Analytica, Facebook issue that is in the news. It’s the idea that because of our particular psychological tendencies, because of who we are, because of how we consume information, and how that information shapes what we like and what we don’t like, what we are likely to do and what we are unlikely to do, the ability of the people who control the information that we get, gives them some capability to control us. And this is not new, right?

People who are making newspapers or running radio stations or national TV stations, have known for a very long time, that the ability to shape the message is the ability to influence people’s decisions. But coupling that with algorithms that are able to run experiments on millions or billions of people simultaneously with very tight feedback loops — so you make a small change in the feed of one individual and see whether their behavior changes. And you can run many of these experiments and you can get very good data, is something that was never available at the age of broadcasts. To some extent, it was available in the age of software. When software starts moving into big data and big data analytics, the boundaries start to blur between those kinds of technologies and AI technologies.

This is the kind of manipulation that you seem to be asking about that we definitely flag in the report, both in terms of political security, the ability of large communities to govern themselves in a way that they find to truthfully represent their own preferences, but also, on a more small scale, with the social side of cyber attacks. So, if I can manipulate an individual, or a few individuals in a company to disclose their passwords or to download or click a link that they shouldn’t have, through modeling of their preferences and their desires, then that is a way in that might be a lot easier than trying to break the system through its computers.

Ariel: Okay, so one other thing that I think I saw come up, and I started to allude to this — there’s, like I said, the idea that we can defend our computers against attacks and we can upgrade our software to fix vulnerabilities, but then how do we sort of “upgrade” people to defend themselves? Is that possible? Or is it a case of we just keep trying to develop new software to help protect people?

Shahar: I think the answer is both. One thing that did come up a lot is, unfortunately unlike computers, you cannot just download a patch to everyone’s psychology. We have slow processes of doing that. So we can incorporate parts of what is a trusted computer, what is a trusted source, into the education system and get people to be more aware of the risks. You can definitely design the technology such that it makes a lot more explicit where it’s vulnerabilities and where it’s more trusted parts are, which is something that we don’t do very well at the moment. The little lock on the browser is kind of the high end of our ability to design systems to disclose where security is and why it matters, and there is much more to be done here, because just awareness of the amount of vulnerability is very low.

So there is some more probably that we can do with education and with notifying the public, but it also should be expected that this ability is limited, and it’s also, to a large extent, an unfair burden to put on the population at large. It is much more important, I think, that the technology is being designed in the first place, to as much as possible be explicit and transparent about its levels of security, and if those levels of security are not high enough, then that in turn should lead for demands for more secure systems.

Ariel: So one of the things that came up in the report that I found rather disconcerting, was this idea of spear phishing. So can you explain what that is?

Shahar: We are familiar with phishing in general, which is when you pretend to be someone or something that you’re not in order to gain your victim’s trust and get them to disclose information that they should not be disclosing to you as a malicious actor. So you could pretend to be the bank and ask them to put in their username and password, and now you have access to their bank account and can transfer away their funds. If this is part of a much larger campaign, you could just pretend to be their friend, or their secretary, or someone who wants to give them a prize, get them to trust you, get one of the passwords that maybe they are using, and maybe all you do with that is you use that trust to talk to someone else who is much more concerned. So now that I have the username and password, say for the email or the Facebook account of some low-ranking employee in a company, I can start messaging their boss and pretending to be them and maybe get even more passwords and more access through that.

Phishing is usually kind of a “spray and pray” approach. You have a, “I’m a Nigerian prince, I have all of this money stocked in Africa, I’ll give you a cut if you help me move it out of the country, you need to send me some money.” You send this to millions of people, and maybe one or two fall for it. The cost for the sender is not very high, but the success rate is also very, very low.

Spear phishing on the other hand, is when you find a particular target, and you spend quite a lot of time profiling them and understanding what their interests are, what their social circles are, and then you craft a message that is very likely to work on them, because it plays to their ego, it plays to their normal routine, it plays on their interests and so on.

In the report we talk about this research by ZeroFOX, where they took a very simple version of this. They said, let’s look at what people tweet about, we’ll take that as an indication of the stuff that they’re interested in. We will train a machine learning algorithm to create a model of the topics that people are interested in, form the tweets, craft a malicious tweet that is based on those topics of interest and have that be a link to a malicious site. So instead of sending kind of generally, “Check this out, super cool website,” with a link to a malicious website most people know not to click on, it will be, “Oh, you are clearly interested in sports in this particular country, have you seen what happened, like the new hire in this team?” Or, “You’re interested in archeology, crazy new report about recent finds in the pyramids,” or something. And what they showed was that, once that they’ve kind of created the bot, that bot then crafted targeted messages, those spear phishing messages, to a large number of users, and in principle they could scale it up indefinitely because now it’s software, and the click through rate was very high. I think it was something like 30 percent, which is orders of magnitude more than you get with phishing.

So automating spear phishing changes what used to be a trade off between spray and pray, target millions of people, but very few of them would click on it, or spear phishing where you target only a few individuals with very high success rates — now you can target millions of people and customize the message to each one so you have high success rates for all of them. Which means that, you and me, who previously wouldn’t be very high on the target list for cyber criminals or other cyber attackers can now become targets simply because the cost is very low.

Ariel: So the cost is low, I don’t think I’m the only person who likes to think that I’m pretty good at recognizing sort of these phishing scams and stuff like that. I’m assuming these are going to also become harder for us to identify?

Shahar: Yep. So the idea is that the moment you have access to people’s data, because they’re explicit on social media about their interests and about their circles of friends, then the better you get at crafting messages and, say, comparing them to authentic messages from people, and saying, “oh this is not quite right, we are going to tweak the algorithm until we get something that looks a lot like something a human would write.” Quite quickly you could get to the point where computers are generating, say, to begin with texts that are indistinguishable from what a human would write, but increasingly also images, audio segments, maybe entire websites. As long as the motivation or the potential for profit is there, it seems like the technology, either the ones that we have now or the ones that we can foresee in the five years, would allow these kinds of advances to take place.

Ariel: Okay. So I want to touch quickly on the idea of adversarial examples. There was an XKCD cartoon that came out a week or two ago about self driving cars and the character says, “I worry about self driving car safety features, what’s to stop someone from painting fake lines on the road or dropping a cutout of a pedestrian onto a highway to make cars swerve and crash,” and then realizes all of those things would also work on human drivers. Sort of a personal story, I used to live on a street called Climax and I actually lived at the top of Climax, and I have never seen a street sign stolen more in my life, it was often the street sign just wasn’t there. So my guess is it’s not that hard to steal a stop sign if someone really wanted to mess around with drivers, and yet we don’t see that happen very often.

So I was hoping both of you could weigh in a little bit on what you think artificial intelligence is going to change about these types of scenarios where it seems like the risk will be higher for things like adversarial examples versus just stealing a stop sign.

Victoria: I agree that there is certainly a reason for optimism in the fact that most people just aren’t going to mess with the technology, that there aren’t that many actual bad actors out there who want to mess it up. On the other hand, as Shahar said earlier, democratizing both the technology and the ways to mess with it, to interfere with it, does make that more likely. For example, the ways in which you could provide adversarial examples to cars, can be quite a bit more subtle than stealing a stop sign or dropping a fake body on the road or anything like that. For example, you can put patches on a stop sign that look like noise or just look like rectangles in certain places and humans might not even think to remove them, because to humans they’re not a problem. But an autonomous car might interpret that as a speed limit sign instead of a stop sign, and similarly, more generally people can use adversarial patches to fool various vision systems, for example if they don’t want to be identified by a surveillance camera or something like that.

So a lot of these methods, people can just read about it online, there are papers in arXiv and I think the fact that they are so widely available might make it easier for people to interfere with technology more, and basically might make this happen more often. It’s also the case that the vulnerabilities of AI are different than the vulnerabilities of humans, so it might lead to different ways that it can fail that humans are not used to, and ways in which humans would not fail. So all of these things need to be considered, and of course, as technologists, we need to think about ways in which things can go wrong, whether it is presently highly likely, or not.

Ariel: So that leads to another question that I want to ask, but before I go there, Shahar, was there anything you wanted to add?

Shahar: I think that covers almost all of the basics, but I’d maybe stress a couple of these points. One thing about machines failing in ways that are different from how humans fail, it means that you can craft an attack that would only mess up a self driving car, but wouldn’t mess up a human driver. And that means let’s say, you can go in the middle of the night and put some stickers on and you are long gone from the scene by the time something bad happens. So this diminished ability to attribute the attack, might be something that means that more people feel like they can get away with it.

Another one is that we see people much more willing to perform malicious or borderline acts online. So it’s important, I mean we often talk about adversarial examples as things that affect vision systems, because that’s where a lot of the literature is, but it is very likely — in fact, there are several examples that also things like anomaly detection that uses machine learning patterns, malicious code detection that is based on machine-learned patterns, anomaly detection in networks and so on, all of these have their kinds of adversarial examples as well.  And so thinking about adversarial examples against defensive systems and adversarial examples against systems that are only available online, brings us back to one attacker somewhere in the world could have access to your system and so the fact that most people are not attackers doesn’t really help you defense-wise.

Ariel: And, so this whole report is about how AI can be misused, but obviously the AI safety community and AI safety research goes far beyond that. So especially in the short term, do you see misuse or just general safety and design issues to be a bigger deal?

Victoria: I think it is quite difficult to say which of them would be a bigger deal. I think both misuse and accidents are something that are going to increase in importance and become more challenging and these are things that we really need to be working on as a research community.

Shahar: Yeah, I agree. We wrote this report not because we don’t think accident risk and safety risk matters are important — we think they are very important. We just thought that there was some pretty good technical reports out there outlining the risks from accident with near-term machine learning and with long-term and some of the researching that could be used to address them, and we felt like a similar thing was missing for misuse, which was why we wrote that report.

Both are going to be very important, and to some extent there is going to be an interplay. It is possible that systems that are more interpretable are also easier to secure. It might be the case that if there is some restriction in the diffusion of capabilities that also means that there is less incentive to cut corners to out-compete someone else by skimping on safety and so on. So there are strategic questions across both misuse and accidents, but I agree with Victoria, probably if we don’t do our job, we are just going to see more and more of both of these categories causing harm in the world, and more reason to work on both of them. I think both fields need to grow.

Victoria: I just wanted to add, a common cause of both accident risks and misuse risks that might happen in the future is just that these technologies are advancing quickly and there are often unforeseen and surprising ways in which they can fail, either by accident or by having vulnerabilities that can be misused by bad actors. And so as the technology continues to advance quickly we really need to be on the lookout for new ways that it can fail, new accidents but also new ways in which it can be used for harm by bad actors.

Ariel: So one of the things that I got out of this report, and that I think is also coming through now is, it’s kind of depressing. And I found myself often wondering … So at FLI, especially now we’ve got the new grants that are focused more on AGI, we’re worried about some of these bigger, longer-term issues, but with these shorter-term things, I sometimes find myself wondering if we’re even going to make it to AGI, or if something is going to happen that prevents that development in some way. So I was hoping you could speak to that a little bit.

Shahar: Maybe I’ll start with the Malicious Use report, and apologize for its somewhat gloomy perspective. So it should probably be mentioned that, I think almost all of the authors of the report are somewhere between fairly and very optimistic about artificial intelligence. So it’s much more the fact that we see this technology going, we want to see it developed quickly, at least in various narrow domains that are of very high importance, like medicine, like self driving cars — I’m personally quite a big fan. We think that the best way to, if we can foresee and design around or against the misuse risks, then we will eventually end up with a technology that it is more mature, that is more acceptable, that is more trusted because it is trustworthy, because it is secure. We think it is going to be much better to plan for these things in advance.

It is also, again, say we use electricity as an analogy, if I just sat down at the beginning of the age of electricity and I wrote a report about how many people were going to be electrocuted, it would look like a very sad thing. And it’s true, there has been a rapid increase in the number of people who die from electrocution compared to before the invention of electricity and much safety has been built since then to make sure that that risk is minimized, but of course, the benefits have far, far, far outweighed the risks when it comes to electricity and we expect, probably, hopefully, if we take the right actions, like we lay out in the report, then the same is going to be true for misuse risk for AI. At least half of the report, all of Appendix B and a good chunk of the parts before it, talk about what we can do to mitigate those risks, so hopefully the message is not entirely doom and gloom.

Victoria: I think that the things we need to do remain the same no matter how far away we expect these different developments to happen. We need to be looking out for ways that things can fail. We need to be thinking in advance about ways that things can fail, and not wait until problems show up and we actually see that they’re happening. Of course, we often will see problems show up, but in these matters an ounce of prevention can be worth a pound of cure, and there are some mistakes that might just be too costly. For example, if you have some advanced AI that is running the electrical grid or the financial system, we really don’t want that thing to, hack its reward function.

So there are various predictions about how soon different transformative developments of AI might happen and it is possible that things might go awry with AI before we get to general intelligence and what we need to do is basically work hard to try to prevent these kinds of accidents or misuse from happening and try to make sure that AI is ultimately beneficial, because the whole point of building it is because it would be able to solve big problems that we cannot solve by ourselves. So let’s make sure that we get there and that we sort of handle this with responsibility and foresight the whole way.

Ariel: I want to go back to the very first comments that you made about where we were three years ago. How have things changed in the last three years and where do you see the AI safety community today?

Victoria: In the last three years, we’ve seen the AI safety research community get a fair bit bigger and topics of AI safety have become more mainstream, so I will say that long-term AI safety is definitely less controversial and there are more people engaging with the questions and actually working on them. While near-term safety, like questions of fairness and privacy and technological unemployment and so on, I would say that’s definitely mainstream at this point and a lot of people are thinking about that and working on that.

In terms of long term AI safety or AGI safety we’ve seen teams spring up, for example, both DeepMind and OpenAI have a safety team that’s focusing on these sort of technical problems, which includes myself on the DeepMind side. There have been some really interesting bits of progress in technical AI safety. For example, there has been some progress in reward learning and generally value learning. For example, the cooperative inverse reinforcement learning work from Berkeley. There has been some great work from MIRI on logical induction and quantilizing agents and that sort of thing. There have been some papers at mainstream machine learning conferences that focus on technical AI safety, for example, there was an interruptibility paper at NIPS last year and generally I’ve been seeing more presence of these topics in the big conferences, which is really encouraging.

On a more meta level, it has been really exciting to see the Concrete Problems in AI Safety research agenda come out two years ago. I think that’s really been helpful to the field. So these are only some of the exciting advances that have happened.

Ariel: Great. And so, Victoria, I do want to turn now to some of the stuff about FLI’s newest grants. We have an RFP that included quite a few examples and I was hoping you could explain at least two or three of them, but before we get to that if you could quickly define what artificial general intelligence (AGI) is, what we mean when we refer to long-term AI? I think those are the two big ones that have come up so far.

Victoria: So, artificial general intelligence is this idea of an AI system that can learn to solve many different tasks. Some people define this in terms of human-level intelligence as an AI system that will be able to learn to do all human jobs, for example. And this contrasts to the kind of AI systems that we have today which we could call “narrow AI,” in the sense that they specialize in some task or class of tasks that they can do.

So, for example Alpha Zero is a system that is really good at various games like Go and Chess and so on, but it would not be able to, for example, clean up a room, because that’s not in its class of tasks. While if you look at human intelligence we would say that humans are our go-to example of general intelligence because we can learn to do new things, we can adapt to new tasks and new environments that we haven’t seen before and we can transfer our knowledge that we have acquired through previous experience, that might not be in exactly the same settings, to whatever we are trying to do at the moment.

So, AGI is the idea of building an AI system that is also able to do that — not necessarily in the same way as humans, like it doesn’t necessarily have to be human-like to be able to perform the same tasks, or it doesn’t have to be structured the way a human mind is structured. So the definition of AGI is about what it’s capable of rather than how it can do those things. I guess the emphasis there is on the word general.

In terms of the FLI grant program this year, it is specifically focused on the AGI safety issue, which we also call long-term AI safety. Long term here doesn’t necessarily mean that it’s 100 years away. We don’t know how far away AGI actually is; the opinions of experts vary quite widely on that. But it’s more emphasizing that it’s not an immediate problem in the sense that we don’t have AGI yet, but we are trying to foresee what kind of problems might happen with AGI and make sure that if and when AGI is built that it is as safe and aligned with human preferences as possible.

And in particular as a result of the mainstreaming of AI safety that has happened in the past two years, partly, as I like to think, due to FLI’s efforts, at this point it makes sense to focus on long-term safety more specifically since this is still the most neglected area in the AI safety field. I’ve been very happy to see lots and lots of work happening these days on adversarial examples, fairness, privacy, unemployment, security and so on.  I think this allows us to really zoom in and focus on AGI safety specifically to make sure that there’s enough good technical work going on in this field and that the big technical problems get as much progress as possible and that the research community continues to grow and do well.

In terms of the kind of problems that I would want to see solved, I think some of the most difficult problems in AI safety that sort of feed into a lot of the problem areas that we have are things like Goodhart’s Law. Goodhart’s Law is basically that, when a metric becomes a target, it ceases to be a good metric. And the way this applies to AI is that if we make some kind of specification of what objective we want the AI system to optimize for — for example this could be a reward function, or a utility function, or something like that — then, this specification becomes sort of a proxy or a metric for our real preferences, which are really hard to pin down in full detail. Then if the AI system explicitly tries to optimize for the metric or for that proxy, for whatever we specify, for the reward function that we gave, then it will often find some ways to follow the letter but not the spirit of that specification.

Ariel: Can you give a real life example of Goodhart’s Law today that people can use as an analogy?

Victoria: Certainly. So Goodhart’s Law was not originally coined in AI. This is something that generally exists in economics and in human organizations. For example, if employees at a company have their own incentives in some way, like they are incentivized to clock in as many hours as possible, then they might find a way to do that without actually doing a lot of work. If you’re not measuring that then the number of hours spent at work might be correlated with how much output you produce, but if you just start rewarding people for the number of hours then maybe they’ll just play video games all day, but they’ll be in the office. That could be a human example.

There are also a lot of AI examples these days of reward functions that turn out not to give good incentives to AI systems.

Ariel: For a human example, would the issues that we’re seeing with standardized testing be an example of this?

Victoria: Oh, certainly, yes. I think standardized testing is a great example where when students are optimizing for doing well on the tests, then the test is a metric and maybe the real thing you want is learning, but if they are just optimizing for doing well on the test, then actually learning can suffer because they find some way to just memorize or study for particular problems that will show up on the test, which is not necessarily a good way to learn.

And if we get back to AI examples, there was a nice example from OpenAI last year where they had this reinforcement learning agent that was playing a boat racing game and the objective of the boat racing game was to go along the racetrack as fast as possible and finish the race before the other boats do, and to encourage the player to go along the track there were some reward points — little blocks that you have to hit to get rewards — that were along the track, and then the agent just found a degenerate solution where it would just go in a circle and hit the same blocks over and over again and get lots of reward, but it was not actually playing the game or winning the race or anything like that. This is an example of Goodhart’s Law in action. There are plenty of examples of this sort with present day reinforcement learning systems. Often when people are designing a reward function for a reinforcement learning system they end up adjusting it a number of times to eliminate these sort of degenerate solutions that happen.

And this is not limited to reinforcement learning agents. For example, recently there was a great paper that came out about many examples of Goodhart’s Law in evolutionary algorithms. For example, if some evolved agents were incentivized to move quickly in some direction, then they might just evolve to be really tall and then they fall in this direction instead of actually learning to move. There are lots and lots of examples of this and I think that as AI systems become more advanced and more powerful, then I think they’ll just get more clever at finding these sort of loopholes in our specifications of what we want them to do. Goodhart’s Law is, I would say, part of what’s behind various other AI safety issues. For example, negative side effects are often caused by the agent’s specification being incomplete, so there’s something that we didn’t specify.

For example, if we want a robot to carry a box from point A to point B, then if we just reward it for getting the box to point B as fast as possible, then if there’s something in the path of the robot — for example, there’s a vase there — then it will not have an incentive to go around the vase, it would just go right through the vase and break it just to get to point B as fast as possible, and this is an issue because our specification did not include a term for the state of the vase. So, when data is just optimizing for this reward that’s all about the box, then it doesn’t have an incentive to avoid disruptions to the environment.

Ariel: So I want to interrupt with a quick question. These examples so far, we’re obviously worried about them with a technology as powerful as AGI, but they’re also things that apply today. As you mentioned, Goodhart’s Law doesn’t even just apply to AI. What progress has been made so far? Are we seeing progress already in addressing some of these issues?

Victoria: We haven’t seen so much progress in addressing these questions in a very general sort of way, because when you’re building a narrow AI system, then you can often get away with a sort of trial and error approach where you run it and maybe it does something stupid, finds some degenerate solution, then you tweak your reward function, you run it again and maybe it finds a different degenerate solution and then so on and so forth until you arrive at some reward function that doesn’t lead to obvious failure cases like that. For many narrow systems and narrow applications where you can sort of foresee all the ways in which things can go wrong, and just penalize all those ways or build a reward function that avoids all of those failure modes, then there isn’t so much need to find a general solution to these problems. While as we get closer to general intelligence, there will be more need for more principled and more general approaches to these problems.

For example, how do we build an agent that has some idea of what side effects are, or what it means to disrupt an environment that it’s in, no matter what environment you put it in. That’s something we don’t have yet. One of the promising approaches that has been gaining traction recently is reward learning. For example, there was this paper in collaboration between DeepMind and OpenAI called Deep Reinforcement Learning from Human Preferences, where instead of directly specifying a reward function for the agent, it learns a reward function from human feedback. Where, for example, if your agent is this simulated little noodle or hopper that’s trying to do a backflip, then the human would just look at two videos off the agent trying to do a backflip and say, “Well this one looks more like a back flip.” And so, you have a bunch of data from the human about what is more similar to what the human wants the agent to do.

With this kind of human feedback, unlike, for example, demonstrations, the agent can learn something that the human might not be able to demonstrate very easily. For example, even if I cannot do a backflip myself, I can still judge whether someone else has successfully done a backflip or whether this reinforcement agent has done a backflip. This is promising for getting agents to potentially solve problems that humans cannot solve or do things that humans cannot demonstrate. Of course, with human feedback and human-in-the-loop kind of work, there is always the question of scalability because human time is expensive and we want the agent to learn as efficiently as possible from limited human feedback and we also want to make sure that the agent actually gets human feedback in all the relevant situations so it learns to generalize correctly to new situations. There are a lot of remaining open problems in this area as well, but the progress so far has been quite encouraging.

Ariel: Are there others that you want to talk about?

Victoria: Maybe I’ll talk about one other question, which is that of interpretability. Interpretability of AI systems is something that is a big area right now in near-term AI safety that increasingly more people on the research community are thinking about and working on, that is also quite relevant in long-term AI safety. This generally has to do with being able to understand why your system does things a certain way, or makes certain decisions or predictions, or in the case of an agent, why it takes certain actions and also understanding what different components of the system are looking for in the data or how the system is influenced by different inputs and so on. Basically making it less of a black box, and I think there is a reputation for deep learning systems in particular that they are seen as black boxes and it is true that they are quite complex, but I think they don’t necessarily have to be black boxes and there has certainly been progress in trying to explain why they do things.

Ariel: Do you have real world examples?

Victoria: So, for example, if you have some AI system that’s used for medical diagnosis, then on the one hand you could have something simple like a decision tree that just looks at your x-ray and if there is something in a certain position then it gives you a certain diagnosis, and otherwise it doesn’t and so on. Or you could have a more complex system like a neural network that takes into account a lot more factors and then at the end it says, like maybe this person has cancer or maybe this person has something else. But it might not be immediately clear why that diagnosis was made. Particularly in sensitive applications like that, what sometimes happens is that people end up using simpler systems that they find more understandable where they can say why a certain diagnosis was made, even if those systems are less accurate, and that’s one of the important cases for interpretability where if we figure out how to make these more powerful systems more interpretable, for example, through visualization techniques, then they would actually become more useful in these really important applications where it actually matters not just to predict well, but to explain where the prediction came from.

And another area, another example is an algorithm that’s deciding whether to give someone a loan or a mortgage, then if someone’s loan application got rejected then they would really want to know why it got rejected. So the algorithm has to be able to point at some variables or some other aspect of the data that influences decisions or you might need to be able to explain how the data will need to change for the decision to change, what variables would need to be changed by a certain amount for the decision to be different. So these are just some examples of how this can be important and how this is already important. And this kind of interpretability of present day systems is of course already on a lot of people’s minds. I think it is also important to think about interpretability in the longer term as we build more general AI systems that will continue to be important or maybe even become more important to be able to look inside them and be able to check if they have particular concepts that they’re representing.

Like, for example, especially from a safety perspective, whether your system was thinking about the off switch and if it’s thinking about whether it’s going to be turned off, that might be something good to monitor for. We also would want to be able to explain how our systems fail and why they fail. This is, of course, quite relevant today if, let’s say your medical diagnosis AI makes a mistake and we want to know what led to that, why it made the wrong diagnosis. Also on the longer term we want to know why an AI system hacks its reward function, what is it thinking — well “thinking” with quotes, of course — while it’s following a degenerate solution instead of the kind of solution we would want it to find. So, what is the boat race agent that I mentioned earlier paying attention to while it’s going in circles and collecting the same rewards over and over again instead of playing the game, that kind of thing. I think the particular application of interpretability techniques to safety problems is going to be important and it’s one of the examples of the kind of work that we’re looking for in the in the RFP.

Ariel: Awesome. Okay, and so, we’ve been talking about how all these things can go wrong and we’re trying to do all this research to make sure things don’t go wrong, and yet basically we think it’s worthwhile to continue designing artificial intelligence, that no one’s looking at this and saying “Oh my god, artificial intelligence is awful, we need to stop studying it or developing it.” So what are the benefits that basically make these risks worth the risk?

Shahar: So I think one thing is in the domain of narrow applications, it’s very easy to make analogies to software, right? For the things that we have been able to hand over to computers, they really have been the most boring and tedious and repetitive things that humans can do and we now no longer need to do them and productivity has gone up and people are generally happier and they can get paid more for doing more interesting things and we can just build bigger systems because we can hand off the control of them to machines that don’t need to sleep and don’t make small mistakes in calculations. Now the promise of turning that and adding to that all of the narrow things that experts can do, whether it’s improving medical diagnosis, whether it’s maybe farther down the line some elements of drug discovery, whether it’s piloting a car or operating machinery, many of these areas where human labor is currently required because there is a fuzziness to the task, it does not enable a software engineer to come in and code an algorithm, but maybe with machine learning in the not too distant future we’ll be able to turn them over to machines.

It means taking some skills that only a few individuals in the world can do and making those available to everyone around the world in some domains. That seems, I mean, concrete examples are, the ones that I have I try to find the companies that do them and get involved with them because I want to see them happen sooner and the ones that I can’t imagine yet, someone will come along and make a company out of it, or a not-for-profit for it. But we’ve seen applications from agriculture, to medicine, to computer security, to entertainment and art, and driving and transport, and in all of these I think we’re just gonna be seeing even more. I think we’re gonna have more creative products out there that were designed in collaboration between humans and machines. We’re gonna see more creative solutions to scientific engineering problems. We’re gonna see those professions where really good advice is very valuable, but there are only so many people who can help you — so if I’m thinking of doctors and lawyers, taking some of that advice and making it universally accessible through an app just makes life smoother. These are some of the examples that come to my mind.

Ariel: Okay, great. Victoria what are the benefits that you think make these risks worth addressing?

Victoria: I think there are many ways in which AI systems can make our lives a lot better and make the world a lot better especially as we build more general systems that are more adaptable. For example, these systems could help us with designing better institutions and better infrastructure, better health systems or electrical systems or what have you. Even now, there are examples like the Google project on optimizing the data center energy use using machine learning, which is something that Deep Mind was working on, where the use of machine learning algorithms to automate energy used in the data centers improved their energy efficiency by I think something like 40 percent. That’s of course with fairly narrow AI systems.

I think as we build more general AI systems we can expect, we can hope for really creative and innovative solutions to the big problems that humans face. So you can think of something like AlphaGo’s famous “move 37” that overturned thousands of years of human wisdom in Go. What if you can build even more general and even more creative systems and apply them to real world problems? I think there is great promise in that. I think this can really transform the world in a positive direction, and we just have to make sure that as the systems are built that we think about safety from the get go and think about it in advance and trying to build them to be as resistant to accidents and misuse as possible so that all these benefits can actually be achieved.

The things I mentioned were only examples of the possible benefits. Imagine if you could have an AI scientist that’s trying to develop better drugs against diseases that have really resisted treatment or more generally just doing science faster and better if you actually have more general AI systems that can think as flexibly as humans can about these sort of difficult problems. And they would not have some of the limitations that humans have where, for example, our attention is limited our memory is limited, while AI could be, at least theoretically, unlimited in it’s processing power, in the resources available to it, it can be more parallelized, it can be more coordinated and I think all of the big problems that are so far unsolved are these sort of coordination problems that require putting together a lot of different pieces of information and a lot of data. And I think there are massive benefits to be reaped there if we can only get to that point safely.

Ariel: Okay, great. Well thank you both so much for being here. I really enjoyed talking with you.

Shahar: Thank you for having us. It’s been really fun.

Victoria: Yeah, thank you so much.

[end of recorded material]

How AI Handles Uncertainty: An Interview With Brian Ziebart

When training image detectors, AI researchers can’t replicate the real world. They teach systems what to expect by feeding them training data, such as photographs, computer-generated images, real video and simulated video, but these practice environments can never capture the messiness of the physical world.

In machine learning (ML), image detectors learn to spot objects by drawing bounding boxes around them and giving them labels. And while this training process succeeds in simple environments, it gets complicated quickly.

 

 

 

 

 

 

 

It’s easy to define the person on the left, but how would you draw a bounding box around the person on the right? Would you only include the visible parts of his body, or also his hidden torso and legs? These differences may seem trivial, but they point to a fundamental problem in object recognition: there rarely is a single best way to define an object.

As this second image demonstrates, the real world is rarely clear-cut, and the “right” answer is usually ambiguous. Yet when ML systems use training data to develop their understanding of the world, they often fail to reflect this. Rather than recognizing uncertainty and ambiguity, these systems often confidently approach new situations no differently than their training data, which can put the systems and humans at risk.

Brian Ziebart, a Professor of Computer Science at the University of Illinois at Chicago, is conducting research to improve AI systems’ ability to operate amidst the inherent uncertainty around them. The physical world is messy and unpredictable, and if we are to trust our AI systems, they must be able to safely handle it.

 

Overconfidence in ML Systems

ML systems will inevitably confront real-world scenarios that their training data never prepared them for. But, as Ziebart explains, current statistical models “tend to assume that the data that they’ll see in the future will look a lot like the data they’ve seen in the past.”

As a result, these systems are overly confident that they know what to do when they encounter new data points, even when those data points look nothing like what they’ve seen. ML systems falsely assume that their training prepared them for everything, and the resulting overconfidence can lead to dangerous consequences.

Consider image detection for a self-driving car. A car might train its image detection on data from the dashboard of another car, tracking the visual field and drawing bounding boxes around certain objects, as in the image below:

Bounding boxes on a highway – CloudFactory Blog

 

 

 

 

 

 

 

 

 

 

 

 

For clear views like this, image detectors excel. But the real world isn’t always this simple. If researchers train an image detector on clean, well-lit images in the lab, it might accurately recognize objects 80% of the time during the day. But when forced to navigate roads on a rainy night, it might drop to 40%.

“If you collect all of your data during the day and then try to deploy the system at night, then however it was trained to do image detection during the day just isn’t going to work well when you generalize into those new settings,” Ziebart explains.

Moreover, the ML system might not recognize the problem: since the system assumes that its training covered everything, it will remain confident about its decisions and continue “to make strong predictions that are just inaccurate,” Ziebart adds.

In contrast, humans tend to recognize when previous experience doesn’t generalize into new settings. If a driver spots an unknown object ahead in the road, she wouldn’t just plow through the object. Instead, she might slow down, pay attention to how other cars respond to the object, and consider swerving if she can do so safely. When humans feel uncertain about our environment, we exercise caution to avoid making dangerous mistakes.

Ziebart would like AI systems to incorporate similar levels of caution in uncertain situations. Instead of confidently making mistakes, a system should recognize its uncertainty and ask questions to glean more information, much like an uncertain human would.

 

An Adversarial Approach

Training and practice may never prepare AI systems for every possible situation, but researchers can make their training methods more foolproof. Ziebart posits that feeding systems messier data in the lab can train them to better recognize and address uncertainty.

Conveniently, humans can provide this messy, real-world data. By hiring a group of human annotators to look at images and draw bounding boxes around certain objects – cars, people, dogs, trees, etc. – researchers can “build into the classifier some idea of what ‘normal’ data looks like,” Ziebart explains.

“If you ask ten different people to provide these bounding boxes, you’re likely to get back ten different bounding boxes,” he says. “There’s just a lot of inherent ambiguity in how people think about the ground truth for these things.”

Returning to the image above of the man in the car, human annotators might give ten different bounding boxes that capture different portions of the visible and hidden person. By feeding ML systems this confusing and contradictory data, Ziebart prepares them to expect ambiguity.

“We’re synthesizing more noise into the data set in our training procedure,” Ziebart explains. This noise reflects the messiness of the real world, and trains systems to be cautious when making predictions in new environments. Cautious and uncertain, AI systems will seek additional information and learn to navigate the confusing situations they encounter.

Of course, self-driving cars shouldn’t have to ask questions. If a car’s image detection spots a foreign object up ahead, for instance, it won’t have time to ask humans for help. But if it’s trained to recognize uncertainty and act cautiously, it might slow down, detect what other cars are doing, and safely navigate around the object.

 

Building Blocks for Future Machines

Ziebart’s research remains in training settings thus far. He feeds systems messy, varied data and trains them to provide bounding boxes that have at least 70% overlap with people’s bounding boxes. And his process has already produced impressive results. On an ImageNet object detection task investigated in collaboration with Sima Behpour (University of Illinois at Chicago) and Kris Kitani (Carnegie Mellon University), for example, Ziebart’s adversarial approach “improves performance by over 16% compared to the best performing data augmentation method.” Trained to operate amidst uncertain environments, these systems more effectively manage new data points that training didn’t explicitly prepare them for.

But while Ziebart trains relatively narrow AI systems, he believes that this research can scale up to more advanced systems like autonomous cars and public transit systems.

“I view this as kind of a fundamental issue in how we design these predictors,” he says. “We’ve been trying to construct better building blocks on which to make machine learning – better first principles for machine learning that’ll be more robust.”

This article is part of a Future of Life series on the AI safety research grants, which were funded by generous donations from Elon Musk and the Open Philanthropy Project.

Podcast: AI and the Value Alignment Problem with Meia Chita-Tegmark and Lucas Perry

What does it mean to create beneficial artificial intelligence? How can we expect to align AIs with human values if humans can’t even agree on what we value? Building safe and beneficial AI involves tricky technical research problems, but it also requires input from philosophers, ethicists, and psychologists on these fundamental questions. How can we ensure the most effective collaboration?

Ariel spoke with FLI’s Meia Chita-Tegmark and Lucas Perry on this month’s podcast about the value alignment problem: the challenge of aligning the goals and actions of AI systems with the goals and intentions of humans. 

Topics discussed in this episode include:

  • how AGI can inform human values,
  • the role of psychology in value alignment,
  • how the value alignment problem includes ethics, technical safety research, and international coordination,
  • a recent value alignment workshop in Long Beach,
  • and the possibility of creating suffering risks (s-risks).

This podcast was edited by Tucker Davey. You can listen to it above or read the transcript below.

 

Ariel: I’m Ariel Conn with the Future of Life Institute, and I’m excited to have FLI’s Lucas Perry and Meia Chita-Tegmark with me today to talk about AI, ethics and, more specifically, the value alignment problem. But first, if you’ve been enjoying our podcast, please take a moment to subscribe and like this podcast. You can find us on iTunes, SoundCloud, Google Play, and all of the other major podcast platforms.

And now, AI, ethics, and the value alignment problem. First, consider the statement “I believe that harming animals is bad.” Now, that statement can mean something very different to a vegetarian than it does to an omnivore. Both people can honestly say that they don’t want to harm animals, but how they define “harm” is likely very different, and these types of differences in values are common between countries and cultures, and even just between individuals within the same town. And then we want to throw AI into the mix. How can we train AIs to respond ethically to situations when the people involved still can’t come to an agreement about what an ethical response should be?

The problem is even more complicated because often we don’t even know what we really want for ourselves, let alone how to ask an AI to help us get what we want. And as we’ve learned with stories like that of King Midas, we need to be really careful what we ask for. That is, when King Midas asked the genie to turn everything to gold, he didn’t really want everything — like his daughter and his food — turned to gold. And we would prefer than an AI we design recognize that there’s often implied meaning in what we say, even if we don’t say something explicitly. For example, if we jump into an autonomous car and ask it to drive us to the airport as fast as possible, implicit in that request is the assumption that, while we might be OK with some moderate speeding, we intend for the car to still follow most rules of the road, and not drive so fast as to put anyone’s life in danger or take illegal routes. That is, when we say “as fast as possible,” we mean “as fast as possible within the rules of law,” and not within the rules of physics or within the laws of physics. And these examples are just the tiniest tip of the iceberg, given that I didn’t even mention artificial general intelligence (AGI) and how that can be developed such that its goals align with our values.

So as I mentioned a few minutes ago, I’m really excited to have Lucas and Meia joining me today. Meia is a co-founder of the Future of Life Institute. She’s interested in how social sciences can contribute to keeping AI beneficial, and her background is in social psychology. Lucas works on AI and nuclear weapons risk-related projects at FLI. His background is in philosophy with a focus on ethics. Meia and Lucas, thanks for joining us today.

Meia: It’s a pleasure. Thank you.

Lucas: Thanks for having us.

Ariel: So before we get into anything else, one of the big topics that comes up a lot when we talk about AI and ethics is this concept value alignment. I was hoping you could both maybe talk just a minute about what value alignment is and why it’s important to this question of AI and ethics.

Lucas: So value alignment, in my view, is bringing AI’s goals, actions, intentions and decision-making processes in accordance with what humans deem to be the good or what we see as valuable or what our ethics actually are.

Meia: So for me, from the point of view of psychology, of course, I have to put the humans at the center of my inquiry. So from that point of view, value alignment … You can think about it also in terms of humans’ relationships with other humans. But I think it’s even more interesting when you add artificial agents into the mix. Because now you have an entity that is so wildly different from humans yet we would like it to embrace our goals and our values in order to keep it beneficial for us. So I think the question of value alignment is very central to keeping AI beneficial.

Lucas: Yeah. So just to expand on what I said earlier: The project of value alignment is in the end creating beneficial AI. It’s working on what it means for something to be beneficial, what beneficial AI exactly entails, and then learning how to technically instantiate that into machines and AI systems. Also, building the proper like social and political context for that sort of technical work to be done and for it to be fulfilled and manifested in our machines and AIs.

Ariel: So when you’re thinking of AI and ethics, is value alignment basically synonymous, just another way of saying AI and ethics or is it a subset within this big topic of AI and ethics?

Lucas: I think they have different connotations. If one’s thinking about AI ethics, I think that one is tending to be moreso focused on applied ethics and normative ethics. One might be thinking about the application of AI systems and algorithms and machine learning in domains in the present day and in the near future. So one might think about atomization and other sorts of things. I think that when one is thinking about value alignment, it’s much more broad and expands also into metaethics and really sort of couches and frames the problem of AI ethics as something which happens over decades and which has a tremendous impact. I think that value alignment has a much broader connotation than what AI ethics has traditionally had.

Meia: I think it all depends on how you define value alignment. I think if you take the very broad definition that Lucas has just proposed, I think that yes, it probably includes AI ethics. But you can also think of it more narrowly as simply instantiating your own values into AI systems and having them adopt your goals. In that case, I think there are other issues as well because if you think about it from the point of view of psychology, for example, then it’s not just about which values get instantiated and how you do that, how you solve the technical problem, but also we know that humans, even if they know what goals they have and what values they uphold, it’s very, very hard for them sometimes to actually act in accordance to them because they have all sorts of cognitive and emotional effective limitations. So in that case I think value alignment is, in this narrow sense, is basically not sufficient. We also need to think about AIs and applications of AIs in terms of how do they help us and how do they make sure that we gain the cognitive competencies that we need to be moral beings and to be really what we should be, not just what we are.

Lucas: Right. I guess to expand on what I was just saying. Value alignment I think in the more traditional sense, it’s sort of all … It’s more expansive and inclusive in that it’s recognizing a different sort of problem than AI ethics alone has. I think that when one is thinking about value alignment, there are elements of thinking about — somewhat about machine ethics but also about social, political, technical and ethical issues surrounding the end goal of eventually creating AGI. Whereas, AI ethics can be more narrowly interpreted just as certain sorts of specific cases where AI’s having impact and implications in our lives in the next 10 years. Whereas, value alignment’s really thinking about the instantiation of ethics and machines and making machine systems that are corrigible and robust and docile, which will create a world that we’re all happy about living in.

Ariel: Okay. So I think that actually is going to flow really nicely into my next question, and that is, at FLI we tend to focus on existential risks. I was hoping you could talk a little bit about how issues of value alignment are connected to the existential risks that we concern ourselves with.

Lucas: Right. So, we can think of AI systems as being very powerful optimizers. We can imagine there being a list of all possible futures and what intelligence is good for is for modeling the world and then committing to and doing actions which constrain the set of all possible worlds to ones which are desirable. So intelligence is sort of the means by which we get to an end, and ethics is the end towards which we strive. So these are how these two things really integral and work together and how AI without ethics makes no sense and how ethics without AI or intelligence in general also just doesn’t work. So in terms of existential risk, there are possible futures that intelligence can lead us to where earth-originating intelligent life no longer exists either intentionally or by accident. So value alignment sort of fits in by constraining the set of all possible futures by working on technical work by doing political and social work and also work in ethics to constrain the actions of AI systems such that existential risks do not occur, such that by some sort of technical oversight, by some misalignment of values, by some misunderstanding of what we want, the AI generates an existential risk.

Meia: So we should remember that homo sapiens represent an existential risk to itself also. We are creating nuclear weapons. We have more of them than we need. So many, in fact, that we could destroy the entire planet with them. Not to mention homo sapiens has also represented an existential risk for all other species. The problem is AI is that we’re introducing in the mix a whole new agent that is by definition supposed to be more intelligent, more powerful than us and also autonomous. So as Lucas mentioned, it’s very important to think through what kind of things and abilities do we delegate to these AIs and how can we make sure that they have the survival and the flourishing of our species in mind. So I think this is where value alignment comes in as a safeguard against these very terrible and global risks that we can imagine coming from AI.

Lucas: Right. What makes doing that so difficult is beyond the technical issue of just having AI researchers and AI safety researchers knowing how to just get AI systems to actually do what we want without creating a universe of paperclips. There’s also this terrible social and political context in which this is all happening where there is really great game-theoretic incentives to be the first person to create artificial general intelligence. So in a race to create AI, a lot of these efforts that seem very obvious and necessary could be cut in favor of more raw power. I think that’s probably one of the biggest risks for us not succeeding in creating value-aligned AI.

Ariel: Okay. Right now it’s predominantly technical AI people who are considering mostly technical AI problems. How to solve different problems is usually, you need a technical approach for this. But when it comes to things like value alignment and ethics, most of the time I’m hearing people suggest that we can’t leave that up to just the technical AI researchers. So I was hoping you could talk a little bit about who should be part of this discussion, why we need more people involved, how we can get more people involved, stuff like that.

Lucas: Sure. So maybe if I just break the problem down into just what I view to be the three different parts then talking about it will make a little bit more sense. So we can break down the value alignment problem into three separate parts. The first one is going to be the technical issues, the issues surrounding actually creating artificial intelligence. The issues of ethics, so the end towards which we strive. The set of possible futures which we would be happy in living, and then also there’s the governance and the coordination and the international problem. So we can sort of view this as a problem of intelligence, a problem of agreeing on the end towards which intelligence is driven towards, and also the political and social context in which all of this happens.

So thus far, there’s certainly been a focus on the technical issue. So there’s been a big rise in the field of AI safety and in attempts to generate beneficial AI, attempts at creating safe AGI and mechanisms for avoiding reward hacking and other sorts of things that happen when systems are trying to optimize their utility function. The Concrete Problems on AI Safety paper has been really important and sort of illustrates some of these technical issues. But even between technical AI safety research and ethics there’s disagreement about something also like machine ethics. So how important is machine ethics? Where does machine ethics fit in to technical AI safety research? How much time and energy should we put into certain kinds of technical AI research versus how much time and effort should we put into issues in governance and coordination and addressing the AI arms race issues? How much of ethics do we really need to solve?

So I think there’s a really important and open question regarding how do we apply and invest our limited resources in sort of addressing these three important cornerstones in value alignment so that the technical issue, the issues in ethics and then issues in governance and coordination, and how do we optimize working on these issues given the timeline that we have? How much resources should we put in each one? I think that’s an open question. Yeah, one that certainly needs to be addressed more about how we’re going to move forward given limited resources.

Meia: I do think though the focus so far has been so much on the technical aspect. As you were saying, Lucas, there are other aspects to this problem that need to be tackled. What I’d like to emphasize is that we cannot solve the problem if we don’t pay attention to the other aspects as well. So I’m going to try to defend, for example, psychology here, which has been largely ignored I think in the conversation.

So from the point of view of psychology, I think the value alignment problem is double fold in a way. It’s about a triad of interactions. Human, AI, other humans, right? So we are extremely social animals. We interact a lot with other humans. We need to align our goals and values with theirs. Psychology has focused a lot on that. We have a very sophisticated set of psychological mechanisms that allow us to engage in very rich social interactions. But even so, we don’t always get it right. Societies have created a lot of suffering, a lot of moral harm, injustice, unfairness throughout the ages. So for example, we are very ill-prepared by our own instincts and emotions to deal with inter-group relations. So that’s very hard.

Now, people coming from the technical side, they can say, “We’re just going to have AI learn our preferences.” Inverse reinforcement learning is a proposal that says that basically explains how to keep humans in the loop. So it’s a proposal for programing AI such that it gets its reward not from achieving a goal but from getting good feedback from a human because it achieved a goal. So the hope is that this way AI can be correctable and can learn from human preferences.

As a psychologist, I am intrigued, but I understand that this is actually very hard. Are we humans even capable of conveying the right information about our preferences? Do we even have access to them ourselves or is this all happening in some sort of subconscious level? Sometimes knowing what we want is really hard. How do we even choose between our own competing preferences? So this involves a lot more sophisticated abilities like impulse control, executive function, etc. I think that if we don’t pay attention to that as well in addition to solving the technical problem, I think we are very likely to not get it right.

Ariel: So I’m going to want to come back to this question of who should be involved and how we can get more people involved, but one of the reasons that I’m talking to the both of you today is because you actually have made some steps in broadening this discussion already in that you set up a workshop that did bring together a multidisciplinary team to talk about value alignment. I was hoping you could tell us a bit more about how that workshop went, what interesting insights were gained that might have been expressed during the workshop, what you got out of it, why you think it’s important towards the discussion? Etc.

Meia: Just to give a few facts about the workshop. The workshop took place in December 2017 in Long Beach, California. We were very lucky to have two wonderful partners in co-organizing this workshop. The Berggruen Institute and the Canadian Institute for Advanced Research. And the idea for the workshop was very much to have a very interdisciplinary conversation about value alignment and reframe it as not just a technical problem but also one that involves disciplines such as philosophy and psychology, political science and so on. So we were very lucky actually to have a fantastic group of people there representing all these disciplines. The conversation was very lively and we discussed topics all the way from near term considerations in AI and how we align AI to our goals and also all the way to thinking about AGI and even super intelligence. So it was a fascinating range both of topics discussed and also perspectives being represented.

Lucas: So my inspiration for the workshop was being really interested in ethics and the end towards which this is all going. What really is the point of creating AGI and perhaps even eventually superintelligence? What is it that is good and what is that is valuable? Broadening from that and becoming more interested in value alignment, the conversation thus far has been primarily understood as something that is purely technical. So value alignment has only been seen as something that is for technical AI safety researchers to work on because there are technical issues regarding AI safety and how you get AIs to do really simple things without destroying the world or ruining a million other things that we care about. But this is really, as we discussed earlier, an interdependent issue that covers issues in metaethics and normative ethics, applied ethics. It covers issues in psychology. It covers issue in law, policy, governance, coordination. It covers the AI arms race issue. Solving the value alignment problem and creating a future with beneficial AI is a civilizational project where we need everyone working on all these different issues. On issues of value, on issues of game theory among countries, on the technical issues, obviously.

So what I really wanted to do was I wanted to start this workshop in order to broaden the discussion. To reframe value alignment as not just something in technical AI research but something that really needs voices from all disciplines and all expertise in order to have a really robust conversation that reflects the interdependent nature of the issue and where different sorts of expertise on the different parts of the issue can really come together and work on it.

Ariel: Is there anything specific that you can tell us about what came out of the workshop? Were there any comments that you thought were especially insightful or ideas that you think are important for people to be considering?

Lucas: I mean, I think that for me one of the takeaways from the workshop is that there’s still a mountain of work to do and that there are a ton of open questions. This is a very, very difficult issue. I think that one thing I took away from the workshop was that we couldn’t even agree on the minimal conditions for which it would be okay to safely deploy AGI. There are just issues that seem extremely trivial in value alignment from the technical side and from the ethical side that seem very trivial, but on which I think there is very little understanding or agreement right now.

Meia: I think the workshop was a start and one good thing that happened during the workshop is I felt that the different disciplines or rather their representatives were able to sort of air out their frustrations and also express their expectations of the others. So I remember this quite iconic moment when one roboticist simply said, “But I really want you ethics people to just tell me what to implement in my system. What do you want my system to do?” So I think that was actually very illustrative of what Lucas was saying — the need for more joint work. I think there was a lot of expectations I think from both the technical people towards the ethicists but also from the ethicists in terms of like, “What are you doing? Explain to us what are the actual ethical issues that you think you are facing with the things that you are building?” So I think there’s a lot of catching up to do on both sides and there’s much work to be done in terms of making these connections and bridging the gaps.

Ariel: So you referred to this as sort of a first step or an initial step. What would you like to see happen next?

Lucas: I don’t have any concrete or specific ideas for what exactly should happen next. I think that’s a really difficult question. Certainly, things that most people would want or expect. I think in the general literature and conversations that we were having, I think that value alignment, as a word and as something that we understand, needs to be expanded outside of the technical context. I don’t think that it’s expanded that far. I think that more ethicists and more moral psychologists and people in law policy and governance need to come in and need to work on this issue. I’d like to see more coordinated collaborations, specifically involving interdisciplinary crowds informing each other and addressing issues and identifying issues and really some sorts of formal mechanisms for interdisciplinary coordination on value alignment.

It would be really great if people in technical research, in technical AI safety research and in ethics and governance could also identify all of the issues in their own fields, which the resolution to those issues and the solution to those issues requires answers from other fields. So for example, inverse reinforcement learning is something that Meia was talking about earlier and I think it’s something that we can clearly decide and see as being interdependent on a ton of issues in a law and also in ethics and in value theory. So that would be sort of like an issue or node in the landscape of all issues and technical safety research that would be something that is interdisciplinary.

So I think it would be super awesome if everyone from their own respective fields are able to really identify the core issues which are interdisciplinary and able to dissect them into the constituent components and sort of divide them among the disciplines and work together on them and identify the different timelines at which different issues need to be worked on. Also, just coordinate on all those things.

Ariel: Okay. Then, Lucas, you talked a little bit about nodes and a landscape, but I don’t think we’ve explicitly pointed out that you did create a landscape of value alignment research so far. Can you talk a little bit about what that is and how people can use it?

Lucas: Yeah. For sure. With the help of other colleagues at the Future of Life Institute like Jessica Cussins and Richard Mallah, we’ve gone ahead and created a value alignment conceptual landscape. So what this is is it’s a really big tree, almost like an evolutionary tree that you would see, but what it is, is a conceptual mapping and landscape of the value alignment problem. What it’s broken down into are the three constituent components, which we were talking about earlier, which is the technical issues, the issues in technically creating safe AI systems. Issues in ethics, breaking that down into issues in metaethics and normative ethics and applied ethics and moral psychology and descriptive ethics where we’re trying to really understand values, what it means for something to be valuable and what is the end towards which intelligence will be aimed at. Then also, the other last section is governance. So issues in coordination and policy and law in creating a world where AI safety research can proceed and where there aren’t … Where we don’t develop or allow a sort of winner-take-all scenario to rush us towards the end and not really have a final and safe solution towards fully autonomous powerful systems.

So what the landscape here does is it sort of outlines all of the different conceptual nodes in each of these areas. It lays out what all the core concepts are, how they’re all related. It defines the concepts and also gives descriptions about how the concepts fit into each of these different sections of ethics, governance, and technical AI safety research. So the hope here is that people from different disciplines can come and see the truly interdisciplinary nature of the value alignment problem, to see where ethics and governance and the technical AI safety research stuff all fits in together and how this all together really forms, I think, the essential corners of the value alignment problem. It’s also nice for researchers and other persons to understand the concepts and the landscape of the other parts of this problem.

I think that, for example, technical AI safety researchers probably don’t know much about metaethics or they don’t spend too much time thinking about normative ethics. I’m sure that ethicists don’t spend very much time thinking about technical value alignment and how inverse reinforcement learning is actually done and what it means to do robust human imitation in machines. What are the actual technical, ethical mechanisms that are going to go into AI systems. So I think that this is like a step in sort of laying out the conceptual landscape, in introducing people to each other’s concepts. It’s a nice visual way of interacting with I think a lot of information and sort of exploring all these different really interesting nodes that explore a lot of very deep, profound moral issues, very difficult and interesting technical issues, and issues in law, policy and governance that are really important and profound and quite interesting.

Ariel: So you’ve referred to this as the value alignment problem a couple times. I’m curious, do you see this … I’d like both of you to answer this. Do you see this as a problem that can be solved or is this something that we just always keep working towards and it’s going to influence — whatever the current general consensus is will influence how we’re designing AI and possibly AGI, but it’s not ever like, “Okay. Now we’ve solved the value alignment problem.” Does that make sense?

Lucas: I mean, I think that that sort of question really depends on your metaethics, right? So if you think there are moral facts, if you think that more statements can be true or false and aren’t just sort of subjectively dependent upon whatever our current values and preferences historically and evolutionarily and accidentally happen to be, then there is an end towards which intelligence can be aimed that would be objectively good and which would be the end toward which we would strive. In that case, if we had solved the technical issue and the governance issue and we knew that there was a concrete end towards which we would strive that was the actual good, then the value alignment problem would be solved. But if you don’t think that there is a concrete end, a concrete good, something that is objectively valuable across all agents, then the value alignment problem or value alignment in general is an ongoing process and evolution.

In terms of the technical and governance sides of those, I think that there’s nothing in the laws of physics or I think in computer science or in game theory that says that we can’t solve those parts of the problem. Those ones seem intrinsically like they can be solved. That’s nothing to say about how easy or how hard it is to solve those. But whether or not there is sort of an end towards value alignment I think depends on difficult questions in metaethics and whether something like moral error theory is true where all moral statements are simply false and that morality is maybe sort of just like a human invention, which has no real answers or who’s answers are all false. I think that’s sort of the crux of whether or not value alignment can “be solved” because I think the technical issues and the issues in governance are things which are in principle able to be solved.

Ariel: And Meia?

Meia: I think that regardless of whether there is an absolute end to this problem or not, there’s a lot of work that we need to do in between. I also think that in order to even achieve this end, we need more intelligence, but as we create more intelligent agents, again, this problem gets magnified. So there’s always going to be a race between the intelligence that we’re creating and making sure that it is beneficial. I think at every step of the way, the more we increase the intelligence, the more we need to think about the broader implications. I think in the end we should think of artificial intelligence also not just as a way to amplify our own intelligence but also as a way to amplify our moral competence as well. As a way to gain more answers regarding ethics and what our ultimate goals should be.

So I think that the interesting questions that we can do something about are somewhere sort of in between. We will not have the answer before we are creating AI. So we always have to figure out a way to keep up with the development of intelligence in terms of our development of moral competence.

Ariel: Meia, I want to stick with you for just a minute. When we talked for the FLI end of your podcast, one of the things you said you were looking forward to in 2018 is broadening this conversation. I was hoping you could talk a little bit more about some of what you would like to see happen this year in terms of getting other people involved in the conversation, who you would like to see taking more of an interest in this?

Meia: So I think that unfortunately, especially in academia, we’ve sort of defined our work so much around these things that we call disciplines. I think we are now faced with problems, especially in AI, that really are very interdisciplinary. We cannot get the answers from just one discipline. So I would actually like to see in 2018 more sort of, for example, funding agencies proposing and creating funding sources for interdisciplinary projects. The way it works, especially in academia, so you propose grants to very disciplinary-defined granting agencies.

Another thing that would be wonderful to start happening is our education system is also very much defined and described around these disciplines. So I feel that, for example, there’s a lack of courses, for example, that teach students in technical fields things about ethics, moral psychology, social sciences and so on. The converse is also true; in social sciences and in philosophy we hear very little about advancements in artificial intelligence and what’s new and what are the problems that are there. So I’d like to see more of that. I’d like to see more courses like this developed. I think a friend of mine and I, we’ve spent some time thinking about how many courses are there that have an interdisciplinary nature and actually talk about the societal impacts of AI and there’s a handful in the entire world. I think we counted about five or six of them. So there’s a shortage of that as well.

But then also educating the general public. I think thinking about the implications of AI and also the societal implications of AI and also the value alignment problem is something that’s probably easier for the general public to grasp rather than thinking about the technical aspects of how to make it more powerful or how to make it more intelligent. So I think there’s a lot to be done in educating, funding, and also just simply having these conversations. I also very much admire what Lucas has been doing. I hope he will expand on it, creating this conceptual landscape so that we have people from different disciplines understanding their terms, their concepts, each other’s theoretical frameworks with which they work. So I think all of this is valuable and we need to start. It won’t be completely fixed in 2018 I think. But I think it’s a good time to work towards these goals.

Ariel: Okay. Lucas, is there anything that you wanted to add about what you’d like to see happen this year?

Lucas: I mean, yeah. Nothing else I think to add on to what I said earlier. Obviously we just need as many people from as many disciplines working on this issue because it’s so important. But just to go back a little bit, I was also really liking what Meia said about how AI systems and intelligence can help us with our ethics and with our governance. I think that seems like a really good way forward potentially if as our AI systems grow more powerful in their intelligence, they’re able to inform us moreso about our own ethics and our own preferences and our own values, about our own biases and about what sorts of values and moral systems are really conducive to the thriving of human civilization and what sorts of moralities lead to sort of navigating the space of all possible minds in a way that is truly beneficial.

So yeah. I guess I’ll be excited to see more ways in which intelligence and AI systems can be deployed for really tackling the question of what beneficial AI exactly entails. What does beneficial mean? We all want beneficial AI, but what is beneficial, what does that mean? What does that mean for us in a world in which no one can agree on what beneficial exactly entails? So yeah, I’m just excited to see how this is going to work out, how it’s going to evolve and hopefully we’ll have a lot more people joining this work on this issue.

Ariel: So your comment reminded me of a quote that I read recently that I thought was pretty interesting. I’ve been reading Paula Boddington’s book Toward a Code of Ethics for Artificial Intelligence. This was actually funded at least in part if not completely by FLI grants. But she says, “It’s worth pointing out that if we need AI to help us make moral decisions better, this cast doubt on the attempts to ensure humans always retain control over AI.” I’m wondering if you have any comments on that.

Lucas: Yeah. I don’t know. I think this sort of a specific way of viewing the issue or it’s a specific way of viewing what AI systems are for and the sort of future that we want. In the end is the best at all possible futures a world in which human beings ultimately retain full control over AI systems. I mean, if AI systems are autonomous and if value alignment actually succeeds, then I would hope that we created AI systems which are more moral than we are. AI systems which have better ethics, which are less biased, which are more rational, which are more benevolent and compassionate than we are. If value alignment is able to succeed and if we’re able to create autonomous intelligent systems of that sort of caliber of ethics and benevolence and intelligence, then I’m not really sure what the point is of maintaining any sort of meaningful human control.

Meia: I agree with you, Lucas. That if we do manage to create … In this case, I think it would have to be artificial general intelligence that is more moral, more beneficial, more compassionate than we are, then the issue of control, it’s probably not so important. But in the meantime, I think, while we are sort of tinkering with artificial intelligent systems, I think the issue of control is very important.

Lucas: Yeah. For sure.

Meia: Because we wouldn’t want to … We wouldn’t want to cut out of the loop too early before we’ve managed to properly test the system, make sure that indeed it is doing what we intended to do.

Lucas: Right. Right. I think that in the process of that that it requires a lot of our own moral evolution, something which we humans are really bad and slow at. As president of FLI Max Tegmark likes to talk about, he likes to talk about the race between our growing wisdom and the growing power of our technology. Now, human beings are really kind of bad at keeping our wisdom in pace with the growing power of our technology. If we sort of look at the moral evolution of our species, we can sort of see huge eras in which things which were seen as normal and mundane and innocuous, like slavery or the subjugation of women or other sorts of things like that. Today we have issues with factory farming and animal suffering and income inequality and just tons of people who are living with exorbitant wealth that doesn’t really create much utility for them, whereas there’s tons of other people who are in poverty and who are still starving to death. There are all sorts of things that we can see in the past as being obviously morally wrong.

Meia: Under the present too.

Lucas: Yeah. So then we can see that obviously there must be things like that today. We wonder, “Okay. What are the sorts of things today that we see and innocuous and normal and as mundane that the people of tomorrow, as William MacAskill says, will see us as moral monsters? How are we moral monsters today, but we simply can’t see it? So as we create powerful intelligence systems and we’re working on our ethics and we’re trying to really converge on constraining the set of all possible worlds into ones which are good and which are valuable and ethical, it really demands a moral evolution of ourselves that we sort of have to figure out ways to catalyze and work on and move through, I think, faster.

Ariel: Thank you. So as you consider attempts to solve the value alignment problem, what are you most worried about, either in terms of us solving it badly or not quickly enough or something along those lines? What is giving you the most hope in terms of us being able to address this problem?

Lucas: I mean, I think just technically speaking, ignoring the likelihood of this — the worst of all possible outcomes would be something like an s-risk. So an s-risk is a subset of x-risks — s-risk stands for suffering risk. So this is a sort of risk whereby some sort of value misalignment, whether it be intentional or much more likely accidental, some seemingly astronomical amount of suffering is produced by deploying a misaligned AI system. The way that this was function is given certain sorts of assumptions about the philosophy of mind, about consciousness and machines, if we understand potentially consciousness and experience to be substrate-independent, meaning if consciousness can be instantiated in machine systems, that you don’t just need meat to be conscious, but you need something like integrated information or information processing or computation or something like that, then the invention of AI systems and superintelligence and the spreading of intelligence, which optimizes towards any sort of arbitrary end, it could potentially lead to vast amounts of digital suffering, which would potentially arise accidentally or through subroutines or simulations, which would be epistemically useful but that involve a great amount of suffering. That coupled with these artificial intelligent systems running on silicon and iron and not on squishy, wet, human neurons would be that it would be running at digital time scales and not biological time scales. So there would be huge amplification of the speed of which the suffering was run. So subjectively, we might infer that a second for a computer, a simulated person on a computer, would be much greater than that for a biological person. Then we can sort of reflect that these are the sorts of risks — or an s-risk would be something that would be really bad. Just any sort of way that AI can be misaligned and lead to a great amount of suffering. There’s a bunch of different ways that this could happen.

So something like an s-risk would be something super terrible but it’s not really clear how likely that would be. But yeah, I think that beyond that obviously we’re worried about existential risk, we’re worried about ways that this could curtail or destroy the development of earth-originating intelligent life. Ways that this really might happen are I think most likely because of this winner-take-all scenario that you have with AI. We’ve had nuclear weapons for a very long time now, and we’re super lucky that nothing bad has happened. But I think the human civilization is really good at getting stuck into minimum equilibria where we get locked into these positions where it’s not easy to escape from. So it’s really not easy to disarm and get out of the nuclear weapons situation once we’ve discovered it. Once we start to develop, I think, more powerful and robust AI systems, I think already that a race towards AGI and towards more and more powerful AI might be very, very hard to stop if we don’t make significant progress on that soon, if we’re not able to get a ban on lethal autonomous weapons and if we’re not able to introduce any real global coordination and that we all just start racing towards more powerful systems that there might be a race towards AGI, which would cut corners on safety and potentially make the likelihood of an existential risk or suffering risk more likely.

Ariel: Are you hopeful for anything?

Lucas: I mean, yeah. If we get it right, then the next billion years can be super amazing, right? It’s just kind of hard to internalize that and think about that. It’s really hard to say I think how likely it is that we’ll succeed in any direction. But yeah, I’m hopeful that if we succeed in value alignment that the future can be unimaginably good.

Ariel: And Meia?

Meia: What’s scary to me is that it might be too easy to create intelligence. That there’s nothing in the laws of physics making it hard for us. Thus I think that it might happen too fast. Evolution took a long time to figure out how to make us intelligent, but that was probably just because it was trying to optimize for things like energy consumption and making us a certain size. So that’s scary. It’s scary that it’s happening so fast. I’m particularly scared that it might be easy to crack general artificial intelligence. I keep asking Max, “Max, but isn’t there anything in the laws of physics that might make it tricky?” His answer and also that of more physicists that I’ve been discussing with is that, “No, it doesn’t seem to be the case.”

Now, what makes me hopeful is that we are creating this. Stuart Russell likes to give this example of a message from an alien civilization, an alien intelligence that says, “We will be arriving in 50 years.” Then he poses the question, “What would you do when you prepare for that?” But I think with artificial intelligence it’s different. It’s not like it’s arriving and it’s a given and it has a certain form or shape that we cannot do anything about. We are actually creating artificial intelligence. I think that’s what makes me hopeful that if we actually research it right, that if we think hard about what we want and we work hard at getting our own act together, first of all, and also on making sure that this stays and is beneficial, we have a good chance to succeed.

Now, there’ll be a lot of challenges in between from very near-term issues like Lucas was mentioning, for example, autonomous weapons, weaponizing our AI and giving it the right to harm and kill humans, to other issues regarding income inequality enhanced by technological development and so on, to down the road how do we make sure that autonomous AI systems actually adopt our goals. But I do feel that it is important to try and it’s important to work at it. That’s what I’m trying to do and that’s what I hope others will join us in doing.

Ariel: All right. Well, thank you both again for joining us today.

Lucas: Thanks for having us.

Meia: Thanks for having us. This was wonderful.

Ariel: If you’re interested in learning more about the value alignment landscape that Lucas was talking about, please visit FutureofLife.org/valuealignmentmap. We’ll also link to this in the transcript for this podcast. If you enjoyed this podcast, please subscribe, give it a like, and share it on social media. We’ll be back again next month with another conversation among experts.

[end of recorded material]

How to Prepare for the Malicious Use of AI

How can we forecast, prevent, and (when necessary) mitigate the harmful effects of malicious uses of AI?

This is the question posed by a 100-page report released last week, written by 26 authors from 14 institutions. The report, which is the result of a two-day workshop in Oxford, UK followed by months of research, provides a sweeping landscape of the security implications of artificial intelligence.

The authors, who include representatives from the Future of Humanity Institute, the Center for the Study of Existential Risk, OpenAI, and the Center for a New American Security, argue that AI is not only changing the nature and scope of existing threats, but also expanding the range of threats we will face. They are excited about many beneficial applications of AI, including the ways in which it will assist defensive capabilities. But the purpose of the report is to survey the landscape of security threats from intentionally malicious uses of AI.

“Our report focuses on ways in which people could do deliberate harm with AI,” said Seán Ó hÉigeartaigh, Executive Director of the Cambridge Centre for the Study of Existential Risk. “AI may pose new threats, or change the nature of existing threats, across cyber, physical, and political security.”

Importantly, this is not a report about a far-off future. The only technologies considered are those that are already available or that are likely to be within the next five years. The message therefore is one of urgency. We need to acknowledge the risks and take steps to manage them because the technology is advancing exponentially. As reporter Dave Gershgorn put it, “Every AI advance by the good guys is an advance for the bad guys, too.”

AI systems tend to be more efficient and more scalable than traditional tools. Additionally, the use of AI can increase the anonymity and psychological distance a person feels to the actions carried out, potentially lowering the barrier to committing crimes and acts of violence. Moreover, AI systems have their own unique vulnerabilities including risks from data poisoning, adversarial examples, and the exploitation of flaws in their design. AI-enabled attacks will outpace traditional cyberattacks because they will generally be more effective, more finely targeted, and more difficult to attribute.

The kinds of attacks we need to prepare for are not limited to sophisticated computer hacks. The authors suggest there are three primary security domains: digital security, which largely concerns cyberattacks; physical security, which refers to carrying out attacks with drones and other physical systems; and political security, which includes examples such as surveillance, persuasion via targeted propaganda, and deception via manipulated videos. These domains have significant overlap, but the framework can be useful for identifying different types of attacks, the rationale behind them, and the range of options available to protect ourselves.

What can be done to prepare for malicious uses of AI across these domains? The authors provide many good examples. The scenarios described in the report can be a good way for researchers and policymakers to explore possible futures and brainstorm ways to manage the most critical threats. For example, imagining a commercial cleaning robot being repurposed as a non-traceable explosion device may scare us, but it also suggests why policies like robot registration requirements may be a useful option.

Each domain also has its own possible points of control and countermeasures. For example, to improve digital security, companies can promote consumer awareness and incentivize white hat hackers to find vulnerabilities in code. We may also be able to learn from the cybersecurity community and employ measures such as red teaming for AI development, formal verification in AI systems, and responsible disclosure of AI vulnerabilities. To improve physical security, policymakers may want to regulate hardware development and prohibit sales of lethal autonomous weapons. Meanwhile, media platforms may be able to minimize threats to political security by offering image and video authenticity certification, fake news detection, and encryption.

The report additionally provides four high level recommendations, which are not intended to provide specific technical or policy proposals, but rather to draw attention to areas that deserve further investigation. The recommendations are the following:

Recommendation #1: Policymakers should collaborate closely with technical researchers to investigate, prevent, and mitigate potential malicious uses of AI.

Recommendation #2: Researchers and engineers in artificial intelligence should take the dual-use nature of their work seriously, allowing misuse-related considerations to influence research priorities and norms, and proactively reaching out to relevant actors when harmful applications are foreseeable.

Recommendation #3: Best practices should be identified in research areas with more mature methods for addressing dual- use concerns, such as computer security, and imported where applicable to the case of AI.

Recommendation #4: Actively seek to expand the range of stakeholders and domain experts involved in discussions of these challenges.

Finally, the report identifies several areas for further research. The first of these is to learn from and with the cybersecurity community because the impacts of cybersecurity incidents will grow as AI-based systems become more widespread and capable. Other areas of research include exploring different openness models, promoting a culture of responsibility among AI researchers, and developing technological and policy solutions.

As the authors state, “The malicious use of AI will impact how we construct and manage our digital infrastructure as well as how we design and distribute AI systems, and will likely require policy and other institutional responses.”

Although this is only the beginning of the understanding needed on how AI will impact global security, this report moves the discussion forward. It not only describes numerous emergent security concerns related to AI, but also suggests ways we can begin to prepare for those threats today.

MIRI’s February 2018 Newsletter

Updates

News and links

  • In “Adversarial Spheres,” Gilmer et al. investigate the tradeoff between test error and vulnerability to adversarial perturbations in many-dimensional spaces.
  • Recent posts on Less Wrong: Critch on “Taking AI Risk Seriously” and Ben Pace’s background model for assessing AI x-risk plans.
  • Solving the AI Race“: GoodAI is offering prizes for proposed responses to the problem that “key stakeholders, including [AI] developers, may ignore or underestimate safety procedures, or agreements, in favor of faster utilization”.
  • The Open Philanthropy Project is hiring research analysts in AI alignment, forecasting, and strategy, along with generalist researchers and operations staff.

This newsletter was originally posted on MIRI’s website.

Optimizing AI Safety Research: An Interview With Owen Cotton-Barratt

Artificial intelligence poses a myriad of risks to humanity. From privacy concerns, to algorithmic bias and “black box” decision making, to broader questions of value alignment, recursive self-improvement, and existential risk from superintelligence — there’s no shortage of AI safety issues.  

AI safety research aims to address all of these concerns. But with limited funding and too few researchers, trade-offs in research are inevitable. In order to ensure that the AI safety community tackles the most important questions, researchers must prioritize their causes.

Owen Cotton-Barratt, along with his colleagues at the Future of Humanity Institute (FHI) and the Centre for Effective Altruism (CEA), looks at this ‘cause prioritization’ for the AI safety community. They analyze which projects are more likely to help mitigate catastrophic or existential risks from highly-advanced AI systems, especially artificial general intelligence (AGI). By modeling trade-offs between different types of research, Cotton-Barratt hopes to guide scientists toward more effective AI safety research projects.

 

Technical and Strategic Work

The first step of cause prioritization is understanding the work already being done. Broadly speaking, AI safety research happens in two domains: technical work and strategic work.

AI’s technical safety challenge is to keep machines safe and secure as they become more capable and creative. By making AI systems more predictable, more transparent, and more robustly aligned with our goals and values, we can significantly reduce the risk of harm. Technical safety work includes Stuart Russell’s research on reinforcement learning and Dan Weld’s work on explainable machine learning, since they’re improving the actual programming in AI systems.

In addition, the Machine Intelligence Research Institute (MIRI) recently released a technical safety agenda aimed at aligning machine intelligence with human interests in the long term, while OpenAI, another non-profit AI research company, is investigating the “many research problems around ensuring that modern machine learning systems operate as intended,” following suggestions from the seminal paper Concrete Problems in AI Safety.

Strategic safety work is broader, and asks how society can best prepare for and mitigate the risks of powerful AI. This research includes analyzing the political environment surrounding AI development, facilitating open dialogue between research areas, disincentivizing arms races, and learning from game theory and neuroscience about probable outcomes for AI. Yale professor Allan Dafoe has recently focused on strategic work, researching the international politics of artificial intelligence and consulting for governments, AI labs and nonprofits about AI risks. And Yale bioethicist Wendell Wallach, apart from his work on “silo busting,” is researching forms of global governance for AI.

Cause prioritization is strategy work, as well. Cotton-Barratt explains, “Strategy work includes analyzing the safety landscape itself and considering what kind of work do we think we’re going to have lots of, what are we going to have less of, and therefore helping us steer resources and be more targeted in our work.”

 

 

 

 

 

 

 

 

 

 

 

Who Needs More Funding?

As the graph above illustrates, AI safety spending has grown significantly since 2015. And while more money doesn’t always translate into improved results, funding patterns are easy to assess and can say a lot about research priorities. Seb Farquhar, Cotton-Barratt’s colleague at CEA, wrote a post earlier this year analyzing AI safety funding and suggesting ways to better allocate future investments.

To start, he suggests that the technical research community acquire more personal investigators to take the research agenda, detailed in Concrete Problems in AI Safety, forward. OpenAI is already taking a lead on this. Additionally, the community should go out of its way to ensure that emerging AI safety centers hire the best candidates, since these researchers will shape each center’s success for years to come.

In general, Farquhar notes that strategy, outreach and policy work haven’t kept up with the overall growth of AI safety research. He suggests that more people focus on improving communication about long-run strategies between AI safety research teams, between the AI safety community and the broader AI community, and between policymakers and researchers. Building more PhD and Masters courses on AI strategy and policy could establish a pipeline to fill this void, he adds.

To complement Farquhar’s data, Cotton-Barratt’s colleague Max Dalton created a mathematical model to track how more funding and more people working on a safety problem translate into useful progress or solutions. The model tries to answer such questions as: if we want to reduce AI’s existential risks, how much of an effect do we get by investing money in strategy research versus technical research?

In general, technical research is easier to track than strategic work in mathematical models. For example, spending more on strategic ethics research may be vital for AI safety, but it’s difficult to quantify that impact. Improving models of reinforcement learning, however, can produce safer and more robustly-aligned machines. With clearer feedback loops, these technical projects fit best with Dalton’s models.

 

Near-sightedness and AGI

But these models also confront major uncertainty. No one really knows when AGI will be developed, and this makes it difficult to determine the most important research. If AGI will be developed in five years, perhaps researchers should focus only on the most essential safety work, such as improving transparency in AI systems. But if we have thirty years, researchers can probably afford to dive into more theoretical work.

Moreover, no one really knows how AGI will function. Machine learning and deep neural networks have ushered in a new AI revolution, but AGI will likely be developed on architectures far different from AlphaGo and Watson.

This makes some long-term safety research a risky investment, even if, as many argue, it is the most important research we can do. For example, researchers could spend years making deep neural nets safe and transparent, only to find their work wasted when AGI develops on an entirely different programming architecture.

Cotton-Barratt attributes this issue to ‘nearsightedness,’ and discussed it in a recent talk at Effective Altruism Global this summer. Humans often can’t anticipate disruptive change, and AI researchers are no exception.

“Work that we might do for long-term scenarios might turn out to be completely confused because we weren’t thinking of the right type of things,” he explains. “We have more leverage over the near-term scenarios because we’re more able to assess what they’re going to look like.”

Any additional AI safety research is better than none, but given the unknown timelines and the potential gravity of AI’s threats to humanity, we’re better off pursuing — to the extent possible — the most effective AI safety research.

By helping the AI research portfolio advance in a more efficient and comprehensive direction, Cotton-Barratt and his colleagues hope to ensure that when machines eventually outsmart us, we will have asked — and hopefully answered — the right questions.

This article is part of a Future of Life series on the AI safety research grants, which were funded by generous donations from Elon Musk and the Open Philanthropy Project. If you’re interested in applying for our 2018 grants competition, please see this link.

Transparent and Interpretable AI: an interview with Percy Liang

At the end of 2017, the United States House of Representatives passed a bill called the SELF DRIVE Act, laying out an initial federal framework for autonomous vehicle regulation. Autonomous cars have been undergoing testing on public roads for almost two decades. With the passing of this bill, along with the increasing safety benefits of autonomous vehicles, it is likely that they will become even more prevalent in our daily lives. This is true for numerous autonomous technologies including those in the medical, legal, and safety fields – just to name a few.

To that end, researchers, developers, and users alike must be able to have confidence in these types of technologies that rely heavily on artificial intelligence (AI). This extends beyond autonomous vehicles, applying to everything from security devices in your smart home to the personal assistant in your phone.

 

Predictability in Machine Learning

Percy Liang, Assistant Professor of Computer Science at Stanford University, explains that humans rely on some degree of predictability in their day-to-day interactions — both with other humans and automated systems (including, but not limited to, their cars). One way to create this predictability is by taking advantage of machine learning.

Machine learning deals with algorithms that allow an AI to “learn” based on data gathered from previous experiences. Developers do not need to write code that dictates each and every action or intention for the AI. Instead, the system recognizes patterns from its experiences and assumes the appropriate action based on that data. It is akin to the process of trial and error.

A key question often asked of machine learning systems in the research and testing environment is, “Why did the system make this prediction?” About this search for intention, Liang explains:

“If you’re crossing the road and a car comes toward you, you have a model of what the other human driver is going to do. But if the car is controlled by an AI, how should humans know how to behave?”

It is important to see that a system is performing well, but perhaps even more important is its ability to explain in easily understandable terms why it acted the way it did. Even if the system is not accurate, it must be explainable and predictable. For AI to be safely deployed, systems must rely on well-understood, realistic, and testable assumptions.

Current theories that explore the idea of reliable AI focus on fitting the observable outputs in the training data. However, as Liang explains, this could lead “to an autonomous driving system that performs well on validation tests but does not understand the human values underlying the desired outputs.”

Running multiple tests is important, of course. These types of simulations, explains Liang, “are good for debugging techniques — they allow us to more easily perform controlled experiments, and they allow for faster iteration.”

However, to really know whether a technique is effective, “there is no substitute for applying it to real life,” says Liang, “ this goes for language, vision, and robotics.” An autonomous vehicle may perform well in all testing conditions, but there is no way to accurately predict how it could perform in an unpredictable natural disaster.

 

Interpretable ML Systems

The best-performing models in many domains — e.g., deep neural networks for image and speech recognition — are obviously quite complex. These are considered “blackbox models,” and their predictions can be difficult, if not impossible, for them to explain.

Liang and his team are working to interpret these models by researching how a particular training situation leads to a prediction. As Liang explains, “Machine learning algorithms take training data and produce a model, which is used to predict on new inputs.”

This type of observation becomes increasingly important as AIs take on more complex tasks – think life or death situations, such as interpreting medical diagnoses. “If the training data has outliers or adversarially generated data,” says Liang, “this will affect (corrupt) the model, which will in turn cause predictions on new inputs to be possibly wrong.  Influence functions allow you to track precisely the way that a single training point would affect the prediction on a particular new input.”

Essentially, by understanding why a model makes the decisions it makes, Liang’s team hopes to improve how models function, discover new science, and provide end users with explanations of actions that impact them.

Another aspect of Liang’s research is ensuring that an AI understands, and is able to communicate, its limits to humans. The conventional metric for success, he explains, is average accuracy, “which is not a good interface for AI safety.” He posits, “what is one to do with an 80 percent reliable system?”

Liang is not looking for the system to have an accurate answer 100 percent of the time. Instead, he wants the system to be able to admit when it does not know an answer. If a user asks a system “How many painkillers should I take?” it is better for the system to say, “I don’t know” rather than making a costly or dangerous incorrect prediction.

Liang’s team is working on this challenge by tracking a model’s predictions through its learning algorithm — all the way back to the training data where the model parameters originated.

Liang’s team hopes that this approach — of looking at the model through the lens of the training data — will become a standard part of the toolkit of developing, understanding, and diagnosing machine learning. He explains that researchers could relate this to many applications: medical, computer, natural language understanding systems, and various business analytics applications.

“I think,” Liang concludes, “there is some confusion about the role of simulations some eschew it entirely and some are happy doing everything in simulation. Perhaps we need to change culturally to have a place for both.

In this way, Liang and his team plan to lay a framework for a new generation of machine learning algorithms that work reliably, fail gracefully, and reduce risks.

This article is part of a Future of Life series on the AI safety research grants, which were funded by generous donations from Elon Musk and the Open Philanthropy Project. If you’re interested in applying for our 2018 grants competition, please see this link.

Podcast: Top AI Breakthroughs and Challenges of 2017 with Richard Mallah and Chelsea Finn

AlphaZero, progress in meta-learning, the role of AI in fake news, the difficulty of developing fair machine learning — 2017 was another year of big breakthroughs and big challenges for AI researchers!

To discuss this more, we invited FLI’s Richard Mallah and Chelsea Finn from UC Berkeley to join Ariel for this month’s podcast. They talked about some of the technical progress they were most excited to see and what they’re looking forward to in the coming year.

You can listen to the podcast here, or read the transcript below.

Ariel: I’m Ariel Conn with the Future of Life Institute. In 2017, we saw an increase in investments into artificial intelligence. More students are applying for AI programs, and more AI labs are cropping up around the world. With 2017 now solidly behind us, we wanted to take a look back at the year and go over some of the biggest AI breakthroughs. To do so, I have Richard Mallah and Chelsea Finn with me today.

Richard is the director of AI projects with us at the Future of Life Institute, where he does meta-research, analysis and advocacy to keep AI safe and beneficial. Richard has almost two decades of AI experience in industry and is currently also head of AI R & D at the recruiting automation firm, Avrio AI. He’s also co-founder and chief data science officer at the content marketing planning firm, MarketMuse.

Chelsea is a PhD candidate in computer science at UC Berkeley and she’s interested in how learning algorithms can enable robots to acquire common sense, allowing them to learn a variety of complex sensory motor skills in real-world settings. She completed her bachelor’s degree at MIT and has also spent time at Google Brain.

Richard and Chelsea, thank you so much for being here.

Chelsea: Happy to be here.

Richard: As am I.

Ariel: Now normally I spend time putting together questions for the guests, but today Richard and Chelsea chose the topics. Many of the breakthroughs they’re excited about were more about behind-the-scenes technical advances that may not have been quite as exciting for the general media. However, there was one exception to that, and that’s AlphaZero.

AlphaZero, which was DeepMind’s follow-up to AlphaGo, made a big splash with the popular press in December when it achieved superhuman skills at Chess, Shogi and Go without any help from humans. So Richard and Chelsea, I’m hoping you can tell us more about what AlphaZero is, how it works and why it’s a big deal. Chelsea, why don’t we start with you?

Chelsea: Yeah, so DeepMind first started with developing AlphaGo a few years ago, and AlphaGo started its learning by watching human experts play, watching how human experts play moves, how they analyze the board — and then once it analyzed and once it started with human experts, it then started learning on its own.

What’s exciting about AlphaZero is that the system started entirely on its own without any human knowledge. It started just by what’s called “self-play,” where the agent, where the artificial player is essentially just playing against itself from the very beginning and learning completely on its own.

And I think that one of the really exciting things about this research and this result was that AlphaZero was able to outperform the original AlphaGo program, and in particular was able to outperform it by removing the human expertise, by removing the human input. And so I think that this suggests that maybe if we could move towards removing the human biases and removing the human input and move more towards what’s called unsupervised learning, where these systems are learning completely on their own, then we might be able to build better and more capable artificial intelligence systems.

Ariel: And Richard, is there anything you wanted to add?

Richard: So, what was particularly exciting about AlphaZero is that it’s able to do this by essentially a technique very similar to what Paul Christiano of AI Safety fame has called “capability amplification.” It’s similar in that it’s learning a function to predict a prior or an expectation over which moves are likely at a given point, as well as function to predict which player will win. And it’s able to do these in an iterative manner. It’s able to apply what’s called an “amplification scheme” in the more general sense. In this case it was Monte Carlo tree search, but in the more general case it could be other more appropriate amplification schemes for taking a simple function and iterating it many times to make it stronger, to essentially have a leading function that is then summarized.

Ariel: So I do have a quick follow up question here. With AlphaZero, it’s a program that’s living within a world that has very strict rules. What is the next step towards moving outside of that world with very strict rules and into the much messier real world?

Chelsea: That’s a really good point. The catch with these results, with these types of games — and even video games, which are a little bit messier than the strict rules of a board game — these games, all of these games can be perfectly simulated. You can perfectly simulate what will happen when you make a certain move or when you take a certain action, either in a video game or in the game of Go or the game of Chess, et cetera. Then therefore, you can train these systems with many, many lifetimes of data.

The real physical world on the other hand, we can’t simulate. We don’t know how to simulate the complex physics of the real world. As a result, you’re limited by the number of robots that you have if you’re interested in robots, or if you’re interested in healthcare, you’re limited by the number of patients that you have. And you’re also limited by safety concerns, the cost of failure, et cetera.

I think that we still have a long way to go towards taking these sorts of advances into real world settings where there’s a lot of noise, there’s a lot of complexity in the environment, and I think that these results are inspiring, and we can take some of the ideas from these approaches and apply them to these sorts of systems, but we need to keep in mind that there are a lot of challenges ahead of us.

Richard: So between real world systems and something like the game of Go, there are also incremental improvements, like introducing this port for partial observability or more stochastic environments, or more continuous environments as opposed to the very discrete ones. So these challenges, assuming that we do have a situation where we could actually simulate what we would like to see or use a simulation to help to get training data on the fly, then in those cases, we’re likely to be able to make some progress. Using a technique like this with some extensions or with some modifications to support those criteria.

Ariel: Okay. Now, I’m not sure if this is a natural jump to the next topic or not, but you’ve both mentioned that one of the big things that you saw happening last year were new creative approaches to unsupervised learning, and Richard in an email to me you mentioned “word translation without parallel data.” So I was hoping you could talk a little bit more about what these new creative approaches are and what you’re excited about there.

Richard: So this year, we saw an application of taking vector spaces, or taking word embeddings, which are essentially these multidimensional spaces where there are relationships between points that are meaningful semantically. The space itself is learned by a relatively shallow deep-learning network, but this meaningfulness that is imbued in the space, is actually able to be used, we’ve seen this year, by taking different languages, or I should say vector spaces that were trained in different languages or created from corpora of different languages and compared, and via some techniques to sort of compare and rationalize the differences between those spaces, we’re actually able to translate words and translate things between language pairs in ways that actually, in some cases, exceed supervised approaches because typically there are parallel sets of documents that have the same meaning in different languages. But in this case, we’re able to essentially do something very similar to what the Star Trek universal translator does. By consuming enough of the alien language, or the foreign language I should say, it’s able to model the relationships between concepts and then realign those with the concepts that are known.

Chelsea, would you like to comment on that?

Chelsea: I don’t think I have too much to add. I’m also excited about the translation results and I’ve also seen similar, I guess, works that are looking at unsupervised learning, not for translation, that have a little bit of a similar vein, but they’re fairly technical in terms of the actual approach.

Ariel: Yeah, I’m wondering if either of you want to try to take a stab at explaining how this works without mentioning vector spaces?

Richard: That’s difficult because it is a space, I mean it’s a very geometric concept, and it’s because we’re aligning shapes within that space that we actually get the magic happening.

Ariel: So would it be something like you have different languages going in, some sort of document or various documents from different languages going in, and this program just sort of maps them into this space so that it figures out which words are parallel to each other then?

Richard: Well it figures out the relationship between words and based on the shape of relationships in the world, it’s able to take those shapes and rotate them into a way that sort of matches up.

Chelsea: Yeah, perhaps it could be helpful to give an example. I think that generally in language you’re trying to get across concepts, and there is structure within the language, I mean there’s the structure that you learn about in grade school when you’re learning vocabulary. You learn about verbs, you learn about nouns, you learn about people and you learn about different words that describe these different things, and different languages have shared this sort of structure in terms of what they’re trying to communicate.

And so, what these algorithms do is they are given basically data of people talking in English, or people writing documents in English, and they’re also given data in another language — and the first one doesn’t necessarily need to be English. They’re given data in one language and data in another language. This data doesn’t match up. It’s not like one document that’s been translated into another, it’s just pieces of language, documents, conversations, et cetera, and by using the structure that exists, and the data such as nouns, verbs, animals, people, it can basically figure out how to map from the structure of one language to the structure of another language. It can recognize this similar structure in both languages and then figure out basically a mapping from one to the other.

Ariel: Okay. So I think, I want to keep moving forward, but continuing with the concept of learning, and Chelsea I want to stick with you for a minute. You mentioned that there were some really big metalearning advances that occurred last year, and you also mentioned a workshop and symposium at NIPS. I was wondering if you could talk a little more about that.

Chelsea: Yeah, I think that there’s been a lot of excitement around metalearning, or learning to learn. There were two gatherings at NIPS, one symposium, one workshop this year and both were well-attended by a number of people. Actually, metalearning has a fairly long history, and so it’s by no means a recent or a new topic, but I think that it has renewed attention within the machine learning community.

And so, I guess I can describe metalearning. It’s essentially having systems that learn how to learn. There’s a number of different applications for such systems. So one of them is an application that’s often referred to as AutoML, or automatic machine learning, where these systems can essentially optimize the hyper parameters, basically figure out the best set of parameters and then run a learning algorithm with those sets of hyper parameters. Essentially kind of taking the job of the machine learning researcher that is tuning different models on different data sets. And this can basically allow people to more easily train models on a data set.

Another application of metalearning that I’m really excited about is enabling systems to reuse data and reuse experience from other tasks when trying to solve new tasks. So in machine learning, there’s this paradigm of creating everything from scratch, and as a result, if you’re training from scratch, from zero prior knowledge, then it’s going to take a lot of data. It’s going to take a lot of time to train because you’re starting from nothing. But if instead you’re starting from previous experience in a different environment or on a different task, and you can basically learn how to efficiently learn from that data, then when you see a new task that you haven’t seen before, you should be able to solve it much more efficiently.

And so, one example of this is what’s called One-Shot Learning or Few-Shot Learning, where you learn essentially how to learn from a few examples, such that when you see a new setting and you just get one or a few examples, labeled examples, labeled data points, you can figure out the new task and solve the new task just from a small number of examples.

One explicit example of how humans do this is that you can have someone point out a Segway to you on the street, and even if you’ve never seen a Segway before or never heard of the concept of a Segway, just from that one example of a human pointing out to you, you can then recognize other examples of Segways. And the way that you do that is basically by learning how to recognize objects over the course of your lifetime.

Ariel: And are there examples of programs doing this already? Or we’re just making progress towards programs being able to do this more effectively?

Chelsea: There are some examples of programs being able to do this in terms of image recognition. There’s been a number of works that have been able to do this with real images. I think that more recently we’ve started to see systems being applied to robotics, which I think is one of the more exciting applications of this setting because when you’re training a robot in the real world, you can’t have the robot collect millions of data points or days of experience in order to learn a single task. You need it to share and reuse experiences from other tasks when trying to learn a new task.

So one example of this is that you can have a robot be able to manipulate a new object that it’s never seen before based on just one demonstration of how to manipulate that object from a human.

Ariel: Okay, thanks.

I want to move to a topic that is obviously of great interest to FLI and that is technical safety advances that occurred last year. Again in an email to me, you’ve both mentioned “inverse reward design” and “deep reinforcement learning for human preferences” as two areas related to the safety issue that were advanced last year. I was hoping you could both talk a little bit about what you saw happening last year that gives you hope for developing safer AI and beneficial AI.

Richard: So, as I mentioned, both inverse reward design and deep reinforcement learning from human preferences are exciting papers that came out this year.

So inverse reward design is where the AI system is trying to understand what the original designer or what the original user intends for the system to do. So it actually tries, if it’s in some new setting, a test setting where there are some potentially problematic new things that were introduced relative to the training time, then it tries specifically to back those out or to mitigate the effects of those, so that’s kind of exciting.

Deep reinforcement learning from human preferences is an algorithm for trying to very efficiently get feedback from humans based on trajectories in the context of reinforcement learning systems. So, these are systems that are trying to learn some way to plan, let’s say a path through a game environment or in general trying to learn a policy of what to do in a given scenario. This algorithm, deep RL from human preferences, shows little snippets of potential paths to humans and has them simply choose which are better, very similar to what goes on at an optometrist. Does A look better or does B look better? And just from that, very sophisticated behaviors can be learned from human preferences in a way that was not possible before in terms of scale.

Ariel: Chelsea, is there anything that you wanted to add?

Chelsea: Yeah. So, in general, I guess, going back to AlphaZero and going back to games in general, there’s a very clear objective for achieving the goal, which is whether or not you won the game or your score at the game. It’s very clear what the objective is and what each system should be optimizing for. AlphaZero should be, like when playing Go should be optimizing for winning the game, and if a system is playing Atari games it should be optimizing for maximizing the score.

But in the real world, when you’re training systems, when you’re training agents to do things, when you’re training an AI to have a conversation with you, when you’re training a robot to set the table for you, there is no score function. The real world doesn’t just give you a score function, doesn’t tell you whether or not you’re winning or losing. And I think that this research is exciting and really important because it gives us another mechanism for telling robots, telling these AI systems how to do the tasks that we want them to do.

And for example, the human preferences work, it allows us, in sort of specifying some sort of goal that we want the robot to achieve or kind of giving it a demonstration of what we want the robot to achieve, or some sort of reward function, instead lets us say, “okay, this is not what I want, this is what I want,” throughout the process of learning. And then as a result, at the end you can basically guarantee that if it was able to optimize for your preferences successfully, then you’ll end up with behavior that you’re happy with.

Ariel: Excellent. So I’m sort of curious, before we started recording, Chelsea, you were telling me a little bit about your own research. Are you doing anything with this type of work? Or is your work a little different?

Chelsea: Yeah. So more recently I’ve been working on metalearning and so some of the metalearning works that I talked about previously, like learning just from a single demonstration and reusing data, reusing experience that you talked about previously, has been some of the things that I’ve been focusing on recently in terms of getting robots to be able to do things in the real world, such as manipulating objects, pushing objects around, using a spatula, stuff like that.

I’ve also done work on reinforcement learning where you essentially give a robot an objective, tell it to try to get the object as close as possible to the goal, and I think that the human preferences work provides a nice alternative to the classic setting, to the classic framework of reinforcement learning, that we could potentially apply to real robotic systems.

Ariel: Chelsea, I’m going to stick with you for one more question. In your list of breakthroughs that you’re excited about, one of the things that you mentioned is very near and dear to my heart, and that was better communication, and specifically better communication of the research. And I was hoping you could talk a little bit about some of the websites and methods of communicating that you saw develop and grow last year.

Chelsea: Yes. I think that more and more we’re seeing researchers put their work out in blog posts and try to make their work more accessible to the average user by explaining it in terms that are easier to understand, by motivating it in words that are easier for the average person to understand and I think that this is a great way to communicate the research in a clear way to a broader audience.

In addition, I’ve been quite excited about an effort, I think led by Chris Olah, on building what is called distill.pub. It’s a website and a journal, an academic journal, that tries to move away from this paradigm of publishing research on paper, on trees essentially. Because we have such rich digital technology that allows us to communicate in many different ways, it makes sense to move past just completely written forms of research dissemination. And I think that’s what distill.pub does, is it allows us, allows researchers to communicate research ideas in the form of animations, in the form of interactive demonstrations on a computer screen, and I think this is a big step forward and has a lot of potential in terms of moving forward the communication of research, the dissemination of research among the research community as well as beyond to people that are less familiar with the technical concepts in the field.

Ariel: That sounds awesome, Chelsea, thank you. And distill.pub is probably pretty straight forward, but we’ll still link to it on the post that goes along with this podcast if anyone wants to click straight through.

And Richard, I want to switch back over to you. You mentioned that there was more impressive output from GANs last year, generative adversarial networks.

Richard: Yes.

Ariel: Can you tell us what a generative adversarial network is?

Richard: So a generative adversarial network is an AI system where there are two parts, essentially a generator or creator that comes up with novel artifacts and a critic that tries to determine whether this is a good or legitimate or realistic type of thing that’s being generated. So both are learned in parallel as training data is streamed into the system, so in this way, the generator learns relatively efficiently how to create things that are good or realistic.

Ariel: So I was hoping you could talk a little bit about what you saw there that was exciting.

Richard: Sure, so new architectures and new algorithms and simply more horsepower as well have led to more impressive output. Particularly exciting are conditional generative adversarial networks, where there can be structured biases or new types of inputs that one wants to base some output around.

Chelsea: Yeah, I mean, one thing to potentially add is that I think the research on GANs is really exciting and I think that it will not only make advances in generating images of realistic quality, but also generating other types of things, like generating behavior potentially, or generating speech, or generating a language. We haven’t seen as much advances in those areas as generating images, thus far the most impressive advances have been in generating images. I think that those are areas to watch out for as well.

One thing to be concerned about in terms of GANs is the ability for people to generate fake images, fake videos of different events happening and putting those fake images and fake videos into the media, because while there might be ways to detect whether or not these images are made-up or are counterfeited essentially, the public might choose to believe something that they see. If you see something, you’re very likely to believe it, and this might exacerbate all of the, I guess, fake news issues that we’ve had recently.

Ariel: Yeah, so that actually brings up something that I did want to get into, and honestly, that, Chelsea, what you just talked about, is some of the scariest stuff I’ve seen, just because it seems like it has the potential to create sort of a domino effect of triggering all of these other problems just with one fake video. So I’m curious, how do we address something like that? Can we? And are there other issues that you’ve seen crop in the last year that also have you concerned?

Chelsea: I think there are potentially ways to address the problem in that if media websites, if it seems like it’s becoming a real danger in the imminent future, then I think that media websites, including social media websites, should take measures to try to be able to detect fake images and fake videos and either prevent them from being displayed or put a warning that it seems like it was detected as something that was fake, to explicitly try to mitigate the effects.

But, that said, I haven’t put that much thought into it. I do think it’s something that we should be concerned about, and the potential solution that I mentioned, I think that even if it can help solve some of the problems, I think that we don’t have a solution to the problem yet.

Ariel: Okay, thank you. I want to move on to the last question that I have that you both brought up, and that was, last year we saw an increased discussion of fairness in machine learning. And Chelsea, you mentioned there was a NIPS tutorial on this and the keynote mentioned it at NIPS as well. So I was hoping you could talk a bit about what that means, what we saw happen, and how you hope this will play out to better programs in the future.

Chelsea: So, there’s been a lot of discussion in how we can build machine-learning systems, build AI systems such that when they make decisions, they are fair and they aren’t biased. And all this discussion has been around fairness in machine learning, and actually one of the interesting things about the discussion from a technical point of view is how you even define fairness and how you define removing biases and such, because a lot of the biases are inherent to the data itself. And how you try to remove those biases can be a bit controversial.

Ariel: Can you give us some examples?

Chelsea: So one example is, if you’re trying to build an autonomous car system that is trying to avoid hitting pedestrians, and recognize pedestrians when appropriate and respond to them, then if these systems are trained in environments and in communities that are predominantly of one race, for example in Caucasian communities, and you then deploy this system in settings where there are people of color and in other environments that it hasn’t seen before, then the resulting system won’t have as good accuracy on settings that it hasn’t seen before and will be biased inherently, when it for example tries to recognize people of color, and this is a problem.

So some other examples of this is if machine learning systems are making decisions about who to give health insurance to, or speech recognition systems that are trying to recognize different speeches, if these systems are trained on a smaller part of the community that is not representative of the entire population as a whole, then they won’t be able to accurately make decisions about the entire population. Or if they’re trained on data that was collected by humans that has the same biases as humans, then they will make the same mistake, they will inherit the same biases that humans inherit, that humans have.

I think that the people that have been researching fairness in machine learning systems, unfortunately one of the conclusions that they’ve made so far is that there isn’t just a one size fits all solution to all of these different problems, and in many cases we’ll have to think about fairness in individual contexts.

Richard: Chelsea, you mentioned that some of the remediations for fairness issues in machine learning are themselves controversial. Can you go into an example or so about that?

Chelsea: Yeah, I guess part of what I meant there is that even coming up with a definition for what is fair is unclear. It’s unclear what even the problem specification is, and without a problem specification, without a definition of what you want your system to be doing, creating a system that’s fair is a challenge if you don’t have a definition for what fair is.

Richard: I see.

Ariel: So then, my last question to you both, as we look towards 2018, what are you most excited or hopeful to see?

Richard: I’m very hopeful for the FLI grants program that we announced at the very end of 2017 leading to some very interesting and helpful AI safety papers and AI safety research in general that will build on past research and break new ground and will enable additional future research to be built on top of it to make the prospect of general intelligence safer and something that we don’t need to fear as much. But that is a hope.

Ariel: And Chelsea, what about you?

Chelsea: I think I’m excited to see where metalearning goes. I think that there’s a lot more people that are paying attention to it and starting to research into “learning to learn” topics. I’m also excited to see more advances in machine learning for robotics. I think that, unlike other fields in machine learning like machine translation, image recognition, et cetera, I think that robotics still has a long way to go in terms of being useful and solving a range of complex tasks and I hope that we can continue to make strides in machine learning for robotics in the coming year and beyond.

Ariel: Excellent. Well, thank you both so much for joining me today.

Richard: Sure, thank you.

Chelsea: Yeah, I enjoyed talking to you.

 

This podcast was edited by Tucker Davey.

Is There a Trade-off Between Immediate and Longer-term AI Safety Efforts?

Something I often hear in the machine learning community and media articles is “Worries about superintelligence are a distraction from the *real* problem X that we are facing today with AI” (where X = algorithmic bias, technological unemployment, interpretability, data privacy, etc). This competitive attitude gives the impression that immediate and longer-term safety concerns are in conflict. But is there actually a tradeoff between them?

tradeoff

We can make this question more specific: what resources might these two types of efforts be competing for?

Media attention. Given the abundance of media interest in AI, there have been a lot of articles about all these issues. Articles about advanced AI safety have mostly been alarmist Terminator-ridden pieces that ignore the complexities of the problem. This has understandably annoyed many AI researchers, and led some of them to dismiss these risks based on the caricature presented in the media instead of the real arguments. The overall effect of media attention towards advanced AI risk has been highly negative. I would be very happy if the media stopped writing about superintelligence altogether and focused on safety and ethics questions about today’s AI systems.

Funding. Much of the funding for advanced AI safety work currently comes from donors and organizations who are particularly interested in these problems, such as the Open Philanthropy Project and Elon Musk. They would be unlikely to fund safety work that doesn’t generalize to advanced AI systems, so their donations to advanced AI safety research are not taking funding away from immediate problems. On the contrary, FLI’s first grant program awarded some funding towards current issues with AI (such as economic and legal impacts). There isn’t a fixed pie of funding that immediate and longer-term safety are competing for – it’s more like two growing pies that don’t overlap very much. There has been an increasing amount of funding going into both fields, and hopefully this trend will continue.

Talent. The field of advanced AI safety has grown in recent years but is still very small, and the “brain drain” resulting from researchers going to work on it has so far been negligible. The motivations for working on current and longer-term problems tend to be different as well, and these problems often attract different kinds of people. For example, someone who primarily cares about social justice is more likely to work on algorithmic bias, while someone who primarily cares about the long-term future is more likely to work on superintelligence risks.

Overall, there does not seem to be much tradeoff in terms of funding or talent, and the media attention tradeoff could (in theory) be resolved by devoting essentially all the airtime to current concerns. Not only are these issues not in conflict – there are synergies between addressing them. Both benefit from fostering a culture in the AI research community of caring about social impact and being proactive about risks. Some safety problems are highly relevant both in the immediate and longer term, such as interpretability and adversarial examples. I think we need more people working on these problems for current systems while keeping scalability to more advanced future systems in mind.

AI safety problems are too important for the discussion to be derailed by status contests like “my issue is better than yours”. This kind of false dichotomy is itself a distraction from the shared goal of ensuring AI has a positive impact on the world, both now and in the future. People who care about the safety of current and future AI systems are natural allies – let’s support each other on the path towards this common goal.

This article originally appeared on the Deep Safety blog.

MIRI’s January 2018 Newsletter

Our 2017 fundraiser was a huge success, with 341 donors contributing a total of $2.5 million!

Some of the largest donations came from Ethereum inventor Vitalik Buterin, bitcoin investors Christian Calderon and Marius van Voorden, poker players Dan Smith and Tom and Martin Crowley (as part of a matching challenge), and the Berkeley Existential Risk Initiative. Thank you to everyone who contributed!

Research updates

General updates

News and links

AI Should Provide a Shared Benefit for as Many People as Possible

Shared Benefit Principle: AI technologies should benefit and empower as many people as possible.

Today, the combined wealth of the eight richest people in the world is greater than that of the poorest half of the global population. That is, 8 people have more than the combined wealth of 3,600,000,000 others.

This is already an extreme example of income inequality, but if we don’t prepare properly for artificial intelligence, the situation could get worse. In addition to the obvious economic benefits that would befall whoever designs advanced AI first, those who profit from AI will also likely have: access to better health care, happier and longer lives, more opportunities for their children, various forms of intelligence enhancement, and so on.

A Cultural Shift

Our approach to technology so far has been that whoever designs it first, wins — and they win big. In addition to the fabulous wealth an inventor can accrue, the creator of a new technology also assumes complete control over the product and its distribution. This means that an invention or algorithm will only benefit those whom the creator wants it to benefit. While this approach may have worked with previous inventions, many are concerned that advanced AI will be so powerful that we can’t treat it as business-as-usual.

What if we could ensure that as AI is developed we all benefit? Can we make a collective — and pre-emptive — decision to use AI to help raise up all people, rather than just a few?

Joshua Greene, a professor of psychology at Harvard, explains his take on this Principle: “We’re saying in advance, before we know who really has it, that this is not a private good. It will land in the hands of some private person, it will land in the hands of some private company, it will land in the hands of some nation first. But this principle is saying, ‘It’s not yours.’ That’s an important thing to say because the alternative is to say that potentially, the greatest power that humans ever develop belongs to whoever gets it first.”

AI researcher Susan Craw also agreed with the Principle, and she further clarified it.

“That’s definitely a yes,” Craw said, “But it is AI technologies plural, when it’s taken as a whole. Rather than saying that a particular technology should benefit lots of people, it’s that the different technologies should benefit and empower people.”

The Challenge of Implementation

However, as is the case with all of the Principles, agreeing with them is one thing; implementing them is another. John Havens, the Executive Director of The IEEE Global Initiative for Ethical Considerations in Artificial Intelligence and Autonomous Systems, considered how the Shared Benefit Principle would ultimately need to be modified so that the new technologies will benefit both developed and developing countries alike.

“Yes, it’s great,” Havens said of the Principle, before adding, “if you can put a comma after it, and say … something like, ‘issues of wealth, GDP, notwithstanding.’ The point being, what this infers is whatever someone can afford, it should still benefit them.”

Patrick Lin, a philosophy professor at California Polytechnic State University, was even more concerned about how the Principle might be implemented, mentioning the potential for unintended consequences.

Lin explained: “Shared benefit is interesting, because again, this is a principle that implies consequentialism, that we should think about ethics as satisfying the preferences or benefiting as many people as possible. That approach to ethics isn’t always right. … Consequentialism often makes sense, so weighing these pros and cons makes sense, but that’s not the only way of thinking about ethics. Consequentialism could fail you in many cases. For instance, consequentialism might green-light torturing or severely harming a small group of people if it gives rise to a net increase in overall happiness to the greater community.”

“That’s why I worry about the … Shared Benefit Principle,” Lin continued. “[It] makes sense, but [it] implicitly adopts a consequentialist framework, which by the way is very natural for engineers and technologists to use, so they’re very numbers-oriented and tend to think of things in black and white and pros and cons, but ethics is often squishy. You deal with these squishy, abstract concepts like rights and duties and obligations, and it’s hard to reduce those into algorithms or numbers that could be weighed and traded off.”

As we move from discussing these Principles as ideals to implementing them as policy, concerns such as those that Lin just expressed will have to be addressed, keeping possible downsides of consequentialism and utilitarianism in mind.

The Big Picture

The devil will always be in the details. As we consider how we might shift cultural norms to prevent all benefits going only to the creators of new technologies — as well as considering the possible problems that could arise if we do so — it’s important to remember why the Shared Benefit Principle is so critical. Roman Yampolskiy, an AI researcher at the University of Louisville, sums this up:

“Early access to superior decision-making tools is likely to amplify existing economic and power inequalities turning the rich into super-rich, permitting dictators to hold on to power and making oppositions’ efforts to change the system unlikely to succeed. Advanced artificial intelligence is likely to be helpful in medical research and genetic engineering in particular making significant life extension possible, which would remove one the most powerful drivers of change and redistribution of power – death. For this and many other reasons, it is important that AI tech should be beneficial and empowering to all of humanity, making all of us wealthier and healthier.”

What Do You Think?

How important is the Shared Benefit Principle to you? How can we ensure that the benefits of new AI technologies are spread globally, rather than remaining with only a handful of people who developed them? How can we ensure that we don’t inadvertently create more problems in an effort to share the benefits of AI?

This article is part of a series on the 23 Asilomar AI Principles. The Principles offer a framework to help artificial intelligence benefit as many people as possible. But, as AI expert Toby Walsh said of the Principles, “Of course, it’s just a start. … a work in progress.” The Principles represent the beginning of a conversation, and now we need to follow up with broad discussion about each individual principle. You can read the discussions about previous principles here.

Deep Safety: NIPS 2017 Report

This year’s NIPS gave me a general sense that near-term AI safety is now mainstream and long-term safety is slowly going mainstream. On the near-term side, I particularly enjoyed Kate Crawford’s keynote on neglected problems in AI fairness, the ML security workshops, and the Interpretable ML symposium debate that addressed the “do we even need interpretability?” question in a somewhat sloppy but entertaining way. There was a lot of great content on the long-term side, including several oral / spotlight presentations and the Aligned AI workshop.

Value alignment papers

Inverse Reward Design (Hadfield-Menell et al) defines the problem of an RL agent inferring a human’s true reward function based on the proxy reward function designed by the human. This is different from inverse reinforcement learning, where the agent infers the reward function from human behavior. The paper proposes a method for IRD that models uncertainty about the true reward, assuming that the human chose a proxy reward that leads to the correct behavior in the training environment. For example, if a test environment unexpectedly includes lava, the agent assumes that a lava-avoiding reward function is as likely as a lava-indifferent or lava-seeking reward function, since they lead to the same behavior in the training environment. The agent then follows a risk-averse policy with respect to its uncertainty about the reward function.

ird

The paper shows some encouraging results on toy environments for avoiding some types of side effects and reward hacking behavior, though it’s unclear how well they will generalize to more complex settings. For example, the approach to reward hacking relies on noticing disagreements between different sensors / features that agreed in the training environment, which might be much harder to pick up on in a complex environment. The method is also at risk of being overly risk-averse and avoiding anything new, whether it be lava or gold, so it would be great to see some approaches for safe exploration in this setting.

Repeated Inverse RL (Amin et al) defines the problem of inferring intrinsic human preferences that incorporate safety criteria and are invariant across many tasks. The reward function for each task is a combination of the task-invariant intrinsic reward (unobserved by the agent) and a task-specific reward (observed by the agent). This multi-task setup helps address the identifiability problem in IRL, where different reward functions could produce the same behavior.

repeated irl

The authors propose an algorithm for inferring the intrinsic reward while minimizing the number of mistakes made by the agent. They prove an upper bound on the number of mistakes for the “active learning” case where the agent gets to choose the tasks, and show that a certain number of mistakes is inevitable when the agent cannot choose the tasks (there is no upper bound in that case). Thus, letting the agent choose the tasks that it’s trained on seems like a good idea, though it might also result in a selection of tasks that is less interpretable to humans.

Deep RL from Human Preferences (Christiano et al) uses human feedback to teach deep RL agents about complex objectives that humans can evaluate but might not be able to demonstrate (e.g. a backflip). The human is shown two trajectory snippets of the agent’s behavior and selects which one more closely matches the objective. This method makes very efficient use of limited human feedback, scaling much better than previous methods and enabling the agent to learn much more complex objectives (as shown in MuJoCo and Atari).

qbert_trimmed

Dynamic Safe Interruptibility for Decentralized Multi-Agent RL (El Mhamdi et al) generalizes the safe interruptibility problem to the multi-agent setting. Non-interruptible dynamics can arise in a group of agents even if each agent individually is indifferent to interruptions. This can happen if Agent B is affected by interruptions of Agent A and is thus incentivized to prevent A from being interrupted (e.g. if the agents are self-driving cars and A is in front of B on the road). The multi-agent definition focuses on preserving the system dynamics in the presence of interruptions, rather than on converging to an optimal policy, which is difficult to guarantee in a multi-agent setting.

Aligned AI workshop

This was a more long-term-focused version of the Reliable ML in the Wild workshop held in previous years. There were many great talks and posters there – my favorite talks were Ian Goodfellow’s “Adversarial Robustness for Aligned AI” and Gillian Hadfield’s “Incomplete Contracting and AI Alignment”.

Ian made the case of ML security being important for long-term AI safety. The effectiveness of adversarial examples is problematic not only from the near-term perspective of current ML systems (such as self-driving cars) being fooled by bad actors. It’s also bad news from the long-term perspective of aligning the values of an advanced agent, which could inadvertently seek out adversarial examples for its reward function due to Goodhart’s law. Relying on the agent’s uncertainty about the environment or human preferences is not sufficient to ensure safety, since adversarial examples can cause the agent to have arbitrarily high confidence in the wrong answer.

ian talk_3

Gillian approached AI safety from an economics perspective, drawing parallels between specifying objectives for artificial agents and designing contracts for humans. The same issues that make contracts incomplete (the designer’s inability to consider all relevant contingencies or precisely specify the variables involved, and incentives for the parties to game the system) lead to side effects and reward hacking for artificial agents.

Gillian talk_4

The central question of the talk was how we can use insights from incomplete contracting theory to better understand and systematically solve specification problems in AI safety, which is a really interesting research direction. The objective specification problem seems even harder to me than the incomplete contract problem, since the contract design process relies on some level of shared common sense between the humans involved, which artificial agents do not currently possess.

Interpretability for AI safety

I gave a talk at the Interpretable ML symposium on connections between interpretability and long-term safety, which explored what forms of interpretability could help make progress on safety problems (slidesvideo). Understanding our systems better can help ensure that safe behavior generalizes to new situations, and it can help identify causes of unsafe behavior when it does occur.

For example, if we want to build an agent that’s indifferent to being switched off, it would be helpful to see whether the agent has representations that correspond to an off-switch, and whether they are used in its decisions. Side effects and safe exploration problems would benefit from identifying representations that correspond to irreversible states (like “broken” or “stuck”). While existing work on examining the representations of neural networks focuses on visualizations, safety-relevant concepts are often difficult to visualize.

Local interpretability techniques that explain specific predictions or decisions are also useful for safety. We could examine whether features that are idiosyncratic to the training environment or indicate proximity to dangerous states influence the agent’s decisions. If the agent can produce a natural language explanation of its actions, how does it explain problematic behavior like reward hacking or going out of its way to disable the off-switch?

There are many ways in which interpretability can be useful for safety. Somewhat less obvious is what safety can do for interpretability: serving as grounding for interpretability questions. As exemplified by the final debate of the symposium, there is an ongoing conversation in the ML community trying to pin down the fuzzy idea of interpretability – what is it, do we even need it, what kind of understanding is useful, etc. I think it’s important to keep in mind that our desire for interpretability is to some extent motivated by our systems being fallible – understanding our AI systems would be less important if they were 100% robust and made no mistakes. From the safety perspective, we can define interpretability as the kind of understanding that help us ensure the safety of our systems.

For those interested in applying the interpretability hammer to the safety nail, or working on other long-term safety questions, FLI has recently announced a new grant program. Now is a great time for the AI field to think deeply about value alignment. As Pieter Abbeel said at the end of his keynote, “Once you build really good AI contraptions, how do you make sure they align their value system with our value system? Because at some point, they might be smarter than us, and it might be important that they actually care about what we care about.”

(Thanks to Janos Kramar for his feedback on this post, and to everyone at DeepMind who gave feedback on the interpretability talk.)

This article was originally posted here.

Research for Beneficial Artificial Intelligence

Research Goal: The goal of AI research should be to create not undirected intelligence, but beneficial intelligence.

It’s no coincidence that the first Asilomar Principle is about research. On the face of it, the Research Goal Principle may not seem as glamorous or exciting as some of the other Principles that more directly address how we’ll interact with AI and the impact of superintelligence. But it’s from this first Principle that all of the others are derived.

Simply put, without AI research and without specific goals by researchers, AI cannot be developed. However, participating in research and working toward broad AI goals without considering the possible long-term effects of the research could be detrimental to society.

There’s a scene in Jurassic Park, in which Jeff Goldblum’s character laments that the scientists who created the dinosaurs “were so preoccupied with whether or not they could that they didn’t stop to think if they should.” Until recently, AI researchers have also focused primarily on figuring out what they could accomplish, without longer-term considerations, and for good reason: scientists were just trying to get their AI programs to work at all, and the results were far too limited to pose any kind of threat.

But in the last few years, scientists have made great headway with artificial intelligence. The impacts of AI on society are already being felt, and as we’re seeing with some of the issues of bias and discrimination that are already popping up, this isn’t always good.

Attitude Shift

Unfortunately, there’s still a culture within AI research that’s too accepting of the idea that the developers aren’t responsible for how their products are used. Stuart Russell compares this attitude to that of civil engineers, who would never be allowed to say something like, “I just design the bridge; someone else can worry about whether it stays up.”

Joshua Greene, a psychologist from Harvard, agrees. He explains:

“I think that is a bookend to the Common Good Principle [#23] – the idea that it’s not okay to be neutral. It’s not okay to say, ‘I just make tools and someone else decides whether they’re used for good or ill.’ If you’re participating in the process of making these enormously powerful tools, you have a responsibility to do what you can to make sure that this is being pushed in a generally beneficial direction. With AI, everyone who’s involved has a responsibility to be pushing it in a positive direction, because if it’s always somebody else’s problem, that’s a recipe for letting things take the path of least resistance, which is to put the power in the hands of the already powerful so that they can become even more powerful and benefit themselves.”

What’s Beneficial?

Other AI experts I spoke with agreed with the general idea of the Principle, but didn’t see quite eye-to-eye on how it was worded. Patrick Lin, for example was concerned about the use of the word “beneficial” and what it meant, while John Havens appreciated the word precisely because it forces us to consider what “beneficial” means in this context.

“I generally agree with this research goal,” explained Lin, a philosopher at Cal Poly. “Given the potential of AI to be misused or abused, it’s important to have a specific positive goal in mind. I think where it might get hung up is what this word ‘beneficial’ means. If we’re directing it towards beneficial intelligence, we’ve got to define our terms; we’ve got to define what beneficial means, and that to me isn’t clear. It means different things to different people, and it’s rare that you could benefit everybody.”

Meanwhile, Havens, the Executive Director of The IEEE Global Initiative for Ethical Considerations in Artificial Intelligence and Autonomous Systems, was pleased the word forced the conversation.

“I love the word beneficial,” Havens said. “I think sometimes inherently people think that intelligence, in one sense, is always positive. Meaning, because something can be intelligent, or autonomous, and that can advance technology, that that is a ‘good thing’. Whereas the modifier ‘beneficial’ is excellent, because you have to define: What do you mean by beneficial? And then, hopefully, it gets more specific, and it’s: Who is it beneficial for? And, ultimately, what are you prioritizing? So I love the word beneficial.”

AI researcher Susan Craw, a professor at Robert Gordon University, also agrees with the Principle but questioned the order of the phrasing.

“Yes, I agree with that,” Craw said, but adds, “I think it’s a little strange the way it’s worded, because of ‘undirected.’ It might even be better the other way around, which is, it would be better to create beneficial research, because that’s a more well-defined thing.”

Long-term Research

Roman Yampolskiy, an AI researcher at the University of Louisville, brings the discussion back to the issues of most concern for FLI:

“The universe of possible intelligent agents is infinite with respect to both architectures and goals. It is not enough to simply attempt to design a capable intelligence, it is important to explicitly aim for an intelligence that is in alignment with goals of humanity. This is a very narrow target in a vast sea of possible goals and so most intelligent agents would not make a good optimizer for our values resulting in a malevolent or at least indifferent AI (which is likewise very dangerous). It is only by aligning future superintelligence with our true goals, that we can get significant benefit out of our intellectual heirs and avoid existential catastrophe.”

And with that in mind, we’re excited to announce we’ve launched a new round of grants! If you haven’t seen the Request for Proposals (RFP) yet, you can find it here. The focus of this RFP is on technical research or other projects enabling development of AI that is beneficial to society, and robust in the sense that the benefits are somewhat guaranteed: our AI systems must do what we want them to do.

If you’re a researcher interested in the field of AI, we encourage you to review the RFP and consider applying.

This article is part of a series on the 23 Asilomar AI Principles. The Principles offer a framework to help artificial intelligence benefit as many people as possible. But, as AI expert Toby Walsh said of the Principles, “Of course, it’s just a start. … a work in progress.” The Principles represent the beginning of a conversation, and now we need to follow up with broad discussion about each individual principle. You can read the discussions about previous principles here.