Skip to content

AI Researcher Stuart Russell

September 30, 2016
Revathi Kumar


AI Safety Research

Stuart Russell

Professor of Computer Science and Smith-Zadeh Professor in Engineering

University of California, Berkeley

Project: Value Alignment and Moral Metareasoning

Amount Recommended:    $342,727

Project Summary

Developing AI systems that are benevolent towards humanity requires making sure that those systems know what humans want. People routinely make inferences about the preferences of others and use those inferences as the basis for helping one another. This project aims to provide AI systems a similar ability to learn from observations, in order to better align the values of those systems with those of humans. Doing so requires dealing with some significant challenges: If we ultimately develop AI systems that can reason better than humans, how do we make sure that those AI systems are able to take human limitations into account? The fact that we haven’t yet cured cancer shouldn’t be taken as evidence that we don’t really care about it. Furthermore, once we have made an AI system that can reason about human preferences, that system then has to trade off time spent in deliberating about the right course of action with the need to act as quickly as possible – it needs to deal with its own computational limitations as it makes decisions. We aims to address both these challenges by examining how intelligent agents (be they humans or computers) should make these tradeoffs.

Technical Abstract

AI research has focused on improving the decision-making capabilities of computers, i.e., the ability to select high-quality actions in pursuit of a given objective. When the objective is aligned with the values of the human race, this can lead to tremendous benefits. When the objective is misaligned, improving the AI system’s decision-making may lead to worse outcomes for the human race. The objectives of the proposed research  are (1) to create a mathematical framework in which fundamental questions of value  alignment can be investigated; (2) to develop and experiment with methods for aligning  the values of a machine (whether explicitly or implicitly represented) with those of  humans; (3) to understand the relationships among the degree of value alignment, the  decision-making capability of the machine, and the potential loss to the human; and (4) to  understand in particular the implications of  the computational limitations of humans  and  machines  for value alignment . The core of our technical approach will be a cooperative,  game-theoretic extension  of  inverse reinforcement learning, allowing for the different  action  spaces of humans and machines and the varying motivations of humans; the  concepts of  rational metareasoning  and  bounded optimality  will inform our investigation  of the effects of computational limitations.

Artificial Intelligence and the King Midas Problem

Value alignment. It’s a phrase that often pops up in discussions about the safety and ethics of artificial intelligence. How can scientists create AI with goals and values that align with those of the people it interacts with?

Very simple robots with very constrained tasks do not need goals or values at all. Although the Roomba’s designers know you want a clean floor, Roomba doesn’t: it simply executes a procedure that the Roomba’s designers predict will work—most of the time. If your kitten leaves a messy pile on the carpet, Roomba will dutifully smear it all over the living room. If we keep programming smarter and smarter robots, then by the late 2020s, you may be able to ask your wonderful domestic robot to cook a tasty, high-protein dinner. But if you forgot to buy any meat, you may come home to a hot meal but find the aforementioned cat has mysteriously vanished. The robot, designed for chores, doesn’t understand that the sentimental value of the cat exceeds its nutritional value.

AI and King Midas

Stuart Russell, a renowned AI researcher, compares the challenge of defining a robot’s objective to the King Midas myth. “The robot,” says Russell, “has some objective and pursues it brilliantly to the destruction of mankind. And it’s because it’s the wrong objective. It’s the old King Midas problem.”

This is one of the big problems in AI safety that Russell is trying to solve. “We’ve got to get the right objective,” he explains, “and since we don’t seem to know how to program it, the right answer seems to be that the robot should learn – from interacting with and watching humans – what it is humans care about.”

Russell works from the assumption that the robot will solve whatever formal problem we define. Rather than assuming that the robot should optimize a given objective, Russell defines the problem as a two-player game (“game” as used by economists, meaning a decision problem with multiple agents) called cooperative inverse reinforcement learning (CIRL).

A CIRL game includes a person and a robot: the robot’s only purpose is to make the person happy, but it doesn’t know what the person wants. Fortunately, it can learn more about what the person wants by observing her behavior. For example, if a robot observed the human’s morning routine, it should discover how important coffee is—not to itself, of course (we don’t want robots drinking coffee), but to the human. Then, it will make coffee for the person without being asked.

What about the example of King Midas, who thought he wanted as much gold as possible? How does the approach deal with someone stating their wishes incorrectly?

“We don’t have to know what we want. That’s the whole point. It’s hard to know what we want, and what we ask for is evidence for what we really want, but it’s not ground truth,” says Russell.

The robot should learn what people want by watching their actions. Then, if someone asks for something that’s technically incorrect, the robot will understand their underlying wish. Russell adds, “The robot—if it had been there—would have told Midas that he didn’t really want everything turned to gold, maybe just a few choice objects that he might point at from time to time and say the magic word.”

AI Off-Switch

Russell and his Berkeley colleagues also recently announced further progress toward ensuring safe AI, with a paper on ensuring an AI’s off-switch is always accessible. An off-switch is often suggested to guarantee a robot doesn’t harm humans – if the robot goes bad, just turn it off! But what if the AI anticipates that and disables its off switch?

Russell and his team propose solving this by providing a level of uncertainty in the robot’s objective. They created another CIRL game in which the robot chooses to announce its intentions, which gives the human a chance to switch it off, or to plunge ahead with its plan, bypassing human oversight.

They showed that a robot with a fixed objective will take the second choice and disable its off-switch. But if there is uncertainty about the objective, the robot will reason that a human decision to switch the robot off would imply the announced intention is undesirable. Because its goal is to make the human happy, the robot accepts being switched off. The robot has a positive incentive not to disable the off-switch, and that incentive is directly tied to the robot’s uncertainty about the human’s true objective.

Ensuring AI Safety

In addition to his research, Russell is also one of the most vocal and active AI safety researchers concerned with ensuring a stronger public understanding of the potential issues surrounding AI development.

He recently co-authored a rebuttal to an article in the MIT Technology Review, which claimed that real AI scientists weren’t worried about the existential threat of AI. Russell and his co-author summed up why it’s better to be cautious and careful than just assume all will turn out for the best:

“Our experience with Chernobyl suggests it may be unwise to claim that a powerful technology entails no risks. It may also be unwise to claim that a powerful technology will never come to fruition. On September 11, 1933, Lord Rutherford, perhaps the world’s most eminent nuclear physicist, described the prospect of extracting energy from atoms as nothing but “moonshine.” Less than 24 hours later, Leo Szilard invented the neutron-induced nuclear chain reaction; detailed designs for nuclear reactors and nuclear weapons followed a few years later. Surely it is better to anticipate human ingenuity than to underestimate it, better to acknowledge the risks than to deny them. … he risk arises from the unpredictability and potential irreversibility of deploying an optimization process more intelligent than the humans who specified its objectives.”

This summer, Russell received a grant of over $5.5 million from the Open Philanthropy Project for a new research center, the Center for Human-Compatible Artificial Intelligence, in Berkeley. Among the primary objectives of the Center will be to study this problem of value alignment, to continue his efforts toward provably beneficial AI, and to ensure we don’t make the same mistakes as King Midas.

“Look,” he says, “if you were King Midas, would you want your robot to say, ‘Everything turns to gold? OK, boss, you got it.’ No! You’d want it to say, ‘Are you sure? Including your food, drink, and relatives? I’m pretty sure you wouldn’t like that. How about this: you point to something and say ‘Abracadabra Aurificio’ or something, and then I’ll turn it to gold, OK?’”

This article is part of a Future of Life series on the AI safety research grants, which were funded by generous donations from Elon Musk and the Open Philanthropy Project.


  1. Bai, Aijun and Russell, Stuart. Markovian State and Action Abstractions in Monte ­Carlo Tree Search. In Proc. IJCAI­16, New York, 2016.
  2. Hadfield-­Menell, Dylan, et al. Cooperative Inverse Reinforcement Learning. Neural Information Processing Systems (NIPS), 2016.
    • These researchers have created a new formal model of cooperative inverse reinforcement learning (CIRL). A CIRL model is a two ­player game in which a human H has an objective while a robot R has the aim of maximizing H’s objective but doesn’t initially know what it is. They have proved that optimal CIRL solutions are distinct from the “optimal” behavior demonstrations assumed in IRL; we have shown that, under reasonable assumptions, the robot’s posterior over H’s objective is a sufficient statistic for R’s optimal policy and have derived a basic algorithm for solving CIRL games of this type.
  3. Liu, C., et al. Goal inference improves objective and perceived performance in human robot collaboration. In Proc. AAMAS­16, Singapore, 2016.


  1. Russell, Stuart. Moral Philosophy Will Become Part of the Tech Industry. Time, September 15, 2015.
  2. Russell, Stuart. Should we fear super smart robots? Scientific American, 314, 58­59, June 2016.

Course Materials

Course Names: 

  1. “Human-Compatible AI” – Graduate Course, Spring 2016
    • Approximately 20 PhD students (from several disciplines) and 5 undergraduates participated. Since then they have been running weekly group meetings with this topic as a focus.


  1. Colloquium Series on Robust and Beneficial AI (CSRBAI): May 27-June 17, 2016
  2. Control and Responsible Innovation in the Development of Autonomous Systems Workshop: April 24-26, 2016. The Hastings Center, Garrison, NY.
    • The four co-­chairs (Gary Marchant, Stuart Russell, Bart Selman, and Wendell Wallach) and The Hastings Center staff (particularly Mildred Solomon and Greg Kaebnick) designed this first workshop. Twenty-five participants attended.
    • This workshop was focused on exposing participants to relevant research progressing in an array of fields, stimulating extended reflection upon key issues and beginning a process of dismantling intellectual silos and loosely knitting the represented disciplines into a transdisciplinary community.
    • The workshop included representatives from key institutions that have entered this space, including IEEE, the Office of Naval Research, the World Economic Forum, and of course AAAI.
      • They are planning a second workshop, scheduled for October 30-November 1, 2016
        • The invitees for the second workshop are primarily scientists, but also include social theorists, legal scholars, philosophers, and ethicists. The expertise of the social scientists will be drawn upon in clarifying the application of research in cognitive science and legal and ethical theory to the development of autonomous systems. Not all of the invitees to the second workshop have considered the challenge of developing beneficial trustworthy artificial agents. However, these researchers believe that they are bringing together brilliant and creative minds to collectively address this challenge. They hope that scientific and intellectual leaders, new to the challenge and participating in the second workshop, will take on the development of beneficial, robust, safe, and controllable AI as a serious research agenda.


  1. “The long-­term future of (artificial) intelligence”, invited lecture, Software Alliance Annual Meeting, Napa, Nov 13, 2015
  2. “The Future of AI and the Human Race”, TedX talk, Berkeley, Nov 8, 2015
  3. “Value Alignment”, invited lecture, Workshop on Algorithms for Human­-Robot Interaction, Nov 18, 2015
  4. “Killer Robots, the End of Humanity, and All That”, Award Lecture, World Technology Awards, New York, Nov 2015
  5. “Should we Fear or Welcome the Singularity?”, panel presentation, Nobel Week Dialogue, December 2015
  6. “The Future of Human­-Computer Interaction”, panel presentation (chair), Nobel Week Dialogue, December 2015
  7. “The Future Development of AI”, panel presentation, Nobel Week Dialogue, December 2015
  8. “Some thoughts on the future”, invited lecture, NYU AI Symposium, January 2016
  9. “The State of AI”, televised panel presentation, World Economic Forum, Davos, January 2016
  10. “AI: Friend or Foe?” panel presentation, World Economic Forum, Davos, January 2016
  11. “The long­-term future of (artificial) intelligence”, CERN Colloquium, Geneva, Jan 16,2016
  12. “Some thoughts on the future”, invited presentation, National Intelligence Council,Berkeley, Jan 28, 2016
  13. “The long­-term future of (artificial) intelligence”,  Herbst Lecture, University of Colorado, Boulder, March 11 2016
  14. “The Future of AI”, Keynote Lecture, Annual Ethics Forum, California State University Monterey Bay, March 16, 2016
  15. “The long-­term future of (artificial) intelligence”, IARPA Colloquium, Washington DC,March 21 2016
  16. “AI: Friend or Foe?”, panel presentation, Milken Global Institute, Los Angeles, May 2,2016
  17. “Will Superintelligent Robots Make Us Better People?”, Keynote Lecture (televised),Seoul Digital Forum, South Korea, May 19, 2016
  18. “The long-­term future of (artificial) intelligence”, Keynote Lecture, Strata Big Data Conference, London, June 2, 2016
  19. “Moral Economy of Technology”, panel presentation, Annual Meeting of the Society for the Advancement of Socio-­Economics, Berkeley, June 2016

This content was first published at on September 30, 2016.

About the Future of Life Institute

The Future of Life Institute (FLI) is a global non-profit with a team of 20+ full-time staff operating across the US and Europe. FLI has been working to steer the development of transformative technologies towards benefitting life and away from extreme large-scale risks since its founding in 2014. Find out more about our mission or explore our work.

Our content

Related content

Other posts about 

If you enjoyed this content, you also might also be interested in:

AI Researcher Brian Ziebart

AI Safety Research Brian Ziebart Assistant Professor Department of Computer Science University of Illinois at Chicago Project: Towards Safer […]
October 1, 2016

AI Researcher Jacob Steinhardt

AI Safety Research Jacob Steinhardt Graduate Student Stanford University Project: Summer Program in Applied Rationality and Cognition Amount Recommended:    $88,050 […]
October 1, 2016

AI Researcher Bas Steunebrink

AI Safety Research Bas Steunebrink Artificial Intelligence / Machine Learning, Postdoctoral Researcher IDSIA (Dalle Molle Institute for Artificial Intelligence) […]
October 1, 2016

AI Researcher Moshe Vardi

AI Safety Research Moshe Vardi Computer Scientist, Professor Department of Computer Science Rice University Project: Artificial Intelligence and the […]
October 1, 2016

Sign up for the Future of Life Institute newsletter

Join 40,000+ others receiving periodic updates on our work and cause areas.
cloudmagnifiercrossarrow-up linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram