AI Safety Research

Owain Evans

FHI-Amlin Postdoctoral Research Fellow

University of Oxford

Project: Inferring Human Values: Learning “Ought”, not “Is”

Amount Recommended:    $227,212

Project Summary

Previous work in economics and AI has developed mathematical models of preferences or values, along with computer algorithms for inferring preferences from observed human choices. We would like to use such algorithms to enable AI systems to learn human preferences by observing humans make real-world choices. However, these algorithms rely on an assumption that humans make optimal plans and take optimal actions in all circumstances. This is typically false for humans. For example, people’s route planning is often worse than Google Maps, because we can’t number-crunch as many possible paths. Humans can also be inconsistent over time, as we see in procrastination and impulsive behavior. Our project seeks to develop algorithms that learn human preferences from data despite humans not being homo-economicus and despite the influence of non-rational impulses. We will test our algorithms on real-world data and compare their inferences to people’s own judgments about their preferences. We will also investigate the theoretical question of whether this approach could enable an AI to learn the entirety of human values.

Technical Abstract

Previous work in economics and AI has developed mathematical models of preferences, along with algorithms for inferring preferences from observed actions. We would like to use such algorithms to enable AI systems to learn human preferences from observed actions. However, these algorithms typically assume that agents take actions that maximize expected utility given their preferences. This assumption of optimality is false for humans in real-world domains. Optimal sequential planning is intractable in complex environments and humans perform very rough approximations. Humans often don’t know the causal structure of their environment (in contrast to MDP models). Humans are also subject to dynamic inconsistencies, as observed in procrastination, addiction and in impulsive behavior. Our project seeks to develop algorithms that learn human preferences from data despite the suboptimality of humans and the behavioral biases that influence human choice. We will test our algorithms on real-world data and compare their inferences to people’s own judgments about their preferences. We will also investigate the theoretical question of whether this approach could enable an AI to learn the entirety of human values.


  1. Evans, Owain, et al. Learning the Preferences of Ignorant, Inconsistent Agents. 2015.
    • In this paper, the authors explain the difficulty for IRL in learning from agents with false beliefs or with cognitive biases. They focus on the well ­studied cognitive bias of “time inconsistency”, which is often modeled as hyperbolic discounting. They develop a principled, flexible approach to learning preferences from biased agents using generative models and approximate Bayesian inference. To test their approach, the authors devised scenarios in Grid world where time ­inconsistent or ignorant agents behave differently from unbiased agents. They showed that the algorithm was able to learn preferences accurately in these scenarios.
    • To further validate the authors’ approach in this paper, they ran an experiment to elicit human judgments about preferences. Given the primitive status of current preference­ learning algorithms, the authors assumed humans provide a “gold­ standard” for inferences (at least for inferences from simple, everyday situations). They surveyed hundreds of people via Amazon’s Mechanical Turk (assisted by Stanford University psychologist, Daniel Hawthorne). The authors’ experiments showed that humans spontaneously interpret the behavior in scenarios in terms of false beliefs and time­ inconsistency, suggesting that it’s important to include them in models of human choice. Moreover, the authors showed that overall inferences of preferences for humans and their model were similar.
    • Following acceptance of the paper in the Fall, Owain presented this work at AAAI in February 2016 (Owain also have a separate presentation at AAAI for the AI and Ethics workshop). In the Fall, Owain presented this work to smaller audiences of computer scientists and cognitive scientists at Oxford (Frank Wood’s Probabilistic Programming Seminar) and at Stanford.
  2. Evans, Owain, et al. Learning the Preferences of Bounded Agents. Fall 2015.­nipsworkshop2015.pdf
    • This paper demonstrated the flexibility of the authors’ approach by incorporating additional cognitive biases or bounds into the same framework. They included agents who make approximately optimal plans by using Monte Carlo simulations. They also included two kinds of agents who  are “greedy” or “myopic” in their planning. These kinds of biases/bounds have featured in models in human cognition from psychology and neuroscience. Finally, they included a new scenario in which time­-inconsistent agents exhibit procrastination. The paper described simple, uncontrived, everyday scenarios in which these cognitive biases/bounds lead to behavior distinct from an optimal agent. The authors showed that IRL algorithms that assume the agent is optimal will make mistaken inferences in these scenarios and that the mistakes can get arbitrarily bad. Finally, they showed that a single, concise model is sufficient to capture all these biases and to perform inference.


  1. Colloquium Series on Robust and Beneficial AI (CSRBAI): May 27-June 17. MIRI, Berkeley, CA.
    • Owain Evans participated in this 22-day June colloquium series ( with the Future of Humanity Institute, which included four additional workshops.
    • Specific Workshop: “Preference Specification.” June 11-12.
      • The perennial problem of wanting code to “do what I mean, not what I said” becomes increasingly challenging when systems may find unexpected ways to pursue a given goal. Highly capable AI systems thereby increase the difficulty of specifying safe and useful goals, or specifying safe and useful methods for learning human preferences.

Course Materials

  • Starting in the Spring of 2016, Owain began working on an interactive online textbook, explaining the kinds of models that his team developed in the two earlier papers. The textbook is based on a open­ source library “webppl-­agents”, which can be used independently of the textbook.
  • There were two main motivations for writing this textbook. First, the authors wanted communicate the idea of IRL to a broader audience. Their conference papers were aimed at AI and Machine Learning researchers with background in reinforcement learning and inference. The authors wanted to provide an introduction to IRL that built up from first principles and only assumed background in mathematics and programming. They have anecdotal evidence that this kind of expository material has been useful in drawing people to work on AI Safety. For example, material posted by MIRI researchers on blogs has attracted people with math and programming talent but with little experience in AI. Likewise, a similar online textbook “Probabilistic Models of Cognition” (co­authored by Noah Goodman), has attracted people to work on probabilistic programming and cognitive science.
  • The second motivation was to give a detailed explanation of the authors’ approach to IRL to the existing AI Safety and AI/ML communities. Their papers in Fall 2016 were short and so the explanation of the formal approach (and the software that the authors used to implement it) was necessarily brief. They felt that a good way to disseminate their contributions in these papers would be to release an open­-source library with documentation and examples. The textbook, while starting from first principles, includes a number of advanced examples that are aimed at the AI/ML communities.
  • The textbook is online ( and the associated library is also available to download. While the existing chapters are reasonably polished, the textbook is not yet finished. The authors intend to add an additional few chapters later this summer. At that point, they will publicize among the AI and AI Safety communities.
  • The writing of the textbook and library was led by Owain. Daniel Filan made substantial contributions during his spell working as a research assistant. John Salvatier, a professional software engineer, did important work on the library (especially the visualizations). The library and textbook will be used by Andreas to teach DARPA’s summer school in probabilistic programming (in August 2016).

Ongoing Projects/Recent Progress

  1. Owain Evans, Andreas Stuhlmueller, David Krueger- “Predicting Expensive Human Judgments from Cheap Signals”
    • These researchers would like to use Machine Learning to automate a human’s considered judgment. Because it’s expensive to get judgments from extensive human deliberation, the researchers want AI systems that can learn from cheaper proxies and by Active Learning. The goal for the project is to devise concrete problems that exhibits these challenges and that can be tackled by ML researchers. One subgoal is to produce abstract formalizations of the problem in the language of ML. As the researchers have noted, the problem is related to work in Active Learning (especially in an online setting). It also relates to work in semi­supervised learning. The existing papers closest to this work are LUPI­ distillation, patterns of side information and chris rea paper.
  2. Owain Evans, Andreas Stuhlmueller, David Abel – “Reinforcement Learning with a Human Teacher”
    • The researchers’ aim in this project is to provide a deeper understanding of how human feedback can facilitate learning in an RL agent. Their initial focus is on the setting that combines standard RL with human feedback (rather than TAMER). They want to identify classes of environments in which human feedback has a big impact relative to learning from a reward function alone. One class of environments that seems promising are those with irreversible actions or with catastrophic actions. Sparse human feedback can still suffice to help the agent avoid catastrophic or irreversibly bad actions. This also relates to the mainstream literature on “Safe RL” (cite: garcia and fernandez 2015).
The Latest from the Future of Life Institute
Subscribe To Our Newsletter

Stay up to date with our grant announcements, new podcast episodes and more.

Invalid email address
You can unsubscribe at any time.