AI Safety Research

Benja Fallenstein

Research Fellow

Machine Intelligence Research Institute

Project: Aligning Superintelligence With Human Interests

Amount Recommended:    $250,000

Project Summary

How can we ensure that powerful AI systems of the future behave in ways that are reliably aligned with human interests?

One productive way to begin study of this AI alignment problem in advance is to build toy models of the unique safety challenges raised by such powerful AI systems and see how they behave, much as Konstantin Tsiolkovsky wrote down (in 1903) a toy model of how a multistage rocket could be launched into space. This enabled Tsiolkovsky and others to begin exploring the specific challenges of spaceflight long before such rockets were built.

Another productive way to study the AI alignment problem in advance is to seek formal foundations for the study of well-behaved powerful Ais, much as Tsiolkovsky derived the rocket equation (also in 1903) which governs the motion of rockets under ideal environmental conditions. This was a useful stepping stone toward studying the motion of rockets in actual environments.

We plan to build toy models and seek formal foundations for many aspects of the AI alignment problem. One example is that we aim to improve our toy models of a corrigible agent which avoids default rational incentives to resist its programmers’ attempts to fix errors in the AI’s goals.

Technical Abstract

The Future of Life Institute’s research priorities document calls for research focused on ensuring beneficial behavior in [AI] systems that can learn from experience with human-like breadth and surpass human performance in most cognitive tasks. We aim to study several sub-problems of this ‘AI alignment problem, by illuminating the key difficulties using toy models, and by seeking formal foundations for robustly beneficial intelligent agents. In particular, we hope to (a) improve our toy models of ‘corrigible agents’ which avoid default rational incentives to resist corrective interventions from the agents’ programmers, (b) continue our preliminary efforts to put formal foundations under the study of naturalistic, embedded agents which avoid the standard agent-environment split currently used as a simplifying assumption throughout the field of AI, and (c) continue our preliminary efforts to overcome obstacles to flexible cooperation in multi-agent settings. We also hope to take initial steps in formalizing several other informal problems related to AI alignment, for example the problem of ‘ontology identification’: Given goals specified with respect to some ontology and a world model, how can the ontology of the goals be identified inside the world model?


  1. Critch, Andrew. Parametric Bounded Löb’s Theorem and Robust Cooperation of Bounded Agents. 2016.
    • On the positive side, “Parametric Bounded Löb’s Theorem and Robust Cooperation of Bounded Agents” provides the first proof of robust program equilibrium for actual programs (as opposed to idealized agents with access to halting oracles). In the process, these researchers proved new bounded generalizations of Löb’s theorem and Gödel’s second incompleteness theorem, which they expect to prove valuable for modeling the behavior of bounded formal agents.
    • On the negative side, these researchers have a better understanding now of why the modal framework fails to encapsulate some intuitively highly desirable features of a theory of logical counterfactuals. Some preliminary work suggests that there may be better options, but these researchers haven’t published these results anywhere yet. This, too, is an area where their recent progress in logical induction is likely to be quite valuable.
  2. Garrabrant, Scott, et al. Asymptotically Coherent, Well Calibrated, Self-trusting Logical Induction. Working Paper (Berkeley, CA: Machine Intelligence Research Institute). 2016.
    • These researchers saw sizable progress in formalizing embedded agents. Their main result here, and one of the two or three largest results they’ve achieved to date, was a formalization of logically uncertain reasoning described in our “Logical Induction” paper. This is a computable method for assigning probabilities to sentences of logic, allowing the researchers to formalize agents that have beliefs about computations (e.g.,“this program outputs ‘hello world’”) in full generality.
    • Logical inductors have many nice properties: they can reason deductively(respecting logical patterns of entailment), inductively (respecting observed empirical patterns), and reflectively (recognizing facts about the inductor’s own beliefs, and trusting its own conclusions within reasonable limits). This means that this new framework can provide early models of bounded reasoners reasoning about each other, reasoners reasoning about themselves, and reasoners reasoning about limited reasoners.
    • The researchers expect this tool to be quite helpful in the study of many of the informal and semi-formal problems we described in their proposal. For example: incorrigibility problems, an AI system needs to model the fact that human operators are not just uncertain about the system’s preferences, but uncertain about the implications of the code they write and the beliefs they hold.  This tool gives these researchers their first first detailed formal models of that sort of situation. They also expect logical inductors to help them get robustness guarantees when reasoning about the behavior of programs, something that one can’t readily do with, e.g., probability-theoretic models.
  3. Garrabrant, Scott, et al. Inductive Coherence. 2016.
  4. Garrabrant, Scott, et al. Asymptotic Convergence in Online Learning with Unbounded Delays. 2016.
  5. Leike, Jan, et al. A Formal Solution to the Grain of Truth Problem. Uncertainty in Artificial Intelligence: 32nd Conference (UAI 2016), edited by Alexander Ihler and Dominik Janzing, 427–436. Jersey City, New Jersey, USA. 2016.
    • These researchers formally demonstrated that reflective oracles can serve as a general-purpose formal foundation for game-theoretic dilemmas. Leike, Taylor, and Fallenstein showed that reflective oracles give them a solution to the grain of truth problem, allowing them to prove that agents using Thompson sampling can achieve approximate Nash equilibria in arbitrary unknown computable multi-agent environments. In effect, this shows that ordinary decision-theoretic expected utility maximization is sufficient for achieving optimal game-theoretic behavior, as opposed to the researchers needing to separately assume that rational agents can achieve Nash equilibria in the relevant games. This result can also be readily computably approximated.
  6. Taylor, Jessica. Quantilizers: A Safer Alternative to Maximizers for Limited Optimization. 2nd International Workshop on AI, Ethics and Society at AAAI-2016. Phoenix, AZ. 2016.


  1. What Are Some Recent Advances in Non-Convex Optimization Research? The Huffington Post.
  2. Taylor, Jessica. A first look at the hard problem of corrigibility. Intelligent Agent Foundations Forum, 2015.
  3. Taylor, Jessica. A sketch of a value-learning sovereign. Intelligent Agent Foundations Forum, 2015.
  4. Taylor, Jessica. Three preference frameworks for goal-directed agents. Intelligent Agent Foundations Forum, 2015.
  5. Taylor, Jessica. What do we need value learning for? Intelligent Agent Foundations Forum, 2015.


  1. Colloquium Series on Robust and Beneficial AI (CSRBAI):
  2. Self-Reference, Type Theory, and Formal Verification: April 1-3.
    • Participants worked on questions of self-reference in type theory and automated theorem provers, with the goal of studying systems that model themselves.
  3. Logic, Probability, and Reflection: August 12-14.
    • Participants at this workshop, consisting of MIRI staff and regular collaborators, worked on a variety of problems related to MIRI’s Agent Foundations technical agenda, with a focus on decision theory and the formal construction of logical counterfactuals.