AI Safety Research

Benja Fallenstein

Research Fellow

Machine Intelligence Research Institute

Project: Aligning Superintelligence With Human Interests

Amount Recommended:    $250,000

Project Summary

How can we ensure that powerful AI systems of the future behave in ways that are reliably aligned with human interests?

One productive way to begin study of this AI alignment problem in advance is to build toy models of the unique safety challenges raised by such powerful AI systems and see how they behave, much as Konstantin Tsiolkovsky wrote down (in 1903) a toy model of how a multistage rocket could be launched into space. This enabled Tsiolkovsky and others to begin exploring the specific challenges of spaceflight long before such rockets were built.

Another productive way to study the AI alignment problem in advance is to seek formal foundations for the study of well-behaved powerful Ais, much as Tsiolkovsky derived the rocket equation (also in 1903) which governs the motion of rockets under ideal environmental conditions. This was a useful stepping stone toward studying the motion of rockets in actual environments.

We plan to build toy models and seek formal foundations for many aspects of the AI alignment problem. One example is that we aim to improve our toy models of a corrigible agent which avoids default rational incentives to resist its programmers’ attempts to fix errors in the AI’s goals.

Technical Abstract

The Future of Life Institute’s research priorities document calls for research focused on ensuring beneficial behavior in systems that can learn from experience with human-like breadth and surpass human performance in most cognitive tasks. We aim to study several sub-problems of this ‘AI alignment problem, by illuminating the key difficulties using toy models, and by seeking formal foundations for robustly beneficial intelligent agents. In particular, we hope to (a) improve our toy models of ‘corrigible agents’ which avoid default rational incentives to resist corrective interventions from the agents’ programmers, (b) continue our preliminary efforts to put formal foundations under the study of naturalistic, embedded agents which avoid the standard agent-environment split currently used as a simplifying assumption throughout the field of AI, and (c) continue our preliminary efforts to overcome obstacles to flexible cooperation in multi-agent settings. We also hope to take initial steps in formalizing several other informal problems related to AI alignment, for example the problem of ‘ontology identification’: Given goals specified with respect to some ontology and a world model, how can the ontology of the goals be identified inside the world model?

Aligning Superintelligence With Human Interests

The trait that currently gives humans a dominant advantage over other species is intelligence. Human advantages in reasoning and resourcefulness have allowed us to thrive. However, this may not always be the case.

Although superintelligent AI systems may be decades away, Benya Fallenstein – a research fellow at the Machine Intelligence Research institute – believes “it is prudent to begin investigations into this technology now.” The more time scientists and researchers have to prepare for a system that could eventually be smarter than us, the better.

A smarter-than-human AI system could potentially develop the tools necessary to exert control over humans. At the same time, highly capable AI systems may not possess a human sense of fairness, compassion, or conservatism. Consequently, the AI system’s single-minded pursuit of its programmed goals could cause it to deceive programmers, attempt to seize resources, or otherwise exhibit adversarial behaviors.

Fallenstein believes researchers must “ensure that AI would behave in ways that are reliably aligned with human interests.” However, even highly-reliable agent programming does not guarantee a positive impact; the effects of the system still depend upon whether it is pursuing human-approved goals. A superintelligent system may find clever, unintended ways to achieve the specific goals that it is given.

For example, imagine a super intelligent system designed to cure cancer “without doing anything bad.” This goal is rooted in cultural context and shared human knowledge. The AI may not completely understand what qualifies as “bad.” Therefore, it may try to cure cancer by stealing resources, proliferating robotic laboratories at the expense of the biosphere, kidnapping test subjects, or all of the above.

If a current AI system gets out of hand, researchers simply shut it down and modify its source code. However, modifying super-intelligent systems could prove to be more difficult, if not impossible. A system could acquire new hardware, alter its software, or take other actions that would leave the original programmers with only dubious control over the agent. And since most programmed goals are better achieved if the system stays operational and continues pursuing its goals than if it is deactivated or its goals are changed, systems will naturally tend to have an incentive to resist shutdown and to resist modifications to their goals.

Fallenstein explains that, in order to ensure that the development of super-intelligent AI has a positive impact on the world, “it must be constructed in such a way that it is amenable to correction, even if it has the ability to prevent or avoid correction.” The goal is not to design systems that fail in their attempts to deceive the programmers; the goal is to understand how highly intelligent and general-purpose reasoners with flawed goals can be built to have no incentives to deceive programmers in the first place. Instead, the intent is for the first highly capable systems to be “corrigible”—i.e., for them to recognize that their goals and other features are works in progress, and to work with programmers to identify and fix errors.

Little is known about the design or implementation details of such systems because everything, at this point, is hypothetical — no super-intelligent AI systems exist yet. As a consequence, the research described below focuses on formal agent foundations for AI alignment research — that is, on developing the basic conceptual tools and theories that are most likely to be useful for engineering robustly beneficial systems in the future.

Active research into this is focused on small “toy” problems and models of corrigible agents, in the hope that insight gained there could be applied to more realistic and complex versions of the problems. Fallenstein and her team sought to illuminate the key difficulties of AI using these models. One such toy problem is the “shutdown problem,” which involves designing a set of preferences that incentivize an agent to shut down upon the press of a button without also incentivizing the agent to either cause or prevent the pressing of that button. This would tell researchers whether a utility function could be specified such that agents using that function switch their preferences on demand, without having incentives to cause or prevent the switching.

Studying models in this formal logical setting has led to partial solutions, and further research that drives the development of methods for reasoning under logical uncertainty may continue.

The largest result thus far under this research program is “logical induction,” a line of research led by Scott Garrabrant. It functions as a new model of deductively-limited reasoning.

The kind of uncertainty we have about mathematical questions that are too difficult for us to settle one way or another right this moment is logical uncertainty. For example, a typical human mind can’t quickly answer the question:

What’s the 10100th digit of Pi?

Further, nobody has the computational resources to solve this in a reasonable amount of time. Despite this, mathematicians have lots of theories about how likely mathematical conjectures are to be true. As such, they must be implicitly using some sort of criterion that can be used to judge the probability that a mathematical statement is true or not. This type of “logical induction” proves that a computable logical inductor (an algorithm producing probability assignments that satisfy logical induction) exists.

The research team presented a computable algorithm that outpaces deduction, assigning high subjective probabilities to provable conjectures and low probabilities to disprovable conjectures long before the proofs can be produced. Among other accomplishments, the algorithm learns to reason competently about its own beliefs and trust its future beliefs while avoiding paradox. This gives some formal backing to the thought that real-world probabilistic agents can often be reasonably confident in their future reasoning in practice.

The team believes “there’s a good chance that this framework will open up new avenues of study in questions of metamathematics, decision theory, game theory, and computational reflection that have long seemed intractable.” They are also “cautiously optimistic” that they’ll improve our understanding of decision theory and counterfactual reasoning, and other problems related to AI value alignment.

At the same time, Fallenstein’s team doesn’t believe that all parts of the problem must be solved in advance. In fact, “the task of designing smarter, safer, more reliable systems could be delegated to early smarter-than-human systems.” This can only happen, though, as long as the research done by the AI can be trusted.

According to Fallenstein, this “call to arms” is vital, and “significant effort must be focused on the study of superintelligence alignment as soon as possible.” It is important to develop a formal understanding of AI alignment well in advance of making design decisions about smarter-than-human systems. By beginning the work early, humans inevitably face the risk that it may turn out to be irrelevant. However, failing to prepare could be even worse.

This article is part of a Future of Life series on the AI safety research grants, which were funded by generous donations from Elon Musk and the Open Philanthropy Project.


  1. Critch, Andrew. Parametric Bounded Löb’s Theorem and Robust Cooperation of Bounded Agents. 2016.
    • On the positive side, “Parametric Bounded Löb’s Theorem and Robust Cooperation of Bounded Agents” provides the first proof of robust program equilibrium for actual programs (as opposed to idealized agents with access to halting oracles). In the process, these researchers proved new bounded generalizations of Löb’s theorem and Gödel’s second incompleteness theorem, which they expect to prove valuable for modeling the behavior of bounded formal agents.
    • On the negative side, these researchers have a better understanding now of why the modal framework fails to encapsulate some intuitively highly desirable features of a theory of logical counterfactuals. Some preliminary work suggests that there may be better options, but these researchers haven’t published these results anywhere yet. This, too, is an area where their recent progress in logical induction is likely to be quite valuable.
  2. Garrabrant, Scott, et al. Asymptotically Coherent, Well Calibrated, Self-trusting Logical Induction. Working Paper (Berkeley, CA: Machine Intelligence Research Institute). 2016.
    • These researchers saw sizable progress in formalizing embedded agents. Their main result here, and one of the two or three largest results they’ve achieved to date, was a formalization of logically uncertain reasoning described in our “Logical Induction” paper. This is a computable method for assigning probabilities to sentences of logic, allowing the researchers to formalize agents that have beliefs about computations (e.g.,“this program outputs ‘hello world’”) in full generality.
    • Logical inductors have many nice properties: they can reason deductively(respecting logical patterns of entailment), inductively (respecting observed empirical patterns), and reflectively (recognizing facts about the inductor’s own beliefs, and trusting its own conclusions within reasonable limits). This means that this new framework can provide early models of bounded reasoners reasoning about each other, reasoners reasoning about themselves, and reasoners reasoning about limited reasoners.
    • The researchers expect this tool to be quite helpful in the study of many of the informal and semi-formal problems we described in their proposal. For example: incorrigibility problems, an AI system needs to model the fact that human operators are not just uncertain about the system’s preferences, but uncertain about the implications of the code they write and the beliefs they hold.  This tool gives these researchers their first first detailed formal models of that sort of situation. They also expect logical inductors to help them get robustness guarantees when reasoning about the behavior of programs, something that one can’t readily do with, e.g., probability-theoretic models.
  3. Garrabrant, Scott, et al. Inductive Coherence. 2016.
  4. Garrabrant, Scott, et al. Asymptotic Convergence in Online Learning with Unbounded Delays. 2016.
  5. Leike, Jan, et al. A Formal Solution to the Grain of Truth Problem. Uncertainty in Artificial Intelligence: 32nd Conference (UAI 2016), edited by Alexander Ihler and Dominik Janzing, 427–436. Jersey City, New Jersey, USA. 2016.
    • These researchers formally demonstrated that reflective oracles can serve as a general-purpose formal foundation for game-theoretic dilemmas. Leike, Taylor, and Fallenstein showed that reflective oracles give them a solution to the grain of truth problem, allowing them to prove that agents using Thompson sampling can achieve approximate Nash equilibria in arbitrary unknown computable multi-agent environments. In effect, this shows that ordinary decision-theoretic expected utility maximization is sufficient for achieving optimal game-theoretic behavior, as opposed to the researchers needing to separately assume that rational agents can achieve Nash equilibria in the relevant games. This result can also be readily computably approximated.
  6. Taylor, Jessica. Quantilizers: A Safer Alternative to Maximizers for Limited Optimization. 2nd International Workshop on AI, Ethics and Society at AAAI-2016. Phoenix, AZ. 2016.


  1. What Are Some Recent Advances in Non-Convex Optimization Research? The Huffington Post.
  2. Taylor, Jessica. A first look at the hard problem of corrigibility. Intelligent Agent Foundations Forum, 2015.
  3. Taylor, Jessica. A sketch of a value-learning sovereign. Intelligent Agent Foundations Forum, 2015.
  4. Taylor, Jessica. Three preference frameworks for goal-directed agents. Intelligent Agent Foundations Forum, 2015.
  5. Taylor, Jessica. What do we need value learning for? Intelligent Agent Foundations Forum, 2015.


  1. Colloquium Series on Robust and Beneficial AI (CSRBAI):
  2. Self-Reference, Type Theory, and Formal Verification: April 1-3.
    • Participants worked on questions of self-reference in type theory and automated theorem provers, with the goal of studying systems that model themselves.
  3. Logic, Probability, and Reflection: August 12-14.
    • Participants at this workshop, consisting of MIRI staff and regular collaborators, worked on a variety of problems related to MIRI’s Agent Foundations technical agenda, with a focus on decision theory and the formal construction of logical counterfactuals.