Aligning Superintelligence With Human Interests

The trait that currently gives humans a dominant advantage over other species is intelligence. Human advantages in reasoning and resourcefulness have allowed us to thrive. However, this may not always be the case.

Although superintelligent AI systems may be decades away, Benya Fallenstein – a research fellow at the Machine Intelligence Research Institute – believes “it is prudent to begin investigations into this technology now.” The more time scientists and researchers have to prepare for a system that could eventually be smarter than us, the better.

A smarter-than-human AI system could potentially develop the tools necessary to exert control over humans. At the same time, highly capable AI systems may not possess a human sense of fairness, compassion, or conservatism. Consequently, the AI system’s single-minded pursuit of its programmed goals could cause it to deceive programmers, attempt to seize resources, or otherwise exhibit adversarial behaviors.

Fallenstein believes researchers must “ensure that AI would behave in ways that are reliably aligned with human interests.” However, even highly-reliable agent programming does not guarantee a positive impact; the effects of the system still depend upon whether it is pursuing human-approved goals. A superintelligent system may find clever, unintended ways to achieve the specific goals that it is given.

For example, imagine a super intelligent system designed to cure cancer “without doing anything bad.” This goal is rooted in cultural context and shared human knowledge. The AI may not completely understand what qualifies as “bad.” Therefore, it may try to cure cancer by stealing resources, proliferating robotic laboratories at the expense of the biosphere, kidnapping test subjects, or all of the above.

If a current AI system gets out of hand, researchers simply shut it down and modify its source code. However, modifying super-intelligent systems could prove to be more difficult, if not impossible. A system could acquire new hardware, alter its software, or take other actions that would leave the original programmers with only dubious control over the agent. And since most programmed goals are better achieved if the system stays operational and continues pursuing its goals than if it is deactivated or its goals are changed, systems will naturally tend to have an incentive to resist shutdown and to resist modifications to their goals.

Fallenstein explains that, in order to ensure that the development of super-intelligent AI has a positive impact on the world, “it must be constructed in such a way that it is amenable to correction, even if it has the ability to prevent or avoid correction.” The goal is not to design systems that fail in their attempts to deceive the programmers; the goal is to understand how highly intelligent and general-purpose reasoners with flawed goals can be built to have no incentives to deceive programmers in the first place. Instead, the intent is for the first highly capable systems to be “corrigible”—i.e., for them to recognize that their goals and other features are works in progress, and to work with programmers to identify and fix errors.

Little is known about the design or implementation details of such systems because everything, at this point, is hypothetical — no super-intelligent AI systems exist yet. As a consequence, the research described below focuses on formal agent foundations for AI alignment research — that is, on developing the basic conceptual tools and theories that are most likely to be useful for engineering robustly beneficial systems in the future.

Active research into this is focused on small “toy” problems and models of corrigible agents, in the hope that insight gained there could be applied to more realistic and complex versions of the problems. Fallenstein and her team sought to illuminate the key difficulties of AI using these models. One such toy problem is the “shutdown problem,” which involves designing a set of preferences that incentivize an agent to shut down upon the press of a button without also incentivizing the agent to either cause or prevent the pressing of that button. This would tell researchers whether a utility function could be specified such that agents using that function switch their preferences on demand, without having incentives to cause or prevent the switching.

Studying models in this formal logical setting has led to partial solutions, and further research that drives the development of methods for reasoning under logical uncertainty may continue.

The largest result thus far under this research program is “logical induction,” a line of research led by Scott Garrabrant. It functions as a new model of deductively-limited reasoning.

The kind of uncertainty we have about mathematical questions that are too difficult for us to settle one way or another right this moment is logical uncertainty. For example, a typical human mind can’t quickly answer the question:

What’s the 10100th digit of Pi?

Further, nobody has the computational resources to solve this in a reasonable amount of time. Despite this, mathematicians have lots of theories about how likely mathematical conjectures are to be true. As such, they must be implicitly using some sort of criterion that can be used to judge the probability that a mathematical statement is true or not. This type of “logical induction” proves that a computable logical inductor (an algorithm producing probability assignments that satisfy logical induction) exists.

The research team presented a computable algorithm that outpaces deduction, assigning high subjective probabilities to provable conjectures and low probabilities to disprovable conjectures long before the proofs can be produced. Among other accomplishments, the algorithm learns to reason competently about its own beliefs and trust its future beliefs while avoiding paradox. This gives some formal backing to the thought that real-world probabilistic agents can often be reasonably confident in their future reasoning in practice.

The team believes “there’s a good chance that this framework will open up new avenues of study in questions of metamathematics, decision theory, game theory, and computational reflection that have long seemed intractable.” They are also “cautiously optimistic” that they’ll improve our understanding of decision theory and counterfactual reasoning, and other problems related to AI value alignment.

At the same time, Fallenstein’s team doesn’t believe that all parts of the problem must be solved in advance. In fact, “the task of designing smarter, safer, more reliable systems could be delegated to early smarter-than-human systems.” This can only happen, though, as long as the research done by the AI can be trusted.

According to Fallenstein, this “call to arms” is vital, and “significant effort must be focused on the study of superintelligence alignment as soon as possible.” It is important to develop a formal understanding of AI alignment well in advance of making design decisions about smarter-than-human systems. By beginning the work early, humans inevitably face the risk that it may turn out to be irrelevant. However, failing to prepare could be even worse.

This article is part of a Future of Life series on the AI safety research grants, which were funded by generous donations from Elon Musk and the Open Philanthropy Project.