At the core of AI safety, lies the value alignment problem: how can we teach artificial intelligence systems to act in accordance with human goals and values?
Many researchers interact with AI systems to teach them human values, using techniques like inverse reinforcement learning (IRL). In theory, with IRL, an AI system can learn what humans value and how to best assist them by observing human behavior and receiving human feedback.
But human behavior doesn’t always reflect human values, and human feedback is often biased. We say we want healthy food when we’re relaxed, but then we demand greasy food when we’re stressed. Not only do we often fail to live according to our values, but many of our values contradict each other. We value getting eight hours of sleep, for example, but we regularly sleep less because we also value working hard, caring for our children, and maintaining healthy relationships.
AI systems may be able to learn a lot by observing humans, but because of our inconsistencies, some researchers worry that systems trained with IRL will be fundamentally unable to distinguish between value-aligned and misaligned behavior. This could become especially dangerous as AI systems become more powerful: inferring the wrong values or goals from observing humans could lead these systems to adopt harmful behavior.
Distinguishing Biases and Values
Owain Evans, a researcher at the Future of Humanity Institute, and Andreas Stuhlmüller, president of the research non-profit Ought, have explored the limitations of IRL in teaching human values to AI systems. In particular, their research exposes how cognitive biases make it difficult for AIs to learn human preferences through interactive learning.
Evans elaborates: “We want an agent to pursue some set of goals, and we want that set of goals to coincide with human goals. The question then is, if the agent just gets to watch humans and try to work out their goals from their behavior, how much are biases a problem there?”
In some cases, AIs will be able to understand patterns of common biases. Evans and Stuhlmüller discuss the psychological literature on biases in their paper, Learning the Preferences of Ignorant, Inconsistent Agents, and in their online book, agentmodels.org. An example of a common pattern discussed in agentmodels.org is “time inconsistency.” Time inconsistency is the idea that people’s values and goals change depending on when you ask them. In other words, “there is an inconsistency between what you prefer your future self to do and what your future self prefers to do.”
Examples of time inconsistency are everywhere. For one, most people value waking up early and exercising if you ask them before bed. But come morning, when it’s cold and dark out and they didn’t get those eight hours of sleep, they often value the comfort of their sheets and the virtues of relaxation. From waking up early to avoiding alcohol, eating healthy, and saving money, humans tend to expect more from their future selves than their future selves are willing to do.
With systematic, predictable patterns like time inconsistency, IRL could make progress with AI systems. But often our biases aren’t so clear. According to Evans, deciphering which actions coincide with someone’s values and which actions spring from biases is difficult or even impossible in general.
“Suppose you promised to clean the house but you get a last minute offer to party with a friend and you can’t resist,” he suggests. “Is this a bias, or your value of living for the moment? This is a problem for using only inverse reinforcement learning to train an AI — how would it decide what are biases and values?”
Learning the “Correct” Values
Despite this conundrum, understanding human values and preferences is essential for AI systems, and developers have a very practical interest in training their machines to learn these preferences.
Already today, popular websites use AI to learn human preferences. With YouTube and Amazon, for instance, machine-learning algorithms observe your behavior and predict what you will want next. But while these recommendations are often useful, they have unintended consequences.
Consider the case of Zeynep Tufekci, an associate professor at the School of Information and Library Science at the University of North Carolina. After watching videos of Trump rallies to learn more about his voter appeal, Tufekci began seeing white nationalist propaganda and Holocaust denial videos on her “autoplay” queue. She soon realized that YouTube’s algorithm, optimized to keep users engaged, predictably suggests more extreme content as users watch more videos. This led her to call the website “The Great Radicalizer.”
This value misalignment in YouTube algorithms foreshadows the dangers of interactive learning with more advanced AI systems. Instead of optimizing advanced AI systems to appeal to our short-term desires and our attraction to extremes, designers must be able to optimize them to understand our deeper values and enhance our lives.
Evans suggests that we will want AI systems that can reason through our decisions better than humans can, understand when we are making biased decisions, and “help us better pursue our long-term preferences.” However, this will entail that AIs suggest things that seem bad to humans on first blush.
One can imagine an AI system suggesting a brilliant, counterintuitive modification to a business plan, and the human just finds it ridiculous. Or maybe an AI recommends a slightly longer, stress-free driving route to a first date, but the anxious driver takes the faster route anyway, unconvinced.
To help humans understand AIs in these scenarios, Evans and Stuhlmüller have researched how AI systems could reason in ways that are comprehensible to humans and can ultimately improve upon human reasoning.
One method (invented by Paul Christiano) is called “amplification,” where humans use AIs to help them think more deeply about decisions. Evans explains: “You want a system that does exactly the same kind of thinking that we would, but it’s able to do it faster, more efficiently, maybe more reliably. But it should be a kind of thinking that if you broke it down into small steps, humans could understand and follow.”
This second concept is called “factored cognition” – the idea of breaking sophisticated tasks into small, understandable steps. According to Evans, it’s not clear how generally factored cognition can succeed. Sometimes humans can break down their reasoning into small steps, but often we rely on intuition, which is much more difficult to break down.
Specifying the Problem
Evans and Stuhlmüller have started a research project on amplification and factored cognition, but they haven’t solved the problem of human biases in interactive learning – rather, they’ve set out to precisely lay out these complex issues for other researchers.
“It’s more about showing this problem in a more precise way than people had done previously,” says Evans. “We ended up getting interesting results, but one of our results in a sense is realizing that this is very difficult, and understanding why it’s difficult.”
This article is part of a Future of Life series on the AI safety research grants, which were funded by generous donations from Elon Musk and the Open Philanthropy Project.