Here are some examples of the kinds of topics we are interested in funding. Many of them are based on problems from the following research agendas: Concrete Problems in AI Safety and Agent Foundations.
Scalable reward learning
One way to specify human preferences to an artificial agent is having the agent learn a reward function based on human demonstrations or feedback. Since human input is often limited and slow to obtain, we need to introduce novel algorithms or extend existing approaches to learn reward functions in an efficient and scalable way. Can we develop heuristics and methods for systematically identifying situations where existing approaches might fail to teach the agent the right objective?
Quantifying negative side effects
Misspecified objectives for artificial agents can result in negative side effects, where the agent creates unnecessary disruptions in its environment. One form of misspecification is omitting important variables or considerations, which implicitly expresses indifference over those variables, leading the agent to set them to undesirable values. How can we define and measure negative side effects in a generalizable way that does not require explicitly penalizing every possible disruption?
Interpretability in safety problems
What forms of interpretability are likely to be useful for solving long-term AI safety problems, e.g. for detecting and explaining problematic behaviors like reward hacking, unsafe exploration, or deceiving human supervisors? Can we extend interpretability techniques to model representations of abstract concepts that are not easy to visualize?
Producing simultaneous explanations
An AI system that can produce explanations for its decisions while it is making those decisions makes human oversight more scalable and tractable. Can we build AI systems that maintain state-of-the-art performance while producing accurate and understandable explanations? How can we measure various aspects of explanation quality, such as accuracy and clarity?
Avoiding manipulation traps
An agent can get stuck in a low-value state if it finds a way to exploit a loophole in an incorrect specification of an objective function, manipulate its input channel or a human supervisor, or modify itself to receive more reward (these failure modes are also called “wireheading”). How can we design an agent that explores its environment while avoiding such traps?
Grounding objectives in the environment
One cause of manipulation traps is that the agent receives information about its objective through its sensory data, such as rewards or signals from the supervisor, and the agent’s sensors could malfunction or be manipulated by the agent. How can an agent follow objectives with respect to the actual state of the environment in which it is embedded, rather than with respect to its sensory data from that environment?
While value alignment has been studied for individual reinforcement learning agents, it has not been studied much in multi-agent settings. Can we test the alignment of reinforcement learning agents by investigating equilibria in iterated cooperation games, such as Prisoner’s Dilemma?
Testing ethical behavior
Develop a set of tests for ethical behavior required from AI systems and a set of guidelines for applying the tests to different AI systems in varied domains. For example, such a test could verify that the system avoids causing harm to humans, and guidelines could specify how to design different environments that meaningfully test for harm to humans. Applying the same test procedure to an AI system in different domains could help prevent overfitting to the specifics of that test in one domain.
Inferring human metacognition
Rather than applying inverse reinforcement learning to a human’s actions, we could apply it to the cognitive actions taken by a human while they deliberate about a subject. We could then use the inferred preferences to execute a longer deliberation process, asking “what would the human do if they had more time to think or more powerful cognitive tools?” This could enable AI systems to satisfy human preferences better than humans can, in the spirit of what is known as “coherent extrapolated volition.”
Reasoning about philosophical problems
In order to understand human values, powerful AI systems will likely need to be able to reason about moral and philosophical problems and ethical theories. For example, could AI systems learn philosophical reasoning from a diverse corpus of human text and dialogue with humans? This would be especially helpful in domains where human values are ambiguous or underdetermined. There has been little work in this area, and it would be valuable to try to formalize these questions and investigate whether they can be made more tractable, for example by identifying what sorts of text corpora might be helpful.
A general AI system is corrigible if it allows its objectives to be corrected by humans, enabling it to arrive at value-aligned objectives. In the space of possible objectives, do corrigible systems create a basin of attraction around objectives that produce acceptable outcomes for humans? This would mean that if humans try to correct the objectives of an AI system in the basin, it will probably move towards the center of the basin. This would require a value alignment criterion such that we can effectively monitor and correct when the objectives of a system depart from it. How likely is it that such a basin would exist if corrigibility is achieved, and what additional criteria besides corrigibility might be necessary?
Analyzing Goodhart’s Law
According to Goodhart’s Law, when a measure becomes a target, it ceases to be a good measure. When applied to powerful artificial agents optimizing for a proxy measure of a human objective, this becomes Goodhart’s curse: the agent’s optimization process is likely to compound errors caused by deviations of the proxy from the objective. A taxonomy of levels of Goodhart’s curse was recently introduced: regressing to the mean, optimizing a proxy that’s not causally related to the objective, optimizing away the correlation between the proxy and the objective, and adversarial correlations. Are there additional dynamics or modes of Goodhart’s law not covered by that taxonomy?
Subagents with value drift
Even if a general artificial agent is aligned with human values, there is a risk of the agent developing successor agents or subagents that are not aligned with those values (whether they are explicitly designed by the agent or spontaneously emerge from the optimization process). This is an example of a principal-agent problem. Under what conditions can we align subagents with the agent’s objective or limit the degree of undesirable value drift? Can we quantify the risk of subagent value drift due to Goodhart’s curse (if the subagent is optimizing a proxy of the agent’s objective)?
Formal robustness criteria for foundational problems
How can solutions to foundational alignment problems (such as decision theory or logical uncertainty) be implemented in machine learning systems? Can these philosophical concerns be converted into formal mathematical criteria that can be optimized for by a machine learning algorithm? Such criteria should be based on worst-case assumptions that require the system to do well in all situations of a given type, rather than just doing well in expectation.
Identifying unknown unknowns
Besides known categories of foundational alignment problems, there are likely to be more fundamental problems that the research community has not yet thought of. What could other failure modes for advanced AI look like? (One approach could be by making adversarial assumptions: supposing that a bounded adversary is trying to deceive the agent about the value of different courses of action.)
Human preference aggregation
Building AI systems whose objectives reflect humanity’s values would likely require aggregating the values of different people. Is there a minimalistic set of values that almost everyone in a given group of humans would agree on? Can the conceptual building blocks of those values be specified in a way that most humans would agree with (e.g. by developing more precise definitions of ambiguous concepts such as “fairness” or “harm”)? Can we identify new (or newly relevant) technologies or techniques to aggregate human values and preferences?
Modeling race dynamics
A race towards powerful AI between different organizations or nations could have large negative consequences. For example, it could impede cooperation, or lead to researchers cutting corners on safety. How could we improve the incentives of the key parties involved, in order to reduce the chances of a race occurring or of a bad outcome if it does occur? This suggests modeling possible dynamics of a race in a game theoretic or mechanism design context, and investigating interventions to improve outcomes for the world. Are there equilibria for stably stopping the race, or for deploying general AI in a way that is broadly beneficial?