Value Alignment and Moral Metareasoning
Developing AI systems that are benevolent towards humanity requires making sure that those systems know what humans want. People routinely make inferences about the preferences of others and use those inferences as the basis for helping one another. This project aims to provide AI systems a similar ability to learn from observations, in order to better align the values of those systems with those of humans. Doing so requires dealing with some significant challenges: If we ultimately develop AI systems that can reason better than humans, how do we make sure that those AI systems are able to take human limitations into account? The fact that we haven’t yet cured cancer shouldn’t be taken as evidence that we don’t really care about it. Furthermore, once we have made an AI system that can reason about human preferences, that system then has to trade off time spent in deliberating about the right course of action with the need to act as quickly as possible – it needs to deal with its own computational limitations as it makes decisions. We aim to address both these challenges by examining how intelligent agents (be they humans or computers) should make these tradeoffs.
AI research has focused on improving the decision-making capabilities of computers, i.e., the ability to select high-quality actions in pursuit of a given objective. When the objective is aligned with the values of the human race, this can lead to tremendous benefits. When the objective is misaligned, improving the AI system’s decision-making may lead to worse outcomes for the human race. The objectives of the proposed research are (1) to create a mathematical framework in which fundamental questions of value alignment can be investigated; (2) to develop and experiment with methods for aligning the values of a machine (whether explicitly or implicitly represented) with those of humans; (3) to understand the relationships among the degree of value alignment, the decision-making capability of the machine, and the potential loss to the human; and (4) to understand in particular the implications of the computational limitations of humans and machines for value alignment. The core of our technical approach will be a cooperative, game-theoretic extension of inverse reinforcement learning, allowing for the different action spaces of humans and machines and the varying motivations of humans; the concepts of rational metareasoning and bounded optimality will inform our investigation of the effects of computational limitations.