AI Safety Research

David Parkes

George F. Colony Professor of Computer Science, and Area Dean for Computer Science

School of Engineering and Applied Science,Harvard University

Project: Mechanism Design for AI Architectures

Amount Recommended:    $200,000

Project Summary

Economics models the behavior of people, firms, and other decision makers, as a means to understand how these decisions shape the pattern of activities that produce value and ultimately satisfy (or fail to satisfy) human needs and desires. The field adopts rational models of behavior, either of individuals or of behavior in the aggregate.

Artificial Intelligence (AI) research is also drawn to rationality concepts, which provide an ideal for the computational agents that it seeks to create. Although perfect rationality is not achievable, the capabilities of AI are rapidly advancing, and AI can already surpass human-level capabilities in narrow domains.

We envision a future with a massive number of AIs, these AIs owned, operated, designed, and deployed by a diverse array of entitites. This multiplicity of interacting AIs, apart or together with people, will constitute a social system, and as such economics can provide a useful framework for understanding and influencing the aggregate. In turn, systems populated by AIs can benefit from explicit design of the frameworks within which AIs exist. The proposed research looks to apply the economic theory of mechanism design to the coordination of behavior in systems of multiple AIs, looking to promote beneficial outcomes.

Technical Abstract

When a massive number of AIs are owned, operated, designed, and deployed by a diverse array of firms, individuals, and governments, this multi-agent AI constitutes a social system, and economics provides a useful framework for understanding and influencing the aggregate. In particular, we need to understand how to design multi-agent systems that promote beneficial outcomes when AIs interact with each other. A successful theory must consider both incentives and privacy considerations.

Mechanism design theory from economics provides a framework for the coordination of behavior, such that desirable outcomes are promoted and less desirable outcomes made less likely because they are not in the self-interest of individual actors. We propose a program of fundamental research to understand the role of mechanism design, multi-agent dynamical models, and privacy-preserving algorithms, especially in the context of multi-agent systems in which the AIs are built through reinforcement learning (RL). The proposed research considers two concrete AI problems: the first is experiment design, typically formalized as a multi-armed bandit process, which we study in a multi-agent, privacy-preserving setting. The second is the more general problem of learning to act in Markovian dynamical systems, including both planning and RL agents.

Training Artificial Intelligence to Compromise


Imagine you’re sitting in a self-driving car that’s about to make a left turn into on-coming traffic. One small system in the car will be responsible for making the vehicle turn, one system might speed it up or hit the brakes, other systems will have sensors that detect obstacles, and yet another system may be in communication with other vehicles on the road. Each system has its own goals — starting or stopping, turning or traveling straight, recognizing potential problems, etc. — but they also have to all work together toward one common goal: turning into traffic without causing an accident.

Harvard professor and FLI researcher, David Parkes, is trying to solve just this type of problem. Parkes told FLI, “The particular question I’m asking is: If we have a system of AIs, how can we construct rewards for individual AIs, such that the combined system is well behaved?”

Essentially, an AI within a system of AIs — like that in the car example above — needs to learn how to meet its own objective, as well as how to compromise so that it’s actions will help satisfy the group objective. On top of that, the system of AIs needs to consider the preferences of society. The safety of the passenger in the car or a pedestrian in the crosswalk is a higher priority than turning left.

Training a well-behaved AI

Because environments like a busy street are so complicated, an engineer can’t just program an AI to act in some way to always achieve its objectives. AIs need to learn proper behavior based on a rewards system. “Each AI has a reward for its action and the action of the other AI,” Parkes explained. With the world constantly changing, the rewards have to evolve, and the AIs need to keep up not only with how their own goals change, but also with the evolving objectives of the system as a whole.

The idea of a rewards-based learning system is something most people can likely relate to. Who doesn’t remember the excitement of a gold star or a smiley face on a test? And any dog owner has experienced how much more likely their pet is to perform a trick when it realizes it will get a treat. A reward for an AI is similar.

A technique often used in designing artificial intelligence is reinforcement learning. With reinforcement learning, when the AI takes some action, it receives either positive or negative feedback. And it then tries to optimize its actions to receive more positive rewards. However, the reward can’t just be programmed into the AI. The AI has to interact with its environment to learn which actions will be considered good, bad or neutral. Again, the idea is similar to a dog learning that tricks can earn it treats or praise, but misbehaving could result in punishment.

More than this, Parkes wants to understand how to distribute rewards to subcomponents – the individual AIs – in order to achieve good system-wide behavior. How often should there be positive (or negative) reinforcement, and in reaction to which types of actions?

For example, if you were to play a video game without any points or lives or levels or other indicators of success or failure, you might run around the world killing or fighting aliens and monsters, and you might eventually beat the game, but you wouldn’t know which specific actions led you to win. Instead, games are designed to provide regular feedback and reinforcement so that you know when you make progress and what steps you need to take next. To train an AI, Parkes has to determine which smaller actions will merit feedback so that the AI can move toward a larger, overarching goal.

Rather than programming a reward specifically into the AI, Parkes shapes the way rewards flow from the environment to the AI in order to promote desirable behaviors as the AI interacts with the world around it.

But this is all for just one AI. How do these techniques apply to two or more AIs?

Training a system of AIs

Much of Parkes’ work involves game theory. Game theory helps researchers understand what types of rewards will elicit collaboration among otherwise self-interested players, or in this case, rational AIs. Once an AI figures out how to maximize its own reward, what will entice it to act in accordance with another AI? To answer this question, Parkes turns to an economic theory called mechanism design.

Mechanism design theory is a Nobel-prize winning theory that allows researchers to determine how a system with multiple parts can achieve an overarching goal. It is a kind of “inverse game theory.” How can rules of interaction – ways to distribute rewards, for instance – be designed so individual AIs will act in favor of system-wide and societal preferences? Among other things, mechanism design theory has been applied to problems in auctions, e-commerce, regulations, environmental policy, and now, artificial intelligence.

The difference between Parkes’ work with AIs and mechanism design theory is that the latter requires some sort of mechanism or manager overseeing the entire system. In the case of an automated car or a drone, the AIs within have to work together to achieve group goals, without a mechanism making final decisions. As the environment changes, the external rewards will change. And as the AIs within the system realize they want to make some sort of change to maximize their rewards, they’ll have to communicate with each other, shifting the goals for the entire autonomous system.

Parkes summarized his work for FLI, saying, “The work that I’m doing as part of the FLI grant program is all about aligning incentives so that when autonomous AIs decide how to act, they act in a way that’s not only good for the AI system, but also good for society more broadly.”

Parkes is also involved with the One Hundred Year Study on Artificial Intelligence, and he explained his “research with FLI has informed a broader perspective on thinking about the role that AI can play in an urban context in the near future.” As he considers the future, he asks, “What can we see, for example, from the early trajectory of research and development on autonomous vehicles and robots in the home, about where the hard problems will be in regard to the engineering of value-aligned systems?”

This article is part of a Future of Life series on the AI safety research grants, which were funded by generous donations from Elon Musk and the Open Philanthropy Project.


  1. Zhang, et al. On the Differential Privacy of Bayesian Inference. Proc. 13th AAAI Conf. on Artificial Intelligence (AAAI 2016), 2016.
  2. Tossou, A.C.Y. and Dimitrakakis, C. Algorithms for Differentially Private Multi-­Armed Bandits. Proc. 13th AAAI Conf. on Artificial Intelligence (AAAI 2016), 2016.

Ongoing Projects

Planned Papers:

  1. The Helper Agent Problem: Theory and Algorithms
  2. Reward transfer Mechanisms for Multi-Agent AI
  3. Differentially-private Mechanisms for Multi-Agent AI