Safe Learning and Verification of Human-AI Systems
Recent developments in artificial intelligence (AI) have enabled us to build AI agents and robots capable of performing complex tasks, including many that interact with humans. In these tasks, it is desirable for robots to build predictive and robust models of humans’ behaviors and preferences: a robot manipulator collaborating with a human needs to predict her future trajectories, or humans sitting in self-driving cars might have preferences for how cautiously the car should drive. In reality, humans have different preferences, which can be captured in the form of a mixture of reward functions. Learning this mixture can be challenging due to having different types of humans. It is also usually assumed that these humans are approximately optimizing the learned reward functions. However, in many safety-critical scenarios, humans follow behaviors that are not easily explainable by the learned reward functions due to lack of data or misrepresentation of the structure of the reward function. Our goal in this project is to actively learn a mixture of reward functions by eliciting comparisons from a mixed set of humans, and further analyze the generalizability and robustness of such models for safe and seamless interaction with AI agents.