AI Safety Research

Ramana Kumar

PhD student

University of Cambridge

ramana.kumar@gmail.com

Project: Applying Formal Verification to Reflective Reasoning

Amount Recommended:    $36,750

Project Summary

One path to significantly smarter-than-human artificial agents involves self-improvement, i.e., agents doing artificial intelligence research to make themselves even more capable. If such an agent is designed to be robust and beneficial, it should only execute self-modifying actions if it knows they are improvements, which, at a minimum, means being able to trust that the modified agent only takes safe actions. However, trusting the actions of a similar or smarter agent can lead to problems of self-reference, which can be seen as sophisticated versions of the liar paradox (which shows that the self-referential sentence “this sentence is false” cannot be consistently true or false). Several partial solutions to these problems have recently been proposed. However, current software for formal reasoning does not have sufficient support for self-referential reasoning to make these partial solutions easy to implement and study. In this project, we will implement a toy model of agents using these partial solutions to reason about self-modifications, in order to improve our understanding of the challenges of implementing self-referential reasoning, and to stimulate work on tools suitable for it.

Technical Abstract

Artificially intelligent agents designed to be highly reliable are likely to include a capacity for formal deductive reasoning to be applied in appropriate situations, such as when reasoning about computer programs including other agents and future versions of the same agent. However, it will not always be possible to model other agents precisely: considering more capable agents, only abstract reasoning about their architecture is possible. Abstract reasoning about the behavior of agents that justify their actions with proofs lead to problems of self-reference and reflection: Godel’s second incompleteness theorem shows that no sufficiently strong proof system can prove its own consistency, making it difficult for agents to show that actions their successors have proven to be safe are in fact safe (since an inconsistent proof system would be able to prove any action “safe”). Recently, some potential approaches to circumventing this obstacle have been proposed in the form of pen-and-paper proofs.

We propose building and studying implementations of agents using these approaches, to better understand the challenges of implementing tools that are able to support this type of reasoning, and to stimulate work in the interactive theorem proving community on this kind of tools.

Workshop Participation

  1. Colloquium Series on Robust and Beneficial AI (CSRBAI) – May 27-June 17, 2016.
  2. Self-Reference, Type Theory, and Formal Verification – April 1-3, 2016
    • In this conference, hosted by MIRI, participants worked on questions of self-reference in type theory and automated theorem provers, with the goal of studying systems that model themselves.