The Problem of Self-Referential Reasoning in Self-Improving AI: An Interview with Ramana Kumar, Part 2

Contents
When it comes to artificial intelligence, debates often arise about what constitutes āsafeā and āunsafeā actions. As Ramana Kumar, an AGI safety researcher at DeepMind, notes, the terms are subjective and ācan only be defined with respect to the values of the AI systemās users and beneficiaries.ā
Fortunately, such questions can mostly be sidestepped when confronting the technical problems associated with creating safe AI agents, as these problems arenāt associated with identifying what is right or morally proper. Rather, from a technical standpoint, the term āsafetyā is best defined as an AI agent that consistently takes actions that lead to the desired outcomes, regardless of whatever those desired outcomes may be.
In this respect, Kumar explains that, when it comes to creating an AI agent that is tasked with improving itself, āthe technical problem of building a safe agent is largely independent of what āsafeā means because a large part of the problem is how to build an agent that reliably does something, no matter what that thing is, in such a way that the method continues to work even as the agent under consideration is more and more capable.ā
In short, making a āsafeā AI agent should not be conflated with making an āethicalā AI agent. The respective terms are talking about different things..
In general, sidestepping moralistic definitions of safety makes AI technical work quite a bit easier It allows research to advance while debates on the ethical issues evolve. Case in point, Uberās self-driving cars are already on the streets, despite the fact that weāve yet to agree on a framework regarding whether they should safeguard their driver or pedestrians.
However, when it comes to creating a robust and safe AI system that is capable of self-improvement, the technical work gets a lot harder, and research in this area is still in its most nascent stages. This is primarily because we arenāt dealing with just one AI agent; we are dealing with generations of future self-improving agents.
Kumar clarifies, āWhen an AI agent is self-improving, one can view the situation as involving two agents: the āseedā or āparentā agent and the āchildā agent into which the parent self-modifiesā¦.and its total effects on the world will include the effects of actions made by its descendants.ā As a result, in order to know weāve made a safe AI agent, we need to understand all possible child agents that might originate from the first agent.
And verifying the safety of all future AI agents comes down to solving a problem known as āself-referential reasoning.ā
Understanding the Self-Referential Problem
The problem with self-referential reasoning is most easily understood by defining the term according to its two primary components: self-reference and reasoning.
- Self-reference: Refers to an instance in which someone (or something, such as a computer program or book) refers to itself. Any person or thing that refers to itself is called āself-referential.ā
- Reasoning: In AI systems, reasoning is a process through which an agent establishes ābeliefsā about the world, like whether or not a particular action is safe or a specific reasoning system is sound. āGood beliefsā are beliefs that are sound or plausible based on the available evidence. The term ābeliefā is used instead of āknowledgeā because the things that an agent believes may not be factually true and can change over time.
In relation to AI, then, the term āself-referential reasoningā refers to an agent that is using a reasoning process to establish a belief about that very same reasoning process. Consequently, when it comes to self-improvement, the āself-referential problemā is as follows: An agent is using its own reasoning system to determine that future versions of its reasoning system will be safe.
To explain the problem another way, Kumar notes that, if an AI agent creates a child agent to help it achieve its goal, it will want to establish some beliefs about the childās safety before using it. This will necessarily involve proving beliefs about the child by arguing that the childās reasoning process is good. Yet, the childās reasoning process may be similar to, or even an extension of, the original agentās reasoning process. And ultimately, an AI system can not use its own reasoning to determine whether or not its reasoning is good.
From a technical standpoint, the problem comes down to Godelās second incompleteness theorem, which Kumar explains, āshows that no sufficiently strong proof system can prove its own consistency, making it difficult for agents to show that actions their successors have proven to be safe are, in fact, safe.ā
Investigating Solutions
To date, several partial solutions to this problem have been proposed; however, our current software doesnāt have sufficient support for self-referential reasoning to make the solutions easy to implement and study. Consequently, in order to improve our understanding of the challenges of implementing self-referential reasoning, Kumar and his team aimed to implement a toy model of AI agents using some of the partial solutions that have been put forth.
Specifically, they investigated the feasibility of implementing one particular approach to the self-reference problem in a concrete setting (specifically, Botworld) where all the details could be checked. The approach selected was model polymorphism. Instead of requiring proof that shows an action is safe for all future use cases, model polymorphism only requires an action to be proven safe for an arbitrary number of steps (or subsequent actions) that is kept abstracted from the proof system.
Kumar notes that the overall goal was ultimately āto get a sense of the gap between the theory and a working implementation and to sharpen our understanding of the model polymorphism approach.ā This would be accomplished by creating a proved theorem in a HOL (Higher Order Logic) theorem prover that describes the situation.
To break this down a little, in essence, theorem provers are computer programs that assist with the development of mathematical correctness proofs. These mathematical correctness proofs are the highest safety standard in the field, showing that a computer system always produces the correct output (or response) for any given input. Theorem provers create such proofs by using the formal methods of mathematics to prove or disprove the ācorrectnessā of the control algorithms underlying a system. HOL theorem provers, in particular, are a family of interactive theorem proving systems that facilitate the construction of theories in higher-order logic. Higher-order logic, which supports quantification over functions, sets, sets of sets, and more, is more expressive than other logics, allowing the user to write formal statements at a high level of abstraction.
In retrospect, Kumar states that trying to prove a theorem about multiple steps of self-reflection in a HOL theorem prover was a massive undertaking. Nonetheless, he asserts that the team took several strides forward when it comes to grappling with the self-referential problem, noting that they built āa lot of the requisite infrastructure and got a better sense of what it would take to prove it and what it would take to build a prototype agent based on model polymorphism.ā
Kumar added that MIRIās (the Machine Intelligence Research Instituteās) Logical Inductors could also offer a satisfying version of formal self-referential reasoning and, consequently, provide a solution to the self-referential problem.
If you haven’t read it yet, find Part 1 here.
About the Future of Life Institute
The Future of Life Institute (FLI) is a global think tank with a team of 20+ full-time staff operating across the US and Europe. FLI has been working to steer the development of transformative technologies towards benefitting life and away from extreme large-scale risks since its founding in 2014. Find out more about our mission or explore our work.
Related content
Other posts about AI, AI Research, Grants Program, Recent News

The U.S. Public Wants Regulation (or Prohibition) of Expert‑Level and Superhuman AI



