When it comes to artificial intelligence, debates often arise about what constitutes “safe” and “unsafe” actions. As Ramana Kumar, an AGI safety researcher at DeepMind, notes, the terms are subjective and “can only be defined with respect to the values of the AI system’s users and beneficiaries.”
Fortunately, such questions can mostly be sidestepped when confronting the technical problems associated with creating safe AI agents, as these problems aren’t associated with identifying what is right or morally proper. Rather, from a technical standpoint, the term “safety” is best defined as an AI agent that consistently takes actions that lead to the desired outcomes, regardless of whatever those desired outcomes may be.
In this respect, Kumar explains that, when it comes to creating an AI agent that is tasked with improving itself, “the technical problem of building a safe agent is largely independent of what ‘safe’ means because a large part of the problem is how to build an agent that reliably does something, no matter what that thing is, in such a way that the method continues to work even as the agent under consideration is more and more capable.”
In short, making a “safe” AI agent should not be conflated with making an “ethical” AI agent. The respective terms are talking about different things..
In general, sidestepping moralistic definitions of safety makes AI technical work quite a bit easier It allows research to advance while debates on the ethical issues evolve. Case in point, Uber’s self-driving cars are already on the streets, despite the fact that we’ve yet to agree on a framework regarding whether they should safeguard their driver or pedestrians.
However, when it comes to creating a robust and safe AI system that is capable of self-improvement, the technical work gets a lot harder, and research in this area is still in its most nascent stages. This is primarily because we aren’t dealing with just one AI agent; we are dealing with generations of future self-improving agents.
Kumar clarifies, “When an AI agent is self-improving, one can view the situation as involving two agents: the ‘seed’ or ‘parent’ agent and the ‘child’ agent into which the parent self-modifies….and its total effects on the world will include the effects of actions made by its descendants.” As a result, in order to know we’ve made a safe AI agent, we need to understand all possible child agents that might originate from the first agent.
And verifying the safety of all future AI agents comes down to solving a problem known as “self-referential reasoning.”
Understanding the Self-Referential Problem
The problem with self-referential reasoning is most easily understood by defining the term according to its two primary components: self-reference and reasoning.
- Self-reference: Refers to an instance in which someone (or something, such as a computer program or book) refers to itself. Any person or thing that refers to itself is called “self-referential.”
- Reasoning: In AI systems, reasoning is a process through which an agent establishes “beliefs” about the world, like whether or not a particular action is safe or a specific reasoning system is sound. “Good beliefs” are beliefs that are sound or plausible based on the available evidence. The term “belief” is used instead of “knowledge” because the things that an agent believes may not be factually true and can change over time.
In relation to AI, then, the term “self-referential reasoning” refers to an agent that is using a reasoning process to establish a belief about that very same reasoning process. Consequently, when it comes to self-improvement, the “self-referential problem” is as follows: An agent is using its own reasoning system to determine that future versions of its reasoning system will be safe.
To explain the problem another way, Kumar notes that, if an AI agent creates a child agent to help it achieve its goal, it will want to establish some beliefs about the child’s safety before using it. This will necessarily involve proving beliefs about the child by arguing that the child’s reasoning process is good. Yet, the child’s reasoning process may be similar to, or even an extension of, the original agent’s reasoning process. And ultimately, an AI system can not use its own reasoning to determine whether or not its reasoning is good.
From a technical standpoint, the problem comes down to Godel’s second incompleteness theorem, which Kumar explains, “shows that no sufficiently strong proof system can prove its own consistency, making it difficult for agents to show that actions their successors have proven to be safe are, in fact, safe.”
To date, several partial solutions to this problem have been proposed; however, our current software doesn’t have sufficient support for self-referential reasoning to make the solutions easy to implement and study. Consequently, in order to improve our understanding of the challenges of implementing self-referential reasoning, Kumar and his team aimed to implement a toy model of AI agents using some of the partial solutions that have been put forth.
Specifically, they investigated the feasibility of implementing one particular approach to the self-reference problem in a concrete setting (specifically, Botworld) where all the details could be checked. The approach selected was model polymorphism. Instead of requiring proof that shows an action is safe for all future use cases, model polymorphism only requires an action to be proven safe for an arbitrary number of steps (or subsequent actions) that is kept abstracted from the proof system.
Kumar notes that the overall goal was ultimately “to get a sense of the gap between the theory and a working implementation and to sharpen our understanding of the model polymorphism approach.” This would be accomplished by creating a proved theorem in a HOL (Higher Order Logic) theorem prover that describes the situation.
To break this down a little, in essence, theorem provers are computer programs that assist with the development of mathematical correctness proofs. These mathematical correctness proofs are the highest safety standard in the field, showing that a computer system always produces the correct output (or response) for any given input. Theorem provers create such proofs by using the formal methods of mathematics to prove or disprove the “correctness” of the control algorithms underlying a system. HOL theorem provers, in particular, are a family of interactive theorem proving systems that facilitate the construction of theories in higher-order logic. Higher-order logic, which supports quantification over functions, sets, sets of sets, and more, is more expressive than other logics, allowing the user to write formal statements at a high level of abstraction.
In retrospect, Kumar states that trying to prove a theorem about multiple steps of self-reflection in a HOL theorem prover was a massive undertaking. Nonetheless, he asserts that the team took several strides forward when it comes to grappling with the self-referential problem, noting that they built “a lot of the requisite infrastructure and got a better sense of what it would take to prove it and what it would take to build a prototype agent based on model polymorphism.”
Kumar added that MIRI’s (the Machine Intelligence Research Institute’s) Logical Inductors could also offer a satisfying version of formal self-referential reasoning and, consequently, provide a solution to the self-referential problem.
If you haven’t read it yet, find Part 1 here.