Skip to content

From the MIRI Blog: “Formalizing Convergent Instrumental Goals”

Published:
December 1, 2015
Author:
Rob Bensinger

Contents

Tsvi Benson-Tilsen, a MIRI associate and UC Berkeley PhD candidate, has written a paper with contributions from MIRI Executive Director Nate Soares on strategies that will tend to be useful for most possible ends: “Formalizing convergent instrumental goals.” The paper will be presented as a poster at the AAAI-16 AI, Ethics and Society workshop.

Steve Omohundro has argued that AI agents with almost any goal will converge upon a set of “basic drives,” such as resource acquisition, that tend to increase agents’ general influence and freedom of action. This idea, which Nick Bostrom calls the instrumental convergence thesis, has important implications for future progress in AI. It suggests that highly capable decision-making systems may pose critical risks even if they are not programmed with any antisocial goals. Merely by being indifferent to human operators’ goals, such systems can have incentives to manipulate, exploit, or compete with operators.

The new paper serves to add precision to Omohundro and Bostrom’s arguments, while testing the arguments’ applicability in simple settings. Benson-Tilsen and Soares write:

In this paper, we will argue that under a very general set of assumptions, intelligent rational agents will tend to seize all available resources. We do this using a model, described in section 4, that considers an agent taking a sequence of actions which require and potentially produce resources. […] The theorems proved in section 4 are not mathematically difficult, and for those who find Omohundro’s arguments intuitively obvious, our theorems, too, will seem trivial. This model is not intended to be surprising; rather, the goal is to give a formal notion of “instrumentally convergent goals,” and to demonstrate that this notion captures relevant aspects of Omohundro’s intuitions.

Our model predicts that intelligent rational agents will engage in trade and cooperation, but only so long as the gains from trading and cooperating are higher than the gains available to the agent by taking those resources by force or other means. This model further predicts that agents will not in fact “leave humans alone” unless their utility function places intrinsic utility on the state of human-occupied regions: absent such a utility function, this model shows that powerful agents will have incentives to reshape the space that humans occupy.

Benson-Tilsen and Soares define a universe divided into regions that may change in different ways depending on an agent’s actions. The agent wants to make certain regions enter certain states, and may collect resources from regions to that end. This model can illustrate the idea that highly capable agents nearly always attempt to extract resources from regions they are indifferent to, provided the usefulness of the resources outweighs the extraction cost.

The relevant models are simple, and make few assumptions about the particular architecture of advanced AI systems. This makes it possible to draw some general conclusions about useful lines of safety research even if we’re largely in the dark about how or when highly advanced decision-making systems will be developed. The most obvious way to avoid harmful goals is to incorporate human values into AI systems’ utility functions, a project outlined in “The value learning problem.” Alternatively (or as a supplementary measure), we can attempt to specify highly capable agents that violate Benson-Tilsen and Soares’ assumptions, avoiding dangerous behavior in spite of lacking correct goals. This approach is explored in the paper “Corrigibility.”

 

Find the original post here.

This content was first published at futureoflife.org on December 1, 2015.

About the Future of Life Institute

The Future of Life Institute (FLI) is a global non-profit with a team of 20+ full-time staff operating across the US and Europe. FLI has been working to steer the development of transformative technologies towards benefitting life and away from extreme large-scale risks since its founding in 2014. Find out more about our mission or explore our work.

Our content

Related content

Other posts about ,

If you enjoyed this content, you also might also be interested in:

Verifiable Training of AI Models 

This collaboration between the Future of Life Institute and Mithril Security presents a proof-of-concept to enable stakeholders to verify the integrity and origin of AI models.
23 July, 2024
Our content

Sign up for the Future of Life Institute newsletter

Join 40,000+ others receiving periodic updates on our work and cause areas.
cloudmagnifiercrossarrow-up linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram