Can AI agents learn to be good?

AI agents are different from AI assistants because they can initiate actions independently. Here we discuss the safety concerns involved with AI agents and what we are doing to mitigate them.

Published:

August 29, 2024

Author:

Ram Rachum

'Spot' robot from Boston Dynamics. Image by Jonte, from Wikimedia Commons, CC BY-SA 4.0.

Ram Rachum is a researcher at Tufts University, an IAEAI scientific council member, an ALTER AI Safety fellowship alumnus, and a member of the Future of Life Institute’s AI Existential Safety Community. Ram is the beneficiary of an FLI travel grant.

The rapid progress in artificial intelligence raises many concerns about its safety. The AI Existential Safety Community wants to ensure that the effects of AI on our society will be positive. A major challenge for researchers is the fact that we are preparing for scenarios that have never happened before. This is why there are many different lines of research into various scenarios in which AI could harm people and different ways to prevent these scenarios.

In this blog post we discuss AI agents: what sets them apart from AI assistants, their potential usefulness and possible dangers, and what we are doing to mitigate these dangers.

What are AI agents?

If you ever asked ChatGPT how long cooked cauliflower lasts in the fridge, you have used an AI assistant– but not an AI agent. While an exact definition of what makes an AI agent is a complicated topic, a simple definition would be: An AI agent does not wait to be prompted by a human, but rather senses its environment, makes decisions, initiates interactions with humans and changes its behavior as it learns more about the environment in which it operates.

Let’s talk about different kinds of AI agents and examples of what they can do.

A robot is an AI agent with a body that can move around in the world and perform physical actions, like manufacturing products in a factory, cleaning up trash in the ocean, or assisting visually-impaired people. Self-driving cars are embodied AI agents as well, even though we do not usually think of them as robots.

AI agents without a physical body can act by accessing APIs for different services, such as sending emails. The oldest kind of AI agents we use are automated stock-trading systems, which have been active since the 1980s, though only since the 2010s have they started to incorporate Deep Learning technologies.

The distinctive quality of AI agents is that they initiate actions without being expressly ordered to do so. A subtle but crucial point is that agency is a spectrum rather than a yes-or-no property. The AI agents mentioned above are not as agentic as human beings, because they work on tasks given to them by humans rather than making up their own goals and aspirations; but they do have more agency than a chatbot because they take some liberty in deciding the actions they are going to take in order to achieve their programmed goals. One may think of a self-driving car as a machine with no agency, because much like an elevator, it has a human operator pressing its buttons to tell it where to go; the salient difference is that unlike the elevator, the self-driving car has to make thousands of little decisions, from choosing its route to deciding whether to slow down coming up to a traffic light. In extreme cases, these decisions will involve weighing the potential loss of human lives.

We are now witnessing a rapid proliferation of AI agents that can get information and perform actions on APIs related to health, finance, the environment, social media, energy, traffic, scientific research and more. While AI assistants allow us to delegate decisions, AI agents allow us to delegate entire projects. They therefore offer humanity unprecedented possibilities, as well as pose enormous risks. We got a taste of these risks back in 2010, when the previously mentioned trading agents made an escalating flurry of U.S. stock trades which caused the temporary loss of $1 trillion U.S. dollars, an event known as the 2010 flash crash. Luckily, it did not end up destabilizing the U.S. economy. However, there is no guarantee that future disasters will be strictly financial, or easy to recover from. We wish to avoid ending up like the Sorcerer’s Apprentice who delegated his way into disaster; we wish to learn how to prevent unchecked AI activity from wreaking havoc on our economies, environment and culture.

Could AI agents intentionally hurt people?

In his 2019 book “Human Compatible”, prominent AI researcher Stuart Russell described a hypothetical scenario in which an AI agent designed with benevolent goals turns rogue and attempts to hurt people. This scenario is called “You can’t fetch the coffee if you’re dead.”

Illustration of a robot ‘fetching the coffee’: a scenario that Stuart Russell uses to explain why AI systems might have the instrumental goal of self-preservation.

The scenario imagines that you had an AI-powered robot trained to get you a cup of coffee from the nearest coffee shop. In a process of trial and error, this robot has learned to perform a sequence of physical actions: It strolls down to the coffee shop, waits in line, gives the barista your coffee instructions, pays, waits again, grabs the coffee and delivers it back to your office. Its reward signal would be a combination of how good the coffee is, how much it cost, how fast it is delivered, etc. Because it is an AI, it automatically learns the relationships between each action that it takes and its final reward signal. This means the robot would be regularly performing many little experiments and checking whether these reduce or increase the final reward.

One such experiment may be to obtain a stolen credit card number on the dark web and use it to pay. If it reduces the total cost with no repercussions, the robot will learn to do it every time. Another experiment could be to cut in line in front of other customers, shaving two minutes off the delivery time. With the right hardware, the robot could save even more time by killing all the customers in the coffee shop. As a bonus, the barista might be inclined to offer that cup of coffee on the house.

The key insight is that an AI does not have to be designed with malicious intent in order to behave maliciously. Trial and error, if allowed to proceed unchecked, can lead to both wonderful and horrible results.

One way to react to this thought experiment is to say, “we need to have tighter safety controls on these AIs! They need to be limited in the actions that they are allowed to take. No cheating, stealing, or hurting other people.” Unfortunately, ethical rules are notoriously difficult to define. When ChatGPT went live in 2022 it had ethical rules that jailbreakers quickly circumvented, causing it to make offensive statements and give instructions for making illegal drugs. As OpenAI improved ChatGPT’s safety controls, jailbreakers developed creative methods to hide their unethical payload, such as engaging in role-play, or asking for know-how about a criminal activity in order to prevent it.

We would like to move beyond this cat-and-mouse game and find a more lasting approach for aligning AI agents.

What are we doing to make AI agents good?

Researchers in FLI’s AI Existential Safety Community are working on many different directions to solve these problems. These range from theoretical work and position papers to algorithm development, empirical research and open-source code. Here is a small selection of these efforts:

Alan Chan, et al. Visibility into AI Agents (2024). The authors propose three ways to improve visibility into AI agents: (1) agent identifiers to help track AI agents’ involvement in interactions, (2) real-time monitoring for the immediate flagging of problematic behaviors and (3) activity logs that allow detailed records for post-incident analysis. The authors look at how these methods work in different setups, from big tech companies to individual users.

Andreea Bobu, Andi Peng, et al. Aligning Human and Robot Representations (2023). The authors present a formal framework for aligning a robot’s goals to human goals, defining it as an optimization problem with four criteria: (1) value alignment, (2) generalizability, (3) reduced human burden and (4) explainability. The paper includes a survey of five different methods for representing a robot’s goals and a comparison of how well they satisfy each of these four criteria.

W. Bradley Knox, Stephane Hatgis-Kessell, et al. Models of human preference for learning reward functions (2022). Developers of popular LLMs such as ChatGPT and Google Gemini use a technique called Reinforcement Learning from Human Feedback (RLHF) to align these LLMs to human preferences. RLHF is also used for aligning robots, autonomous vehicles and image generators such as Midjourney. The authors identify several flaws in RLHF; for example, RLHF can recommend actions that lead to good outcomes because of luck rather than sensible decision-making. The authors propose a regret preference model that solves some of these problems and demonstrate its viability in an experiment in which human subjects give instructions to an AI driving a simulated autonomous vehicle.

Joey Hejna, et al. Contrastive Preference Learning: Learning from Human Feedback without RL (2023). The authors build on the work mentioned in the previous paragraph and provide a regret-based alternative to RLHF called Contrastive Preference Learning (CPL), which is more efficient and scalable while also being simpler to implement. The authors provide code for CPL on GitHub, allowing AI developers to integrate CPL into their AI systems.

Could AI agents figure out what is good by themselves?

Let’s try to look at the alignment problem from a different angle, by taking a step back to reflect on what it really means for an AI agent to be good.

Much of AI alignment research focuses on how to make AI agents obey rules, values and social norms. Some examples are “don’t lie”, “don’t discriminate”, and “don’t hurt people.” We are trying to teach AI agents to obey rules because this is what we have learned to do with people, in our legal systems, religions and other value systems. Rules are useful to us because of how clear and decisive they are. We each have our own romantic concept of what it means to be a good person, based on social experiences that we have had with other people, but these ideals are amorphous and not actionable. When we make up moral rules, we distill these ideals into actionable capsules that are easier to share with other people and get a consensus on.

The trolley problem is a moral dilemma where one must choose between diverting a trolley and allowing one person to die, or not diverting the trolley and allowing five people to die. This dilemma highlights the difference between deontological morality and consequentialist morality. We do not have a definitive answer of what an AI agent such as a self-driving car should do when faced with such a dilemma. (CC BY-SA 4.0 McGeddon and Zapyon)

The flip side is that any set of such moral rules has its breaking point, which is illustrated by moral dilemmas such as the trolley problem. We cannot say for sure what a self-driving car should do when it needs to decide whether to cause the death of one person by action, or of five people by inaction. There might be an analog of Gödel’s incompleteness theorem for moral rules, where any system of moral rules must have at least one moral dilemma that cannot be solved without breaking some of the rules, or inventing new ones.

Could an AI agent have a profound understanding of what it means to be good in a way that does not depend on rules? Could a self-driving car faced with the trolley problem make a decision that we will find satisfactory? If so, what would it take to develop such an AI agent?

We mentioned earlier that when people form their concepts of right and wrong, it is usually in a social context, which means our perception of “what is good” is shaped by experiences that we have had with other people. We propose recreating the same kind of social process with AI agents because it might lead to AI agents that develop a robust understanding of being good.

A social path to human-like artificial intelligence, a position paper published in Nature Machine Intelligence in 2023 suggests developing AI systems in which multiple AI agents interact with each other. Inspired by collective intelligence exhibited by groups of animals and humans, the authors speculate that the social relationships that AI agents will form will enable them to solve problems that could not be solved by single-minded AI agents.

There have been many experiments with AI agents working as a team (see the video below of AI agents playing hide and seek), learning social norms together and influencing each other’s actions. Concordia is a Google DeepMind experiment in which multiple LLMs talk to each other in order to study their social behavior. My team’s modest contribution is a series of experiments in which we have shown that AI agents can form dominance hierarchies, a natural phenomenon seen in most animal species.

OpenAI research: AI agents playing hide and seek.

In most of these examples, the AI agents are avatars in a simple video game that elicits these social behaviors, which means they are not doing real-world tasks. A major challenge in these experiments is getting AI agents to think beyond their own selfish interests. AI algorithms are great at optimizing their reward signal, but it is not clear how to get them to optimize their teammates’ reward signals. The upcoming Concordia contest in NeurIPS 2024 will challenge researchers to train LLMs to maximize their cooperative ability.

This line of research is not being applied to AI Safety yet. First we would need these AI agents to show social behavior that is more advanced and closer to a human level, including an understanding of right and wrong. Then we would need to train these agents on real-world tasks, and get them to apply their moral understanding to these tasks in a way that is compatible with human values, norms, and cultural practices.

These are formidable challenges. If this research direction proves to be a worthy one, it’s possible that it will help to develop AI agents much better aligned, and more likely to benefit humans.

Thanks to Markov Grey, Taylor Jones, William Jones, David Manheim, Reuth Mirsky, Yonatan Nakar and Natalie Shapira for reviewing drafts of this post.

This content was first published at futureoflife.org on August 29, 2024.

About the Future of Life Institute

The Future of Life Institute (FLI) is a global think tank with a team of 20+ full-time staff operating across the US and Europe. FLI has been working to steer the development of transformative technologies towards benefitting life and away from extreme large-scale risks since its founding in 2014. Find out more about our mission or explore our work.

Our content

Can AI agents learn to be good?

Contents

What are AI agents?

Could AI agents intentionally hurt people?

What are we doing to make AI agents good?

Could AI agents figure out what is good by themselves?

About the Future of Life Institute

Related content

Other posts about AI Research

AI Safety Index Released

Miles Apart: Comparing key AI Act proposals

Steve Omohundro on Provably Safe AGI

Tom Davidson on How Quickly AI Could Automate the Economy

Sign up for the Future of Life Institute newsletter