New AI Safety Research Agenda From Google Brain

Google Brain just released an inspiring research agenda, Concrete Problems in AI Safety, co-authored by researchers from OpenAI, Berkeley and Stanford. This document is a milestone in setting concrete research objectives for keeping reinforcement learning agents and other AI systems robust and beneficial. The problems studied are relevant both to near-term and long-term AI safety, from cleaning robots to higher-stakes applications. The paper takes an empirical focus on avoiding accidents as modern machine learning systems become more and more autonomous and powerful.

Reinforcement learning is currently the most promising framework for building artificial agents – it is thus especially important to develop safety guidelines for this subfield of AI. The research agenda describes a comprehensive (though likely non-exhaustive) set of safety problems, corresponding to where things can go wrong when building AI systems:

  • Mis-specification of the objective function by the human designer. Two common pitfalls when designing objective functions are negative side-effects and reward hacking (also known as wireheading), which are likely to happen by default unless we figure out how to guard against them. One of the key challenges is specifying what it means for an agent to have a low impact on the environment while achieving its objectives effectively.

  • Extrapolation from limited information about the objective function. Even with a correct objective function, human supervision is likely to be costly, which calls for scalable oversight of the artificial agent.

  • Extrapolation from limited training data or using an inadequate model. We need to develop safe exploration strategies that avoid irreversibly bad outcomes, and build models that are robust to distributional shift – able to fail gracefully in situations that are far outside the training data distribution.

The AI research community is increasingly focusing on AI safety in recent years, and Google Brain’s agenda is part of this trend. It follows on the heels of the Safely Interruptible Agents paper from Google DeepMind and the Future of Humanity Institute, which investigates how to avoid unintended consequences from interrupting or shutting down reinforcement learning agents. We at FLI are super excited that industry research labs at Google and OpenAI are spearheading and fostering collaboration on AI safety research, and look forward to the outcomes of this work.

3 replies
  1. Ajish Ajoel Abraham
    Ajish Ajoel Abraham says:

    I think even if AI behaves properly and is in line with the human ideas and yet smarter than humans, there will be atleast 1000’s of talented skillful hackers in the world somewhere and if knowingly/unknowingly uploads virus into the system, will AI be able to resist?? even if it does have resistance, what if there’s a chance of accumulation of the errors of its 99.99% accuracy and turn into a mutant AI which can be against and literally competitive to humans? And since the exponential rate is peaking high and “the singularity” is near, don’t you think that tech industries are inviting the doomsday to our home? What guarantee can they give for a safer, better future? Who would love to have someone listening to their private tele-conversations? C’mon Tech guys … this might sound awful but there’s always a limit to everything. We humans are about to throw our boomerang too far that we wont be able to fetch it back.

  2. Mindey
    Mindey says:

    Good. Looks like the five concrete problems essentially is one abstract problem — predict complex outcomes and their value to people. Obviously, there will be incomplete information, and people speaking in hints often fail to specify objective function matching exactly what they want, so, it looks like a few rules could help prevent all of the five possible failure modes:

    (A) always ask for value to people of unrecognized stuff or stuff of unknown value to people.
    (B) always ask to check and reevaluate the objective function on unexpected jumps in fitness.
    (C) don’t do it if you can’t reliably predict outcome, ask humans to confirm experimental action if you can’t reliably predict its outcome.

    (1) Avoiding Negative Side Effects — (A) just ask for value to people of things that exist,
    (2) Avoiding Reward Hacking — (B) just ask for value to people on jumps in fitness,
    (3) Scalable Oversight — (A) ask for value to people of unrecognized stuff,
    (4) Robustness to Distributional Shift — (A,B) ask for value on jump in fitness (“Are the values of things to people are really the same in place X as in place Y? Looks like I can make a huge jump in fitness, is the objective function same?”)
    (5) Safe Exploration — (C) reliably predict complex outcomes from hypothetical actions, and their value to people

    To illustrate (B), for example, a human may want the AI learn to play a game very similar to ATARI Pong. The human’s intention may be to teach the AI to maximize the number of bounces of ball to bar, to keep the ball bouncing longer, but for the sake of simplicity, human defines the objective function be proportional to the number of bricks destroyed, since every bounce actually destroys a brick. Human thinks – the more bricks, the more bounces to the bar, since they correlate most of the time. Unintended consequence – the AI digging through a hole, and letting the ball bounce above the bricks could be said to be an example of “Reward Hacking”. It discovered a jump in optimization process, which actually reduces the number of bounces to bar (opposite to what human wanted), without asking about the value of its discovery to people. The blame is on human, who incorrectly defined objective function. Applying rule (B) it would ask people to re-think objective function in the cases when optimization process leads to new optima.

    But actually, my greater concern would be that AI companies won’t open source their AIs… So, here is one thing that looks obvious to me, — requiring the algorithms and hardware used for non-trivial human decisions all be transparent and make sense: . I think the AI algorithms that don’t make perfect sense, they should probably be considered unsafe, and not granted computational resources. Optimizing for goals is simple in principle, there probably is no reason why people would not be able to understand the whole decision-making processes – visualizing what exactly constitutes their decision before they point where they relied on search results.

  3. Mindey
    Mindey says:


    * “Applying rule (B) it should ask people to re-think objective function in the cases when optimization process leads to finding new optima.”

    An example in the paper was that of a robot simply disabling its vision so that it won’t find any messes. The rule (B) would tell that the robot would have to say “Hey, I found this nice way to reduce mess – simply close the eyes. It perfectly satisfies the objective function, and has huge fitness jump per energy required. Are you sure that this is your objective function or want to redefine it?”

    ** Sorry for the few typos in the last two sentences typed before falling asleep. It should have been:

    I think the AI algorithms that don’t make perfect sense, should probably be considered unsafe, and not granted large computational resources. Optimizing for goals is simple in principle, so we should probably have no reasons to hide the whole decision-making processes — people should probably be able to visualize what exactly constitutes their decisions before the point where people rely on recommendations from systems (i.e., be able to see tracebacks of execution of search, ranking, and other algorithms humans rely on for non-trivial decisions.).

Comments are closed.