The following post originally appeared here.
Last weekend, I attended OpenAI’s self-organizing conference on machine learning (SOCML 2016), meta-organized by Ian Goodfellow (thanks Ian!). It was held at OpenAI’s new office, with several floors of large open spaces. The unconference format was intended to encourage people to present current ideas alongside with completed work. The schedule mostly consisted of 2-hour blocks with broad topics like “reinforcement learning” and “generative models”, guided by volunteer moderators. I especially enjoyed the sessions on neuroscience and AI and transfer learning, which had smaller and more manageable groups than the crowded popular sessions, and diligent moderators who wrote down the important points on the whiteboard. Overall, I had more interesting conversation but also more auditory overload at SOCML than at other conferences.
To my excitement, there was a block for AI safety along with the other topics. The safety session became a broad introductory Q&A, moderated by Nate Soares, Jelena Luketina and me. Some topics that came up: value alignment, interpretability, adversarial examples, weaponization of AI.
One value alignment question was how to incorporate a diverse set of values that represents all of humanity in the AI’s objective function. We pointed out that there are two complementary problems: 1) getting the AI’s values to be in the small part of values-space that’s human-compatible, and 2) averaging over that space in a representative way. People generally focus on the ways in which human values differ from each other, which leads them to underestimate the difficulty of the first problem and overestimate the difficulty of the second. We also agreed on the importance of allowing for moral progress by not locking in the values of AI systems.
Nate mentioned some alternatives to goal-optimizing agents – quantilizers and approval-directed agents. We also discussed the limitations of using blacklisting/whitelisting in the AI’s objective function: blacklisting is vulnerable to unforeseen shortcuts and usually doesn’t work from a security perspective, and whitelisting hampers the system’s ability to come up with creative solutions (e.g. the controversial move 37 by AlphaGo in the second game against Sedol).
Been Kim brought up the recent EU regulation on the right to explanation for algorithmic decisions. This seems easy to game due to lack of good metrics for explanations. One proposed metric was that a human would be able to predict future model outputs from the explanation. This might fail for better-than-human systems by penalizing creative solutions if applied globally, but seems promising as a local heuristic.
Ian Goodfellow mentioned the difficulties posed by adversarial examples: an imperceptible adversarial perturbation to an image can make a convolutional network misclassify it with very high confidence. There might be some kind of No Free Lunch theorem where making a system more resistant to adversarial examples would trade off with performance on non-adversarial data.
We also talked about dual-use AI technologies, e.g. advances in deep reinforcement learning for robotics that could end up being used for military purposes. It was unclear whether corporations or governments are more trustworthy with using these technologies ethically: corporations have a profit motive, while governments are more likely to weaponize the technology.
More detailed notes by Janos coming soon! For a detailed overview of technical AI safety research areas, I highly recommend reading Concrete Problems in AI Safety.