Benjamin Smith
Why do you care about AI Existential Safety?
AI Alignment is an existential issue with a substantial risk of cutting off humanity’s progress in the next hundred years. Whether you’re concerned about the well-being of the next few generations, or humanity’s long-term flourishing–and I’m concerned about both–AI Existential safety is a critical issue we must get right to maximize either objective. As a postdoctoral researcher in psychology, specializing in neuroimaging and computational modeling of decision-making and motivational systems, I want to explore potential bridges I can build between neuroscience and AI Alignment. There are likely to be important lessons to learn from neuroscience and psychology for AI Alignment, and I want to understand what those are and help others understand them.
Please give one or more examples of research interests relevant to AI existential safety:
My primary overlapping interest in this area is in multi-objective utility and decision-making. I’m interested in how biological agents and artificial agents, particularly agents using reinforcement learning, can trade off multiple objectives. For artificial agents, I believe a system that can competently trade off multiple objectives is be less likely to be misaligned, or will be less misaligned, than any system that is trained on a single objective. Humans align to multiple objectives, as do other biological organism, so any human-aligned system needs to appropriately balance those objectives. Further, any measure of human preferences is an external measure that needs to operationalize preferences, and misalignment is less likely if multiple operationalizations (e.g., revealed vs. expressed preferences) are balanced against each other. Last year, I published a paper arguing that multiple objectives were necessary to model empirically-observed behavior in seemingly-monotonic objectives in rats. I am currently working on two papers directly related to AI safety. The first, with Roland Pihlakas and Robert Klassert. In contrast to previous approaches using a lexicographic ordering to trade off safety and performance objectives, we experimented with a non-linear, log-exponential trade-off that allows negative outcomes in some objectives for large positive outcomes in others, but only where positive objectives vastly outweigh negative ones. I’m also working with Peter Vamplew on a response paper “Single-objective reward is not enough”, explaining the importance of multi-objective reinforcement in biological systems.