Adrià Garriga Alonso
Why do you care about AI Existential Safety?
It is likely that, in the next few decades, AI will surpass humans at crucial tasks like technology R&D and social persuasion. A superhuman AI will clearly understand human values, but we don’t know how to accurately point to them and train it to optimize for that. Nor do we know how to make AI explicitly not optimize something and just remain generally helpful: existing language model alignment schemes are a start, but may still leave underlying goals that surface under increased AI capabilities.
After a few generations of AI-driven development of more capable AIs, if things remain this way, it is likely that humanity will lose control of our own future. Everything that we value may be destroyed if the AI finds it slightly expedient.
Please give at least one example of your research interests related to AI existential safety:
I’m pursuing NN interpretability research. My main goal is to understand whether (and how) neural networks acquire goals and agency.
In order for a loss-of-control scenario to occur, the AI likely needs to explicitly think about what its goals are and how to get there. If we can figure out what the goals of a particular NN are, perhaps we can get a handle on how to satisfy that in a way that also keeps humanity flourishing, or to change the NN’s goals.
Or, perhaps NNs in practice don’t have very salient goals and don’t do planning. In that case, we’re probably in a safer world, or the avenues for existential risk are different.