Lucy Farnik
Why do you care about AI Existential Safety?
AI seems like by far the most likely way for humanity to go extinct in the near future. I’ve always been motivated by trying to do as much good as I can in the world, and was initially planning on earning to give, but realizing just how likely an AI extinction was (and in particular reading The Precipice) made me completely change my career direction. By default, I expect things to go poorly for humanity because AI safety is clearly taking a backseat to building shiny AI-powered products, and also because we’ve never before faced a problem where the first critical try is very likely to also be the last. But I also think that, because of how absurdly small this research field is, individuals can have a massive amount of influence on the future of the world by working on this issue.
Please give at least one example of your research interests related to AI existential safety:
I’m currently working on mechanistic interpretability under Neel Nanda. My goal there is to allow us to create white-box model evaluations and inference-time monitoring based on the internal states of models, which we could use to make superalignment safer (since I expect alignment is too difficult to solve “manually”, and therefore needs to be solved by AIs instead). Before that, I worked on theory of safe reinforcement learning and on neuroconnectionism (i.e. using insights from neuroscience for alignment fine-tuning).