
Zachary Furman
Why do you care about AI Existential Safety?
In short: as a human who cares about other humans, I would like for humans not to go extinct. It seems increasingly plausible to me that one of the largest threats to the continued existence of humanity is artificial intelligence. This was perhaps easier to dismiss a decade ago, when it seemed like it might take many decades before AI approached superhuman level; it is less easy to dismiss now. Of course, many reasonable people have argued that even on our current trajectory, artificial superintelligence will not lead to catastrophic outcomes for humanity, and I hope they’re right. Nevertheless I remain concerned. I would like to be a part of the solution rather than watch from the sidelines.
Please give at least one example of your research interests related to AI existential safety:
To start, I should clarify that I believe we need significant sociopolitical and technical progress across multiple different fronts to fully mitigate AI existential risk, and my research addresses only one of many directions of progress that will be necessary to solve this issue.
The root cause of most technical failure modes I’m worried about (misalignment, deception, etc) is (in my opinion) a lack of understanding of how modern artificial intelligence works. For instance, modern alignment training (supervised fine-tuning, constitutional AI, etc) attempts to shape the behavior of language models by carefully selecting the training data and reward signals given to the models. This has worked to moderate success so far. But we are largely ignorant of how this process actually works. We have little besides human intuition to predict whether a model will be deceptive, reward hack, etc., design training interventions to prevent such outcomes, or determine if our “control” countermeasures will be sufficient to prevent bad outcomes even if the model is misaligned. Perhaps uncharitably, this feels akin to determining the safe cargo capacity of an airplane by “eyeballing it.” Having worked in the aerospace industry several years ago, I can confirm this would not be acceptable.
I believe that technical research can address this issue, much as how historically thermodynamics turned steam power from an unreliable and dangerous power source to a safe and well-understood tool for society. Concretely, I believe better understanding of model internals (interpretability) and of how we can shape those internals (training dynamics), to lead to safer AI systems. I believe singular learning theory (SLT) offers valuable insights on both these fronts. My work has focused on using SLT to develop tools for interpretability like the local learning coefficient (Lau et al. 2024) and loss kernel (Adam et al. 2025), and develop tools for understanding how training data shapes models like Bayesian influence functions (Kreer et al. 2025). I hope that my research can contribute to more robust understanding and control of modern artificial intelligence systems.
