Kirill Acharya

Position

Stanford University

Member of

Biography

Why do you care about AI Existential Safety?

I’ve always been driven to develop safe and reliable algorithms. Ensuring that AI systems are aligned with human goals is crucial, and I believe this requires a deep understanding of the underlying technology. Without that, even small misalignments can lead to serious consequences.
Additionally, as AI becomes increasingly integrated into safety-critical environments, robust control is essential. Developing reliable and scalable methods requires strong theoretical guarantees, solid safety principles, and interpretability.

Please give at least one example of your research interests related to AI existential safety:

My research focuses on the interpretability of AI systems, particularly examining the internal representations of large language models to understand how they encode features, knowledge, and intermediate reasoning. In particular, I study how to impose safety constraints on model activations, using interpretable lightweight proxy models to identify internal states that may lead to unsafe or undesirable behaviors. By detecting these trajectories early in the process of generating the output, it is possible to preemptively steer the model activations back into the safe regions. This approach applies principles from control theory to achieve more reliable and aligned model behavior at scale.

Kirill Acharya

Sign up for the Future of Life Institute newsletter