
Aran Nayebi
Why do you care about AI Existential Safety?
AI is rapidly advancing in capabilities, and it is therefore pressing & urgent to develop a true *science* of AI safety, not merely philosophy and discussion (which is, of course, an important first step).
Please give at least one example of your research interests related to AI existential safety:
My research aims to turn AI existential safety into a *provable science*. I think one of the best ways to do this is to prove theorems about capable agents (this is important in part because shouldn’t run these agents in the real world!), in particular, proving barriers to alignment. My latest work (https://arxiv.org/abs/2502.05934), shows that there is no “free lunch”, namely, even for computationally unbounded agents, we cannot encode “all human values” without incurring high computational overhead to get them to provably align with high probability. Furthermore, we also show reward hacking is an inevitable byproduct of computationally *bounded* agents and large state space sizes. Instead, to avoid these barriers, we propose (1) settling on a small value set we want to align over (we propose one for “corrigibility” in the off-switch game to ensure human control: https://arxiv.org/abs/2507.20964), and (2) the practical goal for scalable oversight is therefore alignment on portions of the state space that matter most (safety-critical slices), not uniform coverage. Concretely: (i) focus rater/model time on risk-targeted slices via adversarial sampling, (ii) compress objectives to those governing these slices, (iii) set ε, δ budgets per slice to certify coverage.
