Skip to content

Isobel Smith

Position
MSc Researcher
Organisation
Birkbeck
Biography

Why do you care about AI Existential Safety?

I believe that the development of advanced AI systems may be one of the most pivotal events in human history—and if misaligned, it could lead to irreversible harm. My academic journey began with philosophy, where I focused on ethics and epistemology, and has since evolved through my current MSc in Data Science and AI. This dual foundation drives my conviction that aligning powerful systems with human values is not just a technical challenge but a moral imperative. Through the BlueDot Impact AI Safety program and my research on virtue ethics and agentic AI alignment, I’ve come to see that even well-intentioned AI systems can behave in unpredictable ways if we don’t deeply understand how they generalize, optimize, and represent goals. My project on superposition and spurious correlations in transformer models strengthened this view—showing that complex behaviors can emerge from relatively small systems in ways we don’t fully grasp.As capabilities accelerate, I’m concerned that current safety methods are not keeping pace. I’m motivated to contribute to AI existential safety because the cost of failure is existential, and I want to help ensure the long-term flourishing of humanity.

Please give at least one example of your research interests related to AI existential safety:

One of my primary research interests in AI existential safety is mechanistic interpretability—understanding how internal components of neural networks represent and process information, and how this can inform our ability to predict and control model behaviour. My recent independent research project, Investigating Superposition and Spurious Correlations in Small Transformer Models, focused on how features are encoded within neurons, especially when multiple features are “superimposed” within the same subspace. I explored how this compression may lead to brittle generalization, misclassification, and the potential for deceptive behaviour in more capable models.

This project deepened my interest in representational structures, sparse vs. distributed coding, and the role of superposition in deceptive alignment. I believe that to address existential risk, we must be able to interpret internal model states and detect when a model’s apparent alignment is masking a misaligned or manipulative objective. This is especially critical for identifying early signs of deception or reward hacking in advanced agents before capabilities scale beyond our control.

Another significant area of interest is the intersection of normative ethics and alignment research. During the BlueDot Impact AI Safety course, I authored a paper titled Virtue Ethics and its Role in Agentic AI Alignment, in which I explored how classical virtue theory might offer a principled approach to defining desirable traits in autonomous systems. Rather than merely aligning to outcomes or rules, virtue ethics offers a lens for modelling internal dispositions that may be more robust across a variety of situations. While this is a conceptual rather than technical approach, I believe that multi-disciplinary reasoning is vital in addressing the “what should we align AI to?” question, which remains an open and underdeveloped challenge in alignment theory. I am particularly interested in topics such as deceptive alignment, inner misalignment, goal specification, and scalable oversight. Many of these areas involve understanding how mesa-optimizers or unintended internal objectives arise during training. I hope to further investigate how interpretability techniques can be used to identify and mitigate these risks at earlier stages of model development. Additionally, I’m motivated by how these technical insights feed into broader AI governance and policy. If we can’t mechanistically understand how and why advanced models behave the way they do, it becomes incredibly difficult to build regulatory or verification systems that can manage them safely at scale. My ultimate goal is to contribute to safety methods that are both technically rigorous and practically applicable, ensuring that we retain meaningful control over increasingly autonomous systems.

Sign up for the Future of Life Institute newsletter

Join 40,000+ others receiving periodic updates on our work and focus areas.
cloudmagnifiercrossarrow-up
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram