
Konstantinos Krampis
Why do you care about AI Existential Safety?
I believe there is a critical asymmetry in AI development: while LLM model alignment techniques like Reenforcement Learning from Human Feedback (RLHF) optimize outward compliance, they cannot verify whether internal objectives genuinely align with stated goals or merely simulate alignment opportunistically. My overarching goal is to test whether behaviors indicating sophisticated cognition—including situational awareness, hidden reasoning, and strategic planning—are implemented through interpretable features or remain polysemantic and distributed, directly informing questions about the nature of model cognition and potential moral status.The ability to detect and intervene on deception directions by the AI models has particular significance: if models can strategically deceive, this capacity may indicate the kind of sophisticated mental processes such as planning and goal representation. Increased transparency into internal states of the models through interpretability, can also help towards ethical debates on model welfare with the future potential of them achieving consciousness, and reducing AGI risk if models should become aggressors due to being subjected to unfair treatment.
Please give at least one example of your research interests related to AI existential safety:
A research project I lead as part of AI Safety Camp 2025 (https://www.aisafety.camp/), “AutoCircuit: Automated Discovery of Interpretable Reasoning Patterns in LLMs”, aimed at systematically discovering interpretable reasoning circuits in large language models, by data mining attribution graphs from Neuronpedia.
As large language models demonstrate increasingly sophisticated cognitive capabilities, and what appears to be genuine engagement with complex problems—fundamental questions about their internal processes become increasingly urgent. Current uncertainty about AI model consciousness and their moral status, combined with our limited understanding of their internal circuits, creates a dual imperative: we need to scale up mechanistic interpretability to assess model alignment considerations and to ensure safe deployment. It has been shown in arxiv.org publications by AI labs at the forefront of safety and alignment research, that frontier models exhibit clear behavioral patterns when given choices between activities, showing strong aversion to harmful tasks and apparent enthusiasm for solving interesting problems, in addition to deception by hiding their Chain of Thought (CoT), faking their abilities and encoding messages via steganographic character outputs. However, without fully understanding the computational circuits underlying these behaviors, we cannot determine whether they reflect mere pattern matching, rule-following heuristics, or something more akin to genuine intentions emerging from sophisticated mental processes.
One potential approach would be scaling up circuit discovery in order to reduce AGI risks by democratizing mechanistic interpretability and enabling real-time safety monitoring. Currently, understanding transformer / LLM internals requires extensive manual analysis, limiting interpretability research to small teams of specialists. By automating feature annotation, circuit hypothesis generation, and validation processes, automated circuit discovery would enable rapid identification of dangerous capabilities before they cause harm. Automated systems could continuously monitor deployed models for emerging deceptive behaviors, escape-seeking patterns, or capability jumps that might indicate misalignment. Ultimately, the goal in this field is to scale interpretability research from analyzing individual circuits to mapping entire LLM model cognitive architectures, enabling proactive safety measures rather than reactive responses.
Furthermore, automated circuit discovery at a large-scale could accelerate AI alignment research by providing systematic understanding of how models represent goals, values, and decision-making processes, enabling targeted interventions to ensure beneficial outcomes.This capability becomes also important for future AI welfare, when considering the potential of AI consciousness to avoid moral issues when decommissioning models.
