
Xisen Wang
Why do you care about AI Existential Safety?
I care about AI existential safety because the systems we build today are rapidly acquiring capabilities that scale far faster than our frameworks for understanding, controlling, or aligning them. My work in agentic systems has shown me how quickly models can develop unanticipated behaviors: emergent planning, self-modifying loops, and brittle failure modes that humans often cannot detect until too late.
I believe existential risk arises not only from raw capability growth, but from misaligned world-models, opaque decision pathways, and communication failures between humans and agents. Unsafe models don’t need to be malicious; they only need to reason incorrectly, optimize the wrong internal abstraction, or generalize in ways humans cannot predict.
By working toward transparent reasoning, human-aligned evaluation, and controllable agent architectures, we can ensure that powerful AI systems contribute to human flourishing rather than outpace our capacity to guide their behaviour. I see existential safety not as a constraint, but as the foundation for building truly reliable and beneficial intelligence.
Please give at least one example of your research interests related to AI existential safety:
My research focuses on safe multi-agent systems, world-model alignment, and human-aligned evaluation—three pillars that I believe are essential for preventing large-scale misalignment failures.
(1) World-Model Reliability & Action Consistency
At Oxford, my 4th-year thesis develops a principled benchmark for evaluating whether a world model’s internal dynamics remain consistent under counterfactual or action-induced state changes. This directly links to existential safety: misaligned or unstable world-models lead to unpredictable behaviour in autonomous agents, especially when scaled.
(2) Semantic Hallucination & Human-Aligned Reasoning
In my SoftHallEval and VCD-SDA work (CVPR submitted), I study how vision–language models hallucinate and how to assign graded, human-aligned penalties to semantic errors. Many catastrophic alignment failures stem from brittle semantics—models treating “almost correct” and “dangerously wrong” as the same. I develop evaluation metrics and decoding strategies that restore nuance, enabling models to reason closer to human expectations.
(3) Agentic Communication & Interpretability
At Microsoft Research Asia (RD-Agent), I co-developed automated agents capable of scientific modeling, multi-step planning, and self-improvement. This work illuminated both the promise and dangers of autonomous tool-use. My interest now lies in developing transparent communication channels between agents and humans: mechanisms that track internal beliefs, uncertainty, and reasoning paths, reducing the likelihood of deceptive or uncontrolled trajectories.
(4) Collective Intelligence & Multi-Agent Safety
My NeurIPS workshop projects explore emergent coordination, tool-building, and the risks that arise when multiple agents evolve strategies outside designer intentions. Collective misalignment is an under-explored vector for existential risk, and my goal is to formalize how agent societies can remain interpretable, predictable, and corrigible.
Overall, my research aims to articulate why models behave the way they do, and to build systems that preserve interpretability, alignment, and controllability even under scale.
