Skip to content

Zhijing Jin

Position
Assistant Professor
Organisation
University of Toronto
Class of
2022
Biography

Why do you care about AI Existential Safety?

I care deeply about AI safety because we stand at a critical juncture where AI systems are rapidly integrating into every aspect of society—from students using LLMs for learning to programmers delegating coding tasks and government officials refining policy ideas. As automation expands across sectors, we face not only algorithmic limitations but potentially catastrophic risks from superintelligent AI misalignment, creating unprecedented uncertainty about humanity’s future and motivating my commitment to this field.

Please give one or more examples of research interests relevant to AI existential safety:


My research interests span several crucial areas of AI safety: I investigate sociopolitical risks such as how AI might enable authoritarianism (as explored in our work “Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models”), develop AI systems to assist scientific reasoning through causal inference (demonstrated in papers like “Causal AI Scientist”, and our benchmarks CauSciBench, Corr2Cause, and CLadder, evaluating AI models’ causal reasoning behavior), advance mechanistic interpretability to understand model internals (“Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals” and “Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders”), strengthen adversarial defenses (“Improving Large Language Model Safety with Contrastive Representation Learning”, “TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering”, and “Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards”), and align AI systems through moral reasoning via game theory (“Language Model Alignment in Multilingual Trolley Problems” and “Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents”).

2022 Technical PhD Fellow

Advisor: Bernhard Schölkopf
Research on promoting NLP for social good and improving AI by connecting NLP with causality

Zhijing (she/her) is a PhD student in Computer Science at Max Planck Institute, Germany, and ETH Zürich, Switzerland. She is co-supervised by Prof Bernhard Schoelkopf, Rada Mihalcea, Mrinmaya Sachan and Ryan Cotterell. She is broadly interested in making natural language processing (NLP) systems better serve for humanities. Specifically, she uses causal inference to improve the robustness and explainability of language models (as part of the “inner alignment” goal), and make language models align with human values (as part of the “outer alignment” goal). Previously, Zhijing received her bachelor’s degree at the University of Hong Kong, during which she had visiting semesters at MIT and National Taiwan University. She was also a research intern at Amazon AI with Prof Zheng Zhang. For more information, see her website.

Sign up for the Future of Life Institute newsletter

Join 40,000+ others receiving periodic updates on our work and focus areas.
cloudmagnifiercrossarrow-up linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram