Owain Evans

Director at Truthful AI (research group in Berkeley)
Affiliate Researcher at CHAI, UC Berkeley

Recent papers (March 2025):

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. (blog)
Tell me about yourself: LLMs are aware of their learned behaviors. (blog)
Looking Inward: Language Models Can Learn About Themselves by Introspection. (Tweets, blog)

About Me

I have a broad interest in AI alignment and AGI risk. My current focus is emergent misalignment, out-of-context reasoning, deception, and situational awareness in AI systems. I run a research non-profit in Berkeley called Truthful AI. I'm also an affiliate of the CHAI group at UC Berkeley.

In the past, I worked on AI Alignment at the University of Oxford (FHI) and earned my PhD at MIT. I also worked at Ought, where I still serve on the Board of Directors. I post regular research updates on Twitter.

If you are interested in collaborating or working on my team, please get in touch here. I also mentor researchers through the MATS fellowship, which provides full funding and office space in Berkeley. My previous mentees are listed here.

Email | Scholar | LinkedIn | Twitter | LessWrong

Highlights

	Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs The first large-scale, multi-task benchmark for situational awareness in LLMs, with 7 task categories and more than 12,000 questions.
	Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs (x,f(x)) can articulate a definition of f and compute inverses.
	The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" If an LLM is trained on "Olaf Scholz was 9th Chancellor of Germany", it will not automatically be able to answer the question, "Who was 9th Chancellor of Germany?
	How To Catch an AI Liar We create a lie detector for blackbox LLMs by asking models a fixed set of questions (unrelated to the lie).
	TruthfulQA: Measuring how models mimic human falsehoods New benchmark testing if models like GPT3 are truthful. We find that models fail and imitate human misconceptions. Larger models (with more parameters) do worse.
	Truthful AI: Developing and governing AI that does not lie AI systems are becoming capable of producing personalized deceptive statements at scale. How could we create helpful AI systems that reliably avoid "lying" to humans?

Blog posts

Vintage LLMs: Pretrain language models on data up to a particular date

How do LLMs give truthful answers? A discussion of LLM vs human reasoning, ensembles & parrots

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions

Paper: LLMs trained on "A is B" fail to learn "B is A" (The Reversal Curse)

How do new models from OpenAI, DeepMind and Anthropic perform on TruthfulQA?

Modernist poetry by GPT-3 davinci

Lives of the Cambridge Polymath Geniuses

How truthful is GPT-3? A benchmark for language models

Truthful AI: Developing and governing AI that does not lie

Solving Math Problems with Relay Teams: An Experiment in Factored Cognition
(w/ Ben Goldhaber)

Evaluating Arguments One Step at a Time
(w/ Ought team)

Quantifying Household Transmission of Covid

Neural nets as a model for how humans make and understand visual art

Model Mis-specification and Inverse Reinforcement Learning: Obstacles to Inferring Preferences from Behavior
(w/ Jacob Steinhardt)

Papers

Looking Inward: Language Models Can Learn About Themselves by Introspection
Binder, F., Chua, J., Korbak, T.; Sleight, H., Hughes, J., Long, R., Perez, E., Turpin, M., Evans, O. (2024)
ICLR 2025

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
Laine, R., Chughtai, B., Betley, J., Hariharan, K., Scheurer, J., Balesni, M., Hobbhahn, M., Meinke, A., Evans, O. (2024)
NeurIPS 2024

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
Treutlein, J., Choi, D., Betley, J., Anil, C., Marks, S., Grosse, RB., Evans, O. (2024)
NeurIPS 2024

Can Language Models Explain Their Own Classification Behavior?
Sherburn, D., Chughtai, B., Evans, O. (2024)
arXiv preprint arXiv:2405.07436

Tell, Don't show: Declarative facts influence how LLMs generalize
Meinke, A., Evans, O. (2023)
arXiv preprint arXiv:2312.07779

How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions
Pacchiardi, L., Chan, AJ., Mindermann, S., Moscovitz, I., Pan, AY., Gal, Y., Evans, O., Brauner, J. (2023)
ICLR 2024

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, AC., Korbak, T., Evans, O. (2023)
ICLR 2024

Taken out of context: On measuring situational awareness in LLMs
Berglund, L., Stickland, AC., Balesni, M., Kaufmann, M., Tong, M., Korbak, T., Kokotajlo, D., Evans, O. (2023)
arXiv preprint arXiv:2309.00667

Forecasting Future World Events with Neural Networks
Zou A, Xiao T, Jia R, Kwon J, Mazeika M, Li R, Song D, Steinhardt J, Evans O, Hendrycks D (2022)
Neurips 2022

Teaching Models to Express Their Uncertainty in Words
Lin S., Hilton J., Evans O. (2022)
Transactions of Machine Learning Research

Truthful AI: Developing and governing AI that does not lie
Evans O., Cotton-Barratt O., Finnveden L., Bales A., Balwit A., Wills P., Righetti L., Saunders W. (2021)
ArXiv

TruthfulQA: Measuring how models mimic human falsehoods
Lin S., Hilton J., Evans O. (2021)
ACL

Modelling the health and economic impacts of population-wide testing, contact tracing and isolation (PTTI) strategies for Covid-19
Colbourn T. et al. (2020)
SSRN Preprint

Estimating Household Transmission of SARS-CoV-2
Curmei M., Ilyas A., Evans O., Steinhardt J. (2020)
International Journal of Epidemiology

Evaluating arguments one step at a time
Saunders, W., Rachbach, B., Evans, O., Miller, Z., Byun, J., Stuhlmüller A. (2020)
Ought.org Technical report

Sensory Optimization: Neural Networks as a Model for Understanding and Creating Art
Evans, O. (2019)
Arxiv
(PDF version)

Generalizing from a few environments in safety-critical reinforcement learning
Kenton Z., Filos A., Evans O., Gal Y. (2019)
ICLR 2019 (Safe ML Workshop)

Machine Learning Projects for Iterated Distillation and Amplification
Evans O., Saunders W., Stuhlmüller A. (2019)
FHI Technical Report

Predicting Human Deliberative Judgments with Machine Learning
Evans O., Stuhlmüller A., Cundy C., Carey R., Kenton, Z., McGrath T., Schreiber A. (2018)
FHI Technical Report

Active Reinforcement Learning with Monte-Carlo Tree Search
Schulze S., Evans O. (2018)
ArXiv

The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation
Brundage M., Avin S., Clark J., et al. (2018)
ArXiv

Trial without Error: Towards Safe Reinforcement Learning via Human Intervention
Saunders S., Sastry G., Stuhlmüller A., Evans O. (2017)
AAMAS 2018
(Blogpost, Atari Videos, Slides)

When Will AI Exceed Human Performance? Evidence from AI Experts.
Grace K., Salvatier J., Zhang B., Dafoe A., Evans O. (2017)
Journal of AI Research (JAIR) 2018.
(Covered by BBC News, New Scientist, Newsweek, and more)

Model Mis-specification and Inverse Reinforcement Learning.
(Essay co-authored with Jacob Steinhardt, 2017).

Agentmodels.org: Modeling Agents with Probabilistic Programs.
Evans O., Stuhlmüller A., Salvatier J., Filan D. (2017)
Online Book and Open-source Library

Agent-Agnostic Human-in-the-Loop Reinforcement Learning.
Abel D., Salvatier J., Stuhlmüller A., Evans O. (2016)
NeurIPS Workshop

Active Reinforcement Learning: Observing Rewards at a Cost.
Krueger D., Leike J, Salvatier J., Evans O. (2016)
NeurIPS Workshop

Learning the Preferences of Ignorant, Inconsistent Agents.
Evans O., Stuhlmüller A., Goodman N. (2016)
AAAI Conference on Artificial Intelligence,

Learning the Preferences of Bounded Agents.
Evans O., Stuhlmüller A., Goodman N. (2015)
NeurIPS Workshop

Learning Structured Preferences.
Evans O., Bergen L., Tenenbaum J. (2012)
Proceedings of Cognitive Science Society Conference

Help or hinder: Bayesian models of social goal inference.
Ullman T., Baker C., Macindoe O., Evans O., Goodman N., & Tenenbaum J. (2010)
NeurIPS

Bayesian Computational Models for Inferring Preferences (2015)
MIT Dissertation

Video and slides

Video: Podcast Interview on Situational Awareness and Out-of-context Reasoning
(August 2024)

Video talk: Out-of-context Reasoning in LLMs
(New Orleans Alignment Workshop, December 2023)

Video talk: Truthful Language Models and Alignment
(University of Toronto, 2023)

Video conversation: LLMs, truthful AI, and composition
(Conversation with Ozzie Gooen, 2023)

Predicting the future of AI (YouTube link)
(Towards Data Science Podcast, 2020)

Synergies Between Near-term and Long-term AI Safety (YouTube)
(Future of Life Institute Conference, 2019 in Puerto Rico)

Predicting Slow Judgment
(Slides for talk at "Aligning AI" workshop at NeurIPS 2017 in Long Beach.)

Careers in AI safety (YouTube)
(Effective Altruist Global Conference, 2017 in London)

Trial without Error: Towards Safe Reinforcement Learning via Human Intervention
(Slides for talks at Cambridge Centre for the Future of Intelligence and Google Deepmind)

Automated Corporations and AI Risk
(Informal talk at Oxford University)

Agent-agnostic Human-in-the-loop Reinforcement Learning
(Slides for talks at U. Toronto and Deepmind)

Learning the Preferences of Ignorant, Inconsistent Agents
(Slides for oral presentation at AAAI 2016)

Learning Human Preferences
(Short talk at MIT)

Mentees

Name	Year	Current role
Daniel Filan	2016	PhD student in ML, UC Berkeley (CHAI)
John Salvatier	2016	Independent researcher
David Abel	2016	Research Scientist, DeepMind (London)
David Krueger	2016	Lecturer in ML, University of Cambridge
William Saunders	2017	Research Engineer, OpenAI (Alignment Team)
Girish Sastry	2017	Researcher, OpenAI (Policy Team)
Neal Jean	2017	Founder at YC startup Beacons
Ryan Carey	2017	PhD student in ML (Oxford) and Researcher at FHI
Chris Cundy	2017	PhD student in ML, Stanford
Tom McGrath	2018	Research Scientist in AI Safety, DeepMind (London)
Zac Kenton	2018	Research Scientist in AI Safety, DeepMind (London)
Richard Ngo	2018	Research Scientist, OpenAI
Jan Kirchner	2022	Research Scientist, OpenAI (Superalignment Team)
Stephanie Lin	2021-2022	Research Engineer, OpenAI (Superalignment Team)
Lukas Finnveden	2021-2022	Open Philanthropy
Alexander Meinke	2023	Apollo Research
Lorenzo Pacchiardi	2023	University of Cambridge
Asa Cooper Stickland	2023	New York University
Mikita Balesni	2023	Apollo Research
Lukas Berglund	2023
Meg Tong	2023	Anthropic (Alignment Team)
Max Kaufmann	2023	UK Frontier AI Taskforce
Alex Chan	2023	University of Cambridge
Dane Sherburn	2022-2023	OpenAI (Contractor on Evaluations)
Tomek Korbak	2023	Anthropic (Alignment Team)
Alexa Pan	2023	Yale University
Martín Soto	2024	Astra Fellow

Past Collaborators

Recommendations

I recommend Eric Drexler's writing on AI, which I host here to ward against link-rot:

Adapted from Matei Zaharia and Andreas Viklund.