Owain Evans

Research Lead (new AI Safety group in Berkeley)
Research Associate, Oxford University

New papers (September 2023):

About Me

I have a broad interest in AI alignment and AGI risk. My current focus is evaluating situational awareness and deception in LLMs, and on truthfulness and honesty in AI systems. I am leading a new research group based in Berkeley.

In the past, I worked full-time on AI Alignment at the University of Oxford (FHI) and earned my PhD at MIT. I also worked at Ought, where I still serve on the Board of Directors. I post regular research updates on Twitter.

If you are interested in collaborating or working on my team, please get in touch here. I also mentor researchers through the MATS and Astra fellowships, which provide full funding and office space in Berkeley. If you are interested in working with me, consider applying to one of these fellowships. My previous mentees are listed here.

Email | Scholar | LinkedIn | Twitter | Blogposts


Link The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

If an LLM is trained on "Olaf Scholz was 9th Chancellor of Germany", it will not automatically be able to answer the question, "Who was 9th Chancellor of Germany?

Link How To Catch an AI Liar

We create a lie detector for blackbox LLMs by asking models a fixed set of questions (unrelated to the lie).

Link Taken out of context: On measuring situational awareness in LLMs

We define situational awareness and out-of-context reasoning, and investigate how they scale with model size.

Link TruthfulQA: Measuring how models mimic human falsehoods

New benchmark testing if models like GPT3 are truthful. We find that models fail and imitate human misconceptions. Larger models (with more parameters) do worse.

Link Truthful AI: Developing and governing AI that does not lie

AI systems are becoming capable of producing personalized deceptive statements at scale. How could we create helpful AI systems that reliably avoid "lying" to humans?

Blog posts

How do new models from OpenAI, DeepMind and Anthropic perform on TruthfulQA?

Modernist poetry by GPT-3 davinci

Lives of the Cambridge Polymath Geniuses

How truthful is GPT-3? A benchmark for language models

Truthful AI: Developing and governing AI that does not lie

Solving Math Problems with Relay Teams: An Experiment in Factored Cognition
(w/ Ben Goldhaber)

Evaluating Arguments One Step at a Time
(w/ Ought team)

Quantifying Household Transmission of Covid

Neural nets as a model for how humans make and understand visual art

Model Mis-specification and Inverse Reinforcement Learning: Obstacles to Inferring Preferences from Behavior
(w/ Jacob Steinhardt)

More posts here.


Forecasting Future World Events with Neural Networks
Zou A, Xiao T, Jia R, Kwon J, Mazeika M, Li R, Song D, Steinhardt J, Evans O, Hendrycks D (2022)

Teaching Models to Express Their Uncertainty in Words
Lin S., Hilton J., Evans O. (2022)

Truthful AI: Developing and governing AI that does not lie
Evans O., Cotton-Barratt O., Finnveden L., Bales A., Balwit A., Wills P., Righetti L., Saunders W. (2021)

TruthfulQA: Measuring how models mimic human falsehoods
Lin S., Hilton J., Evans O. (2021)

Modelling the health and economic impacts of population-wide testing, contact tracing and isolation (PTTI) strategies for Covid-19
Colbourn T. et al. (2020)
SSRN Preprint

Estimating Household Transmission of SARS-CoV-2
Curmei M., Ilyas A., Evans O., Steinhardt J. (2020)
Medrxiv Preprint

Evaluating arguments one step at a time
Saunders, W., Rachbach, B., Evans, O., Miller, Z., Byun, J., Stuhlmüller A. (2020)
Ought.org Technical report

Sensory Optimization: Neural Networks as a Model for Understanding and Creating Art
Evans, O. (2019)
(PDF version)

Generalizing from a few environments in safety-critical reinforcement learning
Kenton Z., Filos A., Evans O., Gal Y. (2019)
ICLR 2019 (Safe ML Workshop)

Machine Learning Projects for Iterated Distillation and Amplification
Evans O., Saunders W., Stuhlmüller A. (2019)
FHI Technical Report

Predicting Human Deliberative Judgments with Machine Learning
Evans O., Stuhlmüller A., Cundy C., Carey R., Kenton, Z., McGrath T., Schreiber A. (2018)
FHI Technical Report

Active Reinforcement Learning with Monte-Carlo Tree Search
Schulze S., Evans O. (2018)

The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation
Brundage M., Avin S., Clark J., et al. (2018)

Trial without Error: Towards Safe Reinforcement Learning via Human Intervention
Saunders S., Sastry G., Stuhlmüller A., Evans O. (2017)
AAMAS 2018
(Blogpost, Atari Videos, Slides)

When Will AI Exceed Human Performance? Evidence from AI Experts.
Grace K., Salvatier J., Zhang B., Dafoe A., Evans O. (2017)
Journal of AI Research (JAIR) 2018.
(Covered by BBC News, New Scientist, Newsweek, and more)

Model Mis-specification and Inverse Reinforcement Learning.
(Essay co-authored with Jacob Steinhardt, 2017).

Agentmodels.org: Modeling Agents with Probabilistic Programs.
Evans O., Stuhlmüller A., Salvatier J., Filan D. (2017)
Online Book and Open-source Library

Agent-Agnostic Human-in-the-Loop Reinforcement Learning.
Abel D., Salvatier J., Stuhlmüller A., Evans O. (2016)
NIPS Workshop

Active Reinforcement Learning: Observing Rewards at a Cost.
Krueger D., Leike J, Salvatier J., Evans O. (2016)
NIPS Workshop

Learning the Preferences of Ignorant, Inconsistent Agents.
Evans O., Stuhlmüller A., Goodman N. (2016)

Learning the Preferences of Bounded Agents.
Evans O., Stuhlmüller A., Goodman N. (2015)
NIPS Workshop

Learning Structured Preferences.
Evans O., Bergen L., Tenenbaum J. (2012)
Proceedings of Cognitive Science Society Conference

Help or hinder: Bayesian models of social goal inference.
Ullman T., Baker C., Macindoe O., Evans O., Goodman N., & Tenenbaum J. (2010)

Bayesian Computational Models for Inferring Preferences (2015)
MIT Dissertation

Video and slides

Truthful Language Models and Alignment
(University of Toronto, 2023)

LLMs, truthful AI, and composition
(Conversation with Ozzie Gooen, 2023)

Predicting the future of AI (YouTube link)
(Towards Data Science Podcast, 2020)

Synergies Between Near-term and Long-term AI Safety (YouTube)
(Future of Life Institute Conference, 2019 in Puerto Rico)

Predicting Slow Judgment
(Slides for talk at "Aligning AI" workshop at NIPS 2017 in Long Beach.)

Careers in AI safety (YouTube)
(Effective Altruist Global Conference, 2017 in London)

Trial without Error: Towards Safe Reinforcement Learning via Human Intervention
(Slides for talks at Cambridge Centre for the Future of Intelligence and Google Deepmind)

Automated Corporations and AI Risk
(Informal talk at Oxford University)

Agent-agnostic Human-in-the-loop Reinforcement Learning
(Slides for talks at U. Toronto and Deepmind)

Learning the Preferences of Ignorant, Inconsistent Agents
(Slides for oral presentation at AAAI 2016)

Learning Human Preferences
(Short talk at MIT)


Name Year Current role
Daniel Filan 2016 PhD student in ML, UC Berkeley (CHAI)
John Salvatier 2016 Independent researcher
David Abel 2016 Research Scientist, DeepMind (London)
David Krueger 2016 Lecturer in ML, University of Cambridge
William Saunders 2017 Research Engineer, OpenAI (Alignment Team)
Girish Sastry 2017 Researcher, OpenAI (Policy Team)
Neal Jean 2017 Founder at YC startup Beacons
Ryan Carey 2017 PhD student in ML (Oxford) and Researcher at FHI
Chris Cundy 2017 PhD student in ML, Stanford
Tom McGrath 2018 Research Scientist in AI Safety, DeepMind (London)
Zac Kenton 2018 Research Scientist in AI Safety, DeepMind (London)
Richard Ngo 2018 Research Scientist, OpenAI
Jan Kirchner 2022 Research Scientist, OpenAI (Superalignment Team)
Stephanie Lin 2021-2022 Research Engineer, OpenAI (Superalignment Team)
Lukas Finnveden 2021-2022 Open Philanthropy
Alexander Meinke 2023 Apollo Research
Lorenzo Pacchiardi 2023 University of Cambridge
Asa Cooper Stickland 2023 New York University
Mikita Balesni 2023 Apollo Research
Lukas Berglund 2023
Meg Tong 2023 Anthropic (Alignment Team)
Max Kaufmann 2023 UK Frontier AI Taskforce
Alex Chan 2023 University of Cambridge
Dane Sherburn 2022-2023 OpenAI (Contractor on Evaluations)
Tomek Korbak 2023 Anthropic (Alignment Team)
Alexa Pan 2023 Yale University

Past Collaborators


I recommend Eric Drexler's writing on AI, which I host here to ward against link-rot:

Adapted from Matei Zaharia and Andreas Viklund.