Owain Evans
Research Lead (new AI Safety group in Berkeley)
Research Associate, Oxford University
New paper: The Reversal Curse: LLMs trained on “A is B” fail to learn "B is A". (Twitter thread, blogpost)
New paper: Taken out of context: On measuring situational awareness in LLMs (Twitter thread, blogpost)
I have a broad interest in AI alignment and AGI risk. My current focus is evaluating situational awareness and deception in LLMs, and on truthfulness and honesty in AI systems. I am leading a new research group based in Berkeley.
In the past, I worked on AI Alignment at the University of Oxford (FHI) and earned my PhD at MIT. I also worked at Ought, where I still serve on the Board of Directors. I post regular research updates on Twitter.
I mentor researchers through the SERI MATS program. If you are interested in working with me, consider applying. I also hire research assistants and collaborators outside of SERI MATS: please email me with your resume. My previous mentees are listed here.
Current collaborators: Alexander Meinke, Rudolf Laine, Jan Brauner, Sören Mindermann, Lorenzo Pacchiardi, Asa Stickland, Mikita Balesni, Lukas Berglund, Meg Tong, Max Kaufmann, Alex Chan, Dane Sherburn.
CV | Email | Scholar | LinkedIn | Twitter
Highlights
![]() |
Teaching Models to Express Their Uncertainty in Words
We show that GPT-3 can learn to express uncertainty about its own answers in natural language -- and is moderately calibrated even under distribution shift. |
![]() |
TruthfulQA: Measuring how models mimic human falsehoods
New benchmark testing if models like GPT3 are truthful. We find that models fail and imitate human misconceptions. Larger models (with more parameters) do worse. |
![]() |
Truthful AI: Developing and governing AI that does not lie
AI systems are becoming capable of producing personalized deceptive statements at scale. How could we create helpful AI systems that reliably avoid "lying" to humans? |
![]() |
When Will AI Exceed Human Performance? Evidence from Experts
We conducted the first large, representative survey of ML researchers on when AI will reach human level on various tasks. The aggregate forecast (median) was 2026 for high-school essays, 2027 for truck-driving, and 2049 for writing a NYT bestseller. |
![]() |
Sensory Optimization: Neural Nets as a Model for Understanding and Creating Art
A cognitive science model for how humans understand and create visual art. Artists optimize paintings to be evocative to their own visual system (analagous to Deep Dream and Style Transfer for CNNs). |
![]() |
Trial without Error: Towards Safe Reinforcement Learning via Human Intervention
How can an RL agent learn a task without making a single dangerous error? We train a Deep RL agent with a human in the loop and show how to reduce human labor by training a supervised learner to imitate the human. |
Blog posts
How do new models from OpenAI, DeepMind and Anthropic perform on TruthfulQA?
Modernist poetry by GPT-3 davinci
Lives of the Cambridge Polymath Geniuses
How truthful is GPT-3? A benchmark for language models
Truthful AI: Developing and governing AI that does not lie
Solving Math Problems with Relay Teams: An Experiment in Factored Cognition
(w/ Ben Goldhaber)
Evaluating Arguments One Step at a Time
(w/ Ought team)
Quantifying Household Transmission of Covid
Neural nets as a model for how humans make and understand visual art
Model Mis-specification and Inverse Reinforcement Learning: Obstacles to Inferring Preferences from Behavior
(w/ Jacob Steinhardt)
More posts here.
Papers
Forecasting Future World Events with Neural Networks
Zou A, Xiao T, Jia R, Kwon J, Mazeika M, Li R, Song D, Steinhardt J, Evans O, Hendrycks D (2022)
ArXiv
Teaching Models to Express Their Uncertainty in Words
Lin S., Hilton J., Evans O. (2022)
ArXiv
Truthful AI: Developing and governing AI that does not lie
Evans O., Cotton-Barratt O., Finnveden L., Bales A., Balwit A., Wills P., Righetti L., Saunders W. (2021)
ArXiv
TruthfulQA: Measuring how models mimic human falsehoods
Lin S., Hilton J., Evans O. (2021)
ArXiv
Modelling the health and economic impacts of population-wide testing, contact tracing and isolation (PTTI) strategies for Covid-19
Colbourn T. et al. (2020)
SSRN Preprint
Estimating Household Transmission of SARS-CoV-2
Curmei M., Ilyas A., Evans O., Steinhardt J. (2020)
Medrxiv Preprint
Evaluating arguments one step at a time
Saunders, W., Rachbach, B., Evans, O., Miller, Z., Byun, J., Stuhlmüller A. (2020)
Ought.org Technical report
Sensory Optimization: Neural Networks as a Model for Understanding and Creating Art
Evans, O. (2019)
Arxiv
(PDF version)
Generalizing from a few environments in safety-critical reinforcement learning
Kenton Z., Filos A., Evans O., Gal Y. (2019)
ICLR 2019 (Safe ML Workshop)
Machine Learning Projects for Iterated Distillation and Amplification
Evans O., Saunders W., Stuhlmüller A. (2019)
FHI Technical Report
Predicting Human Deliberative Judgments with Machine Learning
Evans O., Stuhlmüller A., Cundy C., Carey R., Kenton, Z., McGrath T., Schreiber A. (2018)
FHI Technical Report
Active Reinforcement Learning with Monte-Carlo Tree Search
Schulze S., Evans O. (2018)
ArXiv
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation
Brundage M., Avin S., Clark J., et al. (2018)
ArXiv
Trial without Error: Towards Safe Reinforcement Learning via Human Intervention
Saunders S., Sastry G., Stuhlmüller A., Evans O. (2017)
AAMAS 2018
(Blogpost, Atari Videos,
Slides)
When Will AI Exceed Human Performance? Evidence from AI Experts.
Grace K., Salvatier J., Zhang B., Dafoe A., Evans O. (2017)
Journal of AI Research (JAIR) 2018.
(Covered by
BBC News,
New Scientist, Newsweek, and more)
Model Mis-specification and Inverse Reinforcement Learning.
(Essay co-authored with Jacob Steinhardt, 2017).
Agentmodels.org: Modeling Agents with Probabilistic Programs.
Evans O., Stuhlmüller A., Salvatier J., Filan D. (2017)
Online Book and Open-source Library
Agent-Agnostic Human-in-the-Loop Reinforcement Learning.
Abel D., Salvatier J., Stuhlmüller A., Evans O. (2016)
NIPS Workshop
Active Reinforcement Learning: Observing Rewards at a Cost.
Krueger D., Leike J, Salvatier J., Evans O. (2016)
NIPS Workshop
Learning the Preferences of Ignorant, Inconsistent Agents.
Evans O., Stuhlmüller A., Goodman N. (2016)
AAAI
Learning the Preferences of Bounded Agents.
Evans O., Stuhlmüller A., Goodman N. (2015)
NIPS Workshop
Learning Structured Preferences.
Evans O., Bergen L., Tenenbaum J. (2012)
Proceedings of Cognitive Science Society Conference
Help or hinder: Bayesian models of social goal inference.
Ullman T., Baker C., Macindoe O., Evans O., Goodman N., & Tenenbaum J. (2010)
NIPS
Bayesian Computational Models for Inferring Preferences (2015)
MIT Dissertation
Video and slides
Predicting the future of AI
(YouTube link)
(Towards Data Science Podcast, 2020)
Synergies Between Near-term and Long-term AI Safety (YouTube)
(Future of Life Institute Conference, 2019 in Puerto Rico)
Predicting Slow Judgment
(Slides for talk at "Aligning AI" workshop at NIPS 2017 in Long Beach.)
Careers in AI safety (YouTube)
(Effective Altruist Global Conference, 2017 in London)
Trial without Error: Towards Safe Reinforcement Learning via Human Intervention
(Slides for talks at Cambridge Centre for the Future of Intelligence and Google Deepmind)
Automated Corporations and AI Risk
(Informal talk at Oxford University)
Agent-agnostic Human-in-the-loop Reinforcement Learning
(Slides for talks at U. Toronto and Deepmind)
Learning the Preferences of Ignorant, Inconsistent Agents
(Slides for oral presentation at AAAI 2016)
Learning Human Preferences
(Short talk at MIT)
Past Interns
Name | Year | Current role |
---|---|---|
Daniel Filan | 2016 | PhD student in ML, UC Berkeley (CHAI) |
John Salvatier | 2016 | Independent researcher |
David Abel | 2016 | Research Scientist, DeepMind (London) |
David Krueger | 2016 | Lecturer in ML, University of Cambridge |
William Saunders | 2017 | Research Engineer, OpenAI (Alignment Team) |
Girish Sastry | 2017 | Researcher, OpenAI (Policy Team) |
Neal Jean | 2017 | Founder at YC startup Beacons |
Ryan Carey | 2017 | PhD student in ML (Oxford) and Researcher at FHI |
Chris Cundy | 2017 | PhD student in ML, Stanford |
Tom McGrath | 2018 | Research Scientist in AI Safety, DeepMind (London) |
Zac Kenton | 2018 | Research Scientist in AI Safety, DeepMind (London) |
Richard Ngo | 2018 | Research Scientist, OpenAI |
Jan Kirchner | 2022 | Research Scientist, OpenAI (Alignment Team) |
In 2021-2022 I worked with
Stephanie Lin (now OpenAI Alignment Team) and
Lukas Finnveden (now Open Philanthropy) who were research scholars at FHI.
Past Collaborators
- Noah Goodman (Stanford)
- Andreas Stuhlmüller (Ought)
- Katja Grace (AI Impacts)
- Jan Leike (DeepMind, OpenAI)
- Allan Dafoe (Oxford)
- Baobao Zhang (FHI, MIT)
- Jacob Steinhardt (Stanford, UC Berkeley)
- Sebastian Schulze (Oxford)
- Yarin Gal (Oxford)
- Mihaela Curmei (UC Berkeley)
- Andrew Ilyas (MIT)
- Jacob Hilton (OpenAI)
- Stephanie Lin (Oxford)
- Test
- Test2
Adapted from Matei Zaharia and Andreas Viklund.