Research Engineer (Contract), Preparedness Team, OpenAI
My work focused on designing and implementing medium- and high-risk model autonomy
evaluations for
the Preparedness framework.
Public examples include SWE-bench Verified, which
measures a
model's ability to solve real-world software issues, MLE-bench (ICLR '25 Oral), which measures an agent's ability to do
machine
learning engineering, and PaperBench (ICML '25), which evaluates an agent's ability to
replicate state-of-the-art AI research. These have been featured in the Financial Times and the
Lex
Fridman
podcast, among other outlets, as well as being featured in system cards from OpenAI, Anthropic and DeepMind.
I've also contributed to the GPT-4o, o1 and o3-mini system cards.
Our project, the Dangerous Capability
Evaluations project, was mentioned by Sam Altman in a Senate hearing, and I was also a maintainer of the
OpenAI evals repo while in the
Policy team.
BitBlaster-16
is a 16-bit computer built from scratch using only NAND gates
and data flip-flops as primitives! :) It has its own assembly
language, compiler and operating system. I wrote a few programs
on it, including an autograd engine to do backpropagation in
computational graphs.
Fermi Poker
is a fun way to get better at making predictions under uncertainty.
Anecdotally I've found friends get much better at making
well-calibrated predictions after just a few rounds!
All Minus One: Modernized
is my re-interpretation of All Minus One,
a modernised adaptation of John Stuart Mill's 1859 essay On Liberty. This presents Mill's arguments
on the importance of free speech and the danger of silencing minority opinions.
I tend to limit my time on social media and news sites, as I find they often skew negative. Wikipedia and Metaculus have
become my go-to for staying informed about current events — I think they're both underrated. For those interested in this
perspective, I recommend Rolf Dobelli's "Stop Reading the News".