My work to date has focused on designing and implementing medium- and high-risk model autonomy
evaluations for
the Preparedness framework.
Public examples include SWE-bench Verified, which
measures a
model's ability to solve real-world software issues, and MLE-bench, which measures an agent's ability to do
machine
learning engineering. These have been featured in the Financial Times and the
Lex
Fridman
podcast, among other outlets.
I also contributed to the GPT-4o and o1 system cards, which outline the
safety
tests run prior to their release.
I also contributed to and helped maintain the OpenAI evals repo.
Some evals I developed were mentioned on the OpenAI blog, and our project, the Dangerous Capability
Evaluations project, was mentioned by Sam Altman in Congress.
BitBlaster-16
is a 16-bit computer built from scratch using only NAND gates
and data flip-flops as primitives! :) It has its own assembly
language, compiler and operating system. I wrote a few programs
on it, including an autograd engine to do backpropagation in
computational graphs.
Fermi Poker
is a fun way to get better at making predictions under uncertainty.
Anecdotally I've found friends get much better at making
well-calibrated predictions after just a few rounds!
I tend to limit my time on social media and news sites, as I find they often skew negative. As Niall Ferguson put it,
"Social media is a polarization machine that we can't turn off". Wikipedia and Metaculus have
become my go-to for staying informed about current events — I think they're both underrated. For those interested in this
perspective, I recommend Rolf Dobelli's "Stop Reading the News".