How AI Safety Evaluations & Benchmarks can aid in AI governance

What are AI Safety Evaluations

“Evals” are tests that measure how AI models behave, and how powerful they are becoming. In AI Safety, evals are often designed to measure dangerous capabilities, such as cybersecurity capabilities, self-replication and autonomous AI research.

Importantly, evals can measure if an AI is too dangerous to deploy. There are some red lines that no AI model should ever cross, for example when it can…

self-replicate. (E.g. RepliBench
). A self-replicating AI could escape from a lab and spread to other machines.
make more powerful AI models. E.g. RE-bench
. A self-improving AI could rapidly become more powerful than humans.

Preventing deployment is not enough

Things can go wrong even before deployment. Self-replication and self-improvement can happen in an AI lab, before the model is publicly available.

This is why we need a Pause Button. We need to globally halt the development of increasingly powerful AI models, before these dangerous capabilities are fully matured. This Pause Button should be pressed when the evaluations are showing we’re entering the danger zone.

What AI companies are doing

Most of the frontier AI companies are doing safety evaluations on their models before they deploy them and include the results of these evaluations in so called “System Cards”. Most of them (except for Meta and Apple) have signed the EU AI Code of Practice

, which mentions “state‑of‑the‑art model evaluations” (Measure 3.2).

This also means that some of these companies are not doing any safety evaluations, and the ones that are performed are not yet required and not standardized. In other words, we desperately need regulations to require standardized safety evaluations.

What countries are doing

Multiple governments are now seriously investing in AI Evaluations / Benchmarks to measure dangerous capabilities:

UK AISI has built the Inspect framework
, written Replibench
, is now investing 15M GBP in evals & alignment research grants
EU Commission is lauching a 10M EUR tender
, and a big grant with the Horizon programme
. They have also launched the The General-Purpose AI Code of Practice
, which includes a requirement to do “state‑of‑the‑art model evaluations” (Measure 3.2).
US AI Action Plan
mentions evaluations and hardware controls
China (concordia AI + Shanghai AI lab) has just released a report with a lot of evals
Other governments are working on evaluations as well

The fact that so many countries are working on evaluations creates a very important opportunity for us. If these countries and institutes would use the same benchmarks and have some common red lines, that would be an important step towards a global treaty. In addition to that, we should clearly communicate to politicians that when a red line is crossed, it’s time to stop further development.