Research

Academic or research source. Check the methodology, sample size, and whether it's been replicated.

Frontier Models Can Take Actions at Low Probabilities

Pre-deployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to "defect": misbehaving so rarely that...

arXiv cs.LG · Mar 02, 2026 18:56 UTC · Paper: ~15 min

2-Minute Brief

According to arXiv cs.LG: Pre-deployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to "defect": misbehaving so rarely that no malicious actions are observed during evaluation, but often enough that they occur eventually in deployment. But this requires taking actions at very low rates, while maintaining calibration. Are frontier models even capable of that? We prompt the GPT-5, Claude-4.5 and Qwen-3 families to take a ta

Read Original

Frontier Models Can Take Actions at Low Probabilities

TLDR

Artifacts

Paper PDF

2-Minute Brief

According to arXiv cs.LG: Pre-deployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to "defect": misbehaving so rarely that no malicious actions are observed during evaluation, but often enough that they occur eventually in deployment. But this requires taking actions at very low rates, while maintaining calibration. Are frontier models even capable of that? We prompt the GPT-5, Claude-4.5 and Qwen-3 families to take a ta

Open

O open S save B back M mode