Principled Interpretability of Reward Hacking in Closed Frontier Models
In brief:
Published on January 1, 2026 4:37 PM GMT Authors: Gerson Kroiz, Aditya Singh, Senthooran Rajamanoharan, Neel Nanda Gerson and Aditya are co-first authors.
This is a research sprint report from Neel Nanda’s MATS 9.0 training phase.
Open receipts to verify and go deeper.
About this source
Source
AI Alignment Forum
Type
Research Publication
Published
Credibility
From peer-reviewed or pre-print research
Always verify with the primary source before acting on this information.
AI Alignment Forum·Research Publication·Academic Source·
Principled Interpretability of Reward Hacking in Closed Frontier Models
TL;DR
Published on January 1, 2026 4:37 PM GMT Authors: Gerson Kroiz, Aditya Singh, Senthooran Rajamanoharan, Neel Nanda Gerson and Aditya are co-first authors.