Skip to content
Mobrief
Research

Academic or research source. Check the methodology, sample size, and whether it's been replicated.

How to Design Environments for Understanding Model Motives

Authors: Gerson Kroiz*, Aditya Singh*, Senthooran Rajamanoharan, Neel Nanda Gerson and Aditya are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran...

2-Minute Brief
  • According to AI Alignment Forum: Authors: Gerson Kroiz*, Aditya Singh*, Senthooran Rajamanoharan, Neel Nanda Gerson and Aditya are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda. TL;DR Understanding why a model took an action is a key question in AI Safety. It is a difficult question, as often many motivations could plausibly lead to the same action. We are particularly concerned about the case of model incrimination, i.e. observing a model taking a bad actio
Read Original

How to Design Environments for Understanding Model Motives

TLDR

Authors: Gerson Kroiz*, Aditya Singh*, Senthooran Rajamanoharan, Neel Nanda Gerson and Aditya are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran...

2-Minute Brief
  • According to AI Alignment Forum: Authors: Gerson Kroiz*, Aditya Singh*, Senthooran Rajamanoharan, Neel Nanda Gerson and Aditya are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda. TL;DR Understanding why a model took an action is a key question in AI Safety. It is a difficult question, as often many motivations could plausibly lead to the same action. We are particularly concerned about the case of model incrimination, i.e. observing a model taking a bad actio
Open
O open S save B back M mode