Skip to content
Provenance Brief
Research

Academic or research source. Check the methodology, sample size, and whether it's been replicated.

OpenAI wants to retire the AI coding benchmark that everyone has been competing on

OpenAI says the popular SWE-bench Verified coding benchmark is broken: most tasks are flawed enough to reject correct solutions, and leading AI models have likely seen the answers during training.

Read Original

OpenAI wants to retire the AI coding benchmark that everyone has been competing on

TLDR

OpenAI says the popular SWE-bench Verified coding benchmark is broken: most tasks are flawed enough to reject correct solutions, and leading AI models have likely seen the answers during training.

Open
O open S save B back M mode