Academic or research source. Check the methodology, sample size, and whether it's been replicated.
OpenAI wants to retire the AI coding benchmark that everyone has been competing on
OpenAI says the popular SWE-bench Verified coding benchmark is broken: most tasks are flawed enough to reject correct solutions, and leading AI models have likely seen the answers during training.
OpenAI wants to retire the AI coding benchmark that everyone has been competing on
TLDR
OpenAI says the popular SWE-bench Verified coding benchmark is broken: most tasks are flawed enough to reject correct solutions, and leading AI models have likely seen the answers during training.