Why we no longer evaluate SWE-bench Verified
SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress.
Official announcement from Openai. These are their claims—they have marketing incentives.
SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress.
TLDR
SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress.