Research

Academic or research source. Check the methodology, sample size, and whether it's been replicated.

Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation

Large Language Model (LLM)-based agents are increasingly adopted in high-stakes settings, but current benchmarks evaluate mainly whether a task was completed, not how. We introduce Procedure-Aware...

arXiv cs.AI · Mar 03, 2026 15:47 UTC · Paper: ~15 min

2-Minute Brief

According to arXiv cs.AI: Large Language Model (LLM)-based agents are increasingly adopted in high-stakes settings, but current benchmarks evaluate mainly whether a task was completed, not how. We introduce Procedure-Aware Evaluation (PAE), a framework that formalizes agent procedures as structured observations and exposes consistency relationships between what agents observe, communicate, and execute. PAE evaluates agents along complementary axes (Utility, Efficiency, Interaction Quality, Procedural Integrity) and appli

Read Original

Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation

TLDR

Large Language Model (LLM)-based agents are increasingly adopted in high-stakes settings, but current benchmarks evaluate mainly whether a task was completed, not how. We introduce Procedure-Aware...

Artifacts

Paper PDF

2-Minute Brief

According to arXiv cs.AI: Large Language Model (LLM)-based agents are increasingly adopted in high-stakes settings, but current benchmarks evaluate mainly whether a task was completed, not how. We introduce Procedure-Aware Evaluation (PAE), a framework that formalizes agent procedures as structured observations and exposes consistency relationships between what agents observe, communicate, and execute. PAE evaluates agents along complementary axes (Utility, Efficiency, Interaction Quality, Procedural Integrity) and appli

Open

O open S save B back M mode