Research

Academic or research source. Check the methodology, sample size, and whether it's been replicated.

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks.

arXiv cs.CL · Feb 25, 2026 18:58 UTC · Paper: ~15 min

TLDR

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks.

Artifacts

Paper PDF

O open S save B back M mode