Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks.
Academic or research source. Check the methodology, sample size, and whether it's been replicated.
The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks.
TLDR
The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks.