The Progress Illusion: Revisiting meta-evaluation standards of LLM evaluators
Published in EMNLP, 2025
Large language models (LLMs) are increasingly used as evaluators of other language models, but how do we evaluate the evaluators themselves? This work revisits current meta-evaluation standards for LLM-based evaluators and reveals systematic biases and limitations in existing approaches. We introduce new benchmarks and methodologies for more robust evaluation of LLM evaluators, challenging the assumption that progress in LLM capabilities automatically translates to better evaluation performance. Our findings suggest that current meta-evaluation practices may create an illusion of progress while missing critical evaluation failures.
Recommended citation: Vedant Gaur et al. 2025. The Progress Illusion: Revisiting meta-evaluation standards of LLM evaluators. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). #