Characterizing the Confidence of Large Language Model-Based Automatic Evaluation Metrics

Rickard Stureborg, Dimitris Alikaniotis, Yoshi Suhara

Main: Summarization Oral Paper

Session 8: Summarization (Oral)
Conference Room: Marie Louise 2
Conference Time: March 19, 16:00-17:30 (CET) (Europe/Malta)
TLDR:
You can open the #paper-57-Oral channel in a separate window.
Abstract: There has recently been a growing interest in using Large Language Models (LLMs) to evaluate NLP tasks automatically. Considerable research effort has been put into improving such systems towards achieving high correlations with human judgement. However, it is still unclear what level of correlation is good enough for practical applications of LLM-based automatic evaluation systems. This paper characterizes these LLM evaluators' confidence in ranking candidate NLP models and develops a configurable Monte Carlo simulation method. We show that even automatic metrics with low correlation with human judgement can reach high-confidence rankings of candidate models with reasonable evaluation set sizes (100s of examples). Further, we describe tradeoff curves between the LLM evaluator performance (i.e., correlation with humans) and evaluation set size; loss in correlation can be compensated with modest increases in the evaluation set size. We validate our results on RoSE, a text summarization dataset, and find our estimates of confidence align with empirical observations. Code available at https://github.com/rickardstureborg/llm-eval-confidence