New Dimensions of Evaluation Amid the Rise of Large Language Models

20 Feb, 2025

Abstract

So far, in information retrieval, recommendation systems, and artificial intelligence in general, automatic offline evaluation benchmarks (based on comparisons between the output and a gold standard) have focused on effectiveness, alongside parallel research on some issues such as bias or diversity. However, with the emergence of large language models, it has become necessary to consider a broader spectrum of evaluation dimensions, including aspects such as harmful content, explainability power, hallucination, informativeness or reasoning capabilities. This talk presents a taxonomy of evaluation dimensions, as well as existing benchmarks and metrics, highlighting their strengths and limitations. The goal is to provide a comprehensive overview that enables the design of research work from different perspectives, as well as to outline a general methodology for offline evaluation of intelligent systems.

Speaker

Enrique Amigó

Date

20 Feb, 2025 12:00 PM — 1:00 PM

Event

TIGER Talk: New Dimensions of Evaluation Amid the Rise of Large Language Models

Location

B080.06.005 at RMIT & MS Teams

Building 80/435-457 Swanston St, Melbourne, VIC 3000

Recording of the Talk (RMIT Account Required)

Getting There

New Dimensions of Evaluation Amid the Rise of Large Language Models

Abstract

Recording of the Talk (RMIT Account Required)

Getting There

Enrique Amigó

Associate Professor at UNED

Chenglong Ma

Research Fellow