New Dimensions of Evaluation Amid the Rise of Large Language Models

Abstract

So far, in information retrieval, recommendation systems, and artificial intelligence in general, automatic offline evaluation benchmarks (based on comparisons between the output and a gold standard) have focused on effectiveness, alongside parallel research on some issues such as bias or diversity. However, with the emergence of large language models, it has become necessary to consider a broader spectrum of evaluation dimensions, including aspects such as harmful content, explainability power, hallucination, informativeness or reasoning capabilities. This talk presents a taxonomy of evaluation dimensions, as well as existing benchmarks and metrics, highlighting their strengths and limitations. The goal is to provide a comprehensive overview that enables the design of research work from different perspectives, as well as to outline a general methodology for offline evaluation of intelligent systems.

Date
20 Feb, 2025 12:00 PM — 1:00 PM
Location
B080.06.005 at RMIT & MS Teams
Building 80/435-457 Swanston St, Melbourne, VIC 3000

Recording of the Talk (RMIT Account Required)

Getting There

Enrique Amigó
Enrique Amigó
Associate Professor at UNED

Enrique Amigó is an Associate Professor at UNED, specializing in system evaluation and text representation. He has co-organized several international evaluation campaigns and served as General Chair for SIGIR 2022. He has received multiple awards, including the SEPLN National Award and the Google Faculty Research Award. His research focuses on evaluation metrics and text representation, with over 3,000 citations on Google Scholar.

Chenglong Ma
Chenglong Ma
Research Fellow

I’m a Research Fellow at ADM+S RMIT node. My research interests include Information Retrieval, Recommender Systems, and Responsible AI.