So far, in information retrieval, recommendation systems, and artificial intelligence in general, automatic offline evaluation benchmarks (based on comparisons between the output and a gold standard) have focused on effectiveness, alongside parallel research on some issues such as bias or diversity. However, with the emergence of large language models, it has become necessary to consider a broader spectrum of evaluation dimensions, including aspects such as harmful content, explainability power, hallucination, informativeness or reasoning capabilities. This talk presents a taxonomy of evaluation dimensions, as well as existing benchmarks and metrics, highlighting their strengths and limitations. The goal is to provide a comprehensive overview that enables the design of research work from different perspectives, as well as to outline a general methodology for offline evaluation of intelligent systems.