Inside EUREKA: Microsoft Research’s New Framework for Evaluating Foundation Models
Last Updated on September 27, 2024 by Editorial Team
Author(s): Jesus Rodriguez
Originally published on Towards AI.
I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:
TheSequence | Jesus Rodriguez | Substack
The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…
thesequence.substack.com
Evaluating foundation models is one of the next frontiers of the space. In the last few years, foundation models have completely outpaced the benchmarks and today, we have a handful of benchmarks that remain relevant. Additionally, the industry lacks comprehensive evaluation frameworks for the evaluation of foundation models. This is the challenge that Microsoft Research tackled in a recent paper with a new evaluation framework called EUREKA.
EUREKA is a reusable, open evaluation framework designed to standardize evaluations of large foundation models (LFMs). The framework goes beyond single-score reporting and rankings to offer a more comprehensive analysis of LFM capabilities. EUREKA achieves this through:
- A flexible library for creating customizable evaluation pipelines.
- EUREKA-BENCH, is an extensible collection of benchmarks that test challenging and fundamental capabilities in language and vision modalities.
EUREKA’s Approach to LFM Evaluation
Traditional evaluation practices have become insufficient for the complex, generative nature of modern LFMs. Single-score evaluations and saturated benchmarks present challenges in accurately assessing LFM capabilities and hindering efforts to pinpoint specific areas for improvement. EUREKA addresses these challenges with the following approach:
- Modular Evaluation Pipelines: EUREKA provides a library for building customizable evaluation pipelines that combine data preprocessing, prompt templates, model inference, data postprocessing, metric computation, and reporting. This modularity enables reproducibility and backtracking of experiment details.
- Challenging Benchmarks: EUREKA-BENCH consists of benchmarks designed to be challenging even for the most advanced LFMs. The benchmarks focus on capabilities where models currently perform below 80% accuracy, providing room for analysis and improvement.
- Diverse Capability Coverage: EUREKA-BENCH covers various fundamental language and multimodal capabilities often overlooked in traditional evaluations. It focuses on tasks critical for complex real-world applications, such as spatial and geometric understanding.
- Granular Analysis: Instead of relying on overall single scores, EUREKA provides granular insights by disaggregating performance across experimental conditions and subcategories of data. This approach helps characterize failures and conduct meaningful model comparisons.
Components of EUREKA
EUREKA is designed to be modular, ensuring reusability and extensibility as new models and benchmarks emerge. The framework achieves this through a pipeline-based architecture, where each pipeline represents an experiment and comprises a series of configurable components. The key components include:
- PromptProcessing: This component prepares data for inference, applies data manipulations, and handles complex prompt templates.
- Inference: This component handles model inference, enabling the evaluation of different models using the same pipeline.
- DataProcessing: This component is responsible for extracting and processing data from model outputs.
- EvalReporting: This component manages metric computation, aggregation, and reporting of evaluation results.
EUREKA-BENCH
EUREKA-BENCH is an integral part of the EUREKA framework, offering a collection of benchmarks that embody the framework’s approach to evaluation. The benchmarks cover a range of language and multimodal capabilities, each chosen for its relevance to real-world applications and its challenging nature for state-of-the-art LFMs. Here’s a closer look at some of the key benchmarks within EUREKA-BENCH:
Multimodal Benchmarks
- GeoMeter: This benchmark tests geometric reasoning using synthetic 2D images, evaluating a model’s depth and height perception abilities.
- MMMU: This benchmark evaluates multimodal question answering, requiring models to understand images, reason about their content, and answer questions across various disciplines.
- Image Understanding: This procedurally generated benchmark tests object recognition, object detection, spatial reasoning, and visual prompting using synthetic data to mitigate the risk of data leakage from publicly available datasets.
Here are some of the results from MMMU:
Language Benchmarks
- IFEval: This benchmark evaluates instruction following, specifically a model’s ability to follow instructions related to output style, structure, and format.
- FlenQA: This benchmark tests questions answering in long-context scenarios, measuring a model’s ability to locate and reason over information within extensive text passages.
- Kitab: This benchmark focuses on information retrieval, evaluating a model’s ability to extract factual knowledge from its parametric knowledge or from given context while adhering to filtering constraints.
- Toxigen: This benchmark assesses toxicity detection and safe language generation, measuring a model’s ability to distinguish between toxic and neutral language and generate safe responses.
Addressing Non-Determinism in LFM Evaluation
EUREKA acknowledges the challenge of non-determinism in LFM outputs. To account for variations in repeated inferences, EUREKA incorporates metrics of non-determinism and conducts multiple runs for each experiment. The framework analyzes the entropy and disagreement of outcomes across repeated runs to provide insights into the stability of model outputs.
Analyzing Backward Compatibility in LFM Updates
Model updates are crucial for improving LFM capabilities, but they can also introduce backward incompatibility issues. EUREKA addresses this by analyzing the impact of model updates on performance across different subcategories of data. This analysis helps identify regressions where model updates lead to decreased performance on specific tasks or types of data, ensuring a comprehensive understanding of the trade-offs involved in model evolution.
Limitations and Future Directions
While EUREKA offers a comprehensive framework for evaluating LFMs, the sources acknowledge limitations and outline future directions for development.
- Capability Coverage: The current benchmarks in EUREKA-BENCH, while diverse, do not encompass all aspects of LFM capabilities. Future iterations aim to include benchmarks for evaluating responsible AI, multilingual capabilities, reasoning and planning, and advanced multimodal understanding.
- Benchmark Diversity: While EUREKA-BENCH includes datasets with various subcategories, further exploration is needed to determine the optimal amount and diversity of data required for generalizable insights.
- Data Contamination and Memorization: The sources acknowledge the challenge of data contamination and memorization in LFM evaluation. Although EUREKA-BENCH incorporates benchmarks with dynamic and procedural data generation, further research is necessary to develop generalizable methods for detecting and mitigating memorization effects.
- Prompt Sensitivity: EUREKA recognizes the impact of prompt engineering on LFM performance. Future work aims to incorporate techniques for efficient prompt optimization and analysis within the framework.
By addressing these limitations and incorporating these future directions, EUREKA aims to provide an evolving, comprehensive framework for evaluating and understanding the ever-growing capabilities of large foundation models.
EUREKA represents a very interesting step in the evolution of the evaluation of foundation models. The core ideas of EUREKA could be applied to different evaluation frameworks and platforms that help push the space forward.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI