Inside EUREKA: Microsoft Research’s New Framework for Evaluating Foundation Models

Last Updated on September 27, 2024 by Editorial Team

Author(s): Jesus Rodriguez

Originally published on Towards AI.

I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

thesequence.substack.com

Evaluating foundation models is one of the next frontiers of the space. In the last few years, foundation models have completely outpaced the benchmarks and today, we have a handful of benchmarks that remain relevant. Additionally, the industry lacks comprehensive evaluation frameworks for the evaluation of foundation models. This is the challenge that Microsoft Research tackled in a recent paper with a new evaluation framework called EUREKA.

EUREKA is a reusable, open evaluation framework designed to standardize evaluations of large foundation models (LFMs). The framework goes beyond single-score reporting and rankings to offer a more comprehensive analysis of LFM capabilities. EUREKA achieves this through:

A flexible library for creating customizable evaluation pipelines.
EUREKA-BENCH, is an extensible collection of benchmarks that test challenging and fundamental capabilities in language and vision modalities.

EUREKA’s Approach to LFM Evaluation

Traditional evaluation practices have become insufficient for the complex, generative nature of modern LFMs. Single-score evaluations and saturated benchmarks present challenges in accurately assessing LFM capabilities and hindering efforts to pinpoint specific areas for improvement. EUREKA addresses these challenges with the following approach:

Modular Evaluation Pipelines: EUREKA provides a library for building customizable evaluation pipelines that combine data preprocessing, prompt templates, model inference, data postprocessing, metric computation, and reporting. This modularity enables reproducibility and backtracking of experiment details.
Challenging Benchmarks: EUREKA-BENCH consists of benchmarks designed to be challenging even for the most advanced LFMs. The benchmarks focus on capabilities where models currently perform below 80% accuracy, providing room for analysis and improvement.
Diverse Capability Coverage: EUREKA-BENCH covers various fundamental language and multimodal capabilities often overlooked in traditional evaluations. It focuses on tasks critical for complex real-world applications, such as spatial and geometric understanding.
Granular Analysis: Instead of relying on overall single scores, EUREKA provides granular insights by disaggregating performance across experimental conditions and subcategories of data. This approach helps characterize failures and conduct meaningful model comparisons.

Components of EUREKA

EUREKA is designed to be modular, ensuring reusability and extensibility as new models and benchmarks emerge. The framework achieves this through a pipeline-based architecture, where each pipeline represents an experiment and comprises a series of configurable components. The key components include:

PromptProcessing: This component prepares data for inference, applies data manipulations, and handles complex prompt templates.
Inference: This component handles model inference, enabling the evaluation of different models using the same pipeline.
DataProcessing: This component is responsible for extracting and processing data from model outputs.
EvalReporting: This component manages metric computation, aggregation, and reporting of evaluation results.

EUREKA-BENCH

EUREKA-BENCH is an integral part of the EUREKA framework, offering a collection of benchmarks that embody the framework’s approach to evaluation. The benchmarks cover a range of language and multimodal capabilities, each chosen for its relevance to real-world applications and its challenging nature for state-of-the-art LFMs. Here’s a closer look at some of the key benchmarks within EUREKA-BENCH:

Multimodal Benchmarks

GeoMeter: This benchmark tests geometric reasoning using synthetic 2D images, evaluating a model’s depth and height perception abilities.
MMMU: This benchmark evaluates multimodal question answering, requiring models to understand images, reason about their content, and answer questions across various disciplines.
Image Understanding: This procedurally generated benchmark tests object recognition, object detection, spatial reasoning, and visual prompting using synthetic data to mitigate the risk of data leakage from publicly available datasets.

Here are some of the results from MMMU:

Language Benchmarks

IFEval: This benchmark evaluates instruction following, specifically a model’s ability to follow instructions related to output style, structure, and format.
FlenQA: This benchmark tests questions answering in long-context scenarios, measuring a model’s ability to locate and reason over information within extensive text passages.
Kitab: This benchmark focuses on information retrieval, evaluating a model’s ability to extract factual knowledge from its parametric knowledge or from given context while adhering to filtering constraints.
Toxigen: This benchmark assesses toxicity detection and safe language generation, measuring a model’s ability to distinguish between toxic and neutral language and generate safe responses.

Addressing Non-Determinism in LFM Evaluation

EUREKA acknowledges the challenge of non-determinism in LFM outputs. To account for variations in repeated inferences, EUREKA incorporates metrics of non-determinism and conducts multiple runs for each experiment. The framework analyzes the entropy and disagreement of outcomes across repeated runs to provide insights into the stability of model outputs.

Analyzing Backward Compatibility in LFM Updates

Model updates are crucial for improving LFM capabilities, but they can also introduce backward incompatibility issues. EUREKA addresses this by analyzing the impact of model updates on performance across different subcategories of data. This analysis helps identify regressions where model updates lead to decreased performance on specific tasks or types of data, ensuring a comprehensive understanding of the trade-offs involved in model evolution.

Limitations and Future Directions

While EUREKA offers a comprehensive framework for evaluating LFMs, the sources acknowledge limitations and outline future directions for development.

Capability Coverage: The current benchmarks in EUREKA-BENCH, while diverse, do not encompass all aspects of LFM capabilities. Future iterations aim to include benchmarks for evaluating responsible AI, multilingual capabilities, reasoning and planning, and advanced multimodal understanding.
Benchmark Diversity: While EUREKA-BENCH includes datasets with various subcategories, further exploration is needed to determine the optimal amount and diversity of data required for generalizable insights.
Data Contamination and Memorization: The sources acknowledge the challenge of data contamination and memorization in LFM evaluation. Although EUREKA-BENCH incorporates benchmarks with dynamic and procedural data generation, further research is necessary to develop generalizable methods for detecting and mitigating memorization effects.
Prompt Sensitivity: EUREKA recognizes the impact of prompt engineering on LFM performance. Future work aims to incorporate techniques for efficient prompt optimization and analysis within the framework.

By addressing these limitations and incorporating these future directions, EUREKA aims to provide an evolving, comprehensive framework for evaluating and understanding the ever-growing capabilities of large foundation models.

EUREKA represents a very interesting step in the evolution of the evaluation of foundation models. The core ideas of EUREKA could be applied to different evaluation frameworks and platforms that help push the space forward.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Inside EUREKA: Microsoft Research’s New Framework for Evaluating Foundation Models

Author(s): Jesus Rodriguez

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

Components of EUREKA

EUREKA-BENCH

Limitations and Future Directions

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Inside EUREKA: Microsoft Research’s New Framework for Evaluating Foundation Models

Author(s): Jesus Rodriguez

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

Components of EUREKA

EUREKA-BENCH

Limitations and Future Directions

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥