Inside NuminaMath: The AI Model that Took The First Place In the AI Math Olympiad

Last Updated on July 22, 2024 by Editorial Team

Author(s): Jesus Rodriguez

Originally published on Towards AI.

I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

thesequence.substack.com

The AI Mathematical Olympiad(AIMO) has been one of the most interesting initiatives to evaluate sophisticated math reasoning in AI models. Launched a few months ago, AIMO setup a $10 million prize for models that can reason at the level of a gold medalist in the International Math Olymmpiad(IMO) competitions for high school students. By performing at those levels, AI models need to exhibit sophisticated capabilities in areas such as multi-step reasoning, math as well as deep level language understanding. I was fascinated the AIMO challenge and was tracking the progress of the different models quite closely over the last few months trying to understand the techniques they were using to solve such complex chal.

After months of intervention, NuminaMath 7B TIR emerged as the winner. The model was a collaboration between HuggingFace and Numina, a lab focused on advancing math capabilities in foundation models. You probably know a lot about HuggingFace but very little about Numina so le’s fix that.

Numina is a lab dedicated to advance math capabilities in foundation models. Numina rallies behind that vision that math is essential to humanty and a key component of advances intelligence. The project received initial support from Mistral and firms like General Catalyst and set its eyes on the AIMO challenge as one of its firs major tests.

NuminaMath is a combination of some obvious steps with very novel approaches in terms across different areas. Today, I would like to dive into some of the details behind NuminaMath that could serve as inspirations for AI teams working on similar problems.

NuminaMath

One of the most interesting aspects of NuminaMath is that they build a new architecture from scratch. Instead, they relied on the DeepSeekMath model as a baseline and extend it with a novel approach based on three fundamental components:

i. Fine-tuning Strategy: NuminaMath fine-tuned the DeepSeekMath-Base 7B model to function as a “reasoning agent.” This agent tackled mathematical problems using natural language reasoning combined with Python REPL to compute intermediate results.

ii. Decoding Algorithm: They developed a novel decoding algorithm for tool-integrated reasoning (TIR) that incorporated code execution feedback, enabling the generation of solution candidates during inference.

iii. Internal Validation Sets: Various internal validation sets were used to guide model selection and prevent overfitting to the public leaderboard.

The models were trained using open-source libraries such as TRL, PyTorch, vLLM, and DeepSpeed. Training on one node of 8 x H100 GPUs took approximately 10 hours.

Training Recipe

Fine tuning is, arguably, one of the most interesting areas of contribution of NuminaMath.

The fine-tuning process was divided into two stages:

i. Stage 1: The base model was fine-tuned on a diverse dataset of natural language math problems and solutions. Each solution was templated with Chain of Thought (CoT) to aid reasoning.

ii. Stage 2: The model from Stage 1 was further fine-tuned on a synthetic dataset of tool-integrated reasoning. Problems were broken down into rationales, Python programs, and their outputs. This method, influenced by Microsoft’s ToRA paper, produced a reasoning agent capable of solving problems using both natural language and Python REPL.

Both stages involved “full fine-tuning,” where all model weights were updated during backpropagation. The “packing” feature from TRL’s SFTTrainer was utilized to concatenate multiple samples into a single chunk of 2048 tokens. Gradient checkpointing and the DeepSpeed ZeRO-3 protocol ensured efficient training within available VRAM. Key hyperparameters used in each stage included a learning rate of 2.0 E-5, a total batch size of 32, and a cosine learning rate scheduler.

Initial Attempts and Adjustments

Initial submissions using only Stage 1 fine-tuning yielded limited success. Inspired by Abdur Rafae’s public prize notebook, NuminaMath integrated code execution into their training recipe. They first explored the Mix of Minimal Optimal Sets (MMOS) dataset but found it insufficient for harder problems. This led them to develop a dataset similar to the one used by DeepSeekMath Instruct / RL models, resulting in significant improvements.

Dataset Construction

NuminaMath used two main datasets for its fine-tuning process:

i. Chain of Thought Dataset: Comprised of several hundred thousand problems with solutions written in a Chain of Thought manner. Data sources ranged from Chinese high school math exercises to international mathematics competition problems. The data underwent OCR, segmentation, translation into English, and realignment to produce a Chain of Thought format.

ii. Tool-Integrated Reasoning Dataset: Focused on 60,000 problems from the Numina dataset with numerical outputs. Using a pipeline with GPT-4, they generated TORA-like reasoning paths and executed code to produce results. Solutions were iteratively filtered and refined to ensure accuracy.

SC-TIR Algorithm

To address high variance in model evaluation, NuminaMath developed the SC-TIR algorithm. This involved:

· Copying the input N times to define the initial batch of prompts.

· Sampling N diverse completions until a complete block of Python code was produced.

· Executing each Python block and concatenating the output.

· Repeating the process M times to allow self-correction of code errors.

· Postprocessing and applying majority voting to select the final answer.

For their winning submission, they generated N=48 candidates with a depth of M=4. Quantizing models to 8-bit precision improved upload speed and accommodated GPU constraints without significantly compromising accuracy.

Avoiding Overfitting:

To mitigate overfitting to the public leaderboard, NuminaMath used four internal validation sets, covering problems of varying difficulty. These included datasets from AMC12 (2022, 2023) and AIME (2022, 2023, 2024), along with subsets of the MATH test set. This approach allowed them to select the most promising models and fine-tune hyperparameters effectively, balancing small representative sets with larger ones to manage submission stochasticity.

What Didn’t Work and Promising Ideas

Not everything in NuminaMath was a smashing success. The team tried different ideas such as:

1. CoT Model with Majority Voting: They trained a pure Chain of Thought (CoT) model and evaluated it using majority voting. This method did not yield the desired results.

2. MMOS Model for Single-Step Solutions: They also attempted to train a model based on the Mix of Minimal Optimal Sets (MMOS) to solve problems using a single Python step. This approach was not successful either.

A Promising Approach: Kahneman-Tversky Optimisation (KTO)

Another technique involved applying KTO to new completions sampled from the SFT model. This approach was inspired by OrcaMath and involved the following steps:

– Sampling four completions per problem from the SFT model, using prompts that combined rationales and code execution from the Stage 2 dataset.

– Comparing the extracted answers to the ground truth and labeling the samples as positive if correct and negative if incorrect.

Although this form of on-policy KTO produced a slightly better model than the SFT one, it only resulted in a modest improvement (a few percentage points) on internal evaluations and scored 27/50 on the public leaderboard. One advantage of using KTO was the ability to track the implicit reward during training, which greatly assisted in debugging. For instance, successful training logs showed an increase in rewards for correct solutions while suppressing the rewards for incorrect ones.

Unfortunately, the team didn’t have enough time to include KTO in NuminaMath, but the idea seems quite promising.

The Results

NuminaMath climbed to the top of the AIMO leaderboard by answering 29 of the 50 problems. Notably, the model answered 7 models more than the second place.

NuminaMath represents an important iteration in frontier models for math reasoning. The AIMO prize might be one of the highest levels of testing we can find in terms of math reasoning, and NuminaMath performed at very impressive levels. Hopefully, some of the ideas behind NuminaMath will inspire other models in the math and reasoning space.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Inside NuminaMath: The AI Model that Took The First Place In the AI Math Olympiad

Author(s): Jesus Rodriguez

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

NuminaMath

Training Recipe

Initial Attempts and Adjustments

Dataset Construction

SC-TIR Algorithm

Avoiding Overfitting:

What Didn’t Work and Promising Ideas

A Promising Approach: Kahneman-Tversky Optimisation (KTO)

The Results

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

The Secret to Unlocking Deeper SWOT Analysis with AI (The Code That Started It All — and How I Took It to the Next Level)

Evaluating and Monitoring LLM Agents: Tools, Metrics, and Best Practices

Building Multi-Agent AI Systems From Scratch: OpenAI vs. Ollama

Web-LLM Assistant: Bridging Local AI Models With Real-Time Web Intelligence

ChatGPT Gets Windows App

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Inside NuminaMath: The AI Model that Took The First Place In the AI Math Olympiad

Author(s): Jesus Rodriguez

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

NuminaMath

Training Recipe

Initial Attempts and Adjustments

Dataset Construction

SC-TIR Algorithm

Avoiding Overfitting:

What Didn’t Work and Promising Ideas

A Promising Approach: Kahneman-Tversky Optimisation (KTO)

The Results

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement