Inside NuminaMath: The AI Model that Took The First Place In the AI Math Olympiad
Last Updated on July 22, 2024 by Editorial Team
Author(s): Jesus Rodriguez
Originally published on Towards AI.
I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:
TheSequence | Jesus Rodriguez | Substack
The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and dataβ¦
thesequence.substack.com
The AI Mathematical Olympiad(AIMO) has been one of the most interesting initiatives to evaluate sophisticated math reasoning in AI models. Launched a few months ago, AIMO setup a $10 million prize for models that can reason at the level of a gold medalist in the International Math Olymmpiad(IMO) competitions for high school students. By performing at those levels, AI models need to exhibit sophisticated capabilities in areas such as multi-step reasoning, math as well as deep level language understanding. I was fascinated the AIMO challenge and was tracking the progress of the different models quite closely over the last few months trying to understand the techniques they were using to solve such complex chal.
After months of intervention, NuminaMath 7B TIR emerged as the winner. The model was a collaboration between HuggingFace and Numina, a lab focused on advancing math capabilities in foundation models. You probably know a lot about HuggingFace but very little about Numina so leβs fix that.
Numina is a lab dedicated to advance math capabilities in foundation models. Numina rallies behind that vision that math is essential to humanty and a key component of advances intelligence. The project received initial support from Mistral and firms like General Catalyst and set its eyes on the AIMO challenge as one of its firs major tests.
NuminaMath is a combination of some obvious steps with very novel approaches in terms across different areas. Today, I would like to dive into some of the details behind NuminaMath that could serve as inspirations for AI teams working on similar problems.
NuminaMath
One of the most interesting aspects of NuminaMath is that they build a new architecture from scratch. Instead, they relied on the DeepSeekMath model as a baseline and extend it with a novel approach based on three fundamental components:
i. Fine-tuning Strategy: NuminaMath fine-tuned the DeepSeekMath-Base 7B model to function as a βreasoning agent.β This agent tackled mathematical problems using natural language reasoning combined with Python REPL to compute intermediate results.
ii. Decoding Algorithm: They developed a novel decoding algorithm for tool-integrated reasoning (TIR) that incorporated code execution feedback, enabling the generation of solution candidates during inference.
iii. Internal Validation Sets: Various internal validation sets were used to guide model selection and prevent overfitting to the public leaderboard.
The models were trained using open-source libraries such as TRL, PyTorch, vLLM, and DeepSpeed. Training on one node of 8 x H100 GPUs took approximately 10 hours.
Training Recipe
Fine tuning is, arguably, one of the most interesting areas of contribution of NuminaMath.
The fine-tuning process was divided into two stages:
i. Stage 1: The base model was fine-tuned on a diverse dataset of natural language math problems and solutions. Each solution was templated with Chain of Thought (CoT) to aid reasoning.
ii. Stage 2: The model from Stage 1 was further fine-tuned on a synthetic dataset of tool-integrated reasoning. Problems were broken down into rationales, Python programs, and their outputs. This method, influenced by Microsoftβs ToRA paper, produced a reasoning agent capable of solving problems using both natural language and Python REPL.
Both stages involved βfull fine-tuning,β where all model weights were updated during backpropagation. The βpackingβ feature from TRLβs SFTTrainer was utilized to concatenate multiple samples into a single chunk of 2048 tokens. Gradient checkpointing and the DeepSpeed ZeRO-3 protocol ensured efficient training within available VRAM. Key hyperparameters used in each stage included a learning rate of 2.0 E-5, a total batch size of 32, and a cosine learning rate scheduler.
Initial Attempts and Adjustments
Initial submissions using only Stage 1 fine-tuning yielded limited success. Inspired by Abdur Rafaeβs public prize notebook, NuminaMath integrated code execution into their training recipe. They first explored the Mix of Minimal Optimal Sets (MMOS) dataset but found it insufficient for harder problems. This led them to develop a dataset similar to the one used by DeepSeekMath Instruct / RL models, resulting in significant improvements.
Dataset Construction
NuminaMath used two main datasets for its fine-tuning process:
i. Chain of Thought Dataset: Comprised of several hundred thousand problems with solutions written in a Chain of Thought manner. Data sources ranged from Chinese high school math exercises to international mathematics competition problems. The data underwent OCR, segmentation, translation into English, and realignment to produce a Chain of Thought format.
ii. Tool-Integrated Reasoning Dataset: Focused on 60,000 problems from the Numina dataset with numerical outputs. Using a pipeline with GPT-4, they generated TORA-like reasoning paths and executed code to produce results. Solutions were iteratively filtered and refined to ensure accuracy.
SC-TIR Algorithm
To address high variance in model evaluation, NuminaMath developed the SC-TIR algorithm. This involved:
Β· Copying the input N times to define the initial batch of prompts.
Β· Sampling N diverse completions until a complete block of Python code was produced.
Β· Executing each Python block and concatenating the output.
Β· Repeating the process M times to allow self-correction of code errors.
Β· Postprocessing and applying majority voting to select the final answer.
For their winning submission, they generated N=48 candidates with a depth of M=4. Quantizing models to 8-bit precision improved upload speed and accommodated GPU constraints without significantly compromising accuracy.
Avoiding Overfitting:
To mitigate overfitting to the public leaderboard, NuminaMath used four internal validation sets, covering problems of varying difficulty. These included datasets from AMC12 (2022, 2023) and AIME (2022, 2023, 2024), along with subsets of the MATH test set. This approach allowed them to select the most promising models and fine-tune hyperparameters effectively, balancing small representative sets with larger ones to manage submission stochasticity.
What Didnβt Work and Promising Ideas
Not everything in NuminaMath was a smashing success. The team tried different ideas such as:
1. CoT Model with Majority Voting: They trained a pure Chain of Thought (CoT) model and evaluated it using majority voting. This method did not yield the desired results.
2. MMOS Model for Single-Step Solutions: They also attempted to train a model based on the Mix of Minimal Optimal Sets (MMOS) to solve problems using a single Python step. This approach was not successful either.
A Promising Approach: Kahneman-Tversky Optimisation (KTO)
Another technique involved applying KTO to new completions sampled from the SFT model. This approach was inspired by OrcaMath and involved the following steps:
– Sampling four completions per problem from the SFT model, using prompts that combined rationales and code execution from the Stage 2 dataset.
– Comparing the extracted answers to the ground truth and labeling the samples as positive if correct and negative if incorrect.
Although this form of on-policy KTO produced a slightly better model than the SFT one, it only resulted in a modest improvement (a few percentage points) on internal evaluations and scored 27/50 on the public leaderboard. One advantage of using KTO was the ability to track the implicit reward during training, which greatly assisted in debugging. For instance, successful training logs showed an increase in rewards for correct solutions while suppressing the rewards for incorrect ones.
Unfortunately, the team didnβt have enough time to include KTO in NuminaMath, but the idea seems quite promising.
The Results
NuminaMath climbed to the top of the AIMO leaderboard by answering 29 of the 50 problems. Notably, the model answered 7 models more than the second place.
NuminaMath represents an important iteration in frontier models for math reasoning. The AIMO prize might be one of the highest levels of testing we can find in terms of math reasoning, and NuminaMath performed at very impressive levels. Hopefully, some of the ideas behind NuminaMath will inspire other models in the math and reasoning space.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI