Inside rStar-Math, a Technique that Makes Small Models Math GPT-o1 in Math Reasoning
Last Updated on January 14, 2025 by Editorial Team
Author(s): Jesus Rodriguez
Originally published on Towards AI.
Inside rStar-Math, a Technique that Makes Small Models Math GPT-o1 in Math Reasoning
I recently started an AI-focused educational newsletter, that already has over 175,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:
TheSequence | Jesus Rodriguez | Substack
The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and dataβ¦
thesequence.substack.com
The battle between SLM and big LLMs is one of the most interesting trends in generative AI. We are always fascinated by the claims of smaller models beating competitors on different benchmarks. Recently, this has become even trendier with areas such as reasoning gaining relevance. For a while, reasoning was considering a by product of the scaling laws but now we are seeing emerging SLMs able to reason across different domains. One of the most impressive examples came a few days ago when Microsoft published a paper outlining a rStar-Math, a method that validates SLMs can outperform models like GPT-o1 on math reasoning without any distillation.
rStar-Math is a novel approach that significantly boosts the mathematical reasoning capabilities of small language models (SLMs). This innovative system enables SLMs to achieve performance levels comparable to, and even exceeding, OpenAIβs o1, despite a significantly smaller model size. This is accomplished through a self-evolved System 2 deep thinking process that leverages Monte Carlo Tree Search (MCTS) guided by a carefully crafted Process Preference Model (PPM).
Architecture
At the heart of rStar-Math lies a self-evolutionary process consisting of four distinct rounds. Each round focuses on progressively refining the policy SLM, which generates reasoning steps, and the PPM, which evaluates these steps, resulting in increasingly accurate and sophisticated mathematical reasoning.
- Round 1: Bootstrapping. The initial round utilizes a powerful pre-trained LLM, DeepSeek-Coder-V2-Instruct (236B), to bootstrap the process. It generates an initial set of reasoning trajectories using MCTS and a simple terminal-guided annotation for assigning Q-values to each step based on their contribution to reaching the correct answer. This data is then used to fine-tune a smaller 7B SLM, designated as SLM-r1, forming the first iteration of the policy model.
- Round 2: Reliable PPM. In this round, SLM-r1 is employed to generate a more extensive set of reasoning trajectories with 16 MCTS rollouts per problem. The increased rollouts lead to more reliable Q-value annotations. This data is used to train the first truly effective reward model, PPM-r2, marking a significant step towards robust System 2 reasoning.
- Round 3: PPM-Augmented MCTS. The introduction of PPM-augmented MCTS drastically improves the quality of generated trajectories. The PPM guides the search process, prioritizing steps that are more likely to lead to correct solutions. This results in a training set enriched with more challenging mathematical problems, further pushing the boundaries of SLM capabilities.
- Round 4: Solving Challenging Problems. The final round concentrates on expanding the training set coverage to include even more difficult, competition-level problems. For problems not solvable within the standard 16 MCTS rollouts, additional rollouts are conducted (up to 128), along with multiple tree expansions with varying random seeds. This strategic approach ensures that the policy SLM and PPM are exposed to and trained on a diverse and challenging set of mathematical problems.
This iterative self-evolution process culminates in a powerful policy SLM and a highly accurate PPM that can effectively guide the MCTS search to solve complex mathematical problems.
Key Innovations
Three key innovations underpin the remarkable success of rStar-Math:
- Step-by-Step Verified Reasoning Trajectory. This novel method tackles the problem of erroneous intermediate steps often generated by LLMs. By augmenting the CoT generation with corresponding Python code and verifying the codeβs successful execution at each step, only valid and logically sound steps are retained. This ensures the generation of high-quality reasoning trajectories, significantly enhancing the training dataβs integrity.
- Process Preference Model (PPM). Existing methods for training Process Reward Models (PRMs) face a critical challenge: the need for precise step-level reward annotations, which are difficult and expensive to obtain. rStar-Math circumvents this obstacle by introducing the PPM, trained using a novel approach based on preference pairs. Instead of relying on precise reward scores, the PPM learns to distinguish positive (correct) steps from negative (incorrect or irrelevant) ones. This pairwise ranking approach effectively utilizes the relative quality information available through extensive MCTS rollouts, resulting in a reliable and effective process reward model.
- Code-Augmented CoT Data Synthesis. rStar-Math utilizes a novel code-augmented CoT data synthesis method during MCTS rollouts. The policy SLM generates both a natural language (NL) CoT and corresponding Python code for each step. The Python code is then executed, and only steps with successfully executing code are retained as valid candidates. This approach effectively mitigates the issue of LLM hallucination, ensuring the generation of correct and relevant steps. Additionally, extensive MCTS rollouts automatically assign Q-values to each step based on its contribution to reaching the correct answer, serving as a valuable self-annotation mechanism for training the PPM.
Performance and Impact
rStar-Math demonstrates remarkable performance across a variety of challenging math benchmarks, consistently achieving state-of-the-art results and surpassing existing SLM and even some larger LLM solutions.
- Outperforming OpenAI o1. On the MATH benchmark, rStar-Math boosts the accuracy of Qwen2.5-Math-7B from 58.8% to 90.0%, exceeding o1-preview by 4.5%. On the demanding AIME 2024, it solves an average of 53.3% of problems, placing it in the top 20% of high school students taking this challenging exam.
- Generalizability. rStar-Math shows strong generalizability, achieving impressive results on diverse math benchmarks beyond the commonly used MATH and GSM8K datasets. This includes outperforming o1-mini on the College Math benchmark and setting new state-of-the-art scores on the Olympiad Bench and the Chinese College Entrance Math Exam (Gaokao).
- Intrinsic Self-Reflection. Intriguingly, the MCTS-driven deep thinking process in rStar-Math exhibits an emergent self-reflection capability. The policy model can identify low-quality steps, backtrack, and explore alternative solutions, showcasing a level of meta-cognitive awareness not explicitly trained for.
- PPM Guiding Reasoning. Experiments reveal that the PPM plays a crucial role in shaping the reasoning boundaries of System 2 deep thinking. Once the policy SLM achieves a certain level of competence, the PPM becomes the primary factor determining the upper limit of the systemβs performance.
Conclusion
rStar-Math presents a significant advancement in the field of LLM-based mathematical reasoning. Its innovative self-evolutionary approach, combined with the novel PPM and code-augmented CoT data synthesis, enables smaller LLMs to achieve remarkable performance levels, rivaling and even surpassing larger, more computationally expensive models. The emergent self-reflection capability further highlights the potential of this method. rStar-Mathβs success in unlocking the deep thinking capabilities of SLMs holds immense promise for future research in various domains, including theorem proving, code reasoning, and general problem-solving.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI