Is AI Mathematically Competent? A Review of the Apple Study

Last Updated on October 31, 2024 by Editorial Team

Author(s): Devashish Datt Mamgain

Originally published on Towards AI.

In 2022 and 2023, large AI companies were primarily concerned with NLP. This was evidenced by the launches that focused more on creative use and Mira. However, the latest models (o1 and the new Claude Sonnet 3.5) have focused more on mathematical reasoning and science problems.

Of course, this makes logical sense. The thesis around artificial intelligence has been that once AI can perform RL to improve itself, it can improve to much higher levels. However, doing RL on NLP and other linguistic tasks is difficult.

There’s no absolute correct answer to “What should I do in Britain?”
Of course, LLMs can achieve a degree of correctness, but they can’t infinitely improve because there’s no concrete answer to these questions.

Mathematical and scientific problems have concrete answers. AI can perform RL when a right answer can be evaluated. And, like AlphaZero did with chess, AI can look at the correct answer, adjust weights, and try to get to the correct answer repeatedly.

If you want a more comprehensive review of RL, I describe them in a previous article. However, I want to review the new paper and discuss mathematical reasoning in LLMs.

The Grade School Mathematics Dataset

The Grade School Maths Dataset with 8500 questions is used to benchmark the performance of LLMs in mathematical reasoning. The guidelines that Surge and Open AI used to create the dataset are intuitive and simple. The maths problems in the dataset have:

Simple Calculations — Something that most people can calculate in their heads. Like 8*6, or 4+4.
Multiple Intermediate Steps — Each problem has 2 to 8 steps in the solution.
Integer Answers — The answer should be an integer value.
Only elementary operations — Addition, subtraction, multiplication, and division- are used in the dataset.
No Repetition of Setting — There’s a unique setting to each problem.
Write the Operations — If you arrived at something using 8/2, write it as 8/2 instead of 4.

Using these criteria, multiple datasets were created to evaluate multiple LLM models on mathematical reasoning.

Identifying the Problem

The GSM8K dataset has been around since 2021 and has been very influential in AI. However, the recent Apple paper comes forward with a different hypothesis.

Hypothesis — LLMs are learning patterns in the dataset and not performing mathematical reasoning.

This is an issue commonly pointed out by researchers of AI. Transformers, the most popular version of AI used in LLMs, are machines designed to identify and learn patterns in data and use them for next-token prediction.

However, in mathematical reasoning terms, this poses a problem. LLMs struggle to solve questions where the values or relationships are changed.

Let’s do an Experiment

Take any elementary school question, for example; one example given in the paper is as follows:

“When Sophie watches her nephew, she gets out various toys for him. The bag of building blocks has 31 blocks in it. The bin of stuffed animals has 8 stuffed animals inside. The tower of stacking rings has 9 multicolored rings on it. Sophie recently bought a tube of bouncy balls, bringing her total number of toys for her nephew up to 62. How many bouncy balls came in the tube?”

The answer is derived from this equation

31+8+9+x = 62

X = 62–31–8–9 = 14

And ChatGPT 4-o gives the right answer here.

Now, let’s modify the scenario and the numbers.

Samira is playing with her cat. Her cat has the following toys: 23 fish toys, 4 balls, and 9 rings. She recently bought him a box of goodies that brings the number of toys up to 59. How many toys were in the box?

Here the answer is again,

23+4+9+x = 59

X = 59–23–4–9 = 23

While ChatGPT-4o can still give the right answer, the researchers found that many other LLMs could not.

Solving the Dataset Issue

If LLMs are performing mathematical reasoning, they should be able to account for any symbolic switch in the scenario and the variable. If they understood the numbers, it wouldn’t matter if the questions were in another format.

So, the researchers took a simple solution. For the following question:

"When Sophie watches her nephew, she gets out various toys for him. The bag of building blocks has 31 blocks in it. The bin of stuffed animals has 8 stuffed animals inside. The tower of stacking rings has 9 multicolored rings on it. Sophie recently bought a tube of bouncy balls, bringing her total number of toys for her nephew up to 62. How many bouncy balls came in the tube?"

They created a symbolic one that was:

"When <girl> watches <family>, she gets out various toys for him. The bag of building blocks has <x> blocks in it. The bin of stuffed animals has <y> stuffed animals inside. The tower of stacking rings has <z> multicolored rings on it. Sophie recently bought a tube of bouncy balls, bringing her total number of toys for her nephew up to <total>. How many bouncy balls came in the tube?"

Now, you can adjust these variables to create different questions. These questions form the GSM-Symbolic dataset.

When tested against the new benchmark, many LLMs degraded performance.

The Result

This measures the accuracy drop of the model in GSM-Symbolic v/s the GSM8K dataset. The largest fall from grace happens with small language models (not surprising; they recognize fewer patterns). o1-Mini and GPT 4-o show the smallest delta in performance.

But, surprisingly, even o1 and GPT 4-o show bigger falls in performance when a second variable is introduced.

Adding Inconsequential Numbers to the Questions

What if we add an inconsequential sentence to the mathematical question? If humans see a random statement in a mathematical question, they know how to ignore it. However, LLMs are told to pay “attention” to all the parts of the question.

So, if we take the question from the paper:

"Oliver picks 44 Kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. How many Kiwis does Oliver have?"

And then add a part

"Oliver picks 44 Kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picked twice the number of kiwis he did on Friday, but 5 were smaller than average. How many Kiwis does Oliver have?"

Adding these inconsequential numbers and sentences to the question degrades the accuracy of LLMs.

Results

Even frontier models fail when this additional variable is added to the question. So, maybe the latest models have started doing advanced pattern recognitions, but, they’re still struggling with mathematical reasoning.

The Final Question

Machine learning has always been focused on mathematics. It seems intuitive that the latest models would also try to bring that expertise into the current LLMs.

And as evidenced by the Apple paper, they perform much better than other smaller models.

This paper’s key hypothesis and question are the most important.

Are LLMs just pattern recognizers, and if they are, will they ever solve novel problems?

Ideally, you’d want AI to develop and prove hypotheses in sciences and mathematics. This would be the crucial step towards AGI. True human intelligence needs some pattern recognition but requires you to reason and apply logic to problems.

If adaptability to new scenarios outside of known patterns is a problem for LLMs, then the models must be further developed.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Is AI Mathematically Competent? A Review of the Apple Study

Author(s): Devashish Datt Mamgain

The Grade School Mathematics Dataset

Identifying the Problem

Solving the Dataset Issue

The Result

Adding Inconsequential Numbers to the Questions

Results

The Final Question

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #71: Open-Sora: $200K Video Model, HPC’s Unsung Hero, and 10 Ways LLMs Fail in the Wild

Using CrewAI to Build Agentic Systems

Future of the Job Market — Impact of AI on Various Roles in 2025

Multimodal Autonomous AI Agents: Enhancing Web Interactions Through Tree Search

TAI #148: New API Models from OpenAI (4.1) & xAI (grok-3); Exploring Deep Research’s Scaling Laws

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Is AI Mathematically Competent? A Review of the Apple Study

Author(s): Devashish Datt Mamgain

The Grade School Mathematics Dataset

Identifying the Problem

Solving the Dataset Issue

The Result

Adding Inconsequential Numbers to the Questions

Results

The Final Question

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥