Is AI Mathematically Competent? A Review of the Apple Study
Last Updated on October 31, 2024 by Editorial Team
Author(s): Devashish Datt Mamgain
Originally published on Towards AI.
In 2022 and 2023, large AI companies were primarily concerned with NLP. This was evidenced by the launches that focused more on creative use and Mira. However, the latest models (o1 and the new Claude Sonnet 3.5) have focused more on mathematical reasoning and science problems.
Of course, this makes logical sense. The thesis around artificial intelligence has been that once AI can perform RL to improve itself, it can improve to much higher levels. However, doing RL on NLP and other linguistic tasks is difficult.
Thereβs no absolute correct answer to βWhat should I do in Britain?β
Of course, LLMs can achieve a degree of correctness, but they canβt infinitely improve because thereβs no concrete answer to these questions.
Mathematical and scientific problems have concrete answers. AI can perform RL when a right answer can be evaluated. And, like AlphaZero did with chess, AI can look at the correct answer, adjust weights, and try to get to the correct answer repeatedly.
If you want a more comprehensive review of RL, I describe them in a previous article. However, I want to review the new paper and discuss mathematical reasoning in LLMs.
The Grade School Mathematics Dataset
The Grade School Maths Dataset with 8500 questions is used to benchmark the performance of LLMs in mathematical reasoning. The guidelines that Surge and Open AI used to create the dataset are intuitive and simple. The maths problems in the dataset have:
- Simple Calculations β Something that most people can calculate in their heads. Like 8*6, or 4+4.
- Multiple Intermediate Steps β Each problem has 2 to 8 steps in the solution.
- Integer Answers β The answer should be an integer value.
- Only elementary operations β Addition, subtraction, multiplication, and division- are used in the dataset.
- No Repetition of Setting β Thereβs a unique setting to each problem.
- Write the Operations β If you arrived at something using 8/2, write it as 8/2 instead of 4.
Using these criteria, multiple datasets were created to evaluate multiple LLM models on mathematical reasoning.
Identifying the Problem
The GSM8K dataset has been around since 2021 and has been very influential in AI. However, the recent Apple paper comes forward with a different hypothesis.
Hypothesis β LLMs are learning patterns in the dataset and not performing mathematical reasoning.
This is an issue commonly pointed out by researchers of AI. Transformers, the most popular version of AI used in LLMs, are machines designed to identify and learn patterns in data and use them for next-token prediction.
However, in mathematical reasoning terms, this poses a problem. LLMs struggle to solve questions where the values or relationships are changed.
Letβs do an Experiment
Take any elementary school question, for example; one example given in the paper is as follows:
βWhen Sophie watches her nephew, she gets out various toys for him. The bag of building blocks has 31 blocks in it. The bin of stuffed animals has 8 stuffed animals inside. The tower of stacking rings has 9 multicolored rings on it. Sophie recently bought a tube of bouncy balls, bringing her total number of toys for her nephew up to 62. How many bouncy balls came in the tube?β
The answer is derived from this equation
31+8+9+x = 62
X = 62β31β8β9 = 14
And ChatGPT 4-o gives the right answer here.
Now, letβs modify the scenario and the numbers.
Samira is playing with her cat. Her cat has the following toys: 23 fish toys, 4 balls, and 9 rings. She recently bought him a box of goodies that brings the number of toys up to 59. How many toys were in the box?
Here the answer is again,
23+4+9+x = 59
X = 59β23β4β9 = 23
While ChatGPT-4o can still give the right answer, the researchers found that many other LLMs could not.
Solving the Dataset Issue
If LLMs are performing mathematical reasoning, they should be able to account for any symbolic switch in the scenario and the variable. If they understood the numbers, it wouldnβt matter if the questions were in another format.
So, the researchers took a simple solution. For the following question:
"When Sophie watches her nephew, she gets out various toys for him. The bag of building blocks has 31 blocks in it. The bin of stuffed animals has 8 stuffed animals inside. The tower of stacking rings has 9 multicolored rings on it. Sophie recently bought a tube of bouncy balls, bringing her total number of toys for her nephew up to 62. How many bouncy balls came in the tube?"
They created a symbolic one that was:
"When <girl> watches <family>, she gets out various toys for him. The bag of building blocks has <x> blocks in it. The bin of stuffed animals has <y> stuffed animals inside. The tower of stacking rings has <z> multicolored rings on it. Sophie recently bought a tube of bouncy balls, bringing her total number of toys for her nephew up to <total>. How many bouncy balls came in the tube?"
Now, you can adjust these variables to create different questions. These questions form the GSM-Symbolic dataset.
When tested against the new benchmark, many LLMs degraded performance.
The Result
This measures the accuracy drop of the model in GSM-Symbolic v/s the GSM8K dataset. The largest fall from grace happens with small language models (not surprising; they recognize fewer patterns). o1-Mini and GPT 4-o show the smallest delta in performance.
But, surprisingly, even o1 and GPT 4-o show bigger falls in performance when a second variable is introduced.
Adding Inconsequential Numbers to the Questions
What if we add an inconsequential sentence to the mathematical question? If humans see a random statement in a mathematical question, they know how to ignore it. However, LLMs are told to pay βattentionβ to all the parts of the question.
So, if we take the question from the paper:
"Oliver picks 44 Kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. How many Kiwis does Oliver have?"
And then add a part
"Oliver picks 44 Kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picked twice the number of kiwis he did on Friday, but 5 were smaller than average. How many Kiwis does Oliver have?"
Adding these inconsequential numbers and sentences to the question degrades the accuracy of LLMs.
Results
Even frontier models fail when this additional variable is added to the question. So, maybe the latest models have started doing advanced pattern recognitions, but, theyβre still struggling with mathematical reasoning.
The Final Question
Machine learning has always been focused on mathematics. It seems intuitive that the latest models would also try to bring that expertise into the current LLMs.
And as evidenced by the Apple paper, they perform much better than other smaller models.
This paperβs key hypothesis and question are the most important.
Are LLMs just pattern recognizers, and if they are, will they ever solve novel problems?
Ideally, youβd want AI to develop and prove hypotheses in sciences and mathematics. This would be the crucial step towards AGI. True human intelligence needs some pattern recognition but requires you to reason and apply logic to problems.
If adaptability to new scenarios outside of known patterns is a problem for LLMs, then the models must be further developed.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI