Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Is AI Mathematically Competent? A Review of the Apple Study
Latest   Machine Learning

Is AI Mathematically Competent? A Review of the Apple Study

Last Updated on October 31, 2024 by Editorial Team

Author(s): Devashish Datt Mamgain

Originally published on Towards AI.

AI and Maths

In 2022 and 2023, large AI companies were primarily concerned with NLP. This was evidenced by the launches that focused more on creative use and Mira. However, the latest models (o1 and the new Claude Sonnet 3.5) have focused more on mathematical reasoning and science problems.

Of course, this makes logical sense. The thesis around artificial intelligence has been that once AI can perform RL to improve itself, it can improve to much higher levels. However, doing RL on NLP and other linguistic tasks is difficult.

There’s no absolute correct answer to β€œWhat should I do in Britain?”
Of course, LLMs can achieve a degree of correctness, but they can’t infinitely improve because there’s no concrete answer to these questions.

Mathematical and scientific problems have concrete answers. AI can perform RL when a right answer can be evaluated. And, like AlphaZero did with chess, AI can look at the correct answer, adjust weights, and try to get to the correct answer repeatedly.

If you want a more comprehensive review of RL, I describe them in a previous article. However, I want to review the new paper and discuss mathematical reasoning in LLMs.

The Grade School Mathematics Dataset

The Grade School Maths Dataset with 8500 questions is used to benchmark the performance of LLMs in mathematical reasoning. The guidelines that Surge and Open AI used to create the dataset are intuitive and simple. The maths problems in the dataset have:

  1. Simple Calculations β€” Something that most people can calculate in their heads. Like 8*6, or 4+4.
  2. Multiple Intermediate Steps β€” Each problem has 2 to 8 steps in the solution.
  3. Integer Answers β€” The answer should be an integer value.
  4. Only elementary operations β€” Addition, subtraction, multiplication, and division- are used in the dataset.
  5. No Repetition of Setting β€” There’s a unique setting to each problem.
  6. Write the Operations β€” If you arrived at something using 8/2, write it as 8/2 instead of 4.

Using these criteria, multiple datasets were created to evaluate multiple LLM models on mathematical reasoning.

Identifying the Problem

The GSM8K dataset has been around since 2021 and has been very influential in AI. However, the recent Apple paper comes forward with a different hypothesis.

Hypothesis β€” LLMs are learning patterns in the dataset and not performing mathematical reasoning.

This is an issue commonly pointed out by researchers of AI. Transformers, the most popular version of AI used in LLMs, are machines designed to identify and learn patterns in data and use them for next-token prediction.

However, in mathematical reasoning terms, this poses a problem. LLMs struggle to solve questions where the values or relationships are changed.

Let’s do an Experiment

Take any elementary school question, for example; one example given in the paper is as follows:

β€œWhen Sophie watches her nephew, she gets out various toys for him. The bag of building blocks has 31 blocks in it. The bin of stuffed animals has 8 stuffed animals inside. The tower of stacking rings has 9 multicolored rings on it. Sophie recently bought a tube of bouncy balls, bringing her total number of toys for her nephew up to 62. How many bouncy balls came in the tube?”

The answer is derived from this equation

31+8+9+x = 62

X = 62–31–8–9 = 14

And ChatGPT 4-o gives the right answer here.

Now, let’s modify the scenario and the numbers.

Samira is playing with her cat. Her cat has the following toys: 23 fish toys, 4 balls, and 9 rings. She recently bought him a box of goodies that brings the number of toys up to 59. How many toys were in the box?

Here the answer is again,

23+4+9+x = 59

X = 59–23–4–9 = 23

While ChatGPT-4o can still give the right answer, the researchers found that many other LLMs could not.

Solving the Dataset Issue

If LLMs are performing mathematical reasoning, they should be able to account for any symbolic switch in the scenario and the variable. If they understood the numbers, it wouldn’t matter if the questions were in another format.

So, the researchers took a simple solution. For the following question:

"When Sophie watches her nephew, she gets out various toys for him. The bag of building blocks has 31 blocks in it. The bin of stuffed animals has 8 stuffed animals inside. The tower of stacking rings has 9 multicolored rings on it. Sophie recently bought a tube of bouncy balls, bringing her total number of toys for her nephew up to 62. How many bouncy balls came in the tube?"

They created a symbolic one that was:

"When <girl> watches <family>, she gets out various toys for him. The bag of building blocks has <x> blocks in it. The bin of stuffed animals has <y> stuffed animals inside. The tower of stacking rings has <z> multicolored rings on it. Sophie recently bought a tube of bouncy balls, bringing her total number of toys for her nephew up to <total>. How many bouncy balls came in the tube?"

Now, you can adjust these variables to create different questions. These questions form the GSM-Symbolic dataset.

When tested against the new benchmark, many LLMs degraded performance.

The Result

This measures the accuracy drop of the model in GSM-Symbolic v/s the GSM8K dataset. The largest fall from grace happens with small language models (not surprising; they recognize fewer patterns). o1-Mini and GPT 4-o show the smallest delta in performance.

But, surprisingly, even o1 and GPT 4-o show bigger falls in performance when a second variable is introduced.

Adding Inconsequential Numbers to the Questions

What if we add an inconsequential sentence to the mathematical question? If humans see a random statement in a mathematical question, they know how to ignore it. However, LLMs are told to pay β€œattention” to all the parts of the question.

So, if we take the question from the paper:

"Oliver picks 44 Kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. How many Kiwis does Oliver have?"

And then add a part

"Oliver picks 44 Kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picked twice the number of kiwis he did on Friday, but 5 were smaller than average. How many Kiwis does Oliver have?"

Adding these inconsequential numbers and sentences to the question degrades the accuracy of LLMs.

Results

Even frontier models fail when this additional variable is added to the question. So, maybe the latest models have started doing advanced pattern recognitions, but, they’re still struggling with mathematical reasoning.

The Final Question

Machine learning has always been focused on mathematics. It seems intuitive that the latest models would also try to bring that expertise into the current LLMs.

And as evidenced by the Apple paper, they perform much better than other smaller models.

This paper’s key hypothesis and question are the most important.

Are LLMs just pattern recognizers, and if they are, will they ever solve novel problems?

Ideally, you’d want AI to develop and prove hypotheses in sciences and mathematics. This would be the crucial step towards AGI. True human intelligence needs some pattern recognition but requires you to reason and apply logic to problems.

If adaptability to new scenarios outside of known patterns is a problem for LLMs, then the models must be further developed.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓