Month in 4 Papers (August 2025)

Last Updated on September 4, 2025 by Editorial Team

Author(s): Ala Falaki, PhD

Originally published on Towards AI.

The complexity of reasoning, the promise of memory, and the rise of small models are shaping agentic AI.

This series of posts is designed to bring you the newest findings and developments in the NLP field. I’ll delve into four significant research papers each month, offering a comprehensive summary. Be sure to visit my blog regularly or subscribe to my newsletter for monthly updates. Let’s dive in!

Illusion of Thinking 🤯

📝 The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity [paper]

The controversial paper! Apple’s response to the rise of LRMs (Large Reasoning Models) and the answer to “Do These Models Really Think?”. The authors argue that current benchmarks for evaluating LLMs are flawed because they emphasize final answer accuracy, which may result from exposure to contaminated training data rather than genuine reasoning (or “thinking”!).

To address this, they introduce a new evaluation framework based on controlled puzzle environments, such as the Tower of Hanoi, that allow systematic differences in complexity and enable precise analysis of both final answers and the intermediate reasoning processes. The results? There is a lot to talk about. They showed model performance changes with problem difficulty. Regular models do better on easy tasks, reasoning models work better on medium ones, but both struggle and fail on very hard problems.

One reason is that LRMs put in less reasoning effort (shorter thinking steps) as problems become too complex, even when they have enough token budget to continue. This important finding suggests that the models give up once the problem passes a certain level of difficulty. Even when given an explicit algorithm (like solving the Tower of Hanoi), LRMs often fail to execute it correctly. This suggests they do not reliably follow logical procedures, indicating limitations in symbolic reasoning. It seems that they do more pattern matching!

In simple problems, models often find the right answer early but keep thinking anyway, wasting time and resources.

This is a behaviour that was also mentioned in this paper

For moderately hard problems, they take longer to find the correct answer, going through many wrong steps first. In very complex problems, their reasoning breaks down completely, with no correct steps or meaningful progress. Also, LRMs’ self-reflection helps only up to a point. After a threshold, their ability to course-correct breaks down entirely. They can think, but not efficiently, break down on hard problems, and fail at following exact logic even when shown how.

Opinion: There’s been quite a bit of drama around this paper recently, and while I recognize the methodological limitations (as I’m sure the authors do too), I still think it’s an important contribution. It helps set more realistic expectations about what current models can and can’t do, and the results clearly show that even when given clear instructions, the models often fail to produce correct answers. That points to a deeper issue that deserves more focus. That said, I would have expected Apple to be further along in the AI race by now. If they see this paper as a strong defence of their position… well, that’s a bit of a stretch!

Mini Reasoning

📝 Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math [paper]

You think only large models (7B+) can do complex reasoning? This paper introduces Phi-4-Mini-reasoning by training Phi-4-Mini(3.8b), achieving top performance among similar-sized models.

This research builds on recent Deepseek-R1 work showing that distilling synthetic data from large models can greatly boost reasoning in small models. It details the data creation process and introduces a four-step training approach. The authors created a large math dataset by starting with existing question sets, some with explanations and some without. For questions without reasoning steps, they used a powerful AI model (DeepSeek-R1) to generate detailed step-by-step answers.

They kept only the correct answers by checking them with automated tools and GPT-4-mini. Each question was labelled by topic and difficulty to make the training more effective. They built a high-quality dataset with 10M examples to teach their small model how to reason. A simplified overview of the training process: First, the model is mid-trained on a large, diverse set of reasoning examples created by a larger model. Then, it is fine-tuned on a smaller, high-quality subset to boost accuracy and generalization.

Third, the model is trained to prefer better answers by comparing good and bad responses from earlier stages. Finally, reinforcement learning with a reward based on correct final answers is used to further refine its reasoning ability. Despite having only 3.8 billion parameters, Phi-4-Mini-Reasoning outperforms models nearly twice its size on challenging math benchmarks. It surpasses 7B and 8B models like DeepSeek-R1-Distill-Qwen-7B and Llama-8B in tasks such as Math-500 and AIME24.

Ablation studies clearly demonstrate how each part of the training process contributes to the final performance; I suggest reading the paper for more details about the results and training process if you’re interested.

Synergize Memory

📝 MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents [paper]

Current LLM agents append all past interactions, leading to growing memory, and degraded reasoning. This paper teaches the agent to keep only what matters, staying fast and focused. MEM1 is a training method that helps AI agents learn to handle multi-step tasks, like answering linked questions or searching the web, without needing to remember everything. For each query, it generates both a response and an updated memory summary. (Internal State <IS>)

The <IS> acts like a small, updated memory. It keeps Important facts, Steps completed and what’s left, Hints for what to do next. So, instead of saving everything in a growing prompt, the model rewrites <IS> each turn with just what matters, like updating a scratchpad. It uses RL (PPO) to train the model. It only rewards the agent for getting the final answer right; no extra rewards are added for memory use or format. The agent naturally learns to manage memory well just by trying to succeed at the task.

The authors create harder tasks by combining multiple questions into one, turning simple QA into multi-step problems. This helps train the agent to handle longer, more complex tasks using the same datasets. It outperforms larger models like Qwen2.5–14B, despite being only 7B in size. It uses up to 3.7× less memory and runs 1.78× faster, while achieving 3.5× better accuracy in some tasks. Unlike other models that collapse on long tasks, MEM1 stays accurate and efficient.

Future of Agents

📝 Small Language Models are the Future of Agentic AI [paper] [blog]

This paper argues that SLMs are not only sufficient but preferable to LLMs for many agentic AI tasks. By highlighting their efficiency, adaptability, and performance in narrow, structured domains. Since most agent tasks are simple and repetitive, using large generalist models is often unnecessary and inefficient. SLMs can perform core agentic functions like code generation, tool use, and instruction following with high accuracy.

Additionally, they run faster, consume less energy, and require significantly fewer computational resources than LLMs. Through case studies of open-source agentic systems, they estimate that 40–70% of LLM calls in these agents can be reliably handled by specialized SLMs. The LLM-to-SLM conversion algorithm involves collecting agent interaction data, clustering tasks, and fine-tuning small models for those specific functions. This process enables specialized SLMs to replace generalist LLMs, reducing cost while maintaining performance.

They highlight a number of results to back their claims, for example, in terms of capability, modern SLMs like Phi-2 (2.7B) and SmolLM2 (1.7B) match or exceed the performance of older 30B–70B LLMs in code generation, reasoning, and instruction following. And in terms of efficiency, serving a 7B SLM is 10–30× cheaper in latency, energy, and computational cost compared to 70–175B LLMs, with many SLMs able to run in real time on consumer-grade GPUs to facilitate offline access.

The work shifts the conversation from general-purpose intelligence to task-specific efficiency and provides a clear framework for adoption.

I send out a monthly newsletter for NLP nerds. Consider subscribing if you like to stay up-to-date on the latest developments in Natural Language Processing.
Read more and subscribe — join the cool kids club and sign up now!

Final Words,

What do you think of this newsletter? I would like to hear your feedback.
What parts were interesting? What sections did you not like? What was missing, and do you want to see more of it in future?
Please reach out to me at nlpiation@gmail.com.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Month in 4 Papers (August 2025)

Author(s): Ala Falaki, PhD

Illusion of Thinking 🤯

Mini Reasoning

Synergize Memory

Future of Agents

Final Words,

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Month in 4 Papers (August 2025)

Author(s): Ala Falaki, PhD

Illusion of Thinking 🤯

Mini Reasoning

Synergize Memory

Future of Agents

Final Words,

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement