Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Month in 4 Papers (August 2025)
Latest   Machine Learning

Month in 4 Papers (August 2025)

Last Updated on September 4, 2025 by Editorial Team

Author(s): Ala Falaki, PhD

Originally published on Towards AI.

The complexity of reasoning, the promise of memory, and the rise of small models are shaping agentic AI.

This series of posts is designed to bring you the newest findings and developments in the NLP field. I’ll delve into four significant research papers each month, offering a comprehensive summary. Be sure to visit my blog regularly or subscribe to my newsletter for monthly updates. Let’s dive in!

Illusion of Thinking 🤯

📝 The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity [paper]

Month in 4 Papers (August 2025)

The controversial paper! Apple’s response to the rise of LRMs (Large Reasoning Models) and the answer to “Do These Models Really Think?”. The authors argue that current benchmarks for evaluating LLMs are flawed because they emphasize final answer accuracy, which may result from exposure to contaminated training data rather than genuine reasoning (or “thinking”!).

To address this, they introduce a new evaluation framework based on controlled puzzle environments, such as the Tower of Hanoi, that allow systematic differences in complexity and enable precise analysis of both final answers and the intermediate reasoning processes. The results? There is a lot to talk about. They showed model performance changes with problem difficulty. Regular models do better on easy tasks, reasoning models work better on medium ones, but both struggle and fail on very hard problems.

One reason is that LRMs put in less reasoning effort (shorter thinking steps) as problems become too complex, even when they have enough token budget to continue. This important finding suggests that the models give up once the problem passes a certain level of difficulty. Even when given an explicit algorithm (like solving the Tower of Hanoi), LRMs often fail to execute it correctly. This suggests they do not reliably follow logical procedures, indicating limitations in symbolic reasoning. It seems that they do more pattern matching!

In simple problems, models often find the right answer early but keep thinking anyway, wasting time and resources.

This is a behaviour that was also mentioned in this paper

For moderately hard problems, they take longer to find the correct answer, going through many wrong steps first. In very complex problems, their reasoning breaks down completely, with no correct steps or meaningful progress. Also, LRMs’ self-reflection helps only up to a point. After a threshold, their ability to course-correct breaks down entirely. They can think, but not efficiently, break down on hard problems, and fail at following exact logic even when shown how.

Opinion: There’s been quite a bit of drama around this paper recently, and while I recognize the methodological limitations (as I’m sure the authors do too), I still think it’s an important contribution. It helps set more realistic expectations about what current models can and can’t do, and the results clearly show that even when given clear instructions, the models often fail to produce correct answers. That points to a deeper issue that deserves more focus. That said, I would have expected Apple to be further along in the AI race by now. If they see this paper as a strong defence of their position… well, that’s a bit of a stretch!

Mini Reasoning

📝 Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math [paper]

You think only large models (7B+) can do complex reasoning? This paper introduces Phi-4-Mini-reasoning by training Phi-4-Mini(3.8b), achieving top performance among similar-sized models.

This research builds on recent Deepseek-R1 work showing that distilling synthetic data from large models can greatly boost reasoning in small models. It details the data creation process and introduces a four-step training approach. The authors created a large math dataset by starting with existing question sets, some with explanations and some without. For questions without reasoning steps, they used a powerful AI model (DeepSeek-R1) to generate detailed step-by-step answers.

They kept only the correct answers by checking them with automated tools and GPT-4-mini. Each question was labelled by topic and difficulty to make the training more effective. They built a high-quality dataset with 10M examples to teach their small model how to reason. A simplified overview of the training process: First, the model is mid-trained on a large, diverse set of reasoning examples created by a larger model. Then, it is fine-tuned on a smaller, high-quality subset to boost accuracy and generalization.

Third, the model is trained to prefer better answers by comparing good and bad responses from earlier stages. Finally, reinforcement learning with a reward based on correct final answers is used to further refine its reasoning ability. Despite having only 3.8 billion parameters, Phi-4-Mini-Reasoning outperforms models nearly twice its size on challenging math benchmarks. It surpasses 7B and 8B models like DeepSeek-R1-Distill-Qwen-7B and Llama-8B in tasks such as Math-500 and AIME24.

Ablation studies clearly demonstrate how each part of the training process contributes to the final performance; I suggest reading the paper for more details about the results and training process if you’re interested.

Synergize Memory

📝 MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents [paper]

Current LLM agents append all past interactions, leading to growing memory, and degraded reasoning. This paper teaches the agent to keep only what matters, staying fast and focused. MEM1 is a training method that helps AI agents learn to handle multi-step tasks, like answering linked questions or searching the web, without needing to remember everything. For each query, it generates both a response and an updated memory summary. (Internal State <IS>)

The <IS> acts like a small, updated memory. It keeps Important facts, Steps completed and what’s left, Hints for what to do next. So, instead of saving everything in a growing prompt, the model rewrites <IS> each turn with just what matters, like updating a scratchpad. It uses RL (PPO) to train the model. It only rewards the agent for getting the final answer right; no extra rewards are added for memory use or format. The agent naturally learns to manage memory well just by trying to succeed at the task.

The authors create harder tasks by combining multiple questions into one, turning simple QA into multi-step problems. This helps train the agent to handle longer, more complex tasks using the same datasets. It outperforms larger models like Qwen2.5–14B, despite being only 7B in size. It uses up to 3.7× less memory and runs 1.78× faster, while achieving 3.5× better accuracy in some tasks. Unlike other models that collapse on long tasks, MEM1 stays accurate and efficient.

Future of Agents

📝 Small Language Models are the Future of Agentic AI [paper] [blog]

This paper argues that SLMs are not only sufficient but preferable to LLMs for many agentic AI tasks. By highlighting their efficiency, adaptability, and performance in narrow, structured domains. Since most agent tasks are simple and repetitive, using large generalist models is often unnecessary and inefficient. SLMs can perform core agentic functions like code generation, tool use, and instruction following with high accuracy.

Additionally, they run faster, consume less energy, and require significantly fewer computational resources than LLMs. Through case studies of open-source agentic systems, they estimate that 40–70% of LLM calls in these agents can be reliably handled by specialized SLMs. The LLM-to-SLM conversion algorithm involves collecting agent interaction data, clustering tasks, and fine-tuning small models for those specific functions. This process enables specialized SLMs to replace generalist LLMs, reducing cost while maintaining performance.

They highlight a number of results to back their claims, for example, in terms of capability, modern SLMs like Phi-2 (2.7B) and SmolLM2 (1.7B) match or exceed the performance of older 30B–70B LLMs in code generation, reasoning, and instruction following. And in terms of efficiency, serving a 7B SLM is 10–30× cheaper in latency, energy, and computational cost compared to 70–175B LLMs, with many SLMs able to run in real time on consumer-grade GPUs to facilitate offline access.

The work shifts the conversation from general-purpose intelligence to task-specific efficiency and provides a clear framework for adoption.

I send out a monthly newsletter for NLP nerds. Consider subscribing if you like to stay up-to-date on the latest developments in Natural Language Processing.
Read more and subscribe — join the cool kids club and sign up now!

Final Words,

What do you think of this newsletter? I would like to hear your feedback.
What parts were interesting? What sections did you not like? What was missing, and do you want to see more of it in future?
Please reach out to me at nlpiation@gmail.com.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.