Explaining Tongyi DeepResearch
Last Updated on September 29, 2025 by Editorial Team
Author(s): Mengliu Zhao
Originally published on Towards AI.
What will the future of LLM training be like?
Last week, Alibaba Tonyi Lab released its agentic research model, Tongyi DeepResearch, which outperformed OpenAI o3 and DeepResearch on various tasks. Moreover, Tongyi DeepResearch only has 30B parameters in total, with 3B activated per token, while its open-source rival – the DeepSeek v3.1 has 671B parameters, and Kimi Researcher (based on Kimi v2) has 1T parameters. We can’t help but ask, how did Tongyi DeepResearch achieve that?
It turns out not to be magic. Extending the basic ReAct reasoning paradigm to the Iterative Deep Research Paradigm, the model fully utilized synthetic trajectory data for training purposes. Besides the official blog, the synthetic and training technologies are outlined in the following research papers:
- AgentFounder for model pretraining — synthesizing first and higher-order action data for continual pretraining
- WebSailor-V2 and WebShaper for post-training — using random walk for QA knowledge graph dataset creation, and layer-wise expansion-based information seeking strategy for improving data quality, for SFT and GRPO-based RL.
We’ll explain these strategies in detail in the following.

AgentFounder — Agentic Continual Pretraining
The AgentFounder training schema proposed in the paper “Scaling Agents via Continual Pre-training” involves two stages: i) Stage I pretraining with 32K context length, and ii) Stage II with 128K context length.

To create the corresponding agentic data, two different synthesis methods were involved: a) first-order action synthesis (FAS) with planning action and reasoning action; b) higher-order action synthesis (HAS) for multi-decision action synthesis.
First-order action synthesis. According to the original paper, different from traditional “wiki-style” knowledge representation, such as “Paris is the capital of France”, the knowledge is anchored by “entity” France, e.g., (“France”: “Tourist arrivals in France reached 4,222 thousand in June 2025”), to form a diverse QA set.

Higher-order action synthesis. Instead of using the original “plain” reasoning/action trajectory data, a new reasoning/action candidate set is scaled (created) for each step by LLMs so as to help explore different decision possibilities, without changing the final binary decision.

WebSailor-V2 — Creating Knowledge Graph for SFT and RL
The original WebSailor paper proposed a three-level web-scraping method, which progressively built the knowledge graph:

The WebSailor v2 further creates cyclic graphs to introduce information interconnectivity and then uses random walks for subgraph extraction.
Post Training — SFT Cold Start + RL with GRPO. The QA data generated using the pruned knowledge graph will eventually be utilized in the post-training stages, in both the SFT and RL stages.
The reinforcement learning algorithm is GRPO, which DeepSeekMath proposed for improving the PPO by replacing the value model with group scores.

WebShaper — An Information-driven Paradigm for Improving QA Quality
Since the WebSailor v2 first retrieves the knowledge graph from the web and then generates the QA dataset (Information-driven), there could be an inconsistency between the knowledge graph and the final reasoning dataset. To prevent this from happening, the WebShaper paper proposed a “knowledge projection” operation and used set theory to control the structure of the newly generated QA set (Formation-driven).

First, to achieve knowledge/reasoning consistency, WebShaper proposed projecting entities based on certain relationships onto a subdomain using union and intersection operations.

Second, starting from a seed question, a “Layer-wise” structure is created by a sequence of Agentic expansions (composed of union and intersection mentioned above), which eliminates the redundancy and reasoning shortcut issues during QA creation.
ReAct and IterResearch Inference Paradigm
So, how do we integrate the above-created synthetic dataset for agentic inference purposes? Tongyi DeepResearch supports two paradigms: ReAct and IterResearch (or heavy mode).
ReAct is the paradigm proposed in ICLR’23, which follows the “thought-action-observation” cycle as follows:

IterResearch employs a more complex paradigm, which creates a new workspace with a central report that includes noise prevention and quality checks. If a tool call is required, it returns the corresponding tool response from preset environments, along with the predicted action for iteration.
Final Results and Takeaways
How well does the Tongyi DeepResearch perform? The following is their official benchmark results:
The Tongyi DeepResearch model outperforms OpenAI DeepResearch by 6.3% on the Humanity’s Last Exam dataset, a multimodal benchmarking dataset comprising frontier knowledge from mathematics, physics, social science, and other fields. However, the Tongyi DeepResearch lags on the BrowseComp dataset, which “require persistently navigating the internet in search of hard-to-find, entangled information”. This might be because the Tongyi model has a small base of only 30B and only supports a context length of 128k.
But this shows promise. With a small to medium-sized base, the Tongyi DeepResearch model shows great potential for scaling with synthetic trajectory data. It may reveal that we’re facing a completely different era of new training paradigms for synthetic data.
References
- Blog post, “Tongyi DeepResearch: A New Era of Open-Source AI Researchers”, https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/
- Blog post, “Kimi K2: Open Agentic Intelligence”, https://moonshotai.github.io/Kimi-K2/
- Model card, “DeepSeek-V3.1-Terminus”, https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus
- Tao Z, Wu J, Yin W, Zhang J, Li B, Shen H, Li K, Zhang L, Wang X, Jiang Y, Xie P. Webshaper: Agentically data synthesizing via information-seeking formalization. arXiv preprint arXiv:2507.15061. 2025 Jul 20.
- Li K, Zhang Z, Yin H, Ye R, Zhao Y, Zhang L, Ou L, Zhang D, Wu X, Wu J, Wang X. WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning. arXiv preprint arXiv:2509.13305. 2025 Sep 16.
- Su L, Zhang Z, Li G, Chen Z, Wang C, Song M, Wang X, Li K, Wu J, Chen X, Qiao Z. Scaling Agents via Continual Pre-training. arXiv preprint arXiv:2509.13310. 2025 Sep 16.
- Li K, Zhang Z, Yin H, Zhang L, Ou L, Wu J, Yin W, Li B, Tao Z, Wang X, Shen W. WebSailor: Navigating Super-human Reasoning for Web Agent. arXiv preprint arXiv:2507.02592. 2025 Jul 3.
- Shao Z, Wang P, Zhu Q, Xu R, Song J, Bi X, Zhang H, Zhang M, Li YK, Wu Y, Guo D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. 2024 Feb 5.
- Phan L, Gatti A, Han Z, Li N, Hu J, Zhang H, Zhang CB, Shaaban M, Ling J, Shi S, Choi M. Humanity’s last exam. arXiv preprint arXiv:2501.14249. 2025 Jan 24.
- Wei J, Sun Z, Papay S, McKinney S, Han J, Fulford I, Chung HW, Passos AT, Fedus W, Glaese A. Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv preprint arXiv:2504.12516. 2025 Apr 16.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.