Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


ReST meets ReACT: improving ReAct with Self-Critique, AI Feedback, and Synthetic Data Generation.
Artificial Intelligence   Latest   Machine Learning

ReST meets ReACT: improving ReAct with Self-Critique, AI Feedback, and Synthetic Data Generation.

Last Updated on December 30, 2023 by Editorial Team

Author(s): Eduardo Muñoz

Originally published on Towards AI.

A brief description of this adaptation of Reinforced Self-Training (ReST) to an agentic configuration.

Picture by Aaron Burden from Unsplash

This article describes a very interesting and inspiring proposal to improve a ReAct agent with reasoning and action response with external knowledge. Besides offering promising results, it seems to me that it presents workflows that can be used in many approaches, and therefore, I find it very useful to read and understand.


In December 2023, Google researchers published the paper titled “ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent” [1]. The paper discusses the implementation of a search and-answer procedure for a multi-step reasoning LLM agent. The agent uses web search to generate long-form answers for knowledge-seeking questions. The paper focuses on improving the agent’s performance and robustness through self-critique, AI feedback, and synthetic data generation.

It describes a ReSTlike (Reinforced Self-Training)[2] algorithm used for iterative fine-tuning of the agent’s reasoning traces. The contributions of the paper include building a ReAct agent with self-critique, defining a proxy evaluation metric, demonstrating the effectiveness of Rest-style iterative fine-tuning, and using synthetic data for distilling the agent into smaller models.

Search Agent

The research paper discusses a specialized agent called Search Agent, which is a variant of the ReAct agent introduced by Yao et al. in 2022 [4]. This particular agent incorporates Reflexion, a concept presented by Shin et al. in 2023 [5]. Reflexion presents an innovative approach to reinforcement learning for language agents, relying on linguistic feedback and reflective text in an episodic memory buffer. The framework’s flexibility and notable performance improvements across various tasks underscore its potential as an effective and versatile tool for training large language models in goal-driven scenarios

The primary function of the Search Agent is to address knowledge-seeking open-ended questions by leveraging web search as a tool to generate comprehensive and explicitly attributable answers.

The workflow of the Search Agent is outlined as follows:

  1. Given a question, the agent initiates a search loop using a search tool, summarizes the pieces of text, and determines if additional information is required.
  2. Utilizing the information gathered during the search loop, the agent formulates the initial attempt or draft of the answer.
  3. The agent undergoes two rounds of self-revision before producing the final answer: verifies the relevance of the answer and ensures the answer is grounded in the snippets from the search process

The Search Agent employs a systematic approach involving iterative search loops and self-revisions to generate detailed answers for diverse open-ended questions. This methodology allows the agent to refine and validate its responses, contributing to the production of accurate and well-founded information.

Implementation and methodology

The iterative self-improvement process is described in the paper:

“Start with a model capable of performing Search Agent task at a certain level, for example, with prompted PaLM 2-L model. Collect reasoning trajectories from this model based on our set of 2000 initial questions (essentially the “grow” stage of ReST, with the difference that we keep the set of initial questions fixed).

• Convert the trajectories into the fine-tuning mixture. Apply re-ranking with RM during the conversion (this is roughly equivalent to the “improve” stage of ReST, though we only do one iteration of “improve”).

• Fine-tune the new model (of the same size) on this mixture and verify that it’s performing better than the original model (we will discuss how to do it in the following section). Repeat the process, starting with this new, better model.” [1]

For the reranking reward model (RM), an instruction-tuned PaLM 2-L is applied with a prompt specifically designed that receives the model input, multiple sampled outputs, and guidance on how to rank them. The highest-ranked samples are used for fine-tuning instead of the default sample chosen based on the perplexity value. This approach differs from ReST and aligns more closely with RAFT (Reward rAnked FineTuning) [3], emphasizing the importance of reward model rankings in the selection process, particularly for off-policy trajectory rollouts.

Picture by Brett Jordan from Unsplash

Ablation studies explore the impact of human filtering and the use of multiple trajectories per question in the fine-tuning process. Surprisingly, fine-tuning on filtered data results in a small performance drop, hypothesized to be due to reduced data size and the preservation of “bad” steps in other examples. Using two trajectories per question in fine-tuning shows a performance gain, but further increases do not significantly improve results.

The self-critique aspect of the multi-step setup is examined, showing a small but measurable positive boost in overall agent performance, particularly during the “Answer Generation” step.


The research focuses on evaluating the performance of the Search Agent using the Bamboogle dataset (Press et al., 2023), a semi-adversarial set of 2-hop questions deliberately designed to be unanswerable through direct Google search, but with answers available in Wikipedia. The improvement in the Search Agent’s performance on Bamboogle indicates its enhanced ability to effectively use web search as a tool.

To address the challenges associated with human evaluations, the paper introduces an LLM-based auto-eval approach. This auto-eval method is shown to be highly correlated with human evaluation scores, with a Pearson correlation of 0.98 and a Spearman correlation of 0.83. The authors use Bamboogle auto-eval to estimate the final model performance and answer various questions related to model optimization, such as sampling temperature selection, checkpoint choices for different model sizes, the impact of multiple trajectories on fine-tuning, and the effectiveness of self-checks.

The research also addresses the trade-off between data quantity and quality, highlighting that the quality of data matters more than its quantity. And it emphasizes the importance of better data in reducing evaluation trajectory variance.

To mitigate the risk of overfitting and address shortcomings found during human evaluations, a new dataset called BamTwoogle is introduced. This dataset, serving as a test set, is a slightly more challenging sequel to Bamboogle, requiring 2+ steps to answer each question. BamTwoogle is handcrafted and includes 100 information-seeking questions, ensuring they require multiple searches or reasoning steps. The questions are checked to ensure that answers do not appear on the first page of Google search results. The expected answers should be unambiguous, not prone to change over time, and preferentially sourced from Wikipedia.

In summary, the paper details the evaluation strategy using Bamboogle and introduces BamTwoogle as a complementary test set to assess the final performance of the Search Agent, addressing challenges related to human evaluations and overfitting.


There are many details in this study that I think can inspire future research in the field of AI agents, or part of the workflow can even be applied in some agent implementations. I hope you find it as interesting as I do.

Finally, I will close this article with the conclusion expressed in the study by its authors:

This work demonstrates that the ReST-like approach with AI feedback could be effectively applied to a multi-step reasoning LLM agent. We show that it is a relatively simple and efficient way to iteratively build high-quality synthetic data for agent self-improvement. Moreover, this increasingly higher quality data could simultaneously be used for distilling a multi-step agent into several magnitudes smaller models while preserving most of the performance from the large teacher model. [1]


[1] Paper: “ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent” by

[2] Paper: Reinforced Self-Training (ReST) for Language Modeling

[3] Paper: RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

[4] Paper: ReAct: Synergizing Reasoning and Acting in Language Models

[5] Paper: Reflexion: Language Agents with Verbal Reinforcement Learning

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓