Inside COSP and USP: Google Research New Methods to Advance Reasoning in LLMs
Last Updated on November 9, 2023 by Editorial Team
Author(s): Jesus Rodriguez
Originally published on Towards AI.
I recently started an AI-focused educational newsletter, that already has over 160,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:
TheSequence U+007C Jesus Rodriguez U+007C Substack
The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and dataβ¦
thesequence.substack.com
The evolution of prompt generation is one of the key building blocks of LLM-based applications. Tasks such as reasoning or fine-tuning highly depend on having strong prompt datasets. Techniques such as few-shot setup have significantly reduced the necessity for copious amounts of data to fine-tune models for specific tasks. Nevertheless, challenges persist when it comes to crafting sample prompts, especially in scenarios where a broad array of tasks is covered by general-purpose models. Even generating a modest number of demonstrations can be a formidable task. This is particularly true for tasks such as summarizing lengthy articles or addressing inquiries that demand specialized domain knowledge, like medical question answering.
In such situations, models endowed with robust zero-shot performance come to the rescue, eliminating the need for manual prompt generation. However, itβs worth noting that zero-shot performance tends to be less potent since the language model operates without specific guidance, leaving room for occasional erroneous outputs.
Recently, Google Research introduced two techniques that advance the zero-shot adaptive prompting in LLMs. The first method is known as βConsistency-Based Self-Adaptive Prompting (COSP),β outlined in a recent ACL 2023 research paper. COSP addresses the predicament of generating suitable prompts by leveraging unlabeled samples and the modelβs own predictions, thereby bridging the performance gap between zero-shot and few-shot while preserving the advantages of zero-shot prompting.
In a parallel development, βUniversal Self-Adaptive Prompting (USP),β as presented in the forthcoming EMNLP 2023 paper, extends the concept to a wide array of natural language understanding and generation tasks, showcasing its effectiveness across various domains.
COSP and USP in Detail
The core idea behind both COSP and USP is to utilize the modelβs zero-shot outputs as demonstrations for prompting itself. The challenge lies in selecting reliable self-generated demonstrations, as erroneous demonstrations can be detrimental. To navigate this challenge, COSP capitalizes on the observation that confident and consistent model predictions are more likely to be correct. This confidence measurement is based solely on the modelβs predictions and doesnβt require labeled data. The high-confidence predictions and their corresponding inputs are treated as pseudo-demonstrations.
Building on this foundation, the modelβs confidence in its output is estimated through self-consistency assessment, serving as a gauge of correctness. To generate a range of possible rationales and answers, the model is queried multiple times with zero-shot chain-of-thought prompting, with the level of randomness controlled by a βtemperatureβ hyperparameter. The entropy of the answers is then computed to quantify uncertainty. Answers with high self-consistency and greater model certainty are deemed reliable and selected.
In summary, COSP and USP follow a similar methodology:
Β· Input unlabeled questions into the model to obtain multiple rationales and answers.
Β· Highlight the most frequent answers and measure their consistency across multiple model outputs.
Β· Penalize repetition and promote diversity in the selected demonstrations.
Β· Concatenate the pseudo-demonstrations into test questions and query the model again for the final predicted answer.
While COSP primarily focuses on question-answering tasks with clear correct answers, USP generalizes the approach to other NLP tasks, including classification, short-form generation, and long-form generation, adapting the confidence measurement techniques accordingly. Under USP, Google Research extends its methodology to a broader spectrum of natural language processing tasks:
Β· Classification (CLS): In this category, problems involve determining the probability of each class based on the neural networkβs output logits. Google Research employs this approach to gauge uncertainty without the need for multiple sampling by calculating the entropy of the logit distribution.
Β· Short-form generation (SFG): Problems akin to question-answering benefit from a similar procedure as used in COSP, sans the rationale-generating step, if necessary.
Β· Long-form generation (LFG): Tasks such as summarization and translation often involve open-ended questions with non-identical outputs, even when the model is confident. In these instances, Google Research resorts to an overlap metric, computing the average pairwise ROUGE score between distinct outputs for the same query.
These innovative approaches represent a significant step forward in the field of AI prompting, enabling models to effectively prompt themselves and enhance their performance across a wide range of natural language tasks.
The Results
Google Research evaluated COSP and USP across different benchmarks. In the case of Consistency-Based Self-Adaptive Prompting (COSP), Google Research initially concentrates on a set of six arithmetic and commonsense reasoning problems. They benchmark COSP against the 0-shot-CoT approach, using self-consistency across all baselines to ensure a fair computational resource comparison. Across three different large language models (LLMs), the results unequivocally reveal that zero-shot COSP outperforms the standard zero-shot baseline.
With Universal Self-Adaptive Prompting (USP), Google Research takes a more expansive approach, broadening the scope of analysis to encompass over 25 classification tasks, short-form generation, and long-form generation tasks. Furthermore, they employ state-of-the-art PaLM 2 models to tackle the formidable BIG-Bench Hard suite of tasks, a domain where LLMs have previously struggled in comparison to human performance. In a remarkable consistency with their COSP findings, Google Research demonstrates that USP consistently outperforms the baseline methods and remains competitive when compared to prompting with golden examples.
Google Researchβs commitment to understanding the mechanics of USP is evident through their investigation into the relationship between confidence and correctness. Their findings substantiate the key observation that USP predominantly selects confident predictions, which tend to yield superior results across all types of tasks considered, as depicted in the accompanying figure. This reinforces the efficacy of USP in enhancing the performance of language models across diverse natural language understanding and generation tasks.
Both COSP and USP represent explore important areas of prompt generation to enhance common sense reasoning in LLMs.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI