Last Updated on April 1, 2023 by Editorial Team
Author(s): Clemens Jarnach ⚡️
Originally published on Towards AI.
How Multimodal Chain-of-Thought Reasoning Can Improve Large Language Models and your ChatGPT prompting too
The Generative Pre-trained Transformer (GPT) model, which is currently used by ChatGPT, has become one of the most dominant language models available, boasting unprecedented accuracy and fluency in natural language generation. But, there is always a way to do things better.
OpenAI has pushed the boundaries of natural language processing and artificial intelligence communication technology by introducing ChatGPT, a language model that used GPT-3 to generate human-like responses and engage in fluid dialogue with users. With their recent release of GPT-4, an even better model, everyone is on the edge of their seats. Despite the success of GPT models, recent research has identified room for improvement. Recent developments in Large Language Models (LLMs) have made impressive progress in complex reasoning tasks. One particularly encouraging method is the use of Chain-of-Thought (CoT) prompts. CoT prompting involves generating intermediate reasoning chains to deduce an answer. Whereas previous studies have focused on language modality only, a new research paper by Zhang et al. (2023) introduces a novel approach called Multimodal-CoT, which incorporates both text and image modalities into a two-stage framework. This new approach shows promising performance improvements over the current state-of-the-art GPT-3 model. Will GPT-4 solve such problems, and how can you use this prompt technique?
In this blog post, I will provide an overview of Zhang et al.’s (2023) research and explore how Multimodal-CoT can improve reasoning and answer inference in large language models.
What is CoT?
Chain-of-Thought (CoT) prompting is a type of language prompting technique used in natural language processing (NLP) that involves the generation and refinement of chains of reasoning to facilitate better language understanding and generation. In CoT prompting, a language model (e.g., a neural network-based model such as GPT-3) is presented with a prompt to perform a given task (e.g., answer a question). The model is then given a set of rationales (or reasons) for why a given word or phrase might be the correct answer. The model can then use these rationales to generate a more accurate and contextually appropriate response. CoT prompting has been proven to be effective in improving the performance of language models in a variety of tasks.
Chain-of-Thought prompting is like playing a game of “Twenty Questions”. You think of an object, and your friend has to ask you questions to figure out what it is. When your friend asks a question, you give them a reason why the answer to the question might help them guess the object. For example, if your friend asks “Is it a vegetable?”, you might say “Yes, because it grows in the ground.” Your friend can then use this reason to make a better guess. CoT prompting works in a similar way. A computer program is given a prompt to perform a task, like answering a question. It is then given a set of reasons why a certain answer might be correct. The program can use these reasons to come up with a better answer. This technique has been shown to improve the performance of computer programs that process language in a variety of tasks. Theoretically, a pretty straightforward and convincing model, wouldn’t you agree? Give it a try next time you use ChatGPT and see if the answers improve in accuracy and desirability.
On the more practical side, CoT prompting can be computationally expensive, especially when dealing with large datasets and complex reasoning tasks. The process involves generating and refining chains of reasoning, which can require significant computing resources. Additionally, creating the set of rationales to present to the language model may require manual annotation, which can be time-consuming and costly. However, the benefits of using CoT prompting, such as improved performance in language tasks, may outweigh the computational cost for certain applications. Moreover, advancements in hardware and software technologies, as well as the development of more efficient algorithms, may help to reduce the computational cost of CoT prompting in the future.
Multimodal-CoT: A Two-Step Framework
The Multimodal CoT framework, as proposed by Zhang et al. (2023), consists of two distinct stages. The first stage involves the generation of rationales, which is achieved by feeding the model with both speech and vision inputs. The second stage then focuses on response inference, where they combine the initial speech input with the rationale generated in the first stage. The updated language input, along with the original vision input, is then fed into the model to infer the answer (Zhang et al., 2023).
By combining speech and visual input in this way, Multimodal CoT allows for a more comprehensive understanding of the task at hand, resulting in more accurate and informative responses. Zhang et al. (2023) believe that this framework has the potential to revolutionize the field of multimodal learning, and I am excited to see where it goes from here.
Zhang et al. evaluated their method using the “ScienceQA” benchmark dataset (Lu et al., 2022), which is a large-scale multimodal science question dataset that annotates answers with detailed lectures and explanations. The dataset contains 21k multimodal multiple-choice questions with rich diversity across subjects, topics, categories, and skills. Zhang et al.’s (2023) most sophisticated model (i.e., Multimodal-CoT_Large) outperformed the previous state-of-the-art LLM GPT-3.5 by an average of 17 percentage points, achieving an accuracy rate of 92%, and even surpassed human performance on the “ScienceQA” benchmark. Overall, the results confirm the effectiveness of multimodality and the potential for high accuracy through the use of CoT and a two-step framework as proposed by Zhang et al. (2023).
In summary, Chain-of-Thought (CoT) prompting is a technique that can be used to improve the reasoning and accuracy performance of large language models (LLM) by providing rationales for a given the word or phrase. The method shows strong capabilities to be effective in improving the performance of language models in various tasks. In addition, research by Zhang et al. presents an exciting new approach to Chain-of-Thought reasoning by incorporating both language and vision modalities into a two-step framework that separates reasoning generation and response inference. Their proposed method achieves high accuracy, outperforming existing LLMs and human-level performance. This type of research opens up promising avenues for future studies and AI applications. In addition, the concept behind CoT prompting can also help users of LLMs, such as ChatGPT, to improve their prompting and achieve better, more accurate results. I believe we will see much more of this approach in the near future, and iterations of this model will push the already very impressive capabilities of current large language models even further.
Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., Tafjord, O., Clark, P., and Kalyan. A. (2022). Learn to explain: Multimodal reasoning via thought chains for science question answering. ArXiv preprint, arXiv:2209.09513v2.
Zhang, Zhuosheng, Zhang, Aston, Li, Mu, Zhao, Hai, Karypis, George, & Smola, Alex. (2023). Multimodal Chain-of-Thought Reasoning in Language Models. arXiv:2302.00923v4.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI