Q* and LVM: LLM’s AGI Evolution
Last Updated on December 11, 2023 by Editorial Team
Author(s): Luhui Hu
Originally published on Towards AI.
The realm of artificial intelligence has witnessed a revolutionary surge with the advent of Large Language Models (LLMs) like ChatGPT. These models have dramatically transformed our interaction with AI, offering conversational abilities that feel almost human. However, despite their success, LLMs have notable gaps in two critical areas: vision AI and logical/mathematical reasoning. Addressing these gaps are two groundbreaking innovations: OpenAI’s mysterious Q* project and the pioneering Large Vision Models (LVM) introduced by UCB and JHU.
Q*: Bridging the Gap in Logical and Mathematical Reasoning
Q*, a project shrouded in secrecy, has recently surfaced in discussions within the AI community. While details are scarce, information leaked through various sources, including a Wired article and discussions on OpenAI’s community forum, suggest that Q* is OpenAI’s answer to enhancing logical and mathematical reasoning in AI models.
The need for Q* arises from the inherent limitations of current LLMs in processing complex logical constructs and mathematical problems. While LLMs like ChatGPT can simulate reasoning to an extent, they often falter in tasks requiring deep, systematic, logical analysis or advanced mathematical computations. Q* aims to fill this gap, potentially leveraging advanced algorithms and novel approaches to imbue AI with the ability to reason and compute at a level currently beyond the reach of existing models.
LVM: Revolutionizing Vision AI
Parallel to the development of Q* is the breakthrough in vision AI, marked by the introduction of Large Vision Models (LVM). A recent paper published on arxiv.org by researchers from the University of California, Berkeley (UCB), and Johns Hopkins University (JHU) details this advancement. LVM represents a significant leap in the field of vision AI, addressing scalability and learning efficiencies that have long been challenges in this domain.
LVMs are designed to process and interpret visual data at a scale and sophistication not seen before. They leverage sequential modeling, a technique that allows for more efficient training and better generalization of large datasets. This approach enables LVMs to learn from vast amounts of visual data, making them adept at tasks ranging from image recognition to complex scene understanding.
This LVM uses a novel sequential modeling approach, enabling the learning of visual data without relying on linguistic information. Central to this approach is the concept of “visual sentences,” a format that represents a wide array of visual data, including raw images, videos, and annotated sources like semantic segmentations, as sequential tokens. This method allows for the handling of a vast array of visual data (over 420 billion tokens) as sequences, which the model learns to process by minimizing cross-entropy loss for next token prediction.
At the heart of the LVM is a two-stage process for handling visual data. The first stage involves image tokenization using a VQGAN model, which translates each image into a sequence of discrete visual tokens. The VQGAN framework employs a combination of encoding and decoding mechanisms, with a quantization layer that assigns input images to discrete tokens from a pre-established codebook. The second stage involves training an autoregressive transformer model on these visual sentences. This model treats the sequences of visual tokens in a unified manner, without the need for task-specific tokens, allowing the system to infer relationships between images contextually.
For inference and application in various vision tasks, the LVM utilizes a method called visual prompting. By constructing partial visual sentences that define a task, the model can generate output by predicting and completing the sequence of visual tokens. This approach mirrors in-context learning in language models, providing flexibility and adaptability in generating visual outputs for a wide range of applications.
The Road to AGI
The development of Q* and LVM marks a crucial step in the journey towards Artificial General Intelligence (AGI). AGI, the holy grail of AI research, refers to a machine’s ability to understand, learn, and apply intelligence across a wide range of tasks, much like a human brain. While LLMs have laid a solid foundation, the integration of specialized capabilities like logical reasoning (Q*) and advanced vision processing (LVM) is essential to move closer to AGI.
These advancements represent not just incremental improvements but a paradigm shift in AI capabilities. With Q* enhancing logical and mathematical reasoning and LVM revolutionizing vision AI, the path to AGI looks more promising than ever. As we anticipate further developments in these projects, the potential for AI to surpass current boundaries and evolve into a truly general intelligence looms on the horizon, heralding a new era in the AI world.
- Sequential Modeling Enables Scalable Learning for Large Vision Models: https://arxiv.org/abs/2312.00785
- UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework: https://arxiv.org/abs/2311.10125
- Physically Grounded Vision-Language Models for Robotic Manipulation: https://arxiv.org/abs/2309.02561
- Vector-Quantized Image Modeling with Improved VQGAN: https://blog.research.google/2022/05/vector-quantized-image-modeling-with.html
- A Survey of Large Language Models: https://arxiv.org/abs/2303.18223
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI