Inside Infini Attention: Google DeepMind’s Technique Powering Gemini 2M Token Window
Last Updated on June 4, 2024 by Editorial Team
Author(s): Jesus Rodriguez
Originally published on Towards AI.
I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:
TheSequence | Jesus Rodriguez | Substack
The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…
thesequence.substack.com
Google had plenty of generative AI announcements during last week’s I/O conference. Gemini was at the center of those announcements with a new flash version that boosts, among other impressive 2M token windows. This capability streamlines in-context learning scenarios allowing the processing of large amounts of text in interactions with LLMs. The large context window has been the result of several incremental research breakthroughs by Google’s DeepMind. One of the most relevant ones was their work on Infini-Attention published just a few weeks ago. This paper outlines an architecture that theoretically can scale the transformer context window to an arbitrarily large number of tokens.
The Problem
Memory is crucial for intelligence, enabling efficient, context-specific computations. Yet, Transformers and Transformer-based language models have limitations in their context-dependent memory due to their attention mechanisms.
The attention mechanism in Transformers has a quadratic complexity in both memory usage and computation time. For instance, attention Key-Value (KV) states require 3TB of memory for a 500B model with a batch size of 512 and a context length of 2048. Scaling these models to handle longer sequences, such as 1 million tokens, presents challenges and increases financial costs.
Compressive memory systems offer a more scalable and efficient alternative to the attention mechanism for extremely long sequences. Unlike systems that grow with input length, compressive memory uses a fixed number of parameters to store and recall information, maintaining bounded storage and computation costs. New information is integrated by adjusting memory parameters to ensure it can be retrieved later.
Infini Attention
Google DeepMind introduces a novel approach that enables Transformer language models to process infinitely long inputs with a limited memory footprint and computation. The core of this approach is a new attention technique called Infini-attention. Infini-attention incorporates compressive memory into the standard attention mechanism, combining masked local attention and long-term linear attention in a single Transformer block.
This modification allows existing language models to extend to infinitely long contexts through continual pre-training and fine-tuning. Infini-attention reuses key, value, and query states from standard attention for long-term memory storage and retrieval. Instead of discarding old KV states, they are stored in the compressive memory and retrieved using attention query states when processing future sequences. The final contextual output combines long-term memory-retrieved values with local attention contexts. Infini-attention computes both local and global context states, similar to multi-head attention (MHA), and maintains multiple parallel compressive memories per attention layer.
The attention mechanism is based on two fundamental components:
1) Scaled Dot-product Attention
Multi-head scaled dot-product attention, particularly self-attention, has been the main component in language models. MHA’s strong capability to model context-dependent dynamic computation and its convenience in temporal masking have been extensively used in autoregressive generative models.
2) Compressive Memory
Infini-attention reuses query, key, and value states from dot-product attention to create new memory entries for compressive memory. This state-sharing between dot-product attention and compressive memory allows efficient long-context adaptation and speeds up training and inference. The goal is to store key and value state bindings in the compressive memory and retrieve them using query vectors.
To illustrate the effect of Infini attention, look at this comparison with the Transformer-XL method. As illustrated in the following figure, Infini Attention stores the entire context, while Transformer-XL only captures the latest states.
The Results
Infini-Transformer models were evaluated on benchmarks involving extremely long input sequences: long-context language modeling, 1M length passkey context block retrieval, and 500K length book summarization tasks. For the language modeling benchmark, models were trained from scratch, while for the other tasks, existing language models were continually pre-trained to demonstrate the approach’s plug-and-play long-context adaptation capability.
Several interesting findings emerged. Firstly, attention heads specialize in focusing on either the current context or retrieving information from compressive memory. Mixer's heads combined both the current context and long-term memory content.
Additionally, model performance improved with more input, enabling successful summarization of entire textbooks effectively handling very large contexts.
As we witnessed last week, Infini Attention can have profound implications for the future of generalist LLMs. Similarly, RAG applications can be really improved by the used of techniques like Infini Attention.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI