Inside Infini Attention: Google DeepMind’s Technique Powering Gemini 2M Token Window

Last Updated on June 4, 2024 by Editorial Team

Author(s): Jesus Rodriguez

Originally published on Towards AI.

I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

thesequence.substack.com

Google had plenty of generative AI announcements during last week’s I/O conference. Gemini was at the center of those announcements with a new flash version that boosts, among other impressive 2M token windows. This capability streamlines in-context learning scenarios allowing the processing of large amounts of text in interactions with LLMs. The large context window has been the result of several incremental research breakthroughs by Google’s DeepMind. One of the most relevant ones was their work on Infini-Attention published just a few weeks ago. This paper outlines an architecture that theoretically can scale the transformer context window to an arbitrarily large number of tokens.

The Problem

Memory is crucial for intelligence, enabling efficient, context-specific computations. Yet, Transformers and Transformer-based language models have limitations in their context-dependent memory due to their attention mechanisms.

The attention mechanism in Transformers has a quadratic complexity in both memory usage and computation time. For instance, attention Key-Value (KV) states require 3TB of memory for a 500B model with a batch size of 512 and a context length of 2048. Scaling these models to handle longer sequences, such as 1 million tokens, presents challenges and increases financial costs.

Compressive memory systems offer a more scalable and efficient alternative to the attention mechanism for extremely long sequences. Unlike systems that grow with input length, compressive memory uses a fixed number of parameters to store and recall information, maintaining bounded storage and computation costs. New information is integrated by adjusting memory parameters to ensure it can be retrieved later.

Infini Attention

Google DeepMind introduces a novel approach that enables Transformer language models to process infinitely long inputs with a limited memory footprint and computation. The core of this approach is a new attention technique called Infini-attention. Infini-attention incorporates compressive memory into the standard attention mechanism, combining masked local attention and long-term linear attention in a single Transformer block.

This modification allows existing language models to extend to infinitely long contexts through continual pre-training and fine-tuning. Infini-attention reuses key, value, and query states from standard attention for long-term memory storage and retrieval. Instead of discarding old KV states, they are stored in the compressive memory and retrieved using attention query states when processing future sequences. The final contextual output combines long-term memory-retrieved values with local attention contexts. Infini-attention computes both local and global context states, similar to multi-head attention (MHA), and maintains multiple parallel compressive memories per attention layer.

The attention mechanism is based on two fundamental components:

1) Scaled Dot-product Attention

Multi-head scaled dot-product attention, particularly self-attention, has been the main component in language models. MHA’s strong capability to model context-dependent dynamic computation and its convenience in temporal masking have been extensively used in autoregressive generative models.

2) Compressive Memory

Infini-attention reuses query, key, and value states from dot-product attention to create new memory entries for compressive memory. This state-sharing between dot-product attention and compressive memory allows efficient long-context adaptation and speeds up training and inference. The goal is to store key and value state bindings in the compressive memory and retrieve them using query vectors.

To illustrate the effect of Infini attention, look at this comparison with the Transformer-XL method. As illustrated in the following figure, Infini Attention stores the entire context, while Transformer-XL only captures the latest states.

The Results

Infini-Transformer models were evaluated on benchmarks involving extremely long input sequences: long-context language modeling, 1M length passkey context block retrieval, and 500K length book summarization tasks. For the language modeling benchmark, models were trained from scratch, while for the other tasks, existing language models were continually pre-trained to demonstrate the approach’s plug-and-play long-context adaptation capability.

Several interesting findings emerged. Firstly, attention heads specialize in focusing on either the current context or retrieving information from compressive memory. Mixer's heads combined both the current context and long-term memory content.

Additionally, model performance improved with more input, enabling successful summarization of entire textbooks effectively handling very large contexts.

As we witnessed last week, Infini Attention can have profound implications for the future of generalist LLMs. Similarly, RAG applications can be really improved by the used of techniques like Infini Attention.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Inside Infini Attention: Google DeepMind’s Technique Powering Gemini 2M Token Window

Author(s): Jesus Rodriguez

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

The Problem

Infini Attention

1) Scaled Dot-product Attention

2) Compressive Memory

The Results

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

TAI #143: New Scaling Laws Incoming? Ilya’s SSI Raises at $30bn, Manus Takes AI Agents Mainstream

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Inside Infini Attention: Google DeepMind’s Technique Powering Gemini 2M Token Window

Author(s): Jesus Rodriguez

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

The Problem

Infini Attention

1) Scaled Dot-product Attention

2) Compressive Memory

The Results

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement