Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take the GenAI Test: 25 Questions, 6 Topics. Free from Activeloop & Towards AI

Publication

Inside Infini Attention: Google DeepMind’s Technique Powering Gemini 2M Token Window
Artificial Intelligence   Latest   Machine Learning

Inside Infini Attention: Google DeepMind’s Technique Powering Gemini 2M Token Window

Last Updated on June 4, 2024 by Editorial Team

Author(s): Jesus Rodriguez

Originally published on Towards AI.

Created Using Ideogram

I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

thesequence.substack.com

Google had plenty of generative AI announcements during last week’s I/O conference. Gemini was at the center of those announcements with a new flash version that boosts, among other impressive 2M token windows. This capability streamlines in-context learning scenarios allowing the processing of large amounts of text in interactions with LLMs. The large context window has been the result of several incremental research breakthroughs by Google’s DeepMind. One of the most relevant ones was their work on Infini-Attention published just a few weeks ago. This paper outlines an architecture that theoretically can scale the transformer context window to an arbitrarily large number of tokens.

The Problem

Memory is crucial for intelligence, enabling efficient, context-specific computations. Yet, Transformers and Transformer-based language models have limitations in their context-dependent memory due to their attention mechanisms.

The attention mechanism in Transformers has a quadratic complexity in both memory usage and computation time. For instance, attention Key-Value (KV) states require 3TB of memory for a 500B model with a batch size of 512 and a context length of 2048. Scaling these models to handle longer sequences, such as 1 million tokens, presents challenges and increases financial costs.

Compressive memory systems offer a more scalable and efficient alternative to the attention mechanism for extremely long sequences. Unlike systems that grow with input length, compressive memory uses a fixed number of parameters to store and recall information, maintaining bounded storage and computation costs. New information is integrated by adjusting memory parameters to ensure it can be retrieved later.

Infini Attention

Google DeepMind introduces a novel approach that enables Transformer language models to process infinitely long inputs with a limited memory footprint and computation. The core of this approach is a new attention technique called Infini-attention. Infini-attention incorporates compressive memory into the standard attention mechanism, combining masked local attention and long-term linear attention in a single Transformer block.

This modification allows existing language models to extend to infinitely long contexts through continual pre-training and fine-tuning. Infini-attention reuses key, value, and query states from standard attention for long-term memory storage and retrieval. Instead of discarding old KV states, they are stored in the compressive memory and retrieved using attention query states when processing future sequences. The final contextual output combines long-term memory-retrieved values with local attention contexts. Infini-attention computes both local and global context states, similar to multi-head attention (MHA), and maintains multiple parallel compressive memories per attention layer.

The attention mechanism is based on two fundamental components:

1) Scaled Dot-product Attention

Multi-head scaled dot-product attention, particularly self-attention, has been the main component in language models. MHA’s strong capability to model context-dependent dynamic computation and its convenience in temporal masking have been extensively used in autoregressive generative models.

2) Compressive Memory

Infini-attention reuses query, key, and value states from dot-product attention to create new memory entries for compressive memory. This state-sharing between dot-product attention and compressive memory allows efficient long-context adaptation and speeds up training and inference. The goal is to store key and value state bindings in the compressive memory and retrieve them using query vectors.

To illustrate the effect of Infini attention, look at this comparison with the Transformer-XL method. As illustrated in the following figure, Infini Attention stores the entire context, while Transformer-XL only captures the latest states.

The Results

Infini-Transformer models were evaluated on benchmarks involving extremely long input sequences: long-context language modeling, 1M length passkey context block retrieval, and 500K length book summarization tasks. For the language modeling benchmark, models were trained from scratch, while for the other tasks, existing language models were continually pre-trained to demonstrate the approach’s plug-and-play long-context adaptation capability.

Several interesting findings emerged. Firstly, attention heads specialize in focusing on either the current context or retrieving information from compressive memory. Mixer's heads combined both the current context and long-term memory content.

Additionally, model performance improved with more input, enabling successful summarization of entire textbooks effectively handling very large contexts.

As we witnessed last week, Infini Attention can have profound implications for the future of generalist LLMs. Similarly, RAG applications can be really improved by the used of techniques like Infini Attention.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓