Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


The Method OpenAI Uses to Extract Interpretable Concepts from GPT-4
Artificial Intelligence   Latest   Machine Learning

The Method OpenAI Uses to Extract Interpretable Concepts from GPT-4

Last Updated on June 13, 2024 by Editorial Team

Author(s): Jesus Rodriguez

Originally published on Towards AI.

The Method OpenAI Uses to Extract Interpretable Concepts from GPT-4
Created Using Ideogram

I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…


Interpretability is one of the crown jewels of modern generative AI. The workings of large frontier models remain largely mysterious compared to other human-made systems. While previous generations of ML saw a boom in interpretability tools and frameworks, most of those techniques have become impractical when applied to massively large neural network. From that perspective, solving interpretability for generative is going to require new methods and potential breakthroughs. A few weeks ago, Anthropic published some research about their work in identifying concepts in LLMs. More recently, OpenAI published a super interesting paper about their work on identifying interpretable features in GPT-4 using a quite novel technique.

To interpret LLMs, identifying useful building blocks for their computations is essential. However, the activations within an LLM often display unpredictable patterns, seemingly representing multiple concepts simultaneously. These activations are also densely packed, meaning each activation is constantly engaged with every input. In reality, concepts are usually sparse, with only a few being relevant in any given context. This reality underpins the use of sparse autoencoders, which help identify a few crucial “features” within the network that contribute to any given output. These features exhibit sparse activation patterns, aligning naturally with concepts that humans can easily understand, even without explicit interpretability incentives.

The sparse nature of concepts inspired OpenAI to use a well-known, but incredibly hard-to-scale, method in their interpretability journey.

Sparse Autoencoders

Sparse autoencoders offer a promising unsupervised method for extracting interpretable features from language models by reconstructing activations from a sparse bottleneck layer. Given that LLMs encompass numerous concepts, these autoencoders must be large enough to capture all relevant features. Studying the scaling properties of autoencoders is challenging due to the need to balance reconstruction quality and sparsity, as well as the presence of inactive patients. As a potential solution, using k-sparse autoencoders can directly control sparsity, simplifying tuning and enhancing the balance between reconstruction and sparsity.

Image Credit: OpenAI

OpenAI has developed advanced techniques to train extremely wide and sparse autoencoders with minimal inactive latents on the activations of any language model. By systematically studying scaling laws concerning sparsity, autoencoder size, and language model size, OpenAI demonstrates the reliability of their methodology. They have successfully trained a 16-million-latent autoencoder on GPT-4’s residual stream activations.

The training process for sparse autoencoders started with the residual streams of GPT-2 small and variously sized models that share GPT-4’s architecture and training setup, including GPT-4 itself. A layer near the network’s end is selected, containing many features without specializing in next-token predictions. The k-sparse autoencoder employs an activation function that retains only the largest k latents, zeroing the rest.

Understanding Scaling Laws

The most notorious challenge with sparse autoencoders is its ability to scale.

To accurately represent the state of advanced models like GPT-4, it is hypothesized that a large number of sparse features are needed. Two main approaches are considered for determining the autoencoder size and token budget:

1. Training to achieve the optimal Mean Squared Error (MSE) given the available compute, without focusing on convergence.

2. Training autoencoders to convergence, providing a bound on the best possible reconstruction achievable if compute efficiency is disregarded.

Sparse Autoencoders and LLM Interpretability

The ultimate goal of autoencoders is to identify features that are useful for practical applications, such as mechanistic interpretability, rather than merely improving the sparsity-reconstruction balance. The quality of autoencoders is measured using several metrics:

1. Downstream loss: Evaluates how well the language model performs if the residual stream latent is replaced with the autoencoder’s reconstruction.

2. Probe loss: Assesses whether autoencoders recover expected features.

3. Explainability: Checks if there are simple, necessary, and sufficient explanations for the activation of autoencoder latents.

4. Ablation sparsity: Determines if removing individual latents has a sparse effect on downstream outputs.

The following picture represents a complete visualization in a text related to rethorical questions. You can see the full example in a demo visualization tool released with the paper:

Image Credit: OpenAI


While this work marks a step forward, it has several limitations. Many discovered features remain hard to interpret, with some activating unpredictably or exhibiting spurious activations unrelated to their usual encoded concepts. Additionally, there are no reliable methods to verify these interpretations.

The sparse autoencoder does not fully capture the behavior of the original model. Passing GPT-4’s activations through the sparse autoencoder yields performance comparable to a model trained with significantly less compute. To completely map the concepts in cutting-edge language models, scaling to billions or trillions of features may be necessary, presenting significant challenges despite improved scaling techniques.

Finding features at one model point is only the beginning. Much more work is needed to understand how models compute these features and how they are utilized downstream.

Regardless of the challenges, the use of sparse autoencoders for LLM interpretability shows tremendous potential. Hope to see OpenAI double down on this in the near future.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓