In-Context Learning Explained Like Never Before

Last Updated on April 14, 2025 by Editorial Team

Author(s): Allohvk

Originally published on Towards AI.

As the Ocean (of knowledge) was stirred to search the elixir of life, something unexpected happened. Beautiful and magical things spontaneously started to emerge during the process…

– From the episode of Samudra Manthan (The churning of the ocean)

Emergere in Latin means “to come forth”. Emergent behaviour, therefore refer to capabilities that are not explicitly built-in, but instead “come forth” spontaneously. This behaviour has been known to occur in complex systems. Imagine a flock of birds. Each individual bird is programmed to follow some simple rules: stick close your neighbors, avoid collisions etc. Yet, amazing patterns are observed when viewing the flock flight as a whole.

This collective outcome is an emergent property of the flock behaviour and cannot be predicted by examining any single bird’s actions. Likewise, market prices emerge from numerous individual interactions in an economy. Indeed, life itself is a result of emergent behaviour. At some point in evolution, a certain combination of lifeless atoms gave rise to “life”. In LLMs, emergent behaviour refers to the spontaneous appearance of unexpected capabilities as model size and training data scale, a behaviour neatly documented in Emergent Abilities of Large Language Models.

In-Context Learning (ICL) — A notable emergent behaviour

In particular, we focus on In-context learning (ICL) which was observed in GPT-2 and further confirmed in GPT-3. ICL basically refers to the capability whereby LLMs learn a new task from the training data provided without any fine-tuning. The data, in the form of training examples, is provided as a part of the prompt-context and consists of multiple input-label pairs & are called demonstrations. It appears that the model learns from these demonstrations directly at inference time. Since very few demonstrations suffice for a model to learn, this phenomenon is also called few-shot learning.

ICL was referred to as a surprising ability by Xie et al. It is indeed surprising since we are not fine-tuning the model & hence the model weights are unchanged. Yet, somehow the model performs well at tasks for which it has not been trained/tuned for. In fact, under certain conditions it performs better than LLMs fine-tuned for those tasks. How is this possible? Does the LLM learn something from the demonstrations at inference time & is able to use that information (without changing its weights) to deliver good output?

Consider a simple example with just 2 training samples.

“Albert Einstein was German”
“Mahatma Gandhi was Indian”

Let us build a prompt by concatenating these two demonstrations and append it with a test example: “Marie Curie was ”. Let us feed this entire prompt to GPT-3. It is likely that the result would be “Polish” instead of the more common “a great scientist” or “a Nobel prize winner” etc. The LLM is thus able to infer the custom task from the demonstrations — which is to identify the coutry of origin. The surprising bit is that (a) LLMs are NOT explicitly pre-trained to learn from examples and (b) You will never see such prompts (which concatenate multiple independent training data examples) in natural language texts which is primarily used for pre-training.

This is powerful stuff! Imagine you have some custom libraries & code which the LLM has never seen before. You also collect some data samples but just can’t fine-tune as you don’t have GPU’s. Well, You could leverage ICL! Emergent behaviour need not happen with scale alone. Even better-quality data or better prompts could induce emergent behaviour. Indeed, Chain-of-Thought prompting technique raises the level of output of an LLM significantly, inducing emergent behaviour.

No “magic”, please! We are in the 21st century

A paper by Schaeffer et al. tried to take the magic out of emergent behaviour by suggesting that emergent abilities are simply due to the nature of discontinuous metrics commonly used for evaluating behaviour. They suggest that emergent behaviour is not something spontaneous at all. Rather, that behaviour was there all along, hidden out of sight (getting better as models became larger), till at some point, the right output was consistently generated.

For example, consider the task of number additions. For 220+330 the answer 520 or 595 is better than (say) -9250. Instead of a Pass/Fail evaluation, can we give partial credit if the answer is close? For e.g. assume the evaluation system gave partial credit (a) if the sign was accurate (b) more credit if the scale was accurate (c) even more credit for guessing the first digit accurately etc. With such an evaluation in place, they show that there was no spontaneous emergent behavior and all what happened was that the metrics steadily improved over time as models scaled. In other words there was no magic & the behaviour could simply be explained as “models get marginally better & better at tasks as they scale”. Specifically, they state that non-linear or discontinuous metrics produce apparent emergent abilities & suggest using linear metrics which produce predictable changes in LLM behaviour.

Interestingly enough, the results of this paper were not unforeseen by the authors of the Emergent Abilities paper, who acknowledge that certain (discontinuous) metrics “may disguise compounding incremental improvements as emergence” but still go on to say that at best this is a “partial explanation”. To be fair, inventing a new set of metrics to explain an observed behaviour is much easier than predicting a new emerging behaviour itself. Moreover on certain tasks, it has been observed that the LLM performance does remain near-random until a certain threshold, beyond which there is a marked increase in accuracy. Lastly, emergent behaviour is a natural consequence of scale and complexity as empirically observed in the wild. So we should actually be surprised if there were no emergent behaviour in LLMs.

This is an area that will see great churn (excuse the pun) as theories come and go. What magical behaviour will suddenly emerge when model sizes cross (say) 100 trillion? Maybe a model that suddenly gets every coding task right? Maybe a model that can accurately predict a stock price? For sure, its sudden emergence will cause a lot of chaos & disruption! Let us try to understand how ICL happens. Maybe then, we would be better prepared…

Fine-tuning & ICL both provide demonstrations to the LLM. In Fine-tuning, we use these demonstrations as training data and use gradient descent to modify the LLM weights. In ICL, we feed demonstrations to the model via the prompt. The model looks at the demonstrations & learns the patterns & predicts the correct output (without changing its weights). The model learns the pattern on the fly! Wow, ICL is neat in the sense that we don’t have to fine-tune on every task we want. Instead, we just feed the right demonstrations to the model at run-time. How can we explain this?

Is In-Context Learning a Complete-the-Pattern exercise?

To start with, we can view the demonstration examples as a complete the pattern exercise. We have the prompt structured as:

<q1, ans1>, <q2, ans2>, … , <q_n, ans_n>

We now append this with <q_n+1> and feed to the model. The model now leverages its predict the next word goal to complete the pattern. Min et al say that ICL acts as a pattern recognition procedure, rather than as an actual “learning” procedure. They underplay the role of input-label mapping in the demonstrations & claim that the model relies more on the information gained during pre-training to generate outputs. Hmmm.

Maybe a Copy-Paste job? Hello, Induction heads!

Olsson et al. dig a little deeper and finds that transformers have induction heads that refer to abstract patterns in the prompt sequence to help predict the next token. These are different from the regular attention heads that pay attention to different aspects of sentence grammar during training. Induction heads (a) search for token(s) similar to the current token that have occurred in the past sequence (b) copy the token that followed & paste it as the next output token. They hypothesised that as you increase the model size, this behavior becomes more complex — the model becomes capable of not just copying the token but even concepts & latent meanings. An example may help:

Simple token copy: Say during training, the model has NOT come across the word: Samudra-Manthan. But assume we use this word plenty of times in our prompt-context. Say our prompt to the LLM is: “Samudra-Manthan was written 1000’s of years ago. The best of the world’s resources were pooled to conduct Samudra-Manthan. Samudra-Manthan means churning of the ocean. During the Samudra-”. When the LLM is asked to generate the next token, the induction heads likely conclude that “Manthan” is likely going to be the next word. They force the probability distribution generated across the tokens in the vocabulary to allocate the highest probability to the token “Manthan” even though this particular combination was never observed during training!
More abstract token copy: Say the demonstration examples are for translation of words from English to Kannada. So the prompt-context is something like: {EN: < query1> KA: < translation of Query1>; EN: <query2> KA: < translation f Query2>; EN:< query3>}. Now what is the LLM prediction? The induction heads in the LLM realise that they need to copy the token query3 and paste it as the next token but only after translating it to Kannada. This is just a copy-paste job but involves a small transformation prior to the pasting!
Even more abstract token copy: Say the prompt is something like this- {Rat: 1; Duck: 4; Bison: 6; Elephant: 7; Whale: }. Now what could the LLM generate? The induction heads realise that they need to copy Whale but only after translating it to something using some latent concept. What could that latent concept be? Maybe it reasons out that it has to do with size or lifespan and it should return a 10 or something. Or maybe just to tease humans, it may return a 6 with the justification that the latent concept has something to do with the length of the token. Whatever be the output, it is still a copy-paste job but with a transformation thrown in. By giving enough in-context examples, the model is able to guess the latent concept accurately… the final effect being similar to fine-tuning.

Induction heads are implemented by a pair of attention heads in different layers. The first induction head reads the prompt & copies (some) info from every Question-token in the prompt to the Answer-token following it. Another downstream head keeps matching the current token (query) to all the previous token keys in typical attention methodology. Because the keys are shifted-right by one token by the first head, the attention gets focussed on the relevant Answer-token(s) from the past. So together, these heads make the model search the entire prompt for past tokens that are similar to the present token, attend big-time to the token that came next, increasing its softmax probability. Induction heads are so named because they attend to tokens that would be predicted by using induction (from the examples in the prompt).

Or is In-Context Learning due to Nearest-Neighbour search?

This is not a far fetched interpretation. When the temparature hyper-parameter T is set very low, the Attention weights i.e. softmax(1/T * Q.Kt), converge to a one-hot vector. This means the model is attending to the most similar token — using a nearest-neighbor behavior to identify that token. Essentially, this means we are finding the closest match to the query from the demonstrations. This gives us a fresh perspective on ICL — we can view it as implementing a nearest neighbor algorithm over our input-output demonstration pairs, through the mechanics of attention! Now, imagine the model directly tinkering with the attention mechanism to enforce this behaviour naturally… without us setting the temparature parameter.

Something for Bayesian fans too!

Another view is to look at it from a Bayesian lens. The Bayesian inference framework explains how the model sharpens the posterior distribution over concepts based on the prompt given, effectively learning the concept. This occurs despite demonstrations being unnatural sequences that concatenate independent examples which don’t occur in natural language datasets used to pre-train models. Xie et al. explain that in-context learning emerges when language models can infer the “shared latent concept” common to the bunch of demonstration examples and use it to “locate” information acquired during pre-training to generate the output. They also underplay the role of input-label mapping in ICL by showing that ICL is robust to label randomization. They suggest that other aspects of the prompt such as input & output distributions contribute to the final result.

Apparently, In-Context Learning allows the model to learn any function!

Garg et al in Can we train a model to ‘in-context learn’ a certain function class, show that irrespective on what objective the original model was trained on, it can (under certain conditions) acquire a behaviour which makes it possible to in-context learn any (linear) function. So the model can approximate f (x_query) by conditioning on a prompt sequence of x, f(x) examples and the query. So basically, while the model is being pre-trained on an objective like masked attention, it is silently picking up the ability to also do ICL. This is meta-learning, a general paradigm for learning from data.

The model somehow develops this internal learning machinery that can handle learning a much wider range of unseen tasks by searching over an implicit parameter space to optimize some function f which is not the model’s own loss function. In other words, its pre-trained weights somehow have the ability to ensure that the model “trains” from context data at runtime to learn functions that it has never been exposed to during its training. The Induction heads paper by Anthropic folks discussed earlier provides some of the strongest evidences on how ICL may possibly occur. We now shift focus to a paper by Oswald et al who further generalize this and boldly claim that “Transformers Learn In-Context by Gradient Descent”!

Attention as Gradient Descent!

Let us try to understand how they explain ICL. I start by talking of 3–4 smaller concepts (not necessarily from their paper). Later on, we tie up these concepts to understand their explanation of ICL.

Concept 1 — Prompts have the power to change the attention weights & therefore change the model behaviour: LLMs predict the next word based on the probability distribution across all the tokens in the vocabulary. A prompt has the power to change this probability distribution. It can make the model favour certain words more than what it would have done by default. If the prompt is: “I work as a”, the LLM generated next word could be any of the several hundred professions in the world. If the prompt were changed to “I am in the Information Technology sector. I work as a”, then the probability distribution is drastically changed to focus on the dozen or so professions related to computers. The change in prompt forced a change in behaviour even though the model weights itself have not changed. This is the immense power of prompt engineering! Prompts change the activations generated by the attention mechanism and therefore control attention weights which are used to predict the next word! Theoretically, you can engineer a prompt to hijack any LLM’s output!
Concept 2 — Imagine fine-tuning an LLM with a set of 100 training data points. We know what happens during gradient descent — the weights are nudged slightly in each iteration towards the direction that produces a better output. Can we view this process of gradient-descent as an attention mechanism? Basically, the model is attending to a set of 100 examples. The attention mechanism decides the contribution of each training sample in nudging these weights. So gradient descent can be expressed as an attention mechanism! (Note: A more accurate statement would be — The forward pass of a linear layer trained by gradient descent can be expressed as an attention operation where the keys and values are training datapoints and the query is generated from the test input. Link.
Concept 3 — If gradient descent can be expressed as an attention mechanism can the opposite be true? Can attention be expressed as a gradient descent mechanism? Before we go there, let us get an intuition of how LLMs work specifically w.r.t attention phenomenon. Take a transformer with just one attention layer. We know that the next token prediction is based on attention scores of all tokens prior to (& including) the token in question. You have multiple such attention blocks in an LLM. The attention scores get better & better as we move from block to block! The last block has the best attention scores.

Ok, let us try to put these intuitions together to see where we are going. Let us consider Fine-tuning first:

You have the original model weights
You have 100 data samples available
You use gradient descent to update the model weights.
We do gradient descent 50 times & get a good tuned model!

Now, let us say we don’t have the resources to do fine-tuning 🙁

We do have the original model weights. We can do inference on it. Basically, this means the data flows thru’ all the transformer blocks in a forward pass till an output is generated.
We have 100 data samples available. Let us say we feed this as a prompt to the model hoping for ICL to happen!
We have attention operations being performed in each block (on the input data samples in the prompt) and we keep doing that iteratively (block on block) improving attention scores till we get a final output.
Let us say there are 50 attention blocks in the model. So the above process is repeated 50 times. The final activations are really good!

Source: https://www.lesswrong.com/posts/HHSuvG2hqAnGT5Wzp/no-convincing-evidence-for-gradient-descent-in-activation

Aren’t the two processes strikingly similar! Isn’t it logical to investigate whether attention forward passes can be explained as being equivalent to gradient descent? Yes, the authors do precisely that. They show that
gradient descent happens via the attention mechanism in the LLM during forward pass. One attention block = 1 gradient descent step with the training data being the in-context examples. They even show that the gradients (tiny nudges taken towards generating the desired output) at each step of the 2 different architectures are numerically comparable to one other.

But hey, aren’t model weights frozen during inference? How can it take tiny nudges towards generating the desired output? This is where a small leap of imagination comes in. In ICL, the nudges are not happening to the weights. It is the attention scores that get nudged towards producing better & better output as the input moves from block to block inside the LLM. Attention scores get better if there are better activations. So the gradients in ICL are not changes to weights but changes to the activations as we move from one block to another. The effect is numerically comparable to a gradient descent step. So it is almost as if the model is training itself via the attention mechanism instead of the regular gradient descent mechanism! Wow! Do read this healthy criticism.

Enough intuitions. We can now summarize Oswald’s paper & one more (Dai et al) that further builds on it as follows — ICL produces the equivalent of meta-gradients in its forward pass. These meta-gradients can be computed by comparing activations between ICL & non-ICL forward passes. These can then be compared with gradients generated during fine tuning. The authors find strong similarities. This is where it appears magical! Imagine the model actually learning weights during pre-training that allow it to do this — basically take any set of examples from any domain that will be fed to it in the future & generate activations that help it converge layer-by-layer to fit to those training examples — while silently obeying the loss function and the original objectives like Masked attention that it is being trained on! Fantastic, isn’t it!

Maybe somewhere amongst the model’s billions of parameters are a small subset of parameters that get activated when encountered with ICL type of input data to produce this kind of behaviour. As of today, deep transformers have been found to match OLS (Ordinary Least Squares) solutions to simple linear problems. As models get deeper, more complex forms of training might emerge. A recent study claims that it is not 1st order gradient descent which is emulated but a 2nd order convergence. Whatever be the case, we now have a possible explanation as to how ICL happens.

But is this the real reason? Just because (under certain conditions) LLMs can learn via forward pass does not mean that they are actually learning that way in reality. The jury is still out on this. Secondly, even if we do find out how ICL happens, it is difficult to explain why a model picks up that ability. We can (for example) explain practically how life originated on earth but explaining why it happened is tricky. Did it happen simply because there was a small mathematical possibility that it could happen? Did our churners in the epic of Samudra-Manthan find out the source of the origin of life 🙂 Now, that is a story for another day…

This is the 7th of a 12-series article titled My LLM diaries.

Quantization in plain English
LoRA & its newer variants explained like never before
In-Context learning: The greatest magic show in the kingdom of LLMs
RAG in plain English — Summary of 1000+ papers
HNSW — Small World, Yes! But how in the world is it Navigable?
VectorDB origins, Vamana & on-disk vector search algorithms
LLMs on the laptop — A peek into the Silicon
Taming LLMs — A study of few popular techniques
Agents in plain English
LLMops in plain English — Operationalizing trained models
Look Ma, LLMs without Prompt Engineering
Taking a step back — On model sentience, conscientiousness & other philosophical aspects

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

In-Context Learning Explained Like Never Before

Author(s): Allohvk

In-Context Learning (ICL) — A notable emergent behaviour

No “magic”, please! We are in the 21st century

Is In-Context Learning a Complete-the-Pattern exercise?

Maybe a Copy-Paste job? Hello, Induction heads!

Or is In-Context Learning due to Nearest-Neighbour search?

Something for Bayesian fans too!

Apparently, In-Context Learning allows the model to learn any function!

Attention as Gradient Descent!

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Exploring Deep Learning Models: Comparing ANN vs CNN for Image Recognition

LAI #72: From Python Groundwork to Function Calling, ICL Theory, and Load Balancing MoEs

Quantum AI Is Coming. Here’s What No One Is Telling You (But Should)

Tool Descriptions Are Critical: Making Better LLM Tools + Research Capability

Top 5 AI Chatbot projects to showcase on your Portfolio: with Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

In-Context Learning Explained Like Never Before

Author(s): Allohvk

In-Context Learning (ICL) — A notable emergent behaviour

No “magic”, please! We are in the 21st century

Is In-Context Learning a Complete-the-Pattern exercise?

Maybe a Copy-Paste job? Hello, Induction heads!

Or is In-Context Learning due to Nearest-Neighbour search?

Something for Bayesian fans too!

Apparently, In-Context Learning allows the model to learn any function!

Attention as Gradient Descent!

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥