RoPE (Rotary Position Embeddings): A Detailed Example
Last Updated on November 11, 2025 by Editorial Team
Author(s): Utkarsh Mittal
Originally published on Towards AI.
In transformer models, knowing the order of tokens is essential — even though the model processes tokens in parallel. Traditional positional embeddings rely on a fixed “lookup table” (learned for positions up to a maximum length). But what if you need to work with sequences longer than the training limit? Enter RoPE (Rotary Position Embeddings).
RoPE leverages a mathematical approach (rotations via cosine and sine) to encode positions, allowing models to handle arbitrarily long sequences — “infinite context.” This guide breaks down the concepts, walks through the code with practical examples, uses visuals, and highlights how the dimension stays the same before and after the rotary transformation.
1. Why Positional Embeddings Matter
Traditional vs. Rotary Positional Embeddings
- Traditional Positional Embeddings:
Each token gets a fixed vector based on its position (e.g., positions 1–512). They work well for short sequences but cannot extrapolate to unseen positions. - Rotary (RoPE) Positional Embeddings:
Instead of fixed values, RoPE uses rotations computed by:
θ=frequency×position
This angle is then converted into a complex number:
cos(θ)+isin(θ)
- Because cosine and sine are defined for any real number, you can compute positions for any token — extending your model’s capability well beyond the training limit.
Visual Analogy:
Think of a traditional positional embedding like a 12-hour clock with fixed numbers. RoPE is like a continuously spinning clock hand that can indicate any time — even if it goes beyond 12.
2. How RoPE Works: Code Walkthrough
Let’s explore the key functions that implement RoPE in PyTorch.
A. Precomputing Frequency Factors
def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
"""
Precompute the frequency tensor for complex exponentials (cos, sin).
Parameters:
dim: Embedding dimension (must be even)
end: Maximum sequence length
theta: Base value for frequencies
Returns:
A complex tensor of shape (end, dim/2) with values cos(θ) + i*sin(θ)
"""
# For each pair (since rotary works on pairs), compute the frequencies.
freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
# Position indices
t = torch.arange(end, device=freqs.device)
# Outer product creates a matrix of shape (end, dim/2)
freqs = torch.outer(t, freqs)
# Convert to complex numbers (cosine + i*sin)
freqs_cis = torch.polar(torch.ones_like(freqs), freqs)
return freqs_cis
Key Points:
- The function splits the embedding dimension into pairs.
- It computes a frequency factor for each pair.
- The result is used to “rotate” the query and key vectors later.
B. Applying Rotary Embeddings
def apply_rotary_emb(xq, xk, freqs_cis):
# Reshape last dimension into pairs (each pair becomes a complex number)
xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
# Use the sequence length from the query (and key)
seq_len = xq_.shape[-2]
freqs_cis = freqs_cis[:seq_len, :]
# Multiply the complex numbers to perform the rotation
xq_out = torch.view_as_real(xq_ * freqs_cis.unsqueeze(0).unsqueeze(0))
xk_out = torch.view_as_real(xk_ * freqs_cis.unsqueeze(0).unsqueeze(0))
# Flatten back to the original shape
xq_out = xq_out.flatten(3).type_as(xq)
xk_out = xk_out.flatten(3).type_as(xk)
return xq_out, xk_out
What’s Happening?
- Reshaping: The last dimension of Q and K is divided into pairs so that each pair can be interpreted as a complex number.
- Rotation: Multiplying these complex numbers by the frequency factors
(cos(θ)+isin(θ))
“rotates” the vectors in 2D planes.
3. Reformatting: The complex values are converted back into real vectors. Crucially, the dimension remains the same as before — only the values change to incorporate positional information.
C. Integration in Causal Self-Attention
Here’s a simplified snippet showing how RoPE fits into the attention mechanism:
class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
# Number of query and key/value heads
self.n_head = config.n_head
self.n_kv_head = config.n_kv_head if hasattr(config, 'n_kv_head') else config.n_head
self.n_embd = config.n_embd
self.head_dim = config.n_embd // config.n_head
# Linear projections for queries, keys, and values
self.wq = nn.Linear(config.n_embd, self.n_head * self.head_dim, bias=getattr(config, 'bias', False))
self.wk = nn.Linear(config.n_embd, self.n_kv_head * self.head_dim, bias=getattr(config, 'bias', False))
self.wv = nn.Linear(config.n_embd, self.n_kv_head * self.head_dim, bias=getattr(config, 'bias', False))
# Causal mask (ensures tokens can only attend to previous tokens)
self.register_buffer(
"mask",
torch.tril(torch.ones(config.block_size, config.block_size)).view(1, 1, config.block_size, config.block_size)
)
# Rotary embedding cache (computed on first forward pass)
self.rope_cache = None
self.max_seq_len = config.block_size
def forward(self, x):
batch_size, seq_length, _ = x.shape # x shape: (B, T, C)
# Initialize rotary cache if not done
if self.rope_cache is None or self.rope_cache.shape[0] < self.max_seq_len:
self.rope_cache = precompute_freqs_cis(self.head_dim, self.max_seq_len).to(x.device)
# Compute Q, K, V projections (code omitted for brevity)
...
Visual Flow of the Forward Pass:
┌────────────────────────────┐
│ Input Tensor │
│ Shape: (Batch, T, C) │
└────────────┬───────────────┘
│
[Linear Projection → Q, K, V]
│
▼
┌────────────────────────────┐
│ Reshape into Multiple │
│ Heads │
└────────────┬───────────────┘
│
[Apply RoPE → Rotated Q, K]
│
▼
┌────────────────────────────┐
│ Compute Attention Scores │
│ (with causal masking) │
└────────────────────────────┘
3. A Practical Example with Toy Sentences
Imagine you have the following nine sentences:
toy_sentences = [
"The quick brown fox jumps over the lazy dog.",
"I love programming and machine learning.",
"Natural language processing is a fascinating field.",
"Transformers have changed the way we process text.",
"Artificial intelligence is the future of technology.",
"Data science combines statistics and computer science.",
"Deep learning models require large amounts of data.",
"Python is a popular language for AI research.",
"ChatGPT helps answer many complex questions."
]
Step-by-Step Example
- Tokenization & Embedding
Each sentence is tokenized and padded/truncated to a fixed length (say, 4 tokens per sentence). Imagine each sentence becomes a 4×8 matrix. - Linear Projection for Keys (Detailed Example)
For a single token (e.g., “The”) with an 8-dimensional embedding[1, 2, 3, 4, 5, 6, 7, 8], the key linear layer maps this to an 8-dimensional output. That 8-dim output is then split into two key heads (each of size 4). - Reshaping the Projections
For the entire batch (9 sentences, 4 tokens each, 8 features per token), the key projection is initially(9, 4, 8). It is then reshaped to(9, 4, 2, 4)and transposed to(9, 2, 4, 4). - Applying the Rotary Embedding
Each pair of features (e.g.,[q1, q2]) is treated as a complex number. Multiplying by
cos(θ)+isin(θ)
rotates that pair, embedding the positional information.
5. Causal Masking
A lower-triangular matrix ensures that tokens only attend to themselves and previous tokens, preserving the causal structure for language generation.
4. Visual Summary of the Reshaping Steps
Below is a text-based representation of the diagrams showing how we go from (9, 4, 8) → (9, 4, 2, 4) → (9, 2, 4, 4) for one batch element (one sentence). The same applies across the entire batch of 9.
1. Original Tensor (9, 4, 8)
For one batch element (Sentence 1), we might have:
Token 1: [ a1, a2, a3, a4, a5, a6, a7, a8 ]
Token 2: [ a9, a10, a11, a12, a13, a14, a15, a16 ]
Token 3: [ a17, a18, a19, a20, a21, a22, a23, a24 ]
Token 4: [ a25, a26, a27, a28, a29, a30, a31, a32 ]
2. Reshaped Tensor (9, 4, 2, 4)
We split each token’s 8 features into 2 groups (each of size 4). For one sentence:
Token 1:
Head Group 1: [ a1, a2, a3, a4 ]
Head Group 2: [ a5, a6, a7, a8 ]
Token 2:
Head Group 1: [ a9, a10, a11, a12 ]
Head Group 2: [ a13, a14, a15, a16 ]
Token 3:
Head Group 1: [ a17, a18, a19, a20 ]
Head Group 2: [ a21, a22, a23, a24 ]
Token 4:
Head Group 1: [ a25, a26, a27, a28 ]
Head Group 2: [ a29, a30, a31, a32 ]
3. Transposed Tensor (9, 2, 4, 4)
Next, we transpose so that the head dimension (2) comes right after the batch dimension (9). Now it’s (9, 2, 4, 4). For one sentence:
Head 1 (shape: 4 x 4):
Token 1: [ a1, a2, a3, a4 ]
Token 2: [ a9, a10, a11, a12 ]
Token 3: [ a17, a18, a19, a20 ]
Token 4: [ a25, a26, a27, a28 ]
Head 2 (shape: 4 x 4):
Token 1: [ a5, a6, a7, a8 ]
Token 2: [ a13, a14, a15, a16 ]
Token 3: [ a21, a22, a23, a24 ]
Token 4: [ a29, a30, a31, a32 ]
5. Detailed Numeric Example of the Pairwise Rotation
A key insight is that the dimension stays the same before and after RoPE. Let’s illustrate this with a smaller dimension (4) for clarity:
Setup
- Suppose our head dimension is 4 (so we have 2 “pairs”).
- We take a single token’s query vector, Q=[q1,q2,q3,q4].
- We’ll apply RoPE with angles θ1 and θ2 for the two pairs.
Step 1: Split into Pairs
We group the vector into:
- Pair 1: [q1,q2]
- Pair 2: [q3,q4]
Step 2: Interpret Each Pair as a Complex Number
- z1=q1+i q2
- z2=q3+i q4
Step 3: Multiply by the Rotation Factors
Each rotation factor is cos(θ)+i sin(θ). For pair 1, we might have θ1 for pair 2, θ2.
Pair 1
z1,rot=(q1+i q2)×[cos(θ1)+i sin(θ1)]
Expanding this:
ℜ(z1,rot)=q1cos(θ1) − q2sin(θ1),ℑ(z1,rot)=q1sin(θ1) + q2cos(θ1).
Pair 2
z2,rot=(q3+iq4)×[cos(θ2)+isin(θ2)]
Similarly,
ℜ(z2,rot)=q3cos(θ2) − q4sin(θ2),ℑ(z2,rot)=q3sin(θ2) + q4cos(θ2).
Step 4: Convert Back to Real Vectors
After rotation, we end up with two new pairs:
Rotated Pair 1: [ℜ(z1,rot), ℑ(z1,rot)]
Rotated Pair 2: [ℜ(z2,rot), ℑ(z2,rot)]
Concatenating them back yields a 4-dimensional vector:
Qrot=[q1′,q2′,q3′,q4′].
Hence, the shape is the same as the original [q1,q2,q3,q4]. We haven’t lost or gained dimensions; we’ve merely rotated them in 2D planes.
6. FAQ
Q1: What is the main advantage of RoPE over traditional positional embeddings?
A: RoPE uses mathematical rotations (via cosine and sine) to encode positions, allowing the model to naturally extend to sequences longer than the training limit.
Q2: How does RoPE support infinite context?
A: Since the cosine and sine functions are defined for any number, you can compute positional factors for tokens at positions far beyond the training range, preserving relative position information.
Q3: How are query and key vectors “rotated”?
A: The vectors are split into pairs and treated as complex numbers. Multiplying these by precomputed complex rotation factors cos(θ)+i sin(θ) “rotates” them in 2D planes.
Q4: Why is a causal mask used in the attention mechanism?
A: The causal mask ensures that tokens can only attend to themselves and previous tokens, preventing the leakage of future information, which is crucial for tasks like language generation.
Q5: Does applying RoPE change the dimensionality of Q/K?
A: No. RoPE rotates each pair of features. After converting back to real numbers, the shape stays the same as before. Only the values are updated to incorporate positional information.
7. Conclusion
RoPE (Rotary Position Embeddings) introduces a robust, mathematically elegant method for encoding positional information in transformer models. By replacing fixed lookup tables with dynamic rotations, RoPE handles infinite contexts and preserves relative token order more effectively — without changing the dimensionality of your Q/K vectors.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.