RoPE (Rotary Position Embeddings): A Detailed Example

Last Updated on November 11, 2025 by Editorial Team

Author(s): Utkarsh Mittal

Originally published on Towards AI.

In transformer models, knowing the order of tokens is essential — even though the model processes tokens in parallel. Traditional positional embeddings rely on a fixed “lookup table” (learned for positions up to a maximum length). But what if you need to work with sequences longer than the training limit? Enter RoPE (Rotary Position Embeddings).

RoPE leverages a mathematical approach (rotations via cosine and sine) to encode positions, allowing models to handle arbitrarily long sequences — “infinite context.” This guide breaks down the concepts, walks through the code with practical examples, uses visuals, and highlights how the dimension stays the same before and after the rotary transformation.

1. Why Positional Embeddings Matter

Traditional vs. Rotary Positional Embeddings

Traditional Positional Embeddings:
Each token gets a fixed vector based on its position (e.g., positions 1–512). They work well for short sequences but cannot extrapolate to unseen positions.
Rotary (RoPE) Positional Embeddings:
Instead of fixed values, RoPE uses rotations computed by:

θ=frequency×position

This angle is then converted into a complex number:

cos(θ)+isin(θ)

Because cosine and sine are defined for any real number, you can compute positions for any token — extending your model’s capability well beyond the training limit.

Visual Analogy:
Think of a traditional positional embedding like a 12-hour clock with fixed numbers. RoPE is like a continuously spinning clock hand that can indicate any time — even if it goes beyond 12.

2. How RoPE Works: Code Walkthrough

Let’s explore the key functions that implement RoPE in PyTorch.

A. Precomputing Frequency Factors

def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
 """
 Precompute the frequency tensor for complex exponentials (cos, sin).
 
 Parameters:
 dim: Embedding dimension (must be even)
 end: Maximum sequence length
 theta: Base value for frequencies
 
 Returns:
 A complex tensor of shape (end, dim/2) with values cos(θ) + i*sin(θ)
 """
 # For each pair (since rotary works on pairs), compute the frequencies.
 freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
 # Position indices
 t = torch.arange(end, device=freqs.device)
 # Outer product creates a matrix of shape (end, dim/2)
 freqs = torch.outer(t, freqs)
 # Convert to complex numbers (cosine + i*sin)
 freqs_cis = torch.polar(torch.ones_like(freqs), freqs)
 return freqs_cis

Key Points:

The function splits the embedding dimension into pairs.
It computes a frequency factor for each pair.
The result is used to “rotate” the query and key vectors later.

B. Applying Rotary Embeddings

def apply_rotary_emb(xq, xk, freqs_cis):
 # Reshape last dimension into pairs (each pair becomes a complex number)
 xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
 xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
 
 # Use the sequence length from the query (and key)
 seq_len = xq_.shape[-2]
 freqs_cis = freqs_cis[:seq_len, :]
 
 # Multiply the complex numbers to perform the rotation
 xq_out = torch.view_as_real(xq_ * freqs_cis.unsqueeze(0).unsqueeze(0))
 xk_out = torch.view_as_real(xk_ * freqs_cis.unsqueeze(0).unsqueeze(0))
 
 # Flatten back to the original shape
 xq_out = xq_out.flatten(3).type_as(xq)
 xk_out = xk_out.flatten(3).type_as(xk)
 
 return xq_out, xk_out

What’s Happening?

Reshaping: The last dimension of Q and K is divided into pairs so that each pair can be interpreted as a complex number.
Rotation: Multiplying these complex numbers by the frequency factors

(cos(θ)+isin(θ))

“rotates” the vectors in 2D planes.

3. Reformatting: The complex values are converted back into real vectors. Crucially, the dimension remains the same as before — only the values change to incorporate positional information.

C. Integration in Causal Self-Attention

Here’s a simplified snippet showing how RoPE fits into the attention mechanism:

class CausalSelfAttention(nn.Module):
 def __init__(self, config):
 super().__init__()
 # Number of query and key/value heads
 self.n_head = config.n_head 
 self.n_kv_head = config.n_kv_head if hasattr(config, 'n_kv_head') else config.n_head
 self.n_embd = config.n_embd 
 self.head_dim = config.n_embd // config.n_head 
 
 # Linear projections for queries, keys, and values
 self.wq = nn.Linear(config.n_embd, self.n_head * self.head_dim, bias=getattr(config, 'bias', False))
 self.wk = nn.Linear(config.n_embd, self.n_kv_head * self.head_dim, bias=getattr(config, 'bias', False))
 self.wv = nn.Linear(config.n_embd, self.n_kv_head * self.head_dim, bias=getattr(config, 'bias', False))
 
 # Causal mask (ensures tokens can only attend to previous tokens)
 self.register_buffer(
 "mask",
 torch.tril(torch.ones(config.block_size, config.block_size)).view(1, 1, config.block_size, config.block_size)
 )
 
 # Rotary embedding cache (computed on first forward pass)
 self.rope_cache = None
 self.max_seq_len = config.block_size

 def forward(self, x):
 batch_size, seq_length, _ = x.shape # x shape: (B, T, C)

 # Initialize rotary cache if not done
 if self.rope_cache is None or self.rope_cache.shape[0] < self.max_seq_len:
 self.rope_cache = precompute_freqs_cis(self.head_dim, self.max_seq_len).to(x.device)
 
 # Compute Q, K, V projections (code omitted for brevity)
 ...

Visual Flow of the Forward Pass:

 ┌────────────────────────────┐
 │ Input Tensor │
 │ Shape: (Batch, T, C) │
 └────────────┬───────────────┘
 │
 [Linear Projection → Q, K, V]
 │
 ▼
 ┌────────────────────────────┐
 │ Reshape into Multiple │
 │ Heads │
 └────────────┬───────────────┘
 │
 [Apply RoPE → Rotated Q, K]
 │
 ▼
 ┌────────────────────────────┐
 │ Compute Attention Scores │
 │ (with causal masking) │
 └────────────────────────────┘

3. A Practical Example with Toy Sentences

Imagine you have the following nine sentences:

toy_sentences = [
 "The quick brown fox jumps over the lazy dog.",
 "I love programming and machine learning.",
 "Natural language processing is a fascinating field.",
 "Transformers have changed the way we process text.",
 "Artificial intelligence is the future of technology.",
 "Data science combines statistics and computer science.",
 "Deep learning models require large amounts of data.",
 "Python is a popular language for AI research.",
 "ChatGPT helps answer many complex questions."
]

Step-by-Step Example

Tokenization & Embedding
Each sentence is tokenized and padded/truncated to a fixed length (say, 4 tokens per sentence). Imagine each sentence becomes a 4×8 matrix.
Linear Projection for Keys (Detailed Example)
For a single token (e.g., “The”) with an 8-dimensional embedding[1, 2, 3, 4, 5, 6, 7, 8], the key linear layer maps this to an 8-dimensional output. That 8-dim output is then split into two key heads (each of size 4).
Reshaping the Projections
For the entire batch (9 sentences, 4 tokens each, 8 features per token), the key projection is initially (9, 4, 8). It is then reshaped to (9, 4, 2, 4) and transposed to (9, 2, 4, 4).
Applying the Rotary Embedding
Each pair of features (e.g., [q1, q2]) is treated as a complex number. Multiplying by

cos(θ)+isin(θ)

rotates that pair, embedding the positional information.

5. Causal Masking
A lower-triangular matrix ensures that tokens only attend to themselves and previous tokens, preserving the causal structure for language generation.

4. Visual Summary of the Reshaping Steps

Below is a text-based representation of the diagrams showing how we go from (9, 4, 8) → (9, 4, 2, 4) → (9, 2, 4, 4) for one batch element (one sentence). The same applies across the entire batch of 9.

1. Original Tensor `(9, 4, 8)`

For one batch element (Sentence 1), we might have:

Token 1: [ a1, a2, a3, a4, a5, a6, a7, a8 ]
Token 2: [ a9, a10, a11, a12, a13, a14, a15, a16 ]
Token 3: [ a17, a18, a19, a20, a21, a22, a23, a24 ]
Token 4: [ a25, a26, a27, a28, a29, a30, a31, a32 ]

2. Reshaped Tensor `(9, 4, 2, 4)`

We split each token’s 8 features into 2 groups (each of size 4). For one sentence:

Token 1:
 Head Group 1: [ a1, a2, a3, a4 ]
 Head Group 2: [ a5, a6, a7, a8 ]

Token 2:
 Head Group 1: [ a9, a10, a11, a12 ]
 Head Group 2: [ a13, a14, a15, a16 ]

Token 3:
 Head Group 1: [ a17, a18, a19, a20 ]
 Head Group 2: [ a21, a22, a23, a24 ]

Token 4:
 Head Group 1: [ a25, a26, a27, a28 ]
 Head Group 2: [ a29, a30, a31, a32 ]

3. Transposed Tensor `(9, 2, 4, 4)`

Next, we transpose so that the head dimension (2) comes right after the batch dimension (9). Now it’s (9, 2, 4, 4). For one sentence:

Head 1 (shape: 4 x 4):
 Token 1: [ a1, a2, a3, a4 ]
 Token 2: [ a9, a10, a11, a12 ]
 Token 3: [ a17, a18, a19, a20 ]
 Token 4: [ a25, a26, a27, a28 ]

Head 2 (shape: 4 x 4):
 Token 1: [ a5, a6, a7, a8 ]
 Token 2: [ a13, a14, a15, a16 ]
 Token 3: [ a21, a22, a23, a24 ]
 Token 4: [ a29, a30, a31, a32 ]

5. Detailed Numeric Example of the Pairwise Rotation

A key insight is that the dimension stays the same before and after RoPE. Let’s illustrate this with a smaller dimension (4) for clarity:

Setup

Suppose our head dimension is 4 (so we have 2 “pairs”).
We take a single token’s query vector, Q=[q1,q2,q3,q4].
We’ll apply RoPE with angles θ1 and θ2 for the two pairs.

Step 1: Split into Pairs

We group the vector into:

Pair 1: [q1,q2]
Pair 2: [q3,q4]

Step 2: Interpret Each Pair as a Complex Number

z1=q1+i q2
z2=q3+i q4

Step 3: Multiply by the Rotation Factors

Each rotation factor is cos⁡(θ)+i sin⁡(θ). For pair 1, we might have θ1 for pair 2, θ2.

Pair 1

z1,rot=(q1+i q2)×[cos⁡(θ1)+i sin⁡(θ1)]

Expanding this:

ℜ(z1,rot)=q1cos⁡(θ1) − q2sin⁡(θ1),ℑ(z1,rot)=q1sin⁡(θ1) + q2cos⁡(θ1).

Pair 2

z2,rot=(q3+iq4)×[cos(θ2)+isin(θ2)]

Similarly,

ℜ(z2,rot)=q3cos⁡(θ2) − q4sin⁡(θ2),ℑ(z2,rot)=q3sin⁡(θ2) + q4cos⁡(θ2).

Step 4: Convert Back to Real Vectors

After rotation, we end up with two new pairs:

Rotated Pair 1: [ℜ(z1,rot), ℑ(z1,rot)]

Rotated Pair 2: [ℜ(z2,rot), ℑ(z2,rot)]

Concatenating them back yields a 4-dimensional vector:

Qrot=[q1′,q2′,q3′,q4′].

Hence, the shape is the same as the original [q1,q2,q3,q4]. We haven’t lost or gained dimensions; we’ve merely rotated them in 2D planes.

6. FAQ

Q1: What is the main advantage of RoPE over traditional positional embeddings?
A: RoPE uses mathematical rotations (via cosine and sine) to encode positions, allowing the model to naturally extend to sequences longer than the training limit.

Q2: How does RoPE support infinite context?
A: Since the cosine and sine functions are defined for any number, you can compute positional factors for tokens at positions far beyond the training range, preserving relative position information.

Q3: How are query and key vectors “rotated”?
A: The vectors are split into pairs and treated as complex numbers. Multiplying these by precomputed complex rotation factors cos⁡(θ)+i sin⁡(θ) “rotates” them in 2D planes.

Q4: Why is a causal mask used in the attention mechanism?
A: The causal mask ensures that tokens can only attend to themselves and previous tokens, preventing the leakage of future information, which is crucial for tasks like language generation.

Q5: Does applying RoPE change the dimensionality of Q/K?
A: No. RoPE rotates each pair of features. After converting back to real numbers, the shape stays the same as before. Only the values are updated to incorporate positional information.

7. Conclusion

RoPE (Rotary Position Embeddings) introduces a robust, mathematically elegant method for encoding positional information in transformer models. By replacing fixed lookup tables with dynamic rotations, RoPE handles infinite contexts and preserves relative token order more effectively — without changing the dimensionality of your Q/K vectors.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

RoPE (Rotary Position Embeddings): A Detailed Example

Author(s): Utkarsh Mittal

1. Why Positional Embeddings Matter

Traditional vs. Rotary Positional Embeddings

2. How RoPE Works: Code Walkthrough

A. Precomputing Frequency Factors

B. Applying Rotary Embeddings

C. Integration in Causal Self-Attention

3. A Practical Example with Toy Sentences

Step-by-Step Example

4. Visual Summary of the Reshaping Steps

1. Original Tensor `(9, 4, 8)`

2. Reshaped Tensor `(9, 4, 2, 4)`

3. Transposed Tensor `(9, 2, 4, 4)`

5. Detailed Numeric Example of the Pairwise Rotation

Setup

Step 1: Split into Pairs

Step 2: Interpret Each Pair as a Complex Number

Step 3: Multiply by the Rotation Factors

Step 4: Convert Back to Real Vectors

6. FAQ

7. Conclusion

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Crack ML Interviews with Confidence: K-Nearest Neighbors (KNN 20 Q&A)

The Event-Driven Blueprint: How I Scaled a Spring Boot System to 10 Million Kafka Messages/Day

Building Vector Search? Why FAISS Alone Isn’t Enough

TAI #202: GPT-5.5 Moves Codex Into Real Work

Machine Learning System Design -The Model Serving Triangle, With One Forward Pass Flowing Through Every Trade-off (Part3)

AI Orchestration in Action: How MuleSoft and LLMs Fuel the Future of Enterprise AI

GPT-4 Has 1.8 Trillion Parameters. It Uses 2% of Them Per Token.

Part 20: Data Manipulation in Multi-Dimensional Aggregation

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

RoPE (Rotary Position Embeddings): A Detailed Example

Author(s): Utkarsh Mittal

1. Why Positional Embeddings Matter

Traditional vs. Rotary Positional Embeddings

2. How RoPE Works: Code Walkthrough

A. Precomputing Frequency Factors

B. Applying Rotary Embeddings

C. Integration in Causal Self-Attention

3. A Practical Example with Toy Sentences

Step-by-Step Example

4. Visual Summary of the Reshaping Steps

1. Original Tensor (9, 4, 8)

2. Reshaped Tensor (9, 4, 2, 4)

3. Transposed Tensor (9, 2, 4, 4)

5. Detailed Numeric Example of the Pairwise Rotation

Setup

Step 1: Split into Pairs

Step 2: Interpret Each Pair as a Complex Number

Step 3: Multiply by the Rotation Factors

Step 4: Convert Back to Real Vectors

6. FAQ

7. Conclusion

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement

1. Original Tensor `(9, 4, 8)`

2. Reshaped Tensor `(9, 4, 2, 4)`

3. Transposed Tensor `(9, 2, 4, 4)`