Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

RoPE (Rotary Position Embeddings): A Detailed Example
Latest   Machine Learning

RoPE (Rotary Position Embeddings): A Detailed Example

Last Updated on November 11, 2025 by Editorial Team

Author(s): Utkarsh Mittal

Originally published on Towards AI.

In transformer models, knowing the order of tokens is essential — even though the model processes tokens in parallel. Traditional positional embeddings rely on a fixed “lookup table” (learned for positions up to a maximum length). But what if you need to work with sequences longer than the training limit? Enter RoPE (Rotary Position Embeddings).

RoPE leverages a mathematical approach (rotations via cosine and sine) to encode positions, allowing models to handle arbitrarily long sequences — “infinite context.” This guide breaks down the concepts, walks through the code with practical examples, uses visuals, and highlights how the dimension stays the same before and after the rotary transformation.

1. Why Positional Embeddings Matter

Traditional vs. Rotary Positional Embeddings

  • Traditional Positional Embeddings:
    Each token gets a fixed vector based on its position (e.g., positions 1–512). They work well for short sequences but cannot extrapolate to unseen positions.
  • Rotary (RoPE) Positional Embeddings:
    Instead of fixed values, RoPE uses rotations computed by:
θ=frequency×position

This angle is then converted into a complex number:

cos(θ)+isin(θ)
  • Because cosine and sine are defined for any real number, you can compute positions for any token — extending your model’s capability well beyond the training limit.

Visual Analogy:
Think of a traditional positional embedding like a 12-hour clock with fixed numbers. RoPE is like a continuously spinning clock hand that can indicate any time — even if it goes beyond 12.

2. How RoPE Works: Code Walkthrough

Let’s explore the key functions that implement RoPE in PyTorch.

A. Precomputing Frequency Factors

def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
"""
Precompute the frequency tensor for complex exponentials (cos, sin).

Parameters:
dim: Embedding dimension (must be even)
end: Maximum sequence length
theta: Base value for frequencies

Returns:
A complex tensor of shape (end, dim/2) with values cos(θ) + i*sin(θ)
"""

# For each pair (since rotary works on pairs), compute the frequencies.
freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
# Position indices
t = torch.arange(end, device=freqs.device)
# Outer product creates a matrix of shape (end, dim/2)
freqs = torch.outer(t, freqs)
# Convert to complex numbers (cosine + i*sin)
freqs_cis = torch.polar(torch.ones_like(freqs), freqs)
return freqs_cis

Key Points:

  • The function splits the embedding dimension into pairs.
  • It computes a frequency factor for each pair.
  • The result is used to “rotate” the query and key vectors later.

B. Applying Rotary Embeddings

def apply_rotary_emb(xq, xk, freqs_cis):
# Reshape last dimension into pairs (each pair becomes a complex number)
xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))

# Use the sequence length from the query (and key)
seq_len = xq_.shape[-2]
freqs_cis = freqs_cis[:seq_len, :]

# Multiply the complex numbers to perform the rotation
xq_out = torch.view_as_real(xq_ * freqs_cis.unsqueeze(0).unsqueeze(0))
xk_out = torch.view_as_real(xk_ * freqs_cis.unsqueeze(0).unsqueeze(0))

# Flatten back to the original shape
xq_out = xq_out.flatten(3).type_as(xq)
xk_out = xk_out.flatten(3).type_as(xk)

return xq_out, xk_out

What’s Happening?

  1. Reshaping: The last dimension of Q and K is divided into pairs so that each pair can be interpreted as a complex number.
  2. Rotation: Multiplying these complex numbers by the frequency factors
(cos(θ)+isin(θ))

“rotates” the vectors in 2D planes.

3. Reformatting: The complex values are converted back into real vectors. Crucially, the dimension remains the same as before — only the values change to incorporate positional information.

C. Integration in Causal Self-Attention

Here’s a simplified snippet showing how RoPE fits into the attention mechanism:

class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
# Number of query and key/value heads
self.n_head = config.n_head
self.n_kv_head = config.n_kv_head if hasattr(config, 'n_kv_head') else config.n_head
self.n_embd = config.n_embd
self.head_dim = config.n_embd // config.n_head

# Linear projections for queries, keys, and values
self.wq = nn.Linear(config.n_embd, self.n_head * self.head_dim, bias=getattr(config, 'bias', False))
self.wk = nn.Linear(config.n_embd, self.n_kv_head * self.head_dim, bias=getattr(config, 'bias', False))
self.wv = nn.Linear(config.n_embd, self.n_kv_head * self.head_dim, bias=getattr(config, 'bias', False))

# Causal mask (ensures tokens can only attend to previous tokens)
self.register_buffer(
"mask",
torch.tril(torch.ones(config.block_size, config.block_size)).view(1, 1, config.block_size, config.block_size)
)

# Rotary embedding cache (computed on first forward pass)
self.rope_cache = None
self.max_seq_len = config.block_size

def forward(self, x):
batch_size, seq_length, _ = x.shape # x shape: (B, T, C)

# Initialize rotary cache if not done
if self.rope_cache is None or self.rope_cache.shape[0] < self.max_seq_len:
self.rope_cache = precompute_freqs_cis(self.head_dim, self.max_seq_len).to(x.device)

# Compute Q, K, V projections (code omitted for brevity)
...

Visual Flow of the Forward Pass:

 ┌────────────────────────────┐
│ Input Tensor │
│ Shape: (Batch, T, C) │
└────────────┬───────────────┘

[Linear Projection → Q, K, V]


┌────────────────────────────┐
│ Reshape into Multiple │
│ Heads │
└────────────┬───────────────┘

[Apply RoPE → Rotated Q, K]


┌────────────────────────────┐
│ Compute Attention Scores │
│ (with causal masking) │
└────────────────────────────┘

3. A Practical Example with Toy Sentences

Imagine you have the following nine sentences:

toy_sentences = [
"The quick brown fox jumps over the lazy dog.",
"I love programming and machine learning.",
"Natural language processing is a fascinating field.",
"Transformers have changed the way we process text.",
"Artificial intelligence is the future of technology.",
"Data science combines statistics and computer science.",
"Deep learning models require large amounts of data.",
"Python is a popular language for AI research.",
"ChatGPT helps answer many complex questions."
]

Step-by-Step Example

  1. Tokenization & Embedding
    Each sentence is tokenized and padded/truncated to a fixed length (say, 4 tokens per sentence). Imagine each sentence becomes a 4×8 matrix.
  2. Linear Projection for Keys (Detailed Example)
    For a single token (e.g., “The”) with an 8-dimensional embedding[1, 2, 3, 4, 5, 6, 7, 8], the key linear layer maps this to an 8-dimensional output. That 8-dim output is then split into two key heads (each of size 4).
  3. Reshaping the Projections
    For the entire batch (9 sentences, 4 tokens each, 8 features per token), the key projection is initially (9, 4, 8). It is then reshaped to (9, 4, 2, 4) and transposed to (9, 2, 4, 4).
  4. Applying the Rotary Embedding
    Each pair of features (e.g., [q1, q2]) is treated as a complex number. Multiplying by
cos(θ)+isin(θ)

rotates that pair, embedding the positional information.

RoPE (Rotary Position Embeddings): A Detailed Example

5. Causal Masking
A lower-triangular matrix ensures that tokens only attend to themselves and previous tokens, preserving the causal structure for language generation.

4. Visual Summary of the Reshaping Steps

Below is a text-based representation of the diagrams showing how we go from (9, 4, 8)(9, 4, 2, 4)(9, 2, 4, 4) for one batch element (one sentence). The same applies across the entire batch of 9.

1. Original Tensor (9, 4, 8)

For one batch element (Sentence 1), we might have:

Token 1: [ a1, a2, a3, a4, a5, a6, a7, a8 ]
Token 2: [ a9, a10, a11, a12, a13, a14, a15, a16 ]
Token 3: [ a17, a18, a19, a20, a21, a22, a23, a24 ]
Token 4: [ a25, a26, a27, a28, a29, a30, a31, a32 ]

2. Reshaped Tensor (9, 4, 2, 4)

We split each token’s 8 features into 2 groups (each of size 4). For one sentence:

Token 1:
Head Group 1: [ a1, a2, a3, a4 ]
Head Group 2: [ a5, a6, a7, a8 ]

Token 2:
Head Group 1: [ a9, a10, a11, a12 ]
Head Group 2: [ a13, a14, a15, a16 ]

Token 3:
Head Group 1: [ a17, a18, a19, a20 ]
Head Group 2: [ a21, a22, a23, a24 ]

Token 4:
Head Group 1: [ a25, a26, a27, a28 ]
Head Group 2: [ a29, a30, a31, a32 ]

3. Transposed Tensor (9, 2, 4, 4)

Next, we transpose so that the head dimension (2) comes right after the batch dimension (9). Now it’s (9, 2, 4, 4). For one sentence:

Head 1 (shape: 4 x 4):
Token 1: [ a1, a2, a3, a4 ]
Token 2: [ a9, a10, a11, a12 ]
Token 3: [ a17, a18, a19, a20 ]
Token 4: [ a25, a26, a27, a28 ]

Head 2 (shape: 4 x 4):
Token 1: [ a5, a6, a7, a8 ]
Token 2: [ a13, a14, a15, a16 ]
Token 3: [ a21, a22, a23, a24 ]
Token 4: [ a29, a30, a31, a32 ]

5. Detailed Numeric Example of the Pairwise Rotation

A key insight is that the dimension stays the same before and after RoPE. Let’s illustrate this with a smaller dimension (4) for clarity:

Setup

  • Suppose our head dimension is 4 (so we have 2 “pairs”).
  • We take a single token’s query vector, Q=[q1,q2,q3,q4].
  • We’ll apply RoPE with angles θ1 and θ2​ for the two pairs.

Step 1: Split into Pairs

We group the vector into:

  • Pair 1: [q1,q2]
  • Pair 2: [q3,q4]

Step 2: Interpret Each Pair as a Complex Number

  • z1=q1+i q2
  • z2=q3+i q4

Step 3: Multiply by the Rotation Factors

Each rotation factor is cos⁡(θ)+i sin⁡(θ). For pair 1, we might have θ1 for pair 2, θ2.

Pair 1

z1,rot=(q1+i q2)×[cos⁡(θ1)+i sin⁡(θ1)]

Expanding this:

ℜ(z1,rot)=q1cos⁡(θ1) − q2sin⁡(θ1),ℑ(z1,rot)=q1sin⁡(θ1) + q2cos⁡(θ1).

Pair 2

z2,rot​=(q3​+iq4​)×[cos(θ2​)+isin(θ2​)]

Similarly,

ℜ(z2,rot)=q3cos⁡(θ2) − q4sin⁡(θ2),ℑ(z2,rot)=q3sin⁡(θ2) + q4cos⁡(θ2).

Step 4: Convert Back to Real Vectors

After rotation, we end up with two new pairs:

Rotated Pair 1: [ℜ(z1,rot), ℑ(z1,rot)]

Rotated Pair 2: [ℜ(z2,rot), ℑ(z2,rot)]

Concatenating them back yields a 4-dimensional vector:

Qrot​=[q1′​,q2′​,q3′​,q4′​].

Hence, the shape is the same as the original [q1,q2,q3,q4]. We haven’t lost or gained dimensions; we’ve merely rotated them in 2D planes.

6. FAQ

Q1: What is the main advantage of RoPE over traditional positional embeddings?
A: RoPE uses mathematical rotations (via cosine and sine) to encode positions, allowing the model to naturally extend to sequences longer than the training limit.

Q2: How does RoPE support infinite context?
A: Since the cosine and sine functions are defined for any number, you can compute positional factors for tokens at positions far beyond the training range, preserving relative position information.

Q3: How are query and key vectors “rotated”?
A: The vectors are split into pairs and treated as complex numbers. Multiplying these by precomputed complex rotation factors cos⁡(θ)+i sin⁡(θ) “rotates” them in 2D planes.

Q4: Why is a causal mask used in the attention mechanism?
A: The causal mask ensures that tokens can only attend to themselves and previous tokens, preventing the leakage of future information, which is crucial for tasks like language generation.

Q5: Does applying RoPE change the dimensionality of Q/K?
A: No. RoPE rotates each pair of features. After converting back to real numbers, the shape stays the same as before. Only the values are updated to incorporate positional information.

7. Conclusion

RoPE (Rotary Position Embeddings) introduces a robust, mathematically elegant method for encoding positional information in transformer models. By replacing fixed lookup tables with dynamic rotations, RoPE handles infinite contexts and preserves relative token order more effectively — without changing the dimensionality of your Q/K vectors.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.