Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
Why ChatGPT Feels Like Magic While Siri Feels Dumb.
Artificial Intelligence   Latest   Machine Learning

Why ChatGPT Feels Like Magic While Siri Feels Dumb.

Last Updated on January 2, 2026 by Editorial Team

Author(s): Suchitra Malimbada

Originally published on Towards AI.

Understanding the fundamental architectural shift from sequential processing to parallel attention, and why it enabled GPT-5’s capabilities that were impossible for LSTMs

Why ChatGPT Feels Like Magic While Siri Feels Dumb.
Created by author

GPT-5 scored 100% on the 2025 American Invitational Mathematics Examination. It handles 400,000-token contexts without breaking a sweat. It learns new tasks from examples in the prompt without weight updates. Ten years ago, the state-of-the-art LSTM struggled with dependencies spanning more than 13 tokens.

The gap isn’t just about scale. Throwing 1.8 trillion parameters at an LSTM architecture wouldn’t give us GPT-5’s capabilities. The difference is architectural, and it runs deeper than most explanations suggest. This isn’t about transformers being “better” at the same task. Transformers enable algorithms that have no LSTM equivalent.

Consider what happens when GPT-5 processes the sentence “The trophy would not fit in the suitcase because it was too large.” To resolve what “it” refers to, the model performs a single parallel computation across all tokens. The connection from “it” to “trophy” happens in constant time, regardless of distance. An LSTM compresses information about “trophy” and “suitcase” into a fixed-size hidden state before seeing “it,” passing through eight intermediate states that mix and compress at each step. The architecture fundamentally cannot perform the same computation.

This architectural distinction cascades into everything modern LLMs do. In-context learning requires comparing tokens directly to recognize structure. Chain-of-thought reasoning requires attending back to previously generated steps. Both capabilities emerge from transformer architectures and remain impossible for RNNs, regardless of parameter count.

Table of Contents

  1. The Mathematics of Information Flow
  2. Why Sequential Processing Creates a Ceiling
  3. Context Windows: 13 Tokens vs 400,000 Tokens
  4. Emergent Capabilities That Need Attention
  5. Scale Enablement and Parallelization
  6. Real-World Evolution: Siri’s Transformation
  7. Architecture Determines Capability Ceilings

The Mathematics of Information Flow

Self-Attention: Parallel Information Access

The transformer’s self-attention mechanism computes relationships between all positions simultaneously:

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Every query vector computes dot products with every key vector in a single matrix multiplication, producing an n×n attention matrix in constant sequential operations. When GPT-5 processes a 400,000-token document, position 399,999 can attend directly to position 0. The path length between any two tokens is always one.

For a sequence of length n and dimensionality d, self-attention requires O(n²·d) operations. The quadratic scaling makes naive attention expensive, but O(1) sequential operations mean every computation runs in parallel across thousands of GPU cores. GPT-5 trains on clusters with tens of thousands of accelerators, computing attention for billions of tokens simultaneously.

LSTM: Sequential Compression

The LSTM processes sequences through recurrent updates to a hidden state:

f_t = σ(W_f [h_{t-1}, x_t] + b_f) # forget gate
i_t = σ(W_i [h_{t-1}, x_t] + b_i) # input gate
o_t = σ(W_o [h_{t-1}, x_t] + b_o) # output gate

_t = tanh(W_C [h_{t-1}, x_t] + b_C) # candidate cell state

C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t # cell state update
h_t = o_ttanh(C_t) # hidden state

At each timestep, the network updates its hidden state based on the previous state and current input. Information from early tokens must pass through every intermediate state to reach later positions. For a sequence of length n, the path length is O(n). Processing 400,000 tokens requires 400,000 sequential operations that must execute sequentially. Modern GPUs have 10,000+ cores sitting idle while the LSTM processes one timestep at a time.

The Path Length Problem

Path length determines what dependencies a network can learn. Gradients during backpropagation must travel the same paths as forward information flow. In an LSTM processing a 100-token sequence, gradients pass through 100 hidden state updates to reach the first token. Each passage multiplies the gradient by a factor less than one. After 20–30 steps, gradients decay by a factor of 1⁰⁷.

The LSTM’s gates mitigate vanishing gradients to a point, enabling dependencies across 20–50 tokens where vanilla RNNs fail at 5–10. But the fundamental issue remains: information must traverse n states, and gradients must travel back along that same path.

Transformers sidestep this entirely. Every position connects to every other position with path length one. Gradients flow directly from outputs to inputs without compression, making long-range dependency learning tractable regardless of sequence length.

Why Sequential Processing Creates a Ceiling

The Fixed-Size Hidden State

An LSTM with hidden dimension 1024 compresses all previous context into 1024 floating-point numbers. When processing position 1000, the hidden state contains a lossy summary of the previous 999 tokens.

UC Berkeley research quantified this bottleneck precisely. They trained LSTM language models with different n-gram orders, where the model could only look back n tokens. An LSTM with arbitrary context length performed identically to an LSTM with n=13. Beyond 13 tokens, additional context provided no benefit. The hidden state saturated.

This isn’t a training failure. It’s architectural. Modern transformers use attention over 400,000 tokens because tasks require it. Document summarization, codebase understanding, and long-form reasoning need to reference information thousands of tokens in the past.

Information Loss and Parallel Computation

Each LSTM timestep applies a learned compression function combining the previous hidden state with current input. These functions are optimized during training but remain lossy. Information about which specific words appeared and in what order degrades with each state transition.

Transformers never compress context. The full sequence of embeddings remains accessible at every layer. When GPT-5 generates token 50,000, it can attend back to token 1 with the same fidelity as token 49,999.

Consider translating a technical document with acronyms. “The Convolutional Neural Network (CNN) architecture…” appears at token 100. At token 5,000, the text references “the CNN.” A transformer attends directly back to token 100. An LSTM’s hidden state has compressed away the specifics, likely failing the translation.

The sequential bottleneck prevents LSTMs from leveraging parallel hardware. Training GPT-5’s 1.8 trillion parameters required processing trillions of tokens across thousands of GPUs, each computing attention for different positions simultaneously. An LSTM processing 400,000 tokens runs 400,000 sequential steps. No amount of GPUs can parallelize this dependency chain. Training an LSTM with 1.8 trillion parameters would require decades on current hardware.

Context Windows: 13 Tokens vs 400,000 Tokens

Berkeley’s finding that LSTMs saturate at 13-token context has profound implications. The model effectively forgets everything before the most recent 13 tokens. Document-level translation must maintain terminology consistency across paragraphs. Question answering must locate information thousands of words before the question. Code generation must reference earlier function definitions. All require context windows far exceeding 13 tokens.

GPT-5’s 400,000-token context window represents a 30,000x increase. The model processes entire books, large codebases, or extended conversations while maintaining coherent references. Modern transformers employ several techniques to manage O(n²) complexity: sparse attention patterns, FlashAttention kernel optimizations, Mixture of Experts layers, and Rotary Position Embeddings. These optimizations work because the fundamental architecture permits them.

GPT-5 can read entire research papers and answer questions connecting introduction to conclusion. It analyzes 200,000-token codebases to identify module interactions. It maintains coherent plot threads across 100,000+ token creative writing.

LSTM-based systems required architectural hacks: hierarchical models compressing chunks into summaries, attention mechanisms bolted onto RNNs, explicit memory modules outside the hidden state. These workarounds added complexity and often failed on tasks requiring flexible information retrieval. The architecture fought against task requirements.

Emergent Capabilities That Need Attention

Certain capabilities emerge in large transformers that have no LSTM equivalent. These aren’t quantitative improvements. They represent qualitatively different behaviors the LSTM architecture cannot express.

In-Context Learning

Show GPT-5 a few examples of a task in the prompt, and it performs the task on new inputs without gradient updates. This capability appears around 1 billion parameters and strengthens with scale.

Example: Provide three translation pairs —

“sea otter” = “loutre de mer”
“cheese” = “fromage”
“laptop” = “ordinateur portable”,

then prompt with “bicycle” = ?. GPT-5 outputs “vélo” despite never being explicitly trained on this translation task.

The mechanism involves induction heads: circuits spanning two attention layers where the first layer identifies repeated patterns and the second completes them based on what followed previous instances. Anthropic research demonstrated these circuits form during training and implement pattern matching via attention.

LSTMs cannot implement induction heads. The algorithm requires comparing the current token directly to all previous tokens to identify matches, then attending to what followed those matches. Both operations need the parallel random access that attention provides. An LSTM would need to encode “when I previously saw this pattern, what came next” in its hidden state, but the pattern might be arbitrarily complex and might have appeared in many contexts. The fixed-size hidden state cannot store this information with sufficient fidelity.

Chain-of-Thought Reasoning

When prompted to “think step by step,” large transformers generate intermediate reasoning steps before producing final answers. GPT-5 achieved 100% on the 2025 AIME specifically through multi-step reasoning.

The technique works because transformers can attend back to their own generated reasoning. After generating “First, calculate the number of boxes: 4 shelves × 8 boxes per shelf = 32 boxes,” the model attends to “32 boxes” when computing “Then, calculate total items: 32 boxes × 6 items per box = 192 items.”

LSTMs cannot implement this algorithm. Generated text disappears into the hidden state compression. When the model generates “32 boxes,” that information mixes with all previous context in the next hidden state update. Later, when computing “32 boxes × 6 items,” the model cannot selectively attend to the “32” it generated earlier. Researchers have tested chain-of-thought with various architectures. It works with transformers, fails with RNNs, and even fails with transformer-RNN hybrids where the RNN component creates a bottleneck.

Task Composition

GPT-5 combines capabilities in novel ways not seen during training. Ask it to “write a Python function that takes a string in English and returns the Pig Latin translation, then write unit tests for it,” and it generates both components despite likely never seeing this exact combination.

Compositional generalization requires maintaining independent representations of different task aspects and composing them at generation time. Attention layers route information selectively. The model attends to Python examples when generating code structure, attends to language rules when generating transformation logic, attends to testing examples when generating assertions. An LSTM’s hidden state must encode all task aspects in a single vector, making selective routing difficult.

Scale Enablement and Parallelization

The path from 10 million parameter LSTMs in 2015 to 1.8 trillion parameter GPT-5 was only possible because transformers enable parallel training at massive scale.

Transformers process entire sequences in parallel. A batch of 1024 sequences, each 2048 tokens long, processes in roughly the same time as a single sequence. LSTMs process sequences sequentially. For a 2048-token sequence, the model performs 2048 sequential operations regardless of hardware parallelism. GPT-5 trained for approximately 90–120 days on clusters with 25,000+ GPUs. An equivalently sized LSTM would require decades.

OpenAI’s scaling laws research demonstrated that transformer performance improves predictably with model size:

L(N) = A · N^{-0.076} + B

Loss decreases as a power law in parameters N. This predictability enabled planning GPT-5’s architecture: researchers could estimate performance at 1.8 trillion parameters before investing the compute.

LSTMs don’t follow the same scaling laws. Beyond a certain size, adding parameters provides minimal benefit because the sequential bottleneck and fixed hidden state limit what the model can learn. Certain capabilities emerge only above specific scale thresholds. In-context learning appears around 1 billion parameters. Chain-of-thought reasoning emerges around 100 billion parameters. These phase transitions happen because the model develops internal circuits like induction heads.

LSTMs cannot develop these circuits because the required algorithms have no LSTM implementation.

Graph-

GPT-5’s Mixture of Experts architecture activates only a subset of parameters per token. The total parameter count is enormous (1.8 trillion), but each forward pass uses a fraction. This works with transformers because different tokens route to different experts in parallel. MoE is harder to implement efficiently with LSTMs due to sequential dependencies.

Real-World Evolution: Siri’s Transformation

The transition from LSTMs to transformers has reshaped production systems. Apple’s Siri provides a clear case study.

Early Siri used a pipeline architecture: Automatic Speech Recognition converted audio to text using LSTM-based acoustic and language models. Natural Language Understanding classified intent and extracted slots using BiLSTMs. Dialog Management tracked conversation state. Natural Language Generation produced responses via templates or LSTMs. Text-to-Speech converted text to audio.

Each component was trained separately, often on different datasets with different objectives. The system was fragile. Errors in ASR corrupted NLU input. NLU failures produced wrong intents. The pipeline amplified mistakes.

More fundamentally, the system couldn’t reason. Ask “Who won the Oscar for Best Actress in 2020?” and it could retrieve a fact. Ask “Could the winner of the 2020 Best Actress Oscar play the lead in a biopic about Marie Curie?” and it failed. The question requires retrieving the actress, recalling her background, understanding Marie Curie’s characteristics, and reasoning about casting fit. The pipeline had no component for multi-step reasoning.

Apple Intelligence, unveiled in 2024 and expanded through 2025–2026, replaced the pipeline with transformer-based foundation models. Craig Federighi publicly acknowledged that “the first-generation architecture was too limited for where we wanted to go.”

The new system uses a single large language model for understanding, reasoning, and generation. Capabilities that were impossible in the pipeline emerge naturally. Cross-turn context maintains information across conversation turns. Complex reasoning handles multi-step requests like “Find times when both Alice and Bob are free this week, prioritize afternoons, and schedule a one-hour meeting.” Task composition chains operations like “Summarize my unread emails from yesterday, identify action items, and add them to my todo list.” In-context learning adapts to user-defined commands without retraining.

This reliability improvement also matters for security-critical applications. Systems like www.antijection.com, which provide prompt injection prevention layers, rely on accurate threat detection. When the underlying models hallucinate less, security systems built on them become more reliable. The architectural advantages that enable GPT-5’s scale and capabilities also contribute to its trustworthiness.

Architecture Determines Capability Ceilings

The transformer didn’t just improve on LSTMs. It removed fundamental limitations that no amount of optimization could overcome.

The LSTM architecture cannot implement direct token comparison (required for in-context learning), attention over generated text (required for chain-of-thought reasoning), arbitrary long-range dependencies (required for document understanding), or parallel sequence processing (required for training at scale). These aren’t engineering challenges. They’re architectural impossibilities. The sequential processing and fixed-size hidden state create hard ceilings on what the model can express.

The history of LSTMs demonstrates that scale alone doesn’t overcome architectural limitations. Researchers built LSTMs with hundreds of millions of parameters. Performance improved modestly, then plateaued. The hidden state bottleneck remained. The sequential processing remained. The capability ceiling remained.

Transformers removed those bottlenecks. The architecture aligns with the structure of the tasks: language has long-range dependencies, so provide long-range connections. Reasoning requires referencing earlier steps, so enable attention over the full sequence. Learning patterns requires comparing instances, so compute all pairwise similarities. This alignment is why transformer scaling laws hold where LSTM scaling broke down.

Architecture determines what’s possible. Before committing years of research and millions of dollars to scaling a model, understand whether the architecture can express the capabilities needed. Sometimes the bottleneck isn’t optimization or data or compute. Sometimes it’s the equations at the foundation.

The transformer’s victory over LSTMs wasn’t inevitable. It required recognizing that parallel information access matters more than sequential processing for language, that attention mechanisms provide the right inductive bias, and that O(1) path length enables learning that O(n) path length prevents. These architectural insights, not just engineering effort, changed what AI systems can do.

When capabilities plateau despite adding resources, question the architecture. The ceiling might be mathematical, not practical. And changing the equations might be the only way to break through.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.