Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
LLM & AI Agent Applications with LangChain and LangGraph — Part 4 — Components of GPT
Latest   Machine Learning

LLM & AI Agent Applications with LangChain and LangGraph — Part 4 — Components of GPT

Last Updated on December 29, 2025 by Editorial Team

Author(s): Michalzarnecki

Originally published on Towards AI.

Transformers, embeddings and attention: how modern LLMs really think

LLM & AI Agent Applications with LangChain and LangGraph — Part 4 — Components of GPT

Welcome back in the series related to LLM-based application development.

By now you already know the basics of how LLMs are built and what their key parameters mean. In this article we return to the architecture that kicked off the current wave of language models: the Transformer from the 2017 paper “Attention Is All You Need”. That work was a real turning point for natural language processing and it’s the foundation behind GPT and many other modern models.

My goal here is to walk through the main building blocks of a Transformer in a practical way, so that when you see the classic diagram, it’s not just a mysterious box anymore.

Paper “Attention is all you need” paper by Vaswani, et al., 2017 [1]

On the slide from the original paper you’ll usually see two main blocks: an encoder on the left and a decoder on the right.

The encoder is the analysis department. It takes an input sequence, for example a sentence in Polish, and converts it into a sequence of numerical representations. You can think of these as internal codes the model can work with efficiently.

The decoder is the generation department. It receives that internal code from the encoder and combines it with its own previously generated outputs to produce a new sequence, word by word. In machine translation, that new sequence might be a sentence in English. In other tasks it might be an answer, a summary or a continuation of a text.

If you like analogies: the encoder is the first translator who reads the original text and writes very technical notes in a private shorthand. The decoder is the second translator who reads those notes and writes a clean, fluent output in the target language.

A key property of this setup is that it is autoregressive. Each new word depends on all previously generated words. When the model is writing, it isn’t picking words in isolation. Every next token has to fit into the whole story so far. It’s closer to writing a novel than to filling boxes in a form.

Before the model can do any of that, though, text has to be turned into numbers. Computers don’t “see” words, they see vectors.

The simplest idea is one-hot encoding. Imagine you have a vocabulary of 15 words. Each word becomes a vector of length 15. Exactly one position in that vector is 1, and all the others are 0. The position of the 1 tells you which word it is.

Take a small corpus:

Ada has a computer.
Machine learning allows us to train a computer.
To solve problems we just need a computer and a lot of data.

You build a vocabulary, for example: [ada, has, computer, machine, learning, ...].

Then the sentence “Ada has a computer” might be represented as:

[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

So far so good: the text is now a fixed-length numeric vector. But this encoding has a massive weakness. It doesn’t capture meaning or relationships between words at all. For this representation, “king” is just as different from “queen” as it is from “computer”. There is no concept of similarity.

Large language models need something much richer.

This is where embeddings come in.

Instead of sparse vectors full of zeros and a single one, each word gets its own dense vector of real numbers in a shared multi-dimensional space. The model learns these vectors so that words with similar meanings end up close to each other, while different types of relationships can be expressed as directions in that space.

That’s how we end up with classic examples like:

king − man + woman ≈ queen

You can see similar patterns for verb tenses or grammatical number:

  • “walked” and “walk” relate to each other in a similar way as “ran” and “run”
  • “we swim” and “I swim” differ along the same dimension as “we” and “I”

An embedding space behaves a bit like a map. Words with similar meanings are neighbors. Opposites may sit on the same line but on opposite sides. More complex relationships can be captured as particular movement directions across the map.

To store enough information about meaning, we use high-dimensional representations. Instead of 15 dimensions like in the one-hot example, we might use 100, 300, 1024 dimensions or more.

Imagine we have a 100-dimensional embedding for the word “go”. In the same space we have another vector for “away”. We can combine these vectors, for instance by averaging them: add them together and divide by two. The result is a new point in the space that represents the phrase “go away” in a more abstract way than just gluing two words together.

Operations like this aren’t just mathematical tricks. They are a sign that the embedding space captures useful structure. The model can use these representations to reason about similarity, analogy and context.

Embeddings solve the problem of representing individual word meanings, but that isn’t enough. In real language, order and context matter just as much.

“Going from home to work” is not the same as “going from work to home”, even though the words are the same. The word “castle” can refer to a medieval fortress, a zipper component or a door lock, depending on the sentence around it.

One simple idea would be to assign each word a position number: first word is 1, second is 2, and so on. Unfortunately, that naive approach adds its own problems.

If we just pass position indexes directly, position 1000 is dramatically larger than position 1. From a learning perspective, it’s often easier and more stable to work with values that are normalized and live in a bounded range, such as between −1 and 1. Big jumps in scale can make training harder.

Transformers use a different trick to represent position, one that plays nicely with gradient-based learning.

The solution is sinusoidal positional encoding.

Each position in the sequence is represented by a vector built from sine and cosine functions with different frequencies. There are two important consequences:

  1. All values lie between −1 and 1, so the scale is well-behaved.
  2. Each position has a unique pattern of sine and cosine values, and nearby positions have similar patterns.

You can think of it as giving each token its own little melody based on where it appears in the sentence. Words next to each other have similar melodies. No two positions share exactly the same tune. Combined with the embedding, this lets the model know both what the word is and where it is.

We’re missing one last critical ingredient: Multi-Head Attention. This is the real core of the Transformer.

Imagine reading a sentence and tracking several things at once. One thread of attention follows the subject. Another pays attention to the main verbs. A third pays more attention to emotional tone. A fourth tries to understand cause-and-effect relationships.

Your brain shifts and blends these threads without much effort. Multi-head attention is a way for a model to do something similar in a structured, trainable way.

The attention mechanism uses three kinds of vectors: Query, Key and Value.

A library analogy works well here:

  • The Query is what you’re looking for, like “information about cats”.
  • The Key represents what each piece of content is about, similar to book titles on a shelf.
  • The Value is the actual content in the books.

The model compares the query to all the keys, calculates how similar they are, and then uses those similarities as weights to combine the values. The result is a weighted mixture of pieces of information that are relevant to the query.

Unlike a traditional dictionary lookup, this is “soft” retrieval. The model doesn’t pick one exact match, it blends information from many places in proportion to how relevant they seem.

If you’ve used dictionaries or hash maps in programming, the idea is familiar: keys point to values, and you query by key. In a Transformer, everything is continuous. Queries, keys and values are vectors, and matching is based on similarity, not equality.

Multi-head attention simply repeats this mechanism several times in parallel. Instead of one attention head, the model has many heads: 8, 12, 16 or more depending on the implementation. Each head sees the same input but learns to focus on different patterns.

One head might pay attention to short-range relations between neighbouring words. Another might specialize in long-range relations between the beginning and the end of a sentence. A third might capture grammatical structure. A fourth might focus on semantic roles.

This design gives the model some very important advantages:

  • It can process many different aspects of the input in parallel.
  • It builds a much richer understanding of context than simple recurrent networks.
  • It increases the model’s capacity to represent complex patterns without changing the basic building block.

In practice, multi-head attention shows up in multiple layers stacked on top of each other. In models like BERT and GPT, this mechanism is a major reason why they perform so well on a wide range of NLP tasks.

This brings us to the end of the module focused on the internal mechanics of large language models. We’ve covered:

  • the encoder–decoder structure
  • embeddings and word representations
  • positional encoding
  • and the attention mechanism, especially multi-head attention

In the next modules we’ll move from theory to practice. We’ll start using this understanding while building real applications with LangChain and LangGraph. Knowing what happens “under the hood” will make it easier to design better systems and to reason about their behavior when something unexpected happens

see next chapter

see previous chapter

see the full code from this article in the GitHub repository

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.