The Three Pillars of LLM Creation: A Non-technical Guide

Last Updated on October 19, 2024 by Editorial Team

Author(s): Nikita Andriievskyi

Originally published on Towards AI.

· Pre-training
∘ Data collection
∘ Tokenization
∘ Self-Supervised Learning
· Fine-tuning
· Reinforcement Learning from Human Feedback (RLHF)

We have all been using ChatGPT, Claude, and Gemini a lot lately. These are very useful Large Language Models (LLMs) that simplify our lives, helping us with work, education, and everyday tasks. But have you ever wondered how these systems work or how they were created? In this article, I’m going to break down the LLM training process for you in simple terms. By the end, you’ll have a good understanding of the key steps that make these AI models so capable. Let’s explore the first pillar of training Large Language Models: Pre-training!

Pre-training

Pre-training is the first, very important, and very expensive step in the training process. It is where LLMs get their knowledge of the words, languages, and how words interact with one another. It helps them understand what “cat” is, what “help” means, and why words like “water” and “ocean” are related. So, how does pre-training work? Here are the 3 main steps:

Collecting massive amounts of data
Tokenization
Self-Supervised Learning

Data collection

Hundreds of gigabytes of text are gathered from diverse sources like books, websites, news articles, academic papers, social media, and more. These datasets are curated and processed to remove low-quality content, noise, and any potentially harmful data. This step is crucial, as it shapes the “scope” of knowledge the model will have. If all training data was extracted from Reddit replies — the model quality would be, well… You know 😅

On the other hand, a model trained solely on academic papers would sound and behave more formally, providing more scholarly responses but possibly lacking conversational fluidity.

Tokenization

Before feeding data into the model, the text must be converted into smaller units called tokens. A token can be a whole word, part of a word, or even a single character, depending on the approach. Tokenization is crucial because it helps the model break down language into manageable pieces while retaining meaning.

For example, the word “headache” might be split into two tokens: “head” and “ache.” Each part provides valuable information — “head” relates to body parts, and “ache” relates to pain. This decomposition helps the model generalize across similar contexts. For instance, if the model encounters “stomachache,” it can infer the meaning by recognizing the shared component “ache.”

In some cases, tokenization can also handle more complex language structures, like breaking down contractions (“didn’t” into “did” and “n’t”) or even addressing language with complex grammar rules (like Chinese or Arabic). Tokenization allows the model to process text in a flexible, granular way, making it more capable of understanding and generating coherent responses.

Self-Supervised Learning

After the data is collected and tokenized, the model is ready for training. Note: I am skipping the model architecture engineering step as it is much more technical. Essentially, there is a whole other step where the model is defined: what functions are used, how the data flows in and out of the functions, and what the end output is.

In the self-supervised learning step, the text is split into so-called “windows” of a predefined size. Let’s say our full text is: “The cat sat on a mat and enjoyed the wind”. Let the window size be 4, this would create windows of 4 tokens, and we would “slide” the window one token at a time:

“The cat sat on”
“cat sat on a”
“sat on a mat”
“on a mat and”
“a mat and”
“mat and enjoyed the”
“and enjoyed the wind”

The goal of the self-supervised step is to predict the next word in a sentence given a window. So if the model was given the phrase “cat sat on a”, it would need to predict the word “mat”.

It turns out that this training method is a great way for the model to learn the meaning of words and how they interact. On top of it, it learns to associate different groups of words (e.g. it learns that apples are related to oranges, and kings are related to queens).

It does that by learning the word embeddings, which is just a fancy word for the numerical representation of a word, which can be plotted and could look something like this:

You can see how related words are plotted closer together. “waters” and “sea”, “territory” and “area”, “climate”, “wind” and “ice”.

By the end of the pre-training phase, the model learns what words mean and how they are related and becomes a “next-word-generation-machine”. It cannot respond to your questions yet. If you asked it a question, it would probably just generate similar questions or would append words to your question, not answering it.

Fine-tuning

To teach the model how to respond to questions, the fine-tuning step is implemented. Essentially, it takes the base, pre-trained model with all its learned knowledge about words and teaches it to perform a specific task/tasks.

This step consists of collecting lots of data of questions and answers, articles and their summaries, prompts and written code or articles, etc.

Then, this data is used to map inputs (prompts/questions) to outputs (human-written answers to the questions) and is given to the model for training. During this step, the model learns to respond to queries making it a working assistant. After fine-tuning the model should be capable of answering human questions and performing other tasks it was trained on (e.g. generating code)

Technically, the model has finished training and can help you with your queries. However, there is another step that can take the LLM to another level.

Reinforcement Learning from Human Feedback (RLHF)

RLHF or reinforcement learning from human feedback is a human-integrated step, that teaches the LLM to generate human-aligned answers making it more helpful, diverse, and not biased toward political figures (well, at least they say they try to make LLMs this way…), etc.

This is done by asking the model to generate a couple of answers to the same question and then using human rankers to rank each question and then feeding the data back to the model for learning, thus making a human feedback loop.

You can see how in this step, the LLM can be easily aligned with a negative purpose too. If biased answers are always ranked higher, the model will learn to generate more biased answers, and vice versa.

These three pillars — Pre-training, Supervised Fine-tuning, and RLHF — work together to transform raw text data into the sophisticated AI assistants we interact with daily. Each step builds upon the last, refining the model’s capabilities and aligning it with human needs and expectations.

If you have any questions or ideas for more AI or automation articles — let me know!

Follow me on X for more AI and automation content: https://x.com/NAndriievskyi

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

The Three Pillars of LLM Creation: A Non-technical Guide

Author(s): Nikita Andriievskyi

Pre-training

Data collection

Tokenization

Self-Supervised Learning

Fine-tuning

Reinforcement Learning from Human Feedback (RLHF)

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Scaling Intelligence: Overcoming Infrastructure Challenges in Large Language Model Operations

From Code to Conversation: The Rise of Seamless MLOps-DevOps Fusion in Large Language Models

Why Most Task Automation Fails — and How AI Agents Can Fix It

Exploring Deep Learning Models: Comparing ANN vs CNN for Image Recognition

LAI #72: From Python Groundwork to Function Calling, ICL Theory, and Load Balancing MoEs

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

The Three Pillars of LLM Creation: A Non-technical Guide

Author(s): Nikita Andriievskyi

Pre-training

Data collection

Tokenization

Self-Supervised Learning

Fine-tuning

Reinforcement Learning from Human Feedback (RLHF)

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥