The Three Pillars of LLM Creation: A Non-technical Guide
Last Updated on October 19, 2024 by Editorial Team
Author(s): Nikita Andriievskyi
Originally published on Towards AI.
Β· Pre-training
β Data collection
β Tokenization
β Self-Supervised Learning
Β· Fine-tuning
Β· Reinforcement Learning from Human Feedback (RLHF)
We have all been using ChatGPT, Claude, and Gemini a lot lately. These are very useful Large Language Models (LLMs) that simplify our lives, helping us with work, education, and everyday tasks. But have you ever wondered how these systems work or how they were created? In this article, Iβm going to break down the LLM training process for you in simple terms. By the end, youβll have a good understanding of the key steps that make these AI models so capable. Letβs explore the first pillar of training Large Language Models: Pre-training!
Pre-training
Pre-training is the first, very important, and very expensive step in the training process. It is where LLMs get their knowledge of the words, languages, and how words interact with one another. It helps them understand what βcatβ is, what βhelpβ means, and why words like βwaterβ and βoceanβ are related. So, how does pre-training work? Here are the 3 main steps:
- Collecting massive amounts of data
- Tokenization
- Self-Supervised Learning
Data collection
Hundreds of gigabytes of text are gathered from diverse sources like books, websites, news articles, academic papers, social media, and more. These datasets are curated and processed to remove low-quality content, noise, and any potentially harmful data. This step is crucial, as it shapes the βscopeβ of knowledge the model will have. If all training data was extracted from Reddit replies β the model quality would be, wellβ¦ You know 😅
On the other hand, a model trained solely on academic papers would sound and behave more formally, providing more scholarly responses but possibly lacking conversational fluidity.
Tokenization
Before feeding data into the model, the text must be converted into smaller units called tokens. A token can be a whole word, part of a word, or even a single character, depending on the approach. Tokenization is crucial because it helps the model break down language into manageable pieces while retaining meaning.
For example, the word βheadacheβ might be split into two tokens: βheadβ and βache.β Each part provides valuable information β βheadβ relates to body parts, and βacheβ relates to pain. This decomposition helps the model generalize across similar contexts. For instance, if the model encounters βstomachache,β it can infer the meaning by recognizing the shared component βache.β
In some cases, tokenization can also handle more complex language structures, like breaking down contractions (βdidnβtβ into βdidβ and βnβtβ) or even addressing language with complex grammar rules (like Chinese or Arabic). Tokenization allows the model to process text in a flexible, granular way, making it more capable of understanding and generating coherent responses.
Self-Supervised Learning
After the data is collected and tokenized, the model is ready for training. Note: I am skipping the model architecture engineering step as it is much more technical. Essentially, there is a whole other step where the model is defined: what functions are used, how the data flows in and out of the functions, and what the end output is.
In the self-supervised learning step, the text is split into so-called βwindowsβ of a predefined size. Letβs say our full text is: βThe cat sat on a mat and enjoyed the windβ. Let the window size be 4, this would create windows of 4 tokens, and we would βslideβ the window one token at a time:
- βThe cat sat onβ
- βcat sat on aβ
- βsat on a matβ
- βon a mat andβ
- βa mat andβ
- βmat and enjoyed theβ
- βand enjoyed the windβ
The goal of the self-supervised step is to predict the next word in a sentence given a window. So if the model was given the phrase βcat sat on aβ, it would need to predict the word βmatβ.
It turns out that this training method is a great way for the model to learn the meaning of words and how they interact. On top of it, it learns to associate different groups of words (e.g. it learns that apples are related to oranges, and kings are related to queens).
It does that by learning the word embeddings, which is just a fancy word for the numerical representation of a word, which can be plotted and could look something like this:
You can see how related words are plotted closer together. βwatersβ and βseaβ, βterritoryβ and βareaβ, βclimateβ, βwindβ and βiceβ.
By the end of the pre-training phase, the model learns what words mean and how they are related and becomes a βnext-word-generation-machineβ. It cannot respond to your questions yet. If you asked it a question, it would probably just generate similar questions or would append words to your question, not answering it.
Fine-tuning
To teach the model how to respond to questions, the fine-tuning step is implemented. Essentially, it takes the base, pre-trained model with all its learned knowledge about words and teaches it to perform a specific task/tasks.
This step consists of collecting lots of data of questions and answers, articles and their summaries, prompts and written code or articles, etc.
Then, this data is used to map inputs (prompts/questions) to outputs (human-written answers to the questions) and is given to the model for training. During this step, the model learns to respond to queries making it a working assistant. After fine-tuning the model should be capable of answering human questions and performing other tasks it was trained on (e.g. generating code)
Technically, the model has finished training and can help you with your queries. However, there is another step that can take the LLM to another level.
Reinforcement Learning from Human Feedback (RLHF)
RLHF or reinforcement learning from human feedback is a human-integrated step, that teaches the LLM to generate human-aligned answers making it more helpful, diverse, and not biased toward political figures (well, at least they say they try to make LLMs this wayβ¦), etc.
This is done by asking the model to generate a couple of answers to the same question and then using human rankers to rank each question and then feeding the data back to the model for learning, thus making a human feedback loop.
You can see how in this step, the LLM can be easily aligned with a negative purpose too. If biased answers are always ranked higher, the model will learn to generate more biased answers, and vice versa.
These three pillars β Pre-training, Supervised Fine-tuning, and RLHF β work together to transform raw text data into the sophisticated AI assistants we interact with daily. Each step builds upon the last, refining the modelβs capabilities and aligning it with human needs and expectations.
If you have any questions or ideas for more AI or automation articles β let me know!
Follow me on X for more AI and automation content: https://x.com/NAndriievskyi
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI