Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

The Three Pillars of LLM Creation: A Non-technical Guide
Artificial Intelligence   Latest   Machine Learning

The Three Pillars of LLM Creation: A Non-technical Guide

Last Updated on October 19, 2024 by Editorial Team

Author(s): Nikita Andriievskyi

Originally published on Towards AI.

Β· Pre-training
∘ Data collection
∘ Tokenization
∘ Self-Supervised Learning
Β· Fine-tuning
Β· Reinforcement Learning from Human Feedback (RLHF)

We have all been using ChatGPT, Claude, and Gemini a lot lately. These are very useful Large Language Models (LLMs) that simplify our lives, helping us with work, education, and everyday tasks. But have you ever wondered how these systems work or how they were created? In this article, I’m going to break down the LLM training process for you in simple terms. By the end, you’ll have a good understanding of the key steps that make these AI models so capable. Let’s explore the first pillar of training Large Language Models: Pre-training!

Pre-training

Pre-training is the first, very important, and very expensive step in the training process. It is where LLMs get their knowledge of the words, languages, and how words interact with one another. It helps them understand what β€œcat” is, what β€œhelp” means, and why words like β€œwater” and β€œocean” are related. So, how does pre-training work? Here are the 3 main steps:

  1. Collecting massive amounts of data
  2. Tokenization
  3. Self-Supervised Learning

Data collection

Hundreds of gigabytes of text are gathered from diverse sources like books, websites, news articles, academic papers, social media, and more. These datasets are curated and processed to remove low-quality content, noise, and any potentially harmful data. This step is crucial, as it shapes the β€œscope” of knowledge the model will have. If all training data was extracted from Reddit replies β€” the model quality would be, well… You know 😅

On the other hand, a model trained solely on academic papers would sound and behave more formally, providing more scholarly responses but possibly lacking conversational fluidity.

Tokenization

Before feeding data into the model, the text must be converted into smaller units called tokens. A token can be a whole word, part of a word, or even a single character, depending on the approach. Tokenization is crucial because it helps the model break down language into manageable pieces while retaining meaning.

For example, the word β€œheadache” might be split into two tokens: β€œhead” and β€œache.” Each part provides valuable information β€” β€œhead” relates to body parts, and β€œache” relates to pain. This decomposition helps the model generalize across similar contexts. For instance, if the model encounters β€œstomachache,” it can infer the meaning by recognizing the shared component β€œache.”

In some cases, tokenization can also handle more complex language structures, like breaking down contractions (β€œdidn’t” into β€œdid” and β€œn’t”) or even addressing language with complex grammar rules (like Chinese or Arabic). Tokenization allows the model to process text in a flexible, granular way, making it more capable of understanding and generating coherent responses.

Self-Supervised Learning

After the data is collected and tokenized, the model is ready for training. Note: I am skipping the model architecture engineering step as it is much more technical. Essentially, there is a whole other step where the model is defined: what functions are used, how the data flows in and out of the functions, and what the end output is.

In the self-supervised learning step, the text is split into so-called β€œwindows” of a predefined size. Let’s say our full text is: β€œThe cat sat on a mat and enjoyed the wind”. Let the window size be 4, this would create windows of 4 tokens, and we would β€œslide” the window one token at a time:

  1. β€œThe cat sat on”
  2. β€œcat sat on a”
  3. β€œsat on a mat”
  4. β€œon a mat and”
  5. β€œa mat and”
  6. β€œmat and enjoyed the”
  7. β€œand enjoyed the wind”

The goal of the self-supervised step is to predict the next word in a sentence given a window. So if the model was given the phrase β€œcat sat on a”, it would need to predict the word β€œmat”.

It turns out that this training method is a great way for the model to learn the meaning of words and how they interact. On top of it, it learns to associate different groups of words (e.g. it learns that apples are related to oranges, and kings are related to queens).

It does that by learning the word embeddings, which is just a fancy word for the numerical representation of a word, which can be plotted and could look something like this:

2D word embedding plot

You can see how related words are plotted closer together. β€œwaters” and β€œsea”, β€œterritory” and β€œarea”, β€œclimate”, β€œwind” and β€œice”.

By the end of the pre-training phase, the model learns what words mean and how they are related and becomes a β€œnext-word-generation-machine”. It cannot respond to your questions yet. If you asked it a question, it would probably just generate similar questions or would append words to your question, not answering it.

Fine-tuning

To teach the model how to respond to questions, the fine-tuning step is implemented. Essentially, it takes the base, pre-trained model with all its learned knowledge about words and teaches it to perform a specific task/tasks.

This step consists of collecting lots of data of questions and answers, articles and their summaries, prompts and written code or articles, etc.

Then, this data is used to map inputs (prompts/questions) to outputs (human-written answers to the questions) and is given to the model for training. During this step, the model learns to respond to queries making it a working assistant. After fine-tuning the model should be capable of answering human questions and performing other tasks it was trained on (e.g. generating code)

Technically, the model has finished training and can help you with your queries. However, there is another step that can take the LLM to another level.

Reinforcement Learning from Human Feedback (RLHF)

RLHF or reinforcement learning from human feedback is a human-integrated step, that teaches the LLM to generate human-aligned answers making it more helpful, diverse, and not biased toward political figures (well, at least they say they try to make LLMs this way…), etc.

This is done by asking the model to generate a couple of answers to the same question and then using human rankers to rank each question and then feeding the data back to the model for learning, thus making a human feedback loop.

You can see how in this step, the LLM can be easily aligned with a negative purpose too. If biased answers are always ranked higher, the model will learn to generate more biased answers, and vice versa.

These three pillars β€” Pre-training, Supervised Fine-tuning, and RLHF β€” work together to transform raw text data into the sophisticated AI assistants we interact with daily. Each step builds upon the last, refining the model’s capabilities and aligning it with human needs and expectations.

If you have any questions or ideas for more AI or automation articles β€” let me know!

Follow me on X for more AI and automation content: https://x.com/NAndriievskyi

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓