Demystifying LLMs: A Quick Summary to Andrej Karpathy’s Intro to LLM Course

Last Updated on January 5, 2024 by Editorial Team

Author(s): Lye Jia Jun

Originally published on Towards AI.

Demystifying LLMs: A Quick Summary to Andrej Karpathy’s Intro to LLM Course

ICYMI, Andrej Karpathy (Ex-Senior Director of AI at Tesla, Current OpenAI Technical Member) recently made a concise yet comprehensive video on the basics of Large Language Models (LLM).

In this article, I’ll summarize the key takeaways of Andrej Karpathy’s Intro to LLM course in under 10 minutes.

LLM is actually just two files
LLM neural network (high-level details)
Fundamentally, LLM will try to imitate (and it is imperfect)
Building an LLM requires just two steps
LLM Scaling Law and Why It Matters
The capabilities of LLM and where it’s going
LLM Security

LLM is actually just two files

Fundamentally, to run an LLM, we just need two files

A parameters file (~140GB)
A script to run those parameters (it can be written in C or other programming languages)

No internet connection is required.

LLM Neural Network (High-Level Details)

The complexity comes from getting the parameters file, and the intuition is that we’re “compressing” the internet

Here are some numbers about one of the leading open-source LLM, LLama 2–70b by META

Meta AI scientists took a chunk of the internet (roughly 10 terabytes or 10,000 gigabytes of data) by crawling and scraping the internet.
They then procured GPU clusters of roughly 6,000 powerful GPUs and ran the training process for ~12 days
The total cost of training this model is approximately $2 million.
We eventually get a “zipped” parameters file of roughly ~140GB
(we have a compression ratio of almost 100x, hence the intuition of “compressing” the internet)

The neural network of LLM simplified: next word prediction task

Fundamentally, the neural network of the LLM does one specific task: next-word prediction.
Given the context of the first 4 words, “cat sat on a,” the model predicts that the most likely next word is “mat.”

The next word prediction task forces the neural network to learn about the world

Here’s a Wikipedia page of Ruth Handler, the inventor of Barbie Doll.
One great value of the next-word-prediction task is that it forces the neural network to learn about the world.
In the underlined red text, we see that there are a lot of facts about Ruth; by feeding this entire Wikipedia article to the neural network, the network inevitably learns a lot about Ruth, thereby gaining “knowledge.”

Using the neural network to do model inference ⇒ generate next word and re-feeding to model to generate the following word

There is an “auto-regressive” nature of the Decoder used in the transformer architecture, which powers the LLM.
Given a prompt, the LLM guesses the best next word, then fits this best next word back to itself to generate the next following best word, and the process repeats.

Fundamentally, an LLM will try to imitate (and it is imperfect)

On the left, we see the LLM “dreaming” java code that looks correct, but may not actually work.
In the middle, we see the LLM “dreaming” Amazon products: the ISBN number may be of the correct length and format, but the corresponding product most likely won’t exist.
On the right, we see the LLM “dreaming” a Wikipedia article about a fish. Here, even though it is “dreaming,” the information is somehow still largely correct without repeating the text verbatim of the original training data, highlighting that LLM contains “knowledge.”

We know LLM fundamentally uses the Transformer architecture but we still don’t fully know how LLM stores knowledge

We can optimize and improve the LLM neural network as a whole, iteratively to make it better at predicting the next word (i.e. better performance for LLM).
However, we don’t know how individual parameters or neurons collaborate: this is a field called interpretability that is still growing.
A notable illustration of the peculiar behavior in large language models (LLMs) is the “reverse curse” phenomenon. For instance, if we query the LLM about the identity of Tom Cruise’s mother, it can accurately respond with “Mary Lee.”
Yet, intriguingly, when the question is reversed to inquire who the son of Mary Lee is, the LLM struggles to provide the correct answer, despite having the relevant information.
This shows how knowledge could be “one-dimensional” for LLM and how LLM stores knowledge is something we do not have full understanding of.

Fundamentally, Building an LLM Requires Just Two Steps

Step 1: Pretraining (to get an internet document generator)

As mentioned previously, we first scrape the internet and get a large amount of textual data.
We’ll use this data to train our neural network to optimize for better performance in the next-word-prediction task.
We’ll then get a “document generator software” that can spit out text.

However, the conventional LLM we all see today isn’t merely a document generator software; we want an assistant that is actually useful and could help us with a wide array of tasks rather than spit out useless text.

Step 2: Finetuning (to get actual AI assistants)

In step 2, we would hire people and give them precise labeling instructions (i.e., answer this question while ensuring your answer is factual and harmless…, etc.)
There are also human-machine collaborations, like sampling parts of different LLM answers to form the most effective answer overall to facilitate a more efficient labeling process.
This process is drastically lower-cost than the pre-training stage, and upon many iterations, we’d obtain our desired assistant LLM model.

LLM Scaling Law and Why It Matters

The LLM Scaling Law refers to the relationships that describe how a system behaves as it scales in size or other relevant parameters.
In general, the trend seems to suggest that more parameters and more data are leading to better performance (thus allowing us to gain more intelligence “for free” without much algorithmic improvement)
This matters because improvement in LLM performance transfers over to many other NLP tasks (as proven during the evaluation process), thus creating a more useful LLM.

Capabilities of LLM and Where It’s Going

Current Capability: Tool Use and Multimodality

State-of-the-art LLMs today like ChatGPT are already using tools and are multimodal: they can leverage web search, calculator functions, computer vision, speech-to-text, and text-to-speech features to enhance user experience.

Future Capability 1: System 2 Thinking

In the book, Thinking, Fast and Slow, the concepts of System 1 and System 2 types of thinking were mentioned.
In essence, the System 1 type of thinking is more intuitive, quick, and automatic; the System 2 type of thinking is slower, more rational, and serves more complex contemplation.
Currently, LLM only has System 1 type of thinking. The dream is to eventually equip LLM with System 2 type of thinking, where it could take 15–30 minutes to think about a problem, and iterate through all possible options like a tree of thoughts, before thoughtfully sharing their answer (and thought process) to a question.

Future Capability 2: Self-Improvement

How can LLM surpass human intelligence?
Taking reference from a popular AI model (AlphaGo) developed by Google Deepmind to play the board game Go, we see that we can train an AI model that surpasses human performance by creating a reward system to reward the AI model whenever it wins a game.
By playing thousands and thousands of games and tweaking parameters based on the reward function, AlphaGo eventually defeated the best Go player in the world.
The dream here is to find such reward criteria to allow LLM to self-improve at exponential rates.

Future Capability 3: Customized LLM Models

There is likely no single reward criteria that can facilitate LLM to self-improve at exponential rates for all tasks.
The future of LLM, however, could be having a series of specialized LLMs that excel at their niche: the launch of the ChatGPT app store is a testament to OpenAI moving towards such a vision.
In the future, LLMs may even have the opportunities to communicate with each other to leverage each other’s strengths and niches.

LLM Security

Jailbreaking

In the context of Large Language Models (LLMs), “jailbreaking” is the process of carefully engineering prompts to exploit model biases and generate outputs that may not align with their intended purpose.
For example, asking LLM how to create dangerous weapons.
While LLMs would typically be designed with some security mechanisms, attackers are constantly finding new ways to override them.

Prompt Injection

Prompt injections involve bypassing filters or manipulating the LLM using carefully crafted prompts that make the model ignore previous instructions or perform unintended actions.
These vulnerabilities can lead to unintended consequences, including data leakage, unauthorized access, or other security breaches.
One example of an attack is when Bing chat shows a fraud link, which is generated because a Bing search result contains a prompt injection attack that instructed Bing to show the fraud link.

Data Poisoning and Backdoor Attack

Data Poisoning refers to intentionally manipulating the training data of a machine-learning model to influence its future behavior
Used with data poisoning, a backdoor attack refers to how the attacker introduces a specific trigger into the training data. When this trigger is present in a prompt or query, it causes the model to produce a predetermined response or behavior.
An undesirable situation could be that if ML engineers scrape the internet, which contains such malicious data, LLMs could be susceptible to these backdoor attacks.

And… that’s it! This is the quick summary of Andrej Karpathy’s Introduction to LLM course.

I hope you enjoyed it and found some value. U+1F680

Cheers, and I’ll catch you in the next article!

I am a computing undergraduate in Singapore actively exploring the space of AI Safety, Startups, and Venture Capital. I write about AI Governance, technical stuff, and productivity matters. If these interests you, do follow me for more insightful pieces. Cheers!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication