Exploring Large Language Models -Part 1
Last Updated on November 6, 2023 by Editorial Team
Author(s): Alex Punnen
Originally published on Towards AI.
Below are some of the questions that intrigued me or came up while trying to fine-tune LLMs. The article is an attempt to understand these and share this understanding with others in a lucid way, along with deep dives and code illustrations for advanced users.
This article is split into the following parts.
Part 1 discusses the evolution of LLM training. The intention is to set the context for us to understand the magic, or more technically –emergence, that starts to happen when the model size increases above a threshold and when trained with huge data. The deep-dive sections illustrate these concepts in greater detail and depth, though they are also easy to follow by most programmers. Non-programmers can avoid those sections. It tries to answer the following questions in an intuitive way. You can read it here.
Since LLMs are based on NeuralNet with Loss function, is not all training of LLMs supervised training? Why is it termed usually as unsupervised training?
Can you train an LLM in a very short sentence to illustrate how LLM training works in practice?
What is Masked and Causal LM?
Can you explain the intuition behind Transformer Architecture in a single picture?
What exactly is it meant by unsupervised training in LLM?Why does the main architect of ChatGPT β Ilya Suverskar think of unsupervised training as the Holy Grail of machine learning?
What is meant by the Emergence/ Understanding of LLMs?
Part 2 discusses the popular use cases of LLMs, personal assistants, and chatbots with custom data via information retrieval patterns (vector space search with LLM augmentation). We will also explore seeds on how the mental model and Natural Language Understanding of models could become its more powerful use cases. In this context, we will explore one main limitation of the LLM model by contrasting the strengths of supervised training with a weakness of the LLM models β the lack of Explainability or difficulty in determining facts vs. hallucinations. We will explore how such systems have been very effectively used in computer systems by a hierarchy of controls, unreliable systems made reliable by a higher level control -our daily use of ChatGPT for example and how it can be extended to other use cases. It tries to answer the following questions. You can read it here
What are the general use cases of LLMs?
Why are LLMs best suited as productivity assistants?
What is the Vector DB/Embedding pattern of information retrieval?
Can LLMs be used for things other than textual tasks? What is Causal reasoning?
What is the problem with LLMs?
Why do minds like Yan LeCun think current LLMs are hopeless?
Are LLMs Explainable, how can they be effectively used if they are not?
Part 3 discusses concepts related to training and fine-tuning the LLMs on custom domains. We are targeting the domain understanding part in this, and how that is much more powerful than simpler vector space information retrieval patterns. We will explore how Quantisation techniques have opened up very large LLMs to the world, and how this coupled with the concepts of reducing training parameters has democratised LLM fine-tuning. We will explore the main technique of effective fine-tuning β Instruct tuning, and how to solve the biggest practical problem of Instruct tuning β the unavailability of quality Instruction training dataset with all the concepts we have gone through this far. It tries to answer the following questions. You can read it here
What is the need to fine-tune/re-train LLMs?
Why is it difficult to train LLMs?
How do Quanitsation and LoRA help in training large LLMs?
How does Quantisation and LoRA work?
What is an effective way to fine-tune pre-trained LLMs?
What is Instruct Tuning?
What is Self Instruct? How can we generate a high-quality training dataset for Instruct Tuning?
Future sections will discuss the concept of leveraging the understanding part of LLMs and using the hierarchy of controls in leveraging these powerful systems for augmenting AI/ML systems.
Can you show how LLMs of varying capability can be hierarchically structured to create a complex automation with causal reasoning?
Why are we aiming to create human-like intelligence from LLMs or neural nets?
What is the theory of compression comprehension, behind intelligence and how does it relate to LLMs ?
Why does this seem eerily similar to creating bird-like flight back in time before the invention of the fixed-wing plane?
Introduction
Large Language Models (LLMs) present us with two obvious capabilities β a natural language interface for communicating with the model and a vast amount of knowledge stored very efficiently in the models- the whole of the internet textual data. The larger the models are, the better they get at both of these capabilities.
There is another capability that is not that obvious, but that could be the most powerful. It is implicit in the first capability and termed technically as NLU, Natural Language Understanding. To understand something, you need a model of that. For humans β a mental model. To understand language you need a model of language syntax and semantics. To understand a user question and to be able to answer that effectively the model needs an internal world model. There is a debate between thought leaders in this field on whether there is some internal world model learned by these LLMs, or if it seems to us as if it is present.
However, their natural language understanding is so good, and their internal world model has enough information that they pass the Turing test as it was envisioned (https://www.nature.com/articles/d41586-023-02361-7) and also the Mini Turing Test based on Causal Reasoning proposed by current AIβs most famous critic Judea Pearl. We will come back to this topic later.
Looked at in one way, everyone knows what intelligence is; looked at in another way, no one does. Robert J. Sternberg, source
Still, this does not mean that they are intelligent in the real sense of the term, but we could more safely say, intelligent in understanding natural language. This is the key property that sets these models apart.
LLMs start as dumb automata, but somewhere in their training, they become smart enough to generalize their training to tasks for which they are not explicitly trained. This is what we can more freely term as βunderstandingβ. As I mentioned before, this is a highly debated subject and our aim is not to go deep into the debate but to try to learn.
LLM Training and the Emergence of Understanding
We can understand this topic in a more fun way by following the history of some famous AI/ML systems.
Rule Engine/Tree Search
We can start with IBM DeepBlue the chess-playing supercomputer. By using the power of a Super Computer and custom chips it defeated Gary Kasparov β the chess grandmaster in 1997. However, there was no AI or NeuralNet involved then, it was TreeSearch. You could abstract this a bit and say that it was a Rule-based engine. The training data was hand-coded domain expertise, distilled into a set of rules. The algorithm optimized the choice of the next move from a vast but computable result set based on the current state of its world. It was clear, however, that it was impossible to do this rule-based programming for broader domains.
Supervised Training
A decade later in 2011, IBM Watson –designed for Question Answering and specifically trained for Trivia QA played Jeopardy and won against a champion. There was a big hype around this as the next knowledge system that would revolutionise everything. The primary way the system was trained seems to be Supervised Training. That is data with labels on which the system was trained to pick the right answer or in this case, the right question.
Supervised Learning works well. This is the bedrock of almost all machine learning used in production today. Given enough labeled data, AI/ML systems will learn to approximate any complex multivariate functions. They are excellent universal function approximators.
βAll the impressive achievements of deep learning amount to just curve fitting..β. The Truing Award winner Judea Pearl famous critique of AI/ML system
The problem with Supervised Learning is that labelling huge amount of datasets for training huge models, need a lot of expensive and time-consuming human effort.
The best example of one of the largest labeled datasets and its implication is Imagnet. It was the huge amount of labeled image data collected as part of the ImageNet project that helped AlexNet introduced in 2012 (Ilya Sutskever, Alex Krizhevsky and Geoffrey Hinton) revolutionize computer vision, though even by 1998 Yann LeCun and others and others had introduced LeNet-based convolutional neural network for handwriting recognition.
Note Ilya Sutskever is also one of the founders of OpenAI and was instrumental in training the GPT models later.
Back to the story. The hype surrounding IBM Watson died down over the years as the limitations of the system became apparent. A NY Times article gives insight into some of the challenges of why it could not generalize well to other fields like IBM hoped, the lack of properly labeled data being the primary.
Reinforcement Learning
In 2016, Google DeepMind AlphaGo became very popular by defeating the champion Go player. This game is a much wider domain /strategy than Chess (impossible for rule engine/tree search type algorithms). The key here was Reinforcement learning (RL).
Here the training can be abstracted to making a random move, and if the move takes you closer to winning (though some loss calculation) make more of such moves and vice versa. And then they created agents and pitted the agents against each other, thereby playing probably thousands of years worth of games and getting good at the game.
A more complicated game than Go is Dota, and in 2019, a small company (at that time) called OpenAI defeated the current reigning Dota 2 champions
Here is an interesting snippet related to this event. The last word βscaleβ is an interesting part that may suggest how they used this concept for their future work in GPT models
We started OpenAI Five to work on a problem that felt outside of the reach of existing deep reinforcement learning. .. We were expecting to need sophisticated algorithmic ideas, such as hierarchical reinforcement learning, but we were surprised by what we found: the fundamental improvement we needed for this problem was scale.
DeepDive β Tough RL looks like it can be used for everything (for example, this is how a baby learns to walk or most organisms learn), outside of domains like games which have a very limited or controlled state space, it is very difficult to implement. For example, in a self-driving car, any small state change in some previous step can contribute later positively or negatively. It boils down to implementing a loss function that can store and work temporarily. (https://stanford.edu/~ashlearn/RLForFinanceBook/book.pdf, Why backpropagation approximation is needed for RL https://stats.stackexchange.com/a/340657/191675)
So we covered supervised learning ( the bread and butter of AI/ML algorithms so far) and Reinforcement Learning β mostly in Video games and similar.
Unsupervised Learning β The Holy Grail of all Learning?
At OpenAI .. the hope was that if you have a neural network that can predict the next word, itβll solve unsupervised learning. So back before the GPTs, unsupervised learning was considered to be the Holy Grail of machine learning.β¦
But our neural networks were not up for the task. We were using recurrent neural networks. When the transformer came out, literally as soon as the paper came out, literally the next day, it was clear to me, to us, that transformers addressed the limitations of recurrent neural networks, of learning long-term dependencies. β¦
And thatβs what led to eventually GPT-3 and essentially where we are today. –Ilya Sutskever Interview
Before we go there to explore what he means by βHoly Grailβ, letβs step back to make the context clear and explore what engineers usually mean when they say unsupervised learning and what is meant here. What he means here in short, is a higher-level learning abstraction and not the actual implementation. It is still a mystery how a network trained to predict the next tokens can generalize so much as the LLMs; that is, they learn to generalize in an unsupervised manner, though the training is supervised next token prediction.
The usual take on Unsupervised Learning
When we usually speak of unsupervised learning in ML, it is a few algorithms related to Clustering, for example, k-means clustering, Dimensionality Reduction, Principal Component Analysis, or Time series fitting. These are based on maths or matrix properties. These are all pretty complex but if you are good at maths or if you put enough effort, these can be clearly understood.
In the case of Deep learning, the lower-level example is AutoEncoders. Autoencoders are more interesting in the context of LLMs in how they learn structures for internal compressed representation. In Autoencoders the target is the same as the input. That is, given complex data (say a highly detailed image) train a shallow network to give similar output. For this, the network needs to learn some pattern in the data to compress this sufficiently. See more http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/
In this article, we are focussing not on the above classic ML or DL use-case but on unsupervised learning for LLMs.
DeepDive: All current deep nerual net based machine learning needs some loss function to optimize and backpropagate; and for a loss, there should be some target to compute. In the context of LLM the target is the next token in a set of tokens (tokens approximate to words). This way ,we can use the whole of internet text (cleaned), as a giant labelled training dataset.
Lets see how we can fine-tune or re-train a pre-trained model to predict something different, for example βZooβ instead of βCityβ preceeding the sentence βI love New Yorkβ . This way we can have an intutution of the training and loss of an LLM.
The example training sentence will be β I love New York Zooβ. The model is fed the first word βIβ and will output something. But the target is given as
love
and the CrossEntropy Loss is calculated with that target and minimized by training. Finally, afterYork
the targetZoo
. The label or target is basically the next token.This is plain supervised training. Since the model is small and dataset is small no unsupervised learning happens here.
Whether it is for the small toy model NanoGPT or for the LLAMA2 model the loss used is basically Cross Entropy Loss.
The softmax function is used as the final layer of a neural network to produce a probability distribution over classes. In our case, the classes are all the words in the vocabulary.
So a loss function is needed to calculate the difference between the generated and expected probability distributions. Cross-entropy loss is then applied to this probability distribution to measure the error between the predicted probabilities and the true class labels.
To understand better we need to go slightly more deeper into the Transformer architecture. The LLM models are generally designed either as β Causal Language Model or Masked language models (MLM) where certain noise is introduced by masking certain words in a sentence (but the model is able to see the whole tokens in the sentence). These are also known as the Uni-directional model and Bi-directional models respectively.
The third option is a combination called Prefixed Causal or Masked Language modelling β where a Task prefix string is present in the training data (example βTranslate to French:β). The last one is made famous by the T5 type of model (FlanT5 β Finetune Langage Model for Text-to-Text Transformer). (There are other variations also)
There are differences in training these models. For Caual LM, the target is the same as the input shifted one position right .For Masked LM, we need to create a denoised training where the target is the real value of the masks.
Causal Langauge models are named so as it operates in a causal manner, predicting based on previous tokens and presumably based on the causal relationship of those tokens. How this relationship is found is the whole story of the Transformer Model and the role of βAttentionβ β the famous βAttention is all you needβ paper that introduced Transformer Networks.
It is not easy to explain this concept in a simple way. I highly suggest you watch this explanation video possibly multiple times β Intuition Behind Self-Attention Mechanism in Transformer Networks β YouTube .
We can say that in the process of learning the correct βnextβ token to predict, three sets of weights per token are learned by backpropagating the loss β the Key, the Query and the Value weights. These form the base of βAttentionβ mechanism. The above video beuatifully explains this at this location.
The concept of Vector dot product is used to calculate the Value Vectors, which is the sum of the contribution of the dot product of Query and Key vectors.
The intution is that βsimilar vectorsβ in the Vector embedding space will have a larger dot product value and higher contribution.This is the trick to capture the causal association of tokens/words in a sentence.
The Weights are then adjusted via Backpropagation which means that the learned weights represent a new and better Contextual Vector Embedding space for the sentence. ( Key and Query weights are multi-dimensional and there are multiple attention heads, so it is not one vector space but many)
In Transformers, there are multiple attention heads, so, which attention head to weightage more can also be tuned via weights. Transformer network is this architecture, where the intuition of causal relationship between tokens is encoded as learnable weights in linear neural net layers.
I have tried to explain this in a simplified image below
You can see here in more detail how a small pre-trained model, which has the highest probability of generating
Ciy
afterI love New York
can be fine-tuned to reduce loss to generateZoo
https://github.com/alexcpn/tranformer_learn/blob/main/LLM_Loss_Understanding.ipynb
As such, there is nothing extraordinary here. We have seen the usual so-called unsupervised learning (in the context of LLM) via supervised training, the loss function, the training ( based on the usual gradient descent and backpropagation)
However when the network becomes large and training data becomes huge ( billion to trillion tokens), there is some other phenomenon happening where other than just βcurve fittingβ, it can generalise the training to βunderstandβ the inherent structure (human language, programming language etc.) of the training data without explicitly being trained for that. This is the unsupervised learning holy grail.
Ilya Sutskever describes this https://youtu.be/AKMuA_TVz3A?t=490
So, from the humble supervised training, the mysterious un-supervised learning evolves
The Emergence of Understanding
The paper Language Models are Unsupervised Multitask Learners introduced GPT-2 to the world based on Transformer architecture. It did something very extraordinary.
It proved empirically that a sufficiently large LLM, when trained with a sufficiently large dataset (CommomCrawl / WebText- cleaned up internet data) starts to βunderstandβ the language structure. I am deliberately not using the less controversial term βgeneralizeβ instead of βunderstandβ here to convey the meaning to all better.
Here is a Google/DeepMind research paper that is easy to read on this topic https://arxiv.org/pdf/2206.07682.pdf.
They show eight models where the emergent behaviour as measured by the accuracy over random selection for a few shot prompt tasks increases substantially with model scale..
The same authors have discussed this with better visuals (below)
There are other studies like this refuting this, and this debate has just started. The paper links to the inability of the LLM in maths, which is of course, not its forte.
To really understand something and not just be a probabilistic distribution generator, one needs essentially a learned internal world model as well -according to Ilya Sutskever. This makes it completely different from other forms of ML and AI.
To make this clear, here is a sample input and output from the ChatGpt4 model. The model has not been explicitly trained in automobile driving. But you can see that it has formed a world view or world model from its training data and can comprehensively reply.
It is not clear how the emergence of understanding happens when training and model parameter scales, as what we simplify as a probability distribution function generator or dismiss as stochastic parrots may have more revelations in future research. The role of randomness, or its study/measure β probability, in complex interfacing systems, for example, as presented in The Blind Watchmaker, related to evolution and physics at the quantum level sometimes makes one reflect if there is any other way other than probability distribution functions to describe them. It needs more research and more structure to explain this more clearly.
For example, I searched entropy and LLM and came to this paper Study of the possibility of phase transitions in LLMsβ and also a related talk by the founder of Wolfarm and scientist Stephen Wolfarm. So it is not far-fetched to think in these directions.
Note that other signals that follow this property of language could also be effective candidates for LLM training, where instead of words, the signal could be tokenized. An example could be music https://ai.googleblog.com/2022/10/audiolm-language-modeling-approach-to.html. Similarly, structures that donβt follow this model, like, say, DNA, could be a hard fit for these types of models.
Coming back to the point β there is a race on how to tap this potential of LLMs in various fields. We will start with simpler ones and later with more expanded forms where the capability of reasoning of the model could be used in applications. We will cover this in Part 2 of the series.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI