Exploring Large Language Models -Part 2
Last Updated on November 6, 2023 by Editorial Team
Author(s): Alex Punnen
Originally published on Towards AI.
Below are some of the questions that intrigued me or came up while trying to fine-tune LLMs. The article is an attempt to understand these and share this understanding with others in a lucid way, along with deep dives and code illustrations for advanced users.
This article is split into the following parts.
Part 1 discusses the evolution of LLM training. The intention is to set the context for us to understand the magic, or more technically –emergence, that starts to happen when the model size increases above a threshold and when trained with huge data. The deep-dive sections illustrate these concepts in greater detail and depth, though they are also easy to follow by most programmers. Non-programmers can avoid those sections. It tries to answer the following questions in an intuitive way. You can read it here.
Since LLMs are based on NeuralNet with Loss function, is not all training of LLMs supervised training? Why is it termed usually as unsupervised training?
Can you train an LLM in a very short sentence to illustrate how LLM training works in practice?
What is Masked and Causal LM?
Can you explain the intuition behind Transformer Architecture in a single picture?
What exactly is it meant by unsupervised training in LLM?Why does the main architect of ChatGPT β Ilya Suverskar think of unsupervised training as the Holy Grail of machine learning?
What is meant by the Emergence/ Understanding of LLMs?
Part 2 discusses the popular use cases of LLMs, personal assistants, and chatbots with custom data via information retrieval patterns (vector space search with LLM augmentation). We will also explore seeds on how the mental model and Natural Language Understanding of models could become its more powerful use cases. In this context, we will explore one main limitation of the LLM model by contrasting the strengths of supervised training with a weakness of the LLM models β the lack of Explainability or difficulty in determining facts vs. hallucinations. We will explore how such systems have been very effectively used in computer systems by a hierarchy of controls, unreliable systems made reliable by a higher level control -our daily use of ChatGPT for example and how it can be extended to other use cases. It tries to answer the following questions. You can read it here
What are the general use cases of LLMs?
Why are LLMs best suited as productivity assistants?
What is the Vector DB/Embedding pattern of information retrieval?
Can LLMs be used for things other than textual tasks? What is Causal reasoning?
What is the problem with LLMs?
Why do minds like Yan LeCun think current LLMs are hopeless?
Are LLMs Explainable, how can they be effectively used if they are not?
Part 3 discusses concepts related to training and fine-tuning the LLMs on custom domains. We are targeting the domain understanding part in this, and how that is much more powerful than simpler vector space information retrieval patterns. We will explore how Quantisation techniques have opened up very large LLMs to the world, and how this coupled with the concepts of reducing training parameters has democratised LLM fine-tuning. We will explore the main technique of effective fine-tuning β Instruct tuning, and how to solve the biggest practical problem of Instruct tuning β the unavailability of quality Instruction training dataset with all the concepts we have gone through this far. It tries to answer the following questions. You can read it here
What is the need to fine-tune/re-train LLMs?
Why is it difficult to train LLMs?
How do Quanitsation and LoRA help in training large LLMs?
How does Quantisation and LoRA work?
What is an effective way to fine-tune pre-trained LLMs?
What is Instruct Tuning?
What is Self Instruct? How can we generate a high-quality training dataset for Instruct Tuning?
Future sections will discuss the concept of leveraging the understanding part of LLMs and using the hierarchy of controls in leveraging these powerful systems for augmenting AI/ML systems.
Can you show how LLMs of varying capability can be hierarchically structured to create a complex automation with causal reasoning?
Why are we aiming to create human-like intelligence from LLMs or neural nets?
What is the theory of compression comprehension, behind intelligence and how does it relate to LLMs ?
Why does this seem eerily similar to creating bird-like flight back in time before the invention of the fixed-wing plane?
One primary use case is Productivity enhancement β Smart Assistant.
This is an important and widely usable pattern for LLMs.This paper The Economic Potential of Generative AI -The Next Productivity Frontier (June 2023) by McKinsey& Company has projections on how this aspect of LLMs, applied in various formats can alter current work in different sectors would add trillions to the world economy.
Before we go over the specifics, there was a recent talk in which the author talks about how more reliable systems can be built from less reliable systems β quoting from the famous TCP/IP stack design. There is a layered design where the un-reliable/lossy IP layers are made reliable by the transmission control and retransmission logic (in case it detects packet loss) of the TCP layer.
The Assistant type of use-cases is also layered, where the weaknesses of LLMs like hallucinations are overcome by a higher and better control (usually a human)
It could also be a more cognitive model over lesser models. This is what makes this use case so widespread and robust. Software Engineers using Github Co-pilot is a classic example. The model inference is paired with better control who can effectively use the output, take the positives and fine-tune or discard the negatives (errors, hallucinations). The more skilled the human is, the more efficiently he/she can use these models. The high-efficiency gains of this approach as programming or a domain assistant are well known in the very short time since its introduction. The same is true when these assistants are developed for other fields, a recent example being Med-Palm2 by Google and its uncanny ability to help doctors in the medical field.
This need for better control is tied implicitly to the concept of Explainability.
Explainability and its impact.
We mentioned here the Assistant use case, the layered approach, and Med_Palm2. What is implicit in this is the concept of Explainability.
Explainability is a weakness in LLMs now.
Here is Yann LeCunβs take. He puts it rather bluntly
Auto-Regressive LLMs Suck!, They are good as assistants. They are not good at providing factual and consistent answers (due to hallucinations), taking into account recent information (outdated training always)
Auto-Regressive LLMs are doomed. They cannot be made factual and it is not fixable (without a major redesign) β¦
The lack of Explainability forces a higher level of control and lesser automation. We have explored the popular domain assistant use-case of this earlier, or via ChatGPT, we experience daily the use-case of us humans being the higher level control.
But when we discuss AI/ML, we aim for computer-based automation. To give an example in a non-LLM scenario and to provide some seeds of thought, letβs explore how Convolutional Neural Networks have revolutionized Computer Vision and how they overcome this problem of Explainability effectively.
Not even the best computer vision models have any understanding of images. Consequently, it is very easy to fool these models with adversarial images to predict something else. A model trained on detecting β says some diseases from medical images cannot be trusted whether or not it is manipulated or picks up something in the image that presents itself as similar to an adversarial image. (Self-driving cars that rely on cameras also sometimes in border cases wrongly interpret –sometimes with fatal results)
A better control– here a doctor needs to review this. However, if a doctor was to review minute detail of the images, there would not be much use for automating in the first place. This is where visual explainability is used in computer vision for such use cases comes into play. The doctor will need some explanation on why the model predicted as it did -Explainability of prediction. For computer vision, this can be visually represented using the algorithms of Grad-CAM. Below, we can see that the model is indeed picking up the relevant features to predict that the image contains a French Horn. In the medical field, where the subjects are not that evident, this helps in fast-tracking diagnosis with effective control.
For other DL models that are based on data, Explainability refers to the feature combinations the model selected for its prediction. Explainability is a hard problem everywhere in ML, especially as DL is used when there are too many features in the first place for a rule-based engine or a human to correlate. But still, it is possible to some extent in supervised trained ML systems by removing features and checking how it impacts model output and other similar methods.
Until such techniques are invented in LLMs, there would need to be tight control as there is no way of identifying hallucinations. The other alternative is to use these in scenarios where some mistakes do not matter much. We will see one such use case in which we use LLMs to create training data to train LLMs in Section 3. There are many other use cases like better product recommendation engines and similar where few mistakes do not matter.
LMMs as Search engines (LLM augmented Information Retrieval)
This is an extremely popular and growing use case, with new players mushrooming frequently. One will wonder why this is so much better than traditional enterprise search based on keyword indexing techniques.
The main technical inution here is Vector Embedding and Similarity Search; and the main non-technical intuition of this from say other enterprise searches (example ElasticSearch) is that this leverages the concept of contextual word embeddings/sentence embeddings.
Whereas conventional search engines index every word (or the inverted index concept of ES), in this each sentence is grouped on how similar it is to other sentences. This makes it, in theory, much better information retrieval agents. We need to go back to Part 1 and see how Transformers help in doing this via their Attention mechanism to understand this more deeply. Also, the first part of this paper explains this quite deeply for GPT2.
In practice, this is done by computing the high-order embedding or the semantic embedding of the data via pre-trained models (popular SentenceTransformers) and some libraries like FAISS (Facebook AI Similarity Search) for fast similarity search-based retrieval of these embeddings concerning the similarly computed vector of the user query. Instead of, or also embedding, FAISS or similar technologies are ever-growing Vector databases β Pinecone, Weviate, Milvus etc. See this Forbes article.
DeepDive β All documents are first split into components (sentences, paragraphs or even documents with URLs and some information as metadata) and converted to vector embeddings using a model like Sentence Transformers. (Think of them as floating in N-dimensional space (N is a very large number), and similar vectors are clumped together in this vector space).
Here is a Colab Notebook where SentenceTransformer and FAISS with just local storge is used for this pattern : https://colab.research.google.com/drive/1PU-KEHq-vUpUUhjbMbrJip6MP7zGBFk2?usp=sharing. Note Langchain has simpler wrappers over many libs and that is what we are using above. Direct way is bit more cryptic see this colab-notebook.
The semantic embedding of the chunks is used as the vector embedding. For example, in case you are using LamaCPPEmbedding you can see the logits getting picked from the model eval where the embedding of the last word seems to be picked to represent the sentence. langchain– > lmacpp(python) -> lamacpp (c++)
The advantage of this approach is that it is easier to compute the vector embeddings store and then use this data when compared to model fine-tuning. The disadvantage is the selection of the data is based on not understanding the user query but a split based on the words in the query with the sophistication of attention/ semantic embedding. The model has no βunderstandingβ of the domain as we have explained earlier. Since a pre-trained model is used, it understands the query and can use the contexts from similarity search to create a meaningful answer.
Usually, this will suffice for a great many uses. But if you feel that the same amount of output finesse that you get out of, say, ChatGPT or Bard, is elusive with this, then the option is to fine-tune a model on your custom domain.
LMMs as Causal Reasoning Agents
Related to Explainability that we discovered before is the concept of Causality in ML ( SHAP, LIME, The Book of Why β Judea Pearl explains this concept more in-depth and is out of scope here).
In plain terms, it is cause-and-effect reasoning. Imagine how early humans associated cause and effect. One night they were doing some ritualistic dance and the next day they had rain.
This repeated once or twice more and they associated these two facts. Association is the first step in the ladder of causal thinking, but not the only one.
As per Judea P., all our AI systems are in this step. ( I guess many of us humans are also still heavily here). The simple humans would keep dancing next time for rain. Now, for our example ancestors, assuming that they did this dancing every time it rained, it will be almost impossible for anyone to convince them otherwise.
Except if they βdidβ not dance and still it rained, or they did dance and it did not. The first step, βAssociationβ can also be represented as βSeeingβ, the next βDoingβ and the third as βImaginingβ (as in the book).
Hopefully, this makes us aware that Causal Reasoning needs some understanding and not just association ( as in association between tokens).
There is an interesting paper from Microsoft research that uses LLMs for Causal discovery
We find that LLM-based methods establish new state-of-the-art accuracy on multiple causal benchmarks.
I checked the same prompts in the paper with ChatGPT 4, and it gave better results than they had got earlier with the same model; which does not mean much, but still is illustrative of their capability in Causal reasoning out of the box.
Notice here that there was no mention of online and offline demographics, etc. The GPT4 model has picked out the confounders and reasoned not just on the data but based on its worldview as well. As described by the authors, LLMs are not perfect in this, and they make mistakes.
Here is another example simulating the famous one β βMy neighbour's roof gets wet whenever mine does. If I hose my roof it gets wetβ https://x.com/yudapearl/status/1601670183730302976?s=20 from this book.
I changed it a bit so that it does not catch on probabilistically nearer terms like βrainβ, βroofβ, βhoseβ, βwetβ etc (to not be a stochastic causal parrot if you can call that)
Here is the Prompt
Given that Illya may get radbase when Noora gets radbase. Illya and Noora gets radbase when Marsons causes radbase. And that Illya got radbase; Think logically if we can deduce for certain if Noora will get radbase
And the output from LLAMA2 13B model
As a causal reasoning agent, I can deduce with certainty that Noora will get radbase based on the information provided:
1. Illya may get radbase when Noora gets radbase. (This implies that there is a causal relationship between Illya and Nooraβs radbase status.)
2. Illya and Noora get radbase when Marsons causes radbase. (This indicates that Marsons is the cause of their radbase status.)
3. Illya got radbase. (This is a known fact.)
Using logical inference, we can conclude that since Illya got radbase and Marsons caused it, then Noora will also get radbase because of the causal relationship established in point 1. Therefore, it is certain that Noora will get radbase.
https://replicate.com/p/7hfon4dbjldslyor7p6ark6sae
This is not easy to illustrate with a simple practical use case for me now, but this is an exciting field to research. I especially like this quote from the paper, which effectively side steps to prove or disprove arguments on whether they are mimics or not and to use the capabilities practically.
Irrespective of whether LLMs are truly performing causal reasoning or not, their empirically observed ability to perform certain causal tasks is strong enough to provide a useful augmentation for aspects of causal reasoning where we currently rely on humans alone.
That is, though these systems themselves are hard to explain, they can be used as tools to discover Causal relationships between variables in a dataset. This illustrates their possible potential in higher-level reasoning tasks.
In Part 3, we will explore how to FineTune models by also using the models themselves to help in generating data and how that can possibly leverage them in custom domain applications.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI