Walkthrough of LoRA Fine-tuning on GPT and BERT with Visual Implementation

Last Updated on November 5, 2023 by Editorial Team

Author(s): David R. Winer

Originally published on Towards AI.

Fine-tuning, which is the learning or updating of weights in a transformer model, can be the delta between a model that’s not ready for production to one that is robust enough to put in front of customers.

Back when BERT and GPT2 were first revolutionizing natural language processing (NLP), there was really only one playbook for fine-tuning. You had to be very careful with fine-tuning because of catastrophic forgetting. In essence, after you pre-trained your model, you didn’t want to overwrite the original weights so much that they forget previously learned connections. The practitioner’s secret was to dial the lower learning rate very low, freeze all but the last couple layers, and run through the downstream training data very carefully, with perhaps only one epoch for a large dataset. There are a few downsides to this approach. The weights per layer are still very large, and if you freeze certain layers, then your fine-tuning cannot affect those layers.

Pre-training vs fine-tuning, from Devlin 2018 https://arxiv.org/pdf/1810.04805.pdf

Fast forward to today, and now fine-tuning has a few new techniques, typically categorized together as Parameter Efficient Fine-tuning (PEFT) methods, with Low-Rank Adaption of Large Language Models (LoRA) as the primary example.

LoRA https://arxiv.org/pdf/2106.09685.pdf

The central idea of LoRA is that you should keep the original pre-trained weights and add some new low-parameter weights to fine-tune instead. For example, if you have weights of size 768² = 589,824 parameters, then you pick some integer r and use two more weight matrices of size 768 * r. So if r = 4, then 768 * 4 + 4 * 768 = 6,144 parameters. That’s close to 1% of the parameters!

LoRA concept with example parameter sizes

These low-parameter weights are added to your pre-trained weights as part of the compute graph. When training, you only update the new weights, so now the differentiate step only produces gradients for the new weights and the optimizer is only tracking optimizer states for the new weights. As a result, there is less computing during training and less memory needed for gradients and optimizer states. Since there are now so many parameters in today’s models, LoRA is an important technique for getting fine-tuning to run on “regular” sized machines and it speeds up training by needing less computing overall. The small downside is that the extra weights add to the overall memory needed at inference time, though just by a small percentage.

Implementing LoRA is an act of model surgery. In essence, you need to do a “layer-ectomy”, swapping out the original dense layers that you want to add LoRA to with the new setup. If you’ve ever attempted a model surgery, you understand the challenges in the tooling to “operate” on your model and verify the operation went successfully. There are some other tutorials and examples in Keras, but I found them to be overly pre-scripted. This walkthrough is intended to be precise enough so that you can implement LoRA yourself just by looking at the visuals.

Visualization Details

To implement LoRA and do the surgery, we will work with a node graph visualization tool. Each block is an operation that takes the inputs on the left side and produces the data for the output variables on the right side. Links denote the passing of data from outputs to inputs, and circles on inputs mean the data is specified in place and is static.

Operations are either composite containing an “unbox” icon, which then decomposes into a sub-graph whose inputs are the parent’s inputs and whose outputs are the parent’s outputs, or they are primitive, meaning they cannot be decomposed further and correspond to low-level tensor operations like from NumPy or TensorFlow. Colors indicate data type and patterns indicate the data shape. Blue means the dat type is an integer, whereas purple/pink means it’s a decimal data type. Solid links indicate that the data shape is scalar, whereas dots in the link indicate the number of dimensions of the array (the number of dots between the dashes). At the bottom of each graph is a table that characterizes the shape, type, and operation name of each variable that is carrying data in the model.

BERT LoRA

First, I’ll show LoRA in the BERT implementation, and then I’ll do the same for GPT.

Inside LoRA layer visualized implementation

First, I’ll start with what is LoRA. Initially, a LoRA layer starts with an input reflecting the hidden state or the original embeddings in the encoder, a hidden size (e.g., 768), and an integer r. We need to reshape the layer so that it’s 2D. If r = 4, and we have 2 inputs each padded to 10 tokens, then we reshape our [2 x 10 x 768] shape to [20 x 768].

There are 2 linear layers, called “A” and “B”. We feed our [20 x 768] input into the “A” linear layer with hidden size 4 to produce a [20 x 4] shape.

Then we send the output into a “B” linear layer with a hidden size matching the original vector size. This takes the [20 x 4] and multiply by [4 x 768], which brings it back to [20 x 768]. Then, after reshaping it back to [2 x 10 x 768].

We can feed this the same hidden state as our dense layer, and then element-wise add this to the original dense layer. Before the edit, the 3D Linear Layer output went to the place where the “add” block now links to. The LoRA Layer and “add” blocks were added during the surgery.

Then, we would do the same for the Value layer. In the implementation, the LoRA layer is only added to the Q and V projection matrices. These seem to be the most effective and efficient places to use LoRA, however, the authors also note that they leave the investigation of adapting other parameters for future work (e.g., adding LoRA to the biases or to layer normalization).

By looking at the crumb bar, you can see we are in the Self Attention module of Layer 0 in the BERT encoder stack.

GPT LoRA

I also added another composite around the LoRA layer and the “add” operation so that I can drop it as one single modifier.

In the implementation, as covered here, the QKV layers are all stored as a single matrix in the GPT implementation (at least the one that Graphbook uses). These are split apart before being reshaped based on the number of attention heads, from…

[batch_size x num_tokens x hidden_size] →

[batch_size, num_heads, num_tokens, hidden_size/num_heads].

Adding the LoRA layers to Query and Value projections, in GPT

We can drop in those “Add LoRA Layer” blocks and direct the data flow through these blocks before being reshaped.

All implementation details on the LoRA layer are provided on Github.

Was this visualized implementation helpful? Did I get anything wrong? What do you want to see next? Let me know in the comments!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Walkthrough of LoRA Fine-tuning on GPT and BERT with Visual Implementation

Author(s): David R. Winer

Visualization Details

BERT LoRA

GPT LoRA

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

40 Ways DeepSeek AI Will Upgrade Your Life Instantly

TAI #141: Claude 3.7 Sonnet; Software Dev Focus in Anthropic’s First Thinking Model

Agentic RAG: Mastering Document Retrieval with CrewAI, DeepSeek, and Streamlit

Quantum AI Computing

Arbitration for AI: A New Frontier in Governing Uncensored Models

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Walkthrough of LoRA Fine-tuning on GPT and BERT with Visual Implementation

Author(s): David R. Winer

Visualization Details

BERT LoRA

GPT LoRA

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement