Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Cracking the Code of Large Language Models: What Databricks Taught Me
Latest   Machine Learning

Cracking the Code of Large Language Models: What Databricks Taught Me

Last Updated on November 6, 2023 by Editorial Team

Author(s): Anand Taralika

Originally published on Towards AI.

Photo by Brett Jordan on Unsplash

In a world increasingly shaped by artificial intelligence, Large Language Models (LLMs) have emerged as the crown jewels of the machine learning realm. These marvels of technology, capable of generating human-like text and understanding the intricacies of language, have found applications in diverse domains, from natural language processing to content generation. As the demand for LLM-based applications skyrockets, the need for skilled engineers who can harness the potential of these models has never been more critical.

Enter Databricks, a name synonymous with groundbreaking advancements in the big data space. In their quest to democratize AI, they have unveiled a pioneering program: Large Language Models. These courses, tailored for both novice enthusiasts and seasoned experts, provide a comprehensive roadmap to master the art of building and utilizing LLMs for modern applications. Let’s embark on this journey of discovery and delve into what makes these courses an indispensable asset for anyone passionate about the world of machine learning and natural language processing.

The LLMs Program: A Brief Overview

Databricks’ Large Language Models program comprises two distinct yet interrelated courses: “LLMs: Application through Production” and “LLMs: Foundation Models from the Ground Up”. The array of topics covered in these courses is as diverse as it is comprehensive.

Course 1: LLMs — Application through Production

In the fast-paced world of technology, what sets this course apart is its laser focus on practicality. Aimed at developers, data scientists, and engineers, this course equips you with the skills needed to build LLM-centric applications using the latest frameworks. Here’s a sneak peek at what you’ll delve into:

Harnessing the Power of LLMs: You’ll learn to apply LLMs to real-world problems in NLP using popular libraries like Hugging Face and LangChain. Through hands-on exercises, you’ll become proficient in leveraging these libraries for maximum impact.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Generate text with GPT-2
input_text = "Once upon a time,"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
output = model.generate(input_ids, max_length=100)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(generated_text)

Adding Domain Knowledge: Dive deep into enhancing your LLM pipelines with domain knowledge and memory using embeddings and vector databases. This invaluable skill enables you to tailor your models to specific applications.

from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

# First we split the data into manageable chunks to store as vectors.
# There isn't an exact way to do this, more chunks means more detailed context,
# but will increase the size of our vectorstore.
text_splitter = CharacterTextSplitter(chunk_size=250, chunk_overlap=10)
texts = text_splitter.split_documents(document)
# Now we'll create embeddings for our document so we can store it in a vector
# store and feed the data into an LLM. We'll use the sentence-transformers
# model for out embeddings. https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models/
model_name = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(
model_name=model_name, cache_folder=DA.paths.datasets
) # Use a pre-cached model
# Finally we make our Index using chromadb and the embeddings LLM
chromadb_index = Chroma.from_documents(
texts, embeddings, persist_directory=DA.paths.working_dir
)

from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline

# We want to make this a retriever, so we need to convert our index.
# This will create a wrapper around the functionality of our vector database
# so we can search for similar documents/chunks in the vectorstore and retrieve the results:
retriever = chromadb_index.as_retriever()

# This chain will be used to do QA on the document. We will need
# 1 - A LLM to do the language interpretation
# 2 - A vector database that can perform document retrieval
# 3 - Specification on how to deal with this data (more on this soon)

hf_llm = HuggingFacePipeline.from_model_id(
model_id="google/flan-t5-large",
task="text2text-generation",
model_kwargs={
"temperature": 0,
"max_length": 128,
"cache_dir": DA.paths.datasets,
},
)

chain_type = "stuff" # Options: stuff, map_reduce, refine, map_rerank
laptop_qa = RetrievalQA.from_chain_type(
llm=hf_llm, chain_type="stuff", retriever=retriever
)

# Let's ask the chain about the product we have.
laptop_name = laptop_qa.run("What is the full name of the laptop?")
display(laptop_name)

Fine-Tuning Mastery: Understand the nuances of pre-training, fine-tuning, and prompt engineering. You’ll acquire the expertise to fine-tune a custom chat model, opening doors to highly specialized applications.

import transformers as tr
training_args = tr.TrainingArguments(
local_checkpoint_path,
num_train_epochs=1, # default number of epochs to train is 3
per_device_train_batch_size=16,
optim="adamw_torch",
report_to=["tensorboard"],
)

# load the pre-trained model
model = tr.AutoModelForSeq2SeqLM.from_pretrained(
model_checkpoint, cache_dir=DA.paths.datasets
) # Use a pre-cached model

# used to assist the trainer in batching the data
data_collator = tr.DataCollatorWithPadding(tokenizer=tokenizer)
trainer = tr.Trainer(
model,
training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
tokenizer=tokenizer,
data_collator=data_collator,
)

trainer.train()

# save model to the local checkpoint
trainer.save_model()
trainer.save_state()

# persist the fine-tuned model to DBFS
final_model_path = f"{DA.paths.working_dir}/llm_fine_tuning/{checkpoint_name}"
trainer.save_model(output_dir=final_model_path)

fine_tuned_model = tr.AutoModelForSeq2SeqLM.from_pretrained(final_model_path)

inputs = tokenizer(reviews, return_tensors="pt", truncation=True, padding=True)
pred = fine_tuned_model.generate(
input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]
)

Evaluating Efficacy and Bias: In an era marked by ethical concerns, learn how to rigorously evaluate the effectiveness and potential biases of your LLMs using various methodologies.

import evaluate
toxicity = evaluate.load("toxicity", module_type="measurement")

from transformers import AutoTokenizer, AutoModelForCausalLM
import shap

tokenizer = AutoTokenizer.from_pretrained(
"gpt2", use_fast=True, cache_dir=DA.paths.datasets
)
model = AutoModelForCausalLM.from_pretrained("gpt2", cache_dir=DA.paths.datasets)

# Set model decoder to true
# GPT is a decoder-only model
model.config.is_decoder = True
# We set configurations for the output text generation
model.config.task_specific_params["text-generation"] = {
"do_sample": True,
"max_length": 50,
"temperature": 0, # to turn off randomness
"top_k": 50,
"no_repeat_ngram_size": 2,
}
input_sentence = ["Sunny days are the best days to go to the beach. So"]
explainer = shap.Explainer(model, tokenizer)
shap_values = explainer(input_sentence)
shap.plots.text(shap_values)
shap.plots.bar(shap_values[0, :, "looking"])

LLMOps and Multi-Step Reasoning: Discover LLMOps best practices for deploying models at scale and unlocking the potential of multi-step reasoning in your LLM workflows.

import mlflow

# Tell MLflow Tracking to user this explicit experiment path,
# which is in your home directory under the Workspace browser (left-hand sidebar).
mlflow.set_experiment(f"/Users/{DA.username}/LLM 06 - MLflow experiment")

with mlflow.start_run():
# LOG PARAMS
mlflow.log_params(
{
"hf_model_name": hf_model_name,
"min_length": min_length,
"max_length": max_length,
"truncation": truncation,
"do_sample": do_sample,
}
)

# It is valuable to log a "signature" with the model telling MLflow the input and output schema for the model.
signature = mlflow.models.infer_signature(
xsum_sample["document"][0],
mlflow.transformers.generate_signature_output(
summarizer, xsum_sample["document"][0]
),
)
print(f"Signature:\n{signature}\n")

# For mlflow.transformers, if there are inference-time configurations,
# those need to be saved specially in the log_model call (below).
# This ensures that the pipeline will use these same configurations when re-loaded.
inference_config = {
"min_length": min_length,
"max_length": max_length,
"truncation": truncation,
"do_sample": do_sample,
}

# Logging a model returns a handle `model_info` to the model metadata in the tracking server.
# This `model_info` will be useful later in the notebook to retrieve the logged model.
model_info = mlflow.transformers.log_model(
transformers_model=summarizer,
artifact_path="summarizer",
task="summarization",
inference_config=inference_config,
signature=signature,
input_example="This is an example of a long news article which this pipeline can summarize for you.",
)

With these skills at your disposal, you’ll be well-prepared to tackle the challenges of real-world LLM applications.

What I liked about this course:

  1. Practical Focus: This course is designed for hands-on practitioners. It equips you with the skills needed to build LLM-centric applications using popular frameworks like Hugging Face and LangChain. It’s perfect for developers, data scientists, and engineers looking to apply LLMs in real-world scenarios.
  2. Industry Expertise: The course is led by industry leaders and renowned researchers, including Stanford Professor Matei Zaharia and the technical team behind Databricks’ Dolly model. Their insights and experience offer invaluable perspectives.
  3. Hands-On Labs: The inclusion of hands-on labs ensures that you not only understand the theory but also gain practical experience. You’ll be able to build your own production-ready LLM workflows.
  4. Ethical Considerations: In today’s world, ethical concerns are paramount. This course addresses societal, safety, and ethical considerations of using LLMs, helping you develop a responsible approach to AI.
  5. LLMOps Best Practices: You’ll learn LLMOps best practices for deploying models at scale. This is essential for those aiming to integrate LLMs into large-scale applications.

What can be improved:

  1. Limited to Practitioners: While the practical focus is excellent for practitioners, it may not be as suitable for those seeking a deeper theoretical understanding of LLMs (this is resolved with Course 2).
  2. Requires Prior Knowledge: To fully benefit from the course, you should have a working knowledge of machine learning and deep learning. It may not be ideal for complete beginners (See resources below for some pre-requisites)

Course 2: LLMs — Foundation Models from the Ground Up

For those with a thirst for deeper knowledge, this course offers a plunge into the heart of LLMs. Ideal for data scientists and enthusiasts intrigued by the inner workings of foundation models, it covers the following:

Understanding Foundation Models: Delve into the theory and innovations that paved the way for foundation models, including attention mechanisms, encoders, decoders, and their evolution to GPT-4.

class TransformerEncoderBlock(nn.Module):
def __init__(self, d_model, num_heads, conv_hidden_dim, dropout=0.1):
super(TransformerEncoderBlock, self).__init__()
self.attention = nn.MultiheadAttention(d_model, num_heads, dropout=dropout)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.feed_forward = FeedForward(d_model, conv_hidden_dim, dropout)
self.dropout = nn.Dropout(dropout)

def forward(self, x, mask=None):
# Multi-Head Attention
attn_output, _ = self.attention(x, x, x, attn_mask=mask)
x = x + self.dropout(attn_output)
x = self.norm1(x)

# Feed Forward Network
ff_output = self.feed_forward(x)
x = x + self.dropout(ff_output)
x = self.norm2(x)

return x

class TransformerEncoder(nn.Module):
def __init__(self, vocab_size, d_model, num_heads, conv_hidden_dim, num_layers, dropout=0.1):
super(TransformerEncoder, self).__init__()
self.word_embedding = nn.Embedding(vocab_size, d_model)
self.position_embedding = nn.Embedding(1000, d_model) # Assuming a maximum sequence length of 1000
self.layers = nn.ModuleList(
[
TransformerEncoderBlock(d_model, num_heads, conv_hidden_dim, dropout)
for _ in range(num_layers)
]
)

def forward(self, x, mask=None):
seq_length = x.shape[1]
positions = torch.arange(0, seq_length).expand(x.shape[0], seq_length).to(x.device)
out = self.word_embedding(x) + self.position_embedding(positions)

for layer in self.layers:
out = layer(out, mask)

return out

# Assume the following hyperparameters
vocab_size = 5000 # size of the vocabulary
d_model = 512 # dimension of the word embedding
num_heads = 8 # number of attention heads
conv_hidden_dim = 2048 # dimension of the hidden layer in the feed-forward network
num_layers = 6 # number of Transformer Encoder blocks
dropout = 0.1 # dropout rate

# Instantiate the model
model = TransformerEncoder(vocab_size, d_model, num_heads, conv_hidden_dim, num_layers, dropout)

# Generate some example input
input_tensor = torch.randint(0, vocab_size, (1, 20)) # batch size of 1 and sequence length of 20

# Forward pass through the model
output = model(input_tensor, mask=None)

print(f"The model has {sum(p.numel() for p in model.parameters() if p.requires_grad):,} trainable parameters")

Efficient Fine-Tuning: Explore advanced transfer learning techniques like one-shot, few-shot learning, and knowledge distillation. Learn how to reduce LLM sizes while preserving performance. Gain insights into the current research and developments shaping the LLM landscape, from Flash Attention to LoRa, AliBi, and PEFT methods.

import peft
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "bigscience/bloomz-560m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
foundation_model = AutoModelForCausalLM.from_pretrained(model_name)

lora_config = LoraConfig(
r=<FILL_IN>,
lora_alpha=1, # a scaling factor that adjusts the magnitude of the weight matrix. Usually set to 1
target_modules=["query_key_value"],
lora_dropout=0.05,
bias="none", # this specifies if the bias parameter should be trained.
task_type="CAUSAL_LM"
)

peft_model = get_peft_model(foundation_model, lora_config)
print(peft_model.print_trainable_parameters())

LLM Deployment Best Practices: Learn quantization to make models run faster and use less memory by converting 32-bit floating-point numbers into lower-precision formats, like 8-bit integers.

# Specify quantization configuration
net.qconfig = torch.ao.quantization.get_default_qconfig("onednn")

# Prepare the model for static quantization. This inserts observers in the model that will observe activation tensors during calibration.
net_prepared = torch.quantization.prepare(net)

# Now we convert the model to a quantized version.
net_quantized = torch.quantization.convert(net_prepared)

# Once the model is quantized, it can be used for inference in the same way
# as the unquantized model, but it will use less memory and potentially have
# faster inference times, at the cost of a possible decrease in accuracy.

Run LLMs at Massive Scale: Explore how to build our own, simplified version of a mixture-of-experts (MoE) LLM system which can handle a diverse range of data patterns through the different areas of expertise of their component models. Each expert has its own set of parameters and is typically a simpler model than would be necessary to model the entire data set effectively.

# Define the "hard gating" function
# This function decides which model to use based on the length of the input
def hard_gating_function(input):
if len(input) < 10:
# For inputs less than 10 characters long, use the GPT2 model
return "gpt2", gpt2, gpt2_tokenizer
elif len(input) < 100:
# For inputs less than 100 characters long but greater than 10 characters, use the T5 model
return "t5" , t5, t5_tokenizer
else:
# For inputs greater than 100 characters, use the BERT model
return "bert", bert, bert_tokenizer

# Define the "soft gating" function
# This function assigns a weight to each model based on the length of the input, and all models are used to a certain extent to generate the output
def soft_gating_function(input):
# The weights for each model are calculated using the softmax function, which outputs a probability distribution
weights = F.softmax(torch.tensor([len(input), 100 - len(input), len(input)], dtype=torch.float), dim=0)
# The weights for each model are returned along with the models and their tokenizers
return {"gpt2": (gpt2, gpt2_tokenizer, weights[0]),
"bert": (bert, bert_tokenizer, weights[1]),
"t5": (t5, t5_tokenizer, weights[2])}

The future of LLMs is Multi-Modal: Learn the inner workings of Vision Transformers with a hands-on coding project to perform video classification using X-CLIP to assign probabilities to the provided text description. The model consists of a text encoder, a cross-frame vision encoder, a multi-frame integration Transformer, and a video-specific prompt generator.

from transformers import XCLIPProcessor, XCLIPModel

model_name = "microsoft/xclip-base-patch16-zero-shot"
processor = XCLIPProcessor.from_pretrained(model_name)
model = XCLIPModel.from_pretrained(model_name)

By the course’s conclusion, you’ll be well-versed in the intricacies of foundation models, equipped to understand the latest advances, and prepared to tackle multi-modal LLM challenges.

What I liked about this course:

  1. In-Depth Theoretical Knowledge: If you’re interested in the theory behind LLMs and want to understand their inner workings, this course is ideal. It covers the evolution of foundation models, including attention mechanisms, encoders, decoders, and their contributions to models like GPT-4.
  2. Advanced Transfer Learning: You’ll explore advanced transfer learning techniques like one-shot, few-shot learning, and knowledge distillation, which can significantly reduce the size of LLMs while maintaining performance.
  3. Future-Oriented: This course provides insights into the latest LLM developments, including Flash Attention, LoRa, AliBi, and PEFT methods. It keeps you at the forefront of LLM research.
  4. Multi-Modal Applications: Understanding multi-modal LLMs is becoming increasingly important as AI applications involve text, audio, and visual components. This course prepares you for such challenges.

What can be improved:

  1. Theoretical Emphasis: While this course is rich in theoretical content, it may not offer as many practical, hands-on experiences as the first course. If you prefer a more practical approach, this might not be your first choice.
  2. Recommended Prerequisite: While completing the “LLMs: Application through Production” course is recommended, it’s not strictly required. However, if you haven’t taken the first course, you may need to invest extra effort to catch up.

Overall, both courses offer unique advantages, and a combination of both would be a wise choice for a successful journey with LLMs.

Unpacking the Learning Experience

Databricks’ LLM courses are more than just another series of online lectures. They offer an immersive learning experience enriched with the wisdom of industry leaders and renowned researchers. A few standout features that make these courses remarkable:

Expert-Led Instruction

In a domain as dynamic as AI, having mentors who have not only witnessed but actively shaped its evolution is invaluable. The expertise of the instructors infuses every lecture and lab, offering learners a unique perspective born from academia, startups, and F500 companies.

Hands-On Labs

Theory without practice is like a ship without a compass. Databricks’ courses are punctuated with hands-on labs that allow you to apply your newfound knowledge in real-world scenarios. These labs provide the practical skills you need to build your own production-ready LLM workflows.

Free Course Materials and Certificates

In alignment with their commitment to democratizing AI, Databricks offers free access to course materials for anyone who wishes to audit. If you desire a deeper engagement, a nominal fee grants you access to a managed computing environment for course labs, graded exercises, and a completion certificate. This certificate, backed by Databricks’ name, is a valuable addition to your professional profile.

Free Resources

In addition to the courses, Databricks generously provides free resources, including source code and notebooks, ensuring that your learning journey continues beyond the course. The provided GitHub repositories offer a treasure trove of code examples and practical implementations.

Video Lectures

For those who prefer a visual learning experience, Databricks offers a series of video lectures that accompany the courses. These lectures serve as an additional resource to enhance your understanding.

The Road Ahead: LLMs and Beyond

As we approach the end of our exploration, it’s crucial to recognize the immense potential and opportunities that await those who embark on this LLM journey. The data-driven world of tomorrow is bound to be AI-augmented, and enterprises across various industries are actively seeking talent with expertise in LLMs.

According to IDC, 90% of enterprise applications will be AI-augmented by 2025, and the demand for professionals skilled in LLMs is on a meteoric rise. Job postings requiring both NLP and deep learning skills have seen a remarkable 105% increase in the last three years, according to Burning Glass.

The potential career paths for graduates of Databricks’ LLM courses are as diverse as the applications of LLMs themselves. From NLP/LLM engineers to data scientists, machine learning engineers, software developers, and research analysts, the possibilities are limited only by your imagination.

The Verdict: Should You Join the LLM Rebellion?

In the spirit of a dramatic drumroll, let me make this clear — Databricks’ LLM courses are an invaluable treasure chest for anyone serious about conquering the world of large language models. The knowledge, expertise, and experience you’ll gain are worth their weight in gold-pressed latinum (for all you Trekkies out there).

So, should you pursue these courses? Without a doubt! Whether you’re a machine learning enthusiast or a data scientist looking to level up, Databricks’ LLM courses are the hyperdrive you need to journey to the stars.

In the words of the late, great Douglas Adams, “Don’t Panic” and enroll in these courses. May the LLMs be with you!

Conclusion: Join the Journey

As we conclude this odyssey through the galaxy of large language models, I urge you to embark on your own adventure. Databricks’ LLM courses are not just a learning experience; they are a voyage into the future of machine learning. As you navigate the ever-expanding universe of data and language models, remember:

“The only limit to our realization of tomorrow will be our doubts of today.” — Franklin D. Roosevelt.

Enroll in Databricks’ LLM courses, and let your doubts fade away like stars in the dawn of a new day. Clap U+1F44F, subscribe U+1F514 , and stay tuned U+1F4E1 for more such enlightening journeys into the world of technology and data. The cosmos of knowledge awaits, and together, we shall boldly go where no one has gone before.

Resources

  1. Databricks’ LLM Courses Introduction
  2. LLM Certification
  3. LLM Course 1: Application through Production
  4. LLM Course 2: Foundation Models from the Ground Up
  5. LLM Course 1: GitHub repo, Slides, Video Lectures
  6. LLM Course 2: GitHub repo, Slides, Video Lectures
  7. Transformers-Tutorials

Disclaimer: I have not received any financial compensation or favors from Databricks, edX, or any other party for writing this article. The opinions expressed here are entirely my own, based on my genuine experiences with these LLM courses.

Anand Taralika is a Software Engineer who writes about tech life and the use of tech, data, and machine learning for cybersecurity, finance, healthcare, and sustainable energy. Get stories directly in your inbox so you never miss them!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓