Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

A Brief Implementation of MLFlow!
Latest   Machine Learning

A Brief Implementation of MLFlow!

Last Updated on April 4, 2024 by Editorial Team

Author(s): Harish Siva Subramanian

Originally published on Towards AI.

Photo by UX Indonesia on Unsplash

Have you ever run into a scenario where you experiment with multiple models and lose track of the performance of each of the models?

Are you someone who just names the model name with accuracy and all the methods you have used to keep track of everything?

Well, we have all been there and done that! That’s how I used to track during the initial stage of my Data Science career.

To avoid all of these we use MLFlow. There is a concept called MLOps that most of us should be familiar with if we are building models.

MLOps, short for Machine Learning Operations, refers to the practices, processes, and technologies used to streamline and automate the deployment, monitoring, and management of machine learning models in production environments.

MLflow is a platform that facilitates managing the end-to-end machine learning lifecycle. It helps data scientists and engineers streamline the development, deployment, and monitoring of machine learning models. While MLflow itself is not a complete MLOps platform, it plays a significant role within the MLOps ecosystem.

Here’s how MLflow contributes to MLOps:

  1. Experiment Tracking: MLflow allows users to track experiments, including parameters, metrics, and artifacts. This enables teams to compare different model iterations, reproduce results, and collaborate effectively. In an MLOps context, experiment tracking helps maintain a record of model development and performance across the entire lifecycle.
  2. Model Packaging and Deployment: MLflow provides tools for packaging and deploying models to various environments, including batch inference, real-time serving, and edge devices. By packaging models in a standardized format, teams can easily deploy and manage models in production, a crucial aspect of MLOps.
  3. Model Registry: MLflow’s model registry allows teams to organize and manage models throughout their lifecycle. It provides versioning, permissions, and audit capabilities, ensuring that models are tracked, validated, and promoted in a controlled manner. This aligns with MLOps principles of governance and lifecycle management.
  4. Model Monitoring: While MLflow itself does not offer extensive model monitoring capabilities, it integrates with external tools and platforms for monitoring model performance and drift. By logging metrics during training and inference, MLflow can feed data into monitoring systems for ongoing model evaluation and management.
  5. Collaboration and Reproducibility: MLflow promotes collaboration and reproducibility by capturing the code, data, and environment settings associated with each experiment. This allows teams to share, reproduce, and build upon each other’s work, essential aspects of MLOps culture and practices.

In this article, I will cover how to log the metrics, parameters, models, and artifacts. Now let’s write some code to log certain metrics and model to the mlflow.

First things first, we need to pip install mlflow.

pip install mlflow

Now let’s build a sample image classification model using pytorch and let’s log the necessary things to mlflow.

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import mlflow
import mlflow.pytorch
import matplotlib.pyplot as plt



# Set random seed for reproducibility
torch.manual_seed(42)

# Check if CUDA is available and select device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Log CUDA device GPU name
if torch.cuda.is_available():
mlflow.log_param("cuda_device", torch.cuda.get_device_name())

# Define transforms
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])

# Load MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# Define dataloaders
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1000, shuffle=False)

# Define model architecture
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout2d(0.25)
self.dropout2 = nn.Dropout2d(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)

def forward(self, x):
x = self.conv1(x)
x = nn.ReLU()(x)
x = self.conv2(x)
x = nn.ReLU()(x)
x = nn.MaxPool2d(2)(x)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = nn.ReLU()(x)
x = self.dropout2(x)
x = self.fc2(x)
output = nn.LogSoftmax(dim=1)(x)
return output

# Initialize model, loss, and optimizer
model = Net().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

This will download the MNIST data and define the model. Now let’s start with the mlflow code,

# End the current MLflow run
mlflow.end_run()

# Set the experiment name
experiment_id =mlflow.create_experiment("MNIST_Classification1")

with mlflow.start_run(run_name="3Epochs",experiment_id=experiment_id) as parent_id:

# Log parameters
mlflow.log_param("optimizer", "Adam")
mlflow.log_param("learning_rate", 0.001)
mlflow.log_param("batch_size", 64)
mlflow.log_param("epochs", 5)

# Lists to store loss and accuracy for plotting
train_losses = []
train_accuracies = []
test_losses = []
test_accuracies = []

epochs=3
# Training loop
for epoch in range(epochs):
model.train()
running_loss = 0.0
correct = 0
total = 0
for i, data in enumerate(train_loader, 0):
inputs, labels = data[0].to(device), data[1].to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
train_loss = running_loss / len(train_loader)
train_accuracy = correct / total
print(f'Epoch {epoch + 1}, Training Loss: {train_loss}, Training Accuracy: {train_accuracy}')
train_losses.append(train_loss)
train_accuracies.append(train_accuracy)

# Log metrics for training
mlflow.log_metric("training_loss", train_loss, step=epoch+1)
mlflow.log_metric("training_accuracy", train_accuracy, step=epoch+1)

# Test the model
model.eval()
correct = 0
total = 0
with torch.no_grad():
running_loss = 0.0
for data in test_loader:
images, labels = data[0].to(device), data[1].to(device)
outputs = model(images)
loss = criterion(outputs, labels)
running_loss += loss.item()
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
test_loss = running_loss / len(test_loader)
test_accuracy = correct / total
print(f'Testing Loss: {test_loss}, Testing Accuracy: {test_accuracy}')
test_losses.append(test_loss)
test_accuracies.append(test_accuracy)

# Log metrics for validation
mlflow.log_metric("validation_loss", test_loss, step=epoch+1)
mlflow.log_metric("validation_accuracy", test_accuracy, step=epoch+1)

# Save the model
mlflow.pytorch.log_model(model, "models")

# Plot loss curves
plt.figure(figsize=(10, 5))
plt.plot(range(1, epochs+1), train_losses, label='Training Loss', marker='o')
plt.plot(range(1, epochs+1), test_losses, label='Validation Loss', marker='o')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss Curves')
plt.legend()
plt.grid(True)
plt.savefig('loss_curve.png')
mlflow.log_artifact('loss_curve.png')

# Plot accuracy curves
plt.figure(figsize=(10, 5))
plt.plot(range(1, epochs+1), train_accuracies, label='Training Accuracy', marker='o')
plt.plot(range(1, epochs+1), test_accuracies, label='Validation Accuracy', marker='o')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Accuracy Curves')
plt.legend()
plt.grid(True)
plt.savefig('accuracy_curve.png')
mlflow.log_artifact('accuracy_curve.png')


# mlflow.end_run()
print('Finished Training')

In the above case, we first created an experiment named “MNIST_Classification1”. Under this experiment, we will have multiple runs with each run we modify something. In this article, for the sake of simplicity, I will modify the number of epochs for both runs.

In the above code, we define the run named “3Epochs” and specify the experiment_id that we created in the previous lines.

We log all the parameters using

mlflow.log_param(key,value)

If you want to log a metric, you would use

mlflow.log_metric(key, value, step=None)

To log a pytorch model,

mlflow.pytorch.log_model(model, "models")

Here the model is the actual model.

If you want to log a figure or a plot,

plt.savefig('loss_curve.png')
mlflow.log_artifact('loss_curve.png')

Save the model locally and then log them as an artifact like the above.

You can view the logged plot and all the metrics in the MLflow UI. Make sure you have MLflow server running and accessible in order to view the logged artifacts.

Now as soon as you run the code, you will see a new directory being automatically created in the project folder named “mlruns”

Inside that, we see 0 and also a folder with a bunch of numbers. Those numbers indicate the experiment_id that was created when we initially created an experiment. For every new experiment, you get to see a new folder.

Now go to the project folder, and open Anaconda Prompt or Terminal and then type the following,

mlflow ui

In my case, it looked like this,

As soon as you run, you will see a local host being created, and the link will pop up on the terminal. If you open the terminal, you can see the UI of the mlflow,

You can see the run name “3Epochs” being created. When you click that you can see all the logged metrics, parameters and the artifacts.

Run 2

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import mlflow
import mlflow.pytorch
import matplotlib.pyplot as plt



# Set random seed for reproducibility
torch.manual_seed(42)

# Check if CUDA is available and select device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Log CUDA device GPU name
if torch.cuda.is_available():
mlflow.log_param("cuda_device", torch.cuda.get_device_name())

# Define transforms
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])

# Load MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# Define dataloaders
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1000, shuffle=False)

# Define model architecture
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout2d(0.25)
self.dropout2 = nn.Dropout2d(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)

def forward(self, x):
x = self.conv1(x)
x = nn.ReLU()(x)
x = self.conv2(x)
x = nn.ReLU()(x)
x = nn.MaxPool2d(2)(x)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = nn.ReLU()(x)
x = self.dropout2(x)
x = self.fc2(x)
output = nn.LogSoftmax(dim=1)(x)
return output

# Initialize model, loss, and optimizer
model = Net().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# End the current MLflow run
mlflow.end_run()

# Set the experiment name
#experiment_id =mlflow.create_experiment("MNIST_Classification1")

# Start MLflow run with a model name

with mlflow.start_run(run_name="5Epochs",experiment_id=240219543460999387) as parent_id:
#with mlflow.start_run(run_name="3Epochs",experiment_id=experiment_id) as parent_id:

# Log parameters
mlflow.log_param("optimizer", "Adam")
mlflow.log_param("learning_rate", 0.001)
mlflow.log_param("batch_size", 64)
mlflow.log_param("epochs", 5)

# Lists to store loss and accuracy for plotting
train_losses = []
train_accuracies = []
test_losses = []
test_accuracies = []

epochs=5
# Training loop
for epoch in range(epochs):
model.train()
running_loss = 0.0
correct = 0
total = 0
for i, data in enumerate(train_loader, 0):
inputs, labels = data[0].to(device), data[1].to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
train_loss = running_loss / len(train_loader)
train_accuracy = correct / total
print(f'Epoch {epoch + 1}, Training Loss: {train_loss}, Training Accuracy: {train_accuracy}')
train_losses.append(train_loss)
train_accuracies.append(train_accuracy)

# Log metrics for training
mlflow.log_metric("training_loss", train_loss, step=epoch+1)
mlflow.log_metric("training_accuracy", train_accuracy, step=epoch+1)

# Test the model
model.eval()
correct = 0
total = 0
with torch.no_grad():
running_loss = 0.0
for data in test_loader:
images, labels = data[0].to(device), data[1].to(device)
outputs = model(images)
loss = criterion(outputs, labels)
running_loss += loss.item()
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
test_loss = running_loss / len(test_loader)
test_accuracy = correct / total
print(f'Testing Loss: {test_loss}, Testing Accuracy: {test_accuracy}')
test_losses.append(test_loss)
test_accuracies.append(test_accuracy)

# Log metrics for validation
mlflow.log_metric("validation_loss", test_loss, step=epoch+1)
mlflow.log_metric("validation_accuracy", test_accuracy, step=epoch+1)

# Save the model
mlflow.pytorch.log_model(model, "models")

# Plot loss curves
plt.figure(figsize=(10, 5))
plt.plot(range(1, epochs+1), train_losses, label='Training Loss', marker='o')
plt.plot(range(1, epochs+1), test_losses, label='Validation Loss', marker='o')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss Curves')
plt.legend()
plt.grid(True)
plt.savefig('loss_curve.png')
mlflow.log_artifact('loss_curve.png')

# Plot accuracy curves
plt.figure(figsize=(10, 5))
plt.plot(range(1, epochs+1), train_accuracies, label='Training Accuracy', marker='o')
plt.plot(range(1, epochs+1), test_accuracies, label='Validation Accuracy', marker='o')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Accuracy Curves')
plt.legend()
plt.grid(True)
plt.savefig('accuracy_curve.png')
mlflow.log_artifact('accuracy_curve.png')


# mlflow.end_run()
print('Finished Training')

Now, in this case, I wanted to make it a second run under the same experiment. So I fed the expeirment_id from the folder name that was created for the first run when we initialized the experiment. Also, this time we don't need to create an experiment, so commenting out that line!

We use that experiment_id and we create a new run. This time I named it as “5Epochs” and changed the number of epochs to 5 in the code.

That is the one corresponding to 5Epochs!

That is it!! Thank you for reading this article!

If you like the article and would like to support me, make sure to:

  • U+1F44F Clap for the story (50 claps) to help this article be featured
  • Follow me on Medium
  • U+1F4F0 View more content on my medium profile
  • U+1F514 Follow Me: LinkedIn U+007C GitHub

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓