Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Apple M1 and M2 Performance for Training SSL Models
Data Science   Latest   Machine Learning

Apple M1 and M2 Performance for Training SSL Models

Last Updated on November 5, 2023 by Editorial Team

Author(s): Igor Susmelj

Originally published on Towards AI.

We want to know how fast Apple M1 and M2 chips are for training self-supervised learning models.

The number of benchmarks for training ML models using the new Apple chips is still low. Furthermore, most results only compare the M1 chips with earlier software versions that might not have been optimized when the tests were conducted. That’s why we decided to run our own benchmarks.

In order to measure the performance of Apple M1 and M2 chips for training, we set up a simple benchmark. Training a SimCLR model with a ResNet-18 backbone on cifar-10. We measure the time it takes to complete one full epoch. For our experiments, we use various M1 and M2 chips and also compare CPU vs GPU performance.

In detail, we run benchmarks using the following devices:

  • 14-inch Macbook Pro 2021 with the M1 Pro and the 14-core GPU (referred to as M1 Pro in this post)
  • 13-inch Macbook Air 2023 with the M2 and the 8-core GPU (referred to as M2 in this post)
  • We compare the results against a reference implementation using an Nvidia A6000 Ampere GPU

TL;DR

  • On the M1 Pro the GPU is 8.8x faster for training than using the CPU.
  • The M1 Pro GPU is approximately 13.77x slower than an Nvidia A6000 Ampere GPU.
  • The M1 Pro GPU is 26% faster than the M2 GPU.
  • PyTorch running on Apple M1 and M2 chips doesn’t fully support torch.compile and 16-bit precision yet. Hopefully, this changes in the coming months.

In the following table, you will find the different computing hardware we evaluated. On the right side, you find the average time per epoch in minutes. All Apple M1 and M2 chips use the latest nightly build from 30.6.2023 whereas the Nvidia A6000 Ampere chip uses an older PyTorch version from 2022.

Compute hardware and time required for one epoch training.

Setup & Experiments

We give an overview of the software and hardware components used for the experiments.

We use the examples for training a ResNet-18 using SimCLR on Cifar10 from the lightly benchmarks. Instead of training the models for 200 epochs, we will just train them for 2 epochs on the Apple hardware. We leave all the other parameters (batch size: 128, precision: 32, number of workers: 8) the same.

The training code automatically evaluates the model after each epoch using a kNN classifier. Two epochs are equal to training the model on the training data twice and evaluating the model twice. For our experiments, we use the average time per batch.

We have reference results on an Nvidia A6000 Ampere GPU for comparison. The Nvidia GPU uses 97.7 min for 200 epochs, or 0.49 min per epoch.

We use 8 for the number of workers. We did not tune or change any parameters when switching from the system using the Nvidia A6000 GPU to the M1 and M2 chips. However, when monitoring the CPU and GPU usage, we noticed that on the M1 and M2 devices, we were constantly above 90% usage, which lets us assume that we’re close to the limit of the available hardware.

Installing PyTorch with GPU support on Apple M1 and M2

For our experiments, we need to install PyTorch on the Apple M1 and M2 hardware.

We follow this guide here: https://developer.apple.com/metal/pytorch/

Finally, our test system on M1 Pro has the following packages installed:

Results

Note that we don’t use torch.compile or 16-bit precision due to a lack of support on the Apple chips. As of today, Apple M1 and M2 GPUs do support 16-bit precision, but the torch lacks support for autocast, which is required for scaling the gradients and using automatically higher precision for training. Therefore, we don’t use any of these features.

Summary of the different experiments. We always report the full epoch time which includes training and validation time.

We discuss the various benchmark results in more detail here. As a reference, we orient ourselves to two publicly available benchmarks for training ML models on the Apple M1 hardware.

Results on the M1 Pro GPU

Let’s take a look at the detailed results on the M1 Pro GPU. Since we use PyTorch Lightning for the experiments, we also get a summary of the accelerator used and the model size.

You can see that the GPU has been found:
GPU available: True (mps), used: True

The total time for the two epochs is 13.5 min. The time per epoch is, therefore 6.75 min, which is 13.77x slower than the A6000 GPU from Nvidia with 0.49 min.

Results on the M1 Pro CPU

Out of curiosity, we also ran the same benchmark on the M1 Pro CPU, which has 6x performance cores and 2x efficiency cores.

As expected, the results are worse than when using the GPU. The CPU took 118.6 min for the two epochs, or 59.3 min per epoch. The M1 Pro CPU is 8.8x slower than using the M1 Pro GPU. These results show a bigger difference between GPU and CPU performance than previous benchmarks:

Prior ML Benchmarks on Apple M1 Hardware

We only found two other benchmarks. Both are dated to May 2022, when initial support for PyTorch on Apple hardware was announced.

VGG16 training time per epoch comparison. From https://sebastianraschka.com/
PyTorch only reported speedup of GPU training vs CPU training on an M1 Ultra chip. From https://pytorch.org/

Compared to the other two reported results, our benchmark of training a ResNet-18 is less compute-intensive. Both the VGG-16 and ResNet-50 are bigger models with more parameters and more flops.

According to the official torchvision pretrained models, we can get the following numbers about the three models:

  • The VGG-16 model has 138 million parameters and 15.47 billion FLOPs
  • The ResNet-18 model has 11.7 million parameters and 1.81 billion FLOPs
  • The ResNet-50 model has 25.6 million parameters and 4.09 billion FLOPs

Outlook

Half precision (fp16) support

Although the Apple M1 and M2 GPUs support fp16 the software stack around PyTorch is still lacking support on some areas. For example, the current issue is still open https://github.com/pytorch/pytorch/issues/88415, preventing us from easily use mixed precision using autocast. The good news is that fp16 is already supported for M1 and M2 chips meaning that we can create fp16 tensors and run operations on them.

Support for torch.compile for M1 and M2 Chips

If you try to use torch.compile you will get the following error RuntimeError: Unsupported device type: mps as the MPS device is not supported yet.

If you like this post and want to read more from me you can follow me on Medium.

Igor Susmelj,
Co-Founder Lightly

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓