Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Compute-efficient Way to Scale LLM — Journey around data, model, and compute
Latest   Machine Learning

Compute-efficient Way to Scale LLM — Journey around data, model, and compute

Last Updated on June 29, 2024 by Editorial Team

Author(s): Anish Dubey

Originally published on Towards AI.

Context

We have repeatedly seen that increasing the model parameters results in better performance (GPT-1 has 117M parameters, GPT-2 has 1.5B parameters, and GPT-3 has 175B parameters). But the next set of questions is how to scale the AI model. Simply increasing the model parameters without increasing the compute won’t help. There are a lot of things around a number of model parameters (N), number of compute available (C ), number of token units (D), and hyper-parameters (learning rate, learning rate schedule, network width-depth ratio, batch size, and optimizer) which plays a key role in influencing optimal performance.

Photo by Matt Foxx on Unsplash

The below analysis aims to look at the problem of how different parameters shape the performance of AI models. We will start by looking at 3 main theses (papers) to understand the scaling problem:

Scaling Laws for Neural Language Models (Paper link: Year: 2020)

  • This is the most fundamental paper introduced by Open AI, which revolutionized the thinking around scaling and how to think about N, C, and D mathematically.

Training Compute-Optimal Large Language Models (Paper link: Year 2022)

  • This paper was authored by Google DeepMind and introduced the famous Chinchilla laws which revolutionized the idea that smaller models(N) are better than bigger models if you have given extra data (D) to the smaller model to train on.

Data-constrained language model (Paper link: Year 2023)

  • This paper was authored by Hugging Face and introduced how to further optimize the data (D) aspect, considering models are already trained on entire internet text and no further data is left to train on.

You will notice how the problem has emerged from simple statements like “scaling will work if we add extra compute” to “how extra compute(C) should be spent efficiently between dataset (D) and model params (N)” to achieve the best performance.

Let’s jump to the first paper.

Scaling Laws for Neural Language Models

Basic argument

The basic argument is performance has a power relation with compute (C), data set size (# of tokens, D), and model parameters (N) when looked at independently. This is an important argument because out of the three (C, D and N), if the model is not bottlenecked by any two factors, performance will have a power-law style relationship with each individual factor.

Side note: The below graph is a logarithmic plot and, hence a straight line for power law.

Image ref: https://arxiv.org/pdf/2001.08361

General context

Now let’s start looking at how the author modeled 3 variables (C, D and N) and computed equations. Before going deeper, few things to keep in mind:

  • Paper considers compute as the most precious resource and models the equations around compute. This seems reasonable since the paper came around 2020 when models had just started scaling. This assumption is being challenged in our 3rd section, though.
  • The whole premise revolves around compute, i.e. if the compute is increased, how should other variables increase to become efficient?

In order to understand the problem, let’s break the problem into 3 performance types.

  • Performance type 1: Assumes 2 variables are unlimited and one variable is limited. How increasing one variable should improve the performance loss.
  • Performance type 2: Assumes 1 variable is unlimited and 2 variables are limited. How should both variables increase to order to improve the performance loss?
  • Performance type 3: Assumes all variables are limited, and if given more compute, how should other variables increase?

Performance type 1

This step lays a solid foundation for performance types 2 and 3. The author modeled the behavior assuming one variable is limited, and the other two are not. If this is the case, how would the model perform?

Modeling is done by training the model with different variables (N, D, and C) and plotting the equation. Example: Monitor the performance loss of LLM when restricted on data but given enough compute and model size. Based on this, the author was able to plot power law equations driving their behavior.

The below table shows the equations and how each factor influences the performance loss. The major takeaway is that if the other 2 variables are unlimited, the last variable will follow the power law curve with performance improvement.

Notation:

  • L(N) means performance loss associated with model parameters increase
  • L(C) means performance loss associated with computing increase
  • L(D) means performance loss associated with dataset increase
All equations are from https://arxiv.org/pdf/2001.08361 paper

Performance type 2

This step assumes that if one variable is unlimited, how other 2 variables could increase to become efficient. Keep in mind that this is also not close to reality because, generally, all variables are limited, but let’s explore what the author presented in this type.

The main takeaway is that under the assumption that compute is unbounded, increasing model size is much more favorable as compared to increasing data sets.

This is challenged in our next section where authors claimed that data plays an equal role as compared to model parameter size.

The table below shows the equations and how each factor influences the performance loss.

All equations are from https://arxiv.org/pdf/2001.08361 paper

Performance type 3

This is much closer to reality and is the main focus of the paper. Considering all variables are bound in nature with compute as the most important resource, if computing is increased, how should other variables (D and N) increase to become training efficient? Here author coined the terminology, compute-optimal large language model.

The main thesis of the paper is around compute being the main bottleneck and if given more compute, how other variables should increase.

All equations are from https://arxiv.org/pdf/2001.08361 paper

Conclusion

Model parameters play a big role in scaling. If extra compute is available, model size should be given more priority as compared to dataset increase. This is more visible in the below image where increasing the compute by 1B times results in >1M increase in compute but only ~100x increase in the dataset (Batch size).

Image ref: https://arxiv.org/pdf/2001.08361

Now, since we have established a basic premise that power law plays a role and model params play such an important role in scaling, let’s move to the next paper, which contradicts some parts of this statement.

Training Compute-Optimal Large Language Models

Basic argument

This paper builds on top of the above section and contradicts a few data points made above. The main takeaway from the paper is that in a compute-limited world, model size, and data size should scale proportionally. This means the model parameter follows 0.5, and the data set follows 0.5 power factor. This is in contrast to the above explanation where the author claimed model parameter follows 0.73 and the data set follows 0.27 factor

Image from this paper: https://arxiv.org/pdf/2203.15556

What is the main thing the author changed?

Hyperparameters

If we remember in the previous paper, the author modeled around N,C, and D but assumed constant hyperparameters for all model training. In this paper, the author played around with training rate schedule hyperparameters.

Why does changing the training rate scheduler help ?

The thinking goes like this: The training rate scheduler helps change the learning rate while adjusting gradient descent. In the initial stages, the learning rate is/should be small, and after some training, learning rate should increase to reach minimum.

In the previous section, the author assumed a cosine learning schedule strategy, which they kept the same throughout all of their iterations in the paper. But as we can imagine, the learning rate should/can adjust based on the number of tokens as well. The model should be aware of how much training has happened and change the learning based on the number of tokens.

Implementation detail

According to the second paper: “For each parameter count 𝑁 we train 4 different models, decaying the learning rate by a factor of 10× over a horizon (measured in number of training tokens) that ranges by a factor of 16H”

The author trained 4 models with the same model parameters (N)

  • Model 1: Learning rate decays by 10x at every H token.
  • Model 2: Learning rate decays by 10x at every 2H tokens.
  • Model 3: Learning rate decays by 10x at every 4H tokens.
  • Model 4: Learning rate decays by 10x at every 16H tokens.

This gives the model the ability to learn faster, adjust its gradient quickly, and incorporate dataset (D) into the learning schedule.

Intuitively, if we think about this, increasing the dataset(D) but not adjusting the learning rate makes the model consume compute faster such that there is still data left to train it further. Hence, in previous papers, data was never a bottleneck, and model param became a bottleneck. By adjusting the learning rate, the model is able to consume data with an efficient learning rate, and hence, equal importance is associated with model param scaling (N) and dataset increase (D).

Another thing the author worked on was verifying the above data points by training a new model (famously called as Chinchilla), which was trained on 70B parameters (N) and 4x more data as compared to Gopher (280B parameters). The author found that Chinchilla performed better in a large range of downstream evaluation tasks. Since, the parameter is reduced by 4x, it also reduces the inference cost by 4x as well. This leads to famous Chinchilla laws, which state that smaller models can give better performance if trained on more data.

Once the above premise is established that smaller models work better if trained on more data, data becomes the constraining factor. The next section aims to alleviate how we can increase the data.

Data-constrained language model

Basic argument

The paper builds a lot of inspiration from the above 2 sections and further extends the thinking. The idea is that if data (D) and parameters (N) are equally important and should be scaled equally, we have already reached a point of maximum data available to train (11 trillion tokens or 30TB of text data). So, currently, we are in the data-constrained zone. So, if dataset (D) cannot be increased, this paper explores if repetition of data for certain epochs is a viable strategy to increase the compute-optimal performance.

What is the main thing the author changed?

The author took a similar setup as described in “Training Compute-Optimal Large Language Models,” but instead of nonrepeating data, the author trained multiple models of the same parameter with repeating data for multiple epochs (~ up to 1500).

The author noticed that loss per compute is always highest on the first unseen data, which is understandable, but loss improvement is not trivial for repeated data. At least 4 epochs, repeating the data is almost as good as new data. After further repetition, the return diminishes fairly quickly. After 40+ epochs, repeating has no additional value, and hence, no compute should be allocated.

The author also pointed out how loss improved with fewer parameters (N) and more data (D), leading to better loss improvement when compared with vanilla repeating of new data. Compute can either be used for more parameters and less data or vice-versa, but preference is always on fewer parameters (N) since it reduces the inference cost.

Image from this paper: https://arxiv.org/pdf/2305.16264

Conclusion

This was an amazing journey from “scaling works in LLM” to “how model params (N) play an important role than dataset (D), given fixed compute” to “how model params (N) and dataset (D) play an equal role given extra compute”. We also learned about the “Chinchilla laws,” where smaller models are better than larger models given extra data to train on. Finally, we were able to point to a data saturation phase where optimization around data scaling (multiple epochs) is proposed. This all happened in the last 4 years, when we went from compute bound to data bound.

In short: Given N, D and C, we should be spending the compute on smaller models (N) with more data (D). Once data is saturated, multiple epochs should be the further strategy. And beyond this, we should explore to expand the model params (N).

Reference

  • Scaling Laws for Neural Language Models (Paper link: Year: 2020)
  • Training Compute-Optimal Large Language Models (Paper link: Year 2022)
  • Data-constrained language model (Paper link: The year 2023)

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓