Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Language Model Alignment: Use-cases and Approaches
Latest   Machine Learning

Language Model Alignment: Use-cases and Approaches

Last Updated on June 3, 2024 by Editorial Team

Author(s): Fabio YÑñez Romero

Originally published on Towards AI.

Before putting a language model into production, some techniques must be used to align the model to respond appropriately to the end user; this post will explain the main uses and methods employed for this, updated to 2024.

In language model alignment, one common component of almost all techniques is the human review of the sentences generated by the model. Source: Dall-E 3.

What is language model alignment?

Language model alignment is the final stage of language model training, intended to correct problems associated with the data used during training. However, remember that language models process large amounts of data, making it virtually impossible to check all the data used in training the model. This can lead to toxic language, sensitive information, or biases. Therefore, in addition to trying to improve model input data, language model alignment is used as an effective technique to achieve desirable model responses for people.

Another reason for using language model alignment is to get the model to respond politely, as we are used to with GPT-4, or in the intended way. This makes it possible to get more coherent, structured responses as a human would in a real conversation.

As mentioned above, we can summarise the main reasons why language model alignment is used as follows:

  • Reduce toxic language: use foul language, swear words or language that may be offensive in general so that all kinds of people can use the final model, including minors with easy access to these technologies.
  • Avoid answers with sensitive information, such as identity documents, place of residence, telephone numbers, e-mails, etc.
  • Biases of any kind (gender, ethnicity, etc.): Since many professions tend to be covered by one population gender, the model can assign that gender to all responses that include the profession in question.
  • Obtain answers in more polite language or similar to what is expected in a real conversation. This can be seen in chats that use large language models such as ChatGPT or Copilot.

Language alignment techniques

Reinforcement Learning with Human FeedBack (RLHF)

RLHF is the most famous model alignment technique, as it achieves the best results. To perform RLHF on a model, the following steps must be carried out:

  1. Obtaining sentences from the model to be aligned to human preferences: For text inputs, either with or without prompt, we obtain numerous output sentences, employing different decoding techniques (top k sampling, nucleus sampling, beam search) or different decoding parameters (temperature, generated text length). For each input, we retain the two sentences generated that make the most sense, and these are the sentence pairs to be annotated.
  2. Annotate which of the two sentences for each pair is the one that most closely aligns with the objective in question. This may be the most polite sentence, the one that is written in the most natural tone, the one that presents the least sensitive information or bias, etc.
  3. The reward model, a language model trained with pairs of sentences in binary classification, is trained to determine which of the two sentences makes the most sense (it would be a kind of discriminator).
  4. After training the reward model, the policy is optimized, in this case, the original model. For this, the reward model communicates with the original model in the loss function of this model, using PPO at this point to update the policy (the original model) smoothly, avoiding the model’s destabilization. This stage is a fine-tuning of the original model with the indicated modification in the loss function.

Using reinforcement learning to align the model is often one of the most effective but also one of the most complicated techniques. Compared to other deep learning strategies, reinforcement learning tends to be very susceptible to hyperparameter tuning. It requires implementing a reward function in the model aligned with human preferences and fine-tuning without losing valuable information in the original model. Other alternatives try to reduce human intervention to speed up the process, making it less costly at the expense of lower results.

Steps to carry out reinforcement learning with human feedback (RLHF). Source: Training language models to follow instructions with human feedback.

Reinforcement Learning with Artificial Intelligence Feedback (RLAIF)

This process is similar to the previous one, but there is no human annotator; in this case, the human can be replaced by a model that classifies which of the two sentences is preferred. Depending on the training of this model, this process can be carried out for anonymization, toxic language, etc.

Remark that the discriminator can be a fine-tuned language model for binary classification in the task at hand or a purely pre-trained model with text that is considered to meet the desired condition (or the opposite) and evaluate both pairs of sentences separately, looking at the perplexity obtained and choosing the one that has less perplexity (or more) of the two input sentences. This option is feasible if we have a discriminator model already trained, although the results are worse than with a human annotator.

With the annotation model already trained, the bottleneck is obtaining sentence pairs to give as input to the annotation model, as they must be determined manually.

The main advantage of RLAIF over RLHF is automation, saving time and money by avoiding human annotation. Source: RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback.

Direct Preference Optimization (DPO)

This method emerges as an alternative to RLHF and RLAIF, eliminating reinforcement learning (typically PPO for obtaining the best results) and the reward model external to the original model. This method introduces a new parameter to the loss function. It performs fine-tuning in a binary classification task using the model, achieving the same goal as RLHF in a more stable, more straightforward way and without the need to obtain samples of the language model as is done in RLHF.

DPO optimizes a reward function like PPO in reinforcement learning in a much simpler-to-implement way; each update of DPO increases the choice of the preferred sentence at the expense of its other pair. However, DPO assigns an β€˜importance’ weight to each example, thus preventing the model from degenerating as with simple implementations. Instead of training a reward model, DPO modifies the loss function so that the reward function is a function of the policy directly. Thus, with a dataset of human preferences over the model responses, DPO saves us the external reward model and the implementation of PPO in the loss function, making a much simpler final architecture.

DPO maps the reward functions to optimal policies, making everything dependent on the policy, which is the original model. It replaces the reward function with the optimal policy that would generate this reward in the original function used in reinforcement learning along with the current policy. As in PPO, the reward function is assumed to be updated with the policy. Thus, we can use DPO in the loss function of the original model instead of PPO.

Compared to RLHF, DPO eliminates the distinction between the reward and policy models. Source: Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Contrastive Learning

In contrastive learning, unlike classical supervised learning, the model learns by comparing a set of data simultaneously; typically, one example is considered the best or positive, while the rest are considered negative examples. This allows us to update the model with <positive, negative> pairs instead of traditional fine-tuning for binary classification that considers each data with a single positive or negative label.

In contrast learning, the <positive, negative> pair can also be assigned based on an absolute score of each example so that for a subset, the same example can change its role depending on which it is paired with.

When discussing human alignment in language models, this option uses an alternative loss function in the language model, maximizing the positive sentence and minimizing the negative sentence in the loss function. In this case, as in the previous cases, a set of responses is obtained from the model based on an input text (x) according to different decoding strategies.

A human annotator will select the most appropriate response, which we will call A, while the rest will be grouped in B, as indicated in the figure below. In this way, we have a data set for contrastive learning in which the input text is paired with the optimal response (which will be reinforced) and one of the responses from group B (which will lose importance):

Set of examples for Contrastive Learning in Language Model.

This is the case for the Contrastive Learning Framework for Human Alignment (CLHA). CLHA modifies the loss function of the original model, assigning a typical contrastive learning loss function (clha in the image below) alongside the classical supervised learning function using maximum likelihood (sft in the image below).

CLHA uses contrastive learning with traditional supervised learning as an example. Source: CLHA: A Simple Yet Effective Constrastive Learning Framework for Human Alignment.

Contextual Dueling Bandits (CDB)

This is a framework for making decisions based on feedback from an annotator, as in the initial part of RLHF and DPO, it is based on pairs of sentences obtained from the target model in which it is annotated the preferred one. However, instead of having a reward function, a kind of β€˜duel’ is performed by comparing the pair of sentences. It is called the β€˜bandit’ problem because it is similar to a gambler trying to decide which slot machine (or β€˜armed bandit’) to play when he does not know the odds of winning on each machine. Over time, as more duels are conducted with feedback, the model learns to predict which actions are preferred in each context.

This learning method may seem similar to Contrastive Learning since a comparison is made between two results obtained by the model. However, in this case, the objective is to learn an optimal policy to determine which one is correct given a couple of examples rather than having the model learn directly. Normally, Dueling Bandits is done when there are no absolute labels typically used in Contrastive Learning or supervised learning in general, but the labels used are relative, as in the case of movie recommendations. It is also a good option when evaluation of a reward function is not possible.

The term β€˜Contextual’ adds a further layer of complexity to learning the correct policy because when comparing sentences against each other, the winner may vary according to the assigned context, which is the language model's initial input text for language model alignment.

Application of Dueling Bandits in Natural Language Processing. Source: Multi-Source Test-Time Adaptation as Dueling Bandits for Extractive Question Answering.

Hindsight Instruction Relabeling (HIR)

In this case, reinforcement learning is removed from the entire architecture. Based on the language model output, instructions are generated in the language model, and the original text input is modified.

The goal is to align the instruction to human preferences rather than the model by modifying the loss function. However, it does not necessarily have to include an annotator labeling the results. Whether the model achieves the desired goal is determined in a human-supervised way or automatically for specific tasks. When the result is unsatisfactory, the input instruction or prompt is modified to align it with the desired goal.

Normally, this process has an Online Stage in which samples of the language model are obtained with different instructions and an Offline Stage in which the results obtained are compared and the instructions to be used in a new online stage are modified, thus creating an iterative process.

Example of Hindsight Instruction Relabeling. Source: The Wisdom of Hindsight Makes Language Models Better Instruction Followers.

Representation Alignment from Human Feedback (RAHF)

In this case, models other than the target model extract their activity patterns (activations of the different parameters with different data inputs). The activity patterns obtained are superimposed with those of the original model through training, reducing the error of both patterns in the loss function. Specifically, this final model is fine-tuned with LoRA.

One option is to train a model with contrastive learning using preferred/non-preferred sentence pairs and examine their activity patterns. Another option is to employ two models, one fine-tuned for the preferred option and the other for the non-preferred option.

In both methodologies, the activity patterns obtained for the positive and negative examples are compared, thus obtaining the discrepancy of the sentence pairs in the form of a vector. Specifically, the difference in the states hidden by each token at its corresponding position in the response is calculated for the preferred and non-preferred instructions. These difference vectors are used to perturb the original model and align it with human preferences.

The mean squared error of the base model with LoRA and the difference vector is used in the LoRA loss function to align the model with human preferences. In the experiments, the use of dual models achieves almost as good results as RLHF.

Activity Pattern Superposition for Language Model Alignment. Source: Aligning Large Language Models with Human Preferences through Representation Engineering.

Language Model Alignment Drawbacks

Most language model alignment methods (and the most effective ones) have some drawbacks, including the need to carry out a previous stage of collecting text from the language model to be annotated by a human being (Human FeedBack) to obtain the best possible results. This slows the process and makes it more costly, as it cannot be done unsupervised.

This drawback is quite noticeable, considering this data cannot be used to align other language models, as it must be text generated in the same language model.

Another drawback is that it has been shown that current language alignment techniques do not eliminate the possibility of obtaining undesired language. So, any effort to align the language model will improve the result but will make it possible to get a response using toxic language through prompt engineering techniques.

Conclusions

Overall, RLHF is the most effective method for aligning the model. However, it is still one of the most complicated to implement as it integrates reinforcement learning through PPO into the language model, making convergence difficult. Therefore, other alternatives, such as DPO or contrastive learning, appear less effective but much easier to achieve.

Also, remember that given the cost of having a human annotator carry out the text annotation of the original language model, using RLAIF may be a suitable option even if it achieves worse results.

Language model alignment techniques are imperfect but help achieve the desired result. Source: Dall-E 3.

Furthermore, aligning the models will not avoid 100% of the problems they originally had, so covering the problem from the point of view of the data used to train the model is still just as important to achieve a good-quality generative model, as demonstrated by the llama3 model.

Happy model alignment!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓