The Right Approach to Personalize LLM Style — Rewards Dropout for Human Styles Alignment and Training Regularization

Author(s): Roman S

Originally published on Towards AI.

The Right Approach to Personalize LLM Style — Rewards Dropout for Human Styles Alignment and Training Regularization — The only “AI” generated thing here. Created by the author with GPT-4o

Abstract

In this article, I am describing how to effectively solve a task of style transfer and to bypass AI detection through LLM tuning.

Style is one of the most prominent and well-studied aspects of user personalization, and its importance is increasing as users become more adept at recognizing texts generated by LLMs. A key challenge in stylistic personalization is that a single user may exhibit multiple different styles — some determined by context, others influenced by latent or unobservable factors (e.g. mood). This phenomenon of multiple valid outputs is not unique to style but extends to many other dimensions of personalization.

In this work, I focus on a specific task: aligning an LLM to rewrite input text in a way that (1) minimizes the likelihood of detection by LLM-detectors and (2) maximizes stylistic similarity to a set of reference texts. Importantly, these reference texts’ styles are diverse and non-overlapping, reinforcing the need for the alignment methods capable of handling multiple distinct correct policy outputs.

Why it won’t work with just few-shot examples with the style in ChatGPT prompt?

ChatGPT (gpt and other models) will just chose one of the styles from the few-shots and use it in majority of the generated texts, it won’t maintain the diversity of the styles provided as the examples. Moreover, these texts will be detectable by the AI-detectors.

Why it might be difficult for reinforcement learning (RL) to align with the desired set of styles?

Classic modern reinforcement learning frameworks, like reinforcement learning from verifiable reward, work in a paradigm that all the correct answers are similar to each other. Applying such a framework to the case when multiple different styles are the same level correct will lead to suboptimal results — the policy will converge to the simplest style to mimic, that represents reward hacking behaviour.

The image below shows the noticed style cloning behaviour of the LLM prompted with the style few-shots or tuned with RL vs desired behaviour.

Figure 1. The style cloning behaviour of a LLM with few-shots or tuned with RL to maximize similarity to any reference style (on the left) and the desired style cloning behaviour in case of multiple different styles being references (on the right). Image created by the author

Related stuff

Why to minimize the probability of the text to be detected by the LLM-detector?

Some recent studies show that people tend to disfavor LLM generated text if they know that the text is not created by a human. For example, this shows that even though in general people prefer AI-generated text, knowing that the LLM is a source of the text significantly reduces the probability that the text is preferred. Taking into account that LLM texts are also quite recognizable by the frequent users, I believe the text AI-detectability is one of the main human style cloning features (“general human style cloning”).

What is an LLM detector and how does it work?

According to Desklib research LLM detector can be an encoder transformer classifier based on the DeBERTa model. There are also some other approaches, but there are not enough implementation details. The mentioned Desklib model was used as the LLM-detector for the reward function in the current research.

Naive way for text style transfer

According to this paper, style transfer is usually based on attention masking or LLM prompting (or both). Although LLM prompting is a much more flexible way to do style transfer, especially taking into consideration the option to define target style with the few-shot examples. Despite the flexibility, here I prove that this method is not capable of following multiple styles references paying attention to some selected examples provided as the few-shots.

Reinforcement learning for alignment

The style transfer task, together with the avoidance of detection by AI-detectors, can be interpreted as an alignment task. Style transfer success can be evaluated with some style encoding technique that allows to measure the distance between reference styles and the style of the generated texts, while AI-detection can be evaluated with any AI-detector. Reinforcement learning looks like the most suitable way to solve this alignment task (in this case both style evaluation and AI-detector can be external services; we do not need to calculate gradients for the evaluators to create reward function). GRPO and RLOO are the modern simple reinforcement learning algorithms that can be effective for this task (these two algorithms are very similar, especially comparing some GRPO modifications to RLOO, e.g. Dr.GRPO). In the current research, I am using the Dr.GRPO algorithm in an experimental setting where the number of PPO epochs is equal to 1 making it even more similar to RLOO.

Style similarity

For the style similarity evaluation task, transformer encoder-based architectures are currently being trained with the objective to make cosine similarity higher for similar styles and lower for dissimilar. These style encoders are capable of building text representations, embeddings, and the distance between the reference embeddings and the embeddings for the generated texts represents styles similarity (similar to semantic similarity). In the current research, I used this model as the style encoder.

Methodology

The idea of this work is not to fight LLM detectors to help people create undetectable AI content! The motivation is to optimize processes where text is a tool, but not the main value (for example, I believe that in research papers and study reports the text represents the main value, while in marketing emails the text is being more like a tool rather than the main value). Keeping this in mind, I have limited the lengths of the input and output contexts in all the experiments to ~300 tokens.

In all the experiments I was working with the meta-llama/Llama-3.2–3B-Instruct model. This model size should be enough for the simple task of text rewriting, the model doesn’t require any reasoning or other capabilities except rewriting. At the same time, motivated by the idea that extreme user personalization is for each user to have his own model, the tuning I am doing is based on low-rank adapters (LoRA) with the goal to make these adapters as small as possible.

Warm-up

The Llama model I am using is an Instruct model that is trained to follow instructions, while I need the model to just rewrite text. The straightforward approach would be to prompt the model and do training with the prompt to rewrite the input text, however, this approach adds unnecessary computation overhead (this prompt, e.g. “Rewrite the text…” represents additional tokens to be processed every time I inference the model — unnecessary computations that can be avoided). To avoid this as a warm-up I did simple supervised finetuning for 1000 steps with learning rate of 2e-5 and batch size 1. For the dataset I took short texts from the liamdugan/raid dataset that were labeled as human-written, rewrote them with gpt-4o-mini and used rewritten as inputs and original as targets.

Reinforcement learning

For reinforcement learning I used Dr.GRPO with PPO epochs set to 1 (no trust region clipping happens in this setup). The reward function consists of the following components:

(1) Format mask — I expect the first word of the generated answer to be “Rewritten:”, so I’ve added this binary reward;
(2) AI-detector probability of being created by a human — probability returned by the selected AI-detector model inverted to have 1 as human-generated and 0 as LLM-generated. Also, to avoid overfitting, for each probability higher than 0.5 I used (probability**(1/2)*1.11).clip(0,1). During some ablations it was additionally noticed that binary rewards (1/0) work worse — lead to much slower convergence because of the sparse training signal;
(3) Style similarity. For style similarity I used 3 different reference styles that are different from each other, namely I used the following texts as the three different styles references: “Good evening Sir or Madam, I would like to introduce myself.”, “Lori’s gonna rock being boss at HiTOPS; she’ll seriously push things forward.”, “Wow :-), I’ll advocate for Blanco’s dedication to ethical business, and CRT membership =D!” — the second and the third were actually taken from the StyleDistance/synthstel dataset. It was checked that the style similarities between these texts are low according to the selected styles encoder. As the style similarity reward component we took the maximum similarity value to the reference styles;
(4) Length reward. During preliminary experiments, it was noticed that the model is trying to make texts longer as one of the ways to do reward hacking. Introducing length reward I am trying to make the generated text no longer than the length of the input text + 30 in terms of the words count (spoiler — in the following experiments (that I am going to describe here later) I changed this absolute delta to relative);
(5) Named entities reward. To add additional robustness to the model I’ve added the NER component that is rewarding according to the ratio of the named entities from the original text found in the generated text. For named entity recognition I used SpaCy.

**The final reward is (1)*(sum(2,3,4,5)).**

Following the DAPO paper I removed KL divergence component from the Dr.GRPO. It was found that any positive KL weight restricts exploration. I also did ablations with making KL divergence component just small that resulted in the model not converging.

I’ve experimented with different learning rates and found that learning rates that are equal or higher than the learning rate during the supervised finetuning (SFT) stage result in quick model divergence (generating meaningless answers), the learning rate used in SFT was set to 1e-5. At the same time similar training patterns were noticed when doing full model finetuning and just LoRA (following QLoRA in terms of applying adapters to all linear layers, but without main model quantization) finetuning — the training pattern for the AI-detector reward is shown on the Figure 2.

Figure 2. Total train reward graphic. Image created by the author

The final setup was based on applying LoRA adapters because it provides better options for LLM personalization — each user can have his own LoRA adapters that are far more effective compared to scenarios when each user has the whole model. The training hyperparameters: batch size 32, learning rate 1e-6 with cosine scheduling down to 90% of the initial learning rate during the 3 epochs, but in fact the training was stopped after the 1 epoch, LoRA rank 32 and LoRA alpha 16.

As for the dataset with input texts I used the same liamdugan/raid dataset, but this time I took LLM generated texts without any attack (the dataset contains texts with different attacks against AI-detector).

Reward dropout

During the model training according to the described settings I noticed two issues:

The style reward component went quite high; however, the text quality was not satisfactory in some samples (as per manual checks);
During the training I was additionally checking the standard deviation of similarities to each reference style embedding across generated samples. I saw that this standard deviation is becoming smaller, which means that the model tends to generate outputs in the same dominant style (Figure 3a).

These two issues indicate overfitting / reward hacking. At the same time the second issue means we are not achieving the initial goal to be able to train the model that is maximizing multiple alternative rewards — similarity to many different “correct” styles.

One of the ways to regularize the model is to apply dropout to some layers. Similar to the idea of dropout in neural networks I applied the dropout to the alternative reward functions. To apply this kind of a dropout I simply drop the corresponding share of the reference styles from the matrix we are finding similarity to when calculating the style reward.

During the experiments I used a dropout rate of 20% for this reward dropout. Applying the reward dropout I could regularize the reinforcement learning training, change the behaviour of the standard deviation indicator (Figure 3b) and got better quality results.

Figure 3a. Standard deviation for each style similarities across the batch without reward dropout. Image created by the author

Figure 3b. Standard deviation for each style similarities across the batch with the reward dropout. Image created by the author

Evaluation

The evaluation is based on the comparison of the LoRA models trained with reinforcement learning and gpt-4o-mini prompted to clone the style from the few-shots, while all the style reference texts were provided as the few-shots (Appendix A for prompt details). I evaluated the models on the different datasets from those we used in training, this evaluation dataset contains LLM generated marketing emails samples, particularly I used 10 samples from RafaM97/marketing_social_media dataset that contains synthetic messages and 10 samples from Isotonic/marketing_email_samples (applying augmentation through its rewriting, filling the templates if there are any in data with gpt-4o-mini). Apart from evaluating the results with style encoder and AI-detector (similar to what I did in reward function, but without additional reward scalings), I compared the outputs with LLM-as-a-Judge approach. For Judge prompting I followed pairwise comparison approach. Additionally, knowing that the Judge may have model bias and prefer outputs generated by the same model, I used gpt-4.1-mini as a Judge and also did random shuffling of the order of samples to compare to improve judging results robustness.

The evaluation results are in Figure 4. As the evaluation metrics I used:

(1) average maximum any style similarity: calculate the maximum similarity between the style of the generated sample and the reference styles and then calculate average across all the samples;
(2) average not AI-generated probability: for each sample compute the probability for the sample to be human-generated and calculate average for all the samples after that. After some experiments I found that specific text structure limits not AI-generated probability, namely I found that the samples started with “Subject:” are more likely to be labeled as AI-generated, so I additionally calculated this metric after removing the samples started with “Subject:” from the evaluation dataset (the second number in the corresponding column in the table with the evaluation results);
(3) Input styles distribution: for each reference style (3 of them) count number of samples from the evaluation dataset that are closer to this reference style than to any other;
(4) Output styles distribution: the same as the input styles distribution, but for the generated outputs — here we can actually see how just prompted LLM tends to prefer some single style every time and the direct RL tuning without rewards dropout does the same;
(5) Judge preferred ratio: ratio of samples generated following each approach that were preferred to direct LLM prompting with the few-shots by the LLM-as-a-Judge.

Figure 4. Evaluation results. Image created by the author

An example of the input text, output text from the model trained with the reward dropout and the gpt-4o-mini model response prompted in few-shot setting:

Input sample: Develop a social media campaign, “Eco-Chic Challenge,” encouraging users to share their sustainable fashion choices. Collaborate with eco-influencers, create engaging content (reels, stories, posts) highlighting GreenThreads’ commitment to sustainability. Utilize Instagram\’s “Reels” feature to create a branded hashtag challenge, offering discounts to participants who share their own eco-friendly fashion tips.

Not AI-generated probability: 17%

Maximum style similarity: 0.81

GPT response: Let’s kick off the “Eco-Chic Challenge” on social media, inspiring users to flaunt their sustainable fashion choices! We’ll team up with eco-influencers to craft captivating content like reels, stories, and posts that showcase GreenThreads’ dedication to sustainability. We can leverage Instagram’s “Reels” feature to launch a branded hashtag challenge, where participants can earn discounts by sharing their own eco-friendly fashion tips. It’s time to make sustainability stylish!

Not AI-generated probability: 0.7%

Maximum style similarity: 0.93

LoRA trained with the reward dropout response: Launch a social media campaign, “Eco-Chic Challenge”, to prompt users to post about their sustainable fashion. Partner with eco-influencers, design attractive content (reels, stories, posts), that showcases GreenThreads’ dedication to sustainability. Use Instagram’s “Reels” function, to develop a branded hashtag challenge, and offer discounts, to participants, who post their own eco-conscious fashion tips.

Not AI-generated probability: 68%

Maximum style similarity: 0.95

LoRA trained without reward dropout response: Launched the “Eco-Chic Challenge” social media campaign, and, within it, we work with eco-influencers, produce attractive content (reels, stories, posts), and promote GreenThreads’ constant dedication to sustainability. We, through Instagram’s “Reels” function, create a branded hashtag challenge, and, for the participants, we offer discounts. Participants, through their own, should share eco-friendly fashion tips.

Not AI-generated probability: 91%

Maximum style similarity: 0.98

Provided examples show that direct LoRA training (in the described setup) leads to attempt of reward hacking and worse final results.

Conclusion

The approach to solve the multi-style personalization problem with the novel reward dropout method was presented in this article.

At the same time it was shown that other methods, including direct alignment without rewards dropout, lead to only one style overfitting. The proposed reward dropout approach shows its efficiency not only to solve multiple correct behaviour (style) problem, but also to regularize the reinforcement learning training process (similar to the original dropout in neural networks).

Stay tuned for the story to continue — I’ve already started the second part where I am working on the improvements of the approach presented here. For example, I am improving the reward function (to avoid reward hacking though punctuation) and trying to apply Bit-LoRA approach that can make the LoRA adapters much smaller and, at the same time, this adapters quantization results (preliminary) in a huge additional regularization effect when doing RL.

Appendix A. GPT-4o-mini system prompt to do style transfer through few-shot

You are a helpful assistant.
Rewrite to make the content following the user’s style. Some examples of the user’s messages:

Good evening Sir or Madam, I would like to introduce myself.
Lori’s gonna rock being boss at HiTOPS; she’ll seriously push things forward.
Wow :-), I’ll advocate for Blanco\’s dedication to ethical business, and CRT membership =D!

Rewrite the message to exclude any template marks if there are any — you can insert some values there. Start you answer with the word “Rewritten: “

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

The Right Approach to Personalize LLM Style — Rewards Dropout for Human Styles Alignment and Training Regularization

Author(s): Roman S

Abstract

Related stuff

Methodology

Warm-up

Reinforcement learning

**The final reward is (1)*(sum(2,3,4,5)).**

Reward dropout

Evaluation

Conclusion

Appendix A. GPT-4o-mini system prompt to do style transfer through few-shot

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

The Right Approach to Personalize LLM Style — Rewards Dropout for Human Styles Alignment and Training Regularization

Author(s): Roman S

Abstract

Related stuff

Methodology

Warm-up

Reinforcement learning

The final reward is (1)*(sum(2,3,4,5)).

Reward dropout

Evaluation

Conclusion

Appendix A. GPT-4o-mini system prompt to do style transfer through few-shot

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement

**The final reward is (1)*(sum(2,3,4,5)).**