Prompt Robustness: How to Measure and How to Enhance
Last Updated on November 5, 2023 by Editorial Team
Author(s): Kelvin Lu
Originally published on Towards AI.
We noticed in our practice that LLMs are sensitive to the subtle details of the prompt, and even small changes can lead to noticeable differences in the output. This makes it important to evaluate the prompt robustness of an LLM before using it in production. In this article, Iβm going to introduce how to measure prompt robustness and how to use it as part of the LLM engineering process based on a recently developed framework known as PromptBench.
Agenda
Β· Why Prompt are Vulnerable
Β· PromptBench Introduction
Β· The Interesting Findings
β Not all LLMs are created equal
β Vulnerable to Lower-level Attacks
β Term Frequency Relevancy
β Abra Cadebra
Β· Limitations
Β· Closing Words
Β· References
Why Prompts Are Vulnerable
When the generative LLMs were trained, they learned the patterns from the input and used the patterns to predict the next token. Then, they use the input string plus the newly generated output to predict the next token. They iterate the process until the end of the output. This structure is called an autoregressive because it always predicts the next output in a sequence based on the previous outputs in the sequence. This mechanism is very similar to the autoregressors that are often used in time series analysis, where they are used to predict future values of a time series variable based on its past values.
Despite the fact that LLMs display impressive language understanding and generation capabilities, their learning is based on statistical patterns, not true knowledge extraction and reasoning as humans do. As such, the LLMsβ behavior is highly impacted by the training data. The models are more familiar with the patterns presented in the training data. As a result, any data deviation in the prediction phase may cause a more or less significant performance drop. How do we call this type of problem? Overfitting!
It has been reported that trashing low-quality training data can actually improve the model's performance. But that kind of selection of training data can also lead to selection bias in that the language patterns become even stronger. Potentially, the overfitting may get worse if itβs not specially treated.
Similarly, the computer-vision models also suffer from the same problem. They are sensitive to the unexpected details in training materials as well. For example, a model to detect cows may pay too much attention to the grass and clouds. One of the ways to solve the problem when training CV models was through data augmentation. That is, alter the training images a little bit to create a new data set by tilting, stretching, clipping, adding random noise, masking, etc. However, these tricks are not commonly used in LLM training. The main reason was that, compared to CV tasks, NLP tasks already have much larger amounts of human-generated training materials to train on. Computational resources are already a bottleneck for LLM training. Also, generating high-quality augmented data is not easy. Thatβs why NLP data augmentation is not a priority today.
The problem that LLMs do not always follow prompts accurately is just one of the symptoms of LLMsβ imperfect training process. Because of this, we must have a way to evaluate the LLMβs prompt robustness, i.e., how stable the LLM can perform with different variations of instructions.
Making sure the LLMs strictly follow the prompts has been a widespread concern. People are looking at the problem from different angles. Some studies are centered around organizing prompts to make LLM understand better. For example, in the case of few-shot learning, researchers noticed that the LLMs put more weight on the last example, and the common terms in the examples had a higher impact on the model. They are researching how to design prompt templates and organize the examples in a way that makes the prompt work better.
Some other research is about safety and security. That is, for instance, how to make sure the LLMs donβt disclose sensitive information and donβt produce unwanted outputs. This is also an interesting topic. It will be discussed separately.
Last but not least: when the LLMs get imperfect prompts, how badly will the performance drop? This is a very practical problem we are facing. When we decide on an LLM to base it on, we need to know how the LLM works with not-ideal prompts. When we fine-tune a model, we need to know how the new LLM compares to the foundation model. Even further, we want to learn how to enhance our model training process to reduce the performance drop, and we would also like to enhance our prompts with the know-how we learned from the model comparison.
PromptBench Introduction
The simplest way to evaluate LLM prompt robustness is to produce your own test prompt set manually and use that to run against the models. While this is a quick fix, you can never know if the test set is representative enough, and you canβt get an objective indicator of the model's performance.
PromptBench [2] offers a plausible solution on this track.
Compared to the manual trick, PromptBenchβs solution is systemic and extensible. It can permute prompt attacks (prompt variations) based on four strategies:
- Character-level: PromptBench can use TextBugger and DeepWordBug to manipulate texts by introducing typos or errors to words, e.g., by adding, deleting, repeating, replacing, and permuting characters for certain words.
- Word-level: Use BertAttack and TextFooler to replace words with synonyms or contextually similar words to deceive LLMs.
- Sentence-level: Use StressTest and CheckList to append irrelevant or extraneous sentences to the end of prompts, intending to distract LLMs.
- Semantic-level: Simulate different linguistic expressions with the same meaning by translating from six common languages (Chinese, French, Arabic, Spanish, Japanese, and Korean) into English, introducing linguistic nuances and variations that could potentially impact LLMs.
It was designed to process a combination of four types of prompts: zero-shot role-based, few-shot role-based, zero-shot task-based, and few-shot task-based prompts. More complicated prompt patterns like COT, ReAct, etc. are not supported.
When we use PromptBench, we need to provide a labelled dataset, a task, and a model for the utility to run against. PromptBench already has a list of models and datasets supported, including out-of-box support for ChatGPT and ChatGPT 4. The users can easily extend the framework to their own models and datasets as well.
One of the great things about PromptBench is that it provides an evaluation metric. As such, we can compare results on different tasks, using different datasets, against different models. The definition of the metric function is as follows, which, in plain English, is the average performance drop rate:
The framework can be found in its Git repo [2]. At the moment, it can only be installed using the conda command. pip install is not supported yet.
# First, clone the repo:
git clone [email protected]:microsoft/promptbench.git
# The environment can be set up using Conda. Run the following command to create the environment from the provided environment.yml file:
conda env create -f environment.yml
Running an attack is simple:
# For running GLUE dataset
python main.py --model google/flan-t5-large \
--dataset mnli \
--attack textfooler \
--shot 0 \
--generate_len 20
# For running MMLU, SQuAD V2, IWSLT, UN Multi, and Math dataset
python main.py --model google/flan-t5-large \
--dataset mmlu \
--attack semantic \
--shot 0 \
--generate_len 20
The Interesting Findings
Not all LLMs are created equal
Some models are more sensitive to prompt attacks than others. As summarized below, the UL2 model is more robust to prompt attacks, and T5 is also not too bad. ChatGPT comes third. Vicuna is the worst model in the benchmark.
The difference in PDR between the LLMs is a good reason for us to apply promptBench testing before we decide on the LLMs we are going to use. It also indicated that we should investigate better methods for model training and fine-tuning. When UL2 performs much better than ChatGPT, the team probably must have done something right; while Vicuna performs much worse, there must be some lessons to learn in the training and data preparation processes.
Vulnerable to Lower-level Attacks
The second interesting finding is that the models are more vulnerable to word- and character-level attacks. The word-level attacks simulate the impact of using synonyms, while the character-level attacks simulate the impact of typos and non-standard wording.
As shown in the above table, word-level attacks result in a 33% to 35% performance drop on average. And character-level attacks caused a 20% to 23% performance drop on average. These two types of attacks are closer to typos and word misuse. The countermeasure is very simple: correct the typos and uncommon sentences in the prompts, and make sure the prompts are concise and clear.
Term Frequency Relevancy
The authors also noticed that certain terms are more frequently related to unstable prompts, while others are more commonly used in robust prompts. As depicted below, prompts containing βactingβ, βansweringβ, βdetectionβ, and βprovidedβ cause less performance drop. In other words, the prompts with these words are more robust. While vulnerable prompts often have words like 'respond', βfollowingβ, and βexamineβ. This finding can help us produce better prompts by using a more stable vocabulary.
Abra Cadebra
Interestingly, the researchers noticed that adding meaningless strings to the prompt can have either a positive or negative impact on its performance.
For instance, introducing a random sequence βLKF0FZxMZ4β during a CheckList attack distracted the attention of the model and reduced focus on the critical segments. As such, it reduced the model's performance.
On the other hand, by adding an irrelevant sequence such as βand true is trueβ, LLMs focus intensifies on the significant words, thus making a better prediction.
My understanding is that this revealed the importance of domain-relevant fine-tuning. Many domain applications use specific vocabularies. This kind of information can only make sense to the models after fine-tuning. Before that, they were more like random strings in the LLMsβ eyes.
I havenβt found much research on the impact of adding random sequences to the prompt yet, except for some effort to enhance prompt performance by permutation. Iβm more concerned about the risks of this new finding. Because the fact that adding a random string can overturn the model prediction is so scary. The author's report didnβt find that the added random strings completely changed the result. Actually, the average performance drop was not significant compared to character- and word-level attacks. But, who knows, probably this feature can be the zero-day bug of the LLM models.
Limitations
One of the limitations of PromptBench is that it doesnβt support more complicated prompt patterns, like COT, TOT, ReAct, etc. The reason was that it is difficult to apply the same attack strategies to those patterns. However, those advanced use cases are becoming more and more important. I hope it will not take very long before a similar framework is released when people find promptBench necessary.
Generating high-quality NLP permutations is very challenging. One way to make sure the generated new sentences are not too ridiculous is by using a constraint function, as implemented in the underline TextFooler library. It discards the generated samples when the cosine distance between the sample and the original sentence is below a certain value. The other way to do quality assessment, as conducted in the PromptBench, is to do human-checking. The authors reported that they found that over 70% of character- and word-level attacks are human-acceptable. However, both the threshold and the standard of acceptable are subjective. Especially when character- and word-level attacks are the most impactful attack types. Itβs reasonable to question: has the generated prompt gone too far away? This is one of the things we need to be aware of when using the library.
Closing Words
In a proper software project, a new piece of software must go through a set of tests before being rolled out into production. It needs to pass unit tests to make sure every logical turn aligns with the requirements. It needs to go through an integration test to make sure the new function doesnβt break any contracts with other components. Finally, it needs to pass a regression test to make sure it is compatible with the old version and doesnβt introduce any unwanted changes.
ML projects do not strictly follow the same process because ML is more experimental than software projects. However, making things repeatable and measurable is essential to making ML widely applied.
In this sense, PromptBench is a good example. Itβs an important step ahead in making generative AI development an engineering process. My suggestion for using PromptBench is:
- Conduct benchmark testing when you choose a LLM, before and after fine-tuning your model
- Develop and version control your own test dataset
- Keep records of your benchmark results
- Apply the insights to your model's fine-tuning and prompt design.
References
PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
The increasing reliance on Large Language Models (LLMs) across academia and industry necessitates a comprehensiveβ¦
arxiv.org
GitHub – microsoft/promptbench: A robustness evaluation framework for large language models onβ¦
A robustness evaluation framework for large language models on adversarial prompts – GitHub – microsoft/promptbench: Aβ¦
github.com
PromptBench – a Hugging Face Space by March07
Discover amazing ML apps made by the community
huggingface.co
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI