Towards AI Can Help your Team Adopt AI: Corporate Training, Consulting, and Talent Solutions.


Prompt Robustness: How to Measure and How to Enhance
Latest   Machine Learning

Prompt Robustness: How to Measure and How to Enhance

Last Updated on November 5, 2023 by Editorial Team

Author(s): Kelvin Lu

Originally published on Towards AI.

Photo by Antonio Sokic on Unsplash

We noticed in our practice that LLMs are sensitive to the subtle details of the prompt, and even small changes can lead to noticeable differences in the output. This makes it important to evaluate the prompt robustness of an LLM before using it in production. In this article, I’m going to introduce how to measure prompt robustness and how to use it as part of the LLM engineering process based on a recently developed framework known as PromptBench.


· Why Prompt are Vulnerable
· PromptBench Introduction
· The Interesting Findings
Not all LLMs are created equal
Vulnerable to Lower-level Attacks
Term Frequency Relevancy
Abra Cadebra
· Limitations
· Closing Words
· References

Why Prompts Are Vulnerable

When the generative LLMs were trained, they learned the patterns from the input and used the patterns to predict the next token. Then, they use the input string plus the newly generated output to predict the next token. They iterate the process until the end of the output. This structure is called an autoregressive because it always predicts the next output in a sequence based on the previous outputs in the sequence. This mechanism is very similar to the autoregressors that are often used in time series analysis, where they are used to predict future values of a time series variable based on its past values.

Despite the fact that LLMs display impressive language understanding and generation capabilities, their learning is based on statistical patterns, not true knowledge extraction and reasoning as humans do. As such, the LLMs’ behavior is highly impacted by the training data. The models are more familiar with the patterns presented in the training data. As a result, any data deviation in the prediction phase may cause a more or less significant performance drop. How do we call this type of problem? Overfitting!

It has been reported that trashing low-quality training data can actually improve the model's performance. But that kind of selection of training data can also lead to selection bias in that the language patterns become even stronger. Potentially, the overfitting may get worse if it’s not specially treated.

Similarly, the computer-vision models also suffer from the same problem. They are sensitive to the unexpected details in training materials as well. For example, a model to detect cows may pay too much attention to the grass and clouds. One of the ways to solve the problem when training CV models was through data augmentation. That is, alter the training images a little bit to create a new data set by tilting, stretching, clipping, adding random noise, masking, etc. However, these tricks are not commonly used in LLM training. The main reason was that, compared to CV tasks, NLP tasks already have much larger amounts of human-generated training materials to train on. Computational resources are already a bottleneck for LLM training. Also, generating high-quality augmented data is not easy. That’s why NLP data augmentation is not a priority today.

The problem that LLMs do not always follow prompts accurately is just one of the symptoms of LLMs’ imperfect training process. Because of this, we must have a way to evaluate the LLM’s prompt robustness, i.e., how stable the LLM can perform with different variations of instructions.

Making sure the LLMs strictly follow the prompts has been a widespread concern. People are looking at the problem from different angles. Some studies are centered around organizing prompts to make LLM understand better. For example, in the case of few-shot learning, researchers noticed that the LLMs put more weight on the last example, and the common terms in the examples had a higher impact on the model. They are researching how to design prompt templates and organize the examples in a way that makes the prompt work better.

Some other research is about safety and security. That is, for instance, how to make sure the LLMs don’t disclose sensitive information and don’t produce unwanted outputs. This is also an interesting topic. It will be discussed separately.

Last but not least: when the LLMs get imperfect prompts, how badly will the performance drop? This is a very practical problem we are facing. When we decide on an LLM to base it on, we need to know how the LLM works with not-ideal prompts. When we fine-tune a model, we need to know how the new LLM compares to the foundation model. Even further, we want to learn how to enhance our model training process to reduce the performance drop, and we would also like to enhance our prompts with the know-how we learned from the model comparison.

PromptBench Introduction

The simplest way to evaluate LLM prompt robustness is to produce your own test prompt set manually and use that to run against the models. While this is a quick fix, you can never know if the test set is representative enough, and you can’t get an objective indicator of the model's performance.

PromptBench [2] offers a plausible solution on this track.

PrompBench (from official website)

Compared to the manual trick, PromptBench’s solution is systemic and extensible. It can permute prompt attacks (prompt variations) based on four strategies:

  • Character-level: PromptBench can use TextBugger and DeepWordBug to manipulate texts by introducing typos or errors to words, e.g., by adding, deleting, repeating, replacing, and permuting characters for certain words.
  • Word-level: Use BertAttack and TextFooler to replace words with synonyms or contextually similar words to deceive LLMs.
  • Sentence-level: Use StressTest and CheckList to append irrelevant or extraneous sentences to the end of prompts, intending to distract LLMs.
  • Semantic-level: Simulate different linguistic expressions with the same meaning by translating from six common languages (Chinese, French, Arabic, Spanish, Japanese, and Korean) into English, introducing linguistic nuances and variations that could potentially impact LLMs.

It was designed to process a combination of four types of prompts: zero-shot role-based, few-shot role-based, zero-shot task-based, and few-shot task-based prompts. More complicated prompt patterns like COT, ReAct, etc. are not supported.

When we use PromptBench, we need to provide a labelled dataset, a task, and a model for the utility to run against. PromptBench already has a list of models and datasets supported, including out-of-box support for ChatGPT and ChatGPT 4. The users can easily extend the framework to their own models and datasets as well.

One of the great things about PromptBench is that it provides an evaluation metric. As such, we can compare results on different tasks, using different datasets, against different models. The definition of the metric function is as follows, which, in plain English, is the average performance drop rate:

Average Performance Drop Rate (from paper [1])

The framework can be found in its Git repo [2]. At the moment, it can only be installed using the conda command. pip install is not supported yet.

# First, clone the repo:
git clone [email protected]:microsoft/promptbench.git

# The environment can be set up using Conda. Run the following command to create the environment from the provided environment.yml file:
conda env create -f environment.yml

Running an attack is simple:

# For running GLUE dataset
python --model google/flan-t5-large \
--dataset mnli \
--attack textfooler \
--shot 0 \
--generate_len 20

# For running MMLU, SQuAD V2, IWSLT, UN Multi, and Math dataset
python --model google/flan-t5-large \
--dataset mmlu \
--attack semantic \
--shot 0 \
--generate_len 20

The Interesting Findings

Not all LLMs are created equal

Some models are more sensitive to prompt attacks than others. As summarized below, the UL2 model is more robust to prompt attacks, and T5 is also not too bad. ChatGPT comes third. Vicuna is the worst model in the benchmark.

The difference in PDR between the LLMs is a good reason for us to apply promptBench testing before we decide on the LLMs we are going to use. It also indicated that we should investigate better methods for model training and fine-tuning. When UL2 performs much better than ChatGPT, the team probably must have done something right; while Vicuna performs much worse, there must be some lessons to learn in the training and data preparation processes.

Vulnerable to Lower-level Attacks

The second interesting finding is that the models are more vulnerable to word- and character-level attacks. The word-level attacks simulate the impact of using synonyms, while the character-level attacks simulate the impact of typos and non-standard wording.

As shown in the above table, word-level attacks result in a 33% to 35% performance drop on average. And character-level attacks caused a 20% to 23% performance drop on average. These two types of attacks are closer to typos and word misuse. The countermeasure is very simple: correct the typos and uncommon sentences in the prompts, and make sure the prompts are concise and clear.

Term Frequency Relevancy

The authors also noticed that certain terms are more frequently related to unstable prompts, while others are more commonly used in robust prompts. As depicted below, prompts containing ‘acting’, ‘answering’, ‘detection’, and ‘provided’ cause less performance drop. In other words, the prompts with these words are more robust. While vulnerable prompts often have words like 'respond', ‘following’, and ‘examine’. This finding can help us produce better prompts by using a more stable vocabulary.

Abra Cadebra

Interestingly, the researchers noticed that adding meaningless strings to the prompt can have either a positive or negative impact on its performance.

For instance, introducing a random sequence ‘LKF0FZxMZ4’ during a CheckList attack distracted the attention of the model and reduced focus on the critical segments. As such, it reduced the model's performance.

On the other hand, by adding an irrelevant sequence such as ‘and true is true’, LLMs focus intensifies on the significant words, thus making a better prediction.

My understanding is that this revealed the importance of domain-relevant fine-tuning. Many domain applications use specific vocabularies. This kind of information can only make sense to the models after fine-tuning. Before that, they were more like random strings in the LLMs’ eyes.

I haven’t found much research on the impact of adding random sequences to the prompt yet, except for some effort to enhance prompt performance by permutation. I’m more concerned about the risks of this new finding. Because the fact that adding a random string can overturn the model prediction is so scary. The author's report didn’t find that the added random strings completely changed the result. Actually, the average performance drop was not significant compared to character- and word-level attacks. But, who knows, probably this feature can be the zero-day bug of the LLM models.


One of the limitations of PromptBench is that it doesn’t support more complicated prompt patterns, like COT, TOT, ReAct, etc. The reason was that it is difficult to apply the same attack strategies to those patterns. However, those advanced use cases are becoming more and more important. I hope it will not take very long before a similar framework is released when people find promptBench necessary.

Generating high-quality NLP permutations is very challenging. One way to make sure the generated new sentences are not too ridiculous is by using a constraint function, as implemented in the underline TextFooler library. It discards the generated samples when the cosine distance between the sample and the original sentence is below a certain value. The other way to do quality assessment, as conducted in the PromptBench, is to do human-checking. The authors reported that they found that over 70% of character- and word-level attacks are human-acceptable. However, both the threshold and the standard of acceptable are subjective. Especially when character- and word-level attacks are the most impactful attack types. It’s reasonable to question: has the generated prompt gone too far away? This is one of the things we need to be aware of when using the library.

Closing Words

In a proper software project, a new piece of software must go through a set of tests before being rolled out into production. It needs to pass unit tests to make sure every logical turn aligns with the requirements. It needs to go through an integration test to make sure the new function doesn’t break any contracts with other components. Finally, it needs to pass a regression test to make sure it is compatible with the old version and doesn’t introduce any unwanted changes.

ML projects do not strictly follow the same process because ML is more experimental than software projects. However, making things repeatable and measurable is essential to making ML widely applied.

In this sense, PromptBench is a good example. It’s an important step ahead in making generative AI development an engineering process. My suggestion for using PromptBench is:

  • Conduct benchmark testing when you choose a LLM, before and after fine-tuning your model
  • Develop and version control your own test dataset
  • Keep records of your benchmark results
  • Apply the insights to your model's fine-tuning and prompt design.


PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

The increasing reliance on Large Language Models (LLMs) across academia and industry necessitates a comprehensive…

GitHub – microsoft/promptbench: A robustness evaluation framework for large language models on…

A robustness evaluation framework for large language models on adversarial prompts – GitHub – microsoft/promptbench: A…

PromptBench – a Hugging Face Space by March07

Discover amazing ML apps made by the community

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓