Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


Visual Walkthrough for Vectorized BERTScore to Evaluate Text Generation
Latest   Machine Learning

Visual Walkthrough for Vectorized BERTScore to Evaluate Text Generation

Last Updated on November 5, 2023 by Editorial Team

Author(s): David R. Winer

Originally published on Towards AI.

BERTScore visual walkthrough

AI-based text generation has clearly entered the mainstream. From automated writing assistants, to legal document generation, marketing content generation, email writing, and more, there’s no shortage of commercial use cases where transformer models are succeeding. Companies now have an abundance (perhaps an overabundance) of choices of models and training paradigms to choose from. However, evaluating which model to choose shouldn’t be left to anecdotal reporting. The evaluation of a model should be set up as an experiment where stakeholders agree on a rubric of criteria that are important for the use case.

A grading rubric for text generation includes <X, Y> examples; i.e., given this input, produce this output, where X and Y are important coverage cases for the use case. But this kind of criteria can be overly prescriptive, meaning that the automated evaluation shouldn’t take too seriously that the output of X exactly matches Y. Instead, it’s more important that the model output Y’ is meaning the same thing as Y even if it does not match the exact phrasing or token usage.

BERTScore was created to handle this kind of inexact rubric criteria. The main idea of BERTScore is to use a language model that is good at understanding text, like BERT, and use it to evaluate the similarity between two sentences, a Y in your test set and a Y’ representing the model-generated text. BERTScore computes a similarity score on the basis of the token embeddings as a proxy for human-evaluated similarity. Back in 2020, when BERTScore was published in ICLR, it was not yet commonplace to use BERT itself as the measurement for text generation. The leading methods were BLEU, which uses exact string matching and METEOR which uses heuristics to match text, and these had well-known issues. Meanwhile, the BERTScore evaluation continues to be relevant because you can plug in your favorite flavor from among the many varieties of BERT that are now available.

Computing BERTScore can be done in a completely vectorized way, so you can compare a batch of reference sentences (Y hats) to their ground truth labels (Y). This is efficient because it can leverage GPU kernel for parallelization, and it also makes it easy to use BERTScore itself as a contrast loss function, such as for fine-tuning text generation models, i.e., as an alternative to cross-entropy loss over next token prediction. This way, your loss function actually aligns with meaning instead of exact text.

In this article, I will walk through how to compute BERTScore in a vectorized way. First, we’ll start with the 2 sentence example where we compute similarity just between 2 sentences. Then we’ll scale up to batch-based sentence comparisons like you might have in a training objective.

Visualization Details

The compute steps will be shown visually using a node graph visualization tool. Each block is an operation that takes the inputs on the left side and produces the data for the output variables on the right side. Links denote the passing of data from outputs to inputs, and circles on inputs mean the data is specified in place and is static.

Operations are either composite containing an “unbox” icon, which then decomposes into a sub-graph whose inputs are the parent’s inputs and whose outputs are the parent’s outputs, or they are primitive, meaning they cannot be decomposed further and correspond to low-level tensor operations like from NumPy or TensorFlow. Colors indicate data type and patterns indicate the data shape. Blue means the dat type is an integer, whereas purple/pink means it’s a decimal data type. Solid links indicate that the data shape is scalar, whereas dots in the link indicate the number of dimensions of the array (the number of dots between the dashes). At the bottom of each graph is a table that characterizes the shape, type, and operation name of each variable that is carrying data in the model.

I’ve covered and used the visualization in previous posts, such as creating a reference map for GPT Fully Visualized and BERT Fully Visualized, and for walkthroughs about Graph Attention Networks and the LoRA fine-tuning method.


Here we walk through the BERTScore compute steps for the following formula:

The figure in the paper below shows the rough outline of steps. Given two sentences, a reference sentence and a candidate sentence, conduct pairwise similarity between the BERT embeddings. Then for each reference token, take the maximum similarity across the candidate tokens axis. Finally, perform a dot product over the importance weighting (idf weights) and divide by the sum of the IDF weights.

In the original paper, there is both a Recall BERTScore (above) and a Precision BERTScore, which get combined into an F-BERTScore. The difference is entirely relative to which axis the maximum similarity is taken.

The IDF importance weighting is calculated as follows:

Given a dataset of M reference sentences, count the number of times each token appears, then divide by M and use the result as the argument for a negative natural log.

Below, we’re looking at a block that computes the importance weighting for each token. As an example, I’m passing in a tiny 40-sentence subset of the Standard Sentiment Treebank (SST) dataset. The output is a [30523] 1D array corresponding to the size of the vocab. This step would only need to be computed once during preprocessing.

Clicking on the unbox icon of the formula block, we can see the full compute graph for the IDF importance calculation. I use a BERT WordPiece tokenizer to generate IDs for each token of each sentence, generating a [40, 30523] array. After summing along the 40 axis, we divide the [30523] array by a [30523] array whose values are all 40 (i.e., M=40 in the formula).

For more context about the “Get Tokenized 1 Hot”, we can see that it starts with some text cleaning, adding special tokens like [CLS] token and [PAD], and then the WordPiece tokenizer reads from a Vocab file.

Next, we’ll pass the important weighting into a R_{BERT} calculation. The Reference sentence is “the weather is cold today” and the context sentence is “it is freezing today”, matching the example used in the original paper infographic. The Output is a single scalar decimal value, in this case 0.7066, corresponding to the BERTScore.

Looking inside the BERTScore operation, we’ll break down how we do the calculation. First, we simply run our two sentences through BERT. Typically, BERT returns the Last Hidden State, a tensor shaped [num tokens, 768], and the Pooler Output, shaped [768]. Since our importance weighting is based on the vocab Ids, I altered inside BERT to return the Input Ids for the sentences as well.

For more details about what the BERT graph looks like, check out my previous post.

Scrolling to the right, we use cosine similarity as the measure of similarity between embeddings. The inputs to the cosine similarity operation are the last hidden state rows of the BERT output, each row corresponding to an input sentence. After generating the pair-wise token similarities, we take the maximum over the Y axis, the axis corresponding to our context sentence tokens (versus the reference sentence tokens). We multiply by the importance weight and then divide by the sum over the importance scores.

Given arrays A and B, the cosine similarity is calculated as follows:

Cosine similarity formula

We can run this as a vectorized compute step, shown below. First, we dot product the X and Y transposed to produce a [num_tokens, num_tokens] shaped numerator. Second, we normalize X and Y by summing over each item squared, each producing a [num_tokens, 1] shaped output (by not dropping the dimension during the sum. We multiply them to produce a [num_tokens, num_tokens] and divide our numerator by this value.

Batch-wise BERTScore

Thus far, we’ve calculated the Recall BERTScore between 2 sentences. I called this a 2D BERTScore because the compute steps are operating over [num_tokens, hidden_dimension] context embeddings. Now, we’ll scale this to calculating the BERTScore for a batch of sentences at a time. I call this batch BERTScore because we are operating over a batch of sentences, using tensors with shape [num_sentences, num_tokens, hidden_dimensions] for our cosine similarity step. To do this in a completely vectorized way (without explicit loops) can give us performance gains by enabling GPU parallelization.

To demonstrate, we’ll submit a batch of 4 pairs of <Reference, Candidate> sentences. The first pair is our original, and the last pair is actually identical, to test that we get a 1.0 similarity.

The result is a [4] array with a float for each pairwise similarity comparison. Indeed, the 4th item has a 1.0 similarity BERTScore, as expected because they were the same sentence.

If you set up the cosine similarity operation correctly (as shown earlier), there’s no edit needed to extend this for batch-wise cosine similarity. The result is [num_sent, num_tokens, num_tokens], corresponding to the pair-wise token similarities for each pair of sentences. Then we take a maximum over the last dimension and keep the dimension to produce a [num_sent, num_tokens, 1] shaped tensor called “maximum result”.

Now we multiply by the importance scores. We reshape the importance scores to have shape [num_sent, 1, num_tokens] so that we can now produce a [num_sent, 1, 1] multiplied result. Ultimately, this is what we want, as we now essentially have 1 score per sentence. The final step is to divide by the sum over the importance scores. You can do this before or after removing the trailing 1 dimensions.


This visual walkthrough stepped through the major formula for BERTScore. Specifically, we covered recall BERTScore with the importance weight, but it should be straightforward to combine with the precision BERTScore and do the rescaling that is recommended in the paper. The recall BERTScore project and the cosine similarity operation are both available on Github.

BERTScores can be used for producing metrics on a grading rubric for text generation so that model-generated text can be compared to ideal text on the basis of its meaning rather than on exact text matching. This was shown to better resemble how humans evaluate similarity compared to previous methods like BLEU and METEOR.

Another intriguing idea here is to use something like BERTScore for a training objective loss function. For example, if you’re fine-tuning a model for text generation from GPT, you typically have some kind of task-aligning test set, or you are using the next token prediction directly as the training objective. With BERTScore, you would not penalize the model output for producing text with the same meaning as the labels. The downside is that BERTScore would be slow if one were required to run every model output through BERT. Thus, some modifications would be needed.

What do you think? Was this visual walkthrough helpful? What would you want to see next? Let me know in the comments!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓