Building Confidence in LLM Evaluation: My Experience Testing DeepEval on an Open Dataset

Last Updated on October 31, 2024 by Editorial Team

Author(s): Serj Smorodinsky

Originally published on Towards AI.

`deepeval` helped me uncover what is the real source of Beyonce’s depression

Long Story Short

If you’re looking to catch LLM hallucinations — output that introduces information not present in the input — you can use DeepEval’s Faithfulness metric. I tested it on 100 samples from the SQuAD2 dataset and achieved 100% accuracy. This served as a crucial validation step, like a litmus test, before diving into more detailed analysis.

I wanted to find out how accurate is the process of LLM evaluation by a framework, I wanted to evaluate LLM evaluation.

Join me through this journey to find out the ins and outs of LLM evaluation.

Content

Hardship and metrics of evaluating text
Why use deepeval
What is deepeval
How to use deepeval for detecting hallucinations?
Evaluating deepeval (Rabbit hole)
How to run evaluation on SQUAD2
SQUAD2 evaluation results analysis
Which metric to use?
Summary

This is a unique post containing video streams for additional references

Part of my passion is showing how the “sausage is made”. I have links to my streaming sessions throughout the blog.

If you enjoy them please subscribe to the YouTube channel and follow me on Twitch. This will grow my motivation to continue creating and releasing additional material.

Why use deepeval — Hardships of text evaluation

If you want to watch a clip instead of reading — here’s the link, watch out I’m wearing a cap.

Generated text is hard to test and evaluate. Nowdays, generated text is often called LLM output, but that doesn’t have to be the case. But I digress.

Generated text is hard to test and evaluate, unlike a classification task that the success is quantifiable, a good summary doesn’t have clear properties to judge on.

For example, if I’m writing a summary about Fellowship Of The Ring,

should it contain the amazing Tom Bombadil character or not?

Peter Jackson decided otherwise, deleting that book scene from his script.

Is his script not a good summary?

If the summary needs to be exhaustive or complete then yes, the script fails to do that.

But how does this omission effects the score? If only 1 fact was omitted and 99 others were kept, then the completeness or summarization coverage score should be 99/100. Lets say that the script had an alien invasion in to Valinor. This does not align with the original script, hence it is a hallucination, or it is not faithful with the original script.

If you would want to have specific tests to catch these errors, how would you go about it? You would probably need to write a set of prompts, call an LLM, save the predictions and go over them. Or go the manual route, annotate groundtruth and check its distance for the predictions.

Well it’s your lucky day, lets see how we can use the deepeval framework to test these metrics and others.

What is Deepeval

The deepeval team have implemented these metrics in a unit test like interface which is really easy to use, hiding all of the messy details (you can always peek in the sources and check the prompts and implementation).

Example of a context and Q&A

Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny’s Child. Managed by her father, Mathew Knowles, the group became one of the world’s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé’s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles “Crazy in Love” and “Baby Boy”.

Question: when did Beyonce start becoming popular?

Answer: in the late 1990s

Lets test this answer with Deepeval

Faithfulness metric documentation

from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "in the late 1990s"
# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ['''Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".''']
metric = FaithfulnessMetric(
 threshold=0.7,
 model="gpt-4",
 include_reason=True
)
test_case = LLMTestCase(
 input="when did Beyonce start becoming popular?",
 actual_output=actual_output,
 retrieval_context=retrieval_context
)
evaluate([test_case], [metric])

Code Breakdown

There are two main classes in this code: LLMTestCase and FaithfulnessMetric

Deepeval chose to decouple the inputs from the actual test in order to allow defining many different metrics based on the same input.

There are many other metrics to choose from, I will mention them later on in this post.

How to use Deepeval to catch LLM hallucinations?

Here’s my second installment of the saga, if you rather watch YouTube, here’s the link.

That’s the use case I wanted to test during my fling with deepeval. I have several generative tasks that I want to test for hallucinations. My plan was to run deepeval metrics and calculate the hallucination rate of several models on the same inputs, thus allowing me to choose for the best model given the tradeoff of latency, price and accuracy. I wasn’t sure which metric I should use because the generative task is that of summarization, but summarization has many aspects that I can test so I knew there will be some experimentation needed to utilize the framework correctly for my use case.

But then it occurred to me that I need to have a certain baseline for deepeval’s performance as well. I can’t take its predictions as face value, because, eventually they are just another LLM call. If it’s another model, I need to know how accurate it is. It means I need to evaluate the evaluator. This sounds like a rabbit hole.

Hacking evaluation by using an open dataset

The few times that I’m proud of my work is when I succeed to “hack” evaluation. What do I mean by hack? Find the path of least resistance that leads to a quantitative conclusion.

I want to run a test that have a high quality output, without much annotation effort and thus gain confidence in the road.

Quick feedback is the road to success 🚀 (or the opposite re-framing of my favorite engineering principle: fail-fast)

I want a test that will help me optimize deepeval usage as well, because I’m not familiar with the metrics and there is a lot of potential customization to do. This is another reason for a good litmus test.

A litmus test is used in chemistry **to determine if a solution is acidic or basic.** **src link**

Here’s the hack: I need to simulate my setup. A piece of context, a question and LLM output (as the summary OR the answer). I could have annotated my inputs, but that would be obvious, and necessary no matter what. But I wanted something quicker, just to know that I’m in the right direction.

So I figured lets take the next best thing, an open dataset with contexts, questions and answers, a.k.a SQUAD2.

That’s the source of my Beyonce example, it’s the first row from that dataset.

How to run evaluation on SQUAD2?

Created by this prompt on GPT4o: “Please create a funny image with the following tagline: Once you know the “why” it’s easy to find the “how”. Unless it’s cold fusion. It should look like a clown Nitche next to a reactor”

In order to run the evaluation we need to load the dataset and then run evaluation on each of its rows. Here’s the code to do exactly this:

def preprocess_squad2():
 ds = datasets.load_dataset("rajpurkar/squad_v2")
 ds.set_format(type='pandas')
 ds['train'][:].to_csv(SQUAD_DATASET_CSV_PATH, index=False)
 iterator = ds['train'].iter(batch_size=1)

rows = []
 i = 0
 for row in tqdm(iterator):
 if i > 1000:
 break
 first_answer = row["answers"][0]["text"][0]
 rows.append({"context": row["context"][0], "id": row["id"][0], "answer": first_answer, "question": row["question"][0]})
 i += 1
 df = pd.DataFrame(rows)
 df.to_csv(SQUAD_DATASET_CSV_PATH, index=False)
def evaluate_n_rows(n=100):
 df = pd.read_csv(SQUAD_DATASET_CSV_PATH)
 df = df.head(n)
 test_cases = []
 
 metric = FaithfulnessMetric(
 threshold=0.7,
 model="gpt-4o-mini",
 include_reason=True
 )
 for index, row in df.iterrows():
 context = [row["context"]]
 ground_truth_answer = row["answer"]
 test_case = LLMTestCase(
 input=row["question"],
 actual_output=ground_truth_answer,
 retrieval_context=context
 )
 test_cases.append(test_case)
 
 evaluate(test_cases, [metric])
 
if __name__ == "__main__":
 preprocess_squad2()
 evaluate_n_rows()

Code breakdown:

First I parse the SQUAD dataset, because I want to save it once as a .csv and then work on it, instead of loading it each which is just adding latency to each run
During evaluation we choose the context, question and answer per row and build the test case correctly. We use only one metric (Faithfulness), so we create it once.
This deepeval implementation is using parallel calls to the API in order for the run finish faster which is great and working out of the box

SQUAD2 evaluation results analysis

Usually I save predictions in a file in order to do error analysis based on the results.

But, in this case, deepeval already implemented that for me, which is a really nice touch 🥰

The framework automatically creates a file named: .deepeval-cache.json

It has a very nice structure that allows you to find easily any test cased that failed (or succeeded) and a trace from the framework to help with error analysis.

Here’s an example of one object of the list that is contained in .deepeval-cache.json

"{\\"actual_output\\": \\"singing and dancing\\", \\"context\\": null, \\"expected_output\\": null, \\"hyperparameters\\": null, \\"input\\": \\"What areas did Beyonce compete in when she was growing up?\\", \\"retrieval_context\\": [\\"Beyonc\\\\u00e9 Giselle Knowles-Carter (/bi\\\\u02d0\\\\u02c8j\\\\u0252nse\\\\u026a/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyonc\\\\u00e9's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles \\\\\\"Crazy in Love\\\\\\" and \\\\\\"Baby Boy\\\\\\".\\"]}": {
 "cached_metrics_data": [
 {
 "metric_data": {
 "name": "Faithfulness",
 "threshold": 0.7,
 "success": true,
 "score": 1.0,
 "reason": "The score is 1.00 because there are no contradictions, indicating a perfect alignment between the actual output and the retrieval context. Great job maintaining consistency!",
 "strictMode": false,
 "evaluationModel": "gpt-4o-mini",
 "evaluationCost": 0,
 "verboseLogs": "Truths:\\n[\\n \\"Beyonc\\u00e9 Giselle Knowles-Carter was born on September 4, 1981.\\",\\n \\"Beyonc\\u00e9 is an American singer, songwriter, record producer, and actress.\\",\\n \\"Beyonc\\u00e9 was born and raised in Houston, Texas.\\",\\n \\"Beyonc\\u00e9 performed in various singing and dancing competitions as a child.\\",\\n \\"Beyonc\\u00e9 rose to fame in the late 1990s as the lead singer of Destiny's Child.\\",\\n \\"Destiny's Child was managed by Mathew Knowles.\\",\\n \\"Destiny's Child became one of the world's best-selling girl groups of all time.\\",\\n \\"Beyonc\\u00e9's debut album is titled Dangerously in Love.\\",\\n \\"Dangerously in Love was released in 2003.\\",\\n \\"Dangerously in Love established Beyonc\\u00e9 as a solo artist worldwide.\\",\\n \\"Beyonc\\u00e9 earned five Grammy Awards for her work on Dangerously in Love.\\",\\n \\"Dangerously in Love featured the Billboard Hot 100 number-one single 'Crazy in Love'.\\",\\n \\"Dangerously in Love featured the Billboard Hot 100 number-one single 'Baby Boy'.\\"\\n] \\n \\nClaims:\\n[\\n \\"The text mentions singing.\\",\\n \\"The text mentions dancing.\\"\\n] \\n \\nVerdicts:\\n[\\n {\\n \\"verdict\\": \\"yes\\",\\n \\"reason\\": null\\n },\\n {\\n \\"verdict\\": \\"yes\\",\\n \\"reason\\": null\\n }\\n]"
 },
 "metric_configuration": {
 "threshold": 0.7,
 "evaluation_model": "gpt-4o-mini",
 "strict_mode": false,
 "include_reason": true
 }
 }
 ]
 },

Running this code on SQUAD2 produced 98% cases that were faithful. The only case it found as unfaithful it was actually an annotation error. Isn’t that always the dream in data science? Once you find model disagreement with the annotation, you actually find an annotation error?

So essentially, we have 100% agreement between groundtruth and deepeval results, meaning that I can start trusting this framework for my LLM evaluation.

Is the confidence 100%? No, there are still other things to test. We need to test the opposite case that we do find hallucinations when they are present, but I will leave this to another time.

Before closing, I want to share my experience with 3 different metrics: Summarization, hallucination and faithfulness. Eventually I used only 1 of them, but the lesson I learned is worth telling.

Which metric to choose?

I imagine that this crossroad will be visited my many. I really didn’t know what is the metric that’s perfect for me out of the 11 default metrics.

This is of course depends on the generative task and the metrics you mostly care about.

Summarization Metric

I started out with trying the summarization metric. This seemed very natural because the generative task I’m interested is doing a summarization over a customer service conversation. But, as we know, summarization means a lot of things to different people. The task is accompanied with different questions that their answers create an opinionated summary. It all screams summarization, right?

But when using this metric I didn’t find that correlates with my intuition about my task.

Deepevals summarization implementation creates a set of facts from the context and then checks whether the facts are answered by the LLM output, basically, it tests how much the context is covered by the summary.

In my case because I have several questions, and I wanted to test each one separately, none of them by itself covers the context. So when running the Summarization metric I got a very low score, so I decided to drop it. The alignment between deepeval output and SQUAD2 groundtruth was very low.

Here’s a great explanation of deepeval’s summarization implementaiton

Hallucination Metric

On face value, if I want to test hallucinations, this should be my goto metric in deepeval.

No! Faithfulness metric seems a better fit

I did get an alignment that’s higher than the summarization, the alignment was 73%. It was a step towards the right direction.

Here’s an example of a failure:

"{\\"actual_output\\": \\"Jay Z\\", \\"context\\": [\\"Following the disbandment of Destiny's Child in June 2005, she released her second solo album, B'Day (2006), which contained hits \\\\\\"D\\\\u00e9j\\\\u00e0 Vu\\\\\\", \\\\\\"Irreplaceable\\\\\\", and \\\\\\"Beautiful Liar\\\\\\". Beyonc\\\\u00e9 also ventured into acting, with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009). Her marriage to rapper Jay Z and portrayal of Etta James in Cadillac Records (2008) influenced her third album, I Am... Sasha Fierce (2008), which saw the birth of her alter-ego Sasha Fierce and earned a record-setting six Grammy Awards in 2010, including Song of the Year for \\\\\\"Single Ladies (Put a Ring on It)\\\\\\". Beyonc\\\\u00e9 took a hiatus from music in 2010 and took over management of her career; her fourth album 4 (2011) was subsequently mellower in tone, exploring 1970s funk, 1980s pop, and 1990s soul. Her critically acclaimed fifth studio album, Beyonc\\\\u00e9 (2013), was distinguished from previous releases by its experimental production and exploration of darker themes.\\"], \\"expected_output\\": null, \\"hyperparameters\\": null, \\"input\\": \\"Which artist did Beyonce marry?\\", \\"retrieval_context\\": [\\"Following the disbandment of Destiny's Child in June 2005, she released her second solo album, B'Day (2006), which contained hits \\\\\\"D\\\\u00e9j\\\\u00e0 Vu\\\\\\", \\\\\\"Irreplaceable\\\\\\", and \\\\\\"Beautiful Liar\\\\\\". Beyonc\\\\u00e9 also ventured into acting, with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009). Her marriage to rapper Jay Z and portrayal of Etta James in Cadillac Records (2008) influenced her third album, I Am... Sasha Fierce (2008), which saw the birth of her alter-ego Sasha Fierce and earned a record-setting six Grammy Awards in 2010, including Song of the Year for \\\\\\"Single Ladies (Put a Ring on It)\\\\\\". Beyonc\\\\u00e9 took a hiatus from music in 2010 and took over management of her career; her fourth album 4 (2011) was subsequently mellower in tone, exploring 1970s funk, 1980s pop, and 1990s soul. Her critically acclaimed fifth studio album, Beyonc\\\\u00e9 (2013), was distinguished from previous releases by its experimental production and exploration of darker themes.\\"]}": {
 "cached_metrics_data": [
 {
 "metric_data": {
 "name": "Hallucination",
 "threshold": 0.7,
 "success": false,
 "score": 1.0,
 "reason": "The score is 1.00 because the actual output fails to align with the context by not providing any relevant information regarding Beyonc\\u00e9's career, albums, or achievements.",
 "strictMode": false,
 "evaluationModel": "gpt-4o-mini",
 "evaluationCost": 0,
 "verboseLogs": "Verdicts:\\n[\\n {\\n \\"verdict\\": \\"no\\",\\n \\"reason\\": \\"The actual output does not provide any information relevant to the context regarding Beyonc\\\\u00e9's career, albums, or achievements, and thus does not agree with the context.\\"\\n }\\n]"
 },
 "metric_configuration": {
 "threshold": 0.7,
 "evaluation_model": "gpt-4o-mini",
 "strict_mode": false,
 "include_reason": true
 }
 }
 ]
 }

Reason: The score is 1.00 because the actual output fails to align with the context by not providing any relevant information regarding Beyonc\u00e9’s career, albums, or achievements.

The reason part is lovely, it really helps me understand the decision process, with the hallucination documentation, even though seems to be focused on checking contradictions between the context and the output, it actually checks some sort of completeness in this case.

Whenever the answer was too short (actually the ground truth), the hallucination metric failed.

Maybe this is exactly what deepeval intended, but to me it seems incorrect.

My best guess to explain this is that under the hood, we create facts based on the context and then check whether they are contradicted by the answer. But, when creating the fact, there are many details, and the answer might not be the best fit. Basically, we disregard the “question”.

I knew we could find something better, That’s when I tried Faithfulness.

Faithfulness Metric

If you know this song you have lived a full life

I almost leaned over custom metric implementation but I decided to test the faithfulness metrics before that.

And it worked like a charm 🎉

Here’s the documentation, it’s the truthful claims divided by the total claims. Lets compare it with hallucination for a moment. This metric had alignment of 100% with the data! At least! That’s the metric I can start using in my internal tests.

What’s the difference between a contradicted context and a false claim? I’m still not sure. I need to go deeper, one guess might be that a claim is something that puts more emphasis on the answer versus context checks that puts more emphasis on the context. I’m currently investigating this and will report once I fully understand this, so you will hear more about this.

Summary

I hope I have successfully conveyed that you have to be very thoughtful about the LLM evaluation. There are many ways to do it, deepeval is only one of them, but it does seem very easy to use with a great programmatic interface. As always, test it with your data in order to gain confidence.

What’s next? Probably training LLMs!

Links

You can start a conversation with me over my Hacking AI Discord
Hop over to LinkedIn page
Or subscribe to Youtube channel to watch me fumble my API keys!
Or just follow on Twitch if you loved my memes

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Building Confidence in LLM Evaluation: My Experience Testing DeepEval on an Open Dataset

Author(s): Serj Smorodinsky

deepeval helped me uncover what is the real source of Beyonce’s depression

Long Story Short

Content

This is a unique post containing video streams for additional references

Why use deepeval — Hardships of text evaluation

What is Deepeval

Example of a context and Q&A

Lets test this answer with Deepeval

Code Breakdown

How to use Deepeval to catch LLM hallucinations?

Hacking evaluation by using an open dataset

How to run evaluation on SQUAD2?

SQUAD2 evaluation results analysis

Which metric to choose?

Summarization Metric

Hallucination Metric

Faithfulness Metric

Summary

Links

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥

`deepeval` helped me uncover what is the real source of Beyonce’s depression