Building Confidence in LLM Evaluation: My Experience Testing DeepEval on an Open Dataset
Last Updated on October 31, 2024 by Editorial Team
Author(s): Serj Smorodinsky
Originally published on Towards AI.
deepeval
helped me uncover what is the real source of Beyonceβs depression
Long Story Short
If youβre looking to catch LLM hallucinations β output that introduces information not present in the input β you can use DeepEvalβs Faithfulness metric. I tested it on 100 samples from the SQuAD2 dataset and achieved 100% accuracy. This served as a crucial validation step, like a litmus test, before diving into more detailed analysis.
I wanted to find out how accurate is the process of LLM evaluation by a framework, I wanted to evaluate LLM evaluation.
Join me through this journey to find out the ins and outs of LLM evaluation.
Content
- Hardship and metrics of evaluating text
- Why use deepeval
- What is deepeval
- How to use deepeval for detecting hallucinations?
- Evaluating deepeval (Rabbit hole)
- How to run evaluation on SQUAD2
- SQUAD2 evaluation results analysis
- Which metric to use?
- Summary
This is a unique post containing video streams for additional references
Part of my passion is showing how the βsausage is madeβ. I have links to my streaming sessions throughout the blog.
If you enjoy them please subscribe to the YouTube channel and follow me on Twitch. This will grow my motivation to continue creating and releasing additional material.
Why use deepeval β Hardships of text evaluation
If you want to watch a clip instead of reading β hereβs the link, watch out Iβm wearing a cap.
Generated text is hard to test and evaluate. Nowdays, generated text is often called LLM output, but that doesnβt have to be the case. But I digress.
Generated text is hard to test and evaluate, unlike a classification task that the success is quantifiable, a good summary doesnβt have clear properties to judge on.
For example, if Iβm writing a summary about Fellowship Of The Ring,
should it contain the amazing Tom Bombadil character or not?
Peter Jackson decided otherwise, deleting that book scene from his script.
Is his script not a good summary?
If the summary needs to be exhaustive or complete then yes, the script fails to do that.
But how does this omission effects the score? If only 1 fact was omitted and 99 others were kept, then the completeness
or summarization
coverage score should be 99/100. Lets say that the script had an alien invasion in to Valinor. This does not align with the original script, hence it is a hallucination
, or it is not faithful
with the original script.
If you would want to have specific tests to catch these errors, how would you go about it? You would probably need to write a set of prompts, call an LLM, save the predictions and go over them. Or go the manual route, annotate groundtruth and check its distance for the predictions.
Well itβs your lucky day, lets see how we can use the deepeval
framework to test these metrics and others.
What is Deepeval
The deepeval
team have implemented these metrics in a unit test like interface which is really easy to use, hiding all of the messy details (you can always peek in the sources and check the prompts and implementation).
Example of a context and Q&A
BeyoncΓ© Giselle Knowles-Carter (/biΛΛjΙnseΙͺ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destinyβs Child. Managed by her father, Mathew Knowles, the group became one of the worldβs best-selling girl groups of all time. Their hiatus saw the release of BeyoncΓ©βs debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles βCrazy in Loveβ and βBaby Boyβ.
Question: when did Beyonce start becoming popular?
Answer: in the late 1990s
Lets test this answer with Deepeval
Faithfulness metric documentation
from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase
# Replace this with the actual output from your LLM application
actual_output = "in the late 1990s"
# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ['''BeyoncΓ© Giselle Knowles-Carter (/biΛΛjΙnseΙͺ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of BeyoncΓ©'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".''']
metric = FaithfulnessMetric(
threshold=0.7,
model="gpt-4",
include_reason=True
)
test_case = LLMTestCase(
input="when did Beyonce start becoming popular?",
actual_output=actual_output,
retrieval_context=retrieval_context
)
evaluate([test_case], [metric])
Code Breakdown
There are two main classes in this code: LLMTestCase
and FaithfulnessMetric
Deepeval chose to decouple the inputs from the actual test in order to allow defining many different metrics based on the same input.
There are many other metrics to choose from, I will mention them later on in this post.
How to use Deepeval to catch LLM hallucinations?
Hereβs my second installment of the saga, if you rather watch YouTube, hereβs the link.
Thatβs the use case I wanted to test during my fling with deepeval. I have several generative tasks that I want to test for hallucinations. My plan was to run deepeval metrics and calculate the hallucination rate of several models on the same inputs, thus allowing me to choose for the best model given the tradeoff of latency, price and accuracy. I wasnβt sure which metric I should use because the generative task is that of summarization, but summarization has many aspects that I can test so I knew there will be some experimentation needed to utilize the framework correctly for my use case.
But then it occurred to me that I need to have a certain baseline for deepevalβs performance as well. I canβt take its predictions as face value, because, eventually they are just another LLM call. If itβs another model, I need to know how accurate it is. It means I need to evaluate the evaluator. This sounds like a rabbit hole.
Hacking evaluation by using an open dataset
The few times that Iβm proud of my work is when I succeed to βhackβ evaluation. What do I mean by hack? Find the path of least resistance that leads to a quantitative conclusion.
I want to run a test that have a high quality output, without much annotation effort and thus gain confidence in the road.
Quick feedback is the road to success 🚀 (or the opposite re-framing of my favorite engineering principle: fail-fast)
I want a test that will help me optimize deepeval usage as well, because Iβm not familiar with the metrics and there is a lot of potential customization to do. This is another reason for a good litmus test.
Hereβs the hack: I need to simulate my setup. A piece of context, a question and LLM output (as the summary OR the answer). I could have annotated my inputs, but that would be obvious, and necessary no matter what. But I wanted something quicker, just to know that Iβm in the right direction.
So I figured lets take the next best thing, an open dataset with contexts, questions and answers, a.k.a SQUAD2.
Thatβs the source of my Beyonce example, itβs the first row from that dataset.
How to run evaluation on SQUAD2?
In order to run the evaluation we need to load the dataset and then run evaluation on each of its rows. Hereβs the code to do exactly this:
def preprocess_squad2():
ds = datasets.load_dataset("rajpurkar/squad_v2")
ds.set_format(type='pandas')
ds['train'][:].to_csv(SQUAD_DATASET_CSV_PATH, index=False)
iterator = ds['train'].iter(batch_size=1)
rows = []
i = 0
for row in tqdm(iterator):
if i > 1000:
break
first_answer = row["answers"][0]["text"][0]
rows.append({"context": row["context"][0], "id": row["id"][0], "answer": first_answer, "question": row["question"][0]})
i += 1
df = pd.DataFrame(rows)
df.to_csv(SQUAD_DATASET_CSV_PATH, index=False)
def evaluate_n_rows(n=100):
df = pd.read_csv(SQUAD_DATASET_CSV_PATH)
df = df.head(n)
test_cases = []
metric = FaithfulnessMetric(
threshold=0.7,
model="gpt-4o-mini",
include_reason=True
)
for index, row in df.iterrows():
context = [row["context"]]
ground_truth_answer = row["answer"]
test_case = LLMTestCase(
input=row["question"],
actual_output=ground_truth_answer,
retrieval_context=context
)
test_cases.append(test_case)
evaluate(test_cases, [metric])
if __name__ == "__main__":
preprocess_squad2()
evaluate_n_rows()
Code breakdown:
- First I parse the SQUAD dataset, because I want to save it once as a .csv and then work on it, instead of loading it each which is just adding latency to each run
- During evaluation we choose the context, question and answer per row and build the test case correctly. We use only one metric (Faithfulness), so we create it once.
- This deepeval implementation is using parallel calls to the API in order for the run finish faster which is great and working out of the box
SQUAD2 evaluation results analysis
Usually I save predictions in a file in order to do error analysis based on the results.
But, in this case, deepeval already implemented that for me, which is a really nice touch 🥰
The framework automatically creates a file named: .deepeval-cache.json
It has a very nice structure that allows you to find easily any test cased that failed (or succeeded) and a trace from the framework to help with error analysis.
Hereβs an example of one object of the list that is contained in .deepeval-cache.json
"{\\"actual_output\\": \\"singing and dancing\\", \\"context\\": null, \\"expected_output\\": null, \\"hyperparameters\\": null, \\"input\\": \\"What areas did Beyonce compete in when she was growing up?\\", \\"retrieval_context\\": [\\"Beyonc\\\\u00e9 Giselle Knowles-Carter (/bi\\\\u02d0\\\\u02c8j\\\\u0252nse\\\\u026a/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyonc\\\\u00e9's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles \\\\\\"Crazy in Love\\\\\\" and \\\\\\"Baby Boy\\\\\\".\\"]}": {
"cached_metrics_data": [
{
"metric_data": {
"name": "Faithfulness",
"threshold": 0.7,
"success": true,
"score": 1.0,
"reason": "The score is 1.00 because there are no contradictions, indicating a perfect alignment between the actual output and the retrieval context. Great job maintaining consistency!",
"strictMode": false,
"evaluationModel": "gpt-4o-mini",
"evaluationCost": 0,
"verboseLogs": "Truths:\\n[\\n \\"Beyonc\\u00e9 Giselle Knowles-Carter was born on September 4, 1981.\\",\\n \\"Beyonc\\u00e9 is an American singer, songwriter, record producer, and actress.\\",\\n \\"Beyonc\\u00e9 was born and raised in Houston, Texas.\\",\\n \\"Beyonc\\u00e9 performed in various singing and dancing competitions as a child.\\",\\n \\"Beyonc\\u00e9 rose to fame in the late 1990s as the lead singer of Destiny's Child.\\",\\n \\"Destiny's Child was managed by Mathew Knowles.\\",\\n \\"Destiny's Child became one of the world's best-selling girl groups of all time.\\",\\n \\"Beyonc\\u00e9's debut album is titled Dangerously in Love.\\",\\n \\"Dangerously in Love was released in 2003.\\",\\n \\"Dangerously in Love established Beyonc\\u00e9 as a solo artist worldwide.\\",\\n \\"Beyonc\\u00e9 earned five Grammy Awards for her work on Dangerously in Love.\\",\\n \\"Dangerously in Love featured the Billboard Hot 100 number-one single 'Crazy in Love'.\\",\\n \\"Dangerously in Love featured the Billboard Hot 100 number-one single 'Baby Boy'.\\"\\n] \\n \\nClaims:\\n[\\n \\"The text mentions singing.\\",\\n \\"The text mentions dancing.\\"\\n] \\n \\nVerdicts:\\n[\\n {\\n \\"verdict\\": \\"yes\\",\\n \\"reason\\": null\\n },\\n {\\n \\"verdict\\": \\"yes\\",\\n \\"reason\\": null\\n }\\n]"
},
"metric_configuration": {
"threshold": 0.7,
"evaluation_model": "gpt-4o-mini",
"strict_mode": false,
"include_reason": true
}
}
]
},
Running this code on SQUAD2 produced 98% cases that were faithful. The only case it found as unfaithful it was actually an annotation error. Isnβt that always the dream in data science? Once you find model disagreement with the annotation, you actually find an annotation error?
So essentially, we have 100% agreement between groundtruth and deepeval results, meaning that I can start trusting this framework for my LLM evaluation.
Is the confidence 100%? No, there are still other things to test. We need to test the opposite case that we do find hallucinations when they are present, but I will leave this to another time.
Before closing, I want to share my experience with 3 different metrics: Summarization, hallucination and faithfulness. Eventually I used only 1 of them, but the lesson I learned is worth telling.
Which metric to choose?
I imagine that this crossroad will be visited my many. I really didnβt know what is the metric thatβs perfect for me out of the 11 default metrics.
This is of course depends on the generative task and the metrics you mostly care about.
Summarization Metric
I started out with trying the summarization metric. This seemed very natural because the generative task Iβm interested is doing a summarization over a customer service conversation. But, as we know, summarization means a lot of things to different people. The task is accompanied with different questions that their answers create an opinionated summary. It all screams summarization, right?
But when using this metric I didnβt find that correlates with my intuition about my task.
Deepevals summarization implementation creates a set of facts from the context and then checks whether the facts are answered by the LLM output, basically, it tests how much the context is covered by the summary.
In my case because I have several questions, and I wanted to test each one separately, none of them by itself covers the context. So when running the Summarization metric I got a very low score, so I decided to drop it. The alignment between deepeval output and SQUAD2 groundtruth was very low.
Hereβs a great explanation of deepevalβs summarization implementaiton
Hallucination Metric
On face value, if I want to test hallucinations, this should be my goto metric in deepeval.
I did get an alignment thatβs higher than the summarization, the alignment was 73%. It was a step towards the right direction.
Hereβs an example of a failure:
"{\\"actual_output\\": \\"Jay Z\\", \\"context\\": [\\"Following the disbandment of Destiny's Child in June 2005, she released her second solo album, B'Day (2006), which contained hits \\\\\\"D\\\\u00e9j\\\\u00e0 Vu\\\\\\", \\\\\\"Irreplaceable\\\\\\", and \\\\\\"Beautiful Liar\\\\\\". Beyonc\\\\u00e9 also ventured into acting, with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009). Her marriage to rapper Jay Z and portrayal of Etta James in Cadillac Records (2008) influenced her third album, I Am... Sasha Fierce (2008), which saw the birth of her alter-ego Sasha Fierce and earned a record-setting six Grammy Awards in 2010, including Song of the Year for \\\\\\"Single Ladies (Put a Ring on It)\\\\\\". Beyonc\\\\u00e9 took a hiatus from music in 2010 and took over management of her career; her fourth album 4 (2011) was subsequently mellower in tone, exploring 1970s funk, 1980s pop, and 1990s soul. Her critically acclaimed fifth studio album, Beyonc\\\\u00e9 (2013), was distinguished from previous releases by its experimental production and exploration of darker themes.\\"], \\"expected_output\\": null, \\"hyperparameters\\": null, \\"input\\": \\"Which artist did Beyonce marry?\\", \\"retrieval_context\\": [\\"Following the disbandment of Destiny's Child in June 2005, she released her second solo album, B'Day (2006), which contained hits \\\\\\"D\\\\u00e9j\\\\u00e0 Vu\\\\\\", \\\\\\"Irreplaceable\\\\\\", and \\\\\\"Beautiful Liar\\\\\\". Beyonc\\\\u00e9 also ventured into acting, with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009). Her marriage to rapper Jay Z and portrayal of Etta James in Cadillac Records (2008) influenced her third album, I Am... Sasha Fierce (2008), which saw the birth of her alter-ego Sasha Fierce and earned a record-setting six Grammy Awards in 2010, including Song of the Year for \\\\\\"Single Ladies (Put a Ring on It)\\\\\\". Beyonc\\\\u00e9 took a hiatus from music in 2010 and took over management of her career; her fourth album 4 (2011) was subsequently mellower in tone, exploring 1970s funk, 1980s pop, and 1990s soul. Her critically acclaimed fifth studio album, Beyonc\\\\u00e9 (2013), was distinguished from previous releases by its experimental production and exploration of darker themes.\\"]}": {
"cached_metrics_data": [
{
"metric_data": {
"name": "Hallucination",
"threshold": 0.7,
"success": false,
"score": 1.0,
"reason": "The score is 1.00 because the actual output fails to align with the context by not providing any relevant information regarding Beyonc\\u00e9's career, albums, or achievements.",
"strictMode": false,
"evaluationModel": "gpt-4o-mini",
"evaluationCost": 0,
"verboseLogs": "Verdicts:\\n[\\n {\\n \\"verdict\\": \\"no\\",\\n \\"reason\\": \\"The actual output does not provide any information relevant to the context regarding Beyonc\\\\u00e9's career, albums, or achievements, and thus does not agree with the context.\\"\\n }\\n]"
},
"metric_configuration": {
"threshold": 0.7,
"evaluation_model": "gpt-4o-mini",
"strict_mode": false,
"include_reason": true
}
}
]
}
Reason: The score is 1.00 because the actual output fails to align with the context by not providing any relevant information regarding Beyonc\u00e9βs career, albums, or achievements.
The reason part is lovely, it really helps me understand the decision process, with the hallucination documentation, even though seems to be focused on checking contradictions between the context and the output, it actually checks some sort of completeness in this case.
Whenever the answer was too short (actually the ground truth), the hallucination metric failed.
Maybe this is exactly what deepeval intended, but to me it seems incorrect.
My best guess to explain this is that under the hood, we create facts based on the context and then check whether they are contradicted by the answer. But, when creating the fact, there are many details, and the answer might not be the best fit. Basically, we disregard the βquestionβ.
I knew we could find something better, Thatβs when I tried Faithfulness.
Faithfulness Metric
I almost leaned over custom metric implementation but I decided to test the faithfulness metrics before that.
And it worked like a charm 🎉
Hereβs the documentation, itβs the truthful claims divided by the total claims. Lets compare it with hallucination for a moment. This metric had alignment of 100% with the data! At least! Thatβs the metric I can start using in my internal tests.
Whatβs the difference between a contradicted context and a false claim? Iβm still not sure. I need to go deeper, one guess might be that a claim is something that puts more emphasis on the answer versus context checks that puts more emphasis on the context. Iβm currently investigating this and will report once I fully understand this, so you will hear more about this.
Summary
I hope I have successfully conveyed that you have to be very thoughtful about the LLM evaluation. There are many ways to do it, deepeval is only one of them, but it does seem very easy to use with a great programmatic interface. As always, test it with your data in order to gain confidence.
Whatβs next? Probably training LLMs!
Links
- You can start a conversation with me over my Hacking AI Discord
- Hop over to LinkedIn page
- Or subscribe to Youtube channel to watch me fumble my API keys!
- Or just follow on Twitch if you loved my memes
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI