How do I Evaluate Large Language Models

Last Updated on June 10, 2024 by Editorial Team

Author(s): Meenakshi Srinivasan

Originally published on Towards AI.

How do I Evaluate Large Language Models

Meenakshi Srinivasan

Published in

Towards AI

8 min read19 hours ago

Before the launch of Large Language Models, I have always used to test and get a concrete accuracy of how well my model works — be it transformers or any supervised models. But now, the situation has changed after the arrival of Generative AI models. Even though the LLMs are really good at producing results (for tasks like summarization, classification, and what not!), I have always felt that we lack behind in testing how precise the responses are. But thanks to Langsmith, they have come up with a new way of evaluating LLMs and monitor them consistently which is suitable for post deployment monitoring.

Ok, how LangSmith helps with evaluation?

LangSmith is a unified DevOps platform for developing, collaborating, testing, deploying, and monitoring LLM applications[1].

Earlier, in Reinforcement Learning, the response generated by the models will be given a human feedback to improve the results. Now with Langsmith, we can use LLM-as-a-judge approach and analyse the results of LLMs and maintain track of how well the model performs over time. It can be used for tasks like text generation and summarization where there is no single ground truth. To obtain accurate and customized evaluation results, it is essential to include all relevant criteria in the prompt.

Langsmith Overview:

First of all, you need to sign up to Langsmith and create a personal token API key in the website. Langsmith offers Free plan for 5000 traces per month for Developer Account and 0.005/trace thereafter.

With Langsmith, you can enhance your project management by adding separate projects to monitor individual tasks, store datasets for evaluation, employ few-shot prompting, and fine-tune models. The platform offers a wide range of capabilities to streamline and optimize your workflow.

Example Experiment — Generate tweets for Harry Potter movies:

I’m planning an experiment where I will use two different language models (LLMs) to generate tweets about Harry Potter movies, based on plot summaries from Wikipedia. Once the tweets are generated, I’ll use Langsmith to evaluate the results. Additionally, I’ll explain and run the Pairwise evaluation technique, an experiment provided by Langsmith. This method will allow us to compare the outputs of the two LLMs and determine which one produces better results. This comprehensive evaluation will provide valuable insights into the performance of each model.

Working:

Step 1: Install the required packages

Step 2 — Data Loading: Load the movies plot from Wikipedia using Wikipedia Loader from Langchain.

from langchain.document_loaders import WikipediaLoader

movies_list = ['Harry Potter and the philosophers stone', 'Harry Potter and the chamber of secrets']

movies = []

for movie in movies_list:
 loader = WikipediaLoader(query = 'Harry Potter and the chamber of secrets', load_max_docs=1).load()
 movies.extend(loader)

Step 3 — Create dataset: To store the data used to generate tweets, we’ll create a dataset in Langsmith using the Client class. This allows us to create a new dataset with a description, which can be reused for evaluation, fine-tuning, and other purposes at any time.

from langsmith import Client

client = Client()

dataset_name = "Movies_summary_generator"

dataset = client.create_dataset(dataset_name = dataset_name, description = 'Movies to summarize',)

After this step, you would be able to see the dataset that you created in ‘Datasets & Testing’ page in Langsmith.

Step 4 — Generate tweets using GPT: We will create a function to generate tweets using LLM for Harry Potter and the Philosopher’s Stone and Harry Potter and the Chamber of Secrets movies and save it as an Experiment in Langsmith(we will use this later for evaluation).

def predict_tweet_gpt_3(example: dict):
 system_tweet_instructions = (
 """You are an assistant that generates Tweets to summarise movie plots.
 Ensure the summary: 1. has an engaging title. 2. Provides a bullet point list of main characters from the movie.
 3. Utilises emojis 4. includes plot twist 5. highlights in one sentence the key point of the movie.
 """
 )
 
 human = "Generate tweets for the following movie: {paper}"
 prompt = ChatPromptTemplate.from_messages([("system", system_tweet_instructions),("human", human)])
 
 chat = ChatOpenAI(openai_api_key = os.getenv('OPENAI_API_KEY'), model = 'gpt-3.5-turbo')
 tweet_generator_gpt_3 = prompt | chat | StrOutputParser()
 response = tweet_generator_gpt_3.invoke({"paper":example["text"]})
 return response

This is the tweet generated by the model —

🧙🏼‍♂️✨✉️ "Harry Potter and the Philosopher's Stone" Tweet Summary:
 
🌟 Title: Discovering Magic at Hogwarts! 🧙🏼‍♂️🔮

👦🏻 Harry Potter - Young wizard discovering his magical heritage
🧙🏻‍♂️ Ron Weasley - Loyal friend and companion
📚 Hermione Granger - Bright witch and problem-solver

🔮 Harry learns he's famous in the wizarding world, faces dark wizard Lord Voldemort with his friends, and uncovers secrets at Hogwarts School of Witchcraft and Wizardry.
🧙🏼‍♂️ Plot Twist: Harry's wand is connected to Voldemort's, making them "brothers".
🌟 Feedback: Rowling's imaginative world and clever plot kept readers spellbound, though some found the ending rushed.

Step 5: Evaluation — This is the crucial and core step of this article. I will use GPT-4o for evaluation of the tweets. I am inputting the prompt with the criteria for evaluation.

def answer_evaluator(run:Run, example: Example) -> dict:
 input_text = example.inputs["text"]
 prediction = run.outputs["answer"]

 class GradeSummary(BaseModel):
 score: int = Field(description = "Answer meets criteria, score from 0 to 5")

 llm = ChatOpenAI(model = 'gpt-4o', temperature=0)
 structured_llm_grader = llm.with_structured_output(GradeSummary)

 system = """
 You are grading tweets for movie plot summary. Ensure that the Assistant's answer is engaging and meets the criteria,
 Ensure the summary: 1. has an engaging title. 2. Provides a bullet point list of main characters from the movie.
 3. Utilises emojis 4. includes plot twist 5. highlights the feedback 6. Includes hashtag
 """

 grade_prompt = ChatPromptTemplate.from_messages(
 [
 ("system", system),
 ("human", "Assistant's answer for the movie summary: {prediction}")
 ]
 )

 answer_grader = grade_prompt | structured_llm_grader
 score = answer_grader.invoke({"prediction": prediction})
 return {"key": "summary_engagement_score", "score":int(score.score)}

Then, we can setup and trace this evaluation by using Langsmith evaluate function.

from langsmith.evaluation import LangChainStringEvaluator, evaluate
dataset_name = "Movies_summary_generator"

experiment_results = evaluate(
 predict_tweet_gpt_3,
 data=dataset_name,
 evaluators=[answer_evaluator],
 experiment_prefix="summary-gpt3-turbo",
 metadata={"variant": "movie summary tweet, gpt-3-turbo"},
)

By passing the dataset, the experiment set up in the previous step, and the evaluator function to Langsmith, we can view the evaluation results in the Langsmith interface. Running this model will create an Experiment, which will track the performance of this specific task from the date of creation, allowing for consistent monitoring and analysis over time.

Pairwise evaluation

Using Langsmith’s pairwise evaluation, we can also compare the results of two LLMs to determine which one performs better. This analysis provides a clear understanding of the model that performs better based on the criteria that we set, helping us make informed decisions about their effectiveness.

from langsmith.evaluation import evaluate_comparative
from langchain import hub
from langchain_openai import ChatOpenAI
from langsmith.schemas import Run, Example
import json


def evaluate_pairwise(runs: list[Run], example: Example):
 scores = {}

 class GradeSummary(BaseModel):
 score: int = Field(description = "Answer meets criteria, score from 0 to 5")

 llm = ChatOpenAI(model = 'gpt-4o', temperature=0)

 # prompt for evaluation
 system = """
 Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. 
 You should score both the assistants that 
 1. has an engaging title. 
 2. Provides a bullet point list of main characters from the movie.
 3. Utilises emojis 
 4. includes plot twist 
 5. highlights the feedback 
 Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. 
 Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible.
 Your output should strictly be on a scale of 1 to 5 for both the assistants with the keys "Score for Assistant 1" for Assistant 1
 and "Score for Assistant 2" for Assistant and your output should be in JSON format. Your output should not include any explanations.
 """

 grade_prompt = ChatPromptTemplate.from_messages(
 [
 ("system", system),
 ("human", "[User Question] {question} \n[The Start of Assistant A's Answer] {answer_a} [The End of Assistant A's Answer]\nThe Start of Assistant B's Answer] {answer_b} [The End of Assistant B's Answer]") 
 ])
 
 runnable = grade_prompt | llm

 response = runnable.invoke({
 "question": example.inputs["text"],
 "answer_a": runs[0].outputs["answer"] if runs[0].outputs is not None else "N/A",
 "answer_b": runs[1].outputs["answer"] if runs[1].outputs is not None else "N/A",
 })
 response = str(response.content)
 json_response = eval(response)
 scores[runs[0].id] = json_response["Score for Assistant 1"]
 scores[runs[1].id] = json_response["Score for Assistant 2"]

 return {"key": "ranked_preference", "scores": scores}


evaluate_comparative(
 # Replace the following array with the names or IDs of your experiments
 [<YOUR-EXPERIMENT-ID-1>, <YOUR-EXPERIMENT-ID-2>],
 evaluators=[evaluate_pairwise],
)

The tweets generated by GPT-3.5 and GPT-4 are as follows —

# Response by GPT-3.5 for Harry Potter and the Philosopher's Stone

🧙🏼‍♂️✨✉️ "Harry Potter and the Philosopher's Stone" Tweet Summary:

🌟 Title: Discovering Magic at Hogwarts! 🧙🏼‍♂️🔮

👦🏻 Harry Potter - Young wizard discovering his magical heritage
🧙🏻‍♂️ Ron Weasley - Loyal friend and companion
📚 Hermione Granger - Bright witch and problem-solver

🔮 Harry learns he's famous in the wizarding world, faces dark wizard Lord Voldemort with his friends, and uncovers secrets at Hogwarts School of Witchcraft and Wizardry.
🧙🏼‍♂️ Plot Twist: Harry's wand is connected to Voldemort's, making them "brothers".
🌟 Feedback: Rowling's imaginative world and clever plot kept readers spellbound, though some found the ending rushed.

# Response by GPT-4 for Harry Potter and the Philosopher's Stone
🧙‍♂️✨ **Magical Beginnings: Harry Potter and the Philosopher's Stone** 📚🔮

👦 **Harry Potter** - The boy who lived, discovers his magical heritage and his pivotal role in the wizarding world.
👧 **Hermione Granger** - A brilliant and bookish wizard, becomes one of Harry's closest allies.
🧑 **Ron Weasley** - Harry's loyal friend, known for his humor and bravery.
🧙‍♂️ **Lord Voldemort** - The dark wizard who seeks to return to power, having failed to kill Harry as a baby.
🦉 **Hagrid** - Half-giant and Hogwarts gamekeeper, introduces Harry to the wizarding world.

🎂 On his 11th birthday, Harry learns he's a wizard & heads to Hogwarts. 🎉🏰
👬👫 Friendship blooms with Ron and Hermione as they uncover the mystery of the Philosopher's Stone. 🕵️‍♂️💎
🐍 Twist: The stone is hidden in a school mirror! Voldemort's comeback attempt thwarted by brave first-years. 🚫🧙‍♂️

📢 Critics rave about Rowling's rich imagination & the world she crafts! ✨📖
🔥 "A modern classic, melding magic with moral lessons!" - Critics' Circle

#HarryPotter #Magic #Fantasy #Adventure #BookToMovie 🎬📖

As clearly seen, GPT-4’s output is more creative and descriptive. Additionally, it included hashtags for the tweets. Pairwise experiments also rated GPT-4 with higher scores, highlighting its superior performance.

Final words

In conclusion, this is how we can evaluate LLMs using Langsmith. Langsmith allows us to comprehensively analyze and compare the performance of different language models. This systematic approach ensures that we can make informed decisions based on the desired criteria to optimize our models for various tasks.

You can find my jupyter notebook with full code here on my github repository — https://github.com/Meenakshi-srinivasan/langsmith-overview

References:

Connect with me on Linkedin — https://www.linkedin.com/in/meenakshisrinivasan/

If you enjoyed this article, please consider clapping and following me on Medium for more Data Science stories.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

How do I Evaluate Large Language Models

Author(s): Meenakshi Srinivasan

How do I Evaluate Large Language Models

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #71: Open-Sora: $200K Video Model, HPC’s Unsung Hero, and 10 Ways LLMs Fail in the Wild

Using CrewAI to Build Agentic Systems

Future of the Job Market — Impact of AI on Various Roles in 2025

Multimodal Autonomous AI Agents: Enhancing Web Interactions Through Tree Search

TAI #148: New API Models from OpenAI (4.1) & xAI (grok-3); Exploring Deep Research’s Scaling Laws

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

How do I Evaluate Large Language Models

Author(s): Meenakshi Srinivasan

How do I Evaluate Large Language Models

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥