Beyond “Looks Good to Me”: How to Quantify LLM Performance with Google’s GenAI Evaluation Service

Last Updated on September 25, 2025 by Editorial Team

Author(s): Jasleen

Originally published on Towards AI.

The Production Hurdle

The greatest challenge faced by industry today is converting a solution from demo to production. And the main reason behind this is confidence in the results. The evaluation dataset and metrics that we build to test upon are not holistic and adaptable. They only provide a basic idea, not thorough testing. We still rely on “human in the loop” to look at some of the responses and evaluate them to make a final decision. But once it goes into production and is used by other users, it starts to fail. Businesses have a hard time relying on the gut feeling of engineers to release a demo into production with no concrete or customizable evaluation metrics. They often have questions like:

What is the accuracy?
How can we measure how frequently and in what cases it hallucinates
How do we compare between 2 LLMs for our specific use case?

These questions require objective and specialized metrics built specifically for the task at hand. These metrics need to be data-driven and repeatable. Google’s GenAI Evaluation Service on Vertex AI is built to solve this problem of custom metrics for evaluation based on the task at hand. It is an enterprise-grade suite of tools designed to quantify the quality of a model’s output, enabling systematic testing, validation and application monitoring. The most powerful feature of this service is Adaptive Rubrics which moved beyond simple scores and into the realm of true unit testing for prompts.

Beyond “Looks Good to Me”: How to Quantify LLM Performance with Google’s GenAI Evaluation Service — GenAI Evaluation Service Process

The 4 Pillars of Evaluation

Gen AI Evaluation Service can evaluate a model in four different ways:

Computation-Based Metrics: This is useful in cases when ground truth is available and is deterministic. It runs algorithms like ROUGE (for summarization) or BLEU (for translation).
Static Rubrics: This is used to evaluate a model against fixed criteria and critical metrics like Groundedness and Safety.
Model-Based Metrics: This is the “LLM-as-a-Judge” metric when a judge model is used to score a single response (Pointwise) or pick the better response (Pairwise).
Adaptive Rubrics: This is the recommended method that reads the prompt and generates a unique set of pass/fail tests that are tailored towards a specific use case.

The Adaptive Rubric Feature

This Adaptive Rubric feature is the highlight of this service. Instead of providing a static set of pass/fail test cases, it reads the prompt and then dynamically generates a set of pass/fail unit tests that will be used on a generated response.

Let’s look at the exact example from Google’s documentation. Imagine you give the model this prompt:

User Prompt: “Write a four-sentence summary of the provided article about renewable energy, maintaining an optimistic tone.”

The service’s Rubric Generation step analyzes that prompt and instantly creates a set of specific tests. For this prompt, it might produce:

Test Case 1: The response must be a summary of the provided article.
Test Case 2: The response must contain exactly four sentences.
Test Case 3: The response must maintain an optimistic tone.

Now, your model generates its response:

Model Response: “The article highlights significant growth in solar and wind power. These advancements are making clean energy more affordable. The future looks bright for renewables. However, the report also notes challenges with grid infrastructure.”

This is the result from the Rubric Validation step.

Test Case 1 (Summary): Pass. Reason: The response accurately summarizes the main points.
Test Case 2 (Four Sentences): Pass. Reason: The response is composed of four distinct sentences.
Test Case 3 (Optimistic Tone): Fail. Reason: The final sentence introduces a negative point, which detracts from the optimistic tone.

The final pass rate is 66.7%. This is infinitely more useful than a “4/5” score because you know exactly what to fix.

How to Run Your First Evaluation (The Code)

This can be integrated into your code using the Vertex AI SDK.

from vertexai import Client, types
import pandas as pd
eval_df = pd.DataFrame({
 "prompt": [
 "Explain Generative AI in one line",
 "Why is RAG so important in AI. Explain concisely.",
 "Write a four-line poem about the lily, where the word 'and' cannot be
used.",
 ]
})
eval_dataset = client.evals.run_inference(
 model="gemini-2.5-pro",
 src=eval_df,
)
eval_dataset.show()

# Run the Evaluation
eval_result = client.evals.evaluate(dataset=eval_dataset)

# Visualize Results
# Get the data out of the Pydantic model into a dictionary

results_dict = eval_result.model_dump()
key_to_display = 'results_table'

if key_to_display in results_dict:
 # Convert that specific part of the data into a DataFrame
 df = pd.DataFrame(results_dict[key_to_display])
else:
 print(f"Could not find the key '{key_to_display}'.")

summary_df = pd.DataFrame(results_dict['summary_metrics'])

print("--- Summary Metrics ---")
display(summary_df)

# Get the original prompts and responses

inputs_df = results_dict['evaluation_dataset'][0]['eval_dataset_df']


# Parse the complex 'eval_case_results' to get the score for each prompt
parsed_results = []
for case in results_dict['eval_case_results']:
 case_index = case['eval_case_index']
 
 # This drills down to the score and explanation for the 'general_quality_v1' metric
 # It assumes one candidate response (index [0])
 metric_result = case['response_candidate_results'][0]['metric_results']['general_quality_v1']
 
 parsed_results.append({
 'eval_case_index': case_index,
 'score': metric_result['score'],
 })


# Convert the parsed results into their own DataFrame
metrics_df = pd.DataFrame(parsed_results)


# Join the inputs_df (prompts/responses) with the metrics_df (scores)
# We use the index of inputs_df and the 'eval_case_index' from metrics_df
final_df = inputs_df.join(metrics_df.set_index('eval_case_index'))


# Display the final, combined table
print("--- Detailed Per-Prompt Results ---")
display(final_df)

Conclusion

Historically, we have relied on gut feeling and subjective human-in-the-loop checks. The GenAI Evaluation Service is a foundational step in changing that. It generates data-driven metrics using Adaptive Rubrics and transforms the problem of “quality” into a set of unit tests that are actionable.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Beyond “Looks Good to Me”: How to Quantify LLM Performance with Google’s GenAI Evaluation Service

Author(s): Jasleen

The Production Hurdle

The 4 Pillars of Evaluation

The Adaptive Rubric Feature

How to Run Your First Evaluation (The Code)

Conclusion

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Understanding Neural Networks — and Building One!

LLMs Don’t Just Need to Be Smart — They Need to Be Specific. Here’s How.

TAI #171: How is AI Actually Being Used? Frontier Ambitions Meet Real-World Adoption Data

I Built a Clinical AI Agent — and It Skipped the Tools I Gave It

ATOKEN: A Unified Tokenizer for Vision Finally Solves AI’s Biggest Problem

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Beyond “Looks Good to Me”: How to Quantify LLM Performance with Google’s GenAI Evaluation Service

Author(s): Jasleen

The Production Hurdle

The 4 Pillars of Evaluation

The Adaptive Rubric Feature

How to Run Your First Evaluation (The Code)

Conclusion

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement