
Beyond “Looks Good to Me”: How to Quantify LLM Performance with Google’s GenAI Evaluation Service
Last Updated on September 25, 2025 by Editorial Team
Author(s): Jasleen
Originally published on Towards AI.
The Production Hurdle
The greatest challenge faced by industry today is converting a solution from demo to production. And the main reason behind this is confidence in the results. The evaluation dataset and metrics that we build to test upon are not holistic and adaptable. They only provide a basic idea, not thorough testing. We still rely on “human in the loop” to look at some of the responses and evaluate them to make a final decision. But once it goes into production and is used by other users, it starts to fail. Businesses have a hard time relying on the gut feeling of engineers to release a demo into production with no concrete or customizable evaluation metrics. They often have questions like:
- What is the accuracy?
- How can we measure how frequently and in what cases it hallucinates
- How do we compare between 2 LLMs for our specific use case?
These questions require objective and specialized metrics built specifically for the task at hand. These metrics need to be data-driven and repeatable. Google’s GenAI Evaluation Service on Vertex AI is built to solve this problem of custom metrics for evaluation based on the task at hand. It is an enterprise-grade suite of tools designed to quantify the quality of a model’s output, enabling systematic testing, validation and application monitoring. The most powerful feature of this service is Adaptive Rubrics which moved beyond simple scores and into the realm of true unit testing for prompts.

The 4 Pillars of Evaluation
Gen AI Evaluation Service can evaluate a model in four different ways:
- Computation-Based Metrics: This is useful in cases when ground truth is available and is deterministic. It runs algorithms like ROUGE (for summarization) or BLEU (for translation).
- Static Rubrics: This is used to evaluate a model against fixed criteria and critical metrics like Groundedness and Safety.
- Model-Based Metrics: This is the “LLM-as-a-Judge” metric when a judge model is used to score a single response (Pointwise) or pick the better response (Pairwise).
- Adaptive Rubrics: This is the recommended method that reads the prompt and generates a unique set of pass/fail tests that are tailored towards a specific use case.
The Adaptive Rubric Feature
This Adaptive Rubric feature is the highlight of this service. Instead of providing a static set of pass/fail test cases, it reads the prompt and then dynamically generates a set of pass/fail unit tests that will be used on a generated response.
Let’s look at the exact example from Google’s documentation. Imagine you give the model this prompt:
User Prompt: “Write a four-sentence summary of the provided article about renewable energy, maintaining an optimistic tone.”
The service’s Rubric Generation step analyzes that prompt and instantly creates a set of specific tests. For this prompt, it might produce:
- Test Case 1: The response must be a summary of the provided article.
- Test Case 2: The response must contain exactly four sentences.
- Test Case 3: The response must maintain an optimistic tone.
Now, your model generates its response:
Model Response: “The article highlights significant growth in solar and wind power. These advancements are making clean energy more affordable. The future looks bright for renewables. However, the report also notes challenges with grid infrastructure.”
This is the result from the Rubric Validation step.
- Test Case 1 (Summary): Pass. Reason: The response accurately summarizes the main points.
- Test Case 2 (Four Sentences): Pass. Reason: The response is composed of four distinct sentences.
- Test Case 3 (Optimistic Tone): Fail. Reason: The final sentence introduces a negative point, which detracts from the optimistic tone.
The final pass rate is 66.7%. This is infinitely more useful than a “4/5” score because you know exactly what to fix.
How to Run Your First Evaluation (The Code)
This can be integrated into your code using the Vertex AI SDK.
from vertexai import Client, types
import pandas as pd
eval_df = pd.DataFrame({
"prompt": [
"Explain Generative AI in one line",
"Why is RAG so important in AI. Explain concisely.",
"Write a four-line poem about the lily, where the word 'and' cannot be
used.",
]
})
eval_dataset = client.evals.run_inference(
model="gemini-2.5-pro",
src=eval_df,
)
eval_dataset.show()

# Run the Evaluation
eval_result = client.evals.evaluate(dataset=eval_dataset)
# Visualize Results
# Get the data out of the Pydantic model into a dictionary
results_dict = eval_result.model_dump()
key_to_display = 'results_table'
if key_to_display in results_dict:
# Convert that specific part of the data into a DataFrame
df = pd.DataFrame(results_dict[key_to_display])
else:
print(f"Could not find the key '{key_to_display}'.")
summary_df = pd.DataFrame(results_dict['summary_metrics'])
print("--- Summary Metrics ---")
display(summary_df)

# Get the original prompts and responses
inputs_df = results_dict['evaluation_dataset'][0]['eval_dataset_df']
# Parse the complex 'eval_case_results' to get the score for each prompt
parsed_results = []
for case in results_dict['eval_case_results']:
case_index = case['eval_case_index']
# This drills down to the score and explanation for the 'general_quality_v1' metric
# It assumes one candidate response (index [0])
metric_result = case['response_candidate_results'][0]['metric_results']['general_quality_v1']
parsed_results.append({
'eval_case_index': case_index,
'score': metric_result['score'],
})
# Convert the parsed results into their own DataFrame
metrics_df = pd.DataFrame(parsed_results)
# Join the inputs_df (prompts/responses) with the metrics_df (scores)
# We use the index of inputs_df and the 'eval_case_index' from metrics_df
final_df = inputs_df.join(metrics_df.set_index('eval_case_index'))
# Display the final, combined table
print("--- Detailed Per-Prompt Results ---")
display(final_df)

Conclusion
Historically, we have relied on gut feeling and subjective human-in-the-loop checks. The GenAI Evaluation Service is a foundational step in changing that. It generates data-driven metrics using Adaptive Rubrics and transforms the problem of “quality” into a set of unit tests that are actionable.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.