Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Beyond “Looks Good to Me”: How to Quantify LLM Performance with Google’s GenAI Evaluation Service
Latest   Machine Learning

Beyond “Looks Good to Me”: How to Quantify LLM Performance with Google’s GenAI Evaluation Service

Last Updated on September 25, 2025 by Editorial Team

Author(s): Jasleen

Originally published on Towards AI.

The Production Hurdle

The greatest challenge faced by industry today is converting a solution from demo to production. And the main reason behind this is confidence in the results. The evaluation dataset and metrics that we build to test upon are not holistic and adaptable. They only provide a basic idea, not thorough testing. We still rely on “human in the loop” to look at some of the responses and evaluate them to make a final decision. But once it goes into production and is used by other users, it starts to fail. Businesses have a hard time relying on the gut feeling of engineers to release a demo into production with no concrete or customizable evaluation metrics. They often have questions like:

  • What is the accuracy?
  • How can we measure how frequently and in what cases it hallucinates
  • How do we compare between 2 LLMs for our specific use case?

These questions require objective and specialized metrics built specifically for the task at hand. These metrics need to be data-driven and repeatable. Google’s GenAI Evaluation Service on Vertex AI is built to solve this problem of custom metrics for evaluation based on the task at hand. It is an enterprise-grade suite of tools designed to quantify the quality of a model’s output, enabling systematic testing, validation and application monitoring. The most powerful feature of this service is Adaptive Rubrics which moved beyond simple scores and into the realm of true unit testing for prompts.

Beyond “Looks Good to Me”: How to Quantify LLM Performance with Google’s GenAI Evaluation Service
GenAI Evaluation Service Process

The 4 Pillars of Evaluation

Gen AI Evaluation Service can evaluate a model in four different ways:

  1. Computation-Based Metrics: This is useful in cases when ground truth is available and is deterministic. It runs algorithms like ROUGE (for summarization) or BLEU (for translation).
  2. Static Rubrics: This is used to evaluate a model against fixed criteria and critical metrics like Groundedness and Safety.
  3. Model-Based Metrics: This is the “LLM-as-a-Judge” metric when a judge model is used to score a single response (Pointwise) or pick the better response (Pairwise).
  4. Adaptive Rubrics: This is the recommended method that reads the prompt and generates a unique set of pass/fail tests that are tailored towards a specific use case.

The Adaptive Rubric Feature

This Adaptive Rubric feature is the highlight of this service. Instead of providing a static set of pass/fail test cases, it reads the prompt and then dynamically generates a set of pass/fail unit tests that will be used on a generated response.

Let’s look at the exact example from Google’s documentation. Imagine you give the model this prompt:

User Prompt: “Write a four-sentence summary of the provided article about renewable energy, maintaining an optimistic tone.”

The service’s Rubric Generation step analyzes that prompt and instantly creates a set of specific tests. For this prompt, it might produce:

  • Test Case 1: The response must be a summary of the provided article.
  • Test Case 2: The response must contain exactly four sentences.
  • Test Case 3: The response must maintain an optimistic tone.

Now, your model generates its response:

Model Response: “The article highlights significant growth in solar and wind power. These advancements are making clean energy more affordable. The future looks bright for renewables. However, the report also notes challenges with grid infrastructure.”

This is the result from the Rubric Validation step.

  • Test Case 1 (Summary): Pass. Reason: The response accurately summarizes the main points.
  • Test Case 2 (Four Sentences): Pass. Reason: The response is composed of four distinct sentences.
  • Test Case 3 (Optimistic Tone): Fail. Reason: The final sentence introduces a negative point, which detracts from the optimistic tone.

The final pass rate is 66.7%. This is infinitely more useful than a “4/5” score because you know exactly what to fix.

How to Run Your First Evaluation (The Code)

This can be integrated into your code using the Vertex AI SDK.

from vertexai import Client, types
import pandas as pd
eval_df = pd.DataFrame({
"prompt": [
"Explain Generative AI in one line",
"Why is RAG so important in AI. Explain concisely.",
"Write a four-line poem about the lily, where the word 'and' cannot be
used."
,
]
})
eval_dataset = client.evals.run_inference(
model="gemini-2.5-pro",
src=eval_df,
)
eval_dataset.show()
# Run the Evaluation
eval_result = client.evals.evaluate(dataset=eval_dataset)
# Visualize Results
# Get the data out of the Pydantic model into a dictionary

results_dict = eval_result.model_dump()
key_to_display = 'results_table'

if key_to_display in results_dict:
# Convert that specific part of the data into a DataFrame
df = pd.DataFrame(results_dict[key_to_display])
else:
print(f"Could not find the key '{key_to_display}'.")

summary_df = pd.DataFrame(results_dict['summary_metrics'])

print("--- Summary Metrics ---")
display(summary_df)
# Get the original prompts and responses

inputs_df = results_dict['evaluation_dataset'][0]['eval_dataset_df']


# Parse the complex 'eval_case_results' to get the score for each prompt
parsed_results = []
for case in results_dict['eval_case_results']:
case_index = case['eval_case_index']

# This drills down to the score and explanation for the 'general_quality_v1' metric
# It assumes one candidate response (index [0])
metric_result = case['response_candidate_results'][0]['metric_results']['general_quality_v1']

parsed_results.append({
'eval_case_index': case_index,
'score': metric_result['score'],
})


# Convert the parsed results into their own DataFrame
metrics_df = pd.DataFrame(parsed_results)


# Join the inputs_df (prompts/responses) with the metrics_df (scores)
# We use the index of inputs_df and the 'eval_case_index' from metrics_df
final_df = inputs_df.join(metrics_df.set_index('eval_case_index'))


# Display the final, combined table
print("--- Detailed Per-Prompt Results ---")
display(final_df)

Conclusion

Historically, we have relied on gut feeling and subjective human-in-the-loop checks. The GenAI Evaluation Service is a foundational step in changing that. It generates data-driven metrics using Adaptive Rubrics and transforms the problem of “quality” into a set of unit tests that are actionable.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.