Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Benchmarking Volga’s On-Demand Compute Layer for Feature Serving: Latency, RPS, and Scalability on EKS
Latest   Machine Learning

Benchmarking Volga’s On-Demand Compute Layer for Feature Serving: Latency, RPS, and Scalability on EKS

Author(s): Andrey Novitskiy

Originally published on Towards AI.

Benchmarking Volga’s On-Demand Compute Layer for Feature Serving: Latency, RPS, and Scalability on EKS
End-to-end request latencies and storage read latencies during a load test with maximum worker configuration.

TL;DR Real-time machine learning systems require not only efficient models but also robust infrastructure capable of low-latency feature serving under dynamic load conditions.

In this post, we benchmark Volga’s On-Demand Compute Layer — a core component designed for request-time feature computation — deployed on Kubernetes (EKS) and orchestrated with Ray. To simulate real-world load patterns, we use Locust as the load-testing framework, and Redis as the intermediate storage layer between streaming and serving components.

We evaluate the system’s performance, analyze latency characteristics, and assess horizontal scalability across different load profiles, focusing on how architectural choices affect robustness and efficiency.

Contents:

  • Background
  • Tests setup
  • Results and Analysis
  • Conclusion

Background

Volga is a data processing system built for modern AI/ML pipelines, with a focus on real-time feature engineering (read more here). The On-Demand Compute Layer is one of Volga’s two core components (alongside its Streaming Engine), responsible for request-time computation — namely, feature serving and on-demand feature calculation.

At its core, the On-Demand Compute Layer is a stateless service powered by Ray Actors, with each worker running a Starlette server that executes user-defined logic on data produced by the streaming engine (or serves precomputed data). The end-to-end system sits behind an external load balancer and uses a pluggable storage layer as an intermediate between the streaming engine and on-demand workers.

The goal of this post is to evaluate the robustness of the architecture under real-world scenarios, assess On-Demand Layer performance under load, and demonstrate the system’s horizontal scalability.

Tests setup

We ran load tests on an Amazon EKS cluster using t2.medium instances (2 vCPUs, 4 GB RAM), hosting both the Locust deployment and the Ray cluster running Volga. Each Ray pod was mapped to a single EKS node to ensure resource isolation.

Volga’s On-Demand Layer was deployed behind an AWS Application Load Balancer, serving as the primary target for Locust workers and routing requests to the EKS nodes hosting Volga pods (where the OS then distributed the load among workers within the same node/pod).

For code to set it all up, check volga-ops repo.

Resource sizing

Through experimentation, we determined that:

  • One on-demand worker can handle up to 1,000 RPS, assuming the user-defined functions (UDFs) do not introduce CPU-blocking operations.
  • One Locust worker can generate up to 1,000 RPS without overloading a node.

Since Volga’s workers are lightweight Python processes (single-threaded, GIL-bound), we assume a straightforward 1 worker = 1 CPU mapping when estimating resource usage.

Storage Configuration

We selected Redis as the storage layer between the streaming engine and the on-demand layer for its simplicity and high performance.

Although Redis can be configured with master-replica replication, due to its lack of strong consistency guarantees, production environments should consider storage systems like ScyllaDB, Cassandra, or DynamoDB for better durability and consistency.

In our benchmarks:

  • A single Redis pod was deployed without replication or sharding, prioritizing simplicity and isolating compute performance.
  • The streaming engine was disabled during the tests, with storage populated once at setup time. This approach isolates Volga’s own performance but could marginally impact storage behavior compared to real-world dynamic writes.

Features to calculate/serve

Each benchmark emulated a simple real-time feature pipeline consisting of:

  • test_feature: a pipeline (streaming) feature, populated via periodic mock data writes.
  • simple_feature: a request-time feature depending on test_feature, applying a basic transformation (multiplication)
# emulates streaming pipeline feature
@source(TestEntity)
def test_feature() -> Connector:
return MockOnlineConnector.with_periodic_items(
items=[...], # sample data
period_s=0
)

# emulates simple linear transformation at request time (multiplication) based on data produced by 'test_feature'
@on_demand(dependencies=['test_feature'])
def simple_feature(
dep: TestEntity,
multiplier: float = 1.0
) -> TestEntity:
return TestEntity(
id=dep.id,
value=dep.value * multiplier,
timestamp=datetime.datetime.now()
)

Results and Analysis

During each load test, we measured:

  • Total RPS (generated by Locust and handled by Volga)
  • On-Demand Layer internal latency
  • Storage read latency
  • End-to-end request latency (from both Volga and Locust perspectives)

We also monitored container CPU utilization via AWS Container Insights to verify maximum node usage.

Each test followed a stepwise load increase, with RPS growing gradually every ~20 seconds over a 3-minute run, until reaching a configured maximum. (Locust’s internal backpressure mechanism halts RPS growth if latency exceeds acceptable thresholds.)

Max RPS Test

In the largest configuration (80 workers), we achieved:

  • 30,000 RPS peak throughput
  • 50ms p95 end-to-end latency
Benchmarking Volga’s On-Demand Compute Layer for Feature Serving: Latency, RPS, and Scalability on EKS
Figure 1: End-to-end request latencies and storage read latencies during a load test with maximum worker configuration.

Scalability

We conducted tests with 4, 10, 20, 40, 60, and 80 workers, tracking sustainable RPS and corresponding latency metrics.

Figure 2: Sustainable RPS and corresponding latencies for different worker counts, illustrating horizontal scalability.

The results showed:

  • Linear RPS scaling with the number of workers.
  • Stable processing latencies across scaling steps.
  • Storage performance remained the primary bottleneck for latency, reaffirming Volga’s internal scalability and efficiency.

Conclusion

Volga’s On-Demand Compute Layer delivers a critical component for building real-time AI/ML feature engineering pipelines, significantly reducing the need for custom glue code, ad-hoc data models, and manual API abstractions.

Our benchmarks demonstrate that the On-Demand Layer:

  • Scales horizontally with the number of workers (assuming the storage backend scales appropriately).
  • Maintains sub-50ms p95 and sub-10ms average end-to-end request latencies.
  • Adds minimal compute overhead, with storage access being the primary latency driver

Volga’s On-Demand Compute Layer shows strong horizontal scalability and efficient real-time serving capabilities. With the right storage backend, it enables reliable low-latency feature pipelines at scale.

To learn more or contribute, visit the Volga GitHub repository. Thanks for reading!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.