Benchmarking Volga’s On-Demand Compute Layer for Feature Serving: Latency, RPS, and Scalability on EKS

Author(s): Andrey Novitskiy

Originally published on Towards AI.

Benchmarking Volga’s On-Demand Compute Layer for Feature Serving: Latency, RPS, and Scalability on EKS — End-to-end request latencies and storage read latencies during a load test with maximum worker configuration.

TL;DR Real-time machine learning systems require not only efficient models but also robust infrastructure capable of low-latency feature serving under dynamic load conditions.

In this post, we benchmark Volga’s On-Demand Compute Layer — a core component designed for request-time feature computation — deployed on Kubernetes (EKS) and orchestrated with Ray. To simulate real-world load patterns, we use Locust as the load-testing framework, and Redis as the intermediate storage layer between streaming and serving components.

We evaluate the system’s performance, analyze latency characteristics, and assess horizontal scalability across different load profiles, focusing on how architectural choices affect robustness and efficiency.

Contents:

Background
Tests setup
Results and Analysis
Conclusion

Background

Volga is a data processing system built for modern AI/ML pipelines, with a focus on real-time feature engineering (read more here). The On-Demand Compute Layer is one of Volga’s two core components (alongside its Streaming Engine), responsible for request-time computation — namely, feature serving and on-demand feature calculation.

At its core, the On-Demand Compute Layer is a stateless service powered by Ray Actors, with each worker running a Starlette server that executes user-defined logic on data produced by the streaming engine (or serves precomputed data). The end-to-end system sits behind an external load balancer and uses a pluggable storage layer as an intermediate between the streaming engine and on-demand workers.

The goal of this post is to evaluate the robustness of the architecture under real-world scenarios, assess On-Demand Layer performance under load, and demonstrate the system’s horizontal scalability.

Tests setup

We ran load tests on an Amazon EKS cluster using t2.medium instances (2 vCPUs, 4 GB RAM), hosting both the Locust deployment and the Ray cluster running Volga. Each Ray pod was mapped to a single EKS node to ensure resource isolation.

Volga’s On-Demand Layer was deployed behind an AWS Application Load Balancer, serving as the primary target for Locust workers and routing requests to the EKS nodes hosting Volga pods (where the OS then distributed the load among workers within the same node/pod).

For code to set it all up, check volga-ops repo.

Resource sizing

Through experimentation, we determined that:

One on-demand worker can handle up to 1,000 RPS, assuming the user-defined functions (UDFs) do not introduce CPU-blocking operations.
One Locust worker can generate up to 1,000 RPS without overloading a node.

Since Volga’s workers are lightweight Python processes (single-threaded, GIL-bound), we assume a straightforward 1 worker = 1 CPU mapping when estimating resource usage.

Storage Configuration

We selected Redis as the storage layer between the streaming engine and the on-demand layer for its simplicity and high performance.

Although Redis can be configured with master-replica replication, due to its lack of strong consistency guarantees, production environments should consider storage systems like ScyllaDB, Cassandra, or DynamoDB for better durability and consistency.

In our benchmarks:

A single Redis pod was deployed without replication or sharding, prioritizing simplicity and isolating compute performance.
The streaming engine was disabled during the tests, with storage populated once at setup time. This approach isolates Volga’s own performance but could marginally impact storage behavior compared to real-world dynamic writes.

Features to calculate/serve

Each benchmark emulated a simple real-time feature pipeline consisting of:

test_feature: a pipeline (streaming) feature, populated via periodic mock data writes.
simple_feature: a request-time feature depending on test_feature, applying a basic transformation (multiplication)

# emulates streaming pipeline feature
@source(TestEntity)
def test_feature() -> Connector:
 return MockOnlineConnector.with_periodic_items(
 items=[...], # sample data
 period_s=0
 )

# emulates simple linear transformation at request time (multiplication) based on data produced by 'test_feature'
@on_demand(dependencies=['test_feature'])
def simple_feature(
 dep: TestEntity,
 multiplier: float = 1.0
) -> TestEntity:
 return TestEntity(
 id=dep.id,
 value=dep.value * multiplier,
 timestamp=datetime.datetime.now()
 )

Results and Analysis

During each load test, we measured:

Total RPS (generated by Locust and handled by Volga)
On-Demand Layer internal latency
Storage read latency
End-to-end request latency (from both Volga and Locust perspectives)

We also monitored container CPU utilization via AWS Container Insights to verify maximum node usage.

Each test followed a stepwise load increase, with RPS growing gradually every ~20 seconds over a 3-minute run, until reaching a configured maximum. (Locust’s internal backpressure mechanism halts RPS growth if latency exceeds acceptable thresholds.)

Max RPS Test

In the largest configuration (80 workers), we achieved:

30,000 RPS peak throughput
50ms p95 end-to-end latency

Scalability

We conducted tests with 4, 10, 20, 40, 60, and 80 workers, tracking sustainable RPS and corresponding latency metrics.

The results showed:

Linear RPS scaling with the number of workers.
Stable processing latencies across scaling steps.
Storage performance remained the primary bottleneck for latency, reaffirming Volga’s internal scalability and efficiency.

Conclusion

Volga’s On-Demand Compute Layer delivers a critical component for building real-time AI/ML feature engineering pipelines, significantly reducing the need for custom glue code, ad-hoc data models, and manual API abstractions.

Our benchmarks demonstrate that the On-Demand Layer:

Scales horizontally with the number of workers (assuming the storage backend scales appropriately).
Maintains sub-50ms p95 and sub-10ms average end-to-end request latencies.
Adds minimal compute overhead, with storage access being the primary latency driver

Volga’s On-Demand Compute Layer shows strong horizontal scalability and efficient real-time serving capabilities. With the right storage backend, it enables reliable low-latency feature pipelines at scale.

To learn more or contribute, visit the Volga GitHub repository. Thanks for reading!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Benchmarking Volga’s On-Demand Compute Layer for Feature Serving: Latency, RPS, and Scalability on EKS

Author(s): Andrey Novitskiy

Background

Tests setup

Resource sizing

Storage Configuration

Features to calculate/serve

Results and Analysis

Max RPS Test

Scalability

Conclusion

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Benchmarking Volga’s On-Demand Compute Layer for Feature Serving: Latency, RPS, and Scalability on EKS

Author(s): Andrey Novitskiy

Background

Tests setup

Resource sizing

Storage Configuration

Features to calculate/serve

Results and Analysis

Max RPS Test

Scalability

Conclusion

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement