
Benchmarking Volga’s On-Demand Compute Layer for Feature Serving: Latency, RPS, and Scalability on EKS
Author(s): Andrey Novitskiy
Originally published on Towards AI.

TL;DR Real-time machine learning systems require not only efficient models but also robust infrastructure capable of low-latency feature serving under dynamic load conditions.
In this post, we benchmark Volga’s On-Demand Compute Layer — a core component designed for request-time feature computation — deployed on Kubernetes (EKS) and orchestrated with Ray. To simulate real-world load patterns, we use Locust as the load-testing framework, and Redis as the intermediate storage layer between streaming and serving components.
We evaluate the system’s performance, analyze latency characteristics, and assess horizontal scalability across different load profiles, focusing on how architectural choices affect robustness and efficiency.
Contents:
- Background
- Tests setup
- Results and Analysis
- Conclusion
Background
Volga is a data processing system built for modern AI/ML pipelines, with a focus on real-time feature engineering (read more here). The On-Demand Compute Layer is one of Volga’s two core components (alongside its Streaming Engine), responsible for request-time computation — namely, feature serving and on-demand feature calculation.
At its core, the On-Demand Compute Layer is a stateless service powered by Ray Actors, with each worker running a Starlette server that executes user-defined logic on data produced by the streaming engine (or serves precomputed data). The end-to-end system sits behind an external load balancer and uses a pluggable storage layer as an intermediate between the streaming engine and on-demand workers.
The goal of this post is to evaluate the robustness of the architecture under real-world scenarios, assess On-Demand Layer performance under load, and demonstrate the system’s horizontal scalability.
Tests setup
We ran load tests on an Amazon EKS cluster using t2.medium instances (2 vCPUs, 4 GB RAM), hosting both the Locust deployment and the Ray cluster running Volga. Each Ray pod was mapped to a single EKS node to ensure resource isolation.
Volga’s On-Demand Layer was deployed behind an AWS Application Load Balancer, serving as the primary target for Locust workers and routing requests to the EKS nodes hosting Volga pods (where the OS then distributed the load among workers within the same node/pod).
For code to set it all up, check volga-ops repo.
Resource sizing
Through experimentation, we determined that:
- One on-demand worker can handle up to 1,000 RPS, assuming the user-defined functions (UDFs) do not introduce CPU-blocking operations.
- One Locust worker can generate up to 1,000 RPS without overloading a node.
Since Volga’s workers are lightweight Python processes (single-threaded, GIL-bound), we assume a straightforward 1 worker = 1 CPU mapping when estimating resource usage.
Storage Configuration
We selected Redis as the storage layer between the streaming engine and the on-demand layer for its simplicity and high performance.
Although Redis can be configured with master-replica replication, due to its lack of strong consistency guarantees, production environments should consider storage systems like ScyllaDB, Cassandra, or DynamoDB for better durability and consistency.
In our benchmarks:
- A single Redis pod was deployed without replication or sharding, prioritizing simplicity and isolating compute performance.
- The streaming engine was disabled during the tests, with storage populated once at setup time. This approach isolates Volga’s own performance but could marginally impact storage behavior compared to real-world dynamic writes.
Features to calculate/serve
Each benchmark emulated a simple real-time feature pipeline consisting of:
test_feature
: a pipeline (streaming) feature, populated via periodic mock data writes.simple_feature
: a request-time feature depending ontest_feature
, applying a basic transformation (multiplication)
# emulates streaming pipeline feature
@source(TestEntity)
def test_feature() -> Connector:
return MockOnlineConnector.with_periodic_items(
items=[...], # sample data
period_s=0
)
# emulates simple linear transformation at request time (multiplication) based on data produced by 'test_feature'
@on_demand(dependencies=['test_feature'])
def simple_feature(
dep: TestEntity,
multiplier: float = 1.0
) -> TestEntity:
return TestEntity(
id=dep.id,
value=dep.value * multiplier,
timestamp=datetime.datetime.now()
)
Results and Analysis
During each load test, we measured:
- Total RPS (generated by Locust and handled by Volga)
- On-Demand Layer internal latency
- Storage read latency
- End-to-end request latency (from both Volga and Locust perspectives)
We also monitored container CPU utilization via AWS Container Insights to verify maximum node usage.
Each test followed a stepwise load increase, with RPS growing gradually every ~20 seconds over a 3-minute run, until reaching a configured maximum. (Locust’s internal backpressure mechanism halts RPS growth if latency exceeds acceptable thresholds.)
Max RPS Test
In the largest configuration (80 workers), we achieved:
- 30,000 RPS peak throughput
- 50ms p95 end-to-end latency

Scalability
We conducted tests with 4, 10, 20, 40, 60, and 80 workers, tracking sustainable RPS and corresponding latency metrics.

The results showed:
- Linear RPS scaling with the number of workers.
- Stable processing latencies across scaling steps.
- Storage performance remained the primary bottleneck for latency, reaffirming Volga’s internal scalability and efficiency.
Conclusion
Volga’s On-Demand Compute Layer delivers a critical component for building real-time AI/ML feature engineering pipelines, significantly reducing the need for custom glue code, ad-hoc data models, and manual API abstractions.
Our benchmarks demonstrate that the On-Demand Layer:
- Scales horizontally with the number of workers (assuming the storage backend scales appropriately).
- Maintains sub-50ms p95 and sub-10ms average end-to-end request latencies.
- Adds minimal compute overhead, with storage access being the primary latency driver
Volga’s On-Demand Compute Layer shows strong horizontal scalability and efficient real-time serving capabilities. With the right storage backend, it enables reliable low-latency feature pipelines at scale.
To learn more or contribute, visit the Volga GitHub repository. Thanks for reading!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.