
Unlocking AI Through a Financial Lens (Part 1)
Last Updated on April 17, 2025 by Editorial Team
Author(s): Yi H.
Originally published on Towards AI.

Compute and infrastructure for an AI company serves as the backbone of its operations — essential and COSTLY.
Let’s discuss and uncover,
– The core financial cost drivers of its key components
– Decision-making metrics to evaluate the system as a whole
Let’s crunch the basics and numbers!
What comprises the compute and infrastructure system? What are its cost drivers?
GPUs Cost Driver: ⭐⭐⭐⭐⭐ Costly
Multi-GPU (Graphics Processing Units) clusters are often used to distribute the intensive computational workload involved in training and inference.
Capital Expenditure: Costly Upfront Purchase
One of the top costs. A single piece of high-performance industrial-level GPUs like NVIDIA H100 ($25k–$45k from a hardware vendor) and H200 ($30k+, higher prices), comes with a premium price tag.
Enterprise bulk purchase terms can offer a lower price (still expensive!), and the cost also depends on config and supply chain mark-ups and fluctuations.
Operational Costs: Major Energy Consumption
Given its heavy computational nature, energy costs add up quickly.
Running an NVIDIA H200 Tensor Core GPU (Max Thermal Design Power TDP of 700W, source: NVIDIA) for 24 hours at California’s 2025 electricity rate (32 cents/kWh, source: energysage) costs 0.7 kW × 24 hrs × 0.32 USD/kWh = 5.04 USD daily.
Major companies operating hundreds of thousands of GPUs can exceed $500K daily easily in energy costs along, with resulting CO2 impacts.
Cloud Pricing
Renting from cloud infrastructure providers is usage-based, with H200 rates from $2–$10/hr (source: vast.ai). Plus additional fees for network storage and data transfer.
CPUs Cost Driver: ⭐⭐⭐⭐ Costly
High-performance CPUs handle orchestration, pre-processing, and system management, supporting GPU-driven computations.
Purchase Cost: Material
Like GPUs, CPUs also have a significant purchase price, especially in multi-core configurations for servers.
While the standalone price not disclosed, NVIDIA Grace Hopper MGX System equipped with NVIDIA’s Grace CPU listed at $65k+ (source: hyperscalers).
Operational Cost: Material but Lower Energy Consumption than GPUs
In industrial settings, high-performance CPUs can be power-hungry, some with a TDP of ~350W, as half of the H200 GPU’s 700W Max TDP.
One of the most powerful CPUs mentioned above, NVIDIA’s Grace CPU, has a TDP of 500W including memory (source: NVIDIA).
The CPU-to-GPU ratio is typically far below 1, as one piece of Grace CPU’s 128 PCIe Gen 5 lanes can support up to 8 H200 GPUs when using a PCIe Gen5 x16 interface.
Storage Systems and Networking Cost Driver: ⭐⭐⭐ Costly
Efficient storage systems are essential for managing datasets, model weights, logs, and other operational data.
On-Premise Storage: cost depends on capacity (TBs) and redundancy mechanisms (RAID configurations).
SSDs and high-performance storage solutions like NVMe drives come with higher upfront costs, such as Samsung’s 9100 Pro series, priced at ~$550 for 4TB (~$137.5/TB).
Cloud Storage: Cloud storage providers (AWS, Google Cloud Storage) typically charge based on stored data volume, data retrieval frequency, and the region where data is stored.
Tiered pricing models based on access frequency (hot, cold, archive storage) is common.
For example, AWS S3 Standard Storage charges approximately $0.023 per GB (~$23/TB/month) for the first 50 TB per month, with reduced rates for higher usage.
Networking Cost Driver: ⭐⭐ Costly
A fast and high-bandwidth network is crucial for ensuring that all parts of the LLM infrastructure communicate effectively.
In cloud environments, costs depends on data transfer volumes.
For example, AWS charges $0.09 per GB transferred out of storage in most US regions after the first 100 GB per month (source: AWS).
On-premises setups incur expenses related to maintaining network hardware and the network capacity, like 10GbE or 100GbE.
Financial Metrics to Monitor, from Servers to Data Centers
Managing servers to data centers housing vast compute and infrastructure: requires continuous resource monitoring + customer satisfaction and retention.
What metrics should we be using?
First, Metrics for Computational Efficiency.
- FLOPS (Floating-Point Operations per Second): including both peak theoretical FLOPS and actual operations executed FLOPS.
- Petaflop/s-day: represents performing 10¹⁵ (peta) operations per second continuously over one day, totaling ~10²⁰ operations, usually a standardized unit to quantify computational effort in trainings.
Over the past 15 years, hardware computational performance has grown 41% annually, doubling every 2 years in 16-bit and 32-bit FLOPS (source: epoch.ai).

Second, the Energy Cost Efficiency Metrics.
- FLOPS/Watt or FLOPS/$ (Floating-Point Operations per Second per $): depends on the architecture and power efficiency of GPU. From data of GPUs lauched during 2006 and 2021, FLOPS/$ doubles every ~2.5 years (source: epoch.ai).

What about User Engagement vs. System Throughout Performance?
While DAU (Daily Active Users) and MAU (Monthly Active Users) are important, we also focus on how user engagement and usage impact compute and infrastructure efficiency. To measure this, we could use metrics like below:
- Users QPS (Queries Per Second per User): the average number of queries each active user generates per second. It helps monitor individual demand and system load for capacity planning and performance optimization.
- Per Machine/Cluster/Rack QPS: the average number of queries processed per second by each machine, cluster, or rack. This metric evaluates infrastructure efficiency and load distribution.
❤️ Thank you for taking the time to read my article ❤️
Any feedback=🎁 Gift to me.
I have more inventory stories on AI × Finance lined up. More to come!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.