5 Engineering Strategies to Cut Your AI Infrastructure Costs — Without Sacrificing Performance
Last Updated on May 27, 2026 by Editorial Team
Author(s): Satyajit Patra
Originally published on Towards AI.
5 Engineering Strategies to Cut Your AI Infrastructure Costs — Without Sacrificing Performance
The AI industry is pouring $690 billion into infrastructure in 2026. Yet most engineering teams can’t answer a basic question: how much does a single AI-powered feature actually cost to run?

Organizations are overspending 40–60% beyond their AI budgets. Token costs have dropped 280x in two years, but total spending keeps climbing — because usage is growing faster than prices are falling.
The problem isn’t that AI is expensive. The problem is that most teams have zero visibility into where the money goes — and even fewer have systems in place to optimize it.
In this post, I’ll walk through five practical engineering strategies to take control of your AI costs, with the specific tools and code to implement each one.
1. Track Cost Per Task, Not Per Month
Most teams look at one aggregated monthly invoice from OpenAI or AWS and call it a day. That’s like tracking your company’s total electricity bill without knowing which department is using the most power.
Inference now accounts for 55% of all AI infrastructure spending. If you’re not tracking cost at the level of individual models, workflows, and queries, you’re optimizing nothing.
What you need: per-request observability that logs token usage, model, latency, and cost for every single LLM call.
Tools to use:
- Helicone — A proxy-based observability layer. One line of code to set up. It automatically tracks every request with token counts, cost, and latency. Great for fast setup.
- Langfuse — Open-source, trace-level observability. Ideal if you want deep debugging alongside cost tracking. Breaks down cost by
cached_tokens,audio_tokens,image_tokens, etc. - LangSmith — Best fit if you’re already in the LangChain/LangGraph ecosystem. Provides per-chain cost breakdowns.
Implementation with Helicone (Python):
import openai
# Just change the base URL — everything else stays the same
client = openai.OpenAI(
api_key="your-openai-key",
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": "Bearer your-helicone-key",
"Helicone-Property-Feature": "summarization", # tag by feature
"Helicone-Property-Environment": "production"
}
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Summarize this document..."}]
)
With this setup, every request is tagged by feature and environment. You can now answer questions like: “How much does our summarization feature cost per day in production?”
Implementation with Langfuse (Python):
from langfuse import Langfuse
from langfuse.decorators import observe
langfuse = Langfuse(
public_key="your-public-key",
secret_key="your-secret-key"
)
@observe()
def process_query(user_query: str):
# Your LLM call here — Langfuse auto-captures tokens, cost, latency
response = call_llm(user_query)
return response
Langfuse gives you a dashboard with cost breakdowns by trace, user, model, and custom tags — so you can pinpoint which workflows are burning cash.
2. Right-Size Your Models with Smart Routing
Here’s the uncomfortable truth: most teams send every request to their most expensive model. But not every query needs GPT-4o or Claude Opus.
A classification task, a simple extraction, a yes/no validation — these can be handled by smaller, cheaper models at a fraction of the cost, often with the same quality. Smart routing means automatically directing each request to the most cost-effective model that can handle it.
Studies show this approach can reduce costs by 40–85% while maintaining 95%+ output quality.
Tools to use:
- LiteLLM — Open-source AI gateway. Supports 100+ providers with a unified OpenAI compatible interface. Built-in cost tracking, fallbacks, load balancing, and budget routing.
- Portkey — Managed AI gateway with conditional routing, automatic retries, and real-time cost analytics.
- OpenRouter — Routes across models from multiple providers with transparent pricing.
Implementation with LiteLLM (Python):
from litellm import Router
# Define your model pool with cost priorities
router = Router(
model_list=[
{
"model_name": "cheap-fast",
"litellm_params": {
"model": "gpt-4o-mini",
"api_key": "your-key"
}
},
{
"model_name": "powerful",
"litellm_params": {
"model": "gpt-4o",
"api_key": "your-key"
}
},
{
"model_name": "budget",
"litellm_params": {
"model": "claude-haiku-4-5-20251001",
"api_key": "your-anthropic-key"
}
}
],
routing_strategy="cost-based-routing" # always pick the cheapest available
)
# Simple queries go to mini, complex ones get escalated
def route_query(query: str, complexity: str = "low"):
model = "cheap-fast" if complexity == "low" else "powerful"
response = router.completion(
model=model,
messages=[{"role": "user", "content": query}]
)
return response
LiteLLM Proxy (config.yaml) for team-wide routing:
model_list:
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
api_key: sk-xxx
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: sk-xxx
router_settings:
routing_strategy: cost-based-routing
general_settings:
max_budget: 500 # monthly budget cap in USD
budget_duration: "30d"
Run with litellm --config config.yaml and your entire team hits one endpoint with automatic cost-based routing and budget guardrails.
3. Cache and Batch Aggressively
Two of the easiest wins in AI cost optimization are caching repeated queries and batching non-urgent requests. Most teams do neither.
Semantic Caching converts queries into vector embeddings and checks for similarity against previous requests. If a user asks “What’s our refund policy?” and someone already asked “How do I get a refund?” — the cached response is returned instantly, with no API call.
Results: 30–70% cost reduction, with cache hits returning in under 50ms versus 3–10 seconds for a live GPT-4 call.
Batch Processing bundles multiple requests into a single async job. OpenAI’s Batch API offers a flat 50% discount in exchange for a 24-hour completion window (most jobs finish in 1–6 hours).
Caching with GPTCache (Python):
from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
# Initialize semantic cache
onnx = Onnx()
cache_base = CacheBase("sqlite")
vector_base = VectorBase("faiss", dimension=onnx.dimension)
data_manager = get_data_manager(cache_base, vector_base)
cache.init(
embedding_func=onnx.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation()
)
cache.set_openai_key()
# First call hits the API
response = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is our return policy?"}]
)
# Similar query returns cached result — no API call, no cost
response = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "How do I return a product?"}]
)
Batching with OpenAI Batch API (Python):
import openai
import json
client = openai.OpenAI()
# Step 1: Prepare batch requests as JSONL
requests = [
{"custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions",
"body": {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": text}]}}
for i, text in enumerate(documents_to_summarize)
]
# Write to JSONL file
with open("batch_input.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
# Step 2: Upload and submit the batch
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch_job = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
# Step 3: Check status and retrieve results
status = client.batches.retrieve(batch_job.id)
print(f"Status: {status.status}") # "completed" when done
This gives you a 50% discount on every request in the batch. Ideal for nightly reports, bulk classification, document processing — anything that doesn’t need a real-time response.
4. Factor in Energy, Not Just Dollars
Data centers are projected to consume over 1,000 TWh of electricity in 2026 — more than most countries. Every API call has a carbon footprint. Optimization isn’t just about saving money anymore; it’s a sustainability conversation.
Tracking energy alongside cost helps you make smarter decisions: choosing energy-efficient hardware, scheduling heavy workloads during off-peak hours, and selecting providers with greener infrastructure.
Tools to use:
- CodeCarbon — Open-source Python library that estimates CO2 emissions from compute workloads. Tracks CPU, GPU, and RAM energy usage and converts it to carbon equivalents based on your region’s energy grid.
- ML CO2 Impact — A calculator for estimating the carbon footprint of ML training runs.
- eco2AI — Another open-source tracker that integrates into ML pipelines.
Implementation with CodeCarbon (Python):
from codecarbon import EmissionsTracker
import openai
# Track emissions for an entire workflow
tracker = EmissionsTracker(
project_name="ai-summarization-pipeline",
output_dir="./emissions",
log_level="warning"
)
tracker.start()
# Your AI workflow
client = openai.OpenAI()
for doc in documents:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Summarize: {doc}"}]
)
emissions = tracker.stop()
print(f"Total emissions: {emissions:.6f} kg CO2")
print(f"That's equivalent to {emissions * 2.48:.2f} miles driven")
CodeCarbon generates a CSV log with timestamps, energy consumed (kWh), and CO2 emitted (kg). You can pipe this into your existing dashboards alongside cost data for a complete picture.
Integrating with MLflow:
import mlflow
from codecarbon import EmissionsTracker
with mlflow.start_run():
tracker = EmissionsTracker()
tracker.start()
# Training or inference workload
run_model_pipeline()
emissions = tracker.stop()
mlflow.log_metric("co2_emissions_kg", emissions)
mlflow.log_metric("energy_consumed_kwh", tracker._total_energy.kWh)
Now your experiment tracking includes carbon cost right next to accuracy and latency.
5. Unify Your Tracking
Most teams track compute costs in AWS, token spend in a separate dashboard, and energy usage nowhere at all. This fragmentation makes it impossible to optimize holistically.
The teams saving the most are the ones looking at cost, energy, and efficiency in a single view — correlating model performance with spend, identifying which workflows give the best ROI, and catching runaway costs before they hit the invoice.
Tools to use:
- Finout — Ingests costs from OpenAI, Anthropic, AWS SageMaker, GCP Vertex AI, and Azure into a unified “MegaBill” alongside all cloud and Kubernetes spend. One source of truth for engineering and finance.
- Amnic — Real-time AI cost analytics with per-feature and per-team allocation.
- Cloudchipr — Multi-cloud cost visibility extended to AI workloads, with alerting and automation.
Building a dashboard (conceptual architecture):

Practical unified tracking with Langfuse + custom metrics:
from langfuse import Langfuse
from codecarbon import EmissionsTracker
langfuse = Langfuse()
def tracked_llm_call(prompt: str, feature: str):
# Track carbon
tracker = EmissionsTracker(log_level="error")
tracker.start()
# Create Langfuse trace with cost + feature tags
trace = langfuse.trace(name=feature, metadata={"team": "product"})
generation = trace.generation(
name="llm-call",
model="gpt-4o-mini",
input=prompt
)
# Make the actual call
response = call_llm(prompt)
emissions = tracker.stop()
# Log everything in one trace
generation.end(
output=response,
usage={"input": count_tokens(prompt), "output": count_tokens(response)},
metadata={"co2_kg": emissions, "energy_kwh": tracker._total_energy.kWh}
)
return response
This gives you cost, carbon, and performance data in a single trace — queryable, dashboardable, and actionable.
The Bottom Line
AI cost optimization isn’t about using fewer tokens or picking the cheapest model. It’s about building systems that give you visibility, control, and the ability to make informed trade-offs across cost, energy, and efficiency.
The companies that win in AI won’t be the ones spending the most. They’ll be the ones spending the smartest.
If you found this useful, follow me for more on AI engineering, cost optimization, and building production AI systems.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.