5 Engineering Strategies to Cut Your AI Infrastructure Costs — Without Sacrificing Performance

Last Updated on May 27, 2026 by Editorial Team

Author(s): Satyajit Patra

Originally published on Towards AI.

5 Engineering Strategies to Cut Your AI Infrastructure Costs — Without Sacrificing Performance

The AI industry is pouring $690 billion into infrastructure in 2026. Yet most engineering teams can’t answer a basic question: how much does a single AI-powered feature actually cost to run?

Organizations are overspending 40–60% beyond their AI budgets. Token costs have dropped 280x in two years, but total spending keeps climbing — because usage is growing faster than prices are falling.

The problem isn’t that AI is expensive. The problem is that most teams have zero visibility into where the money goes — and even fewer have systems in place to optimize it.

In this post, I’ll walk through five practical engineering strategies to take control of your AI costs, with the specific tools and code to implement each one.

1. Track Cost Per Task, Not Per Month

Most teams look at one aggregated monthly invoice from OpenAI or AWS and call it a day. That’s like tracking your company’s total electricity bill without knowing which department is using the most power.

Inference now accounts for 55% of all AI infrastructure spending. If you’re not tracking cost at the level of individual models, workflows, and queries, you’re optimizing nothing.

What you need: per-request observability that logs token usage, model, latency, and cost for every single LLM call.

Tools to use:

Helicone — A proxy-based observability layer. One line of code to set up. It automatically tracks every request with token counts, cost, and latency. Great for fast setup.
Langfuse — Open-source, trace-level observability. Ideal if you want deep debugging alongside cost tracking. Breaks down cost by cached_tokens, audio_tokens, image_tokens, etc.
LangSmith — Best fit if you’re already in the LangChain/LangGraph ecosystem. Provides per-chain cost breakdowns.

Implementation with Helicone (Python):

import openai

# Just change the base URL — everything else stays the same
client = openai.OpenAI(
 api_key="your-openai-key",
 base_url="https://oai.helicone.ai/v1",
 default_headers={
 "Helicone-Auth": "Bearer your-helicone-key",
 "Helicone-Property-Feature": "summarization", # tag by feature
 "Helicone-Property-Environment": "production"
 }
)

response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": "Summarize this document..."}]
)

With this setup, every request is tagged by feature and environment. You can now answer questions like: “How much does our summarization feature cost per day in production?”

Implementation with Langfuse (Python):

from langfuse import Langfuse
from langfuse.decorators import observe

langfuse = Langfuse(
 public_key="your-public-key",
 secret_key="your-secret-key"
)

@observe()
def process_query(user_query: str):
 # Your LLM call here — Langfuse auto-captures tokens, cost, latency
 response = call_llm(user_query)
 return response

Langfuse gives you a dashboard with cost breakdowns by trace, user, model, and custom tags — so you can pinpoint which workflows are burning cash.

2. Right-Size Your Models with Smart Routing

Here’s the uncomfortable truth: most teams send every request to their most expensive model. But not every query needs GPT-4o or Claude Opus.

A classification task, a simple extraction, a yes/no validation — these can be handled by smaller, cheaper models at a fraction of the cost, often with the same quality. Smart routing means automatically directing each request to the most cost-effective model that can handle it.

Studies show this approach can reduce costs by 40–85% while maintaining 95%+ output quality.

Tools to use:

LiteLLM — Open-source AI gateway. Supports 100+ providers with a unified OpenAI compatible interface. Built-in cost tracking, fallbacks, load balancing, and budget routing.
Portkey — Managed AI gateway with conditional routing, automatic retries, and real-time cost analytics.
OpenRouter — Routes across models from multiple providers with transparent pricing.

Implementation with LiteLLM (Python):

from litellm import Router

# Define your model pool with cost priorities
router = Router(
 model_list=[
 {
 "model_name": "cheap-fast",
 "litellm_params": {
 "model": "gpt-4o-mini",
 "api_key": "your-key"
 }
 },
 {
 "model_name": "powerful",
 "litellm_params": {
 "model": "gpt-4o",
 "api_key": "your-key"
 }
 },
 {
 "model_name": "budget",
 "litellm_params": {
 "model": "claude-haiku-4-5-20251001",
 "api_key": "your-anthropic-key"
 }
 }
 ],
 routing_strategy="cost-based-routing" # always pick the cheapest available
)

# Simple queries go to mini, complex ones get escalated
def route_query(query: str, complexity: str = "low"):
 model = "cheap-fast" if complexity == "low" else "powerful"
 response = router.completion(
 model=model,
 messages=[{"role": "user", "content": query}]
 )
 return response

LiteLLM Proxy (config.yaml) for team-wide routing:

model_list:
 - model_name: gpt-4o-mini
 litellm_params:
 model: openai/gpt-4o-mini
 api_key: sk-xxx

 - model_name: gpt-4o
 litellm_params:
 model: openai/gpt-4o
 api_key: sk-xxx

router_settings:
 routing_strategy: cost-based-routing

general_settings:
 max_budget: 500 # monthly budget cap in USD
 budget_duration: "30d"

Run with litellm --config config.yaml and your entire team hits one endpoint with automatic cost-based routing and budget guardrails.

3. Cache and Batch Aggressively

Two of the easiest wins in AI cost optimization are caching repeated queries and batching non-urgent requests. Most teams do neither.

Semantic Caching converts queries into vector embeddings and checks for similarity against previous requests. If a user asks “What’s our refund policy?” and someone already asked “How do I get a refund?” — the cached response is returned instantly, with no API call.

Results: 30–70% cost reduction, with cache hits returning in under 50ms versus 3–10 seconds for a live GPT-4 call.

Batch Processing bundles multiple requests into a single async job. OpenAI’s Batch API offers a flat 50% discount in exchange for a 24-hour completion window (most jobs finish in 1–6 hours).

Caching with GPTCache (Python):

from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

# Initialize semantic cache
onnx = Onnx()
cache_base = CacheBase("sqlite")
vector_base = VectorBase("faiss", dimension=onnx.dimension)
data_manager = get_data_manager(cache_base, vector_base)

cache.init(
 embedding_func=onnx.to_embeddings,
 data_manager=data_manager,
 similarity_evaluation=SearchDistanceEvaluation()
)
cache.set_openai_key()

# First call hits the API
response = openai.ChatCompletion.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": "What is our return policy?"}]
)

# Similar query returns cached result — no API call, no cost
response = openai.ChatCompletion.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": "How do I return a product?"}]
)

Batching with OpenAI Batch API (Python):

import openai
import json

client = openai.OpenAI()

# Step 1: Prepare batch requests as JSONL
requests = [
 {"custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions",
 "body": {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": text}]}}
 for i, text in enumerate(documents_to_summarize)
]

# Write to JSONL file
with open("batch_input.jsonl", "w") as f:
 for req in requests:
 f.write(json.dumps(req) + "\n")

# Step 2: Upload and submit the batch
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")

batch_job = client.batches.create(
 input_file_id=batch_file.id,
 endpoint="/v1/chat/completions",
 completion_window="24h"
)

# Step 3: Check status and retrieve results
status = client.batches.retrieve(batch_job.id)
print(f"Status: {status.status}") # "completed" when done

This gives you a 50% discount on every request in the batch. Ideal for nightly reports, bulk classification, document processing — anything that doesn’t need a real-time response.

4. Factor in Energy, Not Just Dollars

Data centers are projected to consume over 1,000 TWh of electricity in 2026 — more than most countries. Every API call has a carbon footprint. Optimization isn’t just about saving money anymore; it’s a sustainability conversation.

Tracking energy alongside cost helps you make smarter decisions: choosing energy-efficient hardware, scheduling heavy workloads during off-peak hours, and selecting providers with greener infrastructure.

Tools to use:

CodeCarbon — Open-source Python library that estimates CO2 emissions from compute workloads. Tracks CPU, GPU, and RAM energy usage and converts it to carbon equivalents based on your region’s energy grid.
ML CO2 Impact — A calculator for estimating the carbon footprint of ML training runs.
eco2AI — Another open-source tracker that integrates into ML pipelines.

Implementation with CodeCarbon (Python):

from codecarbon import EmissionsTracker
import openai

# Track emissions for an entire workflow
tracker = EmissionsTracker(
 project_name="ai-summarization-pipeline",
 output_dir="./emissions",
 log_level="warning"
)

tracker.start()

# Your AI workflow
client = openai.OpenAI()
for doc in documents:
 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": f"Summarize: {doc}"}]
 )

emissions = tracker.stop()
print(f"Total emissions: {emissions:.6f} kg CO2")
print(f"That's equivalent to {emissions * 2.48:.2f} miles driven")

CodeCarbon generates a CSV log with timestamps, energy consumed (kWh), and CO2 emitted (kg). You can pipe this into your existing dashboards alongside cost data for a complete picture.

Integrating with MLflow:

import mlflow
from codecarbon import EmissionsTracker

with mlflow.start_run():
 tracker = EmissionsTracker()
 tracker.start()

 # Training or inference workload
 run_model_pipeline()

 emissions = tracker.stop()

 mlflow.log_metric("co2_emissions_kg", emissions)
 mlflow.log_metric("energy_consumed_kwh", tracker._total_energy.kWh)

Now your experiment tracking includes carbon cost right next to accuracy and latency.

5. Unify Your Tracking

Most teams track compute costs in AWS, token spend in a separate dashboard, and energy usage nowhere at all. This fragmentation makes it impossible to optimize holistically.

The teams saving the most are the ones looking at cost, energy, and efficiency in a single view — correlating model performance with spend, identifying which workflows give the best ROI, and catching runaway costs before they hit the invoice.

Tools to use:

Finout — Ingests costs from OpenAI, Anthropic, AWS SageMaker, GCP Vertex AI, and Azure into a unified “MegaBill” alongside all cloud and Kubernetes spend. One source of truth for engineering and finance.
Amnic — Real-time AI cost analytics with per-feature and per-team allocation.
Cloudchipr — Multi-cloud cost visibility extended to AI workloads, with alerting and automation.

Building a dashboard (conceptual architecture):

Practical unified tracking with Langfuse + custom metrics:

from langfuse import Langfuse
from codecarbon import EmissionsTracker

langfuse = Langfuse()

def tracked_llm_call(prompt: str, feature: str):
 # Track carbon
 tracker = EmissionsTracker(log_level="error")
 tracker.start()

 # Create Langfuse trace with cost + feature tags
 trace = langfuse.trace(name=feature, metadata={"team": "product"})
 generation = trace.generation(
 name="llm-call",
 model="gpt-4o-mini",
 input=prompt
 )

 # Make the actual call
 response = call_llm(prompt)

 emissions = tracker.stop()

 # Log everything in one trace
 generation.end(
 output=response,
 usage={"input": count_tokens(prompt), "output": count_tokens(response)},
 metadata={"co2_kg": emissions, "energy_kwh": tracker._total_energy.kWh}
 )

 return response

This gives you cost, carbon, and performance data in a single trace — queryable, dashboardable, and actionable.

The Bottom Line

AI cost optimization isn’t about using fewer tokens or picking the cheapest model. It’s about building systems that give you visibility, control, and the ability to make informed trade-offs across cost, energy, and efficiency.

The companies that win in AI won’t be the ones spending the most. They’ll be the ones spending the smartest.

If you found this useful, follow me for more on AI engineering, cost optimization, and building production AI systems.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

5 Engineering Strategies to Cut Your AI Infrastructure Costs — Without Sacrificing Performance

Author(s): Satyajit Patra

5 Engineering Strategies to Cut Your AI Infrastructure Costs — Without Sacrificing Performance

The AI industry is pouring $690 billion into infrastructure in 2026. Yet most engineering teams can’t answer a basic question: how much does a single AI-powered feature actually cost to run?

1. Track Cost Per Task, Not Per Month

2. Right-Size Your Models with Smart Routing

3. Cache and Batch Aggressively

4. Factor in Energy, Not Just Dollars

5. Unify Your Tracking

The Bottom Line

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

5 Engineering Strategies to Cut Your AI Infrastructure Costs — Without Sacrificing Performance

Author(s): Satyajit Patra

5 Engineering Strategies to Cut Your AI Infrastructure Costs — Without Sacrificing Performance

The AI industry is pouring $690 billion into infrastructure in 2026. Yet most engineering teams can’t answer a basic question: how much does a single AI-powered feature actually cost to run?

1. Track Cost Per Task, Not Per Month

2. Right-Size Your Models with Smart Routing

3. Cache and Batch Aggressively

4. Factor in Energy, Not Just Dollars

5. Unify Your Tracking

The Bottom Line

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement