AgentCrewOps — Part 1 — Agents for builders: goals, gotchas, and a practical starting stack
Last Updated on October 13, 2025 by Editorial Team
Author(s): Subramanian Mayakkaruppan
Originally published on Towards AI.
Series: Part 1 — Goals, gotchas, starting stack · Part 2 — Deploy the Lite stack
TL;DR: Agents aren’t just APIs with bigger prompts. They’re stateful, stochastic programs that call tools, spend real money, and can quietly drift. This series shows a lean way to build, observe, evaluate, and safely deploy agents — using one container and a few managed services — so small teams can move fast without surprises.
Who this is for (and not for)
This is for engineers who’ve shipped APIs, ETL/ELT, or ML features and want to add agents safely.
This is not for folks hunting a giant MLOps platform or a new orchestration framework review.
What is an “agent” (in builder terms)?
Think of an agent as a loop that:
- reads a goal and context,
- decides what to do next,
- calls a tool (search, DB, API, email),
- looks at the result, and
- repeats until done.
Two important differences from classic services:
- Stochastic: same input can lead to different paths; single-run results aren’t reliable.
- Side effects & cost: tool calls can change state or burn dollars (API usage, LLM tokens).
Why “AgentOps” matters (even for small teams)
If you’ve run APIs, you already worry about latency, errors, and cost. With agents, add:
- Quality drift (a prompt tweak improves one case, breaks three others).
- Tool risk (a bad step issues refunds or sends emails to customers).
- Observability gaps (you need to see the decision chain, not just a 500).
- Variance (a lucky pass can hide a real regression).
So we’ll set up a tiny, opinionated ops layer that treats agent quality as a distribution, adds basic guardrails, and ships like normal software.
Minimal mental model
- One spec (
agent.yaml) declares model, tools, and limits. - One policy (
policy.yaml) declares safety/budget rules (PII redaction, approvals, caps). - Traces (OpenTelemetry) make the agent’s decisions debuggable.
- Evals run the same tasks, N times to catch variance and drift.
- CI gates block risky changes.
- Human approval is required for deploys and any high-impact tool use.
If you’ve built APIs: treat this like SRE for agents — but with a couple of extra meters (quality & cost per success).
The starter stack (no Kubernetes)
- One container: API + minimal UI + small cron jobs
- Managed Postgres: run history & costs
- Object storage: artifacts & eval reports
- Managed APM: traces/metrics (OpenTelemetry in, dashboards/alerts out)
- GitHub Actions: build → eval → security → deploy
- Hosted LLMs first; add one managed GPU endpoint later if you need private inference

- API Gateway (Web App Firewall): the front door. TLS, auth, rate limits.
- AgentCrewOps Lite Service (API + UI): runs agents, serves a simple “runs” page, emits traces/metrics, posts alerts.
- Managed Postgres (DB): run records: status, steps, tokens, cost, links to artifacts.
- Object Store (GCS, S3 etc.): big files (prompts, outputs, HTML eval reports).
- Application Performance Management: managed telemetry backend (Grafana Cloud, Datadog, etc.).
- Slack: alerts (error spikes, drift failures, budget breaches).
- Outbound Proxy or NAT — Egress: the only way out to the internet; enforces allowlists/headers for model calls.
- Model provider(s): hosted LLMs now; optional private GPU later.
- GitHub Repo + PR + Actions: your normal SDLC, plus eval & safety gates.
- Ops agents:
– Watcher reads telemetry, posts concise alerts.
– Canary runs small evals nightly and on PRs (N samples per case), writes a report.
– Guardian runs safety checks in CI (prompt-injection, bad output handling).
– Fixer opens PRs with minimal proposed changes; never merges.
Why not Kubernetes? Overhead. Horizontal scaling on a managed container runtime is enough for v0.
Why hosted models first? Lower operational risk; I’ll add a single private endpoint only if needed (long contexts, strict data residency, or latency).
Why OTel now? Retrofitting traces later is painful; naming spans up front keeps everything comparable.
Step-by-step flow

Production call
- Client → API Gateway → AgentCrewOps Service
- Service loads spec + policy (pinned version)
- Writes a Run row in Managed Postgres DB; starts trace
- Calls LLM via Egress; calls allowed tools
- Saves artifacts (inputs/outputs/report) to Object Store
- Returns structured output; UI links to the trace and report
- If thresholds trip, posts a Slack alert
Change pipeline
- Dev or Fixer opens PR (code/spec/policy)
- Actions: Build → Tests → Eval (N-sample) → Security
- Human reviews report + trace links + budgets
- On green + approved: Deploy new container
Nightly
- Canary runs suite (N-sample), writes HTML report
- Compares vs baseline; if drift ↑, pings Slack and suggests a PR
Tiny, readable examples
Data model (just enough)
- Run:
run_id,agent_name,start_ts,end_ts,status,tokens,cost_usd,manifest_fingerprint,artifact_uri. - Step:
run_id,index,role,tool,latency_ms,retry_count,policy_flags,span_id. - EvalCase:
suite,case_id,input,oracle/evaluator,pass,artifacts.
What an agent spec looks like (safe defaults)
# agent.yaml
agent:
name: "support-triage"
model: "openai:gpt-4.1-mini"
temperature: [0.0, 0.4]
max_steps: 8
tools:
- name: "kb.search" # read-only
allow: ["read"]
- name: "billing.refund" # high-impact: needs approval
allow: ["create"]
approval: { required: true }
What the policy looks like (guardrails & budgets)
# policy.yaml
policy:
pii_redaction: ["email", "phone"]
output_contract: { type: "json", schema_ref: "schemas/triage.json" }
spend:
per_run_usd_max: 0.25
per_day_usd_max: 30
domains_allowlist: ["*.your-company.com"]
Think of agent.yaml like an OpenAPI-lite for agent behavior; policy.yaml is your runtime gatekeeper (redaction, budgets, domain rules).
Quality as a distribution (why we do N-sample evals)
With classic APIs, one green check often means “it works.”
With agents, a single pass can be luck. So we run each case N times (start with N=10), then compare to a baseline:
- If success rate drops and the confidence interval excludes zero, we flag drift.
- We also watch p95 latency and $ per successful task to catch cost regressions.
We don’t need fancy stats to begin — just stable cases and N runs per case.
Safety, simplified
- Least privilege: tools start read-only; anything that writes/spends needs approval.
- Security pack (CI): prompt-injection probes + output-handling checks.
- PII redaction: scrub email/phone before logs/traces leave the app.
- Egress allowlist: model calls only to known domains with required headers.
- No agent self-deploys: Fixer opens PRs; humans approve.
Minimal metrics that actually help
Track just a few numbers first:
success_rate(from evals)p95_latency_ms(from traces)tokens_totalandcost_usd_total(from provider responses)- Derived:
$ per successful task(for cost control)
Alert when: error rate ↑, p95 ↑, or $/success ↑ beyond a known band.
Glossary for newcomers
- Agent: an LLM-driven program that decides actions and calls tools.
- Tool: a capability the agent can call (search, DB query, API).
- Spec/Policy: the what (spec) and the guardrails (policy).
- Drift: quality changes over time (often unintended).
- Canary: small, automated evals that catch regressions early.
- APM: managed telemetry backend for traces/metrics/alerts.
- WAF: web application firewall — the front door for your API/UI.
Minimal API surface (to make it real)
I’m starting with one endpoint and one background job.
# api.py (FastAPI stub)
from fastapi import FastAPI
from pydantic import BaseModel
import uuid, time
app = FastAPI()
class RunRequest(BaseModel):
input: str
agent: str = "support-triage"
class RunResponse(BaseModel):
run_id: str
status: str
@app.post("/run", response_model=RunResponse)
def run(req: RunRequest):
run_id = str(uuid.uuid4())
# TODO: enqueue job, write Run row, emit OTel span
return RunResponse(run_id=run_id, status="queued")
CLI stub (shell):
# run a single task locally
curl -s -X POST http://localhost:8000/run \
-H 'content-type: application/json' \
-d '{"input":"reset password for alice@example.com"}'
Observability baseline (metrics + spans)
- Spans:
agent.plan,tool.call,tool.result,agent.critic,agent.retry. - Metrics:
agent_task_success_totalagent_task_latency_seconds(histogram)agent_tokens_totalagent_cost_usd_total- derived dashboard tile: $ per successful task
I’ll start by exporting OTel to a managed backend to avoid running collectors.
Evaluation baseline (N=10, human-readable)
- Suite: a handful of realistic tasks.
- N samples per case (default 10) to catch variance.
- Report: single HTML file with pass/fail, links to traces, and diffs vs baseline.
- CI gate: block PR if success drops with a simple statistical check (start with a bootstrap CI, keep it understandable).
Sample eval case:
{"case_id":"refund_15_small",
"input":"please refund $15 for order #123",
"evaluator":"json_schema + string match",
"success":"amount==15 and contains('refund processed')"}
Decision log
- Chose single container over Kubernetes.
- Picked hosted models first; postpone private inference.
- Committed to OTel from day one; standardized span names.
- Set N=10 for initial evals; will bump to N=30 for risky changes.
- Scoped policy.yaml to PII redaction + spend caps + output contracts.
Open questions I’ll answer in later posts:
- Do I need a tiny vector store now, or can I fake it with a dict cache for now?
- Where exactly to put approval hooks for “write” tools (before or after tool selection)?
- What’s the minimum useful safety pack without false alarms?
Risks I accept (for now)
- Hosted models mean vendor latency/availability risk. I’ll mitigate with retries + partial caching.
- N=10 evals won’t catch rare failures; that’s fine for now— nightly canary + traces should surface patterns.
- Minimal UI means more CLI use early on; I’ll add a run table + trace links soon.
What’s next (Part 2)
We’ll stand up the container + Postgres + object storage + APM on cloud (GCP Cloud Run), wire basic traces/metrics, and add a tiny eval suite you can run locally and in CI.
Next up: Part 2 — Deploy the Lite stack
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.