Every company running LLMs has the same story. They start with a prototype. Costs are trivial — a few dollars a day. Then the prototype becomes a product, the product gets agents, and the agents get autonomy. By month three, someone in finance asks why the OpenAI bill jumped from $200 to $14,000.
The standard response? Add a cost monitoring dashboard. Track tokens per model, per user, per day. Pipe it into Datadog or Grafana. Set up alerts.
Here's the problem: monitoring tells you what happened. It doesn't prevent what's about to happen.
When your agent decides to summarize 500 documents at 3 AM, a Slack alert at 3:01 AM doesn't help. The money is already gone. You need enforcement — not observation.
The LLM Cost Management Landscape Today
Most LLM cost management approaches fall into three categories, each with significant blind spots:
1. Provider Dashboards (OpenAI, Anthropic, Google)
Every LLM provider gives you a usage page. OpenAI shows tokens consumed by model. Anthropic shows spend per API key. Google shows per-project billing.
The limitation is structural: provider dashboards show aggregate spend, not attribution. You know you spent $3,000 on GPT-4o last Tuesday. You don't know which agent, which user, or which workflow caused it. When five teams share one API key — and they always do — the dashboard is useless for accountability.
OpenAI's usage tiers and spending limits help at the account level. But account-level limits are a sledgehammer. When your support agent hits the cap, your code-generation agent goes down too. There's no granularity.
2. Observability Platforms (LangSmith, Helicone, Portkey)
The next tier up is purpose-built LLM observability. These tools proxy your API calls and track token usage, latency, cost per trace, and model performance. They're genuinely useful for debugging and optimization.
But they share a fundamental design choice: they sit in the observation path, not the enforcement path. They record what happened. They don't block what shouldn't happen.
Some offer "budget alerts" — when spend crosses a threshold, they notify you. But notification is not enforcement. Between the alert firing and a human reading their Slack, the agent has already made another 200 calls. At $0.06 per GPT-4o request, that's $12 more in the 30 seconds it took you to read the message.
3. Cloud Billing Controls (AWS Budgets, GCP Quotas)
If you're self-hosting models on cloud infrastructure, you have cloud-native cost controls. AWS Budgets can alert or trigger Lambda functions. GCP quotas can cap API usage.
These are blunt instruments for LLM workloads. Cloud billing operates on hourly or daily cycles. An autonomous agent can burn through $1,000 in GPU time in 10 minutes. By the time the billing cycle catches up, the damage is done.
More critically, cloud billing controls don't understand what the spend is for. They see compute hours, not "Agent X called the translation API 4,000 times because it got stuck in a retry loop."
Why Monitoring Fails When Agents Hold the Wallet
The gap between monitoring and enforcement becomes catastrophic when AI agents are autonomous. Here's the core issue:
Traditional software: A human decides to make an API call. Monitoring shows the human's behavior. The human self-regulates.
Agent software: An agent decides to make API calls — potentially thousands — based on its own reasoning. Monitoring shows the agent's behavior. But the agent doesn't read dashboards. It doesn't self-regulate based on cost. It optimizes for its goal.
This is the fundamental asymmetry. Monitoring assumes a human in the loop who will react to the data. Agents remove that human. Without enforcement at the infrastructure layer, you're relying on prompt engineering ("please don't spend too much") as your cost control mechanism.
That's not a strategy. That's hope.
What Real LLM Cost Management Looks Like
Effective LLM cost management requires four capabilities that monitoring alone can't provide:
1. Pre-Call Budget Checks
Before every LLM call, the system checks: does this agent have budget remaining? Not after the call. Not in a batch job tonight. Before the tokens flow.
# Agent requests tool call
POST /v1/chat/completions
Authorization: Bearer macaroon_v1_agent42_budget500
# Gateway checks budget BEFORE proxying
→ Agent 42 remaining budget: 340 credits
→ Estimated cost of gpt-4o call: 15 credits
→ Budget sufficient: ALLOW
# If budget exhausted:
→ Agent 42 remaining budget: 8 credits
→ Estimated cost: 15 credits
→ HTTP 402 Payment Required
→ {"error": "budget_exhausted", "remaining": 8, "required": 15}The agent gets a structured error it can handle. It can switch to a cheaper model, ask the user for more budget, or gracefully stop. It doesn't crash. It doesn't retry into infinity.
2. Per-Agent, Per-Tool Granularity
Account-level limits punish everyone when one agent misbehaves. Real cost management operates at the granularity that matters:
- Per agent: Research Agent gets 1,000 credits/day. Code Agent gets 5,000.
- Per tool: GPT-4o calls cost 15 credits. GPT-4o-mini costs 1 credit. DALL-E costs 50.
- Per user: Free tier users get 100 credits. Enterprise gets 10,000.
- Per workflow: The "quarterly report" workflow gets a 500-credit budget per execution.
tools:
defaultCost: 1
costs:
gpt-4o: 15
gpt-4o-mini: 1
claude-3-opus: 25
claude-3-haiku: 1
dall-e-3: 50
web_search: 5
database_query: 3This isn't a rate limit. It's an economic policy. The agent can make as many calls as it wants — until the money runs out. Fast calls, slow calls, bursty calls — doesn't matter. The budget is the budget.
3. Real-Time Attribution
When the CFO asks "why did AI spend triple last month," you need an answer better than "usage went up." Real attribution means:
- Agent X spent 4,200 credits on Tuesday processing the backlog
- Team Y's agents averaged 800 credits/day, up from 300
- The customer-support workflow accounts for 62% of total LLM spend
- User Z's agents hit budget limits 14 times (indicating under-provisioned budgets)
Attribution is the bridge between engineering and finance. Without it, LLM costs are an opaque line item that nobody owns and everybody blames.
4. Delegation Without Escalation
In multi-agent systems, agents delegate tasks to sub-agents. Without proper cost management, delegation creates unbounded spend chains:
Orchestrator Agent (budget: 10,000) → spawns Research Agent → spawns 5 Scraper Agents → each spawns a Summarizer Agent. Suddenly 11 agents are spending from a single budget with no individual limits.
With capability-based budgets, the orchestrator delegates a portion of its budget to each sub-agent. The research agent gets 2,000 credits. Each scraper gets 200. Summarizers get 50. The total can never exceed the parent's allocation. It's hierarchical, cryptographically enforced, and impossible to game.
The Economic Firewall Approach
SatGate implements these four capabilities as an economic firewall — a gateway-layer enforcement mechanism that sits between your agents and the LLM providers they call.
The architecture is simple: every API call passes through the gateway. The gateway checks the caller's budget (encoded in a macaroon token), deducts the cost, and either proxies the request or returns HTTP 402. No SDK changes. No prompt engineering. No "please be careful with costs."
# Mint a budget-capped token for an agent
satgate mint \
--budget 1000 \
--tools "gpt-4o:15,gpt-4o-mini:1,web_search:5" \
--expires 24h \
--holder "research-agent-prod"
# The agent uses this token for all API calls
# Gateway enforces the budget automatically
# No code changes in the agentThe key insight: cost management should be infrastructure, not application logic. Just like you don't ask each microservice to implement its own TLS — you terminate TLS at the gateway — you shouldn't ask each agent to implement its own budget tracking.
Monitoring + Enforcement: Not Either/Or
To be clear: monitoring is still valuable. You need dashboards to understand spending patterns, optimize model selection, and forecast costs. The mistake is treating monitoring as sufficient.
The right architecture has both:
- Enforcement layer (gateway): Prevents overspend in real time. Hard limits that agents can't exceed.
- Monitoring layer (observability): Analyzes spend patterns. Identifies optimization opportunities. Informs budget allocation decisions.
Think of it like a credit card. The bank sets a credit limit (enforcement). You check your statement monthly (monitoring). Both matter. But if you had to choose one, you'd choose the limit — because that's what prevents the catastrophic outcome.
Getting Started
If you're managing LLM costs today, here's a pragmatic path forward:
- Audit your current spend. Who's calling what, and how much does it cost? If you can't answer this by agent and by tool, you have a visibility problem.
- Set budget policies. Not alerts — policies. "Agent X gets 1,000 credits per day" is a policy. "Alert me when Agent X exceeds $50" is a notification.
- Enforce at the gateway. Move cost control from application code to infrastructure. Your agents shouldn't know or care about budgets — the gateway handles it.
- Iterate on allocations. Use monitoring data to adjust budgets. Some agents need more, some need less. The enforcement layer makes this safe to experiment with.
SatGate is open source. Try budget enforcement on your LLM calls today:
go install github.com/satgate-io/satgate/cmd/satgate-mcp@latest