By Yair Knijn · October 28, 2025

Someone left debug logging on in production. The CloudWatch bill found out first.

Your head of SRE budgets for compute: EC2 reservations, Lambda concurrency, the RDS instance class. The one thing they almost never put a number on is how many gigabytes of logs the fleet emits per hour. That number was never a capacity decision anyone signed off on; it's a side effect of a config flag, and the flag has no owner.

The assumption that gets people is that observability cost tracks the size of the system. It tracks verbosity instead, and verbosity is one line in a config map any on-call engineer can flip at 2am. Nobody flips it back, because a log level isn't an outage. The bill is the only thing watching it.

How a debug flag outlasts the incident that set it

The shape of it rarely varies. Something pages at night, the on-call engineer can't see what's happening, so they set LOG_LEVEL=DEBUG. The incident resolves. The flag stays. A production service at DEBUG can emit an order of magnitude more volume than at INFO, because you're now writing every request body, every SDK retry, every cache miss, every health check. None of it trips an alarm, and the change rode out through the same pipeline as a feature, so it cleared code review, deploy, and the post-incident review without anyone glancing at log volume. A week later the flag is still on, on every task, across every zone. What was meant to be temporary is now the steady state.

Ingestion plus retention: the double charge

CloudWatch Logs bills you twice for the same mistake. Ingestion runs around $0.50 per GB at standard rates, the charge that scales directly with the debug flag. Storage then accrues per GB-month for as long as the data lives, and CloudWatch keeps log groups indefinitely unless you tell it otherwise. So a flag flipped during one incident isn't a spike: it's a higher ingestion run-rate every hour it stays on, plus a storage baseline that ratchets up every day. Two multipliers compounding, neither one on an availability dashboard.

Watch for the moment the line item meant to observe the system costs more than the system itself. Once ingestion dominates, the logs bill can rival or exceed the compute it describes, and the team hears about it from finance, not monitoring. Observability is the one cost center with no native cost alarm: you instrument latency, error rate, and saturation, but never the dollar cost of the instrumentation, so it grows in exactly the place your dashboards weren't built to look.

Volume guardrails and log-level governance

The fix is to treat log level and ingested volume as governed config, not something each engineer toggles by feel:

Put a metric filter on ingested bytes per log group and alarm when a service's hourly volume jumps off its baseline. A debug flag shows up as a step change within minutes.
Make DEBUG in production time-boxed by construction. A flag that auto-reverts to INFO after the incident window can't outlive the incident.
Set an explicit retention policy on every log group. The default of forever is a decision nobody actually made.
For high-volume groups that don't need millisecond query latency, the Infrequent Access class roughly halves the per-GB ingestion rate while keeping Logs Insights queryable.

Attributing log cost to the service that emits it

None of these guardrails hold unless someone owns the number. Without per-service attribution, a debug flag is free to the engineer who flips it and expensive to everyone else, the cost lost on a shared CloudWatch line no single team feels. Tag log groups by service, break ingestion spend out by owner, and the incentive inverts: the person who can turn the flag off is the person who sees the bill.

Cloud Horizons breaks observability spend out by service and log group inside each workspace, so a debug flag left on surfaces as a volume anomaly attributed to the team that emitted it, before the month closes instead of after. The cloud costs that run away are the ones you built to watch everything except themselves. See how attribution and anomaly detection work on the FinOps page.