Every RDS instance is Multi-AZ, including the ones that didn't need to be

You take over a database fleet that grew faster than anyone wrote standards for, so you write one. Multi-AZ on by default. It is the choice nobody gets fired for, it sounds responsible in a design review, and it goes straight into the Terraform module every team copies. Buried inside it is a quiet assumption: that resilience is free, or close enough not to bother costing out. It is neither.

A Multi-AZ database runs a second full instance in another availability zone, billed at the same rate as the first. Compute and storage, charged twice. So the policy that protects production is, line for line, a doubling on every database it touches, down to the staging copy nobody would notice was down over a weekend.

When a safe default becomes a blanket overspend

This is not a technical failure. It is a granularity failure. A default applied at the org or module level has no idea whether it is provisioning something that serves customers or a CI pipeline. The module stamps Multi-AZ onto a db.t3.medium dev box exactly the way it stamps it onto the production primary, and the dev box now pays production prices for resilience it will never exercise. No single database was a bad call. The cost is one reasonable decision multiplied across every non-prod instance the team ever spun up, compounding a little more each time someone clones the module.

What Multi-AZ actually doubles, and where it earns its keep

Multi-AZ keeps a synchronous standby replica in a second AZ and fails over on its own, usually inside a minute. That standby is a real provisioned instance with its own storage, and AWS charges for it as one. Going from Single-AZ to Multi-AZ is close to a flat doubling on compute and storage, not some modest percentage uplift.

For production, that fast failover is precisely the thing you are paying for, and it pays for itself the first time an AZ misbehaves. For a development database, ask whether a few minutes of downtime during a rare AZ event carries any business cost at all. Almost always it does not. You are buying production insurance for a workload that carries no production risk.

Make the default follow the environment

The fix is to turn the default from a constant into a function of environment. The resilience tier follows the workload, decided at provision time and enforced in the module, rather than discovered after the bill lands.

Finding the non-prod databases paying the prod tax

Before you touch policy, find the instances it has already overcharged. List every RDS instance where MultiAZ is true, then reconcile that against whatever tells you the environment: an Environment tag, the account, the VPC, a naming convention. Any Multi-AZ instance that resolves to non-production is a candidate, and each one is roughly half its own cost waiting to come back. The query is the easy part. The hard part is having a trustworthy map of which instance belongs to which environment, which is the same map you need to enforce the new default going forward.

Right-sizing resilience to the environment

Treating high availability as a per-environment decision rather than a global toggle is the line between real safety and paying twice for staging. Your job is not to pick the safest setting everywhere. It is to put the right setting where it belongs and keep it there.

Cloud Horizons builds that environment map for you, flags every Multi-AZ database that resolves to non-production in your workspace, and shows the exact spend each one returns at Single-AZ. You see which defaults are overspending before the next invoice, not after it. Start with the FinOps view and let the policy follow the evidence.