By Yair Knijn · August 27, 2025

Your microservices architecture is billing you per chatty cross-AZ hop

The head of SRE did the responsible thing. Spread the services across three availability zones, set anti-affinity so no two replicas share an AZ, call resilience done. The diagram looks like a textbook. What it does not show is that every call between two of those services now crosses an AZ boundary, and AWS bills that crossing by the gigabyte.

The wrong assumption is that resilience topology and cost topology are separate concerns owned by separate people. They are the same diagram. The placement that buys survival through a zone failure is the same placement that meters your internal chatter.

How multi-AZ resilience becomes a per-hop bill

Cross-AZ data transfer within one region is charged at $0.01/GB in each direction. That "each direction" clause is the part that disappears in review: a gigabyte sent from us-east-1a to us-east-1b is billed on the way out and on the way in, so a round trip is effectively $0.02/GB. Spread your replicas evenly across three zones with random placement and roughly two-thirds of your inter-service traffic crosses a boundary by default. It surfaces as a Data Transfer line that grows every time someone ships a new service into the mesh, with nobody tracing it back to the resilience decision that created it.

Cross-AZ transfer: the charge that scales with chatter

The cost is a product of two numbers the review almost never puts side by side: how much your services talk, and what fraction of that talk crosses a zone. A monolith with one chatty internal loop pays nothing for it, because both ends live in the same process. Decompose that loop into six microservices spread across zones and the same logic now pays the cross-AZ tax on every fan-out, retry, health check, and sidecar hop. This is why the transfer line on a Kafka cluster, an RDS reader fleet, or a cluster with pods scattered across zones can rival the compute it runs on. The tax scales with how much your fleet talks to itself across boundaries you drew for availability, not with how big the fleet is.

Zone-aware routing and topology-aware placement

Stop treating zone placement as purely a resilience knob. Kubernetes ships topologyAwareHints for exactly this: keep service-to-service traffic inside the AZ when a same-zone endpoint is healthy, and spill across zones only when it must. The same idea applies to load balancers, service meshes, and database read routing: pods that need each other should land together, and the cross-AZ hop should be the exception.

Enable same-zone routing where the call pattern is hot and high-volume.
Keep replicas spread for failure isolation, but pin chatty pairs so steady-state traffic stays local.
Where it fits, route through PrivateLink or Transit Gateway, whose inter-AZ transfer within a region has been free of charge since April 2022.

Attributing transfer cost to the service that talks most

You cannot fix what you cannot assign. The Data Transfer line on the consolidated bill is an average; the cause is a specific pair of services fanning out across zones thousands of times a second. Pull the flow logs, map which talkers cross boundaries, and put a name on the row. The chattiest service is rarely the one anyone suspects, and once it has an owner the fix is usually a config change.

Reviewing resilience and cost in the same diagram

The discipline is simple to state and rare to practice: review a resilience topology and its transfer cost in the same session, on the same diagram. Every arrow that crosses a zone boundary is a recurring per-gigabyte charge, and the resilience win is only honest once you have priced the chatter it commits you to. Inside a Cloud Horizons workspace, cross-AZ transfer is attributed to the services and zone pairs driving it, so the two stop living on separate diagrams. If your transfer line is climbing faster than your fleet, that is where to start looking.