How to spot anomalies in your AWS bill before they wreck the quarter

Most AWS cost surprises do not show up in your monthly invoice. They show up three weeks into the month, in your Slack, when someone in finance asks why the running total is $14K over forecast. By then the meter has been running for 21 days and you have eight days left to figure out what changed.

We have run cost audits on AWS environments from $20K/month to over $1M/month. The same three patterns show up over and over. Here they are, with the actual signals you can watch for, and the daily checks you can run today.

1. The lapsed Reserved Instance

Reserved Instances and Savings Plans expire. When they do, the workload keeps running but the price changes. RDS db.r6g.2xlarge in us-east-1 on-demand is around $0.96 per hour. Under a 1-year No Upfront Reserved Instance the effective rate is closer to $0.62 per hour. That is a 35% jump on the day the RI lapses, on a workload that nobody touched.

On a fleet of 10 large RDS instances, that is roughly $24K in extra spend over the next month, hidden inside a service line that already had a high baseline.

The signal is sharp. Look at daily cost for the RDS line item, broken out by usage type. The day a 1-year RI lapses, the on-demand usage type jumps from near zero to the full instance count, and the RI usage type drops to zero. If you only watch the total RDS line, the baseline shift is muddled. If you watch the usage types, it is unmistakable.

Two checks worth running this week:

2. The misconfigured Lambda hammering S3

Real story. A team deployed a new ETL job. The Lambda was supposed to write one consolidated Parquet file per partition per day. Instead, due to a bug in the partition logic, it wrote one tiny file per record. Each PutObject call to S3 costs $0.005 per 1,000 requests. That is trivial. Except the Lambda processed 200 million records per day, and ran for nine days before anyone noticed.

Math: 200,000,000 / 1,000 * $0.005 = $1,000 per day in S3 PUT requests. After nine days, $9,000 of S3 request charges that did not exist before, on top of whatever the storage cost. The storage line looked normal because the per-byte price had not changed. The S3-API-Tier1-Requests usage type had spiked by a factor of 300.

The signal here is not in the cost summary. It is in the usage type breakdown of the S3 line. CloudWatch metrics for S3 requests are sampled, but the Cost and Usage Report is exact. If your daily ingestion of CUR data shows the request count for any single bucket jump 10x from one day to the next, you have a problem worth investigating.

A reasonable detection rule: per S3 bucket, per usage type, alert when the daily request count crosses three standard deviations above the trailing 14-day mean. False positives mostly come from launch days and backfills. Tag those as expected and move on.

3. The NAT Gateway data transfer surprise

NAT Gateway has two costs: hourly ($0.045 per hour, around $32 per month per gateway) and per-GB processed ($0.045 per GB). The hourly cost is boring and predictable. The per-GB cost is where bills explode.

We have seen this happen twice in the last year. Both times: a service running in private subnets started pulling large objects from S3 through the NAT Gateway instead of through a VPC Gateway Endpoint. In one case the team had built a new analytics pipeline that pulled 2 TB per day from S3. Through NAT, that is 2,000 GB * $0.045 = $90 per day, or $2,700 per month, on a service that should have cost $0 in data transfer if the VPC Endpoint had been in place.

The second case was worse: 8 TB per day for a video transcoding workload, $360 per day, $10,800 a month. The team thought they were paying for compute. They were paying for compute plus a third again for moving bytes from S3 to EC2 through a NAT Gateway in the same region.

Two signals here. The first is the obvious one: NAT Gateway data processed line in CUR jumps. The second is more useful at scale: the ratio of NAT Gateway data processed to EC2 inter-AZ transfer for a given account. Inter-AZ stays predictable while NAT data climbs. That ratio shifts hard the day the workload starts.

Fix is two CLI commands and an hour of route table editing:

aws ec2 create-vpc-endpoint \
  --vpc-id vpc-XXXX \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-private-a rtb-private-b rtb-private-c

Plus DynamoDB if you use it. After the endpoints are wired in, the NAT data line drops to whatever genuinely needed the public internet (third-party APIs, package downloads).

What "anomaly detection" actually means here

AWS Cost Anomaly Detection (the native service) runs on machine-learning baselines per service per linked account. It works. It also alerts on what is, statistically, an anomaly, not on what is actionable. We see customers turn it off because it cried wolf on the last day of the month when usage always spikes for batch jobs.

What works better in practice is splitting the problem in two:

  1. Service-level baselines on each major service line, per account, with thresholds that scale to the size of the line. A $200/day variance on a $50,000/day EC2 line is noise. The same variance on an $800/day RDS line is the signal.
  2. Usage-type drift within each service. The total can look stable while the mix changes (RIs lapsing, on-demand growing). Watch the largest 20 usage types per service and alert on composition shifts.

What we run on every audit

When we run a free 14-day audit, here is the actual checklist we run on AWS:

Half of these are findable in Cost Explorer with patience. The other half need the CUR plus Athena plus a few hours. The reason most teams do not catch them is not the data, it is the routine. Nobody owns running this every week, so it never gets run.

A few honourable mentions

Three smaller patterns we see often, less expensive individually, but cumulatively painful and easy to fix.

EBS volumes detached for months. Someone terminated an EC2 instance with delete-on-termination set to false, the volume is still there at $0.10 per GB-month for gp3. A 500 GB volume left for a year is $600 nobody is using. Find them with a tagging API scan filtered for state available and an attach time more than 30 days ago. Snapshot them, then delete.

Old EBS snapshots. The default policy on most backup tools is "never delete". Snapshots for instances that were terminated two years ago are still costing money. Pull aws ec2 describe-snapshots, filter by start time older than your retention policy, and check the AMIs they back. Anything not referenced by an in-use AMI is a candidate for deletion. We usually find 2 to 5 TB of orphaned snapshots on a typical $200K/mo environment.

Public IPv4 addresses since February 2024. AWS now charges $0.005 per hour ($3.65 per month) for every public IPv4, including those attached to running instances. A handful of NAT Gateways, Load Balancers, and direct-attached EC2 instances adds up. On a fleet of 200 instances with public IPs, that is $730 a month you may not have budgeted for. Pull aws ec2 describe-addresses and the elastic IP report; anything not actively serving traffic is fair game to release.

The simple version: three checks for Monday

If you do nothing else, do this on Monday morning, every week, for the next month:

  1. Pull last week's daily cost by service. Eye the top five for any line that grew more than 20% week over week. Click into the usage type breakdown for that service.
  2. Check Reserved Instance utilization for the past 7 days. Anything below 90% is leaking. Anything below 70% is screaming.
  3. Look at NAT Gateway data processed for each account. Compare to the previous week. If it grew more than 30%, find out which workload and decide if a VPC Endpoint would help.

Three checks, 20 minutes a week, finds the first 70% of what costs most teams real money. Worth doing yourself, worth automating, worth handing to us if you have better things to do.