From $2,500 to $130: Taming Runaway Database Backups

TL;DR: Inherited a system with triple-replicated database snapshots and no lifecycle policies. Replaced it with a single Lambda that exports select snapshots to S3 as Parquet for long-term archival. RDS snapshots handle actual recovery. Monthly backup bill dropped from ~$2,500 to ~$130.

The Discovery

I recently took over management of a system that required extensive backups for compliance. Standard onboarding stuff - I set up cost anomaly detection, reviewed spending patterns, and started getting familiar with the infrastructure.

Then my alerts started firing.

Certain costs were consistently spiking above normal thresholds. After digging through the billing console and a few conversations with the team, I found the culprit: database backups that had quietly grown into a monster.

flowchart LR
    subgraph "What I Inherited"
        DB[(Aurora DB)] --> S1[Daily Snapshots]
        S1 --> R1[Region A<br/>Account 1]
        S1 --> R2[Region B<br/>Account 1]
        S1 --> R3[Region B<br/>Account 2]
    end

    style R1 fill:#ff6b6b,color:#fff
    style R2 fill:#ff6b6b,color:#fff
    style R3 fill:#ff6b6b,color:#fff

The backups were being triple replicated:

Copied to a different region within the same account
Copied again to that same region in a separate account
No lifecycle policies, no S3 exports, no budget controls

These snapshots just… accumulated. Forever. The bill kept climbing and nobody noticed because “backups are important” - and they are, just not like this.

What We Actually Needed

Before jumping to solutions, I reviewed the compliance requirements. For SOC2 and general data protection, we needed:

Recovery Type	Access Speed	Retention
Short-term	Immediate	Days to weeks
Medium-term	Within hours	Months
Long-term	Within a day	Years (for audits)

The existing setup was overkill for all three while somehow still being poorly organized. Time for a rethink.

The Fix

The idea: use AWS’s native snapshot retention for short-term recovery, then selectively export snapshots to S3 for long-term archival. S3 lifecycle policies transition the data through storage tiers and handle cleanup automatically.

Important distinction: RDS snapshot exports land in S3 as Apache Parquet files - a columnar format optimized for analytics. These exports are designed for querying via Athena or Redshift Spectrum, not for database restoration. Need to pull historical records for an audit? Query them directly with SQL. Need to actually restore a database? That’s what the 30-day RDS snapshots are for.

flowchart TB
    subgraph "New Architecture"
        DB[(Aurora DB)] --> AUTO[Daily Snapshots<br/>30-day retention]

        AUTO --> LAMBDA[Lambda<br/>runs monthly]

        LAMBDA --> |"14-day snapshot"| BI[S3 Standard<br/>Parquet export]
        LAMBDA --> |"30-day snapshot"| MO[S3 Standard<br/>Parquet export]

        BI --> |"Lifecycle: 30 days"| BI_IA[S3 Infrequent Access]
        MO --> |"Lifecycle: 30 days"| MO_G[Glacier Deep Archive]

        BI_IA --> |"Lifecycle: 6 months"| DEL1[Auto-delete]
        MO_G --> |"Lifecycle: 2 years"| DEL2[Auto-delete]
    end

    style AUTO fill:#4ecdc4,color:#000
    style BI fill:#e8e8e8,color:#000
    style MO fill:#e8e8e8,color:#000
    style BI_IA fill:#45b7d1,color:#000
    style MO_G fill:#96ceb4,color:#000
    style LAMBDA fill:#ffd93d,color:#000

How the Lambda Works

The Lambda runs on the last day of each month. It finds snapshots that are approximately 14 and 30 days old, then exports them to S3 with the appropriate prefix.

flowchart TD
    START([Triggered]) --> FETCH[Fetch automated snapshots]
    FETCH --> FILTER[Filter available & calculate age]

    FILTER --> MONTHLY{~30 days old?}
    MONTHLY --> |Yes| EXPORTM[Export to monthly/]
    MONTHLY --> |No| BIWEEKLY
    EXPORTM --> BIWEEKLY

    BIWEEKLY{~14 days old?}
    BIWEEKLY --> |Yes| EXPORTB[Export to biweekly/]
    BIWEEKLY --> |No| DONE
    EXPORTB --> DONE([Done])

    style START fill:#ffd93d,stroke:#333
    style DONE fill:#9f9,stroke:#333

The key insight: pick snapshots that are approximately the right age (within a day or two) rather than requiring exact timing. This makes the system resilient to scheduling variations.

The Retention Strategy

Tier	Source	Storage	Retention	Purpose
Daily	Aurora automated	RDS Snapshots	30 days	Actual recovery - restore DB directly
Biweekly	Lambda export	S3 → IA	6 months	Archival, analytics via Athena
Monthly	Lambda export	S3 → Glacier Deep	2 years	Compliance, audit trail

Note: S3 Glacier Deep Archive has a 180-day minimum storage duration. Objects deleted before 180 days incur pro-rated charges. Our 2-year retention avoids this entirely.

The Code

import boto3
import os
from datetime import datetime, timedelta, timezone

def lambda_handler(event, context):
    rds = boto3.client('rds')

    cluster_id = os.environ['DB_CLUSTER_IDENTIFIER']
    s3_bucket = os.environ['S3_BUCKET']
    iam_role_arn = os.environ['IAM_ROLE_ARN']
    kms_key_id = os.environ['KMS_KEY_ID']

    now = datetime.now(timezone.utc)

    # Get all automated snapshots for the cluster
    response = rds.describe_db_cluster_snapshots(
        DBClusterIdentifier=cluster_id,
        SnapshotType='automated'
    )

    snapshots = response.get('DBClusterSnapshots', [])
    exports_started = []

    # Filter available snapshots and calculate age
    available_snapshots = []
    for snapshot in snapshots:
        if snapshot['Status'] != 'available':
            continue
        snapshot_time = snapshot['SnapshotCreateTime']
        age_days = (now - snapshot_time).days
        available_snapshots.append({
            'snapshot': snapshot,
            'age_days': age_days
        })

    # Find the single best snapshot for monthly (closest to 30 days)
    monthly_candidates = [s for s in available_snapshots if 29 <= s['age_days'] <= 31]
    if monthly_candidates:
        # Pick the one closest to 30 days
        monthly_snap = min(monthly_candidates, key=lambda x: abs(x['age_days'] - 30))
        snapshot = monthly_snap['snapshot']
        snapshot_arn = snapshot['DBClusterSnapshotArn']
        snapshot_id = snapshot['DBClusterSnapshotIdentifier']
        snapshot_date = snapshot['SnapshotCreateTime'].strftime('%Y%m%d')
        export_id = f"m-{snapshot_date}-{now.strftime('%Y%m%d')}"
        s3_prefix = f"monthly/{now.strftime('%Y/%m')}"

        try:
            rds.start_export_task(
                ExportTaskIdentifier=export_id,
                SourceArn=snapshot_arn,
                S3BucketName=s3_bucket,
                S3Prefix=s3_prefix,
                IamRoleArn=iam_role_arn,
                KmsKeyId=kms_key_id
            )
            exports_started.append(f"Monthly: {export_id} (snapshot: {snapshot_date})")
            print(f"Started monthly export: {export_id} (age: {monthly_snap['age_days']} days)")
        except rds.exceptions.ExportTaskAlreadyExistsFault:
            print(f"Export already exists: {export_id}")
        except Exception as e:
            print(f"Error starting monthly export: {e}")

    # Find the single best snapshot for biweekly (closest to 14 days)
    biweekly_candidates = [s for s in available_snapshots if 13 <= s['age_days'] <= 15]
    if biweekly_candidates:
        # Pick the one closest to 14 days
        biweekly_snap = min(biweekly_candidates, key=lambda x: abs(x['age_days'] - 14))
        snapshot = biweekly_snap['snapshot']
        snapshot_arn = snapshot['DBClusterSnapshotArn']
        snapshot_id = snapshot['DBClusterSnapshotIdentifier']
        snapshot_date = snapshot['SnapshotCreateTime'].strftime('%Y%m%d')
        export_id = f"bi-{snapshot_date}-{now.strftime('%Y%m%d')}"
        s3_prefix = f"biweekly/{now.strftime('%Y/%m')}"

        try:
            rds.start_export_task(
                ExportTaskIdentifier=export_id,
                SourceArn=snapshot_arn,
                S3BucketName=s3_bucket,
                S3Prefix=s3_prefix,
                IamRoleArn=iam_role_arn,
                KmsKeyId=kms_key_id
            )
            exports_started.append(f"Biweekly: {export_id} (snapshot: {snapshot_date})")
            print(f"Started biweekly export: {export_id} (age: {biweekly_snap['age_days']} days)")
        except rds.exceptions.ExportTaskAlreadyExistsFault:
            print(f"Export already exists: {export_id}")
        except Exception as e:
            print(f"Error starting biweekly export: {e}")

    return {
        'statusCode': 200,
        'body': f"Exports started: {exports_started}"
    }

Required Environment Variables

The Lambda expects these environment variables to be configured:

DB_CLUSTER_IDENTIFIER: Your Aurora cluster identifier
S3_BUCKET: Destination bucket for exports
IAM_ROLE_ARN: Role with permissions for RDS export and S3 write
KMS_KEY_ID: KMS key for encrypting exports

S3 Lifecycle Configuration

The lifecycle rules handle both storage class transitions and cleanup:

{
  "Rules": [
    {
      "ID": "BiweeklyLifecycle",
      "Filter": { "Prefix": "biweekly/" },
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        }
      ],
      "Expiration": { "Days": 180 }
    },
    {
      "ID": "MonthlyLifecycle",
      "Filter": { "Prefix": "monthly/" },
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": { "Days": 730 }
    }
  ]
}

Exports initially land in S3 Standard, then automatically transition to cheaper storage classes after 30 days.

The Results

The numbers speak for themselves:

Metric	Before	After
Monthly cost	~$2,500	~$130
Reduction	-	95%
Backup locations	3	1
Manual work	Frequent	None

Bonus: S3 is globally accessible, so querying archived data from any region is straightforward. Set up an Athena table, point it at the Parquet exports, and run SQL queries against years of historical data whenever you need it.

Takeaways

“Keep everything forever” is not a backup strategy. It’s a ticking cost bomb.
Redundancy ≠ safety. Three copies in slightly different places doesn’t triple your protection. Understand what you’re actually guarding against.
Know the difference between recovery and archival. RDS snapshots restore databases. S3 exports (Parquet) are for querying and analytics - you can run SQL against them with Athena anytime. Different tools for different jobs.
Use S3 storage classes. Glacier Deep Archive costs ~$1/TB/month. Just remember the 180-day minimum storage commitment.
Automate the boring stuff. A simple Lambda running once a month replaced hours of manual work.
Set up cost alerts early. I caught this because anomaly detection was day-one priority.

Sometimes the best infrastructure improvements aren’t about adding capabilities - they’re about removing unnecessary complexity. This was one of those wins where doing less gave us more.