TL;DR: Inherited a system with triple-replicated database snapshots and no lifecycle policies. Replaced it with a single Lambda that exports select snapshots to S3 as Parquet for long-term archival. RDS snapshots handle actual recovery. Monthly backup bill dropped from ~$2,500 to ~$130.
The Discovery
I recently took over management of a system that required extensive backups for compliance. Standard onboarding stuff - I set up cost anomaly detection, reviewed spending patterns, and started getting familiar with the infrastructure.
Then my alerts started firing.
Certain costs were consistently spiking above normal thresholds. After digging through the billing console and a few conversations with the team, I found the culprit: database backups that had quietly grown into a monster.
flowchart LR
subgraph "What I Inherited"
DB[(Aurora DB)] --> S1[Daily Snapshots]
S1 --> R1[Region A<br/>Account 1]
S1 --> R2[Region B<br/>Account 1]
S1 --> R3[Region B<br/>Account 2]
end
style R1 fill:#ff6b6b,color:#fff
style R2 fill:#ff6b6b,color:#fff
style R3 fill:#ff6b6b,color:#fff
The backups were being triple replicated:
- Copied to a different region within the same account
- Copied again to that same region in a separate account
- No lifecycle policies, no S3 exports, no budget controls
These snapshots just… accumulated. Forever. The bill kept climbing and nobody noticed because “backups are important” - and they are, just not like this.
What We Actually Needed
Before jumping to solutions, I reviewed the compliance requirements. For SOC2 and general data protection, we needed:
| Recovery Type | Access Speed | Retention |
|---|---|---|
| Short-term | Immediate | Days to weeks |
| Medium-term | Within hours | Months |
| Long-term | Within a day | Years (for audits) |
The existing setup was overkill for all three while somehow still being poorly organized. Time for a rethink.
The Fix
The idea: use AWS’s native snapshot retention for short-term recovery, then selectively export snapshots to S3 for long-term archival. S3 lifecycle policies transition the data through storage tiers and handle cleanup automatically.
Important distinction: RDS snapshot exports land in S3 as Apache Parquet files - a columnar format optimized for analytics. These exports are designed for querying via Athena or Redshift Spectrum, not for database restoration. Need to pull historical records for an audit? Query them directly with SQL. Need to actually restore a database? That’s what the 30-day RDS snapshots are for.
flowchart TB
subgraph "New Architecture"
DB[(Aurora DB)] --> AUTO[Daily Snapshots<br/>30-day retention]
AUTO --> LAMBDA[Lambda<br/>runs monthly]
LAMBDA --> |"14-day snapshot"| BI[S3 Standard<br/>Parquet export]
LAMBDA --> |"30-day snapshot"| MO[S3 Standard<br/>Parquet export]
BI --> |"Lifecycle: 30 days"| BI_IA[S3 Infrequent Access]
MO --> |"Lifecycle: 30 days"| MO_G[Glacier Deep Archive]
BI_IA --> |"Lifecycle: 6 months"| DEL1[Auto-delete]
MO_G --> |"Lifecycle: 2 years"| DEL2[Auto-delete]
end
style AUTO fill:#4ecdc4,color:#000
style BI fill:#e8e8e8,color:#000
style MO fill:#e8e8e8,color:#000
style BI_IA fill:#45b7d1,color:#000
style MO_G fill:#96ceb4,color:#000
style LAMBDA fill:#ffd93d,color:#000
How the Lambda Works
The Lambda runs on the last day of each month. It finds snapshots that are approximately 14 and 30 days old, then exports them to S3 with the appropriate prefix.
flowchart TD
START([Triggered]) --> FETCH[Fetch automated snapshots]
FETCH --> FILTER[Filter available & calculate age]
FILTER --> MONTHLY{~30 days old?}
MONTHLY --> |Yes| EXPORTM[Export to monthly/]
MONTHLY --> |No| BIWEEKLY
EXPORTM --> BIWEEKLY
BIWEEKLY{~14 days old?}
BIWEEKLY --> |Yes| EXPORTB[Export to biweekly/]
BIWEEKLY --> |No| DONE
EXPORTB --> DONE([Done])
style START fill:#ffd93d,stroke:#333
style DONE fill:#9f9,stroke:#333
The key insight: pick snapshots that are approximately the right age (within a day or two) rather than requiring exact timing. This makes the system resilient to scheduling variations.
The Retention Strategy
| Tier | Source | Storage | Retention | Purpose |
|---|---|---|---|---|
| Daily | Aurora automated | RDS Snapshots | 30 days | Actual recovery - restore DB directly |
| Biweekly | Lambda export | S3 → IA | 6 months | Archival, analytics via Athena |
| Monthly | Lambda export | S3 → Glacier Deep | 2 years | Compliance, audit trail |
Note: S3 Glacier Deep Archive has a 180-day minimum storage duration. Objects deleted before 180 days incur pro-rated charges. Our 2-year retention avoids this entirely.
The Code
import boto3
import os
from datetime import datetime, timedelta, timezone
def lambda_handler(event, context):
rds = boto3.client('rds')
cluster_id = os.environ['DB_CLUSTER_IDENTIFIER']
s3_bucket = os.environ['S3_BUCKET']
iam_role_arn = os.environ['IAM_ROLE_ARN']
kms_key_id = os.environ['KMS_KEY_ID']
now = datetime.now(timezone.utc)
# Get all automated snapshots for the cluster
response = rds.describe_db_cluster_snapshots(
DBClusterIdentifier=cluster_id,
SnapshotType='automated'
)
snapshots = response.get('DBClusterSnapshots', [])
exports_started = []
# Filter available snapshots and calculate age
available_snapshots = []
for snapshot in snapshots:
if snapshot['Status'] != 'available':
continue
snapshot_time = snapshot['SnapshotCreateTime']
age_days = (now - snapshot_time).days
available_snapshots.append({
'snapshot': snapshot,
'age_days': age_days
})
# Find the single best snapshot for monthly (closest to 30 days)
monthly_candidates = [s for s in available_snapshots if 29 <= s['age_days'] <= 31]
if monthly_candidates:
# Pick the one closest to 30 days
monthly_snap = min(monthly_candidates, key=lambda x: abs(x['age_days'] - 30))
snapshot = monthly_snap['snapshot']
snapshot_arn = snapshot['DBClusterSnapshotArn']
snapshot_id = snapshot['DBClusterSnapshotIdentifier']
snapshot_date = snapshot['SnapshotCreateTime'].strftime('%Y%m%d')
export_id = f"m-{snapshot_date}-{now.strftime('%Y%m%d')}"
s3_prefix = f"monthly/{now.strftime('%Y/%m')}"
try:
rds.start_export_task(
ExportTaskIdentifier=export_id,
SourceArn=snapshot_arn,
S3BucketName=s3_bucket,
S3Prefix=s3_prefix,
IamRoleArn=iam_role_arn,
KmsKeyId=kms_key_id
)
exports_started.append(f"Monthly: {export_id} (snapshot: {snapshot_date})")
print(f"Started monthly export: {export_id} (age: {monthly_snap['age_days']} days)")
except rds.exceptions.ExportTaskAlreadyExistsFault:
print(f"Export already exists: {export_id}")
except Exception as e:
print(f"Error starting monthly export: {e}")
# Find the single best snapshot for biweekly (closest to 14 days)
biweekly_candidates = [s for s in available_snapshots if 13 <= s['age_days'] <= 15]
if biweekly_candidates:
# Pick the one closest to 14 days
biweekly_snap = min(biweekly_candidates, key=lambda x: abs(x['age_days'] - 14))
snapshot = biweekly_snap['snapshot']
snapshot_arn = snapshot['DBClusterSnapshotArn']
snapshot_id = snapshot['DBClusterSnapshotIdentifier']
snapshot_date = snapshot['SnapshotCreateTime'].strftime('%Y%m%d')
export_id = f"bi-{snapshot_date}-{now.strftime('%Y%m%d')}"
s3_prefix = f"biweekly/{now.strftime('%Y/%m')}"
try:
rds.start_export_task(
ExportTaskIdentifier=export_id,
SourceArn=snapshot_arn,
S3BucketName=s3_bucket,
S3Prefix=s3_prefix,
IamRoleArn=iam_role_arn,
KmsKeyId=kms_key_id
)
exports_started.append(f"Biweekly: {export_id} (snapshot: {snapshot_date})")
print(f"Started biweekly export: {export_id} (age: {biweekly_snap['age_days']} days)")
except rds.exceptions.ExportTaskAlreadyExistsFault:
print(f"Export already exists: {export_id}")
except Exception as e:
print(f"Error starting biweekly export: {e}")
return {
'statusCode': 200,
'body': f"Exports started: {exports_started}"
}
Required Environment Variables
The Lambda expects these environment variables to be configured:
DB_CLUSTER_IDENTIFIER: Your Aurora cluster identifierS3_BUCKET: Destination bucket for exportsIAM_ROLE_ARN: Role with permissions for RDS export and S3 writeKMS_KEY_ID: KMS key for encrypting exports
S3 Lifecycle Configuration
The lifecycle rules handle both storage class transitions and cleanup:
{
"Rules": [
{
"ID": "BiweeklyLifecycle",
"Filter": { "Prefix": "biweekly/" },
"Status": "Enabled",
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
}
],
"Expiration": { "Days": 180 }
},
{
"ID": "MonthlyLifecycle",
"Filter": { "Prefix": "monthly/" },
"Status": "Enabled",
"Transitions": [
{
"Days": 30,
"StorageClass": "DEEP_ARCHIVE"
}
],
"Expiration": { "Days": 730 }
}
]
}
Exports initially land in S3 Standard, then automatically transition to cheaper storage classes after 30 days.
The Results
The numbers speak for themselves:
| Metric | Before | After |
|---|---|---|
| Monthly cost | ~$2,500 | ~$130 |
| Reduction | - | 95% |
| Backup locations | 3 | 1 |
| Manual work | Frequent | None |
Bonus: S3 is globally accessible, so querying archived data from any region is straightforward. Set up an Athena table, point it at the Parquet exports, and run SQL queries against years of historical data whenever you need it.
Takeaways
-
“Keep everything forever” is not a backup strategy. It’s a ticking cost bomb.
-
Redundancy ≠ safety. Three copies in slightly different places doesn’t triple your protection. Understand what you’re actually guarding against.
-
Know the difference between recovery and archival. RDS snapshots restore databases. S3 exports (Parquet) are for querying and analytics - you can run SQL against them with Athena anytime. Different tools for different jobs.
-
Use S3 storage classes. Glacier Deep Archive costs ~$1/TB/month. Just remember the 180-day minimum storage commitment.
-
Automate the boring stuff. A simple Lambda running once a month replaced hours of manual work.
-
Set up cost alerts early. I caught this because anomaly detection was day-one priority.
Sometimes the best infrastructure improvements aren’t about adding capabilities - they’re about removing unnecessary complexity. This was one of those wins where doing less gave us more.