EC2 vs S3 for storing compressed data
25 Aug 2011
Problem: Compress logfiles on EC2 and backup them to S3.
Should you trade CPU time (high compression rate) for S3 storage costs?
# 720 GB stored for 6 months
#
# cmdline cpu (s/MB) size (%) time (h) ec2 ($) s3 ($) total ($)
# -----------------------------------------------------------------------------------
# raw 0.000013 100.000000 0.002560 0.000973 604.800000 604.800973
# gzip -1 0.015814 14.600000 3.238656 1.230689 88.300800 89.531489
# gzip -5 0.023426 12.100000 4.797696 1.823124 73.180800 75.003924
# gzip -9 0.032528 11.300000 6.661632 2.531420 68.342400 70.873820
# bzip2 -1 0.191470 11.200000 39.213056 14.900961 67.737600 82.638561
# bzip2 -5 0.246373 8.700000 50.457088 19.173693 52.617600 71.791293
# bzip2 -9 0.303029 7.900000 62.060288 23.582909 47.779200 71.362109
# pbzip2 -1 0.105028 11.200000 21.509632 8.173660 67.737600 75.911260
# pbzip2 -5 0.113369 8.900000 23.217920 8.822810 53.827200 62.650010
# pbzip2 -9 0.141710 7.900000 29.022208 11.028439 47.779200 58.807639
# lzma -1 0.060559 11.100000 12.402432 4.712924 67.132800 71.845724
# lzma -5 0.759580 5.600000 155.561984 59.113554 33.868800 92.982354
# lzma -9 1.692910 3.500000 346.707968 131.749028 21.168000 152.917028
# lzop -1 0.002903 20.900000 0.594432 0.225884 126.403200 126.629084
# lzop -5 0.003011 20.700000 0.616704 0.234348 125.193600 125.427948
# lzop -9 0.086239 14.400000 17.661696 6.711444 87.091200 93.802644
Suprisingly, gzip -9 is almost as cost effective as bzip -9 with the
difference being that the same operation takes 10x as long.
The table above was calculated for a standard large EC2 intance (0.38$/h)
storing S3 using the first non-reduced tier (0.14$/GB/month).
# ----
# S3
# ----
# Normal Reduced
# First 1 TB / month $0.140 per GB $0.093 per GB
# Next 49 TB / month $0.125 per GB $0.083 per GB
# Next 450 TB / month $0.110 per GB $0.073 per GB
# Next 500 TB / month $0.095 per GB $0.063 per GN
#
# ----
# EC2
# ----
#
# Large $0.38 per hour
# Extra Large $0.76 per hour
#
# Hi-Memory On-Demand Instances
#
# Extra Large $0.57 per hour
# Double Extra Large $1.14 per hour
# Quadruple Extra Large $2.28 per hour
#
# Hi-CPU On-Demand Instances
#
# Medium $0.19 per hour
# Extra Large $0.76 per hour
The code below was uses the calculate the results, pretty interesting to plug in values to simulate various scenarios (thank @theo for the initial measurements):
MEASUREMENT_SAMPLE_SIZE_MB = 800
MEASUREMENTS = [
# setup cpu s ratio %
# ------------------------------
("raw" , 0.01, 100.0),
("gzip -1" , 12.651, 14.6),
("gzip -5" , 18.741, 12.1),
("gzip -9" , 26.022, 11.3),
("bzip2 -1" , 153.176, 11.2),
("bzip2 -5" , 197.098, 8.7),
("bzip2 -9" , 242.423, 7.9),
("pbzip2 -1" , 84.022, 11.2),
("pbzip2 -5" , 90.695, 8.9),
("pbzip2 -9" , 113.368, 7.9),
("lzma -1" , 48.447, 11.1),
("lzma -5" , 607.664, 5.6),
("lzma -9" , 1354.328, 3.5),
("lzop -1" , 2.322, 20.9),
("lzop -5" , 2.409, 20.7),
("lzop -9" , 68.991, 14.4)
]
EC2_COST_SEC = 0.38 / 3600.0
S3_MB_COST_PER_MONTH = 0.140 / 1024.0
DATA_SIZE_MB = 30*24*1024 # 1 month of data @ 1 GB/h
STORED_FOR_MONTHS = 6
print "%d GB stored for %d months" % (DATA_SIZE_MB/1024, STORED_FOR_MONTHS)
print "cmdline".ljust(12), "".join([
h.ljust(12) for h in ["cpu (s/MB)",
"size (%)",
"time (h)",
"ec2 ($)",
"s3 ($)",
"total ($)"]])
for (cmdline, cpu_time, compress_rate) in MEASUREMENTS:
cpu_time = cpu_time / float(MEASUREMENT_SAMPLE_SIZE_MB)
compress_time = (DATA_SIZE_MB * cpu_time)
ec2_cost = EC2_COST_SEC * compress_time
s3_cost = (DATA_SIZE_MB * compress_rate / 100.0) * S3_MB_COST_PER_MONTH * STORED_FOR_MONTHS
print cmdline.ljust(12), "".join([
("%.6f" % v).ljust(12)
for v in [cpu_time,
compress_rate,
compress_time / 3600.0,
ec2_cost,
s3_cost,
ec2_cost+s3_cost]])