Home / Twitter / GitHub / LinkedIn / Email

EC2 vs S3 for storing compressed data

August 2011

Problem: Compress logfiles on EC2 and backup them to S3.

Should you trade CPU time (high compression rate) for S3 storage costs?

# 720 GB stored for 6 months
#
# cmdline      cpu (s/MB)  size (%)    time (h)    ec2 ($)     s3 ($)      total ($)
# -----------------------------------------------------------------------------------
# raw          0.000013    100.000000  0.002560    0.000973    604.800000  604.800973
# gzip -1      0.015814    14.600000   3.238656    1.230689    88.300800   89.531489
# gzip -5      0.023426    12.100000   4.797696    1.823124    73.180800   75.003924
# gzip -9      0.032528    11.300000   6.661632    2.531420    68.342400   70.873820
# bzip2 -1     0.191470    11.200000   39.213056   14.900961   67.737600   82.638561
# bzip2 -5     0.246373    8.700000    50.457088   19.173693   52.617600   71.791293
# bzip2 -9     0.303029    7.900000    62.060288   23.582909   47.779200   71.362109
# pbzip2 -1    0.105028    11.200000   21.509632   8.173660    67.737600   75.911260
# pbzip2 -5    0.113369    8.900000    23.217920   8.822810    53.827200   62.650010
# pbzip2 -9    0.141710    7.900000    29.022208   11.028439   47.779200   58.807639
# lzma -1      0.060559    11.100000   12.402432   4.712924    67.132800   71.845724
# lzma -5      0.759580    5.600000    155.561984  59.113554   33.868800   92.982354
# lzma -9      1.692910    3.500000    346.707968  131.749028  21.168000   152.917028
# lzop -1      0.002903    20.900000   0.594432    0.225884    126.403200  126.629084
# lzop -5      0.003011    20.700000   0.616704    0.234348    125.193600  125.427948
# lzop -9      0.086239    14.400000   17.661696   6.711444    87.091200   93.802644

Suprisingly, gzip -9 is almost as cost effective as bzip -9 with the difference being that the same operation takes 10x as long.

The table above was calculated for a standard large EC2 intance (0.38$/h) storing S3 using the first non-reduced tier (0.14$/GB/month).

# ----
# S3
# ----
#                      Normal          Reduced
# First  1 TB / month  $0.140 per GB   $0.093 per GB
# Next  49 TB / month  $0.125 per GB   $0.083 per GB
# Next 450 TB / month  $0.110 per GB   $0.073 per GB
# Next 500 TB / month  $0.095 per GB   $0.063 per GN
#
# ----
# EC2
# ----
#
# Large                 $0.38 per hour
# Extra Large           $0.76 per hour
#
# Hi-Memory On-Demand Instances
#
# Extra Large           $0.57 per hour
# Double Extra Large    $1.14 per hour
# Quadruple Extra Large $2.28 per hour
#
# Hi-CPU On-Demand Instances
#
# Medium                $0.19 per hour
# Extra Large           $0.76 per hour

The code below was uses the calculate the results, pretty interesting to plug in values to simulate various scenarios (thank @theo for the initial measurements):


MEASUREMENT_SAMPLE_SIZE_MB = 800
MEASUREMENTS = [
    # setup        cpu s     ratio %
    # ------------------------------
    ("raw"   ,     0.01,     100.0),
    ("gzip -1"   , 12.651,   14.6),
    ("gzip -5"   , 18.741,   12.1),
    ("gzip -9"   , 26.022,   11.3),
    ("bzip2 -1"  , 153.176,  11.2),
    ("bzip2 -5"  , 197.098,  8.7),
    ("bzip2 -9"  , 242.423,  7.9),
    ("pbzip2 -1" , 84.022,   11.2),
    ("pbzip2 -5" , 90.695,   8.9),
    ("pbzip2 -9" , 113.368,  7.9),
    ("lzma -1"   , 48.447,   11.1),
    ("lzma -5"   , 607.664,  5.6),
    ("lzma -9"   , 1354.328, 3.5),
    ("lzop -1"   , 2.322,    20.9),
    ("lzop -5"   , 2.409,    20.7),
    ("lzop -9"   , 68.991,   14.4)
]

EC2_COST_SEC = 0.38  / 3600.0

S3_MB_COST_PER_MONTH = 0.140 / 1024.0

DATA_SIZE_MB = 30*24*1024 # 1 month of data @ 1 GB/h

STORED_FOR_MONTHS = 6

print "%d GB stored for %d months" % (DATA_SIZE_MB/1024, STORED_FOR_MONTHS)
print "cmdline".ljust(12), "".join([
        h.ljust(12) for h in ["cpu (s/MB)",
                              "size (%)",
                              "time (h)",
                              "ec2 ($)",
                              "s3 ($)",
                              "total ($)"]])

for (cmdline, cpu_time, compress_rate) in MEASUREMENTS:
    cpu_time = cpu_time / float(MEASUREMENT_SAMPLE_SIZE_MB)
    compress_time = (DATA_SIZE_MB * cpu_time)
    ec2_cost = EC2_COST_SEC * compress_time
    s3_cost = (DATA_SIZE_MB * compress_rate / 100.0) * S3_MB_COST_PER_MONTH * STORED_FOR_MONTHS
    print cmdline.ljust(12), "".join([
            ("%.6f" % v).ljust(12)
            for v in [cpu_time,
                      compress_rate,
                      compress_time / 3600.0,
                      ec2_cost,
                      s3_cost,
                      ec2_cost+s3_cost]])