How we increased our EC2 event throughput by 50%, for free

(TL;DR: using an SSD cache in front of EBS can give a massive I/O throughput boost)

Background

Internally, Swrve is built around an event-processing pipeline, processing data sent from 100 million devices around the world each month, in real time, with an average events-per-second throughput in the tens of thousands.

Each event processor (or EP) stores its aggregated, per-user state as BDB-JE files on disk, and we use EBS volumes to store these files.  We have a relatively large number of beefy machines in EC2 which perform this task, so we’re quite price-sensitive where these are concerned.

We gain several advantages from using EBS:

  • easy snapshotting: EBS supports this nicely.
  • easy resizing of volumes: it’s pretty much impossible to run out of disk space with a well-monitored EBS setup; just snapshot and resize.
  • reliability in the face of host failure: just detach the volumes and reattach elsewhere.

EBS hasn’t had a blemish-free reliability record, but this doesn’t pose a problem for us — we store our primary backups off-EBS, as compressed files on S3, so we are reasonably resilient to any data corruption risks with EBS, and in another major EBS outage could fire up replacement ephemeral-storage-based hosts using these.  In the worst case scenario, we can even wipe out the BDB and regenerate it from scratch, if required, since we keep the raw input files as an ultimate “source of truth”.

One potential downside of EBS is variable performance; this can be addressed by using Provisioned IOPS on the EBS volumes, but this is relatively expensive.

However, our architecture allows EPs to slow down without major impact elsewhere, and in extreme circumstances will safely fall back to a slower, non-realtime transmission system.  If things get that slow, they generally recover quickly enough, but in the worst-case scenario our ops team are alerted and can choose to reprovision on another host, or split up the EP’s load, as appropriate.  This allows us to safely use the (cheaper) variable-performance EBS volumes instead of PIOPS; it turns out these can actually perform very well, albeit in a “spiky” manner, with occasional less-performant spikes of slowness.

As we were looking at new EC2 instances several months back, we noticed that decent spec, high-RAM, high-CPU instances were starting to appear with SSDs attached in the form of the “c3” instance class.  These SSDs, in the form of two 40GB devices per instance, were far too small to fit our BDB files with enough headroom for safety, unfortunately.  But could they be used as a cache?

Our data model tends to see lots of accesses to recent data, with a very long tail of accesses to anything older than a few days, so there was a good chance caching would have a good effect on performance.  Let’s find out!

EBS Caching With SSDs

To try this out, we set up 4 c3.xlarge EC2 instances, restored a set of 24-hour-old BDB backups for one reasonably large app, and fired up the EP so that it’d run through its events as fast as it could read them.  Given our EP design, this was likely to be I/O-bound on read/write accesses to the BDB databases.

(Initially we tested using a range of multiple workloads, but we found that this single workload made a good performance-measurement test and remained representative of the others, so narrowed down to just this one in order to simplify the testing.)

We tried out the following caching systems:

Each was configured to use one of the 40GB SSD ephemeral drives as a dedicated cache, in front of a 200GB EBS volume.

bcache: we discarded this early on, since it caused its test instance to lock up on a couple of occasions.  Some digging threw up this mailing list thread, which seemed to indicate that it didn’t coexist with md_raid too well, which we were using.

In general, tooling and reliability wasn’t great, and we need something operable, stable, and safe for production use.

zfs: another early discard. Too heavy on memory overhead required to obtain decent caching, given the amount of free memory we had available on the c3 class instances.  It just wasn’t going to fit.

dm-cache: didn’t perform particularly well, and like bcache, was pretty rough around the edges where tooling and operability were concerned; as this tutorial notes, there was a 10-step process to manually configure and mount the cache, and even worse, there was risk to the backing store volume if the cache was decommissioned without a manual dirty-blocks flush process. Eek.

(Note: this was 6 months ago! Apparently these rough edges have been fixed since then, so maybe it’s better nowadays.)

flashcache: this was quite performant, and the highest scoring from our first round of tests. However, it was a little rough around the edges, again, requiring kernel headers to build. Writethrough caches did not persist across device removal or reboot, so this would have had to be handled out of band, causing operability problems.

Here are some graphs of these initial candidates and how they performed, side-by-side:

Events Processed Per Second:

EPS graph

  • Blue: bcache
  • Red: zfs
  • Green: dmcache
  • Purple: flashcache

Y axis is events processed per second, higher is better; X axis is time.  You can see that flashcache gets through the workload pretty speedily. (As an aside, the gradual downward trend is probably due to BDB-JE behaviour with respect to JVM garbage collection — soon after startup, periodic GCs start kicking in and impact throughtput slightly. We’ve found that RocksDB doesn’t suffer as badly from this.)

Read I/O Operations Per Second:

read IOPS graph

  • Blue: flashcache
  • Red: zfs
  • Green: dmcache
  • Purple: bcache

Y axis is Read IOPS, higher is better; X axis is time.

Write I/O Operations Per Second:

write IOPS graph

  • Blue: flashcache
  • Red: zfs
  • Green: dmcache
  • Purple: bcache

Y axis is Write IOPS, higher is better; X axis is time.

Obviously, flashcache is the clear winner.  But after this test round, we tried out one more option:

enhanceio: this is a fork of flashcache, with several improvements, including improved CLI tooling and udev rules providing easy persistence over reboots.  Benchmark-wise, it performed about the same as flashcache.  Since the operability and UI rough edges were the biggest issue with flashcache, this was a definite win.

Final results

Finally, after we decided on enhanceio as a caching subsystem, we benchmarked the effects of using 4 EBS volumes striped in a RAID-0 configuration as the backing store behind the cache, instead of a single EBS volume.  Here’s a graph of the results, comparing:

  • single 200GB EBS volume (blue)
  • 4 x 50GB EBS volumes in RAID-0 stripes (red)
  • 4 x 50GB EBS volumes in RAID-0 stripes, with an enhanceio cache device (green)

RAID 0 graph

Y axis is, again, events processed per second, higher is better; X axis is time.  RAID-0 striping is, unsurprisingly, a win — but using a 4-way stripe and enhanceio has effectively increased our throughput by over 50%, for free.  Not bad!

Reliability

Enhanceio was configured to use a write-through cache, rather than write-back, so the risk of corruption is lower, and in addition (as noted previously) our EPs can easily restore from a backup and reroll from archived source data when necessary. Given this, we were happy to roll this out into production after a few days of 24-hour uptime on the test instance, and some eyeballing of the service’s output and its service metrics.

Over the next few weeks, we carefully and gradually rolled out the new c3.xlarge-with-enhanceio configuration into production, replacing our existing fleet and consolidating to fewer instances.  This has now been in place across our production fleet for 6 months, and we have not observed any filesystem corruption or other issues with enhanceio.  It’s been a solid win.

Moar Graphs!

Here’s the CPU wait time, showing less CPU time spent waiting on I/O operations using the enhanceio option (note, measured out of a max of 400, since it’s summed CPU usage percentages across 4 VCPUs).

CPU wait graph

  • single 200GB EBS volume (blue)
  • 4 x 50GB EBS volumes in RAID-0 stripes (green)
  • 4 x 50GB EBS volumes in RAID-0 stripes, with an enhanceio cache device (red)

Another graph, showing the effects of enhanceio cache warmup.  As the cache warms up gradually after process startup, the IOPS performed on the cache’s SSD volume (red) increases in a nice up-and-to-the-right line until it levels at around 7,500 IOPS:

cache warmup graph

Configuration

Here’s the commands we use to configure the c3.xlarge instances’ RAID array and enhanceio cache:


yes | sudo mdadm --create /dev/md0 --level=0 -c256 --raid-devices=4 /dev/xvd[fghi]
echo 'DEVICE /dev/xvdf /dev/xvdg /dev/xvdh /dev/xvdi' \
| sudo tee /etc/mdadm/mdadm.conf
sudo mdadm --detail --scan | sudo tee -a /etc/mdadm/mdadm.conf
sudo update-initramfs -u
sudo blockdev --setra 65536 /dev/md0
sudo eio_cli create -d /dev/xvdc -s /dev/md0 -b 4096 -c cachedevice1

We also have the enhanceio device configured to auto-mount using udev, as described in their docs.  The enhanceio configuration is:


Policy           : lru
Mode             : Write Through
Block Size       : 4096
Associativity    : 256

EnhanceIO don’t release versioned releases (!), but our module is built from this git commit on master.  We’re using Linux kernel 3.11.0-20-generic on Ubuntu 12.04.3 LTS.

Future Work

Next up is a move from BDB-JE to RocksDB, due to its better performance, smaller file sizes, and friendlier open source licensing.  So how does RocksDB cope with this caching approach?  Even better, it turns out; BDB also has poor time-based locality, that is, any given BDB logfile will tend to have records from a range of last written times. LevelDB/RocksDB tends to operate more like an LRU cache, with most recently modified stuff in the upper levels; this is helpful for caching and suits our use-case well.

Credits

Thanks to Rob Synnott and Paul Dolan for this project!

Discuss this post on Hacker News, or on programming.reddit.

Justin Mason


10 comments

  1. We were planning on upgrading to the C3 class of instance anyway, and sticking with EBS as the backing store. Once you’re using an instance which has SSD ephemeral storage, there are no additional fees to use that hardware.

    Thanks for commenting!

    Like

  2. there’s a chance that by having certain tuning options you could make flashcache work far better than enhanceio (unless they ported our v3 changes) – as we worked quite a bit on better cache efficiency

    Like

  3. Justin,

    if you look for alternatives to BDB then also give hamsterdb a try. It lacks the same “time-based locality”, but its generated files are much smaller than those of BDB or leveldb. Performance is usually better than leveldb, unless you bulk insert in the range of 20 million items.
    Here are benchmarks and more details:
    http://highscalability.com/blog/2014/8/13/hamsterdb-an-analytical-embedded-key-value-store.html

    The API of hamsterdb is very similar to BDB. Integrating it for a quick benchmark shouldn’t be much effort. If you need help or have questions then just send me a mail – chris at crupp.de.

    Best regards
    Christoph

    Like

  4. Sounds great.

    Would love to implement the same on our EC2 servers.
    My concern is stability – enhanceio is listed as experimental at this point, so I am not sure how stable it is.
    Did you guys have any trouble with it? Did you encounter data corruption?

    Like

  5. Thank you for sharing this! I’m wondering if you are still using this approach or if you’ve found issues with it or found better ways to scale?

    Charts / numbers look good, but Enhanceio repo seems very inactive, which is a bit concerning…

    Thanks!

    Like


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s