Cassandra War Stories: Part 1

// 05.17.2016 // Data

This is part one of a multi-part series exploring the successes (and scars) that we’ve had while tuning Cassandra to perform well in MediaMath’s Data Management Platform.

Fast reads on time series data

We use Cassandra as the backend data store for our Data Management Platform (DMP) system here at MediaMath. DMPs are used by advertisers to store their first party data as well as third party data segments they buy so that they can deploy these to bid on ad opportunities targeted audiences. This requires hardware that can handle extremely large volumes of data and then search them very quickly. We chose to use Cassandra here because it’s proven track record in scaling to support massive amounts of data and queries. For anyone that has used Cassandra at scale you know there is a fair amount of fine tuning needed along multiple axis to get the performance characteristics you need for your application, such as schema tweaking, modifying compaction strategies, changing server settings such as bloom filters, cache sizes or garbage collection tuning. Cassandra has a lot of capability but – like a fancy sports car – it requires some fine tuning to get the most out of it. This is the first part in a series about how we have been tuning Cassandra to get very fast read performance out of a time series data feed.

Time series, the anti-pattern

A particular use case we have in the DMP is a time series data set. A stream of incoming events, keyed by uuid, is stored as we receive them. The columns are collections allowing us to support multiple different attributes for each event. We store an event once and never delete it until some time later (at least 30 days but maybe longer). Our reads are queries by UUID where we typically want all the events in that time frame for a particular UUID. In Cassandra parlance this is a query on the row key also known as the partition key.

Our initial approach was to set TTLs on the data and let tombstones get created when data gets old enough. As compactions happen the tombstones and data will get removed after gc_grace_seconds which, at 10 days by default, is way too long.

This pattern is actually somewhat like a queue and is in fact an anti-pattern in Cassandra. Our queue is much longer in this case (30 days, which is typical) but essentially the data accumulates on the “front end” of the queue and gets removed on the “back end” through tombstones and the process of background compactions. In the meantime, tombstones that have not yet been removed have to be reconciled on each read. This is can be detrimental to read performance since tombstones are actual records on disk and the merging process can be non-trivial if there are a lot of them.  We tried turning down gc_grace_seconds to minimize the time tombstones hang around but found our real problem was with compactions.

Compaction Complications

Our application does about four times the number of reads as writes and we were trying to get as low a latency as possible. Cassandra, however, is well known for being optimized for writes – not reads – because architecturally it is based on a Log-structured data model. In this model, writes come in and get appended to essentially an in-memory SSTable (called a Memtable). Later they are flushed to disk. There is very little (relatively) for Cassandra to do during a write. On the other hand, any read has to reconcile data from multiple SSTables (and Memtables) which includes reading tombstones and excluding any TTL’d data.

Generally, the way you improve read performance is fine tune your compaction strategy.  The more aggressive your background compaction rate the faster it will merge (or delete) data thus making the actual read need to do less work. When enough compactions coalesce an entire row key into a single SSTable the read of that row is a single seek and sequential read which can be very fast. However, this is a tradeoff – more background compaction work to improve per query expense – and can be wasteful, especially if you are not reading all your data.

In production after running for a few months and taking on more large clients we started to see problems. The first symptoms that showed up were out of disk space on some nodes. Upon digging into it we saw that compactions (under Size Tiered Compaction Strategy) were failing, leaving behind large temporary Data.db files.

Size Tiered Compaction Strategy (STCS) uses the size of SSTables to determine when to compact, creating “tiers” all of roughly the same size. When a tier fills up it will compact all those tables to one larger SSTable which will be in the next tier up.

If you aren’t deleting data (let’s ignore our TTLs and tombstones for a moment) then the size of the compacted SSTable will be roughly the size of the sum of all the input SSTables being compacted and that means that you need at least 50% overhead of disk space to generate the new SSTable.

Furthermore, with large datasets such as ours (for example nodes with 100s of gigabytes of data) it would take longer and longer to reach thresholds on the higher tiers. So because the top tiers of the STCS would not happen often (sometimes many days!) it would take longer and longer to remove TTLed data compounding the problem of needing more disk space.

Other problems with TTLs

Beyond compaction, TTLs were becoming problematic because once set, they cannot be easily changed. You can set a new TTL on a record but because we are streaming data into the system it is not practical to retroactively go back and set TTLs on all old data.

The TTLs became a long “timer” on our data that we just had to wait out and that was not practical. Additionally, they required an additional few bytes per record which adds up.

Solution: Sharding by date

The solution came in the form of sharding (as so often it does with large scale systems). We decided to stop using TTLs altogether and take advantage of some of the properties of our data. We write once and read many times, which means that after we write our data we don’t touch it until we’re ready to delete it (or archive it). By grouping our data into keyspaces of particular date ranges (of 1 week sizes), we can make some tweaks to how Cassandra manages them to both improve read performance and make it easier to drop data off the backend of the “stream” at our discretion.

 

 

Cassandra_Sharding_by_date

Once a week we started creating versioned and dated keyspaces. We write events from the input stream into the most recent keyspace. Components that read data were enhanced to recognize the most recent N keyspaces and read and merge data from them at the application level.

This does mean there is a multiplier effect on reads (one request now becomes N) but it is worth it to be able to better manage our data. The oldest data can be dropped instantly with a single “DROP KEYSPACE” command. Magic!

Tuning for read performance

Because the older keyspaces (week shards) were no longer being written to we had some opportunity to optimize their performance. We perform a major compaction on them so all their SSTables are compacted down to a single SSTable. Read requests to a single SSTable on an SSD drive easily average below 1 millisecond!

Cassandra_response_time

We also switched to using Leveled Compaction Strategy (LCS) on the most recent keyspace. This strategy can often give better read performance due to its more aggressive compaction. And it has the added benefit that it does not require the same disk space overhead as STCS.

Gotchas when changing compaction strategies

It turns out you can just change the compaction strategy for a table (using ALTER TABLE) but be careful! As soon as you change it Cassandra will apply new rules to how it decides to group SSTables together and it can trigger large background compactions as Cassandra reshuffles data based on the new strategy.

In our particular use case, after a week of collecting data we create a new dated keyspace.  Writes start going to the new keyspace and we can perform a major compaction on the N-1 keyspace. However, we needed to switch from Leveled Compaction to Size Tiered in order to be able to compact down to a single SSTable (LCS does nothing for a major compaction).  How do we do this?  There is a three step process (which we automate with Ansible):

  1. Switch to Size Tiered with compaction disabled using:  ALTER TABLE WITH COMPACTION = {'class': 'SizeTieredCompactionStrategy', 'enabled': 'false'};
  2. Perform a major compaction ( nodetool compact ... )
  3. Turn minor compactions back on – setting enabled : true .

This process gets us from a write optimized schema to a read optimized schema without much trouble.

Conclusion

By recognizing the characteristics of our input data, tuning our compaction strategies, and doing some extra work at the application level we were able to achieve excellent read performance on a time series data stream in Cassandra and still have better control over when we expire data. Cassandra is a very capable database, but you should expect to spend a fair amount of time and resources understanding how it works and how to fine tune it to your application especially if you intend to do anything at large scale or outside the range of typical use cases.

A Picture of Brad Wasson

BRAD WASSON

Principle Software Engineer, Data Services Brad Wasson is a Principle Software Engineer in the Data Services group at MediaMath. He leads the design and build of the real time decisioning systems in MediaMath's data management platform. Prior to MediaMath he worked on the distributed systems that powered Akamai Technologies' Advertising Decision Solutions division. Brad has a BS in Computer Science from Rensselaer Polytechnic Institute.
1 Comment.

One response to “Cassandra War Stories: Part 1”

Leave a Reply

Your email address will not be published. Required fields are marked *