Cassandra War Stories: Part 2
In this series we have been relating some adventures MediaMath has been having getting the NoSQL database Cassandra to work for our needs as we built out of our Data Management Platform service.
As mentioned in our previous post we needed to do a fair amount of tuning in order to scale Cassandra to our workload. In this post we’ll focus on some of the techniques we developed (good and bad) in order to handle the rapid increase in our data ingest. Using a combination of freely available automation tools, building our own custom tooling and clever utilization of AWS services we were able to scale both at the system level and at the human level.
Getting to push-button deployment on a budget
When rolling out a new service you really should have automation in place to make it easy on operations staff and to prevent mistakes that are inevitable any time humans are involved. That process has been an interesting journey for us with Cassandra. There is open source software available out there that is meant to streamline managing a Cassandra cluster, e.g. at Netflix, or this,this or this, but we did not find anything that either supported the version of Cassandra we wanted or the custom operations we needed to do so we started to come up with our own solution.
For the read-heavy use case of the DMP we have been deploying Cassandra in AWS using i2 instance types with ephemeral SSD storage and lately c4 instance types backed by EBS. We use Ansible and a custom wrapper around Ansible called Touchdown that gives us better control over environmental conditions of a deployment.
There are a number of challenges in this environment. First, you can’t use AWS Autoscaling with Cassandra. This is because adding and removing nodes from a cluster is not a trivial task at scale. It can take more than an hour for a node to join or leave a cluster in our environment because of the hundreds of gigabytes of data that needs to be transferred.
Second, adding a node is a multi-step process. First you must provision a new node with the right settings (including but not limited to proper AMI, subnet, security groups etc). Then you must attach the storage, which includes configuring raid, creating filesystems, mounting drives etc. Finally installing it, customizing the configuration and starting the daemon to begin joining the cluster.
Lastly, you need good monitoring in place so you can get a complete view of the cluster and the joining or leaving of nodes. In our case we have regular automation (more below) that manages multiple keyspace state, which includes compaction strategies and kicking off major compactions (see Part 1). You need to coordinate these events to be sure you don’t do stupid things like join nodes when the cluster is too busy, decommission nodes when others are joining etc.
Cassandra scales, people don’t
Last fall we started to get significant customer uptake and realized we were on a pretty steep trajectory curve for onboarding data into the cluster. At the rate we were going, our 25 node cluster needed to scale to 70 nodes by Christmas time. Some quick math and we realized we (a) needed to start joining nodes at a pretty fast pace and (b) we did not have enough automation.
For (b) sometimes when you think you have automation you actually don’t or at least not all the pieces connected together that you actually need. In particular we had been creating new instances manually, adding their IP to an ansible inventory and then installing. Our installation was automated, but manually creating the instances and checking in a new inventory file for every node was not going to scale. It was not too hard to automate creating instances, but it is ugly and not best practice to commit changes to config files as part of automated provisioning. We needed a cleaner way to create and discover nodes.
Fortunately because we built our own custom wrapper around Ansible (called Touchdown) we were able to utilize its support for discovering EC2 instances by tag in a way that allowed us to truly automate the provisioning, installation and joining of new Cassandra nodes. Here is how it works.
- Call Touchdown provision to allocate a new instance – tag it with Name=dmp-prod-cassandra and State=provisioned. This is an ansible job that runs against localhost and calls the ec2 ansible module to allocate an instance.
- Call Touchdown to install all nodes with the Cassandra Name tag and State==provisioned. This installs and configures the Cassandra software and starts the joining process.
- At completion of the install update State=installed. This is just a task using the EC2 ansible module to update tags.
- Meanwhile we report the output of ‘nodetool status’ to our monitoring system (Qasino more below). This gives us the “Status” and “State” of every node which we check throughout the process.
This allows us to do a couple things. First we can flag instances that have been allocated but not yet installed. Second we can check things like, be sure there is only one provisioned node to install (because we only want to join one node at a time). Third we can monitor the state of the cluster before and during for health (such as all nodes are up, and none are currently down, joining or leaving).
The beauty of this is both the provisioning of nodes and installation/joining is codified in Touchdown/Ansible and the state of instances is maintained by Amazon in the instance tags.
Here is the Ansible task that we use to set a freshly installed Cassandra node’s EC2 tag State to installed:
- name: set node state to installed
when: ec2_instance_id is defined
Lastly we’d like to share a piece of operations gold that we happened upon which, some folks might not have considered utilizing, but we have found to be incredibly helpful. We created a Jenkins instance used exclusively for purposes of production operations – including but not limited to deployments.
Typically people use Jenkins for building and packaging their components and maybe running some automation around testing and/or continuous integration or deployment but it really is capable of running anything. In fact, it can also run jobs on a cron schedule and everything it runs is verbosely logged and has its own workspace where files can be inspected later. Furthermore it has Hipchat notifications so no excuses for not knowing some piece of automation failed. Additionally it has user level login and access control. All these things make it really useful as an automation environment for production tasks. Especially when you have limited resources and want to stay nimble.
Once we set one up for this purpose the number of ad hoc tasks we had that were suddenly getting consolidated and tracked in one place exploded (in a good way). Furthermore the number of issues arising from individual’s environments being out of whack decreased significantly.
One word of warning though, it is tempting to start writing more and more complex shell scripts or even Python/Perl right in the Jenkins environment. Be cautious with this as this is code and can be dangerous to test in production without without dev environment testing, version control, or a validation workflow (depending on what you are doing of course). At a minimum, check all your code in, test and have jenkins just check out the necessary tagged environment. Enjoy your new super electronic push button universe!
More on Monitoring Cassandra
To give some more detail on how we monitor Cassandra we use a two pronged approach. First we turn on Cassandra Graphite reporting using metrics-graphite.jar and metrics.yaml. You pass in the MBean patterns you want to keep an eye on and Cassandra will automatically take care of reporting these directly to graphite. It is super helpful data and includes not only Cassandra specific stats but JVM stats including garbage collection metrics. These became extremely useful when debugging long pauses in the GC (see Part 3!). We also get system level stats into graphite by installing Diamond on every node.
The second approach we use is Qasino which is a monitoring tool we wrote and open sourced to support a richer variety of stats that is more easily cross referenced using the power of SQL. Data is sent to Qasino using a couple of custom built collectors that run nodetool regularly.
We use two dashboard mechanisms for each system. For graphite we’ve had much success with Grafana (and we are a sponsor!) and for Qasino we use Turntable, which displays metrics in table form. Turntable is an internal tool we built but hope to open source soon.
A great feature of our Turntable dashboard is it supports easy hyperlinking to other locations. Which means we can easily drill down to time series details in Grafana or jump over to the EC2 console for a particular instance or we can even link to a Jenkins job to kick off automation with a single click!
For example, here is page with links to kick off major compactions in various ways (by individual machine, by region, by availability zone and by specific keyspace). The dashboard orders by total number of SSTables to help make informed decisions about what type of operations are most important:
In summary we’ve evolved our system for provisioning, installing, and automating Cassandra clusters a lot in order to be able to grow with our increase in client uptake. This involved making prodigious use of Ansible, taking advantage of AWS EC2 instance tagging, setting up a production Jenkins instance for operations, and configuring extensive monitoring. All these pieces fit together to enable us to grow Cassandra clusters to large scale deployments in a safe, auditable, reproducible and probably most importantly flexible way.