Breaking the logjam
At MediaMath, our infrastructure generates terabytes of business-critical messages every day, such as ad impression logs and tracking beacon events. A service we’ve developed within our TerminalOne technology platform, nicknamed the “MediaMath Firehose,” enables our internal analytics applications and bidding systems to generate meaningful insights and take action on all of the data from these messages in real time. This wasn’t always the case; traditionally, this data was made available in hourly or nightly batches. We needed a significant technical and cultural transformation to move from batching to streaming.
When we first began architecting our data delivery systems in the company’s early days, we made the decision to adopt a fairly common data transfer pattern: have each “producer” application, like a pixel server, group these messages into regularly rotated files locally and ship those files to a central analytics warehouse. This approach empowered us to significantly scale up our warehouse to support huge batch-processing analytics workflows.
Fig. 1: The original architecture of our data delivery systems.
However, all was not well on the data collection and extract/transfer/load (ETL) side. This initial architecture had some immediate disadvantages that continued to escalate as our daily data volumes grew from gigabytes to terabytes. Seemingly straightforward design changes, such as schema modifications or compressing the log files before delivery to reduce network congestion and cross-region latency, became major initiatives that required painful lock-step coordination between the consumer and producer applications. Producer servers failed, taking any as-yet undelivered files down with them. And any applications that needed real-time access to the raw data had to wait for minutes or even hours depending on log rotation cycles and delivery queues.
Our solution to this problem?
- Drop a message bus in the middle.
That’s the idea behind the MediaMath Firehose, in a nutshell. In practice this effort required identifying a good message bus, making significant changes to our data collection layer to feed directly into the stream, shifting our ETL processes to be subscribers of the stream, and overhauling how we think about consuming this data to focus more heavily on real-time actions and insights.
Our first step was identifying a suitable publish/subscribe messaging system. We liked Apache Kafka in early testing but ultimately chose Amazon’s Kinesis service as it best matched our needs for price, operational support and ease of use. I’ve talked about our use of Kinesis before and we felt this service was a good fit in our tech stack. Our team has no trouble pumping terabytes of data through Kinesis daily; if we need more capacity, we configure more shards and it “just works” at a reasonable cost. I’ll also be speaking at Amazon’s re:Invent customer conference later this year about our success to date with this service.
To Amazon’s credit, getting Kinesis set up was the easy part. The trickier projects involved setting up our data producers to stream and moving our ETL processes to sit on the consumer end of Kinesis. Many of our data producers were designed to generate periodic log files, rotate them on a schedule, and drop them on a central file server for batch processing. To complement this approach without dramatic changes to producers, we developed an agent nicknamed “Scribbler” that simply receives individual messages locally and relays them to Kinesis, and instructed the data producers to pipe their messages to this program instead of (or in some cases, in addition to) the existing log files. Scribbler manages the Kinesis connection and buffers locally or spills to disk in the event that Kinesis is unavailable. Our team also operates a service variant of Scribbler (brilliantly nicknamed “Nibbler”) that allows lightweight or external data producers to publish the stream remotely without the need to install the Scribbler agent.
The last major initiative was porting all of our batch-oriented ETL processes to be stream consumers. For the processes that really needed to remain batch-oriented (like nightly reports and unique-user counting over long time windows), we built a simple stream archiver utility that rolls up producer data by topic and drops batches into Amazon’s S3 object store. The others were granted read-only pull access to Firehose. Each process now needed a consumer application that was responsible for retrieving messages from Firehose as quickly as possible and tracking their own progress through the message queue. In the event of consumer failure, these programs were now expected to resume retrieving messages from their last Firehose checkpoint.
Fig. 2: The architecture of MediaMath Firehose, our streaming data delivery system.
Once we set up the initial streaming infrastructure, we immediately noticed the following patterns start to emerge company-wide:
Our innovation speed increased dramatically. Decoupling data producers from consumers allowed us to ramp up the use of new fields, new data sources, and general volume without being too concerned about the impact to consumers. On the other side, consumers now pulled the fields and streams they wanted at whatever rate they were able to handle and could enable/disable the subscription as needed.
The data pipeline became much more reliable. We no longer faced significant data loss when a data producer crashed, as messages were starting to come off these servers in real-time. In addition, if consumers crashed, they could simply resume from the last-known stream checkpoint. We also added some lightweight tracing to the stream to catch any faults or breakpoints along some of our more critical pathways.
We unlocked the potential of real-time analytics. Instead of waiting hours or even days for data availability, consumer applications began springing up that were designed for — and expected — their datasets to be made available within seconds of relevance, giving our reporting and optimization systems a major boost in performance. We’ll be highlighting some of these new applications in upcoming blog posts.
We’ll be open sourcing some of the components involved in this new system as we have opportunities. You can always keep track of all of MediaMath’s open source projects on GitHub. If that interests you, and you’re excited about designing and leveraging streaming architectures for massive data streams, we’re hiring. You should check out our careers page and join us!