Scaling Data Ingestion Systems: From Perl to Go Part 1

A consequence of MediaMath’s astronomical growth over the past few years is dealing with huge growth in service usage. Rapid growth sometimes means that systems are built quickly, without making hard plans for the future. Systems with headroom can now often become insufficient in as little as six months, and so technical debt becomes a tough challenge to address. We deal with the question, “Do we try to re-write this, or do we modify what we already have to scale with the load we expect to see?”

Nowhere has this been clearer than in ingesting user data, which since 2011 has more than tripled in volume to greater than 500k QPS. In this post I’ll discuss MediaMath’s history of ingesting user data, and how we lowered latency, simplified concurrency, and increased capacity by taking an experimental leap and switching to Go.

The Initial Problem

An important component in advertising, online or off-, is finding an audience. Advertisers use different strategies here:

  1. Finding users they know are interested in their products
  2. Discovering and classifying new users who they think will be interested in their offerings

A number of companies tackle this problem in different ways. For instance, one company may claim to accurately determine users’ genders and income ranges, while another may claim to match offline users to online cookies. Further, advertisers and brands often have their own CRM systems that they’d like to bring on board.

To meet this functionality, we should provide a system for customers and partners to upload user IDs and segment membership information. Ideally, we provide a few interfaces into this system, for different use cases: batch, real-time, remodeling, etc. The solution needs to scale well to meet data ingestion load and handle slightly different formats, but have a common specification.

Designing a System

To interact with the system, we decided on two methods: FTP and HTTP. FTP would be the primary interface, taking care of batch processing. For most offline data, ingesting batch files periodically is sufficient, as the user classification and data update cycle has a longer period (on the time scale of hours). HTTP would be for lower-latency data, where a time scale of hours is insufficient. The processing itself was handled by a series of translation steps to convert user-friendly representations to a common loader format. Then a loader application would load that data into our user database. This data would get pushed to our globally distributed bidders.

01_high_level_ftp_http

Since the earliest days of MediaMath’s lifetime, we’ve loved Perl. The architects behind many of our systems, including our campaign management API and this particular data ingestion system, loved the language’s tools, community, and style. As a core component of this translation process is processing streams of text, we decided to go with Perl.

This worked fairly well for some time. We were handling gigabytes of data, processing batch files a few times per day, and HTTP data a few times per hour. Fortunately (or unfortunately), ingesting offline data caught on like wildfire. Next thing we knew, we were taking in data from numerous sources. Some of these partners had different specifications of their own, and couldn’t match our standard format. Using this system, we had to handle different types of setups, which made it extremely difficult to scale.

Evolution of a Prototype

Scaling and extending raised two seemingly contradictory approaches:

  1. Should we add run-time flexibility into a single system?
  2. Should we write focused, independent processors to make sure each is optimized?

The first option is a common way of handling this in a dynamic language, so this is what we went with. Using extensive configuration files of options, we could fine-tune the behavior used to handle any given input. The architecture basically looked like:

code_architecture_01

However, this brought scaling challenges. Even after trying to parallelize workflows using Linux streams (more on that later), we started to fall behind on processing, with ingestion times taking anywhere from six to 24 hours. Further, it didn’t look like the incoming data volume would taper off any time soon.

So we tried splitting up the processes themselves. This made the architecture look more like:

code_architecture_02

With limited controls for concurrency, we designed a system that was line-atomic, choosing to use Linux processing controls to parallelize it. Using nightmarish configuration files, each file gets piped into parallel, splitting it up, then piped into the translator, specifying how it was to be processed, and then piped to pigz (parallel gzip). It ended up looking something like:

Yikes. If that doesn’t look terrifying, talk to your nearest DevOps expert.

Pretty soon, our data volume was once again pushing the limits of our architecture. We were falling behind, throwing more hardware at the problem. Splitting this over many machines brought deployment problems. We needed to re-think this.

Going Go

MediaMath has a strong culture of experimentation. If you’ve got an itch to solve a problem we have, and you come up with a good solution, we try to minimize the barriers to seeing the results of your efforts. [LF1] This was a system that was staggering over. I had spent some time learning Go, and was curious if it could help us redesign this system to scale more effectively.

So, I began re-writing this process. I had two main goals at the outset:

  1. Simplify concurrency and parallelism
  2. Scale without complicating deployment

Using Go here intrigued me because I wondered if I could solve the above goals using, respectively:

  1. Goroutines and channels, Go’s built-in concurrency primitives
  2. Interfaces

I wondered if this duality of run-time dynamicism vs. deployment terror could be solved by clever uses of Go’s offerings . In next week’s post, I’ll explain how we deployed concurrency in Go and the (expected and unexpected) gains we experienced.

A Picture of Prasanna Swaminathan

PRASANNA SWAMINATHAN

Director, Developer Relations Prasanna Swaminathan is the Director of Developer Relations at MediaMath. In addition to teaching people how to pronounce his name, he leads a team that works with clients across the globe to open our API and build on our platform. Prasanna holds a BA in Physics from the University of Pennsylvania. Prior to joining MediaMath, he worked on power relay firmware systems, though he likes to think that prior to joining MediaMath, the matter/anti-matter ratio was even.
3 Comments.

3 responses to “Scaling Data Ingestion Systems: From Perl to Go Part 1”

  1. Hi,

    Have you thought about using FPGA decoders with protobuf engines or similar decoders/normalizes?

    Similar solutions are used in low latency HFT space (usually with FastFix decoders) and achieve line raw speeds (~ tens of millions pps). Once done they feed application that is bind to CPU via PCIe.

    Sounds interesting!

  2. Hi Mariusz,

    Great question. We have experimented with FPGA decoders and other kinds of hardware. However, we found that the high cost, and high cost per amount processed, put this solution lower down on the list. The ideal solution here would be one that did not need specialized hardware.

    In fact, part of the reason this worked out the way it did is that the solution could use the same commodity hardware we had anyway! By approaching the problem with software, it made it easier for the powers that be to approve moving over many production systems to something entirely rewritten. Using Go with commodity hardware allowed us to move more quickly without changing many layers in between.

    Thanks,
    Prasanna

    • Hi Prasanna,

      Thanks for you reply. As you mentioned higher cost and flexibility are not on a strong side for FPGAs in comparison to purely software solutions. Thus, if you don’t have to you don’t go that way. I guess it might change once traffic increases to amounts that software will struggle to parse it in a reasonable time window. AFAIK the time window that you are dealing with are fixed and wide in contrast to HFT space where first wins.

      Regards, Mariusz.

Leave a Reply

Your email address will not be published. Required fields are marked *