Scaling data tools: How Play enables strongly typed big data pipelines

// 03.04.2015 // Data

The other day, I was talking with a colleague about data validation, and the Play web framework came up. Play has a nice API for validating HTML form and JSON submissions. This works great when you’re processing small amounts of data from the web-tier of your application. But could that same tech benefit a Big Data team working on a backend powered by Hadoop or Spark?

We decided to find out, and the results were encouraging. The secret sauce? Play’s combinator-based approach to data validation.

Vector illustration of mechanism to research brain and relevant icons on blue pattern background.

Whether your data is big or small, garbage in is garbage out

MediaMath processes TBs of online user behavior and advertising data every day. With hundreds of machines spread across multiple datacenters, legacy systems and partner-provided APIs, we inevitably receive bad data or invalid records from time to time. Systems built around file formats such as CSV or TSV are especially susceptible to encoding errors that can cause headaches for downstream processing systems.

So what are your options?

Quite often, the first step for most data processing pipelines (be they single node scripts or massive Hadoop jobs) is translating some kind of encoded wire format T into a record of type D for partitioning, joining, filtering or aggregating. In mathematical terms, you need a function translate: (input: T) => D where input could be a parsed JSON object, a snippet of XML, an array of bytes or—in the case of tab or comma delimited files—an array of stings.

But what if the translation fails?

Think about the scenario of processing a CSV file line by line. Each line has columns of different types (strings, integers, floating point numbers).   What if someone puts “#@$?” where you were expecting a number, or leaves a required field blank? In other words, our function is only defined for some values (values of T) [it’s a partial function]. At MediaMath, we use Scala, so the natural choice would be to model this by throwing an exception or returning an Option[D].

That said, a richer validation API would really open up the possibility of sharing parsing logic between projects, easily swapping out the underlying wire format, using richer types like Joda Time’s DateTime instead of Strings or Ints and much more…

That’s where Play comes in.

Validation with Play’s combinator-based API

The Play framework has a very robust API for doing just these sorts of validations against user-submitted HTML forms or JSON documents. In essence, providing a function equivalent to something like validate[D]: (input: Json) => JsResult[D] where JsResult is a more robust version of Option that can remember failures for each path or index of the JSON document.

Let’s take a look at an example service for creating new online advertising campaigns. First, we define our domain model, taking advantage of Scala’s strong typing and features such as Option to safely process values that may be missing.

Next, we create a sort of recipe or formula for mapping a JSON document into our domain model called a Reads. Note that the parsing could fail for any number of reasons – such as users uploading a document with missing fields, or fields having the wrong data type. We use combinator syntax to and together each field’s Reads and apply the resulting objects as the arguments to the Campaign constructor:

Let’s use our new validator to parse some JSON:

This approach to data validation is fully type safe and very powerful. We’ve found multiple ways of using it at MediaMath.

In the next installment of this series, I’ll cover how MediaMath is extending data validation to work with Amazon Web Services DynamoDB, S3, and Spark.

A Picture of Ian Hummel

IAN HUMMEL

Director of Data Platform Ian Hummel is the Director of Data Platform at MediaMath. He’s led a variety of product initiatives at MediaMath over the years and is currently focused on building a next-gen large scale analytics platform. Before MediaMath he worked in a variety of tech fields including enterprise search, video processing, identity federation, and mobile app development. He has a BA in Mathematics and Comp. Sci from Boston University and an MBA from INSEAD.
0 Comments.

Leave a Reply

Your email address will not be published. Required fields are marked *