Extending Play’s validation to work with Big Data tools like DynamoDB, S3, and Spark

// 03.18.2015 // Data

In this two-part blog series, we are looking at how MediaMath uses Play’s API to perform data validation on big data pipelines. In part one, we covered data validation with Play’s combinator-based API. In part two, we’ll extend that data validation to work with Amazon Web Services DynamoDB, AWS S3, and Spark.

data-pipeline-with-play

Extending validation to work with AWS DynamoDB
MediaMath uses a variety of technologies in our analytics stack, including AWS DynamoDB. DynamoDB is a distributed, fault-tolerant key value store as a service that makes it easy to store/query massive datasets. We use it to power a few internal troubleshooting tools which have front-end/API layers written in Play.

That said, a downside of using the AWS Java SDK from Scala is that it feels quite verbose and unidiomatic. We really liked the succinct JSON API from Play and wanted to see if it could be extended to create a data-binding layer for DynamoDB’s Item instead of JSON docs. It turns out this was quite easy to do and the results are now open sourced as Play DynamoDB.

Working with Play DynamoDB is very similar to working with the Play JSON API.

As you can see, the code is almost the same:

  • Create your domain object of type D
  • Create your blueprint for parsing it from some type T, in this case a DynamoDB Item
  • Use Play’s functional/combinator constructs to map from Item => DdbResult[D]

Let’s take this further!

—————————————-

OK, how does that relate to processing big data pipelines?

There has been a lot of discussion around splitting out Play’s JSON API into its own project, as can be seen from this pull request. It makes a lot of sense because it nicely generalizes the issue of translating data to and from a potentially unsafe wire format in a fully type safe way.

Development work on the new Validation API happens in GitHub at the Play data validation library, where it already unifies parsing of JSON and HTML forms. Recently MediaMath submitted a patch to extend it to work with CSV/TSV delimited files like so:

With the new API you create and combine Rules that can bind and validate records from TSV files. Add Apache Spark to the mix and you get a very compelling development environment for doing fast, reliable and type safe data processing over enormous data sets (which are often stored as lines of JSON or CSV/TSV).

Processing S3 Access Logs in Spark

———————————-

We are heavy users of S3 at MediaMath and have enabled S3 Access Logging on many of the buckets we use for production datasets. An S3 Access Log looks like this:

What if we want to do some data analysis on these access logs, using Spark?

First, let’s create our domain object

The S3 access logs need a little processing before we can simply treat them as a “space-delimited” file. For example, empty values are represented by the string -. Fortunately, we can account for all that by chaining multiple Rules together to create a new Rule which maps from an Array[String] (aliased as Delimited) to our S3AccessLog domain object.

Now we’re ready to start crunching these logs with Spark:

The data pipeline above is fully typed, resilient to bad records and can be joined, grouped and aggregated with compile time type checks and IDE-based code completion. Much nicer than hardcoding column names all throughout your job!

More to come

————

This series just scratched the surface of what’s possible with the strong, combinator-based approach to data translation and validation offered by the new Play Validation API. I really hope the project catches on and can stand on its own two feet (without Play). In the future, we’d like to merge our Play DynamoDB library into it as well. As we’ve shown, the enhanced type safety and reusable parsing logic can be used in many ways outside of a traditional web app.

If you’d like more info, check out these links:

And if you want to learn more about how we use Spark, AWS, and Play to reimagine marketing here at MediaMath, send me a note at @themodernlife or get in touch in the comments.

A Picture of Ian Hummel

IAN HUMMEL

Director of Data Platform Ian Hummel is the Director of Data Platform at MediaMath. He’s led a variety of product initiatives at MediaMath over the years and is currently focused on building a next-gen large scale analytics platform. Before MediaMath he worked in a variety of tech fields including enterprise search, video processing, identity federation, and mobile app development. He has a BA in Mathematics and Comp. Sci from Boston University and an MBA from INSEAD.
1 Comment.

One response to “Extending Play’s validation to work with Big Data tools like DynamoDB, S3, and Spark”

  1. Can you point me to something that might explain this syntax?:

    case class Contact(name: String, email: String, birthday: Option[LocalDate])

    val contactReads = From[Delimited] { __ => (
    (__ \ 0).read[String] and
    (__ \ 1).read(email) and
    (__ \ 2).read(optionR[LocalDate](Rules.equalTo(“N/A”)))
    )(Contact)}

Leave a Reply

Your email address will not be published. Required fields are marked *