Search results for: hadoop

Extending Play’s validation to work with Big Data tools like DynamoDB, S3, and Spark

// 03.18.2015 // Data

In this two-part blog series, we are looking at how MediaMath uses Play’s API to perform data validation on big data pipelines. In part one, we covered data validation with Play’s combinator-based API. In part two, we’ll extend that data validation to work with Amazon Web Services DynamoDB, AWS S3, and Spark. Extending validation to work with AWS DynamoDB MediaMath uses a variety of technologies in our analytics stack, including AWS DynamoDB. DynamoDB is a distributed, fault-tolerant key value store as a service that makes it easy to store/query massive datasets. We use it to power a few internal troubleshooting […]

Scaling data tools: How Play enables strongly typed big data pipelines

// 03.04.2015 // Data

The other day, I was talking with a colleague about data validation, and the Play web framework came up. Play has a nice API for validating HTML form and JSON submissions. This works great when you’re processing small amounts of data from the web-tier of your application. But could that same tech benefit a Big Data team working on a backend powered by Hadoop or Spark? We decided to find out, and the results were encouraging. The secret sauce? Play’s combinator-based approach to data validation. Whether your data is big or small, garbage in is garbage out MediaMath processes TBs […]

Cut your run time from minutes to seconds with HBase and Algebird

// 02.04.2015 // Data

[Note: Code for this demo is available here: https://github.com/MediaMath/hbase-coprocessor-example] At MediaMath, our Hadoop data processing pipelines generate various semi-aggregated datasets based on the many terabytes of data our systems generate daily. Those datasets are then imported to a set of relational SQL databases, where internal and external clients query them in real time. When a query involves extra levels of aggregation on an existing dataset at run time, it starts to hog server resources, slowing down runtime. However, we have been able to reduce the query time on these terabyte–scale datasets from minutes to seconds by using a combination of […]

Making your local Hadoop more like AWS Elastic MapReduce

// 05.21.2014 // Data

A version of this article originally appeared on Ian’s personal blog here.  At MediaMath, we’re big users of Elastic MapReduce (EMR). EMR’s incredible flexibility makes it a great fit for our data analytics team, which processes TBs of data each day to provide insights to our clients, to better understand our own business, and to power the various product back-ends that make Terminal 1 the “marketing operating system” that it is. An extremely important best practice for any analytics project is to ensure the local development and test environments match the production environment as much as possible. This eliminates the […]

Building faster, scalable reporting with Hadoop-Impala

// 05.21.2014 // Infrastructure

As a leading DSP with billons of online ads running through our platform every day, one of our biggest problems is how best to frequently report attribution data (which ad led to which action, like a sale or online signup) to our clients in a reliable way. The problem we are tackling, in numbers: A) 30-day impression volume = 35 – 40 billion records B) 1-hour event/click volume = 15 – 20 million records We need to join B (events) with A (impressions) twice every hour (once for event and once for clicks), find the matching records, perform complex sequencing […]