Search results for: big data

Cassandra War Stories: Part 1

// 05.17.2016 // Data

This is part one of a multi-part series exploring the successes (and scars) that we’ve had while tuning Cassandra to perform well in MediaMath’s Data Management Platform. Fast reads on time series data We use Cassandra as the backend data store for our Data Management Platform (DMP) system here at MediaMath. DMPs are used by advertisers to store their first party data as well as third party data segments they buy so that they can deploy these to bid on ad opportunities targeted audiences. This requires hardware that can handle extremely large volumes of data and then search them very quickly. We chose to […]

Recapping AWS re:Invent—3 Things That Got Our Developers Excited

// 10.22.2015 // Infrastructure

There’s a reason Amazon Web Services (AWS) re:Invent has become the go-to conference for developers in the cloud computing industry– the product launches and releases announced set the tone for the way companies will grow and change over the coming year. The conference has seen enormous growth over it’s four-year history and, this year, more than 18,000 developers descended on Las Vegas to witness 21 product and feature launches and over 275 sessions and workshops. Here is what members of the MediaMath team had to say about what they found the most exciting at re:Invent: Event-driven Architecture “Enterprises with strong […]

Counting at Scale: HyperLogLog to the Rescue

MediaMath processes many terabytes of data each day for the various reports available in T1. One metric we show is the number of unique impressions for each campaign, there is a big difference between showing an ad to 100 different people and showing the same ad to one person 100 times. While this is conceptually a simple problem, solving it at scale is not quite as straightforward. The canonical way of solving this problem would be for any given campaign to put the id of each person who saw an ad for that campaign into a set and then check […]

Scaling data tools: How Play enables strongly typed big data pipelines

// 03.04.2015 // Data

The other day, I was talking with a colleague about data validation, and the Play web framework came up. Play has a nice API for validating HTML form and JSON submissions. This works great when you’re processing small amounts of data from the web-tier of your application. But could that same tech benefit a Big Data team working on a backend powered by Hadoop or Spark? We decided to find out, and the results were encouraging. The secret sauce? Play’s combinator-based approach to data validation. Whether your data is big or small, garbage in is garbage out MediaMath processes TBs […]

Cut your run time from minutes to seconds with HBase and Algebird

// 02.04.2015 // Data

[Note: Code for this demo is available here: https://github.com/MediaMath/hbase-coprocessor-example] At MediaMath, our Hadoop data processing pipelines generate various semi-aggregated datasets based on the many terabytes of data our systems generate daily. Those datasets are then imported to a set of relational SQL databases, where internal and external clients query them in real time. When a query involves extra levels of aggregation on an existing dataset at run time, it starts to hog server resources, slowing down runtime. However, we have been able to reduce the query time on these terabyte–scale datasets from minutes to seconds by using a combination of […]