Video: Monte Carlo Simulations in Ad-lift Measurement Using Spark
Two weeks ago, engineers, developers and data scientists from all over the country packed into the Midtown Hilton in New York Spark Summit East 2016, the largest big data event focused on Apache Spark. MediaMath’s SVP of Data Science, Prasad Chalasani, partnered with Ram Sriharsha, a Senior Member of Technical Staff at Hortonworks to demonstrate how and why and why they used Spark in Monte Carlo Simulations to measure ad lift, or the behavioral effect that advertisements can have on consumers. Watch Prasad’s presentation in it’s entirety below:
Most traditional applications of Spark involve massive data-sets that already exist. A less-commonly encountered use-case, but nevertheless extremely useful, is in Simulations, where massive amounts of data are generated based on model parameters. In this talk we explore some of the challenges that arise in setting up scalable simulations in a specific application, and share some of our solutions and lessons learned along the way, in the realms of mathematics and programming. The application scenario we explore is to quantify the impact of cookie-contamination in randomized experiments aimed at measuring digital advertisement lift/effectiveness. Cookies are randomly assigned to test or control, and those in test are exposed to ads while those in control are not. The goal is to measure the lift in conversion-rate due to ad-exposure.
One important factor that taints such measurements is cookie-contamination: a real-world user may have multiple cookies (but the system is unaware of this linkage), and if their cookies are in both test and control groups, then the cookie in control may show a higher conversion rate than that of a clean control cookie that has no “siblings” in the test group. Analytically quantifying the impact of this contamination is difficult without making overly simplistic assumptions, and one idea we pursued is to simulate the impact of cookie-contamination, with millions of trials over 10s of millions of users. The goals are: (a) understand/quantify the impact of cookie distribution and contamination, on the expected value of the computed lift as well as the 90% confidence interval, and (b) derive approximate analytical formulas for the observed lift. Scaling up the simulations to a large of trials and users is challenging, and we share some of our solutions, and also describe the analysis of error and expectation.