Building faster, scalable reporting with Hadoop-Impala
As a leading DSP with billons of online ads running through our platform every day, one of our biggest problems is how best to frequently report attribution data (which ad led to which action, like a sale or online signup) to our clients in a reliable way.
The problem we are tackling, in numbers:
A) 30-day impression volume = 35 – 40 billion records
B) 1-hour event/click volume = 15 – 20 million records
We need to join B (events) with A (impressions) twice every hour (once for event and once for clicks), find the matching records, perform complex sequencing and allocation logic, then run aggregations on the result, and send it to our reporting data marts for hourly reporting to our clients. This process should also be scalable to handle 10x the volume in the future.
Until 2011, we were using IBM Netezza for its performance, but it came with a big price tag. We started looking for a solution that would give us scalability, improved performance, and cost effectiveness. We reviewed several vendor-based custom attribution options, like Citrusleaf (now Aerospike), but most of them came up short in some options like costs and scalability. At the same time, we internally performed a proof of concept on a Hadoop/Hive combination in Amazon AWS. This helped us verify the basic capability of Hadoop as a replacement option for our existing process, and gave us the confidence to utilize it.
When we started the Hadoop development work, we initially identified Hive as our tool for the processing. The Hadoop cluster was built using good hardware components, but initial performance testing showed that the whole process took 5-6 hours to complete. We do hourly reporting, and need this process to run in 30-40 minutes. This was a big deal-breaker, so we had to search for other, faster options.
Next we tried Pig to see if it could do the magic. After a few design sessions, we rewrote the code in Pig and pushed it for performance testing. Unfortunately, we found that Pig didn’t give that much performance benefit over Hive. It reduced only 10% from the timings we got from Hive.
Evaluating Impala Beta:
After Pig, we moved to evaluate Impala, which was still under beta at that time. In test runs, we got the process to complete in 2-3 hours. The environment was not stable though, and was crashing the data nodes and consuming too much memory.
So, we went back to Hive, and started looking at several performance tuning options like map join, split size, and bumping up the number of mappers/reducers. After all that effort, we brought the timing down from 6 hours to 3 hours, but it still didn’t satisfy our requirements. Cloudera suggested we try Impala again, as version 1.0 GPA had just been released.
Evaluating Impala 1.0 GPA:
On the initial trial of 1.0 GPA, we got the same timings of 2-3 hours as we had before. This is when we started the performance tuning process. One of our first changes was to compress the incoming data. We chose lzo as our compression format, as it gives better performance over gzip while reading. Then we tried other tuning options like join options, tuning parameters, and slicing and dicing the data.
With all these iterations, we brought run time down to 45 minutes. This was great timing compared to our earlier trials. We also discovered some issues with our data node hardware configuration. Some of the drives had slower read/write capacity, causing the reads to slow down. We upgraded and optimized the hardware, and as of today, the process runs in 30 minutes. In total, we brought the process time down from 6 hours to 30 minutes, a big achievement for such a huge process in scale. On top of that, the environment is stable, and data and name nodes don’t crash anymore.
This justified our decision to move out of Netezza. Hadoop gives us the flexibility of getting incremental scalability and performance options, and also allows us to scale up or scale down, based on our needs. Something that is not possible with Netezza.
The real test of our new system was seeing how it would handle the 2013 holiday season, which is the biggest time of the year for advertising. December hit, and our process was able to handle the doubled volume of data without a hitch. In fact, it was one of the most stable holiday seasons we’ve ever had.
Our data processing and handling services are constantly evolving. Every new client on-boarded means increased impressions, advertisers, and websites, putting bigger and bigger demands on our data infrastructure. Any solution we implement has to work not just at today’s capacity but at 10x or 100x today’s capacity. As such, we actively research and test new services in order to create better, faster, more reliable services and products for clients.