Making your local Hadoop more like AWS Elastic MapReduce

// 05.21.2014 // Data

A version of this article originally appeared on Ian’s personal blog here

At MediaMath, we’re big users of Elastic MapReduce (EMR). EMR’s incredible flexibility makes it a great fit for our data analytics team, which processes TBs of data each day to provide insights to our clients, to better understand our own business, and to power the various product back-ends that make Terminal 1 the “marketing operating system” that it is.

An extremely important best practice for any analytics project is to ensure the local development and test environments match the production environment as much as possible. This eliminates the nasty surprise of launching a job that takes hours, only to discover that it fails late into the run due to some unmet dependency or configuration mistake. Failing to invest time in the dev/test phase is a surefire way to blow big $$.

I’ve been investigating some configuration settings you can make to your local Hadoop, to bring it inline with what you’ll find when you run a job on an EMR cluster (read here for more info on our Hadoop configuration). This is especially important to us at MediaMath, because we use S3 as a sort of centralized filesystem. EMR is designed to work wonderfully with S3. Specifically by:

  • Using s3:// URIs everywhere instead of s3n:// URIs
  • Embedding AWS access keys
  • Supporting transparent LZO compression

Installing Hadoop

I run all my Hadoop jobs on my laptop using Homebrew. Homebrew is a fantastic package manager for OS X that makes it a breeze to install general UNIX utilities as well as more complicated software packages (like Hadoop and Hive).

And you’re good!

s3:// vs s3n:// URIs in HDFS

Ever wondered what the difference between an s3:// URI and an s3n:// URI is?

Up until December 2010, S3 had a 5GB object size limit. So, if you used the default S3 HDFS implementation (by specifying an s3n:// URI) you couldn’t read/write files greater than 5GB. That said, when you did read or write a file with HDFS there was a 1 to 1 correspondence with the object that got stored in S3.

To process files larger than 5GB, you had to use s3:// URIs in HDFS, which actually chunked the file into multiple pieces behind the scenes, before storing each piece as a separate object in S3. When accessing something via HDFS with an s3://bucket/object URI, you might actually be downloading multiple “chunks” from S3. This page has some more info.

Nowadays the S3 limit is 5TB, so there isn’t really a need to use s3:// URIs in HDFS anymore. In fact, in EMR, s3:// and s3n:// are both aliased to the same implementation (s3n).

Here’s the relevant config:

Embedding AWS Access Keys

If you want to run Hadoop jobs on your laptop but use data stored in S3, you’ll need to ensure your credentials are stored in mapred-site.xml. If you installed Hadoop via Homebrew, just edit $(brew –prefix hadoop)/libexec/conf/mapred-site.xml.

As a side note: I set this up for both the s3n and s3 HDFS filesystem implementations, but since I only ever use the NativeS3FileSystem via s3:// URIs, the s3 properties don’t really matter (because the S3FileSystem will never be used).

Hadoop and LZO compression

The log files we process in our analytics platform are compressed using LZO compression. Luckily EMR can transparently decompress these files, so no extra configuration is needed. Our local Hadoop install, however, cannot. Luckily Twitter has some open source code we can use on GitHub.

Getting this to work requires a few steps:

  • Compiling the native C code for LZO
  • Compiling the java wrapper used by Hadoop
  • Setting up the relevant classpath/library paths so Hadoop can find them
  • Configuring Hadoop to use LZO compression when it finds a .lzo file

That’s a lot to remember. Homebrew to the rescue!

 

Basically, this will use Homebrew to install lzo and maven so you can compile the Twitter Hadoop/LZO code for steps 1 and 2. The compiled code will be installed into

Now we can configure Hadoop to use it. First of all, the default hadoop executable actually resets JAVALIBRARYPATH which means we can’t get the native libraries onto the right path. So, we have to edit $(brew –prefix hadoop)/libexec/bin/hadoop and comment out line #353

Now we need to edit our .bashrc. I’m also setting JAVA_HOME just to be safe.

Lastly, let’s edit mapred-site.xml again.

Now your jobs should be able to transparently decompress LZO files, even when running on your laptop!

For more information on installing Hadoop and LZO via Homebrew, checkout the GitHub page for my Homebrew Tap!

Investigating more settings

If you enable debugging when you spin up your EMR cluster, you can actually inspect the jobconf by downloading the file from S3. This is a great way to see what other settings you may need to tweak to ensure that your jobs can be tested on your laptop before you send them off to EMR. 

A Picture of Ian Hummel

IAN HUMMEL

Director of Data Platform Ian Hummel is the Director of Data Platform at MediaMath. He’s led a variety of product initiatives at MediaMath over the years and is currently focused on building a next-gen large scale analytics platform. Before MediaMath he worked in a variety of tech fields including enterprise search, video processing, identity federation, and mobile app development. He has a BA in Mathematics and Comp. Sci from Boston University and an MBA from INSEAD.
2 Comments.

2 responses to “Making your local Hadoop more like AWS Elastic MapReduce”

  1. Found this while searching for “brew hadoop-lzo”

    Neat post, Ian. This is really useful info. Thank you.

  2. Just saw your talk on the AWS youtube regarding creating scalable reporting data services, really helpful/interesting. Thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *