How to Make Different Libraries, JAR Dependencies Available to MapReduce Jobs on a Hadoop Cluster

When using Hadoop, you will find yourself rather sooner than later in need of distributing some libraries, usually in the form of JAR files to your cluster, so that these libraries will be available to your MapReduce jobs. There are several ways to accomplish this, but I’ve found the “JAR in a JAR” method to be the most convenient.

One of the things that I am doing, is using Avro serialization in MapReduce Jobs written using the new API, on Cloudera CDH 4.4. For all of this to work, the avro-mapred-1.7.4-hadoop2.jar package must be available throughout the cluster. Cloudera has packed this into it’s CDH distribution but only in the lib folders of tools like Oozie, Sqoop or Hive, so I had to find a way to make Avro available to simple MapReduce jobs that run on the Hadoop cluster as well.

The libjars method

The most straightforward approach is using the libjars method, that requires two steps:

  1. Add the JAR to the CLASSPATH on the client machine (from where the job was submitted). This will allow your job to submit properly.
  2. Use the -libjars directive with the hadoop jar command to have Hadoop distribute the JAR to all machines in the cluster.

While this is a very good way to solve the problem, I think it’s rather inconvenient to always have to take care of these things, and if you need more than one JAR  passing all of them via the command line is, I find, far less than ideal.

The JAR in a JAR method

This is the method I prefer. Hopefully, you are writing you projects as maven projects, so you can easily use a Maven Assembly Descriptor that automatically packs the needed JARs into the /lib folder of your Hadoop job JAR (hence the JAR in a JAR name). Hadoop automagically adds the JARS within this folder to the CLASSPATH on all nodes and your jobs will execute successfully. Here is what you need to do:

  • Add the maven-assembly-plugin to your pom.xlm file
  • Create the assembly descriptor xml file
  • Add <scope>provided</scope> to those dependencies that you do not want packed into the JAR (things that are already available, like hadoop-client, mrunit, slf4j, and so on)

Have a look at https://gist.github.com/rpastia/6930800  for the code required in the two xml files.

Your JARs will now be be “self-sufficient” and there is nothing else you need to do in order to get them to work on your cluster. Just make sure you do not include more thing then necessary; you can have a look inside the JAR archive, in the lib folder, just to make sure.

If you want to understand more about this topic, or you don’t know why this would be an issue, have a look at http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s