When using Hadoop, you will find yourself rather sooner than later in need of distributing some libraries, usually in the form of JAR files to your cluster, so that these libraries will be available to your MapReduce jobs. There are several ways to accomplish this, but I’ve found the “JAR in a JAR” method to be the most convenient.
One of the things that I am doing, is using Avro serialization in MapReduce Jobs written using the new API, on Cloudera CDH 4.4. For all of this to work, the
avro-mapred-1.7.4-hadoop2.jar package must be available throughout the cluster. Cloudera has packed this into it’s CDH distribution but only in the
lib folders of tools like Oozie, Sqoop or Hive, so I had to find a way to make Avro available to simple MapReduce jobs that run on the Hadoop cluster as well.
The libjars method
The most straightforward approach is using the
libjars method, that requires two steps:
- Add the JAR to the CLASSPATH on the client machine (from where the job was submitted). This will allow your job to submit properly.
- Use the
-libjarsdirective with the
hadoop jarcommand to have Hadoop distribute the JAR to all machines in the cluster.
While this is a very good way to solve the problem, I think it’s rather inconvenient to always have to take care of these things, and if you need more than one JAR passing all of them via the command line is, I find, far less than ideal.
The JAR in a JAR method
This is the method I prefer. Hopefully, you are writing you projects as maven projects, so you can easily use a Maven Assembly Descriptor that automatically packs the needed JARs into the
/lib folder of your Hadoop job JAR (hence the JAR in a JAR name). Hadoop automagically adds the JARS within this folder to the CLASSPATH on all nodes and your jobs will execute successfully. Here is what you need to do:
- Add the maven-assembly-plugin to your pom.xlm file
- Create the assembly descriptor xml file
<scope>provided</scope>to those dependencies that you do not want packed into the JAR (things that are already available, like
slf4j, and so on)
Have a look at https://gist.github.com/rpastia/6930800 for the code required in the two xml files.
Your JARs will now be be “self-sufficient” and there is nothing else you need to do in order to get them to work on your cluster. Just make sure you do not include more thing then necessary; you can have a look inside the JAR archive, in the
lib folder, just to make sure.
If you want to understand more about this topic, or you don’t know why this would be an issue, have a look at http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/