Solving the “Small Files Problem” in Apache Hadoop: Appending and Merging in HDFS

While we are waiting for our hardware order to be delivered, we’re using the time by trying to identify potential problems and solve them before they even appear. Today,we investigated the common “Small Files Problem” and we had some talks on the matter. Here’s mostly from what we noted down:

During data acquisition in HDFS, it is important to store files in an efficient manner in order to take full advantage of MapReduce. The basic idea is to store a small number of large files in HDFS (rather than the other way around), such that each one is at least close to the block size (typically 64 or 128 MiB).

Continue reading

How to solve “Unable to load native-hadoop library” in Eclipse

Having just configured Eclipse for Hadoop development, I ran into the following problem:

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Running jobs via the hadoop cli command worked fine; this only happened when I tried to run jobs directly from Eclipse, in local mode. After doing a little investigation, I found that the reason behind this is a java property called java.library.path that did not include the correct path.

Running from the hadoop cli command, the java.library.path property was properly set to /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop/lib/native (I am using the CDH 4.2.0 distribution of Hadoop). When the job was started from inside Eclipse, the java.library.path held it’s system default value:


In order to correctly set this property you can configure Eclipse to load the Java Virtual Machine with this setting, or (and this is the better way) add the native library under the respective library from the Java Build Path. In order to do this, first right click on your project and open the Buid Path configuration screen:


In this screen, find the hadoop-common library, expand the row and add the native library by pointing to the correct location:


That’s it, from now on, you’ll be able to run your MapReduce jobs in local mode, and use the proper native libraries!