Avro Backed Hive Table in CDH5 (or CDH4.5): Encountered AvroSerdeException determining schema

You are using Hive and some of your tables are backed by the Avro Serde. You are seeing a lot of this:

WARN avro.AvroSerdeUtils: Encountered AvroSerdeException determining schema. 
Returning signal schema to indicate problem org.apache.hadoop.hive.serde2.avro.AvroSerdeException: 
Neither avro.schema.literal nor avro.schema.url specified, can't determine table schema

Sometimes this WARN becomes an ERROR and your queries completely fail! This is driving you mad as you have in fact specified the schema, and this is obvious because sometimes the queries work just fine, despite the warning. If this is you, read on!

Continue reading

How to Make Different Libraries, JAR Dependencies Available to MapReduce Jobs on a Hadoop Cluster

When using Hadoop, you will find yourself rather sooner than later in need of distributing some libraries, usually in the form of JAR files to your cluster, so that these libraries will be available to your MapReduce jobs. There are several ways to accomplish this, but I’ve found the “JAR in a JAR” method to be the most convenient.

Continue reading

Solving the “Small Files Problem” in Apache Hadoop: Appending and Merging in HDFS

While we are waiting for our hardware order to be delivered, we’re using the time by trying to identify potential problems and solve them before they even appear. Today,we investigated the common “Small Files Problem” and we had some talks on the matter. Here’s mostly from what we noted down:

During data acquisition in HDFS, it is important to store files in an efficient manner in order to take full advantage of MapReduce. The basic idea is to store a small number of large files in HDFS (rather than the other way around), such that each one is at least close to the block size (typically 64 or 128 MiB).

Continue reading

How to solve “Unable to load native-hadoop library” in Eclipse

Having just configured Eclipse for Hadoop development, I ran into the following problem:

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Running jobs via the hadoop cli command worked fine; this only happened when I tried to run jobs directly from Eclipse, in local mode. After doing a little investigation, I found that the reason behind this is a java property called java.library.path that did not include the correct path.

Running from the hadoop cli command, the java.library.path property was properly set to /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop/lib/native (I am using the CDH 4.2.0 distribution of Hadoop). When the job was started from inside Eclipse, the java.library.path held it’s system default value:

native1

In order to correctly set this property you can configure Eclipse to load the Java Virtual Machine with this setting, or (and this is the better way) add the native library under the respective library from the Java Build Path. In order to do this, first right click on your project and open the Buid Path configuration screen:

native2

In this screen, find the hadoop-common library, expand the row and add the native library by pointing to the correct location:

native3

That’s it, from now on, you’ll be able to run your MapReduce jobs in local mode, and use the proper native libraries!