Solving the “Small Files Problem” in Apache Hadoop: Appending and Merging in HDFS

While we are waiting for our hardware order to be delivered, we’re using the time by trying to identify potential problems and solve them before they even appear. Today,we investigated the common “Small Files Problem” and we had some talks on the matter. Here’s mostly from what we noted down:

During data acquisition in HDFS, it is important to store files in an efficient manner in order to take full advantage of MapReduce. The basic idea is to store a small number of large files in HDFS (rather than the other way around), such that each one is at least close to the block size (typically 64 or 128 MiB).

Continue reading