While we are waiting for our hardware order to be delivered, we’re using the time by trying to identify potential problems and solve them before they even appear. Today,we investigated the common “Small Files Problem” and we had some talks on the matter. Here’s mostly from what we noted down:
During data acquisition in HDFS, it is important to store files in an efficient manner in order to take full advantage of MapReduce. The basic idea is to store a small number of large files in HDFS (rather than the other way around), such that each one is at least close to the block size (typically 64 or 128 MiB).
HDFS does not work good with lots of small files (much smaller than the block size) for the following reasons:
- Each block will hold a single file, so you will have a lot of small blocks (smaller than the configured block size). Reading all these blocks one by one means a lot of time will be spent with disk seeks.
NameNodekeeps track of each file and each block (about 150 bytes for each) and stores this data into memory. A large number of files will occupy more memory.
Some good practices regarding this topic are either appending more data to a larger file that is already stored in HDFS, or merging multiple small files together (before or after getting them into HDFS).
Keep in mind that Hadoop is still changing pretty much between releases, so what is written here might not apply to future versions. We are currently using the Cloudera CDH 4.3 release of Hadoop.
- Files are visible in the namespace as soon as they are created even if the creator dies before closing the file (HADOOP-1708).
- The only write modes are create and append.
- Both of them write to the end of a file and cannot seek, the only difference is that append writes to files that already exist.
- Append mode allows for a single writer modifying an already written file.
- If you open a file which is being written for reading, the last incomplete block is not visible and so are any subsequent blocks which are going to be completed. Only the blocks which were completed before opening the stream for reading are available.
FSDataOutputStreamobject to force the flushing of the current (incomplete) block to disk (useful for implementing commit logs like in HBase).
The following approaches are possible:
SequenceFilewhere keys are file names (if required) and values are their corresponding content.
SequenceFilesare useful if a logical separation into small files is semantically required. For example, a web crawler stores lots of small HTML files which can be stored as key-values in
SequenceFilesfor efficient processing with MapReduce.
- If merging files which are not stored in HDFS is required, they can be appended (see previous section) into the
SequenceFilescombining the idea of appending and merging.
Combine more files into a HAR file (Hadoop Archive).
- Useful when there are already lots of small files in HDFS, which need to be grouped together before some expensive jobs.
- Implemented as a MapReduce job.
- Use a
har://URL to access each file from the archive and view the archive as a folder.
- Use a normal
hdfs://URL to access the actual content of the archive in HDFS. HARs are stored in HDFS as folders which contain a file with the concatenation of all its containing input files. If you’re handling text data and don’t care about the original file name this is a great merging solution.
- cannot be generated in real-time
- currently, when using the archive properly with a
har://URL there is no
InputFormatthat can pack multiple files from the archive into a single MapReduce split, but a custom
CombineFileInputFormatcan be written.
Create a MapReduce job to do the job.
- Best flexibility, but requires more development overhead.
Read more about this topic in a great article here: http://blog.cloudera.com/blog/2009/02/the-small-files-problem/