Excel Spreadsheet Input Format for Hadoop Map Reduce
I want to read a Microsoft Excel spreadsheet using Map Reduce, and found that I cannot use Text Input format of Hadoop to fulfill my requirement. Hadoop does not understand Excel spreadsheet so I landed upon writing custom Input format to achieve the same.
Hadoop works with different types of data formats like flat text files to databases. An InputSplit is nothing more than a chunk of several blocks; it should be pretty rare to get a block boundary ending up at the exact location of a end of line (EOL). Each such split is processed by a single map. Some of my records located around block boundaries should be therefore split in 2 different blocks.
- How Hadoop can guarantee all lines from input files are completely read?
- How Hadoop can consolidate a line that is starting on block B and that ends up on B+1?
- How Hadoop can guarantee we do not miss any line?
- Is there a limitation in term of line’s size? Can a line be greater than a block (i.e. spanned over more than 2 blocks)? If so, is there any performance issue?
Let’s take an example to get the concept more clear: Suppose my data set is composed on a single 300Mb file, spanned over 3 different blocks (blocks of 128Mb), and suppose that I have been able to get 1 InputSplit for each block. Let’s imagine now 3 different scenarios
File is composed on 6 lines of 50Mb each
• The first Reader will start reading bytes from Block B1, position 0. The first two EOL will be met at respectively 50Mb and 100Mb. 2 lines (L1 & L2) will be read and sent as key / value pairs to Mapper 1 instance. Then, starting from byte 100Mb, we will reach end of our Split (128Mb) before having found the third EOL. This incomplete line will be completed by reading the bytes in Block B2 until position 150Mb. First part of Line L3 will be read locally from Block B1, second part will be read remotely from Block B2 (by the mean of FSDataInputStream), and a complete record will be finally sent as key / value to Mapper 1.
• The second Reader starts on Block B2, at position 128Mb. Because 128Mb is not the start of a file, there is strong chance our pointer is located somewhere in an existing record that has been already processed by previous Reader. We need to skip this record by jumping out to the next available EOL, found at position 150Mb. Actual start of RecordReader 2 will be at 150Mb instead of 128Mb.
We can wonder what happens in case a block starts exactly on an EOL. By jumping out until the next available record (through readLine method), we might miss 1 record. Before jumping to next EOL, we actually need to decrement initial “start” value to “start – 1″. Being located at least 1 offset before EOL; we ensure no record is skipped.
Maximum size for a single record
There is a maximum size allowed for a single record to be processed. This value can be set using mapred.linerecordreader.maxlength parameter.
You can also find the whole project on Github
I have implemented only Mapper class and no Reducer because we just need to see if Mappers are able to read each line from excel sheet. Folks who want to further process these data can implement those on Reducers.
To use this custom class you need to just specify the following line