Again a long time to write some technical stuffs on Big Data but believe me the wait was worth. It’s been some couple of months now since I started reading and writing Scala and Spark and finally I am confident enough to share the knowledge I have gained.As said before learning Scala is worth but […]

It’s been a while since I last time blogged. I am writing this post which gives you an idea how to convert a hive query which joins multiple tables into a MapReduce job. You might be wondering why I should ever think of writing a MapReduce query when Hive does it for me ? You […]

My previous post will give a high level architecture of different components used in HBase and its functioning. Here in this post I will discuss how to bulk load source data directly into HBase table using HBase bulkloading feature. Apache HBase gives you random, real-time, read/write access to your Big Data, but how do you […]

After working on HBase from past one and half year I decided to share my understanding. In this blog I will try to describe the high level functioning of HBase and the different components involved. HBase – The Basics: HBase is an open-source, NoSQL, distributed, column-oriented data store which has been implemented from Google BigTable […]

To all those who want to become a programming nerd What are the top 10 pieces of career advice Brian Bi would give to future software engineers?

Partitioners and Combiners in MapReduce Partitioners are responsible for dividing up the intermediate key space and assigning intermediate key-value pairs to reducers. In other words, the partitioner specifies the task to which an intermediate key-value pair must be copied. Within each reducer, keys are processed in sorted order. Combiners are an optimization in MapReduce that […]

Excel Spreadsheet Input Format for Hadoop Map Reduce I want to read a Microsoft Excel spreadsheet using Map Reduce, and found that I cannot use Text Input format of Hadoop to fulfill my requirement. Hadoop does not understand Excel spreadsheet so I landed upon writing custom Input format to achieve the same. Hadoop works with […]