Introduction to Big Data

Big Data in layman’s term:

Big Data is the latest buzzword which describes enormous volumes of both structured and unstructured data. The fundamental difference between both structured and unstructured data is former can be consistently pumped into any relational database or in any structured file format such as XML by knowing the schema, while the latter is schema less i.e. raw and difficult to organize. Organizing unstructured data (i.e. Twitter tweets, Facebook pages, Emails, Raw data from Financial Institutions, satellite data, Telecom data) which is growing at a rapid speed exceeds current processing capacity of current enterprise solutions.

Big Data is spanned in three dimensions i.e. 3Vs: Volume (amount of data), Velocity (speed of data), and Variety (sources of data).

Volume – Enterprises are amassing terabytes and even petabytes of information at a very rapid speed.

  • 40 Zettabytes of data will be created by 2020
  • Most US companies have minimum 100 Terabytes of data stored.
  • 12 Terabytes of Tweets created each day into improved product sentiment analysis

Velocity – Velocity describes the frequency at which data is generated and streaming in at unprecedented speed and must be dealt with in a timely manner.

  • Analyze 500 million daily call detail records in real-time to predict customer churn faster

Variety – Big data is a combination of structured, semi structured and unstructured data. Unstructured data is a combination of log files, audio video streaming, raw txt files, web pages etc. Important insights can be formed when all these data are analyzed together.

  • 30 billion pieces of content are shared on Facebook every month.
  • 4 Billion + hours of video are watched on YouTube every month.

I would like to demonstrate a real time scenario of Big Data on which I have worked:

Use case: A Global bank receives n number of customer calls from across globe, from which many might be genuine caller and some might be fraud. The bank executives want to analyze, what is the number of fraud calls received in a month, and from which part of the globe bank is getting most number of fraud calls.

Detailed story: Every call bank receives the customer call representative asks some details like SSN, DOB and some personal questions which can be used for validation. Summary of conversation is noted down by customer care personnel. Ex. Of genuine caller is: Customer X called to unlock his account, provided correct SSN# and DOB and security question was answered correct. Ex. Of fraud caller is: Customer Y called to unlock his account, provided correct account details and SSN #, but DOB was incorrect, and further when asked about security question he abruptly hanged the call.

Solution: All these data is been written into a raw flat file which is schema less. And from these texts we need to identify which caller was fraud. We need to perform a sentiment analysis on it. Here comes Big Data Hadoop in picture. Once we bring this data in Hadoop we can use Hive to query the sentiment, Machine learning technologies such as Mahout to perform Sentiment classification.

What is Hadoop?

Apache Hadoop is an open source software framework that supports distributed processing of huge data sets under a free license. Here framework includes both storage of huge data sets and large scale processing of the same. Hadoop is designed in such a fashion that it can scale up from single machine to thousands. Doug Cutting created Apache Hadoop which was initially inspired Google’s MapReduce and Google File System (GFS) papers.

Hadoop was created by Doug Cutting who named it after his son’s toy elephant. Hadoop is a top-level Apache project written in the Java programming language. Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its business.

Core Components of Apache Hadoop:

1. MapReduce – Map Reduce is a programming framework for processing huge data sets in a parallel distributed fashion. In essence it is just a way to take a big task and divide it into discrete tasks that can be done in parallel manner.

2. HDFS – A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS creates multiple replicas of each data block and distributes them on computers throughout a cluster to enable reliable and rapid access.

Who uses Hadoop?

Besides Facebook and Yahoo!, many other organizations are using Hadoop to run large distributed computations. Some of the notable users include: Twitter, Ebay, Alibaba, Amazon, American Airlines AOL, Apple, Foursquare, Fox Interactive Media, Hewlett-Packard, IBM, Intuit, Joost,, LinkedIn, Microsoft, NetApp, Netflix, The New York Times, SAP AG, SAS Institute, StumbleUpon, Aguja, and Adobe.

Hadoop Vendors?

Consistent with the demand, many vendors such as AWS EMR (Elastic Map Reduce), Cloudera, Hortonworks, IBM, Intel, MapR Technologies, Microsoft, Pivotal Software, Teradata,



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: