Log analyzer example using Spark and Scala

Again a long time to write some technical stuffs on Big Data but believe me the wait was worth. It’s been some couple of months now since I started reading and writing Scala and Spark and finally I am confident enough to share the knowledge I have gained.As said before learning Scala is worth but it does have difficulties as well. Being a Java programmer and coding in functional way is little difficult but once you learn Scala it is real fun believe me.

In this post I would not go into details of Spark and Scala concepts instead I would demonstrate an example which I think would be easy to start with. Once you see an end to end working program you would be able to relate with Map Reduce or any other paradigms and from there on you can self learn by picking some topics.

I am using log analyzer example which was explained here in my blog post using Map Reduce. The input and output in both examples are same, only change is Scala instead of Java, Spark instead of Map Reduce.

Problem statement: I run a highly busy website and need to pull down my site for an hour in order to apply some patches and maintenance of back end severs, which means the website will be completely unavailable for an hour. To perform this activity the primary lookout will be that shutdown outage should be affected to least number of users. The games starts here: We need to identify at what hour of the day the web traffic is least for the website so that maintenance activity can be scheduled for that time.
There is an Apache web server log for each day which records the activities happening on website. But those are huge files up to 5 GB each.

Sample from log file:

64.242.88.10,[07/Mar/2004:16:05:49 -0800],”GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1″ 401 12846
64.242.88.10,[07/Mar/2004:16:06:51 -0800], “GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1” 200 4523
64.242.88.10,[07/Mar/2004:16:10:02 -0800], “GET /mailman/listinfo/hsdivision HTTP/1.1” 200 6291
64.242.88.10,[07/Mar/2004:17:06:51 -0800], “GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1” 200 4523
64.242.88.10,[07/Mar/2004:18:06:51 -0800], “GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1” 200 4523
64.242.88.10,[07/Mar/2004:16:06:51 -0800], “GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1” 200 4523

We are interested only in the date field i.e. [07/Mar/2014:22:12:28 -0800]
Solution: I need to consume log files of one month and run my code which calculates the total number of hits for each hour of the day. Hour which has the least number of hits is perfect for the downtime. It is as simple as that!

Explanation of spark-Scala code:

Most of the code is self explanatory, I have used Scala case class to map the log schema which will be useful at any point in the code to be re used. A function mapRawLine is used to map the raw data into relevant case class. In this step you can perform any cleaning/transformation of the raw data according to your need.

Transform method contains the main logic which takes the case class as an RDD and it retrieve only the time from datetime field and then group it to get a count as wordcount.

def transform(events: RDD[LogSchema]) = {
   val e = events.map(x => (x.datetime, 1)).reduceByKey (_ + _)
   e.saveAsTextFile("/user/sreejith/loganalyzer/logoutput/")
 }

The second line in the above code can also be re written as

val e = events.map(x => (x.datetime, 1)).reduceByKey { case (x, y) => x + y }

(_+_) is actually a Scala shorthand for an anonymous function taking two parameters and returning the result obtained by invoking the + operator on the first passing the second. This is same as a recursive function written in C or Java.

The code retrieve the time from datetime field and assign  1 to each one of them and then the reduce by key is the reduce step applied on each key which do recursive summation of all the 1’s.

RunJob is the main Scala code which is the starting point.

To run the spark job

spark-submit –master yarn-client –class com.demo.loganalyzer.RunMainJob spark-loganalyzer-1.0-SNAPSHOT-jar-with-dependencies.jar

Output of the below code is

(17,1)
(18,1)
(16,4)

If you compare the amount of lines needed to achieve the same in Map Reduce using Java and in spark scala it’s 1/10 of the code. So that justifies the effort you put in learning this technologies.

The entire source code with a maven compiled project can be found in my Github

Happy Coding

Cheers !

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: