Apache Log Processing with Apache Pig

This entry was posted in Big Data, Blog, Pig on by .   0 Comment[s]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. An abstraction over MapReduce which uses his own querying language called as PigLatin. Pig can work with any type of data, i.e with all structured, semi-structured and unstructured datasets. Best use cases to understand the apache pig, is log processing. In this blog we will use Apache Pig to examine the downloaded Apache Logs

Before getting into the log processing, let’s have a basic idea about the log format of the Apache Combined Log format.

Log Format

“%h %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-agent}i\””



The IP address of the client (remote host) which made the request. If hostname looks up is set it will try to look the hostname instead of the IP address


The – indicated the requested information is not available. RFC1413 Identification Protocol of the client


This is the userid of the person requestion the document as determined by the HTTP authentication. Same way if – present then the requested information is not available


The time that the server finished processing the request. The format is

[day/month/year:hour:minute:second zone]
day = 2*digit
month = 3*letter
year = 4*digit
hour = 2*digit
minute = 2*digit
second = 2*digit
zone = (`+’ | `-‘) 4*digit


The request line from the client in the double process. The method used by the client (GET). The client requested resource i.e /apache_pb.gif

%m %U%q %H

This will provide the log method, path, query-string and protocol


The status code that the server sends back to the client

2xx is a successful response, 3xx is a redirection,

4xx is a client error

5xx is a server error


Size of object returned to the client. Measured in bytes.


The referrer, the site that the client reports having been reported from.


The user- agent HTTP request header. Identifying information that the client brower reports about itself.


Our log files are in Apache’s standard CombinedLogFormat. It is complex to parse the elements in the log file using delimiters like tab or comma, so we can’t just use the built in PigLoader(). Luckily there is also a custom loader in the PiggyBank built especially to work with the logs.

So what is this PiggyBank? PiggyBank is a collection of useful add-ons (Similar to UDFs) for Pig, contributed by the Pig user community. The PiggyBank jar is available in folder path of contrib/piggybank/java within the downloadable pig file. First we need to make sure that the piggybank.jar has been successfully registered to pig shell (grunt).

register /home/dd/pig-0.12.0/contrib/piggybank/java/piggybank.jar



Now define the apache combined log loader which is available in the piggybank.

DEFINE ApacheCombinedLogLoader org.apache.pig.piggybank.storage.apachelog. CombinedLogLoader();


Now load the apache combined log file to the pig using the defined Apache Combined LogLoader. The sample record of the apache combined log file is given below. – – [23/Dec/2014:05:31:12 +0530] “GET /logs/access.log HTTP/1.1″ 200 336949 “http://www.theknot.com/wedding/rocaltrol-and-special” “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.12785 YaBrowser/13.12.1599.12785 Safari/537.36″


D = LOAD ‘ combined_log’ USING org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader AS (remoteAddr, remoteLogname, user, time, method, uri, proto, status, bytes, referer, userAgent);

Group the records based on the referrer information. Here we are applying the foreach function inside the group by flattening and then the count function to achieve the total number of views for a referrer page using apache combined log.

E = group D by referer;

F = foreach E generate flatten(group), COUNT(D.referer);

Dump F;



Now these are the basic log processing which are done using the Apache Pig, but how to process the timestamp of the apache log file? Is it a complex one? No need to worry about it, the PiggyBank also gives us the Date Extractor which helps us the extract the timestamp.

To go with this, we need to define the date extractor and load the file using text loader format.


DEFINE DATE_EXTRACT_YY org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor(‘dd/MMM/yyyy:HH:mm:ss Z’,’dd’);

raw_logs = LOAD ‘log’ USING TextLoader AS (line:chararray);


To split the line into its various elements, we use the EXTRACT function and a complicated regular expression.

logs_base = FOREACH raw_logs GENERATE FLATTEN (REGEX_EXTRACT_ALL(line,’^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] “(.+?)” (\\S+) (\\S+) “([^”]*)” “([^”]*)”‘) ) AS (remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray);


Now using the defined date extractor, generate the day from the timestamp.

logs = FOREACH logs_base GENERATE remoteAddr,remoteLogname, DATE_EXTRACT_YY(time) as day;

Dump logs;


Written by Ram, Data Engineer @ DataDotz.

DataDotz is a Chennai based BigData Team primarily focussed on consulting and training on technologies such as Apache Hadoop, Apache Spark , NoSQL(HBase, Cassandra, MongoDB), Search and Cloud Computing.