Sunday 6 December 2015

PIG for Big Data

Introduction:


Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:
  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
  • Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Starting up PIG:

After Hortonworks sandbox virtual machine is setup to run Hadoop we need to start the virtual machine and if everything goes well you will see the below screen:

The last screen on the VM VirtualBox will provide an address to initiate your Hortonworks Sandbox session.

On a web browser open the address to navigate to the Hortonworks Sandbox.



Select the Advanced options to Access the Secure Shell (SSH) client with the following credentials: 

Hue

URL: http://127.0.0.1:8000/

Username: hue 

Password:  1111

Downloading the Data

The data file  comes from the site www.seanlahman.com . You can download the data 

file in form from: http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

Once you have the file you will need to unzip the file into a directory. We will be uploading just the 


master.csv and batting.csv files.Once you have started the PIG session

Step 1: Click on the pig icon on the hue homepage as shown below:



Step 2 : Create Folder and upload the .csv files.

In Hue:-
  • Create a folder in file browser.
  • Go to file browser > new > folder name (eg: pig_prac)
  • Click on Upload and .csv files in this folder.


Step 3:Give the title to the script and write a code in the box as shown in the below figure, click on


the “EXECUTE” button to run the script.



Let’s take a look at our script. As you can see the syntax is a lot like SQL which is just what Pig is,SQL wrapper around Hadoop. We just assume things are applied to all the rows. We also have powerful operators like GROUP and JOIN to sort rows by a key and to build new data objects.
We can execute our code by clicking on the execute button at the top right of the composition area, which opens a new page.


View the results:


Conclusion:

In this blog we saw how to use PIG from the Horton Works Sandbox virtual machine.PIG is used so that people who do not know java can still use the parallelization features of Hadoop.

No comments:

Post a Comment