DataMetriks: December 2015

Sunday, 20 December 2015

HIVE- HQL for Hadoop

Introduction:

HIVE uses Hadoop Query Language to work around the complexity of writing Hadoop programs.The query language is similar to SQL and hence people with SQL background have a workaround to query Hadoop. In this article we are going to do basic data processing task.

Steps:

We have several files of baseball statistics that we will upload into Hive.
Do some simple operations with them.
Find the player with the highest runs for each year.
Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

This file has all the statistics from 1871–2011 and contains more than 90,000 rows.

Input file path: http://seanlahman.com/files/database/lahman591-csv.zip

Load input file:

After unzipping the files into a directory we need to upload the Master.csv and Batting.csv into a file browser in Hive as shown below.

In the Hue interface is a button called “Hive” and inside Hive there are query options like “Query Editor”, “My Queries” and “Tables” etc.

On the left there is a “query editor”. In the case of Hive we operate on the Apache Hadoop data store.

Create empty table and load data in Hive

In “Table” we select “Create a new table from a file”, which will opens the “file browser”, where we select “batting.csv” file and we will name the new table as “temp_batting”.

Next we load the contents from ‘Batting.csv’ into temp_batting table, through the following command which need to be executed through the Query Editor

LOAD DATA INPATH ‘/user/admin/Batting.csv’ OVERWRITE INTO TABLE temp_batting;

Once data has been loaded, the file (batting.csv) will be deleted by HIVE, and will no longer be seen in the file browser.

Now that we have loaded the data, we have to verify the same. We execute the following command, which shows the first 100 rows from the table.

SELECT * from temp_batting LIMIT 100;

Create a batting table and transfer data from the temporary table to batting table

We extract the contents of temp_batting into a new table called ‘batting’ with the following columns:

a) player_id b) year c) runs

Next object is to create the ‘batting’ table and insert in it from ‘temp_batting’ (player_id, year and run) using regular expression.

create table batting (player_id STRING, year INT, runs INT);

insert overwrite table batting

SELECT regexp_extract(col_value, ‘^(?:([^,]*),?){1}’, 1) player_id,

regexp_extract(col_value, ‘^(?:([^,]*),?){2}’, 1) year,

regexp_extract(col_value, ‘^(?:([^,]*),?){9}’, 1) run

from temp_batting;

Query to show the highest score per year:

We do a “group by” in ‘batting’ by year, so that we have the highest scores by year.

SELECT year, max(runs) FROM batting GROUP BY year;

Who scored the maximum runs, year-wise:

We execute final query which will show the player who scored the maximum runs in a year.

SELECT a.year, a.player_id, a.runs from batting a

JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b

ON (a.year = b.year AND a.runs = b.runs) ;

Sunday, 6 December 2015

PIG for Big Data

Introduction:

Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:

Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility. Users can create their own functions to do special-purpose processing.

Starting up PIG:

After Hortonworks sandbox virtual machine is setup to run Hadoop we need to start the virtual machine and if everything goes well you will see the below screen:

The last screen on the VM VirtualBox will provide an address to initiate your Hortonworks Sandbox session.

On a web browser open the address to navigate to the Hortonworks Sandbox.

Select the Advanced options to Access the Secure Shell (SSH) client with the following credentials:

Hue

URL: http://127.0.0.1:8000/

Username: hue

Password: 1111

Downloading the Data

The data file comes from the site www.seanlahman.com . You can download the data

file in form from: http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

Once you have the file you will need to unzip the file into a directory. We will be uploading just the

master.csv and batting.csv files.Once you have started the PIG session

Step 1: Click on the pig icon on the hue homepage as shown below:

Step 2 : Create Folder and upload the .csv files.

In Hue:-

Create a folder in file browser.
Go to file browser > new > folder name (eg: pig_prac)
Click on Upload and .csv files in this folder.

Step 3:Give the title to the script and write a code in the box as shown in the below figure, click on

the “EXECUTE” button to run the script.

Let’s take a look at our script. As you can see the syntax is a lot like SQL which is just what Pig is,SQL wrapper around Hadoop. We just assume things are applied to all the rows. We also have powerful operators like GROUP and JOIN to sort rows by a key and to build new data objects.

We can execute our code by clicking on the execute button at the top right of the composition area, which opens a new page.

View the results:

Conclusion:

In this blog we saw how to use PIG from the Horton Works Sandbox virtual machine.PIG is used so that people who do not know java can still use the parallelization features of Hadoop.