Sunday 20 December 2015

HIVE- HQL for Hadoop

Introduction:
 


HIVE uses Hadoop Query Language to work around the complexity of writing Hadoop programs.The query language is similar to SQL and hence people with SQL background have a workaround to query Hadoop. In this article we are going to do basic data processing task.
Steps:
  1. We have several files of baseball statistics that we will upload into Hive.
  2. Do some simple operations with them.
  3. Find the player with the highest runs for each year.
  4. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.
This file has all the statistics from 1871–2011 and contains more than 90,000 rows.
Load input file:
After unzipping the files into a directory we need to upload the Master.csv and Batting.csv into a file browser in Hive as shown below.

In the Hue interface is a button called “Hive” and inside Hive there are query options like “Query Editor”, “My Queries” and “Tables” etc.
On the left there is a “query editor”. In the case of Hive we operate on the Apache Hadoop data store.  
Create empty table and load data in Hive
In “Table” we select “Create a new table from a file”, which will opens the “file browser”, where we select “batting.csv” file and we will name the new table as “temp_batting”.



Next we load the contents from ‘Batting.csv’ into temp_batting table, through the following command which need to be executed through the Query Editor
LOAD DATA INPATH ‘/user/admin/Batting.csv’ OVERWRITE INTO TABLE temp_batting;
Once data has been loaded, the file (batting.csv) will be deleted by HIVE, and will no longer be seen in the file browser.











Now that we have loaded the data, we have to verify the same. We execute the following command, which shows the first 100 rows from the table.
SELECT * from temp_batting LIMIT 100;

Create a batting table and transfer data from the temporary table to batting table
We extract the contents of temp_batting into a new table called ‘batting’ with the following columns:
a)  player_id    b)  year    c)  runs
Next object is to create the ‘batting’ table and insert in it from ‘temp_batting’ (player_id, year and run) using regular expression.
create table batting (player_id STRING, year INT, runs INT);
 insert overwrite table batting 
SELECT  regexp_extract(col_value, ‘^(?:([^,]*),?){1}’, 1) player_id,    
                regexp_extract(col_value, ‘^(?:([^,]*),?){2}’, 1) year,    
                regexp_extract(col_value, ‘^(?:([^,]*),?){9}’, 1) run 
from temp_batting 

Query to show the highest score per year:
We do a “group by” in ‘batting’ by year, so that we have the highest scores by year. 
    SELECT year, max(runs) FROM batting GROUP BY year 

Who scored the maximum runs, year-wise:
We execute final query which will show the player who scored the maximum runs in a year.
    SELECT a.year, a.player_id, a.runs from batting a 
    JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b 
    ON (a.year = b.year AND a.runs = b.runs) ;

Sunday 6 December 2015

PIG for Big Data

Introduction:


Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:
  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
  • Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Starting up PIG:

After Hortonworks sandbox virtual machine is setup to run Hadoop we need to start the virtual machine and if everything goes well you will see the below screen:

The last screen on the VM VirtualBox will provide an address to initiate your Hortonworks Sandbox session.

On a web browser open the address to navigate to the Hortonworks Sandbox.



Select the Advanced options to Access the Secure Shell (SSH) client with the following credentials: 

Hue

URL: http://127.0.0.1:8000/

Username: hue 

Password:  1111

Downloading the Data

The data file  comes from the site www.seanlahman.com . You can download the data 

file in form from: http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

Once you have the file you will need to unzip the file into a directory. We will be uploading just the 


master.csv and batting.csv files.Once you have started the PIG session

Step 1: Click on the pig icon on the hue homepage as shown below:



Step 2 : Create Folder and upload the .csv files.

In Hue:-
  • Create a folder in file browser.
  • Go to file browser > new > folder name (eg: pig_prac)
  • Click on Upload and .csv files in this folder.


Step 3:Give the title to the script and write a code in the box as shown in the below figure, click on


the “EXECUTE” button to run the script.



Let’s take a look at our script. As you can see the syntax is a lot like SQL which is just what Pig is,SQL wrapper around Hadoop. We just assume things are applied to all the rows. We also have powerful operators like GROUP and JOIN to sort rows by a key and to build new data objects.
We can execute our code by clicking on the execute button at the top right of the composition area, which opens a new page.


View the results:


Conclusion:

In this blog we saw how to use PIG from the Horton Works Sandbox virtual machine.PIG is used so that people who do not know java can still use the parallelization features of Hadoop.

Saturday 21 November 2015

Map Reduce

Introduction:

'MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.' - Wikipedia.
A MapReduce program is composed of a Map() method that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" orchestrates the processing by marshaling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system.
The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.HDFS is the distributed file system on which the Map Reduce program runs.A typical setup looks like:

How does Map Reduce work?
Consider a simple word count problem where we need to count the number of times a word occurs across many documents.Now typically a document corpus contains thousands or more number of documents which have to be scanned through.This task if performed via serial processing takes hours to run but with the parallel processing system of Hadoop using Map reduce,it runs in seconds.Let's see how:

Step1:
The different documents are assigned to different clusters by the master node and key value pairs are generated.
So for the above example Key Value pairs are (Deer,1) , (Bear,1) ....(Bear,1).
Step 2:
The shuffle step will arrange all the key value pairs so that similar pairs are together.
Step 3:
The reduce step will finally count the total number of times a word occurs across all the documents from the shuffle step.
As another example consider the following scenario:
Assume you have five files, and each file contains two columns (a key and a value in Hadoop terms) that represent a city and the corresponding temperature recorded in that city for the various measurement days. In this example, city is the key and temperature is the value.
Toronto, 20
Whitby, 25
New York, 22
Rome, 32
Toronto, 4
Rome, 33
New York, 18
Out of all the data we have collected, we want to find the maximum temperature for each city across all of the data files (note that each file might have the same city represented multiple times). Using the MapReduce framework, we can break this down into five map tasks, where each mapper works on one of the five files and the mapper task goes through the data and returns the maximum temperature for each city. For example, the results produced from one mapper task for the data above would look like this:
(Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)
Let’s assume the other four mapper tasks (working on the other four files not shown here) produced the following intermediate results:
(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30)
All five of these output streams would be fed into the reduce tasks, which combine the input results and output a single value for each city, producing a final result set as follows:
(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)
As an analogy, we can think of map and reduce tasks as the way a census was conducted in Roman times, where the census bureau would dispatch its people to each city in the empire. Each census taker in each city would be tasked to count the number of people in that city and then return their results to the capital city. There, the results from each city would be reduced to a single count (sum of all cities) to determine the overall population of the empire. This mapping of people to cities, in parallel, and then combining the results (reducing) is much more efficient than sending a single per¬son to count every person in the empire in a serial fashion.
Shown below is a typical Map Reduce Program:
Conclusion:
Though the Map Reduce framework makes querying of humongous datasets easier,MapReduce tasks must be written as acyclic dataflow programs, i.e. a stateless mapper followed by a stateless reducer, that are executed by a batch job scheduler. This paradigm makes repeated querying of datasets difficult and imposes limitations that are felt in fields such as machine learning, where iterative algorithms that revisit a single working set multiple times are the norm

Friday 2 October 2015

Learning SAS by example-2

Introduction:

In this second part of learning SAS by example we are going to see three powerful procs in SAS-proc format,proc freq and proc tabulate.

1.
In this example above we use proc freq to see the counts of data by Gender BloodType and AgeGroup.The nocum and nopercent option removes the column percentages and displays data as counts.

2.
Here,we use proc format to define the age groups and then use proc freq to display the counts by age groups.
3.

4.
5.


Saturday 19 September 2015

Learning SAS by example

Program 1
Using IF and ELSE IF statements, compute two new variables as follows: Grad(Numeric), with a value of if Age is 12 and a value of if Age is 13.The quiz grades have numerical equivalents as follows: A = 95B = 85C = 75,
D = 70, and F = 65. Using this information, compute a course grade (Course) as a
Weighted average of the Quiz (20%), Midterm (30%) and Final (50%).

Code:

Output:
Learning:
Use of IF-ELSE in SAS.

Program 2

You have the following seven values for temperatures for each day of the week, starting with Monday: 70, 72, 74, 76, 77, 78, and 85. Create a temporary SAS data set (Temperatures) with a variable (Day) equal to Mon, Tue, Wed, Thu, Fri, Sat, and Sun and a variable called Temp equal to the listed temperature values. Use a DO loop to create the Day variable.

Code:

Output

Learning:
When this program executes, Day is first set to Mon, the lower limit in the iterative DO range. All the statements up to the END statement are executed and DAY is incremented by 1 (the default increment value). SAS then tests if the new value of DAY is between the lower and the upper limit (the value after the keyword =). If it is, the statements in the DO group execute.

Program 3:
You have daily temperatures for each hour of the day for two cities (Dallas
And Houston). The 48 temperature values are strung out in several lines like this:
80 81 82 83 84 84 87 88 89 89
91 93 93 95 96 97 99 95 92 90 88
86 84 80 78 76 77 78
80 81 82 82 86
88 90 92 92 93 96 94 92 90
88 84 82 78 76 74
The first 24 values represent temperatures from Hour 1 to Hour 24 for Dallas and the next 24 values represent temperatures for Hour 1 to Hour 24 for Austin. Using the appropriate DO loops, create a data set (Temperature) with 48 observations, each observation containing the Variables City, Hour, and Temp.
Note: For this problem, you will need to use a single trailing @ on your INPUT
statement (see Chapter 21, Section 21.11 for an explanation).

Code:
Output:
Learning’s:
To solve this problem, use a trailing @ at the end of the first INPUT statement. This is an instruction to “hold the line” for another INPUT statement in the same DATA step. By “holding the line,” we mean to leave the pointer at the present position and not to advance to the next record. The single trailing @ holds the line until another INPUT statement, (without a trailing @) is encountered further down in the DATA step, or the end of the DATA step is reached.


 Problem 4
You have several lines of data, consisting of a subject number and two dates (date of birth and visit date). The subject starts in column 1 (and is 3 bytes long), the date of birth starts in column 4 and is in the form mm/dd/yyyy, and the visit date starts in column 14 and is in the form nnmmmyyyy  (see sample lines below). Read the following lines of data to create a temporary SAS data set called Dates. Format both dates using the DATE9. Format. Include the subject’s age at the time of the visit in this data set.

0011021195011Nov2006
0020102195525May2005
0031225200525Dec2006

Code:




Output:











Learning's:
SAS does not normally store dates in any of these forms—it converts all of these dates into a single number—the number of days from January 1, 1960. Dates after January 1, 1960, are positive integers; dates before January 1, 1960, are negative integers. The first date (starting in columns 4) is in the month-day-year form; the last date (starting in column 14) starts with the day of the month, a three-letter month abbreviation, and a four-digit year. Notice that some of the dates include separators between the values, while others do not. The date of birth (DOB) takes up 10 columns, MMDDYY10 and the visit date the number of columns used for these dates is 9, and the informat name is DATE.

Problem 5
A listing of the data file is:
IBM 5/21/2006 $80.0 10007/20/2006 $88.5
CSCO04/05/2005 $17.5 20009/21/2005 $23.6
MOT 03/01/2004 $14.7 50010/10/2006 $19.9
XMSR04/15/2006 $28.4 20004/15/2007 $12.7
BBY 02/15/2005 $45.2 10009/09/2006 $56.8
Create a SAS data set (call it Stocks) by reading the data from this file. Use
Formatted input. Compute several new variables as follows:
TotalPur , TotalSell , Profit. Print out the contents of this data set using PROC PRINT.

Code:





Output:







Learning’s:
The @ (at) signs in the INPUT statement are called column pointers—and they do just that. For example, @5 says to SAS, go to column 5. Following the variable names are SAS informats. Informats are built-in instructions that tell SAS how to read a data value. The choice of which informat to use is dictated by the data. Two of the most basic informats are w.d and $w. The w.d format reads standard numeric values. The w tells SAS how many columns to read. The optional d tells SAS that there is an implied decimal point in the value. For example, if you have the number 123 and you read it with a 3.0 informat, SAS stores the value 123.0. If you read the same number with a 3.1 informat, SAS stores the value 12.3. If the number you are reading already has a decimal point in it (this counts as one of the columns to be read), SAS ignores the the portion of the informat. So, if you read the value 1.23 with a 4.1 informat, SAS stores a value of 1.23. The $w. informat tells SAS to read w columns of character data. In this program, Subj is read as character data and takes up three columns; values of Gender take up a single column. The MMDDYY10. informat tells SAS that the date you are reading is in the mm/dd/yyyy form. The DATE9. Format, as you can see, prints dates as a two-digit day of the month, a three character month abbreviation, and a four-digit year. This format helps avoid confusion between the month-day-year and day-month-year formats used in the United States and Europe, respectively. The DOLLAR6.1 format makes the Balance figures much easier to read.
This is a good place to mention that the COMMA w.d format is useful for displaying large numbers where you don’t need or want dollar signs.
           
 Problem 6
You have several lines of data, consisting of a subject number and two dates (date of birth and visit date). The subject starts in column 1 (and is 3 bytes long), the date of birth starts in column 4 and is in the form mm/dd/yyyy, and the visit date starts in column 14 and is in the form nnmmmyyyy (see sample lines below). Read the following lines of data to create a temporary SAS data set called Dates. Format both dates using the DATE9. format. Include the subject’s age at the time of the visit in this data set.
0011021195011Nov2006
0020102195525May2005
0031225200525Dec2006

Code:

Output:











Learning’s:
Same as the previous problem.


 Problem 7
Using the Hosp data set, compute the frequencies for the days of the week, months of the year, and year, corresponding to the admission dates (variable AdmitDate). Supply a format for the days of the week and months of the year. Use PROC FREQ to list these frequencies.

Code:











Output:

























Learning’s:
SAS does not normally store dates in any of these forms—it converts all of these dates into a single number—the number of days from January 1, 1960. Dates after January 1, 1960, are positive integers; dates before January 1, 1960, are negative integers. The first date (starting in columns 4) is in the month-day-year form; the last date (starting in column 14) starts with the day of the month, a three-letter month abbreviation, and a four-digit year. Notice that some of the dates include separators between the values, while others do not. The date of birth (DOB) takes up 10 columns, MMDDYY10 and the visit date the number of columns used for these dates is 9, and the informat name is DATE.


Program 8
Use the following data. If there is a missing value for the day, substitute the 15th of the month.

25 12 2005
. 5 2002
12 8 2006

Code:


Output:









Learning’s:
There are occasions where you have a missing value for the day of the month but still want to compute an approximate date. Many people use the 15th of the month to substitute for a missing Day value. You can use the Month data set from the previous section to demonstrate how this is done. Here the MISSING function tests if there is a missing value for the variable Day. If so, The number 15 is used as the second argument to the MDY function. The resulting listing shows the 15th of the month for the date in the second observation.


Program 9
Using the SAS data set Blood, create two temporary SAS data sets called Subset_Aand Subset_B. Include in both of these data sets a variable called Combined equal to .001 times WBC plus RBC. Subset_A should consist of observations from Blood where Gender is equal to Female and BloodType is equal to AB. Subset_B should consist of all observations from Blood where Gender is equal to Female, BloodType is equal to AB, and Combined is greater than or equal to 14.

Code:


Output:













Program 10
Look at the following program and determine the storage length of each of the variables:
data storage;
length A $ 4 B $ 4;
Name = 'Goldstein';
AandB = A || B;
Cat = cats(A,B);
if Name = 'Smith' then Match = 'No';
else Match = 'Yes';
Substring = substr(Name,5,2);
run;

A _________________
B _________________
Name _________________
AandB _________________
Cat _________________
Match _________________
Substring _________________

Code:


Output: