Sunday, 20 December 2015

HIVE- HQL for Hadoop

Introduction:
 


HIVE uses Hadoop Query Language to work around the complexity of writing Hadoop programs.The query language is similar to SQL and hence people with SQL background have a workaround to query Hadoop. In this article we are going to do basic data processing task.
Steps:
  1. We have several files of baseball statistics that we will upload into Hive.
  2. Do some simple operations with them.
  3. Find the player with the highest runs for each year.
  4. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.
This file has all the statistics from 1871–2011 and contains more than 90,000 rows.
Load input file:
After unzipping the files into a directory we need to upload the Master.csv and Batting.csv into a file browser in Hive as shown below.

In the Hue interface is a button called “Hive” and inside Hive there are query options like “Query Editor”, “My Queries” and “Tables” etc.
On the left there is a “query editor”. In the case of Hive we operate on the Apache Hadoop data store.  
Create empty table and load data in Hive
In “Table” we select “Create a new table from a file”, which will opens the “file browser”, where we select “batting.csv” file and we will name the new table as “temp_batting”.



Next we load the contents from ‘Batting.csv’ into temp_batting table, through the following command which need to be executed through the Query Editor
LOAD DATA INPATH ‘/user/admin/Batting.csv’ OVERWRITE INTO TABLE temp_batting;
Once data has been loaded, the file (batting.csv) will be deleted by HIVE, and will no longer be seen in the file browser.











Now that we have loaded the data, we have to verify the same. We execute the following command, which shows the first 100 rows from the table.
SELECT * from temp_batting LIMIT 100;

Create a batting table and transfer data from the temporary table to batting table
We extract the contents of temp_batting into a new table called ‘batting’ with the following columns:
a)  player_id    b)  year    c)  runs
Next object is to create the ‘batting’ table and insert in it from ‘temp_batting’ (player_id, year and run) using regular expression.
create table batting (player_id STRING, year INT, runs INT);
 insert overwrite table batting 
SELECT  regexp_extract(col_value, ‘^(?:([^,]*),?){1}’, 1) player_id,    
                regexp_extract(col_value, ‘^(?:([^,]*),?){2}’, 1) year,    
                regexp_extract(col_value, ‘^(?:([^,]*),?){9}’, 1) run 
from temp_batting 

Query to show the highest score per year:
We do a “group by” in ‘batting’ by year, so that we have the highest scores by year. 
    SELECT year, max(runs) FROM batting GROUP BY year 

Who scored the maximum runs, year-wise:
We execute final query which will show the player who scored the maximum runs in a year.
    SELECT a.year, a.player_id, a.runs from batting a 
    JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b 
    ON (a.year = b.year AND a.runs = b.runs) ;

1 comment: