Saturday 18 July 2015

GoogleVis package in R

Introduction:

In 2006 Hans Rosling gave an inspiring talk at TED about social and economic developments in the world over the last 50 years, which challenged the views and perceptions of many listeners. Rosling had used extensive data analysis to reach his conclusions. To visualize his talk, he and his team at Gapminder had developed animated bubble charts.


Rosling's presentation popularized the idea and use of interactive charts, and as a result the software 
behind Gapminder was bought by Google and integrated as motion charts into their Visualisation APIone year later.
"googleVis is an R package providing an interface between R and Google Charts. The functions of 
the package allow the user to visualize data with the Google Chart Tools without uploading their 
data to Google.
The output of googleVis functions is html code that contains the data and references to JavaScript 
functions hosted by Google. To view the output a browser with Flash and Internet connection is 
required, the actual chart is rendered in the browser.
Source: google
Data visualizations are an integral part of the Analytics process as it helps to convey important information from a multitude of data easily and attractively.The output of the various functions in the package can be embedded into a web page.In this article we will look at a few basic functions in the googleVis package for visualizing data. 
1.TreeMap:

A tree map is a visual representation of a data tree, where each node can have zero or more children,and one parent (except for the root, which has no parents). Each node is displayed as a rectangle,sized and colored according to assigned values .Sizes and colors are valued relative to all other nodes in thegraph. The default behavior is to move down the tree when a user left-clicks a node, and to move back up the tree when a user right-clicks the graph.We will study the package googleVis and it’s dependencies,the packages it imports and enhances via the gvisTreeMap function.For this lets see the dependencies graph depicting the relationships.

library(“sqldf”)
library("googleVis")
tag <- "googleVis"
plot(makeDepGraph(tag, includeBasePkgs=FALSE, suggests=TRUE, enhances=TRUE), 
     legendPosEdge = c(-1, 1), legendPosVertex = c(1, 1), vertex.size=20)


As you can see in the above graph,googleVis is the parent for knitr and knitr imports 7 other packages.googleVis has no parent and has 23 linkages.Similarly the values for all the packages like stable,R6 etc are calculated and arranged in the below format:

Each record is a node and can have only one parent.The values in the column Val are incoming links + 1.So for the packages like RCCP,R6 etc. it is 1. Knitr has 7 imports and hence Val is 7+1. Fac is a column containing random numbers to generate colors for the tree.
Tree <- gvisTreeMap(pkg,idvar = "Package",parentvar = "Parent",
                    sizevar = "Val",color = “Fac”)
plot(Tree)

Attributes of the function:

data-The dataframe(pkg) used as an input to the function needs to have atleast 4 columns.Each record represents one node and each node has one or more parent nodes.
idvar- column name of data describing the ID for each node.This value is displayed as the node header.
parentvar-column name of data that match to entries in idvar. If this is a root node, its value is NA. Only one root is allowed per treemap.
sizevar -column name of data with positive values to define the size of maps. Any positive value is allowed.This value determines the size of the node, computed relative to all other nodes.

colorvar- column name of data with values to define range of color.The value is used to calculate a color for this node. The color value is first recomputed on a scale from minColorValue to maxColorValue,and then the node is assigned a color from the gradient between minColor and maxColor.


The Output:

TreeMapID1024753d785a
Data: pkg • Chart ID: TreeMapID1024753d785agoogleVis-0.5.8
R version 3.2.0 (2015-04-16) • Google Terms of UseDocumentation and Data Policy


For the charts 2,3 & 4 we will be using the below dataset os

2.MotionChart:
A motion chart is a dynamic chart to explore several indicators over time.Here we will explore the variation of Sales,Profit and expenses of the three items across the years.
M <- gvisMotionChart(os,idvar="Item",timevar="Year")
plot(M)
Attributes of the function gvisMotionChart:
data- a data.frame. The data has to have at least four columns with subject name (idvar), time (timevar) and two columns of numeric values. Further columns,numeric and character/factor are optional. The combination of idvar and timevar has to describe a unique row. The column names of the idvar and timevar have to be specified.
idvar- column name of data with the subject to be analysed.

timevar - column name of data which shows the time dimension. The information has to be either numeric, of class Date or a character .

Output:

MotionChartID102473f95523
Data: os • Chart ID: MotionChartID102473f95523googleVis-0.5.8
R version 3.2.0 (2015-04-16) • Google Terms of UseDocumentation and Data Policy


3.Bar chart:

For the bar chart we combine the expenses,sales and profit for the y-axis and plot its variation              across locations.When we hover above each bar,the corresponding values are displayed.


Attributes of the function gvisBarChart:
xvar- name of the character column which contains the category labels for the x-axes.
yvar-a vector of column names of the numerical variables to be plotted.Each column is                        displayed as a separate bar/column.      

Output:

BarChartID102478ce3fa2
Data: data • Chart ID: BarChartID102478ce3fa2googleVis-0.5.8
R version 3.2.0 (2015-04-16) • Google Terms of UseDocumentation and Data Policy

4.Pie Chart:
An interactive pie chart which shows the values of the variable which is specified while creating the chart.
pie_data <- sqldf("select Item,sum(Sales) as sales from os group by Item")
pie <- gvisPieChart(pie_data,labelvar = 'Item',numvar = 'Sales')
plot(pie)
Attributes of the function gvisPieChart:
labelvar -Name of the character column which contains the category labels for the slice labels.
numvara vector of column names of the numerical variables of the slice values.The output of the above code:

PieChartID10247222555f
Data: pie_data • Chart ID: PieChartID10247222555fgoogleVis-0.5.8
R version 3.2.0 (2015-04-16) • Google Terms of UseDocumentation and Data Policy

5.Geo Map(World):
A geo map is a map of a country, continent, or region map, with colours and values assigned to specific regions. Values are displayed as a colour scale, and optional hover-ext for regions can be specified.
For our understanding we will be using the distribution of Mac Donalds stores across the world.There are 113 locations(Country/Territory) where the stores are located,but for our example here will consider only 21 such locations.



map_world <- gvisGeoMap(Mac_Donalds, locationvar = "Country", 
                   numvar="Numberofoutlets",
                   options=list(dataMode="regions"))                  

plot(map_world)

Attributes of the function gvisGeoMap:
data -data.frame. The data has to have at least two columns with location name (locationvar), value to be mapped to location (numvar) and an optional variable to display any text while the mouse hovers over the location (hovervar).
locationvar- column name of data with the geo locations to be analysed. The locations can be provide in two formats:
Format 1 latitude:longitude.
Format 2 Address, country name, region name locations, or US metropolitan area codes.
If we use regions other than US we need to specify the region in options.
numvar -column name of data with the numeric value displayed when the user hovers
over this region.
hovervar- column name of data with the additional string text displayed when the user
hovers over this region.
options -list of configuration options. 
The Output:

GeoMapID102430ee7e50
Data: Mac_Donalds • Chart ID: GeoMapID102430ee7e50googleVis-0.5.8
R version 3.2.0 (2015-04-16) • Google Terms of UseDocumentation and Data Policy



7.Geo Map(For a specific country):
For this part we will use the distribution of screens of PVR cinemas across the country.



The output :

PVR_SCREENS
Data: pvr • Chart ID: PVR_SCREENSgoogleVis-0.5.8
R version 3.2.0 (2015-04-16) • Google Terms of UseDocumentation and Data Policy




Notice the differences between this graph and the preceding one:
A region is specified here.This has to be done when the states are in a country other than the US.
A hovervar variable is specified here.This will show us the name of the state when we place the point above the markers.
The datamode here is markers.The "markers" style displays a circle, sized and colored to indicate a value,over the regions.
Conclusion:
There are many packages in R which can be used to create interactive plots,but displaying them over the browser is what makes GoogleVis special.Have you used GoogleVis in any other ways,please feel free to make suggestions.
Reference: official google vis vignette

Sunday 5 July 2015

Basic Structural aspects of RDBMS systems(SQL)

In the age of Big Data where storing unstructured data has brought to the forefront NoSQL systems like MongoDB,RDBMS systems are slowly loosing their relevance.However,they are still largely used for structured data storage and manipulation.In this article we will take a look at some basic structural aspects of SQL databases like relationships,foreign key constraints .
DatabaseDiagram:



A database diagram is used to depict the relationships between related tables in a database.As can be seen in the above diagram Employees,Territories and EmployeeTerritories are related by the keys Employee ID and TerritoryID.
Primary Key & Foreign Key:
In the diagram the EmployeeID key is a primary key in Employee table but a foreign key in Employee Territories.
Definition:
"The primary key of a relational table uniquely identifies each record in the table. It can either be a normal attribute that is guaranteed to be unique (such as Social Security Number in a table with no more than one record per person) or it can be generated by the DBMS (such as a globally unique identifier, or GUID, in Microsoft SQL Server). Primary keys may consist of a single attribute or multiple attributes in combination."
Well that is a big definition but let's try and understand that with an example.
In the above database diagram the Key EmployeeID is a primary key and is used to uniquely identify records in the table.
As can be seen here the key EmployeeID is unique for each row.
Now lets see what a foreign key is.
Here,all the keys are foreign keys,which means that they are primary keys in some other table/s and are used here in this way to create a mapping of records in multiple tables as was depicted in the database diagram.This is done primarily to avoid data redundancy and is called normalization,which has to be dealt separately.
Schema:
"database schema (/ˈski.mə/ skee-mə) of a database system is its structure described in a formal language supported by the database management system (DBMS) and refers to the organization of data as a blueprint of how a database is constructed (divided into database tables in the case of Relational Databases). The formal definition of databaseschema is a set of formulas (sentences) called integrity constraints imposed on a database. These integrity constraints ensure compatibility between parts of the schema. All constraints are expressible in the same language. A database can be considered a structure in realization of the database language.[1] The states of a created conceptual schema are transformed into an explicit mapping, the database schema. This describes how real world entities are modeled in the database."-Source: Wiki.
A schema imposes a structure on the database which is a mainstay of RDBMS systems.It imposes constraints which prevent accidental deletion of records which are related.For example if we try to delete the records where EmployeeID = 1 in table Employees we will get the following error message:

This is done because if we delete a primary key record,the foreign key in another table/s which references it becomes invalid.This is a strong structural imposition on the data.

Conclusion:
In this article we have seen some of the basics on how data is structured in a RDBMS system like SQL and some of it's salient features.In a future post I will try to share some basic operations in SQL like record addition,delete,update,retrieval etc.Please feel free to comment.