Sunday, 9 August 2015

Working with csv files in Python

Introduction:

In this post we will see how to use csv files in python.For this purpose we will use flight data from
flight data.Save this file as a csv file and then import using the csv module in python.
Number of airports in each city:
Let's try and find out the number of airports in each city.

Now let's see how to get the output for a specific country say for Australia.

Explanation of the above code:
1.) We create an empty dictionary called Airport.
2.) After that each record is appended to an array line[].
3.) The in the if statement we check the third column (containing country name) and the first column (airport name).If the dictionary has country name as key in it we would append the value to it, else  create a new key as a new country.
4.) Now we print the airports in Australia from the airport dictionary.

Airline Route Histogram:
Lets see how to plot a histogram showing the geographical distribution of each flight.We need to do the following:

1.) Read in the airport file and create a dictionary mapping the unique ID of each airport to latitude and longitude.

2.) Read in the route files and get the IDs of the source & destination airports.Using the latitude and longitude of each flight we calculate the distance of each and append it to a list of all route lengths.

Now in order to measure the distance we need a new module called"geo_distance"

Code:

All the code that has been used:

Explanation :

Airport file (airports.dat) is used to build a dictionary mapping the unique airport ID to the geographical coordinates.
Routes file (routes.dat) is used to get the IDs of the source and destination airports. And then look up for the latitude and longitude based on the ID . Finally, using these coordinates, calculate the length of the route and append it to a list distances = []of all route lengths.

Now let's draw a histogram showing the distribution of different flight lengths.
Output : 





Data Visualization in Python

Introduction:

In the previous section we did some basic data processing to find out some details about our data.In this section we will see how to use powerful data visualization power of python to communicate results in a far better and interpretable way.We will be using the matplotlib library of python.

matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB®* or Mathematica®), web application servers, and six graphical user interface toolkits.

Bar Plot of Radish Votes:
Let's see how to generate bar plot of the data.We will display the individual counts of radish variety across the data.

Code:

This gives us the following plot:
One view at the plot and we know that champion is the favorite radish variety.This is the power of visualizations.We have used a lot of code here so let's try and understand that.
We are importing two libraries matplotlib and numpy.While matplotlib is based on how charting is done in MATLAB,numpy has a lot of numeric functions that come in quite handy.
In the for loop we generate output that is easy for matplotlib to interpret.The names contain the labels for the x - axis while the votes contain the actual data for plotting.

We create a range of indexes for the X values in the graph, one entry for each entry in the "counts" dictionary (ie len(counts)), numbered 0,1,2,3,etc.This will spread out the graph bars evenly across the X axis on the plot.
np.arange is a NumPy function like the range() function in Python, only the result it produces is a "NumPy array".
plt.xticks() specifies a range of values to use as labels ("ticks") for the X axis.
x + 0.5 is a special expression because x is a NumPy array. NumPy arrays have some special capabilities that normal lists orrange() objects don't have.
The above part is from opentechschool.python tutorial.
To add labels to the x-axis and y-axis we can use:
Conclusion:
Here we have seen the power of visualization in python using the matplotlib library in python.

Data Munging in Python

Introduction:
Along with R and SAS Python is one of the most powerful tools for data mining.The journey to learning any programming language begins with learning its basics.The first step in any data mining project is to import the data and then begin analysis on it.
In this blog we will see how to do some basic operations in python using the data set called radishsurvey.txt

Scenario : We will try to figure out a person's choice in the variety of radish.

Introduction :
The file is a text file which contains 300 lines of a raw data from a survey,where each line consists of a name & a radish variety.

What are we looking for :

Our purpose is to find out some useful info from this data,like:
  1.    Popularity of radish varieties.
  2.    Did anyone vote twice?
For the tasks at hand we will need to create an array so that we can store information in a reusable manner.So let's see how to do that.

Array :
·         First of all we will process the raw data and store the information in the form of strings into two different  variables- name and vote.

   Now that we have our variables let's create an array and store the values in them:

  Checking for duplicates:
     Many times there are duplicates in the data which have to be removed for efficient data manipulation.Let's see how to do that:


     The split function separates the name and the radish variety and line[0] stores names and line[1] has radish variety names.As we can see there are many records like 'Sicily Giant' we have more than one occurrence in the data.
     Let's create a dictionary to store the counts of each radish variety and then examine the output.
As we can see there are some anomalies here too.Sicily giant appears as both 'sicily giant' and 'Sicily Giant'.Hence we will have to convert all the cases into a common case and then take the counts.

Data cleaning:
So here we have capitalised all names and also removed extra spaces from the names.The names are then stored one by one in voted and if it occurs again we can understand that it is a fraud case.


Conclusion:
Let's see what we get finally using the function that we defined above:


So as we can see from the above output,Champion is the most popular variety and Red King and Plum Purple are the least popular.