Sunday 9 August 2015

Data Munging in Python

Introduction:
Along with R and SAS Python is one of the most powerful tools for data mining.The journey to learning any programming language begins with learning its basics.The first step in any data mining project is to import the data and then begin analysis on it.
In this blog we will see how to do some basic operations in python using the data set called radishsurvey.txt

Scenario : We will try to figure out a person's choice in the variety of radish.

Introduction :
The file is a text file which contains 300 lines of a raw data from a survey,where each line consists of a name & a radish variety.

What are we looking for :

Our purpose is to find out some useful info from this data,like:
  1.    Popularity of radish varieties.
  2.    Did anyone vote twice?
For the tasks at hand we will need to create an array so that we can store information in a reusable manner.So let's see how to do that.

Array :
·         First of all we will process the raw data and store the information in the form of strings into two different  variables- name and vote.

   Now that we have our variables let's create an array and store the values in them:

  Checking for duplicates:
     Many times there are duplicates in the data which have to be removed for efficient data manipulation.Let's see how to do that:


     The split function separates the name and the radish variety and line[0] stores names and line[1] has radish variety names.As we can see there are many records like 'Sicily Giant' we have more than one occurrence in the data.
     Let's create a dictionary to store the counts of each radish variety and then examine the output.
As we can see there are some anomalies here too.Sicily giant appears as both 'sicily giant' and 'Sicily Giant'.Hence we will have to convert all the cases into a common case and then take the counts.

Data cleaning:
So here we have capitalised all names and also removed extra spaces from the names.The names are then stored one by one in voted and if it occurs again we can understand that it is a fraud case.


Conclusion:
Let's see what we get finally using the function that we defined above:


So as we can see from the above output,Champion is the most popular variety and Red King and Plum Purple are the least popular.










No comments:

Post a Comment