This article is like a crash course, that list downs the things that you would need most frequently while working exploring and manipulating datasets. It will be easy, if you’ve already taken a basic course on Python. Don’t worry you can start for now without any experience.
Let’s start
Step 0: Install Anaconda
We’ll be using Python 3 in this article. After installing, fire up your Spyder(IDE) or Jupyter(notebook) to start coding
Libraries
Apart from basic operations, you would need libraries for most of the functions like reading data, summarizing etc. . The most frequently used libraries in data science are
Numpy
Pandas
Scipy
Scikit-Learn
Matplotlib
To download libraries either use command prompt OR anaconda prompt . You can also download them from Jupyter/IPython notebook
#From command prompt
pip install pandas
#If you don't have root access on your Linux/Unix terminal
sudo pip install pandas
#From Jupyter
!pip install pandas
You need to import these libraries before you use them
import pandas as pd
import numpy as np
Read data and explore
To read a csv file we will be using, read_csv from pandas. This we give us a data frame.
df = pd.read_csv('http://winterolympicsmedals.com/medals.csv')
Let’s look at the data we’ve just read. It’s shape,columns, head etc.
#Table shape
df.shape #(Rows,Columns)
(2311, 8)
Returns a tuple with number of rows and columns. (Ha.. Tuple another data type)
#Column names
df.columns
Index(['Year', 'City', 'Sport', 'Discipline', 'NOC', 'Event', 'Event gender',
'Medal'],
dtype='object')
Columns returns an index of column names
Let’s print your data, but not all of it
df.head()
Year | City | Sport | Discipline | NOC | Event | Event gender | Medal | |
---|---|---|---|---|---|---|---|---|
0 | 1924 | Chamonix | Skating | Figure skating | AUT | individual | M | Silver |
1 | 1924 | Chamonix | Skating | Figure skating | AUT | individual | W | Gold |
2 | 1924 | Chamonix | Skating | Figure skating | AUT | pairs | X | Gold |
3 | 1924 | Chamonix | Bobsleigh | Bobsleigh | BEL | four-man | M | Bronze |
4 | 1924 | Chamonix | Ice Hockey | Ice Hockey | CAN | ice hockey | M | Gold |
If you want to choose how many you rows you want to see
#Specify number of rows
df.head(2)
Year | City | Sport | Discipline | NOC | Event | Event gender | Medal | |
---|---|---|---|---|---|---|---|---|
0 | 1924 | Chamonix | Skating | Figure skating | AUT | individual | M | Silver |
1 | 1924 | Chamonix | Skating | Figure skating | AUT | individual | W | Gold |
Tip : Use tail() to get last rows, head(-n) to get (#rows - n) observations
Short summary of the data
#Get the description of all numeric columns
df.describe()
Year | |
---|---|
count | 2311.000000 |
mean | 1980.361748 |
std | 22.089091 |
min | 1924.000000 |
25% | 1968.000000 |
50% | 1988.000000 |
75% | 1998.000000 |
max | 2006.000000 |
One Shot
shape
columns
head()
tail()
describe()
Subsetting & Indexing
Chasing for the required columns and rows we want, is the daily routine. As I see it there are 3 main ways we subset i.e. Index, Name and Logical.
In python we will be using two functions for this
loc - works using labels
iloc - works using position - only integers
Let’s start with some thing simple, choosing a row. We will use iloc now
#In python indexing starts with 0, unlike in R
df.iloc[0,:]
Year 1924
City Chamonix
Sport Skating
Discipline Figure skating
NOC AUT
Event individual
Event gender M
Medal Silver
Name: 0, dtype: object
This returns a single row as a Series(?)
When you want multiple rows, you specify the required positions as a list(?) or range.
df.iloc[[0,1,2],:] # [0,1,2] is a list
Year | City | Sport | Discipline | NOC | Event | Event gender | Medal | |
---|---|---|---|---|---|---|---|---|
0 | 1924 | Chamonix | Skating | Figure skating | AUT | individual | M | Silver |
1 | 1924 | Chamonix | Skating | Figure skating | AUT | individual | W | Gold |
2 | 1924 | Chamonix | Skating | Figure skating | AUT | pairs | X | Gold |
Using a range of numbers
df.iloc[0:3,:] # Note not 0:2 we're using 0:3
Year | City | Sport | Discipline | NOC | Event | Event gender | Medal | |
---|---|---|---|---|---|---|---|---|
0 | 1924 | Chamonix | Skating | Figure skating | AUT | individual | M | Silver |
1 | 1924 | Chamonix | Skating | Figure skating | AUT | individual | W | Gold |
2 | 1924 | Chamonix | Skating | Figure skating | AUT | pairs | X | Gold |
Note: Another way to do whatever we’ve done before is
df.iloc[0]
df.iloc[[0,1,2]]
df.iloc[0:3]
Now Let’s subset the columns
df.iloc[:,0] # To select the first column and all rows
df.iloc[:,0:3] # To select the first 3 columns and all rows
df.iloc[0:3,0:3] # To select the first 3 columns and first 3 rows
Year | City | Sport | |
---|---|---|---|
0 | 1924 | Chamonix | Skating |
1 | 1924 | Chamonix | Skating |
2 | 1924 | Chamonix | Skating |
But we won’t be using column numbers most of the times, we use column names. So how do we do it
df['Year'] #Single Column
df[['Year','Sport']] #Multiple Columns
When we want a combination of rows and column names , One way is
df.iloc[0:3][['Year','Sport']] # OR df[['Year','Sport']].iloc[0:3] (potayto, potahto...)
Year | Sport | |
---|---|---|
0 | 1924 | Skating |
1 | 1924 | Skating |
2 | 1924 | Skating |
Oh we’re done with most of the subsetting, we haven’t used loc function. May be we just don’t need it. Hmmm, Let’s see can we use loc to do what we’ve done just now
df.loc[0:3,['Year','Sport']] # Just note that when we used 0:3 we got more rows than when used iloc
Year | Sport | |
---|---|---|
0 | 1924 | Skating |
1 | 1924 | Skating |
2 | 1924 | Skating |
3 | 1924 | Bobsleigh |
Yeah we can, but there is just one catch. As i said at the start loc works with labels.
Labels are the column names and row names (or Indexes). Let’s see what that means.
df.columns #Column names
Index(['Year', 'City', 'Sport', 'Discipline', 'NOC', 'Event', 'Event gender',
'Medal'],
dtype='object')
df.index #Row names
RangeIndex(start=0, stop=2311, step=1)
There is generally no problem, when we’re subsetting columns labels using loc. The problem is with row labels. Although in most of the cases the row labels (Index) are just a numbers from 0 to #rows, in some cases they are not.
To check this we’will take a subset of data and change their row labels and see how that effects our results.
df1 = df.iloc[0:5]
df1.shape
(5, 8)
df1.index
RangeIndex(start=0, stop=5, step=1)
df1.loc[0:3,]
Year | City | Sport | Discipline | NOC | Event | Event gender | Medal | |
---|---|---|---|---|---|---|---|---|
0 | 1924 | Chamonix | Skating | Figure skating | AUT | individual | M | Silver |
1 | 1924 | Chamonix | Skating | Figure skating | AUT | individual | W | Gold |
2 | 1924 | Chamonix | Skating | Figure skating | AUT | pairs | X | Gold |
3 | 1924 | Chamonix | Bobsleigh | Bobsleigh | BEL | four-man | M | Bronze |
Let’s change the index and run the same command.
df1.index = [0,33,11,44,3]
df1.index
Int64Index([0, 33, 11, 44, 3], dtype='int64')
df1.loc[0:3,]
Year | City | Sport | Discipline | NOC | Event | Event gender | Medal | |
---|---|---|---|---|---|---|---|---|
0 | 1924 | Chamonix | Skating | Figure skating | AUT | individual | M | Silver |
33 | 1924 | Chamonix | Skating | Figure skating | AUT | individual | W | Gold |
11 | 1924 | Chamonix | Skating | Figure skating | AUT | pairs | X | Gold |
44 | 1924 | Chamonix | Bobsleigh | Bobsleigh | BEL | four-man | M | Bronze |
3 | 1924 | Chamonix | Ice Hockey | Ice Hockey | CAN | ice hockey | M | Gold |
As we can see in the output, it is subsetting the data till it finds the corresponding label. i.e. after we changed the row labels, now in the new data the label 3 occurs in the 5th row. So, it prints 5 rows
One Shot
df.loc[0]
One Row
df.loc[[1,3,4]]
Selective Rows
df.loc[0:3]
Consecutive Rows
df['Year']
One Column
df[['Year','Sport']]
Selective Columns
df.loc[,'Year':'Sport']
Consecutive Columns
df.iloc[0:3][['Year','Sport']]
Rows and columns
Till now we’ve done the subsetting using index and names. Now let’s see how to do conditional subsetting.
From our data, we will subset the data with City Chamonix
df_sub = df[df['City']=='Chamonix']
df_sub.head()
Year | City | Sport | Discipline | NOC | Event | Event gender | Medal | |
---|---|---|---|---|---|---|---|---|
0 | 1924 | Chamonix | Skating | Figure skating | AUT | individual | M | Silver |
1 | 1924 | Chamonix | Skating | Figure skating | AUT | individual | W | Gold |
2 | 1924 | Chamonix | Skating | Figure skating | AUT | pairs | X | Gold |
3 | 1924 | Chamonix | Bobsleigh | Bobsleigh | BEL | four-man | M | Bronze |
4 | 1924 | Chamonix | Ice Hockey | Ice Hockey | CAN | ice hockey | M | Gold |
If you want to choose specific columns from this data, you can either do like this
df_sub = df[df['City']=='Chamonix'][['Year','Sport']]
df_sub.head()
Year | Sport | |
---|---|---|
0 | 1924 | Skating |
1 | 1924 | Skating |
2 | 1924 | Skating |
3 | 1924 | Bobsleigh |
4 | 1924 | Ice Hockey |
OR like this
df_sub = df.loc[df['City']=='Chamonix',['Year','Sport']]
df_sub.head()
Year | Sport | |
---|---|---|
0 | 1924 | Skating |
1 | 1924 | Skating |
2 | 1924 | Skating |
3 | 1924 | Bobsleigh |
4 | 1924 | Ice Hockey |
I think you got the gist of subsetting .
We will be continuing our learning in the next part, which will be ready soon