This article is like a crash course, that list downs the things that you would need most frequently while working exploring and manipulating datasets. It will be easy, if you’ve already taken a basic course on Python. Don’t worry you can start for now without any experience.

Let’s start

Step 0: Install Anaconda

Anaconda

We’ll be using Python 3 in this article. After installing, fire up your Spyder(IDE) or Jupyter(notebook) to start coding

Libraries

Apart from basic operations, you would need libraries for most of the functions like reading data, summarizing etc. . The most frequently used libraries in data science are

Numpy
Pandas
Scipy
Scikit-Learn
Matplotlib

To download libraries either use command prompt OR anaconda prompt . You can also download them from Jupyter/IPython notebook

#From command prompt
pip install pandas

#If you don't have root access on your Linux/Unix terminal
sudo pip install pandas

#From Jupyter
!pip install pandas

You need to import these libraries before you use them

import pandas as pd
import numpy as np


Read data and explore

To read a csv file we will be using, read_csv from pandas. This we give us a data frame.

df = pd.read_csv('http://winterolympicsmedals.com/medals.csv')

Let’s look at the data we’ve just read. It’s shape,columns, head etc.

#Table shape
df.shape #(Rows,Columns)
(2311, 8)

Returns a tuple with number of rows and columns. (Ha.. Tuple another data type)


#Column names
df.columns
Index(['Year', 'City', 'Sport', 'Discipline', 'NOC', 'Event', 'Event gender',
       'Medal'],
      dtype='object')

Columns returns an index of column names


Let’s print your data, but not all of it

df.head()
Year City Sport Discipline NOC Event Event gender Medal
0 1924 Chamonix Skating Figure skating AUT individual M Silver
1 1924 Chamonix Skating Figure skating AUT individual W Gold
2 1924 Chamonix Skating Figure skating AUT pairs X Gold
3 1924 Chamonix Bobsleigh Bobsleigh BEL four-man M Bronze
4 1924 Chamonix Ice Hockey Ice Hockey CAN ice hockey M Gold


If you want to choose how many you rows you want to see

#Specify number of rows
df.head(2)
Year City Sport Discipline NOC Event Event gender Medal
0 1924 Chamonix Skating Figure skating AUT individual M Silver
1 1924 Chamonix Skating Figure skating AUT individual W Gold


Tip : Use tail() to get last rows, head(-n) to get (#rows - n) observations


Short summary of the data

#Get the description of all numeric columns
df.describe()
Year
count 2311.000000
mean 1980.361748
std 22.089091
min 1924.000000
25% 1968.000000
50% 1988.000000
75% 1998.000000
max 2006.000000


One Shot

shape
columns
head()
tail()
describe()


Subsetting & Indexing

Chasing for the required columns and rows we want, is the daily routine. As I see it there are 3 main ways we subset i.e. Index, Name and Logical.
In python we will be using two functions for this

loc - works using labels
iloc - works using position - only integers


Let’s start with some thing simple, choosing a row. We will use iloc now

#In python indexing starts with 0, unlike in R
df.iloc[0,:]
Year                      1924
City                  Chamonix
Sport                  Skating
Discipline      Figure skating
NOC                        AUT
Event               individual
Event gender                 M
Medal                   Silver
Name: 0, dtype: object

This returns a single row as a Series(?)
When you want multiple rows, you specify the required positions as a list(?) or range.

df.iloc[[0,1,2],:] # [0,1,2] is a list
Year City Sport Discipline NOC Event Event gender Medal
0 1924 Chamonix Skating Figure skating AUT individual M Silver
1 1924 Chamonix Skating Figure skating AUT individual W Gold
2 1924 Chamonix Skating Figure skating AUT pairs X Gold


Using a range of numbers

df.iloc[0:3,:] # Note not 0:2 we're using 0:3
Year City Sport Discipline NOC Event Event gender Medal
0 1924 Chamonix Skating Figure skating AUT individual M Silver
1 1924 Chamonix Skating Figure skating AUT individual W Gold
2 1924 Chamonix Skating Figure skating AUT pairs X Gold


Note: Another way to do whatever we’ve done before is

df.iloc[0]
df.iloc[[0,1,2]]
df.iloc[0:3]


Now Let’s subset the columns

df.iloc[:,0] # To select the first column and all rows
df.iloc[:,0:3] # To select the first 3 columns and all rows
df.iloc[0:3,0:3] # To select the first 3 columns and first 3 rows
Year City Sport
0 1924 Chamonix Skating
1 1924 Chamonix Skating
2 1924 Chamonix Skating

But we won’t be using column numbers most of the times, we use column names. So how do we do it

df['Year'] #Single Column
df[['Year','Sport']] #Multiple Columns

When we want a combination of rows and column names , One way is

df.iloc[0:3][['Year','Sport']] # OR df[['Year','Sport']].iloc[0:3] (potayto, potahto...)
Year Sport
0 1924 Skating
1 1924 Skating
2 1924 Skating


Oh we’re done with most of the subsetting, we haven’t used loc function. May be we just don’t need it. Hmmm, Let’s see can we use loc to do what we’ve done just now

df.loc[0:3,['Year','Sport']] # Just note that when we used 0:3 we got more rows than when used iloc
Year Sport
0 1924 Skating
1 1924 Skating
2 1924 Skating
3 1924 Bobsleigh

Yeah we can, but there is just one catch. As i said at the start loc works with labels.

Labels are the column names and row names (or Indexes). Let’s see what that means.

df.columns #Column names
Index(['Year', 'City', 'Sport', 'Discipline', 'NOC', 'Event', 'Event gender',
       'Medal'],
      dtype='object')


df.index #Row names
RangeIndex(start=0, stop=2311, step=1)

There is generally no problem, when we’re subsetting columns labels using loc. The problem is with row labels. Although in most of the cases the row labels (Index) are just a numbers from 0 to #rows, in some cases they are not.

To check this we’will take a subset of data and change their row labels and see how that effects our results.

df1 = df.iloc[0:5]
df1.shape
(5, 8)


df1.index
RangeIndex(start=0, stop=5, step=1)


df1.loc[0:3,]
Year City Sport Discipline NOC Event Event gender Medal
0 1924 Chamonix Skating Figure skating AUT individual M Silver
1 1924 Chamonix Skating Figure skating AUT individual W Gold
2 1924 Chamonix Skating Figure skating AUT pairs X Gold
3 1924 Chamonix Bobsleigh Bobsleigh BEL four-man M Bronze


Let’s change the index and run the same command.

df1.index = [0,33,11,44,3]
df1.index
Int64Index([0, 33, 11, 44, 3], dtype='int64')


df1.loc[0:3,]
Year City Sport Discipline NOC Event Event gender Medal
0 1924 Chamonix Skating Figure skating AUT individual M Silver
33 1924 Chamonix Skating Figure skating AUT individual W Gold
11 1924 Chamonix Skating Figure skating AUT pairs X Gold
44 1924 Chamonix Bobsleigh Bobsleigh BEL four-man M Bronze
3 1924 Chamonix Ice Hockey Ice Hockey CAN ice hockey M Gold

As we can see in the output, it is subsetting the data till it finds the corresponding label. i.e. after we changed the row labels, now in the new data the label 3 occurs in the 5th row. So, it prints 5 rows


One Shot

df.loc[0] One Row

df.loc[[1,3,4]] Selective Rows

df.loc[0:3] Consecutive Rows

df['Year'] One Column

df[['Year','Sport']] Selective Columns

df.loc[,'Year':'Sport'] Consecutive Columns

df.iloc[0:3][['Year','Sport']] Rows and columns

Till now we’ve done the subsetting using index and names. Now let’s see how to do conditional subsetting.

From our data, we will subset the data with City Chamonix

df_sub = df[df['City']=='Chamonix']
df_sub.head()
Year City Sport Discipline NOC Event Event gender Medal
0 1924 Chamonix Skating Figure skating AUT individual M Silver
1 1924 Chamonix Skating Figure skating AUT individual W Gold
2 1924 Chamonix Skating Figure skating AUT pairs X Gold
3 1924 Chamonix Bobsleigh Bobsleigh BEL four-man M Bronze
4 1924 Chamonix Ice Hockey Ice Hockey CAN ice hockey M Gold

If you want to choose specific columns from this data, you can either do like this

df_sub = df[df['City']=='Chamonix'][['Year','Sport']]
df_sub.head()
Year Sport
0 1924 Skating
1 1924 Skating
2 1924 Skating
3 1924 Bobsleigh
4 1924 Ice Hockey

OR like this

df_sub = df.loc[df['City']=='Chamonix',['Year','Sport']]
df_sub.head()
Year Sport
0 1924 Skating
1 1924 Skating
2 1924 Skating
3 1924 Bobsleigh
4 1924 Ice Hockey

I think you got the gist of subsetting .

We will be continuing our learning in the next part, which will be ready soon