Learn Python in 30 minutes for Data scientists

This article is like a crash course, that list downs the things that you would need most frequently while working exploring and manipulating datasets. It will be easy, if you’ve already taken a basic course on Python. Don’t worry you can start for now without any experience.

Let’s start

Step 0: Install Anaconda

Anaconda

We’ll be using Python 3 in this article. After installing, fire up your Spyder(IDE) or Jupyter(notebook) to start coding

Libraries

Apart from basic operations, you would need libraries for most of the functions like reading data, summarizing etc. . The most frequently used libraries in data science are

Numpy
Pandas
Scipy
Scikit-Learn
Matplotlib

To download libraries either use command prompt OR anaconda prompt . You can also download them from Jupyter/IPython notebook

#From command prompt
pip install pandas

#If you don't have root access on your Linux/Unix terminal
sudo pip install pandas

#From Jupyter
!pip install pandas

You need to import these libraries before you use them

import pandas as pd
import numpy as np

Read data and explore

To read a csv file we will be using, read_csv from pandas. This we give us a data frame.

df = pd.read_csv('http://winterolympicsmedals.com/medals.csv')

Let’s look at the data we’ve just read. It’s shape,columns, head etc.

#Table shape
df.shape #(Rows,Columns)

(2311, 8)

Returns a tuple with number of rows and columns. (Ha.. Tuple another data type)

#Column names
df.columns

Index(['Year', 'City', 'Sport', 'Discipline', 'NOC', 'Event', 'Event gender',
       'Medal'],
      dtype='object')

Columns returns an index of column names

Let’s print your data, but not all of it

df.head()

	Year	City	Sport	Discipline	NOC	Event	Event gender	Medal
0	1924	Chamonix	Skating	Figure skating	AUT	individual	M	Silver
1	1924	Chamonix	Skating	Figure skating	AUT	individual	W	Gold
2	1924	Chamonix	Skating	Figure skating	AUT	pairs	X	Gold
3	1924	Chamonix	Bobsleigh	Bobsleigh	BEL	four-man	M	Bronze
4	1924	Chamonix	Ice Hockey	Ice Hockey	CAN	ice hockey	M	Gold

If you want to choose how many you rows you want to see

#Specify number of rows
df.head(2)

	Year	City	Sport	Discipline	NOC	Event	Event gender	Medal
0	1924	Chamonix	Skating	Figure skating	AUT	individual	M	Silver
1	1924	Chamonix	Skating	Figure skating	AUT	individual	W	Gold

Tip : Use tail() to get last rows, head(-n) to get (#rows - n) observations

Short summary of the data

#Get the description of all numeric columns
df.describe()

	Year
count	2311.000000
mean	1980.361748
std	22.089091
min	1924.000000
25%	1968.000000
50%	1988.000000
75%	1998.000000
max	2006.000000

One Shot

shape
columns
head()
tail()
describe()

Subsetting & Indexing

Chasing for the required columns and rows we want, is the daily routine. As I see it there are 3 main ways we subset i.e. Index, Name and Logical.
In python we will be using two functions for this

loc - works using labels
iloc - works using position - only integers

Let’s start with some thing simple, choosing a row. We will use iloc now

#In python indexing starts with 0, unlike in R
df.iloc[0,:]

Year                      1924
City                  Chamonix
Sport                  Skating
Discipline      Figure skating
NOC                        AUT
Event               individual
Event gender                 M
Medal                   Silver
Name: 0, dtype: object

This returns a single row as a Series(?)
When you want multiple rows, you specify the required positions as a list(?) or range.

df.iloc[[0,1,2],:] # [0,1,2] is a list

	Year	City	Sport	Discipline	NOC	Event	Event gender	Medal
0	1924	Chamonix	Skating	Figure skating	AUT	individual	M	Silver
1	1924	Chamonix	Skating	Figure skating	AUT	individual	W	Gold
2	1924	Chamonix	Skating	Figure skating	AUT	pairs	X	Gold

Using a range of numbers

df.iloc[0:3,:] # Note not 0:2 we're using 0:3

	Year	City	Sport	Discipline	NOC	Event	Event gender	Medal
0	1924	Chamonix	Skating	Figure skating	AUT	individual	M	Silver
1	1924	Chamonix	Skating	Figure skating	AUT	individual	W	Gold
2	1924	Chamonix	Skating	Figure skating	AUT	pairs	X	Gold

Note: Another way to do whatever we’ve done before is

df.iloc[0]
df.iloc[[0,1,2]]
df.iloc[0:3]

Now Let’s subset the columns

df.iloc[:,0] # To select the first column and all rows

df.iloc[:,0:3] # To select the first 3 columns and all rows

df.iloc[0:3,0:3] # To select the first 3 columns and first 3 rows

	Year	City	Sport
0	1924	Chamonix	Skating
1	1924	Chamonix	Skating
2	1924	Chamonix	Skating

But we won’t be using column numbers most of the times, we use column names. So how do we do it

df['Year'] #Single Column

df[['Year','Sport']] #Multiple Columns

When we want a combination of rows and column names , One way is

df.iloc[0:3][['Year','Sport']] # OR df[['Year','Sport']].iloc[0:3] (potayto, potahto...)

	Year	Sport
0	1924	Skating
1	1924	Skating
2	1924	Skating

Oh we’re done with most of the subsetting, we haven’t used loc function. May be we just don’t need it. Hmmm, Let’s see can we use loc to do what we’ve done just now

df.loc[0:3,['Year','Sport']] # Just note that when we used 0:3 we got more rows than when used iloc

	Year	Sport
0	1924	Skating
1	1924	Skating
2	1924	Skating
3	1924	Bobsleigh

Yeah we can, but there is just one catch. As i said at the start loc works with labels.

Labels are the column names and row names (or Indexes). Let’s see what that means.

df.columns #Column names

Index(['Year', 'City', 'Sport', 'Discipline', 'NOC', 'Event', 'Event gender',
       'Medal'],
      dtype='object')

df.index #Row names

RangeIndex(start=0, stop=2311, step=1)

There is generally no problem, when we’re subsetting columns labels using loc. The problem is with row labels. Although in most of the cases the row labels (Index) are just a numbers from 0 to #rows, in some cases they are not.

To check this we’will take a subset of data and change their row labels and see how that effects our results.

df1 = df.iloc[0:5]
df1.shape

(5, 8)

df1.index

RangeIndex(start=0, stop=5, step=1)

df1.loc[0:3,]

	Year	City	Sport	Discipline	NOC	Event	Event gender	Medal
0	1924	Chamonix	Skating	Figure skating	AUT	individual	M	Silver
1	1924	Chamonix	Skating	Figure skating	AUT	individual	W	Gold
2	1924	Chamonix	Skating	Figure skating	AUT	pairs	X	Gold
3	1924	Chamonix	Bobsleigh	Bobsleigh	BEL	four-man	M	Bronze

Let’s change the index and run the same command.

df1.index = [0,33,11,44,3]
df1.index

Int64Index([0, 33, 11, 44, 3], dtype='int64')

df1.loc[0:3,]

	Year	City	Sport	Discipline	NOC	Event	Event gender	Medal
0	1924	Chamonix	Skating	Figure skating	AUT	individual	M	Silver
33	1924	Chamonix	Skating	Figure skating	AUT	individual	W	Gold
11	1924	Chamonix	Skating	Figure skating	AUT	pairs	X	Gold
44	1924	Chamonix	Bobsleigh	Bobsleigh	BEL	four-man	M	Bronze
3	1924	Chamonix	Ice Hockey	Ice Hockey	CAN	ice hockey	M	Gold

As we can see in the output, it is subsetting the data till it finds the corresponding label. i.e. after we changed the row labels, now in the new data the label 3 occurs in the 5th row. So, it prints 5 rows

One Shot

df.loc[0] One Row

df.loc[[1,3,4]] Selective Rows

df.loc[0:3] Consecutive Rows

df['Year'] One Column

df[['Year','Sport']] Selective Columns

df.loc[,'Year':'Sport'] Consecutive Columns

df.iloc[0:3][['Year','Sport']] Rows and columns

Till now we’ve done the subsetting using index and names. Now let’s see how to do conditional subsetting.

From our data, we will subset the data with City Chamonix

df_sub = df[df['City']=='Chamonix']
df_sub.head()

	Year	City	Sport	Discipline	NOC	Event	Event gender	Medal
0	1924	Chamonix	Skating	Figure skating	AUT	individual	M	Silver
1	1924	Chamonix	Skating	Figure skating	AUT	individual	W	Gold
2	1924	Chamonix	Skating	Figure skating	AUT	pairs	X	Gold
3	1924	Chamonix	Bobsleigh	Bobsleigh	BEL	four-man	M	Bronze
4	1924	Chamonix	Ice Hockey	Ice Hockey	CAN	ice hockey	M	Gold

If you want to choose specific columns from this data, you can either do like this

df_sub = df[df['City']=='Chamonix'][['Year','Sport']]
df_sub.head()

	Year	Sport
0	1924	Skating
1	1924	Skating
2	1924	Skating
3	1924	Bobsleigh
4	1924	Ice Hockey

OR like this

df_sub = df.loc[df['City']=='Chamonix',['Year','Sport']]
df_sub.head()

	Year	Sport
0	1924	Skating
1	1924	Skating
2	1924	Skating
3	1924	Bobsleigh
4	1924	Ice Hockey

I think you got the gist of subsetting .

We will be continuing our learning in the next part, which will be ready soon

Learn Python in 30 minutes for Data scientists - Part 1

This blog will help you start working on data using Python as quickly as possible. It is also useful for a quick refresh

Let’s start

Step 0: Install Anaconda

Libraries

Read data and explore

Subsetting & Indexing