2 Introduction to Data Analysis and Visualization

← Back

Python provides very intutive and powerful tools for data analysis. In this lesson, we’ll learn the concepts and see how to apply those to get insights from the data.

# show graphs in the notebooks and some config to make to look better

from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = (8,4)
plt.style.use('ggplot')

%matplotlib inline

2.1 Numpy

Numpy is the numerical computation library in Python and this is the basis for most scientific computing and data science tools in Python. It is not only a building block, but many of these tools inherit the API of numpy for vector operatons and data selection. Understanding numpy closely helps us to use those tools better.

While you may not deal with numpy array directly, this knowledge will be useful in working with most data analysis tools in Python.

In this section, we’ll learn about the following things about numpy arrays:

creation of multi-dimentional arrays
reshaping them
vector operations
indexing

import numpy as np

Numpy is a libray that provide a multi-dimentional array interface, with very elegant API.

Let’s start with creating a simple 1-dimentional array. Unlike lists in Python, all the elements of a numpy array will be of the same type.

x = np.array([1, 2, 3, 4, 5])

array([1, 2, 3, 4, 5])

x.shape

(5,)

x.dtype

dtype('int64')

Every array in numpy has a dtype and a shape. The dtype indicates the datatype of each element in the array and shape indicates the length of the array in each dimension as a tuple.

x = np.array([0.1, 0.2, 0.3])

x.dtype

dtype('float64')

# create a float64 array with given numbers
x = np.array([1, 2, 3, 4], dtype=np.float64)

array([1., 2., 3., 4.])

x.dtype

dtype('float64')

We can create a two dimentional array as well.

d = np.array([
    [1, 2, 3],
    [4, 5, 6]])

array([[1, 2, 3],
       [4, 5, 6]])

d.shape

(2, 3)

d.dtype

dtype('int64')

2.1.1 Utilties foc creating arrays

Numpy provides some utilities for creating arrays.

Create an array of zeros.

np.zeros(4)

array([0., 0., 0., 0.])

np.zeros((2, 4))

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.]])

# create a 1-d array of size 6 and reshape it 2-d array of size 2x3
np.zeros(6).reshape(2, 3)

array([[0., 0., 0.],
       [0., 0., 0.]])

# create a 1-d array of size 24 and reshape it 3-d array of size 2x3x4
np.zeros(24).reshape(2, 3, 4)

array([[[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]],

       [[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]]])

You can also create an array of ones.

np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

x = np.ones(10)

x.dtype

dtype('float64')

x = np.ones(10, dtype=np.int8)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int8)

x.dtype

dtype('int8')

Or a range of numbers using np.arange which works like range, but returns a numpy array.

np.arange(6)

array([0, 1, 2, 3, 4, 5])

np.arange(1, 2, 0.1)

array([1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9])

np.arange(1, 2, 0.1).reshape(2, 5)

array([[1. , 1.1, 1.2, 1.3, 1.4],
       [1.5, 1.6, 1.7, 1.8, 1.9]])

2.1.1.1 Problem: Create a 3-d numpy array

Create a numpy array of shape (5, 4, 3) with all zeros.

2.1.2 Vector operations

The most interesting part of numpy arrays is the vector operations.

x = np.arange(1, 5)

array([1, 2, 3, 4])

When we use arthemetic operations on numpy arrays, those operations work on each element.

x + 10

array([11, 12, 13, 14])

x * 2

array([2, 4, 6, 8])

# python lists work differently
[1, 2, 3, 4] * 2

[1, 2, 3, 4, 1, 2, 3, 4]

x * x

array([ 1,  4,  9, 16])

x ** x

array([  1,   4,  27, 256])

2 * x / 3

array([0.66666667, 1.33333333, 2.        , 2.66666667])

How to compute sum of squares of all numbers below one million?

x = np.arange(1000000)
np.sum(x*x)

333332833333500000

Numpy is high-performance. The core computation engine is written in C language.

%%timeit
x = np.arange(1000000)
np.sum(x*x)

3.95 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Compare the same with the simple python implementation.

%%timeit
x = range(1000000)
sum([i * i for i in x])

115 ms ± 5.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Notice that numpy is about 30X faster than the pure python version in this case.

The `timeit` is a jupyterlab magic command to find the time taken to execute a block of code.

It executes the block of code multiple times and reports the mean time along with standard deviation across runs.

2.1.2.1 Example: Computing Euclidian Distance

Euclidian distance between two vectors is defines as:

\(E(p,q) = \sqrt{\Sigma_{i=1}^{n}{(p_{i}-q_{i})^2}}\)

Write a function euclidian_distance to compute the euclidian distance between two vectors specified as numpy arrays.

def euclidian_distance(p, q):
    d = p-q
    total = np.sum(d*d)
    return np.sqrt(total)

p = np.array([1.0, 2.0, 3.0])
q = np.array([4.0, 5.0, 6.0])

np.sqrt(np.sum((p-q)**2))

5.196152422706632

euclidian_distance(p, q)

5.196152422706632

euclidian_distance(p, p)

0.0

2.1.2.2 Problem: Manhattan Distance

Write a function manhattan_distance to compute the manhattan distance between two vectors.

The manhattan distance is defined as:

\(M(p,q) = \Sigma_{i=1}^{n}{|pi−qi|}\)

For more info see: https://en.wikipedia.org/wiki/Taxicab_geometry

>>> manhanttan_distance(np.array([0,0]), np.array([3, 4]))
7

Hint: See numpy.abs.

x = np.array([-1.0, 0.5, -0.5])

array([-1. ,  0.5, -0.5])

np.abs(x)

array([1. , 0.5, 0.5])

def manhanttan_distance(p, q):
    ...

p = np.array([0, 0, 0, 0])
q = np.array([1, -2, 3, 4])

manhanttan_distance(p, q) # 10

2.1.3 Indexing and Slicing

Numpy provides interesting ways to select individual elements and parts of the array to enable operations on a subset of an array.

In the following examples, we are going to use the variable x for 1-d array and variable d for a 2-d array.

x = np.arange(1, 5, 0.5, dtype=np.float64)

array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

d = x.reshape(2, 4)

array([[1. , 1.5, 2. , 2.5],
       [3. , 3.5, 4. , 4.5]])

We can access elements from a 1-d array just like a list.

x[0]

1.0

x[1]

1.5

When dealing with multi-dimentional arrays, we can specify a value for each dimension.

array([[1. , 1.5, 2. , 2.5],
       [3. , 3.5, 4. , 4.5]])

d[0, 0]

1.0

d[0, 1]

1.5

d[1, 3]

4.5

We can also slice numpy arrays.

array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

x[:4]

array([1. , 1.5, 2. , 2.5])

x[4:]

array([3. , 3.5, 4. , 4.5])

The same rule applies for multi-dimentional arrays too, just that we need to specify value for bith dimensions.

d = np.arange(0, 12, 0.5).reshape(4, 6)

array([[ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5],
       [ 3. ,  3.5,  4. ,  4.5,  5. ,  5.5],
       [ 6. ,  6.5,  7. ,  7.5,  8. ,  8.5],
       [ 9. ,  9.5, 10. , 10.5, 11. , 11.5]])

# top left corner
d[:2, :3]

array([[0. , 0.5, 1. ],
       [3. , 3.5, 4. ]])

# bottom right corner?
d[2:, 3:]

array([[ 7.5,  8. ,  8.5],
       [10.5, 11. , 11.5]])

# Row 0
d[0, :]

array([0. , 0.5, 1. , 1.5, 2. , 2.5])

# Row 0
d[0]

array([0. , 0.5, 1. , 1.5, 2. , 2.5])

# Column 0
d[:, 0]

array([0., 3., 6., 9.])

Another interesting feature of numpy is boolean indexing.

array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

x > 2

array([False, False, False,  True,  True,  True,  True,  True])

When we evaluate a boolean expression on an array, it is also considered as a vector operation and we get an array back.

The interesting thing is we can use that to select only the elements where there is True in the index.

For example, the following expression returns all the elements which are greater than 2.

x[x > 2]

array([2.5, 3. , 3.5, 4. , 4.5])

We could also use this to make some operations on them. For example. we want to double all the numbers that are greater than 2.

x[x > 2] *= 2

array([1. , 1.5, 2. , 5. , 6. , 7. , 8. , 9. ])

It is too boring and confusing to understand indexing just by staring numbers. Let’s take an example of a gray-scale image and see how these operations impact it.

2.1.4 Example: Gray Scale Image

We’ll use a sample image from scipy.

We are going to use matplotlib to display the image

import matplotlib.pyplot as plt
import scipy

face = scipy.datasets.face(gray=True)

face

array([[114, 130, 145, ..., 119, 129, 137],
       [ 83, 104, 123, ..., 118, 134, 146],
       [ 68,  88, 109, ..., 119, 134, 145],
       ...,
       [ 98, 103, 116, ..., 144, 143, 143],
       [ 94, 104, 120, ..., 143, 142, 142],
       [ 94, 106, 119, ..., 142, 141, 140]], dtype=uint8)

face.shape

(768, 1024)

The shape is approximately 800x1000. We’ll use these numbers as approximation to make our computaion easy.

# function to show an image using matplotlib
def show(img):
    plt.imshow(img, cmap=plt.cm.gray)

show(face)

How to negate the image?

Each value in the array is between 0 to 255. What would we get if we try 255 - face?

show(255 - face)

How to get the top-half of the face?

show(face[:400, :]) # a bit more than half, just lazy to compute height/2

How to get the bottom half?

show(face[400:, :])

How to get the left half?

How to get the right half

Skip 100 pixels on all sides

show(face[100:-100, 100:-100])

How to flip the image vertically?

show(face[:, ::-1])

And flip horizontally…

show(face[::-1, :])

Add 10px border

We’ll just replace the 10 pixels on all sides with black.

face1 = face.copy() # make a copy because we are modifying it

face1[:10, :] = 0
face1[-10:, :] = 0
face1[:, :10] = 0
face1[:, -10:] = 0

show(face1)

Can you try adding a 10px outer border with black color and another 10 px inner border with white color?

Make the image sharp

Turn all colors less than 200 to 0.

face1 = face.copy()

face1[face1 < 200] = 0

show(face1)

2.2 Pandas

Pandas is library to work with tabular data in Python. In another words, Pandas is spereadsheet tool for hackers.

Pandas mainly has two classes Series and DataFrame. A DataFrame represents a tabular dataset and Series represents a column.

import pandas as pd

Let’s start with a sample dataset.

df = pd.read_csv("shared/un-min.csv")

Let’s see the first few rows of the dataframe.

df.head()

	country	region	lifeMale	lifeFemale	infantMortality	GDPperCapita
0	Afghanistan	Asia	45.0	46.0	154	2848
1	Albania	Europe	68.0	74.0	32	863
2	Algeria	Africa	67.5	70.3	44	1531
3	Angola	Africa	44.9	48.1	124	355
4	Argentina	America	69.6	76.8	22	8055

len(df)

df.columns

Index(['country', 'region', 'lifeMale', 'lifeFemale', 'infantMortality',
       'GDPperCapita'],
      dtype='object')

df.dtypes

country             object
region              object
lifeMale           float64
lifeFemale         float64
infantMortality      int64
GDPperCapita         int64
dtype: object

Pandas automatically infers the datatype when reading a csv file.

All the categorical columns with string values will be of type object.

df.describe()

	lifeMale	lifeFemale	infantMortality	GDPperCapita
count	188.000000	188.000000	188.000000	188.000000
mean	63.526064	68.309043	44.308511	5890.595745
std	9.820235	11.085095	38.896964	8917.273130
min	36.000000	39.100000	3.000000	36.000000
25%	57.275000	58.625000	12.000000	426.500000
50%	66.500000	71.950000	30.500000	1654.500000
75%	70.675000	76.250000	71.250000	6730.500000
max	77.400000	82.900000	169.000000	42416.000000

2.2.1 Accessing individual columns

Columns can be accessed using . or []. The . notation works only when the column name doesn’t have space or other special characters.

df.infantMortality

0      154
1       32
2       44
3      124
4       22
      ... 
183     37
184     80
185     19
186    103
187     68
Name: infantMortality, Length: 188, dtype: int64

df["infantMortality"]

The columns are like numpy arrays and we can do vector operations on them.

For example we can compute the avarage life expectency between male and female by taking average of them.

(df.lifeMale + df.lifeFemale)/2

0      45.50
1      71.00
2      68.90
3      46.50
4      73.20
       ...  
183    67.25
184    57.90
185    72.55
186    42.95
187    48.50
Length: 188, dtype: float64

How many regions are there?

df.region.head()

0       Asia
1     Europe
2     Africa
3     Africa
4    America
Name: region, dtype: object

df.region.unique()

array(['Asia', 'Europe', 'Africa', 'America', 'Oceania'], dtype=object)

df.region.nunique()

df.region.value_counts()

Africa     53
Asia       46
Europe     40
America    35
Oceania    14
Name: region, dtype: int64

2.2.2 Index

Pandas supports having a row index and it is handy to keep an index when there is a column with unique name for each name.

df.set_index("country", inplace=True)

df.head()

	region	lifeMale	lifeFemale	infantMortality	GDPperCapita
country
Afghanistan	Asia	45.0	46.0	154	2848
Albania	Europe	68.0	74.0	32	863
Algeria	Africa	67.5	70.3	44	1531
Angola	Africa	44.9	48.1	124	355
Argentina	America	69.6	76.8	22	8055

In most methods on dataframes return a new dataframe instead of modifying the same dataframe. Passing `inplace=True` changes that behavior to update the dataframe inplace. 

Instead of using `inplace=True`, we could also do the following, but that would be more confusing.

    df = df.set_index("country")

We can reset the index by calling reset_index method. Again, we need to pass inplace=True if we want to modify the dataframe.

df.reset_index(inplace=True)

df.head()

Let’s put the index back the rest of our exploration.

df.set_index("country", inplace=True)

df.head()

We can access the index using df.index.

df.index

Index(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Armenia',
       'Australia', 'Austria', 'Azerbaijan', 'Bahamas',
       ...
       'United.States', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela',
       'Viet.Nam', 'Yemen', 'Yugoslavia', 'Zambia', 'Zimbabwe'],
      dtype='object', name='country', length=188)

df.index[:5]

Index(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina'], dtype='object', name='country')

2.2.3 Visualization

“A picture is worth a thousand words”

How is the wealth distributed across the world?

df.hist("GDPperCapita")

array([[<Axes: title={'center': 'GDPperCapita'}>]], dtype=object)

# we can also call histogram on the column
df.GDPperCapita.hist()

df.boxplot("GDPperCapita")

<Axes: >

Does wealth corelate with health?

df.plot(kind="scatter", x="GDPperCapita", y="infantMortality")

<Axes: xlabel='GDPperCapita', ylabel='infantMortality'>

2.2.3.1 Problem: Plot lifeMale vs. lifeFemale

2.2.4 Sorting values

Lot of times we want to order data based on a column to see the top few values.

What are the top-10 rich countries.

df.sort_values("GDPperCapita")

	region	lifeMale	lifeFemale	infantMortality	GDPperCapita
country
Sudan	Africa	53.6	56.4	71	36
Mozambique	Africa	45.5	48.4	110	77
Ethiopia	Africa	48.4	51.6	107	96
Eritrea	Africa	49.1	52.1	98	96
Dem.Rep.of.the.Congo	Africa	51.3	54.5	89	117
...	...	...	...	...	...
Denmark	Europe	73.0	78.3	7	33191
Norway	Europe	74.8	80.6	5	33734
Luxembourg	Europe	73.1	79.7	6	35109
Japan	Asia	76.9	82.9	4	41718
Switzerland	Europe	75.3	81.8	5	42416

188 rows × 5 columns

df.sort_values("GDPperCapita", ascending=False).head()

	region	lifeMale	lifeFemale	infantMortality	GDPperCapita
country
Switzerland	Europe	75.3	81.8	5	42416
Japan	Asia	76.9	82.9	4	41718
Luxembourg	Europe	73.1	79.7	6	35109
Norway	Europe	74.8	80.6	5	33734
Denmark	Europe	73.0	78.3	7	33191

df.sort_values("GDPperCapita", ascending=False).head().index

Index(['Switzerland', 'Japan', 'Luxembourg', 'Norway', 'Denmark'], dtype='object', name='country')

(df.GDPperCapita
 .sort_values(ascending=False)
 .head(20)
 .plot(kind="bar"))

<Axes: xlabel='country'>

2.2.4.1 Problem: What are the 10 poorest countries?

2.2.4.2 Problem: What are the 10 counties with really bad infantMortality?

2.2.5 Group By

Group By is a powerful tool to summarize values on a column.

In this example we have data by country, what if we want to summarize the data by the region?

df.head()

	region	lifeMale	lifeFemale	infantMortality	GDPperCapita
country
Afghanistan	Asia	45.0	46.0	154	2848
Albania	Europe	68.0	74.0	32	863
Algeria	Africa	67.5	70.3	44	1531
Angola	Africa	44.9	48.1	124	355
Argentina	America	69.6	76.8	22	8055

df.groupby("region").mean()

	lifeMale	lifeFemale	infantMortality	GDPperCapita
region
Africa	52.052830	55.286792	86.320755	1217.641509
America	69.082857	74.474286	26.657143	5080.085714
Asia	65.373913	69.439130	43.782609	5453.195652
Europe	70.362500	77.545000	11.575000	12860.050000
Oceania	67.464286	72.092857	24.642857	7131.785714

Notice that the index has become the column on which we did group by. Also, all the categorical columns are ignored on group by.

df_region = df.groupby("region").mean()

df_region.GDPperCapita.plot(kind="bar")

<Axes: xlabel='region'>

2.2.6 Selecting Rows

A lot of times, we want to work on a subset of the dataset to drill down for specific insights.

The vector operations that we learnt in numpy comes handy here.

df_africa = df[df.region == "Africa"]

df_africa.head()

	region	lifeMale	lifeFemale	infantMortality	GDPperCapita
country
Algeria	Africa	67.5	70.3	44	1531
Angola	Africa	44.9	48.1	124	355
Benin	Africa	52.4	57.2	84	391
Botswana	Africa	48.9	51.7	56	3640
Burkina.Faso	Africa	45.1	47.0	97	165

df_africa.plot(kind="scatter", x="GDPperCapita", y="infantMortality")

<Axes: xlabel='GDPperCapita', ylabel='infantMortality'>

Let’s select poor countries.

df.GDPperCapita.hist()

<Axes: >

df_poor = df[df.GDPperCapita < 1000]

df_poor.head()

	region	lifeMale	lifeFemale	infantMortality	GDPperCapita
country
Albania	Europe	68.0	74.0	32	863
Angola	Africa	44.9	48.1	124	355
Armenia	Europe	67.2	74.0	25	354
Azerbaijan	Asia	66.5	74.5	33	321
Bangladesh	Asia	58.1	58.2	78	280

len(df_poor)

len(df)

Which regions have more poor countries?

df_poor.region.value_counts()

Africa     39
Asia       20
Europe      6
America     6
Oceania     2
Name: region, dtype: int64

2.2.6.1 Problem: Find the poor countries in Europe?

Find the countries in Europe that has GDPperCapita less than 1000.

df[(df.GDPperCapita < 1000) & (df.region == "Europe")]

	region	lifeMale	lifeFemale	infantMortality	GDPperCapita
country
Albania	Europe	68.0	74.0	32	863
Armenia	Europe	67.2	74.0	25	354
Belarus	Europe	64.4	74.8	15	994
Bosnia	Europe	70.5	75.9	13	271
Moldova	Europe	63.5	71.5	26	383
Ukraine	Europe	63.6	74.0	18	694

2.2.6.2 Drilling down Further

Let’s look at the wealth vs health graph again.

df.plot(kind="scatter", x="GDPperCapita", y="infantMortality")

Which are the countries that are rich, but not doing well on health?

df[(df.GDPperCapita > 1000) & (df.infantMortality > 100)]

	region	lifeMale	lifeFemale	infantMortality	GDPperCapita
country
Afghanistan	Asia	45.0	46.0	154	2848
Liberia	Africa	50.0	53.0	153	1124

Which are the countries that are not that rich, but not well on health?

df[(df.GDPperCapita < 500) & (df.infantMortality < 50)]

	region	lifeMale	lifeFemale	infantMortality	GDPperCapita
country
Armenia	Europe	67.2	74.0	25	354
Azerbaijan	Asia	66.5	74.5	33	321
Bosnia	Europe	70.5	75.9	13	271
Georgia	Asia	68.5	76.7	23	343
Korea.Dem.Peoples.Rep	Asia	68.9	75.1	22	271
Kyrgyzstan	Asia	63.4	71.9	39	331
Moldova	Europe	63.5	71.5	26	383
Nicaragua	America	65.8	70.6	44	464
Uzbekistan	Asia	64.3	70.7	43	435
Viet.Nam	Asia	64.9	69.6	37	270

2.2.7 Problem: Find the countries which have large gap between lifeMale vs. lifeFemale

For reference look at the scatter plot of lifeMale vs. lifeFemale.

df.plot(kind="scatter", x="lifeMale", y="lifeFemale")

df.plot?