import seaborn as sns
3 Advaned Visualizations using Seaborn
While matplotlib provides simple visualiation charts that are easy to generate, the seaborn
library provide more sophiticated charts often handly to present complex data from the bioinformatics.
3.1 Seaborn
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
Seaborn has three categories of charts - relplot (relational), displot (distributions) and catplot (categories).
From Seaborn Tutorial.
3.1.1 Palmer Penguins
In this section, we’ll use Palmer Penguins dataset, which is comes packaged with seaborn.
The dataset includes measurements for penguin species, island in Palmer Archipelago, size (flipper length, body mass, bill dimensions), and sex.
= sns.load_dataset("penguins") df
df.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
df.shape
(344, 7)
While seaborn can ignore missing values when plotting, it may be good idea in general to clean the data to remove missing values before starting any exploration.
# number of missing values in each column
sum() df.isna().
species 0
island 0
bill_length_mm 2
bill_depth_mm 2
flipper_length_mm 2
body_mass_g 2
sex 11
dtype: int64
# drop the rows with missing value
=True) df.dropna(inplace
# number of rows and columns in the data after dropping rows with missing values
df.shape
(333, 7)
3.1.2 Scatterplot
df.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
The scatter plot allows visualizing two dimentions. More dimentions can be added a scatter plot to control color, size and style.
# scatterplot of bill length vs bill depth
="bill_length_mm", y="bill_depth_mm") sns.scatterplot(df, x
<Axes: xlabel='bill_length_mm', ylabel='bill_depth_mm'>
The scatterplot is a special kind of relplot. We can get the same output using the following ways as well.
When we specify both x
and y
arguments to
sns.relplot(df, x="bill_length_mm", y="bill_depth_mm")
# scatterplot of bill length vs bill depth
# with color by the species
sns.scatterplot(df, ="bill_length_mm",
x="bill_depth_mm",
y="species") hue
<Axes: xlabel='bill_length_mm', ylabel='bill_depth_mm'>
# scatterplot of bill length vs bill depth
# with color by the species and style by sex
sns.scatterplot(df, ="bill_length_mm",
x="bill_depth_mm",
y="species",
hue="sex") style
<Axes: xlabel='bill_length_mm', ylabel='bill_depth_mm'>
# scatterplot of bill length vs bill depth
# with color by the species, style by sex and size by body weight
="bill_length_mm", y="bill_depth_mm",
sns.scatterplot(df, x="species", style="sex", size="body_mass_g") hue
<Axes: xlabel='bill_length_mm', ylabel='bill_depth_mm'>
Seaborn allows drawing lines on the graphs using matplotlib primitives.
= sns.relplot(data=df, x="bill_length_mm", y="bill_depth_mm", hue='species')
g
# draw a line using start and end points
=(30, 13), xy2=(60, 19), color="g", dashes=(5, 2))
g.ax.axline(xy1
# draw a line using start point and slope
=(35, 13), slope=.6, color="r", dashes=(5, 2)) g.ax.axline(xy1
<matplotlib.lines._AxLine at 0x7f6f9e001f30>
3.2 Distributions
="flipper_length_mm", kind="hist")
sns.displot(df, x
# sns.histplot(df, x="flipper_length_mm")
="flipper_length_mm", kind="kde")
sns.displot(df, x
# sns.kdetplot(df, x="flipper_length_mm")
The distplots allow grouping by color.
="flipper_length_mm", kind="kde", hue="species") sns.displot(df, x
We can stack multiple distributions on top of each other.
="flipper_length_mm", kind="kde",
sns.displot(df, x="species", multiple="stack") hue
We could do the same with histograms.
="flipper_length_mm", kind="hist",
sns.displot(df, x="species", multiple="stack") hue
3.3 Categorical plots
Catplots allows visualizaing categorical data. The default view is a scatter plot with a small jitter added to make the points visible.
="species", y="bill_length_mm") sns.catplot(df, x
A slightly better looking version of that is a swarm plot.
="species", y="bill_length_mm", kind="swarm") sns.catplot(df, x
we can add another dimension usng hue.
="species", y="bill_length_mm", kind="swarm", hue="sex") sns.catplot(df, x
We could even flip the axes, if we want.
="bill_length_mm", y="species", kind="swarm", hue="sex") sns.catplot(df, x
3.4 Comparing Distributions
The boxplot and voilinplot, kinds of catplots, allows comparing distributions.
='bill_length_mm') sns.boxplot(df, y
<Axes: ylabel='bill_length_mm'>
='bill_length_mm') sns.violinplot(df, y
<Axes: ylabel='bill_length_mm'>
Both these plots allow splitting the distribution by a categorical column.
='bill_length_mm', x="species") sns.violinplot(df, y
<Axes: xlabel='species', ylabel='bill_length_mm'>
We could add another dimension using hue.
='bill_length_mm', x="species", hue="sex") sns.violinplot(df, y
<Axes: xlabel='species', ylabel='bill_length_mm'>
We could use the space better by splitting the violin when there are only two categories.
='bill_length_mm', x="species", hue="sex", split=True) sns.violinplot(df, y
<Axes: xlabel='species', ylabel='bill_length_mm'>
3.4.1 Combining multiple views on the data
The jointplot
and pairplot
plots both relationships and distubutions in a single graph.
="bill_length_mm", y="bill_depth_mm", height=3) sns.jointplot(df, x
="bill_length_mm", y="bill_depth_mm", hue="species", height=4) sns.jointplot(df, x
The pairplot show relations between all the numerical columns in a single grid.
=df, hue="species") sns.pairplot(data
3.4.2 Showing multiple charts
Seaborn allows showing a grip of charts for displaying more information.
="bill_length_mm", y="bill_depth_mm", col="species", height=3) sns.relplot(df, x
="bill_length_mm", y="bill_depth_mm", hue="sex",
sns.relplot(df, x="species", height=3) col
="bill_length_mm", y="bill_depth_mm",
sns.relplot(df, x="species", row="sex", height=3) col
When there are too many categories, we can even specify col_wrap
.
="bill_length_mm", y="bill_depth_mm", hue="sex",
sns.relplot(df, x="species", col_wrap=2, height=3) col
This functionality is similar to Facetwrap in R.
3.4.3 Multiple Charts in a grid
import matplotlib.pyplot as plt
= plt.subplots(1, 2, figsize=(8, 3))
f, axs ="bill_length_mm", y="bill_depth_mm", hue="species", ax=axs[0])
sns.scatterplot(df, x="species", hue="species", ax=axs[1]) sns.histplot(df, x
<Axes: xlabel='species', ylabel='Count'>