To find signals in data, we must learn to reduce the noise - not just the noise that resides in the data, but also the noise that resides in us. It is nearly impossible for noisy minds to perceive anything but noise in data.
- Stephen Few, Data Visualization Consultant and Author
pd.merge
function.The most tried-and-true, mature plotting library in Python is called Matplotlib. It began with a mission of replicating Matlab's plotting functionality in Python, so if you're familiar with Matlab you may notice some syntactic similiaries.
Matplotlib is traditionally imported like this:
import matplotlib.pyplot as plt
This means import matplotlib's pyplot submodule under the name plt
.
While matplotlib is powerful and stable, the rise of Python's use within data science led to the development of a more data scientist-friendly library, called Seaborn.
Seaborn allows the user to describe graphics using clearer and less verbose function calls, but uses Matplotlib to generate the plots.
This approach has the added benefit of allowing the user to "drop down" to Matplotlib to make fine adjustments to his/her plots if needed.
Seaborn is traditionally imported like this:
import seaborn as sns
Fun Fact!
Seaborn is allegedly named after West Wing character Sam Seaborn, whose full name is Samuel Norman Seaborn (S.N.S. -- the package import nickname).
So, altogether between Matplotlib and Seaborn, your importing code will usually look something like:
import matplotlib.pyplot as plt
import seaborn as sns
In practice, a good strategy is to try to make your plots with Seaborn and switch to Matplotlib only when you discover you need more flexibility than Seaborn provides. Today we'll strictly be using Seaborn, but it's important to understand Seaborn's relationship with Matplotlib to make sense of advice and code you find on the internet (e.g. Stack Overflow answers).
The first things we need to do is import our libraries.
# DataFrame capabilities
import pandas as pd
# Visualization capabilities
import matplotlib.pyplot as plt
import seaborn as sns
Now let's import some data for plotting.
planes = pd.read_csv('../data/planes.csv')
flights = pd.read_csv('../data/flights.csv')
A very common need in data science is being able to see the distribution of a variable. Histograms, density plots, box plots, and violin plots are the most popular ways to do so.
Seaborn can create histograms and density plots using the histplot
function.
sns.histplot(data=planes, x='seats')
<Axes: xlabel='seats', ylabel='Count'>
Note
Note the annoying <AxesSubplot:xlabel='seats', ylabel='Count'> at the top of the previous plot. We can remove this output by adding a ; after the plotting function call.
sns.histplot(data=planes, x='seats');
By adding an argument, kde=True
, you can overlay a density estimate on the histogram – this is very useful for visualizing continuous distributions.
sns.histplot(data=planes, x='seats', kde=True);
Note
histplot, like all Seaborn plotting functions, supports a wide variety of customizations using various arguments. We won't cover those, but refer to the Seaborn docs to learn more.
Should you prefer a boxplot, use the boxplot
function.
sns.boxplot(data=planes, x='seats');
Or a violin plot:
sns.violinplot(data=planes, x='seats');
Scatter plots are used to see the relationship between two variables. They plot one variable on the x-axis and another on the y-axis, and use points to show where the records of the data occur.
Seaborn provides the scatterplot
function for making scatterplots.
Simply pass in two columns of data –- the first will be your x-axis and the second your y-axis.
sns.scatterplot(data=flights, x='dep_delay', y='arr_delay');
Adding a hue
argument allows you to color points differently based on a categorical variable.
sns.scatterplot(data=flights, x='dep_delay', y='arr_delay', hue='carrier');
/usr/local/anaconda3/envs/uc-python/lib/python3.11/site-packages/IPython/core/events.py:89: UserWarning: Creating legend with loc="best" can be slow with large amounts of data. func(*args, **kwargs) /usr/local/anaconda3/envs/uc-python/lib/python3.11/site-packages/IPython/core/pylabtools.py:152: UserWarning: Creating legend with loc="best" can be slow with large amounts of data. fig.canvas.print_figure(bytes_io, **kw)
Note
This plot may take a while to render when you run it -- there are a lot of flights in our data, and it takes Python a while to assign them all coordinates and colors.
Line plot are useful for observing the change in a value over time.
We can use Seaborn's lineplot
function to see how departure delay changes throughout the day.
## You can ignore this wrangling if you want.
# Subset our data to just one day.
cond = (flights['year'] == 2013) & (flights['month'] == 1) & (flights['day'] == 1)
jan1_flights = flights[cond]
# Group by time and get the mean delay.
jan1_flights = jan1_flights.groupby('dep_time', as_index=False)['dep_delay'].mean()
jan1_flights.head()
dep_time | dep_delay | |
---|---|---|
0 | 517.0 | 2.0 |
1 | 533.0 | 4.0 |
2 | 542.0 | 2.0 |
3 | 544.0 | -1.0 |
4 | 554.0 | -5.0 |
# Make a line plot with dep_time on the x-axis and dep_delay on the y-axis.
sns.lineplot(data=jan1_flights, x='dep_time', y='dep_delay');
Bar plots are often used to display values across several groups.
Seaborn supports them with the barplot
function.
Let's plot the number of flights from each origin airport.
# Get the number of flights from each origin
flight_counts = flights.groupby('origin', as_index=False)['flight'].count()
flight_counts = flight_counts.rename(columns={'flight': 'n_flights'})
flight_counts
origin | n_flights | |
---|---|---|
0 | EWR | 120835 |
1 | JFK | 111279 |
2 | LGA | 104662 |
sns.barplot(data=flight_counts, x='origin', y='n_flights');
If the hue
argument is also specified, you get a grouped bar chart -- the second argument is the groups and the hue
argument is the bars within each group.
# This time, get flights by origin and carrier
flight_counts = flights.groupby(['origin', 'carrier'], as_index=False)['flight'].count()
flight_counts = flight_counts.rename(columns={'flight': 'n_flights'})
# For simplicity, let's narrow down to just a few carriers.
flight_counts = flight_counts[flight_counts['carrier'].isin(['AA', 'B6', '9E'])]
flight_counts
origin | carrier | n_flights | |
---|---|---|---|
0 | EWR | 9E | 1268 |
1 | EWR | AA | 3487 |
3 | EWR | B6 | 6557 |
12 | JFK | 9E | 14651 |
13 | JFK | AA | 13783 |
14 | JFK | B6 | 42076 |
22 | LGA | 9E | 2541 |
23 | LGA | AA | 15459 |
24 | LGA | B6 | 6002 |
sns.barplot(data=flight_counts, x='origin', y='n_flights', hue='carrier');
Do you see a pattern in the functions to create seaborn plots?
Single variable plots:
sns.plot_type(data=my_data, x=variable)
Multi-variable plots:
sns.plot_type(data=my_data, x=x_variable, y=y_variable)
sns.plot_type(data=my_data, x=x_variable, y=y_variable, hue=hue_variable)
The plots we've created above are functional but lack the nice touches we'd want in a graphic we were planning to share.
Seaborn allows users to add a custom title and override the default axis labels with the .set
method.
sns.barplot(data=flight_counts, x='origin', y='n_flights').set(title='Number of Flights by Origin');
Note
It's common to break up the line when it becomes so long -- purely for readability.
sns.barplot(
data=flight_counts,
x='origin',
y='n_flights'
).set(
title='Number of Flights by Origin',
xlabel='Flight Origin',
ylabel='Number of Flights'
);
You've now seen many types of plots. Using the flights data, make 3 different graphics to explore relationships or distributions that you think might be interesting.
If you have extra time, you can try to customize your plot with axis labels, titles, and more –- take a look at the documentation using the question mark syntax we saw at the beginning of the day (e.g. sns.barplot?
).
Unfortunately we don't have time to venture further into the Python data visualization ecosystem, but we want to provide you some background on other tools you are likely to hear about.
Are there questions before we move on?