“Every second of every day, our senses bring in way too much data than we can possibly process in our brains.”
- Peter Diamandis, Founder of the X-Prize for human-AI collaboration
pandas
library that is similar to a table or datasetpd.read_csv()
function - there are parameters for different optionsimport pandas as pd
planes_df = pd.read_csv('../data/planes.csv')
with
statement and respective functions:load()
function from the json
libraryload()
function from the pickle
libraryDimension 1: We may only want to consider certain variables.
For example, we may only care about the year
and engines
variables:
We call this selecting columns/variables -- this is similar to SQL's SELECT
or R's dplyr package's select()
.
Dimension 2: We may only want to consider certain cases.
For example, we may only care about the cases where the manufacturer is Embraer.
We call this filtering or slicing -- this is similar to SQL's WHERE
or R's dplyr package's filter()
or slice()
.
And we can combine these two options to subset in both dimensions -- the year
and engines
variables where the manufacturer is Embraer:
In the previous example, we want to do two things using planes_df
:
year
and engines
variablesBut we also want to return a new DataFrame -- not just highlight certain cells.
In other words, we want to turn this:
planes_df.head()
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
1 | N102UW | 1998.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
2 | N103US | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
3 | N104UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NaN | Turbo-fan |
Into this:
So we really have a third need: return the resulting DataFrame so we can continue our analysis:
year
and engines
variablesRecall that the subsetting of variables/columns is called selecting variables/columns.
In a simple example, we can select a single variable using bracket subsetting notation:
planes_df['year'].head()
0 2004.0 1 1998.0 2 1999.0 3 1999.0 4 2002.0 Name: year, dtype: float64
Notice the head()
method also works on planes_df['year']
to return the first five elements.
Question
What is the data type of planes_df['year']?
This returns pandas.core.series.Series
, referred to simply as a "Series", rather than a DataFrame.
type(planes_df['year'])
pandas.core.series.Series
This is okay -- the Series is a popular data structure in Python. Recall:
list
This visual of a Series and DataFrame may be helpful:
Series can be useful for other things, and we will cover that later -- but for now, we are interested in returning a DataFrame rather than a series.
We can select a single variable and return a DataFrame by still using bracket subsetting notation, but this time we will pass a list
of variables names:
planes_df[['year']].head()
year | |
---|---|
0 | 2004.0 |
1 | 1998.0 |
2 | 1999.0 |
3 | 1999.0 |
4 | 2002.0 |
And we can see that we've returned a DataFrame:
type(planes_df[['year']].head())
pandas.core.frame.DataFrame
Question
What's another advantage of this passing a list?
Passing a list into the bracket subsetting notation allows us to select multiple variables at once:
planes_df[['year', 'engines']].head()
year | engines | |
---|---|---|
0 | 2004.0 | 2 |
1 | 1998.0 | 2 |
2 | 1999.0 | 2 |
3 | 1999.0 | 2 |
4 | 2002.0 | 2 |
In another example, assume we are interested in the model
of plane, number of seats
and engine
type:
planes_df[['model', 'seats', 'engine']].head()
model | seats | engine | |
---|---|---|---|
0 | EMB-145XR | 55 | Turbo-fan |
1 | A320-214 | 182 | Turbo-fan |
2 | A320-214 | 182 | Turbo-fan |
3 | A320-214 | 182 | Turbo-fan |
4 | EMB-145LR | 55 | Turbo-fan |
1. ______ is a common term for subsetting DataFrame variables.
2. What type of object is a DataFrame column?
3. What will be returned by the following code?
planes_df['type', 'model']
When we subset cases/records/rows we primarily use two names: slicing and filtering, but these are not the same:
Remember that all DataFrames have an Index:
planes_df.head()
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
1 | N102UW | 1998.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
2 | N103US | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
3 | N104UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NaN | Turbo-fan |
We can slice cases/rows using the values in the Index and bracket subsetting notation. It's common practice to use .loc
to slice cases/rows:
planes_df.loc[0:5]
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
1 | N102UW | 1998.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
2 | N103US | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
3 | N104UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NaN | Turbo-fan |
5 | N105UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
Note
Note that since this is not "indexing", the last element is inclusive.
We can also pass a list
of Index values:
planes_df.loc[[0, 2, 4, 6, 8]]
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
2 | N103US | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NaN | Turbo-fan |
6 | N107US | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
8 | N109UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
We can filter cases/rows using a logical sequence equal in length to the number of rows in the DataFrame.
Continuing our example, assume we want to determine whether each case's manufacturer
is Embraer. We can use the manufacturer
Series and a logical equivalency test to find the result for each row:
planes_df['manufacturer'] == 'EMBRAER'
0 True 1 False 2 False 3 False 4 True ... 3317 False 3318 False 3319 False 3320 False 3321 False Name: manufacturer, Length: 3322, dtype: bool
We can use this resulting logical sequence to test filter cases -- rows that are True
will be returned while those that are False
will be removed:
planes_df[planes_df['manufacturer'] == 'EMBRAER'].head()
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NaN | Turbo-fan |
10 | N11106 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
11 | N11107 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
12 | N11109 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
This also works with .loc
:
planes_df.loc[planes_df['manufacturer'] == 'EMBRAER'].head()
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NaN | Turbo-fan |
10 | N11106 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
11 | N11107 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
12 | N11109 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
Any conditional test can be used to filter DataFrame rows:
planes_df.loc[planes_df['year'] > 2002].head()
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
15 | N11121 | 2003.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
16 | N11127 | 2003.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
17 | N11137 | 2003.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
18 | N11140 | 2003.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
And multiple conditional tests can be combined using logical operators:
planes_df.loc[(planes_df['year'] > 2002) & (planes_df['year'] < 2004)].head()
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
---|---|---|---|---|---|---|---|---|---|
15 | N11121 | 2003.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
16 | N11127 | 2003.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
17 | N11137 | 2003.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
18 | N11140 | 2003.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
19 | N11150 | 2003.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
Note
Note that each condition is wrapped in parentheses -- this is required.
What's the difference between slicing cases and filtering cases?
Fill in the blanks to fix the following code to find planes that have more than three engines:
planes_df.loc[______['______'] > 3]
If we want to select variables and filter cases at the same time, we have a few options:
We can use what we've previously learned to select variables and filter cases in multiple steps:
planes_df_filtered = planes_df.loc[planes_df['manufacturer'] == 'EMBRAER']
planes_df_filtered_and_selected = planes_df_filtered[['year', 'engines']]
planes_df_filtered_and_selected.head()
year | engines | |
---|---|---|
0 | 2004.0 | 2 |
4 | 2002.0 | 2 |
10 | 2002.0 | 2 |
11 | 2002.0 | 2 |
12 | 2002.0 | 2 |
This is a good way to learn how to select and filter independently, and it also reads very clearly.
However, we can also do both selecting and filtering in a single step with .loc
:
planes_df.loc[planes_df['manufacturer'] == 'EMBRAER', ['year', 'engines']].head()
year | engines | |
---|---|---|
0 | 2004.0 | 2 |
4 | 2002.0 | 2 |
10 | 2002.0 | 2 |
11 | 2002.0 | 2 |
12 | 2002.0 | 2 |
This option is more succinct and also reduces programming time.
Subset planes_df
to only include planes made by Boeing and the seats
and model
variables.
Are there any questions before we move on?