During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst's time.
- Wes McKinney, the creator of Pandas, in his book Python for Data Analysis
describe()
and set_index()
..loc
accessor.It's common to want to modify a column of a DataFrame, or sometimes even to create a new column. Let's take a look at our planes data again.
import pandas as pd
planes = pd.read_csv('../data/planes.csv')
planes.head()
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
1 | N102UW | 1998.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
2 | N103US | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
3 | N104UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NaN | Turbo-fan |
Suppose we wanted to know the total capacity of each plane, including the crew.
We have data on how many seats each plane has (in the seats
column), but that only includes paying passengers.
seats = planes['seats']
seats.head()
0 55 1 182 2 182 3 182 4 55 Name: seats, dtype: int64
For simplicity, let's say a full flight crew is always 5 people.
Series objects allow us to perform addition with the regular +
syntax –- in this case, seats + 5
.
capacity = seats + 5
capacity.head()
0 60 1 187 2 187 3 187 4 60 Name: seats, dtype: int64
Question
What happens if you switch the order? (i.e. 5 + seats)? Does this make sense?
So we've create a new series, capacity
, with the total capacity of the plane.
Right now it's totally separate from our original planes
DataFrame, but we can make it a column of planes
using the assignment syntax with the column reference syntax.
df['new_column_name'] = new_column_series
planes['capacity'] = capacity
planes.head()
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | capacity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan | 60 |
1 | N102UW | 1998.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan | 187 |
2 | N103US | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan | 187 |
3 | N104UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan | 187 |
4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NaN | Turbo-fan | 60 |
Note that planes
now has a "capacity" column at the end.
Note
Also note that in the code above, the column name goes in quotes within the bracket syntax, while the values that will become the column -- the Series we're using -- are on the right side of the statement, without any brackets or quotes.
This sequence of operations can be expressed as a single line:
# Create a capacity column filled with the values in the seats column added with 5.
planes['capacity'] = planes['seats'] + 5
From a mathematical perspective, what we're doing here is adding a scalar -- a single value -- to a vector -- a series of values (aka a Series
).
Other vector-scalar math is supported as well.
# Subtraction
(planes['seats'] - 12).head()
0 43 1 170 2 170 3 170 4 43 Name: seats, dtype: int64
# Multiplication
(planes['seats'] * 10).head()
0 550 1 1820 2 1820 3 1820 4 550 Name: seats, dtype: int64
# Exponentiation
(planes['seats'] ** 2).head()
0 3025 1 33124 2 33124 3 33124 4 3025 Name: seats, dtype: int64
Create a new column named "first_class_seats" that is 1/5 of the total seats on the plane. You will have some results with fractional seats; don't worry about this.
What if we discovered a systematic error in our data? Perhaps we find out that the "engines" column is only the number of engines per wing -- so the total number of engines is actually double the value in that column.
We could create a new column, "real_engine_count" or "total_engines". But we're not going to need the original "engines" column, and leaving it could cause confusion for others looking at our data.
A better solution would be to replace the original column with the new, recalculated, values. We can do so using the same syntax as for creating a new column.
planes.head()
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | capacity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan | 60 |
1 | N102UW | 1998.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan | 187 |
2 | N103US | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan | 187 |
3 | N104UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan | 187 |
4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NaN | Turbo-fan | 60 |
# Multiply the engines column by 2, and then overwrite the original data.
planes['engines'] = planes['engines'] * 2
planes.head()
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | capacity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 4 | 55 | NaN | Turbo-fan | 60 |
1 | N102UW | 1998.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 4 | 182 | NaN | Turbo-fan | 187 |
2 | N103US | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 4 | 182 | NaN | Turbo-fan | 187 |
3 | N104UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 4 | 182 | NaN | Turbo-fan | 187 |
4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 4 | 55 | NaN | Turbo-fan | 60 |
So far we've only seen vector-scalar math. But vector-vector math is supported as well.
Let's look at a toy example of creating a column that contains the number of seats per engine.
seats_per_engine = planes['seats'] / planes['engines']
seats_per_engine.head()
0 13.75 1 45.50 2 45.50 3 45.50 4 13.75 dtype: float64
planes['seats_per_engine'] = seats_per_engine
planes.head()
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | capacity | seats_per_engine | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 4 | 55 | NaN | Turbo-fan | 60 | 13.75 |
1 | N102UW | 1998.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 4 | 182 | NaN | Turbo-fan | 187 | 45.50 |
2 | N103US | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 4 | 182 | NaN | Turbo-fan | 187 | 45.50 |
3 | N104UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 4 | 182 | NaN | Turbo-fan | 187 | 45.50 |
4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 4 | 55 | NaN | Turbo-fan | 60 | 13.75 |
You can combine vector-vector and vector-scalar calculations in arbitrarily complex ways.
planes['nonsense'] = (planes['year'] + 12) * planes['engines'] + planes['seats'] - 9
planes.head()
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | capacity | seats_per_engine | nonsense | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 4 | 55 | NaN | Turbo-fan | 60 | 13.75 | 8110.0 |
1 | N102UW | 1998.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 4 | 182 | NaN | Turbo-fan | 187 | 45.50 | 8213.0 |
2 | N103US | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 4 | 182 | NaN | Turbo-fan | 187 | 45.50 | 8217.0 |
3 | N104UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 4 | 182 | NaN | Turbo-fan | 187 | 45.50 | 8217.0 |
4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 4 | 55 | NaN | Turbo-fan | 60 | 13.75 | 8102.0 |
Create a new column in the planes DataFrame, "technology_index", that is calculated with the formula:
technology_index = (year-1900) / 4 + engines * 2
Note
Remember order of operations!
For simplicity, we started with mathematical operations. However, pandas supports string operations as well.
We can use +
to concatenate strings, with both vectors and scalars.
summary = 'Tailnum is ' + planes['tailnum'] + ' and Model is ' + planes['model']
summary.head()
0 Tailnum is N10156 and Model is EMB-145XR 1 Tailnum is N102UW and Model is A320-214 2 Tailnum is N103US and Model is A320-214 3 Tailnum is N104UW and Model is A320-214 4 Tailnum is N10575 and Model is EMB-145LR dtype: object
More complex string operations are possible using methods available through the .str
accessor.
We won't cover them in detail, so refer to the documentation if you're interested.
# Make the manufacturer field lowercase.
lowercase_mfctr = planes['manufacturer'].str.lower()
lowercase_mfctr.head()
0 embraer 1 airbus industrie 2 airbus industrie 3 airbus industrie 4 embraer Name: manufacturer, dtype: object
# Get the length of the tail number.
tailnum_len = planes['tailnum'].str.len()
tailnum_len.head()
0 6 1 6 2 6 3 6 4 6 Name: tailnum, dtype: int64
One fairly common situation in data wrangling is needing to convert one set of values to another, where there is a one-to-one correspondence between the values currently in the column and the new values that should replace them. This operation can be described as "mapping one set of values to another".
Let's look at an example of this.
airlines = pd.read_csv('../data/airlines.csv')
# Keep just the first 5 rows for this example.
airlines = airlines.iloc[:5]
airlines
carrier | name | |
---|---|---|
0 | 9E | Endeavor Air Inc. |
1 | AA | American Airlines Inc. |
2 | AS | Alaska Airlines Inc. |
3 | B6 | JetBlue Airways |
4 | DL | Delta Air Lines Inc. |
Suppose we learn that there is a mistake in the carrier codes and they should be updated.
We can express this mapping of old values to new values using a Python dictionary.
# Only specify the values we want to replace; don't include the ones that should stay the same.
value_mapping = {'9E': 'PE',
'B6': 'BB'}
Pandas provides a handy method on Series, .replace
, that accepts this value mapping and updates the Series accordingly.
We can use it to create a new column, "updated_carrier", with the proper carrier code values.
airlines['updated_carrier'] = airlines['carrier'].replace(value_mapping)
airlines
carrier | name | updated_carrier | |
---|---|---|---|
0 | 9E | Endeavor Air Inc. | PE |
1 | AA | American Airlines Inc. | AA |
2 | AS | Alaska Airlines Inc. | AS |
3 | B6 | JetBlue Airways | BB |
4 | DL | Delta Air Lines Inc. | DL |
If you are a SQL user, this workflow may look familiar to you;
it's quite similar to a CASE WHEN
statement in SQL.
../data/weather.csv
) and store it in a variable called weather
.5
and its value would be May
. Store it in a variable called month_mapping
..replace
method to overwrite the current month column with the month names as strings, using your newly created mapping.apply
Method and Beyond¶If you can think of a way to express a new column as a combination of other columns and constants, it can be created using Python and Pandas. However, column calculations beyond the above are outside the scope of this training.
If you wish to learn more, take a look at the DataFrame.apply
method.
Are there any questions before we move on?