Manipulating and Creating Columns¶

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst's time.

- Wes McKinney, the creator of Pandas, in his book Python for Data Analysis

Applied Review¶

Data Structures and DataFrames¶

  • We use DataFrames to represent tables in Python.
  • Python also supports other data structures for storing information that isn't tabular. Examples include lists and dictionaries.
  • DataFrames have many methods, or functions that access their internal data. Some examples we saw were describe() and set_index().
  • DataFrames are composed of Series, 1-D data structures (like a vector). DataFrame columns can be thought of as Series.

Importing Data¶

  • Python can read in data from CSVs, JSON files, and pickle files with just a few lines of code.

Selecting and Filtering Data¶

  • Python's pandas library supports limiting rows (via filtering and slicing), as well as selecting columns.
  • All of these operations use the bracket operators, but row syntax includes the .loc accessor.

Calculations Using Columns¶

It's common to want to modify a column of a DataFrame, or sometimes even to create a new column. Let's take a look at our planes data again.

In [1]:
import pandas as pd

planes = pd.read_csv('../data/planes.csv')
planes.head()
Out[1]:
tailnum year type manufacturer model engines seats speed engine
0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR 2 55 NaN Turbo-fan
1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NaN Turbo-fan
2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NaN Turbo-fan
3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NaN Turbo-fan
4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NaN Turbo-fan

Suppose we wanted to know the total capacity of each plane, including the crew. We have data on how many seats each plane has (in the seats column), but that only includes paying passengers.

In [2]:
seats = planes['seats']
seats.head()
Out[2]:
0     55
1    182
2    182
3    182
4     55
Name: seats, dtype: int64

For simplicity, let's say a full flight crew is always 5 people. Series objects allow us to perform addition with the regular + syntax –- in this case, seats + 5.

In [3]:
capacity = seats + 5
capacity.head()
Out[3]:
0     60
1    187
2    187
3    187
4     60
Name: seats, dtype: int64

Question

What happens if you switch the order? (i.e. 5 + seats)? Does this make sense?

So we've create a new series, capacity, with the total capacity of the plane. Right now it's totally separate from our original planes DataFrame, but we can make it a column of planes using the assignment syntax with the column reference syntax.

df['new_column_name'] = new_column_series
In [4]:
planes['capacity'] = capacity
planes.head()
Out[4]:
tailnum year type manufacturer model engines seats speed engine capacity
0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR 2 55 NaN Turbo-fan 60
1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NaN Turbo-fan 187
2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NaN Turbo-fan 187
3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NaN Turbo-fan 187
4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NaN Turbo-fan 60

Note that planes now has a "capacity" column at the end.

Note

Also note that in the code above, the column name goes in quotes within the bracket syntax, while the values that will become the column -- the Series we're using -- are on the right side of the statement, without any brackets or quotes.

This sequence of operations can be expressed as a single line:

In [5]:
# Create a capacity column filled with the values in the seats column added with 5.
planes['capacity'] = planes['seats'] + 5

From a mathematical perspective, what we're doing here is adding a scalar -- a single value -- to a vector -- a series of values (aka a Series). Other vector-scalar math is supported as well.

In [6]:
# Subtraction
(planes['seats'] - 12).head()
Out[6]:
0     43
1    170
2    170
3    170
4     43
Name: seats, dtype: int64
In [7]:
# Multiplication
(planes['seats'] * 10).head()
Out[7]:
0     550
1    1820
2    1820
3    1820
4     550
Name: seats, dtype: int64
In [8]:
# Exponentiation
(planes['seats'] ** 2).head()
Out[8]:
0     3025
1    33124
2    33124
3    33124
4     3025
Name: seats, dtype: int64

Your Turn¶

Create a new column named "first_class_seats" that is 1/5 of the total seats on the plane. You will have some results with fractional seats; don't worry about this.

Overwriting Columns¶

What if we discovered a systematic error in our data? Perhaps we find out that the "engines" column is only the number of engines per wing -- so the total number of engines is actually double the value in that column.

We could create a new column, "real_engine_count" or "total_engines". But we're not going to need the original "engines" column, and leaving it could cause confusion for others looking at our data.

A better solution would be to replace the original column with the new, recalculated, values. We can do so using the same syntax as for creating a new column.

In [9]:
planes.head()
Out[9]:
tailnum year type manufacturer model engines seats speed engine capacity
0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR 2 55 NaN Turbo-fan 60
1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NaN Turbo-fan 187
2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NaN Turbo-fan 187
3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NaN Turbo-fan 187
4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NaN Turbo-fan 60
In [10]:
# Multiply the engines column by 2, and then overwrite the original data.
planes['engines'] = planes['engines'] * 2
In [11]:
planes.head()
Out[11]:
tailnum year type manufacturer model engines seats speed engine capacity
0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR 4 55 NaN Turbo-fan 60
1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 4 182 NaN Turbo-fan 187
2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 4 182 NaN Turbo-fan 187
3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 4 182 NaN Turbo-fan 187
4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR 4 55 NaN Turbo-fan 60

Calculating Values Based on Multiple Columns¶

So far we've only seen vector-scalar math. But vector-vector math is supported as well.

Let's look at a toy example of creating a column that contains the number of seats per engine.

In [12]:
seats_per_engine = planes['seats'] / planes['engines']
seats_per_engine.head()
Out[12]:
0    13.75
1    45.50
2    45.50
3    45.50
4    13.75
dtype: float64
In [13]:
planes['seats_per_engine'] = seats_per_engine
planes.head()
Out[13]:
tailnum year type manufacturer model engines seats speed engine capacity seats_per_engine
0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR 4 55 NaN Turbo-fan 60 13.75
1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 4 182 NaN Turbo-fan 187 45.50
2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 4 182 NaN Turbo-fan 187 45.50
3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 4 182 NaN Turbo-fan 187 45.50
4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR 4 55 NaN Turbo-fan 60 13.75

You can combine vector-vector and vector-scalar calculations in arbitrarily complex ways.

In [14]:
planes['nonsense'] = (planes['year'] + 12) * planes['engines'] + planes['seats'] - 9
planes.head()
Out[14]:
tailnum year type manufacturer model engines seats speed engine capacity seats_per_engine nonsense
0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR 4 55 NaN Turbo-fan 60 13.75 8110.0
1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 4 182 NaN Turbo-fan 187 45.50 8213.0
2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 4 182 NaN Turbo-fan 187 45.50 8217.0
3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 4 182 NaN Turbo-fan 187 45.50 8217.0
4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR 4 55 NaN Turbo-fan 60 13.75 8102.0

Your Turn¶

Create a new column in the planes DataFrame, "technology_index", that is calculated with the formula:

technology_index = (year-1900) / 4 + engines * 2

Note

Remember order of operations!

Non-numeric Column Operations¶

For simplicity, we started with mathematical operations. However, pandas supports string operations as well.

We can use + to concatenate strings, with both vectors and scalars.

In [16]:
summary = 'Tailnum is ' + planes['tailnum'] + ' and Model is ' + planes['model']
summary.head()
Out[16]:
0    Tailnum is N10156 and Model is EMB-145XR
1     Tailnum is N102UW and Model is A320-214
2     Tailnum is N103US and Model is A320-214
3     Tailnum is N104UW and Model is A320-214
4    Tailnum is N10575 and Model is EMB-145LR
dtype: object

More complex string operations are possible using methods available through the .str accessor. We won't cover them in detail, so refer to the documentation if you're interested.

In [17]:
# Make the manufacturer field lowercase.
lowercase_mfctr = planes['manufacturer'].str.lower()
lowercase_mfctr.head()
Out[17]:
0             embraer
1    airbus industrie
2    airbus industrie
3    airbus industrie
4             embraer
Name: manufacturer, dtype: object
In [18]:
# Get the length of the tail number.
tailnum_len = planes['tailnum'].str.len()
tailnum_len.head()
Out[18]:
0    6
1    6
2    6
3    6
4    6
Name: tailnum, dtype: int64

More Complex Column Manipulation¶

Replacing Values¶

One fairly common situation in data wrangling is needing to convert one set of values to another, where there is a one-to-one correspondence between the values currently in the column and the new values that should replace them. This operation can be described as "mapping one set of values to another".

Let's look at an example of this.

In [19]:
airlines = pd.read_csv('../data/airlines.csv')
# Keep just the first 5 rows for this example.
airlines = airlines.iloc[:5]
airlines
Out[19]:
carrier name
0 9E Endeavor Air Inc.
1 AA American Airlines Inc.
2 AS Alaska Airlines Inc.
3 B6 JetBlue Airways
4 DL Delta Air Lines Inc.

Suppose we learn that there is a mistake in the carrier codes and they should be updated.

  • 9E should be PE
  • B6 should be BB
  • The other codes should stay as they are.

We can express this mapping of old values to new values using a Python dictionary.

In [20]:
# Only specify the values we want to replace; don't include the ones that should stay the same.
value_mapping = {'9E': 'PE',
                 'B6': 'BB'}

Pandas provides a handy method on Series, .replace, that accepts this value mapping and updates the Series accordingly. We can use it to create a new column, "updated_carrier", with the proper carrier code values.

In [21]:
airlines['updated_carrier'] = airlines['carrier'].replace(value_mapping)
airlines
Out[21]:
carrier name updated_carrier
0 9E Endeavor Air Inc. PE
1 AA American Airlines Inc. AA
2 AS Alaska Airlines Inc. AS
3 B6 JetBlue Airways BB
4 DL Delta Air Lines Inc. DL

If you are a SQL user, this workflow may look familiar to you; it's quite similar to a CASE WHEN statement in SQL.

Your Turn¶

  1. Open the weather CSV (path: ../data/weather.csv) and store it in a variable called weather.
  2. Take a look at the "month" column. Observe that its values are numeric, not strings. How do you think these values relate to months of the year?
  3. Create a mapping from each number to the corresponding month name, as a dictionary. For example, one of the keys would be 5 and its value would be May. Store it in a variable called month_mapping.
  4. Use the .replace method to overwrite the current month column with the month names as strings, using your newly created mapping.

The apply Method and Beyond¶

If you can think of a way to express a new column as a combination of other columns and constants, it can be created using Python and Pandas. However, column calculations beyond the above are outside the scope of this training.

If you wish to learn more, take a look at the DataFrame.apply method.

Questions¶

Are there any questions before we move on?