Advanced Pandas¶

Learning Objectives¶

  1. Join and merge data frames together.
  2. Learn about the difference between views and copies.
  3. Improve performance with vectorization and parallelization.
  4. Plot data with pandas.

Joining Data Frames¶

It's quite common to have mulitple data frames that need to be joined together. Consider the example of one "fact" table and multiple "dimension" tables, all stored separately:

In [1]:
import pandas as pd

fact = (
    pd.DataFrame({
        'dim_id': [1, 2, 3], 
        'val_x': ['x1', 'x2', 'x3']
    })
)
dim = (
    pd.DataFrame({
        'dim_id': [1, 2, 4], 
        'val_y': ['y1', 'y2', 'y4']
    })
)
In [2]:
fact
Out[2]:
dim_id val_x
0 1 x1
1 2 x2
2 3 x3
In [3]:
dim
Out[3]:
dim_id val_y
0 1 y1
1 2 y2
2 4 y4

Inner Join¶

An inner join matches pairs of observations whenever their keys are equal. We can do an inner join with the merge method:

In [4]:
inner = (
    fact.merge(
        dim,
        how="inner",
        on="dim_id"
    )
)
inner
Out[4]:
dim_id val_x val_y
0 1 x1 y1
1 2 x2 y2

Notice that the non-matching dim_ids are not represented here.

Outer Join¶

An inner join keeps observations that appear in both tables. However, we often want to retain all observations in at least one of the tables. Consequently, we can apply various outer joins to retain observations that appear in at least one of the tables. There are three main types of outer joins:

  • A left join keeps all observations in x.
  • A right join keeps all observations in y.
  • A full join keeps all observations in x and y.

Left Outer Join¶

With a left join we retain all observations from fact, and we add columns dim. Rows in fact where there is no matching key value in dim will have NaN values in the new columns. We can use a left (outer) join by specifying how="left":

In [5]:
left = (
    fact.merge(
        dim,
        how="left",
        on="dim_id"
    )
)
left
Out[5]:
dim_id val_x val_y
0 1 x1 y1
1 2 x2 y2
2 3 x3 NaN

Right Outer Join¶

A right join is just a flipped left join where we retain all observations from dim, and we add columns from fact. Similar to a left join, rows in dim where there is no matching key value in fact will have NaN values in the new columns:

In [6]:
right = (
    fact.merge(
        dim,
        how="right",
        on="dim_id"
    )
)
right
Out[6]:
dim_id val_x val_y
0 1 x1 y1
1 2 x2 y2
2 4 NaN y4

Full Outer Join¶

We can also perform a full outer join where we keep all observations in fact and dim. This join will match observations where the key variable(s) have matching information in both tables and then fill in non-matching values as NaN.

In [7]:
outer = (
    fact.merge(
        dim,
        how="outer",
        on="dim_id"
    )
)
outer
Out[7]:
dim_id val_x val_y
0 1 x1 y1
1 2 x2 y2
2 3 x3 NaN
3 4 NaN y4

Cross Join¶

The final type of join that pandas offers is a cross join, chosen with the parameter how="cross". This constructs the cartesian product between the two data frames and should be used sparingly, as it can result in very large data frames. Note that you do not need to specify on here, as there are no join keys for a cross join (attempting to do so will result in an error).

Note further that when both data frames have equally-named columns, in the result pandas will append "_x" to the first column and "_y" to the second.

In [8]:
cross = (
    fact.merge(
        dim,
        how="cross"
    )
)
cross
Out[8]:
dim_id_x val_x dim_id_y val_y
0 1 x1 1 y1
1 1 x1 2 y2
2 1 x1 4 y4
3 2 x2 1 y1
4 2 x2 2 y2
5 2 x2 4 y4
6 3 x3 1 y1
7 3 x3 2 y2
8 3 x3 4 y4

Differing Keys¶

In the above cases fact and dim have been joined on the same column name. If you wish to join on columns that are named differently, you can do so with left_on and right_on:

In [9]:
fact_renamed = (
    fact.rename(
        columns={"dim_id": "fact_id"}
    )
)

merge_different_cols = (
    fact_renamed.merge(
        dim,
        how="inner",
        left_on="fact_id",
        right_on="dim_id"
    )
)
merge_different_cols
Out[9]:
fact_id val_x dim_id val_y
0 1 x1 1 y1
1 2 x2 2 y2

As you can see, this will retain all joining columns.

Multiple Keys¶

If you wish to join on multiple columns at once, simply pass a list to the on parameter(s):

In [10]:
# `assign()` creates a new column and
# returns a new data frame with that column:
fact_new_col = fact.assign(
    dim_id2=[1, 2, 3]
)
dim_new_col = dim.assign(
    dim_id2=[1, 2, 3]
)

fact_new_col.merge(
    dim_new_col,
    how="inner",
    on=["dim_id", "dim_id2"]
)
Out[10]:
dim_id val_x dim_id2 val_y
0 1 x1 1 y1
1 2 x2 2 y2

join() vs merge()¶

Finally, you might encounter a second way to join two data frames: with the (appropriately named) join() method. This differs from merge() in that it requires you to join on the index rather than on one or more named columns:

In [11]:
fact_index = fact.set_index("dim_id")
dim_index = dim.set_index("dim_id")

# Remember, these new data frames are now
# indexed by "dim_id" rather than row
# number:
fact_index
Out[11]:
val_x
dim_id
1 x1
2 x2
3 x3
In [12]:
inner_join = (
    fact_index
    .join(
        # no "on" argument
        dim_index,
        how="inner"
    )
)

inner_join
Out[12]:
val_x val_y
dim_id
1 x1 y1
2 x2 y2

There is essentially no difference between join() and merge(): they perform equivalently in most cases and offer the same kinds of join. It's just a question of whether you're joining on the index or columns.

Views vs. Copies¶

In pandas, a distinction that will probably trip everyone up at least once is that between views (looking at part of an existing object) and copies (making a new copy of an object in memory). This is a bit abstract, and even the pandas docs state the following:

...it’s very hard to predict whether [an operation] will return a view or a copy.

I recommend reading the linked section of the docs for more information.

It's relatively short and could save you some effort.

The main takeaway is that the most common warning you’ll encounter in Pandas is the SettingWithCopyWarning; Pandas raises it as a warning that you might not be doing what you think you’re doing or because the operation you are performing may behave unpredictably.

As an example, say the number of seats on this particular plane was recorded incorrectly. Instead of 55 seats it should actually be 60 seats.

In [13]:
planes = pd.read_csv("../data/planes.csv")

tailnum_of_interest = planes['tailnum'] == 'N10156'
planes[tailnum_of_interest]
Out[13]:
tailnum year type manufacturer model engines seats speed engine
0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR 2 55 NaN Turbo-fan

Let's say we naively try to set the number of sets, by filtering and selecting this element in our data frame with the following bracket notation:

In [14]:
planes[tailnum_of_interest]['seats'] = 60
/var/folders/j3/v1318ng94fvdpq7kzr0hq9kw0000gn/T/ipykernel_6733/1344183120.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  planes[tailnum_of_interest]['seats'] = 60

As you can see, we get a SettingWithCopyWarning. Was the data frame changed?

In [15]:
planes[tailnum_of_interest]['seats']
Out[15]:
0    55
Name: seats, dtype: int64

Long story short, subsetting by index creates a copy:

In [16]:
planes[tailnum_of_interest]._is_view
Out[16]:
False

So that when you go to set the "seats" column you're not operating on the original data.

Note: don't rely on ._is_view; it is a private variable and as such could be removed at any time. I only use it here for illustrative purposes.

The way to avoid this is to never chain accessors when setting data. Instead, use a single accessor with .loc:

In [17]:
planes.loc[
    tailnum_of_interest,
    "seats"
] = 60

planes[tailnum_of_interest]['seats']
Out[17]:
0    60
Name: seats, dtype: int64

See the linked documentation for more details.

Optional Note: If you check planes.loc[tailnum_of_interest, "seats"], you'll see that it results in a copy as well, not a view. What gives? This comes down to the way assignment works in Python; pandas has overriden it such that, when used as the left hand side of an assignment statement, this .loc expression results in a view instead of a copy. This can't work when you've chained multiple accessors.

Vectorization and Parallelization¶

Pandas is relatively well optimized, with performance critical portions written in the compiled language Cython. But there's room for improvement on occasion.

Vectorization¶

Under the hood, a pandas data frame is a multidimensional array backed by the numerical library NumPy. Let's assume we have an expensive function we wish to apply to each row of a data frame. For illustration we'll use a simple sum of squares:

In [18]:
def sum_squares(x, y):
    # Assume this is a much larger function,
    # with complex, branching logic.
    return x**2 + y**2

We can easily use this function with pandas:

In [19]:
%%time
planes.apply(
    lambda row: sum_squares(
        row["seats"], row["engines"]
    ),
    axis=1
).head(5)
CPU times: user 16.3 ms, sys: 571 µs, total: 16.9 ms
Wall time: 16.9 ms
Out[19]:
0     3604
1    33128
2    33128
3    33128
4     3029
dtype: int64

But we can eke some performance out by dipping down into NumPy:

In [20]:
import numpy as np
In [21]:
%%time
np_result = np.vectorize(sum_squares)(
    planes["seats"], planes["engines"]
)
np_result[:5]
CPU times: user 1.72 ms, sys: 82 µs, total: 1.8 ms
Wall time: 1.8 ms
Out[21]:
array([ 3604, 33128, 33128, 33128,  3029])

The vectorize() function takes one function and converts it into a function that operates on NumPy arrays (or pandas series, which are a special case of NumPy arrays). In this case -- though this is a very silly example -- we get almost 10x the performance by using NumPy.

Although the result is a NumPy array rather than a pandas series, you can still assign it to a column as if it were a series:

In [22]:
planes["silly_col"] = np_result

Fastest of all is to simply apply arithmetic operators to the individual series:

In [23]:
%%time
(planes["seats"]**2 + planes["engines"]**2).head(5)
CPU times: user 346 µs, sys: 27 µs, total: 373 µs
Wall time: 367 µs
Out[23]:
0     3604
1    33128
2    33128
3    33128
4     3029
dtype: int64

But this won't work for more complex logic.

In general, there are a ton of useful functions in NumPy that work as you'd expect on pandas series and data frames. If you need to do numerical processing or linear algebra I would encourage you to read a guide, such as such as this one.

Parallelization¶

Another way to speed up code is to use parallel processing. Pandas under the hood is single-threaded, but for some tasks we can split a series or data frame into parts and operate on multiple portions simultaneously.

Note: I want you to get a sense for how this is possible, but if you don't follow all of the Python code here that's fine. Just be aware of the capability so you can research it further.

First, we'll import from the multiprocessing module built into Python. This allows us to distribute a function over a sequence of elements in parallel:

In [24]:
from multiprocessing import Pool

We need a function to apply. As above, this would ordinarily be a more complex function, but we'll keep it simple:

In [25]:
def sum_squares_df(df):
    df["silly_col"] = (
        df["seats"]**2 + 
        df["engines"]**2
    )
    return df

However, due to a quirk of how the Jupyter and the multiprocessing module combine, we need to import it from a file:

In [26]:
import sum_squares

This sets up a pool of workers for parallel processing:

In [27]:
cores = 4
pool = Pool(cores)

Now we need to split the data frame up into 4 portions, one for each core. For this we'll use the array_split() function from NumPy, which does exactly what it says, in this case returning a list of 4 data frames:

In [28]:
planes_split = np.array_split(planes, cores)

Then we distribute the function over each data frame chunk:

In [29]:
split_work = pool.map(
    sum_squares.sum_squares_df,
    planes_split
)

Finally, we can combine the results back into one data frame, using the concat() function:

In [30]:
df = pd.concat(split_work)

We'll also need to clean up after ourselves, because we've opened multiple processes:

In [31]:
pool.close()
pool.join()

Although this is cumbersome, we could convert it to a reusable function:

In [32]:
def parallel_apply(
    df, func, cores
):
    pool = Pool(cores)
    splits = np.array_split(df, cores)
    split_work = pool.map(
        func, splits
    )
    df = pd.concat(split_work)
    pool.close()
    pool.join()
    
    return df

Something to note here is how much extra work we're doing to apply a function in parallel:

  • Split the data frame into N copies
  • Fork 4 new Python processes
  • Run the function N times
  • Concatenate the results back into one data frame
  • Clean up

In other words: you should only do this if the code you're parallelizing is sufficiently time-intensive to make it worthwhile. Otherwise you could easily end up taking more time than you would if you'd remained single-threaded.

Note: In general, if you have a substantial amount of data and wish to leverage multiple cores or CPUs, you might be better served investigating a library that is built from the ground up to allow for distributed processing. Dask and PySpark are two possible alternatives in this space.

Plotting in Pandas¶

The Python visualization ecosystem is extensive. You'll come across various libraries such as Matplotlib, Seaborn, Plotly and others. (See the PyViz website for more information on plotting capabilities in Python.)

However, since we are focusing on pandas here, we'll stick with the plotting libraries that are native to that library (which use Matplotlib as a backend).

Fair warning, as we don't have much time this will be a pretty whirlwind tour.

For this we'll use a sample of the Complete Journey data provided externally by 84.51˚:

In [33]:
sales = pd.read_csv("../data/cj_sample.csv")

# We need to convert a string to a Timestamp
# with `to_datetime()`:
sales["transaction_timestamp"] = (
    pd.to_datetime(
        sales["transaction_timestamp"]
    )
)

Plotting a Series¶

We can visualize the distribution of a Series with a histogram:

In [34]:
sales["sales_value"].plot.hist();

It's highly skewed. To get a more illustrative plot we can take the log:

In [35]:
sales[
    sales["sales_value"] > 0
]["sales_value"].plot.hist(
    log=True, bins=30
);

We can also view a line plot over time, using the "transaction_timestamp" column. Pandas is quite good at dealing with dates and times, though in this case we need to set the time column to be the index:

In [36]:
sales_indexed = (
    sales.set_index("transaction_timestamp")
    ["sales_value"]
)

(sales_indexed
 .plot.line(figsize=(10,4)));

That's not a particularly engaging plot. Fortunately, since we have a datetime index we can use resample() to group by the day and take the sum:

In [37]:
(
    sales_indexed
    .resample('D')
    .sum()
    .plot.line(figsize=(10,4))
);

We can even be reasonably fancy here. Let's say we wanted a plot of sales by day:

In [38]:
day_order = [
    'Monday', 'Tuesday', 'Wednesday',
    'Thursday', 'Friday', 'Saturday',
    'Sunday'
]

total_sales_by_weekday = (
    sales_indexed
    .resample('D')
    .sum()
    .rename(lambda idx: idx.day_name())
    .groupby('transaction_timestamp')
    .quantile([.25, .5, .75])
    .unstack()
    .reindex(day_order)
)
In [39]:
total_sales_by_weekday.plot.line(
    title='Median and IQR of total sales by DOW',
    figsize=(10,4)
);

Line by line, we:

  1. Resample by day
  2. Compute total sales for each day
  3. Extract weekday from the index; to do this we use rename() which, when passed a function of one argument, will apply that function to the index
  4. Group by the index, now weekday names
  5. Compute quantiles for each group
  6. Since this is multi-indexed Series, turn it into a data frame with unstack(); this takes the inner index (the quantile labels) and transposes them to be columns
  7. Use reindex() to force the index into the customary daily ordering

For reference, the final data frame we use to plot looks like this:

In [40]:
(
    sales_indexed
    .resample('D')
    .sum()
    .rename(lambda idx: idx.day_name())
    .groupby('transaction_timestamp')
    .quantile([.25, .5, .75])
    .unstack()
    .reindex(day_order)
)
Out[40]:
0.25 0.50 0.75
transaction_timestamp
Monday 634.9300 686.720 732.2500
Tuesday 574.1600 650.735 732.3175
Wednesday 568.6600 622.830 705.5575
Thursday 562.6550 634.630 709.5600
Friday 611.8275 677.125 739.8250
Saturday 769.0600 867.095 954.2750
Sunday 862.2900 948.430 990.9600

Other Types of Plot¶

There are a number of different plots we can use here, including:

  • box and whisker plots with .plot.box()
  • kernel density plots with .plot.kde()
  • bar plots with .plot.bar()
  • horizontal bar plots with .plot.barh()
  • pie charts with .plot.pie() (but please don't)

Plotting a Data Frame¶

With the basics of plotting a Series, this can easily be extended to plotting with multiple columns of a data frame. For example, let's say we wanted to plot sales vs quantity sold across all transactions:

In [41]:
sales.plot.scatter(
    x='quantity', 
    y='sales_value', 
    title='Sales versus quantity', 
    figsize=(8,4)
);

We can use the same plotting methods with a data frame as we used for a series. For example, say we wanted to view the distribution of sales by department:

In [42]:
dept_sales = (
    sales.groupby("department", as_index=False)
    .agg({"sales_value": "sum"})
    .sort_values("sales_value", ascending=False)
    .head(5)
)
In [43]:
(dept_sales
 .sort_values('sales_value')
 .plot.barh(
      x='department', 
      y='sales_value', 
      color='red'
));

We can even plot multiple lines, in just the same manner as we plotted a series. Say we wanted to plot sales over time for various discount types:

In [44]:
total_daily_discounts = (
    sales[sales["department"] == "GROCERY"]
    .set_index('transaction_timestamp')
    .loc[:, ['retail_disc', 'coupon_disc', 'coupon_match_disc']]
    .resample('D')
    .sum()
)

total_daily_discounts.head()
Out[44]:
retail_disc coupon_disc coupon_match_disc
transaction_timestamp
2017-01-01 50.70 0.0 0.0
2017-01-02 82.02 0.4 0.4
2017-01-03 68.34 1.0 0.0
2017-01-04 71.44 1.4 0.4
2017-01-05 56.66 0.0 0.0

If we apply .plot.line() to this data frame then by default it will plot each of the numeric variables on the same plot:

In [45]:
total_daily_discounts.plot.line(
    logy=True, figsize=(10, 4)
);

Questions¶

Are there any questions before moving on?