Modular Code, Part 2¶

  • In our coverage of modular code, we talked about abstracting reusable code chunks into their own functions

    • And, in turn, grouping those functions together into separate modules
    • We created a function that splits a data set into its features (a DataFrame) and target (a Series)
  • In our discussion of feature engineering, we showed how one might make a "preprocessor": a column transformer that one-hot encodes categorical features and applies standard scaling to numeric columns

    • We then chained this preprocessor together with a logistic regression model in order to form a scikit-learn pipeline
  • We might use the same approach in preprocessing other datasets, so let's move that logic to its own function and add it to our personal module

Writing a Preprocessor Function¶

Sometimes it's easiest to write a function's definition, or signature, before actually writing its code.

Our function is going to give us a column transformer that we can use in pipelines. The only parameter will be the features DataFrame (at least, for right now).

One possible function signature looks like this:

def make_preprocessor(features):
    ...

Now that we have our definition, we can add code to it. In this case, we can reuse the code we wrote in the feature engineering section.

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns),
    ('standard_scaler', numeric_preprocessor, numeric_columns)
])

Can we just put all of that code into our function without any changes?

In [1]:
def make_preprocessor(features):
    from sklearn.compose import ColumnTransformer

    preprocessor = ColumnTransformer([
        ('one-hot-encoder', categorical_preprocessor, categorical_columns),
        ('standard_scaler', numeric_preprocessor, numeric_columns)
    ])

Discussion

Does anyone see any issues with this?
In [2]:
import pandas as pd
fake_features = pd.read_csv('../data/planes.csv')
In [3]:
preprocessor = make_preprocessor(fake_features)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/var/folders/9w/9m3mzyd96fbdm8q4sy2pjpdw0000gn/T/ipykernel_61981/3965947682.py in <module>
----> 1 preprocessor = make_preprocessor(fake_features)

/var/folders/9w/9m3mzyd96fbdm8q4sy2pjpdw0000gn/T/ipykernel_61981/2727407406.py in make_preprocessor(features)
      3 
      4     preprocessor = ColumnTransformer([
----> 5         ('one-hot-encoder', categorical_preprocessor, categorical_columns),
      6         ('standard_scaler', numeric_preprocessor, numeric_columns)
      7     ])

NameError: name 'categorical_preprocessor' is not defined

Our code is missing some context. categorical_preprocessor, categorical_columns, numeric_preprocessor, and numeric_columns aren't defined yet.

Here's an updated version in which we assign to those variables before using them.

In [4]:
def make_preprocessor(features):
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    
    categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
    numeric_preprocessor = StandardScaler()
    
    numeric_columns = features.select_dtypes(exclude=object).columns
    categorical_columns = features.select_dtypes(include=object).columns

    preprocessor = ColumnTransformer([
        ('one-hot-encoder', categorical_preprocessor, categorical_columns),
        ('standard_scaler', numeric_preprocessor, numeric_columns)
    ])

Things run without error now!

In [5]:
preprocessor = make_preprocessor(fake_features)

But there are a couple of other issues.

What does our resulting preprocessor object look like?

In [6]:
preprocessor
In [7]:
type(preprocessor)
Out[7]:
NoneType
  • We need to remember to return a value -- otherwise we can't get anything useful out of the function.

  • Generally, Python best practice is to import libraries outside functions.

All imports, even if they're to be used in different functions, are usually placed at the top of the Python module.

Let's make those changes...

In [8]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

def make_preprocessor(features):
    categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
    numeric_preprocessor = StandardScaler()
    
    numeric_columns = features.select_dtypes(exclude=object).columns
    categorical_columns = features.select_dtypes(include=object).columns

    preprocessor = ColumnTransformer([
        ('one-hot-encoder', categorical_preprocessor, categorical_columns),
        ('standard_scaler', numeric_preprocessor, numeric_columns)
    ])
    
    return preprocessor

And then make sure it works...

In [9]:
preprocessor = make_preprocessor(fake_features)
preprocessor
Out[9]:
ColumnTransformer(transformers=[('one-hot-encoder',
                                 OneHotEncoder(handle_unknown='ignore'),
                                 Index(['tailnum', 'type', 'manufacturer', 'model', 'engine'], dtype='object')),
                                ('standard_scaler', StandardScaler(),
                                 Index(['year', 'engines', 'seats', 'speed'], dtype='object'))])
In [10]:
type(preprocessor)
Out[10]:
sklearn.compose._column_transformer.ColumnTransformer

Now that our function is ready, we can add it to our module! Reopen my_module.py and add our imports to the top and our new function at the end:

In [11]:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

def get_features_and_target(csv_file, target_col):
    '''Split a CSV into a DF of numeric features and a target column.'''
    adult_census = pd.read_csv(csv_file)
    
    raw_features = adult_census.drop(columns=target_col)
    numeric_features = raw_features.select_dtypes(np.number)
    feature_cols = numeric_features.columns.values

    features = adult_census[feature_cols]
    target = adult_census[target_col]
    return (features, target)

def make_preprocessor(features):
    '''Create a column transformer that applies sensible preprocessing procedures.'''
    categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
    numeric_preprocessor = StandardScaler()
    
    numeric_columns = features.select_dtypes(exclude=object).columns
    categorical_columns = features.select_dtypes(include=object).columns

    preprocessor = ColumnTransformer([
        ('one-hot-encoder', categorical_preprocessor, categorical_columns),
        ('standard_scaler', numeric_preprocessor, numeric_columns)
    ])
    return preprocessor

Our functions can work together now...

In [12]:
import my_module

features, target = my_module.get_features_and_target(
    csv_file='../data/adult-census.csv',
    target_col='class',
)

# Drop education-num as discussed before, because it's redundant.
features = features.drop('education-num', axis=1)

preprocessor = my_module.make_preprocessor(features)

And we could make this preprocessor part of a scikit-learn pipeline, as we saw before:

In [13]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor

# If we want a logistic regression
model = make_pipeline(preprocessor, LogisticRegression())
# or perhaps we prefer a random forest?
#model = make_pipeline(RandomForestRegressor())

If we were even more ambitious, we could build a function that just took features and a model class (such as LogisticRegression) and returned a pipeline. But that wouldn't simplify things much beyond what we already have, so we'll leave that as an exercise you can try if you want to experiment more with modularizing your code.

We can use our pipeline on real data, just as we did before.

In [14]:
from sklearn.model_selection import train_test_split

# one small addition: the target column is encoded as a string in our data so we need to convert to 1s and 0s.
target = target.str.contains('>50K').astype(int)

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=123)

# fit our model
_ = model.fit(X_train, y_train)

# score on test set
model.score(X_test, y_test)
Out[14]:
0.7988698714274015

Discussion

What if we wanted to make our function more flexible, such that users could determine what kind of categorical and numeric encoding schemes should be used?
In [15]:
def make_preprocessor(features):
    '''Create a column transformer that applies sensible preprocessing procedures.'''
    categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
    numeric_preprocessor = StandardScaler()
    numeric_columns = features.select_dtypes(exclude=object).columns
    categorical_columns = features.select_dtypes(include=object).columns
    preprocessor = ColumnTransformer([
        ('one-hot-encoder', categorical_preprocessor, categorical_columns),
        ('standard_scaler', numeric_preprocessor, numeric_columns)
    ])
    return preprocessor

One approach would be to add "categorical_preprocessor" and "numeric_preprocessor" parameters...

In [16]:
def make_preprocessor(features, categorical_preprocessor, numeric_preprocessor):
    '''Create a column transformer that applies sensible preprocessing procedures.'''
    numeric_columns = features.select_dtypes(exclude=object).columns
    categorical_columns = features.select_dtypes(include=object).columns
    preprocessor = ColumnTransformer([
        ('one-hot-encoder', categorical_preprocessor, categorical_columns),
        ('standard_scaler', numeric_preprocessor, numeric_columns)
    ])
    return preprocessor

This allows us to specify the precise transformations we want:

In [17]:
# Will work the same as the original
preprocessor = make_preprocessor(
    fake_features,
    categorical_preprocessor=OneHotEncoder(handle_unknown="ignore"),
    numeric_preprocessor=StandardScaler(),
)
In [18]:
from sklearn.preprocessing import Normalizer, OrdinalEncoder
# Uses different strategies
preprocessor = make_preprocessor(
    fake_features,
    categorical_preprocessor=OrdinalEncoder(),
    numeric_preprocessor=Normalizer(),
)

But this is a bit cumbersome - we have to specify all three arguments every time:

In [19]:
preprocessor = make_preprocessor(fake_features)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/9w/9m3mzyd96fbdm8q4sy2pjpdw0000gn/T/ipykernel_61981/3965947682.py in <module>
----> 1 preprocessor = make_preprocessor(fake_features)

TypeError: make_preprocessor() missing 2 required positional arguments: 'categorical_preprocessor' and 'numeric_preprocessor'

It would be nicer if these arguments were optional, and defaulted to the original choices...

In [20]:
def make_preprocessor(features, categorical_preprocessor=None, numeric_preprocessor=None):
    '''Create a column transformer that applies sensible preprocessing procedures.'''
    
    if categorical_preprocessor is None:
        categorical_preprocessor = OneHotEncoder(handle_unknown='ignore')
    if numeric_preprocessor is None:
        numeric_preprocessor = StandardScaler()
        
    numeric_columns = features.select_dtypes(exclude=object).columns
    categorical_columns = features.select_dtypes(include=object).columns
    preprocessor = ColumnTransformer([
        ('one-hot-encoder', categorical_preprocessor, categorical_columns),
        ('standard_scaler', numeric_preprocessor, numeric_columns)
    ])
    return preprocessor
In [21]:
preprocessor = make_preprocessor(fake_features)

Your Turn

Update your my_module.py file to reflect the changes we made above. Try testing out the new version with the below code:

In [22]:
import my_module

features, target = my_module.get_features_and_target(
    csv_file='../data/adult-census.csv',
    target_col='class',
)
features = features.drop('education-num', axis=1)
target = target.str.contains('>50K').astype(int)

preprocessor = my_module.make_preprocessor(features, numeric_preprocessor=Normalizer())
model = make_pipeline(preprocessor, LogisticRegression())

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=123)

_ = model.fit(X_train, y_train)
model.score(X_test, y_test)
Out[22]:
0.7806076488412087

Remember GitHub?¶

We always commit signicant code updates to GitHub, so let's stop now and push our changes.

Questions¶

Are there any questions before we move on?