In our coverage of modular code, we talked about abstracting reusable code chunks into their own functions
In our discussion of feature engineering, we showed how one might make a "preprocessor": a column transformer that one-hot encodes categorical features and applies standard scaling to numeric columns
Sometimes it's easiest to write a function's definition, or signature, before actually writing its code.
Our function is going to give us a column transformer that we can use in pipelines. The only parameter will be the features DataFrame (at least, for right now).
One possible function signature looks like this:
def make_preprocessor(features):
...
Now that we have our definition, we can add code to it. In this case, we can reuse the code we wrote in the feature engineering section.
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer([
('one-hot-encoder', categorical_preprocessor, categorical_columns),
('standard_scaler', numeric_preprocessor, numeric_columns)
])
Can we just put all of that code into our function without any changes?
def make_preprocessor(features):
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer([
('one-hot-encoder', categorical_preprocessor, categorical_columns),
('standard_scaler', numeric_preprocessor, numeric_columns)
])
Discussion
Does anyone see any issues with this?import pandas as pd
fake_features = pd.read_csv('../data/planes.csv')
preprocessor = make_preprocessor(fake_features)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) /var/folders/9w/9m3mzyd96fbdm8q4sy2pjpdw0000gn/T/ipykernel_61981/3965947682.py in <module> ----> 1 preprocessor = make_preprocessor(fake_features) /var/folders/9w/9m3mzyd96fbdm8q4sy2pjpdw0000gn/T/ipykernel_61981/2727407406.py in make_preprocessor(features) 3 4 preprocessor = ColumnTransformer([ ----> 5 ('one-hot-encoder', categorical_preprocessor, categorical_columns), 6 ('standard_scaler', numeric_preprocessor, numeric_columns) 7 ]) NameError: name 'categorical_preprocessor' is not defined
Our code is missing some context.
categorical_preprocessor
, categorical_columns
, numeric_preprocessor
, and numeric_columns
aren't defined yet.
Here's an updated version in which we assign to those variables before using them.
def make_preprocessor(features):
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numeric_preprocessor = StandardScaler()
numeric_columns = features.select_dtypes(exclude=object).columns
categorical_columns = features.select_dtypes(include=object).columns
preprocessor = ColumnTransformer([
('one-hot-encoder', categorical_preprocessor, categorical_columns),
('standard_scaler', numeric_preprocessor, numeric_columns)
])
Things run without error now!
preprocessor = make_preprocessor(fake_features)
But there are a couple of other issues.
What does our resulting preprocessor object look like?
preprocessor
type(preprocessor)
NoneType
We need to remember to return a value -- otherwise we can't get anything useful out of the function.
Generally, Python best practice is to import libraries outside functions.
All imports, even if they're to be used in different functions, are usually placed at the top of the Python module.
Let's make those changes...
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
def make_preprocessor(features):
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numeric_preprocessor = StandardScaler()
numeric_columns = features.select_dtypes(exclude=object).columns
categorical_columns = features.select_dtypes(include=object).columns
preprocessor = ColumnTransformer([
('one-hot-encoder', categorical_preprocessor, categorical_columns),
('standard_scaler', numeric_preprocessor, numeric_columns)
])
return preprocessor
And then make sure it works...
preprocessor = make_preprocessor(fake_features)
preprocessor
ColumnTransformer(transformers=[('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'), Index(['tailnum', 'type', 'manufacturer', 'model', 'engine'], dtype='object')), ('standard_scaler', StandardScaler(), Index(['year', 'engines', 'seats', 'speed'], dtype='object'))])
type(preprocessor)
sklearn.compose._column_transformer.ColumnTransformer
Now that our function is ready, we can add it to our module!
Reopen my_module.py
and add our imports to the top and our new function at the end:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
def get_features_and_target(csv_file, target_col):
'''Split a CSV into a DF of numeric features and a target column.'''
adult_census = pd.read_csv(csv_file)
raw_features = adult_census.drop(columns=target_col)
numeric_features = raw_features.select_dtypes(np.number)
feature_cols = numeric_features.columns.values
features = adult_census[feature_cols]
target = adult_census[target_col]
return (features, target)
def make_preprocessor(features):
'''Create a column transformer that applies sensible preprocessing procedures.'''
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numeric_preprocessor = StandardScaler()
numeric_columns = features.select_dtypes(exclude=object).columns
categorical_columns = features.select_dtypes(include=object).columns
preprocessor = ColumnTransformer([
('one-hot-encoder', categorical_preprocessor, categorical_columns),
('standard_scaler', numeric_preprocessor, numeric_columns)
])
return preprocessor
Our functions can work together now...
import my_module
features, target = my_module.get_features_and_target(
csv_file='../data/adult-census.csv',
target_col='class',
)
# Drop education-num as discussed before, because it's redundant.
features = features.drop('education-num', axis=1)
preprocessor = my_module.make_preprocessor(features)
And we could make this preprocessor part of a scikit-learn pipeline, as we saw before:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
# If we want a logistic regression
model = make_pipeline(preprocessor, LogisticRegression())
# or perhaps we prefer a random forest?
#model = make_pipeline(RandomForestRegressor())
If we were even more ambitious, we could build a function that just took features
and a model class (such as LogisticRegression
) and returned a pipeline.
But that wouldn't simplify things much beyond what we already have, so we'll leave that as an exercise you can try if you want to experiment more with modularizing your code.
We can use our pipeline on real data, just as we did before.
from sklearn.model_selection import train_test_split
# one small addition: the target column is encoded as a string in our data so we need to convert to 1s and 0s.
target = target.str.contains('>50K').astype(int)
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=123)
# fit our model
_ = model.fit(X_train, y_train)
# score on test set
model.score(X_test, y_test)
0.7988698714274015
Discussion
What if we wanted to make our function more flexible, such that users could determine what kind of categorical and numeric encoding schemes should be used?def make_preprocessor(features):
'''Create a column transformer that applies sensible preprocessing procedures.'''
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numeric_preprocessor = StandardScaler()
numeric_columns = features.select_dtypes(exclude=object).columns
categorical_columns = features.select_dtypes(include=object).columns
preprocessor = ColumnTransformer([
('one-hot-encoder', categorical_preprocessor, categorical_columns),
('standard_scaler', numeric_preprocessor, numeric_columns)
])
return preprocessor
One approach would be to add "categorical_preprocessor" and "numeric_preprocessor" parameters...
def make_preprocessor(features, categorical_preprocessor, numeric_preprocessor):
'''Create a column transformer that applies sensible preprocessing procedures.'''
numeric_columns = features.select_dtypes(exclude=object).columns
categorical_columns = features.select_dtypes(include=object).columns
preprocessor = ColumnTransformer([
('one-hot-encoder', categorical_preprocessor, categorical_columns),
('standard_scaler', numeric_preprocessor, numeric_columns)
])
return preprocessor
This allows us to specify the precise transformations we want:
# Will work the same as the original
preprocessor = make_preprocessor(
fake_features,
categorical_preprocessor=OneHotEncoder(handle_unknown="ignore"),
numeric_preprocessor=StandardScaler(),
)
from sklearn.preprocessing import Normalizer, OrdinalEncoder
# Uses different strategies
preprocessor = make_preprocessor(
fake_features,
categorical_preprocessor=OrdinalEncoder(),
numeric_preprocessor=Normalizer(),
)
But this is a bit cumbersome - we have to specify all three arguments every time:
preprocessor = make_preprocessor(fake_features)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) /var/folders/9w/9m3mzyd96fbdm8q4sy2pjpdw0000gn/T/ipykernel_61981/3965947682.py in <module> ----> 1 preprocessor = make_preprocessor(fake_features) TypeError: make_preprocessor() missing 2 required positional arguments: 'categorical_preprocessor' and 'numeric_preprocessor'
It would be nicer if these arguments were optional, and defaulted to the original choices...
def make_preprocessor(features, categorical_preprocessor=None, numeric_preprocessor=None):
'''Create a column transformer that applies sensible preprocessing procedures.'''
if categorical_preprocessor is None:
categorical_preprocessor = OneHotEncoder(handle_unknown='ignore')
if numeric_preprocessor is None:
numeric_preprocessor = StandardScaler()
numeric_columns = features.select_dtypes(exclude=object).columns
categorical_columns = features.select_dtypes(include=object).columns
preprocessor = ColumnTransformer([
('one-hot-encoder', categorical_preprocessor, categorical_columns),
('standard_scaler', numeric_preprocessor, numeric_columns)
])
return preprocessor
preprocessor = make_preprocessor(fake_features)
Your Turn
Update your my_module.py
file to reflect the changes we made above. Try testing out the new version with the below code:
import my_module
features, target = my_module.get_features_and_target(
csv_file='../data/adult-census.csv',
target_col='class',
)
features = features.drop('education-num', axis=1)
target = target.str.contains('>50K').astype(int)
preprocessor = my_module.make_preprocessor(features, numeric_preprocessor=Normalizer())
model = make_pipeline(preprocessor, LogisticRegression())
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=123)
_ = model.fit(X_train, y_train)
model.score(X_test, y_test)
0.7806076488412087
We always commit signicant code updates to GitHub, so let's stop now and push our changes.
Are there any questions before we move on?