Data preprocessing and engineering techniques generally refer to the addition, deletion, or transformation of data.
The time spent on identifying data engineering needs can be significant and requires you to spend substantial time understanding your data...
"Live with your data before you plunge into modeling" - Leo Breiman
In this module we introduce:
Let's go ahead and import a couple required libraries and import our data.
Note
We will import additional libraries and functions as we proceed but we do so at the time of using the libraries and functions as that provides better learning context.
import pandas as pd
# to display nice model diagram
from sklearn import set_config
set_config(display='diagram')
# import data
adult_census = pd.read_csv('../data/adult-census.csv')
# separate feature & target data
target = adult_census['class']
features = adult_census.drop(columns='class')
Typically, data types fall into two categories:
features.dtypes
age int64 workclass object education object education-num int64 marital-status object occupation object relationship object race object sex object capital-gain int64 capital-loss int64 hours-per-week int64 native-country object dtype: object
We can separate categorical and numerical variables using their data types to identify them.
There are a few ways we can do this. Here, we make use of make_column_selector
helper to select the corresponding columns.
from sklearn.compose import make_column_selector as selector
# create selector object based on data type
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)
# get columns of interest
numerical_columns = numerical_columns_selector(features)
categorical_columns = categorical_columns_selector(features)
# results in a list containing relevant column names
numerical_columns
['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
Scikit-learn works "out of the box" with numeric features. However, some algorithms make some assumptions regarding the distribution of our features.
We see that our numeric features span across different ranges:
numerical_features = features[numerical_columns]
numerical_features.describe()
age | education-num | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|---|
count | 48842.000000 | 48842.000000 | 48842.000000 | 48842.000000 | 48842.000000 |
mean | 38.643585 | 10.078089 | 1079.067626 | 87.502314 | 40.422382 |
std | 13.710510 | 2.570973 | 7452.019058 | 403.004552 | 12.391444 |
min | 17.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 28.000000 | 9.000000 | 0.000000 | 0.000000 | 40.000000 |
50% | 37.000000 | 10.000000 | 0.000000 | 0.000000 | 40.000000 |
75% | 48.000000 | 12.000000 | 0.000000 | 0.000000 | 45.000000 |
max | 90.000000 | 16.000000 | 99999.000000 | 4356.000000 | 99.000000 |
Normalizing our features so that they have mean = 0 and standard deviation = 1, helps to ensure our features align to algorithm assumptions.
Tip
Here are some reasons for scaling features:
Whether or not a machine learning model requires normalization of the features depends on the model family. Linear models such as logistic regression generally benefit from scaling the features while other models such as tree-based models (i.e. decision trees, random forests) do not need such preprocessing (but will not suffer from it).
We can apply such normalization using a scikit-learn transformer called StandardScaler
.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(numerical_features)
StandardScaler()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
StandardScaler()
The fit
method for transformers is similar to the fit
method for
predictors. The main difference is that the former has a single argument (the
feature matrix), whereas the latter has two arguments (the feature matrix and the
target).
In this case, the algorithm needs to compute the mean and standard deviation for each feature and store them into some NumPy arrays. Here, these statistics are the model states.
Note
The fact that the model states of this scaler are arrays of means and standard deviations is specific to the StandardScaler. Other scikit-learn transformers will compute different statistics and store them as model states, in the same fashion.
We can inspect the computed means and standard deviations.
scaler.mean_
array([ 38.64358544, 10.07808853, 1079.06762622, 87.50231358, 40.42238238])
scaler.scale_
array([1.37103696e+01, 2.57094644e+00, 7.45194277e+03, 4.03000427e+02, 1.23913172e+01])
Once we have called the fit
method, we can perform data transformation by
calling the method transform
.
numerical_features_scaled = scaler.transform(numerical_features)
numerical_features_scaled
array([[-0.99512893, -1.19725891, -0.14480353, -0.2171271 , -0.03408696], [-0.04694151, -0.41933527, -0.14480353, -0.2171271 , 0.77292975], [-0.77631645, 0.74755018, -0.14480353, -0.2171271 , -0.03408696], ..., [ 1.41180837, -0.41933527, -0.14480353, -0.2171271 , -0.03408696], [-1.21394141, -0.41933527, -0.14480353, -0.2171271 , -1.64812038], [ 0.97418341, -0.41933527, 1.87131501, -0.2171271 , -0.03408696]])
Let's illustrate the internal mechanism of the transform
method and put it
to perspective with what we already saw with predictors.
The transform
method for transformers is similar to the predict
method
for predictors. It uses a predefined function, called a transformation
function, and uses the model states and the input data. However, instead of
outputting predictions, the job of the transform
method is to output a
transformed version of the input data.
Finally, the method fit_transform
is a shorthand method to call
successively fit
and then transform
.
# fitting and transforming in one step
scaler.fit_transform(numerical_features)
array([[-0.99512893, -1.19725891, -0.14480353, -0.2171271 , -0.03408696], [-0.04694151, -0.41933527, -0.14480353, -0.2171271 , 0.77292975], [-0.77631645, 0.74755018, -0.14480353, -0.2171271 , -0.03408696], ..., [ 1.41180837, -0.41933527, -0.14480353, -0.2171271 , -0.03408696], [-1.21394141, -0.41933527, -0.14480353, -0.2171271 , -1.64812038], [ 0.97418341, -0.41933527, 1.87131501, -0.2171271 , -0.03408696]])
Notice that the mean of all the columns is close to 0 and the standard deviation in all cases is close to 1:
numerical_features = pd.DataFrame(
numerical_features_scaled,
columns=numerical_columns
)
numerical_features.describe()
age | education-num | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|---|
count | 4.884200e+04 | 4.884200e+04 | 4.884200e+04 | 4.884200e+04 | 4.884200e+04 |
mean | 2.281092e-16 | -9.208746e-17 | 1.047440e-17 | -1.018345e-17 | 4.466169e-17 |
std | 1.000010e+00 | 1.000010e+00 | 1.000010e+00 | 1.000010e+00 | 1.000010e+00 |
min | -1.578629e+00 | -3.531030e+00 | -1.448035e-01 | -2.171271e-01 | -3.181452e+00 |
25% | -7.763164e-01 | -4.193353e-01 | -1.448035e-01 | -2.171271e-01 | -3.408696e-02 |
50% | -1.198790e-01 | -3.037346e-02 | -1.448035e-01 | -2.171271e-01 | -3.408696e-02 |
75% | 6.824334e-01 | 7.475502e-01 | -1.448035e-01 | -2.171271e-01 | 3.694214e-01 |
max | 3.745808e+00 | 2.303397e+00 | 1.327438e+01 | 1.059179e+01 | 4.727312e+00 |
We can easily combine sequential operations with a scikit-learn
Pipeline
, which chains together operations and is used as any other
classifier or regressor. The helper function make_pipeline
will create a
Pipeline
: it takes as arguments the successive transformations to perform,
followed by the classifier or regressor model.
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(), LogisticRegression())
model
Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
StandardScaler()
LogisticRegression()
Let's divide our data into train and test sets and then apply and score our logistic regression model:
from sklearn.model_selection import train_test_split
# split our data into train & test
X_train, X_test, y_train, y_test = train_test_split(
numerical_features, target, random_state=123
)
# fit our pipeline model
model.fit(X_train, y_train)
# score our model on the test data
model.score(X_test, y_test)
0.8135287855212513
Unfortunately, Scikit-learn does not accept categorical features in their raw form. Consequently, we need to transform them into numerical representations.
The following presents typical ways of dealing with categorical variables by encoding them, namely ordinal encoding and one-hot encoding.
The most intuitive strategy is to encode each category with a different
number. The OrdinalEncoder
will transform the data in such manner.
We will start by encoding a single column to understand how the encoding
works.
from sklearn.preprocessing import OrdinalEncoder
# let's illustrate with the 'education' feature
education_column = features[["education"]]
encoder = OrdinalEncoder()
education_encoded = encoder.fit_transform(education_column)
education_encoded
array([[ 1.], [11.], [ 7.], ..., [11.], [11.], [11.]])
We see that each category in "education"
has been replaced by a numeric
value. We could check the mapping between the categories and the numerical
values by checking the fitted attribute categories_
.
encoder.categories_
[array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate', ' HS-grad', ' Masters', ' Preschool', ' Prof-school', ' Some-college'], dtype=object)]
However, be careful when applying this encoding strategy: using this integer representation leads downstream predictive models to assume that the values are ordered (0 < 1 < 2 < 3... for instance).
By default, OrdinalEncoder
uses a lexicographical strategy to map string
category labels to integers. This strategy is arbitrary and often
meaningless. For instance, suppose the dataset has a categorical variable
named "size"
with categories such as "S", "M", "L", "XL". We would like the
integer representation to respect the meaning of the sizes by mapping them to
increasing integers such as 0, 1, 2, 3
.
However, the lexicographical strategy used by default would map the labels
"S", "M", "L", "XL" to 2, 1, 0, 3, by following the alphabetical order.
The OrdinalEncoder
class accepts a categories
argument to
pass categories in the expected ordering explicitly (categories[i]
holds the categories expected in the ith column).
ed_levels = [
' Preschool', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th', ' 11th',
' 12th', ' HS-grad', ' Prof-school', ' Some-college', ' Assoc-acdm',
' Assoc-voc', ' Bachelors', ' Masters', ' Doctorate'
]
encoder = OrdinalEncoder(categories=[ed_levels])
education_encoded = encoder.fit_transform(education_column)
education_encoded
array([[ 6.], [ 8.], [11.], ..., [ 8.], [ 8.], [ 8.]])
encoder.categories_
[array([' Preschool', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th', ' 11th', ' 12th', ' HS-grad', ' Prof-school', ' Some-college', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Masters', ' Doctorate'], dtype=object)]
OneHotEncoder
is an alternative encoder that converts the categorical levels into new columns.
We will start by encoding a single feature (e.g. "education"
) to illustrate
how the encoding works.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
education_encoded = encoder.fit_transform(education_column)
education_encoded
array([[0., 1., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]])
Viewing this as a data frame provides a more intuitive illustration:
feature_names = encoder.get_feature_names_out(
input_features=["education"]
)
pd.DataFrame(education_encoded, columns=feature_names).head(5)
education_ 10th | education_ 11th | education_ 12th | education_ 1st-4th | education_ 5th-6th | education_ 7th-8th | education_ 9th | education_ Assoc-acdm | education_ Assoc-voc | education_ Bachelors | education_ Doctorate | education_ HS-grad | education_ Masters | education_ Preschool | education_ Prof-school | education_ Some-college | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Let's apply this encoding to all the categorical features:
# one-hot encode all features
cat_features_encoded = encoder.fit_transform(
features[categorical_columns]
)
# view as a data frame
columns_encoded = encoder.get_feature_names_out(
categorical_columns
)
pd.DataFrame(cat_features_encoded, columns=columns_encoded).head(3)
workclass_ ? | workclass_ Federal-gov | workclass_ Local-gov | workclass_ Never-worked | workclass_ Private | workclass_ Self-emp-inc | workclass_ Self-emp-not-inc | workclass_ State-gov | workclass_ Without-pay | education_ 10th | ... | native-country_ Portugal | native-country_ Puerto-Rico | native-country_ Scotland | native-country_ South | native-country_ Taiwan | native-country_ Thailand | native-country_ Trinadad&Tobago | native-country_ United-States | native-country_ Vietnam | native-country_ Yugoslavia | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
3 rows × 102 columns
Warning
One-hot encoding can significantly increase the number of features in our data. In this case we went from 8 features to 102! If you have a data set with many categorical variables and those categorical variables in turn have many unique levels, the number of features can explode. In these cases you may want to explore ordinal encoding or some other alternative.
Choosing an encoding strategy will depend on the underlying models and the type of categories (i.e. ordinal vs. nominal).
Tip
In general OneHotEncoder is the encoding strategy used when the downstream models are linear models while OrdinalEncoder is often a good strategy with tree-based models.
Using an OrdinalEncoder
will output ordinal categories. This means
that there is an order in the resulting categories (e.g. 0 < 1 < 2
). The
impact of violating this ordering assumption is really dependent on the
downstream models. Linear models will be impacted by misordered categories
while tree-based models will not.
You can still use an OrdinalEncoder
with linear models but you need to be
sure that:
One-hot encoding categorical variables with high cardinality can cause
computational inefficiency in tree-based models. Because of this, it is not recommended
to use OneHotEncoder
in such cases even if the original categories do not
have a given order.
Now let's look at how to combine some of these tasks so we can preprocess both numeric and categorical data.
First, let's get our train & test data established:
# drop the duplicated column `"education-num"` as stated in the data exploration notebook
features = features.drop(columns='education-num')
# create selector object based on data type
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)
# get columns of interest
numerical_columns = numerical_columns_selector(features)
categorical_columns = categorical_columns_selector(features)
# split into train & test sets
X_train, X_test, y_train, y_test = train_test_split(
features, target, random_state=123
)
Scikit-learn provides a ColumnTransformer
class which will send specific
columns to a specific transformer, making it easy to fit a single predictive
model on a dataset that combines both kinds of variables together.
We first define the columns depending on their data type:
We then create our ColumnTransfomer
by specifying three values:
First, let's create the preprocessors for the numerical and categorical parts.
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()
Now, we create the transformer and associate each of these preprocessors with their respective columns.
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer([
('one-hot-encoder', categorical_preprocessor, categorical_columns),
('standard_scaler', numerical_preprocessor, numerical_columns)
])
We can take a minute to represent graphically the structure of a
ColumnTransformer
:
model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))
model
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'), ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']), ('standard_scaler', StandardScaler(), ['age', 'capital-gain', 'capital-loss', 'hours-per-week'])])), ('logisticregression', LogisticRegression(max_iter=500))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'), ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']), ('standard_scaler', StandardScaler(), ['age', 'capital-gain', 'capital-loss', 'hours-per-week'])])), ('logisticregression', LogisticRegression(max_iter=500))])
ColumnTransformer(transformers=[('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'), ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']), ('standard_scaler', StandardScaler(), ['age', 'capital-gain', 'capital-loss', 'hours-per-week'])])
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
OneHotEncoder(handle_unknown='ignore')
['age', 'capital-gain', 'capital-loss', 'hours-per-week']
StandardScaler()
LogisticRegression(max_iter=500)
# fit our model
_ = model.fit(X_train, y_train)
# score on test set
model.score(X_test, y_test)
0.8503808041929408
Unfortunately, we only have time to scratch the surface of feature engineering in this workshop. However, this module should provide you with a strong foundation of how to apply the more common feature preprocessing tasks.
Tip
Scikit-learn provides many feature engineering options. Learn more here: https://scikit-learn.org/stable/modules/preprocessing.html
In this module we learned how to:
StandardScaler
,OrdinalEncoder
and OneHotEncoder
, andColumnTransformer
and make_pipeline
.