Feature Engineering¶

Objective¶

Data preprocessing and engineering techniques generally refer to the addition, deletion, or transformation of data.

The time spent on identifying data engineering needs can be significant and requires you to spend substantial time understanding your data...

"Live with your data before you plunge into modeling" - Leo Breiman

In this module we introduce:

  • an example of preprocessing numerical features,
  • two common ways to preprocess categorical features,
  • using a scikit-learn pipeline to chain preprocessing and model training.

Basic prerequisites¶

Let's go ahead and import a couple required libraries and import our data.

Note

We will import additional libraries and functions as we proceed but we do so at the time of using the libraries and functions as that provides better learning context.

In [1]:
import pandas as pd

# to display nice model diagram
from sklearn import set_config
set_config(display='diagram')

# import data
adult_census = pd.read_csv('../data/adult-census.csv')

# separate feature & target data
target = adult_census['class']
features = adult_census.drop(columns='class')

Selection based on data types¶

Typically, data types fall into two categories:

  • Numeric: a quantity represented by a real or integer number.
  • Categorical: a discrete value, typically represented by string labels (but not only) taken from a finite list of possible choices.
In [2]:
features.dtypes
Out[2]:
age                int64
workclass         object
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

We can separate categorical and numerical variables using their data types to identify them.

There are a few ways we can do this. Here, we make use of make_column_selector helper to select the corresponding columns.

In [3]:
from sklearn.compose import make_column_selector as selector

# create selector object based on data type
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

# get columns of interest
numerical_columns = numerical_columns_selector(features)
categorical_columns = categorical_columns_selector(features)

# results in a list containing relevant column names
numerical_columns
Out[3]:
['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

Preprocessing numerical data¶

Scikit-learn works "out of the box" with numeric features. However, some algorithms make some assumptions regarding the distribution of our features.

We see that our numeric features span across different ranges:

In [4]:
numerical_features = features[numerical_columns]
numerical_features.describe()
Out[4]:
age education-num capital-gain capital-loss hours-per-week
count 48842.000000 48842.000000 48842.000000 48842.000000 48842.000000
mean 38.643585 10.078089 1079.067626 87.502314 40.422382
std 13.710510 2.570973 7452.019058 403.004552 12.391444
min 17.000000 1.000000 0.000000 0.000000 1.000000
25% 28.000000 9.000000 0.000000 0.000000 40.000000
50% 37.000000 10.000000 0.000000 0.000000 40.000000
75% 48.000000 12.000000 0.000000 0.000000 45.000000
max 90.000000 16.000000 99999.000000 4356.000000 99.000000

Normalizing our features so that they have mean = 0 and standard deviation = 1, helps to ensure our features align to algorithm assumptions.

Tip

Here are some reasons for scaling features:

  • Models that rely on the distance between a pair of samples, for instance k-nearest neighbors, should be trained on normalized features to make each feature contribute approximately equally to the distance computations.
  • Many models such as logistic regression use a numerical solver (based on gradient descent) to find their optimal parameters. This solver converges faster when the features are scaled.

Whether or not a machine learning model requires normalization of the features depends on the model family. Linear models such as logistic regression generally benefit from scaling the features while other models such as tree-based models (i.e. decision trees, random forests) do not need such preprocessing (but will not suffer from it).

We can apply such normalization using a scikit-learn transformer called StandardScaler.

In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(numerical_features)
Out[5]:
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StandardScaler()

The fit method for transformers is similar to the fit method for predictors. The main difference is that the former has a single argument (the feature matrix), whereas the latter has two arguments (the feature matrix and the target).

Transformer fit diagram

In this case, the algorithm needs to compute the mean and standard deviation for each feature and store them into some NumPy arrays. Here, these statistics are the model states.

Note

The fact that the model states of this scaler are arrays of means and standard deviations is specific to the StandardScaler. Other scikit-learn transformers will compute different statistics and store them as model states, in the same fashion.

We can inspect the computed means and standard deviations.

In [6]:
scaler.mean_
Out[6]:
array([  38.64358544,   10.07808853, 1079.06762622,   87.50231358,
         40.42238238])
In [7]:
scaler.scale_
Out[7]:
array([1.37103696e+01, 2.57094644e+00, 7.45194277e+03, 4.03000427e+02,
       1.23913172e+01])

Once we have called the fit method, we can perform data transformation by calling the method transform.

In [8]:
numerical_features_scaled = scaler.transform(numerical_features)
numerical_features_scaled
Out[8]:
array([[-0.99512893, -1.19725891, -0.14480353, -0.2171271 , -0.03408696],
       [-0.04694151, -0.41933527, -0.14480353, -0.2171271 ,  0.77292975],
       [-0.77631645,  0.74755018, -0.14480353, -0.2171271 , -0.03408696],
       ...,
       [ 1.41180837, -0.41933527, -0.14480353, -0.2171271 , -0.03408696],
       [-1.21394141, -0.41933527, -0.14480353, -0.2171271 , -1.64812038],
       [ 0.97418341, -0.41933527,  1.87131501, -0.2171271 , -0.03408696]])

Let's illustrate the internal mechanism of the transform method and put it to perspective with what we already saw with predictors.

Transformer transform diagram

The transform method for transformers is similar to the predict method for predictors. It uses a predefined function, called a transformation function, and uses the model states and the input data. However, instead of outputting predictions, the job of the transform method is to output a transformed version of the input data.

Finally, the method fit_transform is a shorthand method to call successively fit and then transform.

Transformer fit_transform diagram

In [9]:
# fitting and transforming in one step
scaler.fit_transform(numerical_features)
Out[9]:
array([[-0.99512893, -1.19725891, -0.14480353, -0.2171271 , -0.03408696],
       [-0.04694151, -0.41933527, -0.14480353, -0.2171271 ,  0.77292975],
       [-0.77631645,  0.74755018, -0.14480353, -0.2171271 , -0.03408696],
       ...,
       [ 1.41180837, -0.41933527, -0.14480353, -0.2171271 , -0.03408696],
       [-1.21394141, -0.41933527, -0.14480353, -0.2171271 , -1.64812038],
       [ 0.97418341, -0.41933527,  1.87131501, -0.2171271 , -0.03408696]])

Notice that the mean of all the columns is close to 0 and the standard deviation in all cases is close to 1:

In [10]:
numerical_features = pd.DataFrame(
    numerical_features_scaled,
    columns=numerical_columns
)

numerical_features.describe()
Out[10]:
age education-num capital-gain capital-loss hours-per-week
count 4.884200e+04 4.884200e+04 4.884200e+04 4.884200e+04 4.884200e+04
mean 2.281092e-16 -9.208746e-17 1.047440e-17 -1.018345e-17 4.466169e-17
std 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00
min -1.578629e+00 -3.531030e+00 -1.448035e-01 -2.171271e-01 -3.181452e+00
25% -7.763164e-01 -4.193353e-01 -1.448035e-01 -2.171271e-01 -3.408696e-02
50% -1.198790e-01 -3.037346e-02 -1.448035e-01 -2.171271e-01 -3.408696e-02
75% 6.824334e-01 7.475502e-01 -1.448035e-01 -2.171271e-01 3.694214e-01
max 3.745808e+00 2.303397e+00 1.327438e+01 1.059179e+01 4.727312e+00

Model pipelines¶

We can easily combine sequential operations with a scikit-learn Pipeline, which chains together operations and is used as any other classifier or regressor. The helper function make_pipeline will create a Pipeline: it takes as arguments the successive transformations to perform, followed by the classifier or regressor model.

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(), LogisticRegression())
model
Out[11]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression())])
StandardScaler()
LogisticRegression()

Let's divide our data into train and test sets and then apply and score our logistic regression model:

In [12]:
from sklearn.model_selection import train_test_split

# split our data into train & test
X_train, X_test, y_train, y_test = train_test_split(
    numerical_features, target, random_state=123
)

# fit our pipeline model
model.fit(X_train, y_train)

# score our model on the test data
model.score(X_test, y_test)
Out[12]:
0.8135287855212513

Preprocessing categorical data¶

Unfortunately, Scikit-learn does not accept categorical features in their raw form. Consequently, we need to transform them into numerical representations.

The following presents typical ways of dealing with categorical variables by encoding them, namely ordinal encoding and one-hot encoding.

Encoding ordinal categories¶

The most intuitive strategy is to encode each category with a different number. The OrdinalEncoder will transform the data in such manner. We will start by encoding a single column to understand how the encoding works.

In [13]:
from sklearn.preprocessing import OrdinalEncoder

# let's illustrate with the 'education' feature
education_column = features[["education"]]

encoder = OrdinalEncoder()
education_encoded = encoder.fit_transform(education_column)
education_encoded
Out[13]:
array([[ 1.],
       [11.],
       [ 7.],
       ...,
       [11.],
       [11.],
       [11.]])

We see that each category in "education" has been replaced by a numeric value. We could check the mapping between the categories and the numerical values by checking the fitted attribute categories_.

In [14]:
encoder.categories_
Out[14]:
[array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
        ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
        ' HS-grad', ' Masters', ' Preschool', ' Prof-school',
        ' Some-college'], dtype=object)]

However, be careful when applying this encoding strategy: using this integer representation leads downstream predictive models to assume that the values are ordered (0 < 1 < 2 < 3... for instance).

By default, OrdinalEncoder uses a lexicographical strategy to map string category labels to integers. This strategy is arbitrary and often meaningless. For instance, suppose the dataset has a categorical variable named "size" with categories such as "S", "M", "L", "XL". We would like the integer representation to respect the meaning of the sizes by mapping them to increasing integers such as 0, 1, 2, 3. However, the lexicographical strategy used by default would map the labels "S", "M", "L", "XL" to 2, 1, 0, 3, by following the alphabetical order.

The OrdinalEncoder class accepts a categories argument to pass categories in the expected ordering explicitly (categories[i] holds the categories expected in the ith column).

In [15]:
ed_levels = [
    ' Preschool', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th', ' 11th', 
    ' 12th', ' HS-grad', ' Prof-school', ' Some-college', ' Assoc-acdm', 
    ' Assoc-voc', ' Bachelors', ' Masters', ' Doctorate'
]

encoder = OrdinalEncoder(categories=[ed_levels])
education_encoded = encoder.fit_transform(education_column)
education_encoded
Out[15]:
array([[ 6.],
       [ 8.],
       [11.],
       ...,
       [ 8.],
       [ 8.],
       [ 8.]])
In [16]:
encoder.categories_
Out[16]:
[array([' Preschool', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th',
        ' 11th', ' 12th', ' HS-grad', ' Prof-school', ' Some-college',
        ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Masters',
        ' Doctorate'], dtype=object)]

Ecoding nominal categories¶

OneHotEncoder is an alternative encoder that converts the categorical levels into new columns.

We will start by encoding a single feature (e.g. "education") to illustrate how the encoding works.

In [17]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
education_encoded = encoder.fit_transform(education_column)
education_encoded
Out[17]:
array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Viewing this as a data frame provides a more intuitive illustration:

In [18]:
feature_names = encoder.get_feature_names_out(
    input_features=["education"]
)
pd.DataFrame(education_encoded, columns=feature_names).head(5)
Out[18]:
education_ 10th education_ 11th education_ 12th education_ 1st-4th education_ 5th-6th education_ 7th-8th education_ 9th education_ Assoc-acdm education_ Assoc-voc education_ Bachelors education_ Doctorate education_ HS-grad education_ Masters education_ Preschool education_ Prof-school education_ Some-college
0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

Let's apply this encoding to all the categorical features:

In [19]:
# one-hot encode all features
cat_features_encoded = encoder.fit_transform(
    features[categorical_columns]
)

# view as a data frame
columns_encoded = encoder.get_feature_names_out(
    categorical_columns
)
pd.DataFrame(cat_features_encoded, columns=columns_encoded).head(3)
Out[19]:
workclass_ ? workclass_ Federal-gov workclass_ Local-gov workclass_ Never-worked workclass_ Private workclass_ Self-emp-inc workclass_ Self-emp-not-inc workclass_ State-gov workclass_ Without-pay education_ 10th ... native-country_ Portugal native-country_ Puerto-Rico native-country_ Scotland native-country_ South native-country_ Taiwan native-country_ Thailand native-country_ Trinadad&Tobago native-country_ United-States native-country_ Vietnam native-country_ Yugoslavia
0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

3 rows × 102 columns

Warning

One-hot encoding can significantly increase the number of features in our data. In this case we went from 8 features to 102! If you have a data set with many categorical variables and those categorical variables in turn have many unique levels, the number of features can explode. In these cases you may want to explore ordinal encoding or some other alternative.

Choosing an encoding strategy¶

Choosing an encoding strategy will depend on the underlying models and the type of categories (i.e. ordinal vs. nominal).

Tip

In general OneHotEncoder is the encoding strategy used when the downstream models are linear models while OrdinalEncoder is often a good strategy with tree-based models.

Using an OrdinalEncoder will output ordinal categories. This means that there is an order in the resulting categories (e.g. 0 < 1 < 2). The impact of violating this ordering assumption is really dependent on the downstream models. Linear models will be impacted by misordered categories while tree-based models will not.

You can still use an OrdinalEncoder with linear models but you need to be sure that:

  • the original categories (before encoding) have an ordering;
  • the encoded categories follow the same ordering than the original categories.

One-hot encoding categorical variables with high cardinality can cause computational inefficiency in tree-based models. Because of this, it is not recommended to use OneHotEncoder in such cases even if the original categories do not have a given order.

Using numerical and categorical variables together¶

Now let's look at how to combine some of these tasks so we can preprocess both numeric and categorical data.

First, let's get our train & test data established:

In [20]:
# drop the duplicated column `"education-num"` as stated in the data exploration notebook
features = features.drop(columns='education-num')

# create selector object based on data type
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

# get columns of interest
numerical_columns = numerical_columns_selector(features)
categorical_columns = categorical_columns_selector(features)

# split into train & test sets
X_train, X_test, y_train, y_test = train_test_split(
    features, target, random_state=123
)

Scikit-learn provides a ColumnTransformer class which will send specific columns to a specific transformer, making it easy to fit a single predictive model on a dataset that combines both kinds of variables together.

We first define the columns depending on their data type:

  • one-hot encoding will be applied to categorical columns.
  • numerical scaling numerical features which will be standardized.

We then create our ColumnTransfomer by specifying three values:

  1. the preprocessor name,
  2. the transformer, and
  3. the columns.

First, let's create the preprocessors for the numerical and categorical parts.

In [21]:
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

Now, we create the transformer and associate each of these preprocessors with their respective columns.

In [22]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns),
    ('standard_scaler', numerical_preprocessor, numerical_columns)
])

We can take a minute to represent graphically the structure of a ColumnTransformer:

columntransformer diagram

In [23]:
model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))
model
Out[23]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('one-hot-encoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['workclass', 'education',
                                                   'marital-status',
                                                   'occupation', 'relationship',
                                                   'race', 'sex',
                                                   'native-country']),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  ['age', 'capital-gain',
                                                   'capital-loss',
                                                   'hours-per-week'])])),
                ('logisticregression', LogisticRegression(max_iter=500))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('one-hot-encoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['workclass', 'education',
                                                   'marital-status',
                                                   'occupation', 'relationship',
                                                   'race', 'sex',
                                                   'native-country']),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  ['age', 'capital-gain',
                                                   'capital-loss',
                                                   'hours-per-week'])])),
                ('logisticregression', LogisticRegression(max_iter=500))])
ColumnTransformer(transformers=[('one-hot-encoder',
                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['workclass', 'education', 'marital-status',
                                  'occupation', 'relationship', 'race', 'sex',
                                  'native-country']),
                                ('standard_scaler', StandardScaler(),
                                 ['age', 'capital-gain', 'capital-loss',
                                  'hours-per-week'])])
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
OneHotEncoder(handle_unknown='ignore')
['age', 'capital-gain', 'capital-loss', 'hours-per-week']
StandardScaler()
LogisticRegression(max_iter=500)
In [24]:
# fit our model
_ = model.fit(X_train, y_train)

# score on test set
model.score(X_test, y_test)
Out[24]:
0.8503808041929408

Wrapping up¶

Unfortunately, we only have time to scratch the surface of feature engineering in this workshop. However, this module should provide you with a strong foundation of how to apply the more common feature preprocessing tasks.

Tip

Scikit-learn provides many feature engineering options. Learn more here: https://scikit-learn.org/stable/modules/preprocessing.html

In this module we learned how to:

  • normalize numerical features with StandardScaler,
  • ordinal and one-hot encode categorical features with OrdinalEncoder and OneHotEncoder, and
  • chain feature preprocessing and model training steps together with ColumnTransformer and make_pipeline.