First model with scikit-learn¶

Objective¶

In this module, we present how to build predictive models on tabular datasets, with only numerical features.

In particular we will highlight:

  • the scikit-learn API: .fit(X, y)/.predict(X)/.score(X, y);
  • how to evaluate the generalization performance of a model with a train-test split.

Data¶

We will use the same dataset "adult_census" described in the previous module. For more details about the dataset see http://www.openml.org/d/1590.

In [1]:
import pandas as pd

adult_census = pd.read_csv("../data/adult-census.csv")

Separating features from target¶

Scikit-learn prefers our features ($X$) apart from our target ($y$)

Numerical data is the most natural type of data used in machine learning and can (often) be directly fed into predictive models. Consequently, for this module we will use a subset of the original data with only the numerical columns.

In [2]:
import numpy as np

# create column names of interest
target_col = "class"
feature_col = (
    adult_census.drop(columns=target_col)
    .select_dtypes(np.number).columns.values
)
In [3]:
target = adult_census[target_col]
target
Out[3]:
0         <=50K
1         <=50K
2          >50K
3          >50K
4         <=50K
          ...  
48837     <=50K
48838      >50K
48839     <=50K
48840     <=50K
48841      >50K
Name: class, Length: 48842, dtype: object
In [4]:
features = adult_census[feature_col]
features
Out[4]:
age education-num capital-gain capital-loss hours-per-week
0 25 7 0 0 40
1 38 9 0 0 50
2 28 12 0 0 40
3 44 10 7688 0 40
4 18 10 0 0 30
... ... ... ... ... ...
48837 27 12 0 0 38
48838 40 9 0 0 40
48839 58 9 0 0 40
48840 22 9 0 0 20
48841 52 9 15024 0 40

48842 rows × 5 columns

Question

What type of object is the target data set?
What type of object is the feature data set?

Fit a model¶

We will build a classification model using the "K-nearest neighbors" strategy. To predict the target of a new sample, a k-nearest neighbors takes into account its k closest samples in the training set and predicts the majority target of these samples.

Note

We use a K-nearest neighbors here. However, be aware that it is seldom useful in practice. We use it because it is an intuitive algorithm. In future modules, we will introduce alternative algorithms.

In [6]:
# to display nice model diagram
from sklearn import set_config
set_config(display='diagram')
In [7]:
from sklearn.neighbors import KNeighborsClassifier

# 1. define the algorithm
model = KNeighborsClassifier()

# 2. fit the model
model.fit(features, target)
Out[7]:
KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()

Learning can be represented as follows:

Predictor fit diagram

The method fit is based on two important elements: (i) learning algorithm and (ii) model state. The model state can be used later to either predict (for classifiers and regressors) or transform data (for transformers).

Note

Here and later, we use the name data and target to be explicit. In scikit-learn documentation, data is commonly named X and target is commonly called y.

Make predictions¶

Let's use our model to make some predictions using the same dataset. To predict, a model uses a prediction function that will use the input data together with the model states.

Predictor predict diagram

In [8]:
target_predicted = model.predict(features)
target_predicted
Out[8]:
array([' <=50K', ' <=50K', ' <=50K', ..., ' <=50K', ' <=50K', ' >50K'],
      dtype=object)

...and we could even check if the predictions agree with the real targets:

In [9]:
# accuracy of first 5 predictions
target[:5] == target_predicted[:5]
Out[9]:
0     True
1     True
2    False
3     True
4     True
Name: class, dtype: bool

Note

Here, we see that our model makes a mistake when predicting the third observation.

To get a better assessment, we can compute the average success rate.

In [10]:
(target == target_predicted).mean()
Out[10]:
0.8479791982310306

Warning!

But, can this evaluation be trusted, or is it too good to be true?

Train-test data split¶

When building a machine learning model, it is important to evaluate the trained model on data that was not used to fit it, as generalization is our primary concern -- meaning we want a rule that generalizes to new data.

Correct evaluation is easily done by leaving out a subset of the data when training the model and using it afterwards for model evaluation.

The data used to fit a model is called training data while the data used to assess a model is called testing data.

Scikit-learn provides the helper function sklearn.model_selection.train_test_split which is used to automatically split the dataset into two subsets.

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    features, 
    target, 
    random_state=123, 
    test_size=0.25,
    stratify=target
)

Tip

In scikit-learn setting the random_state parameter allows to get deterministic results when we use a random number generator. In the train_test_split case the randomness comes from shuffling the data, which decides how the dataset is split into a train and a test set).

And as your target becomes more imbalanced it is important to use the stratify parameter.

Your Turn

  1. How many observations are in your train and test data sets?

  2. What is the proportion of response values in your y_train and y_test?

Instead of computing the prediction and manually computing the average success rate, we can use the method score. When dealing with classifiers this method returns their performance metric.

Predictor score diagram

In [12]:
# 1. define the algorithm
model = KNeighborsClassifier()

# 2. fit the model
model.fit(X_train, y_train)

# 3. score our model on test data
accuracy = model.score(X_test, y_test)

print(f'The test accuracy using {model.__class__.__name__} is {round(accuracy, 4) * 100}%')
The test accuracy using KNeighborsClassifier is 82.59%

Important!

If we compare with the accuracy obtained by wrongly evaluating the model on the training set, we find that this evaluation was indeed optimistic compared to the score obtained on a held-out test set.

This illustrates the importance of always testing the generalization performance of predictive models on a different set than the one used to train these models.

Wrapping up¶

In this module we learned how to:

  • fit a predictive machine learning algorithm (k-nearest neighbors) on a training dataset;
  • evaluate its generalization performance on the testing data;
  • introduced the scikit-learn API .fit(X, y) (to train a model), .predict(X) (to make predictions) and .score(X, y) (to evaluate a model).

Your Turn

Scikit-learn provides a logistic regression algorithm, which is another type of algorithm for making binary classification predictions. This algorithm is available at sklearn.linear_model.LogisticRegression.

Fill in the blanks below to import the LogisticRegression module, define the algorithm, fit the model, and score on the test data.

In [ ]:
# 1. import the LogisticRegression module
from sklearn.linear_model import __________

# 2. define the algorithm
model = __________

# 3. fit the model
model.fit(______, ______)

# 4. score our model on test data
model.score(______, ______)