In this module, we present how to build predictive models on tabular datasets, with only numerical features.
In particular we will highlight:
.fit(X, y)
/.predict(X)
/.score(X, y)
;We will use the same dataset "adult_census" described in the previous module. For more details about the dataset see http://www.openml.org/d/1590.
import pandas as pd
adult_census = pd.read_csv("../data/adult-census.csv")
Scikit-learn prefers our features ($X$) apart from our target ($y$)
Numerical data is the most natural type of data used in machine learning and can (often) be directly fed into predictive models. Consequently, for this module we will use a subset of the original data with only the numerical columns.
import numpy as np
# create column names of interest
target_col = "class"
feature_col = (
adult_census.drop(columns=target_col)
.select_dtypes(np.number).columns.values
)
target = adult_census[target_col]
target
0 <=50K 1 <=50K 2 >50K 3 >50K 4 <=50K ... 48837 <=50K 48838 >50K 48839 <=50K 48840 <=50K 48841 >50K Name: class, Length: 48842, dtype: object
features = adult_census[feature_col]
features
age | education-num | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|---|
0 | 25 | 7 | 0 | 0 | 40 |
1 | 38 | 9 | 0 | 0 | 50 |
2 | 28 | 12 | 0 | 0 | 40 |
3 | 44 | 10 | 7688 | 0 | 40 |
4 | 18 | 10 | 0 | 0 | 30 |
... | ... | ... | ... | ... | ... |
48837 | 27 | 12 | 0 | 0 | 38 |
48838 | 40 | 9 | 0 | 0 | 40 |
48839 | 58 | 9 | 0 | 0 | 40 |
48840 | 22 | 9 | 0 | 0 | 20 |
48841 | 52 | 9 | 15024 | 0 | 40 |
48842 rows × 5 columns
Question
What type of object is the target data set?
What type of object is the feature data set?
We will build a classification model using the "K-nearest neighbors"
strategy. To predict the target of a new sample, a k-nearest neighbors takes
into account its k
closest samples in the training set and predicts the
majority target of these samples.
Note
We use a K-nearest neighbors here. However, be aware that it is seldom useful in practice. We use it because it is an intuitive algorithm. In future modules, we will introduce alternative algorithms.
# to display nice model diagram
from sklearn import set_config
set_config(display='diagram')
from sklearn.neighbors import KNeighborsClassifier
# 1. define the algorithm
model = KNeighborsClassifier()
# 2. fit the model
model.fit(features, target)
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier()
Learning can be represented as follows:
The method fit
is based on two important elements: (i) learning algorithm
and (ii) model state. The model state can be used later to either predict (for classifiers and regressors) or transform data (for transformers).
Note
Here and later, we use the name data and target to be explicit. In scikit-learn documentation, data is commonly named X and target is commonly called y.
Let's use our model to make some predictions using the same dataset. To predict, a model uses a prediction function that will use the input data together with the model states.
target_predicted = model.predict(features)
target_predicted
array([' <=50K', ' <=50K', ' <=50K', ..., ' <=50K', ' <=50K', ' >50K'], dtype=object)
...and we could even check if the predictions agree with the real targets:
# accuracy of first 5 predictions
target[:5] == target_predicted[:5]
0 True 1 True 2 False 3 True 4 True Name: class, dtype: bool
Note
Here, we see that our model makes a mistake when predicting the third observation.
To get a better assessment, we can compute the average success rate.
(target == target_predicted).mean()
0.8479791982310306
Warning!
But, can this evaluation be trusted, or is it too good to be true?
When building a machine learning model, it is important to evaluate the trained model on data that was not used to fit it, as generalization is our primary concern -- meaning we want a rule that generalizes to new data.
Correct evaluation is easily done by leaving out a subset of the data when training the model and using it afterwards for model evaluation.
The data used to fit a model is called training data while the data used to assess a model is called testing data.
Scikit-learn provides the helper function sklearn.model_selection.train_test_split
which is used to automatically split the dataset into two subsets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
features,
target,
random_state=123,
test_size=0.25,
stratify=target
)
Tip
In scikit-learn setting the random_state parameter allows to get deterministic results when we use a random number generator. In the train_test_split case the randomness comes from shuffling the data, which decides how the dataset is split into a train and a test set).
And as your target becomes more imbalanced it is important to use the stratify parameter.
Your Turn
How many observations are in your train and test data sets?
What is the proportion of response values in your y_train and y_test?
Instead of computing the prediction and manually computing the average
success rate, we can use the method score
. When dealing with classifiers
this method returns their performance metric.
# 1. define the algorithm
model = KNeighborsClassifier()
# 2. fit the model
model.fit(X_train, y_train)
# 3. score our model on test data
accuracy = model.score(X_test, y_test)
print(f'The test accuracy using {model.__class__.__name__} is {round(accuracy, 4) * 100}%')
The test accuracy using KNeighborsClassifier is 82.59%
Important!
If we compare with the accuracy obtained by wrongly evaluating the model on the training set, we find that this evaluation was indeed optimistic compared to the score obtained on a held-out test set.
This illustrates the importance of always testing the generalization performance of predictive models on a different set than the one used to train these models.
In this module we learned how to:
.fit(X, y)
(to train a model),
.predict(X)
(to make predictions) and .score(X, y)
(to evaluate a model).Your Turn
Scikit-learn provides a logistic regression algorithm, which is another type of algorithm for making binary classification predictions. This algorithm is available at sklearn.linear_model.LogisticRegression.
Fill in the blanks below to import the LogisticRegression module, define the algorithm, fit the model, and score on the test data.
# 1. import the LogisticRegression module
from sklearn.linear_model import __________
# 2. define the algorithm
model = __________
# 3. fit the model
model.fit(______, ______)
# 4. score our model on test data
model.score(______, ______)