In this notebook, we will look at the necessary steps required before any machine learning takes place. It involves:
We will use data from the 1994 US census that we downloaded from OpenML.
You can look at the OpenML webpage to learn more about this dataset: http://www.openml.org/d/1590
The dataset is available as a CSV (Comma-Separated Values) file and we will use pandas to read it.
import pandas as pd
adult_census = pd.read_csv("../data/adult-census.csv")
Note
The goal with this data is to predict whether a person earns over 50K a year based on heterogeneous data such as age, employment, education, family information, etc.The column named class
is our target variable (i.e., the variable which we want to predict).
The two possible classes are <=50K (low-revenue) and >50K (high-revenue).
Consequently, this is called a binary classification problem.
target_column = 'class'
adult_census[target_column].value_counts()
<=50K 37155 >50K 11687 Name: class, dtype: int64
Note
Classes are slightly imbalanced, meaning there are more samples of one or more classes compared to others. Class imbalance happens often in practice and may need special techniques when building a predictive model.
The other columns represent pieces of information (aka "features") that may be useful in predicting the response variable.
In the field of machine learning and descriptive statistics, commonly used equivalent terms are "variable", "attribute", or "covariate".
features = adult_census.drop(columns='class')
features.head()
age | workclass | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 25 | Private | 11th | 7 | Never-married | Machine-op-inspct | Own-child | Black | Male | 0 | 0 | 40 | United-States |
1 | 38 | Private | HS-grad | 9 | Married-civ-spouse | Farming-fishing | Husband | White | Male | 0 | 0 | 50 | United-States |
2 | 28 | Local-gov | Assoc-acdm | 12 | Married-civ-spouse | Protective-serv | Husband | White | Male | 0 | 0 | 40 | United-States |
3 | 44 | Private | Some-college | 10 | Married-civ-spouse | Machine-op-inspct | Husband | Black | Male | 7688 | 0 | 40 | United-States |
4 | 18 | ? | Some-college | 10 | Never-married | ? | Own-child | White | Female | 0 | 0 | 30 | United-States |
import numpy as np
numeric_columns = (
features.select_dtypes(include=np.number).columns.values
)
categorical_columns = (
features.drop(columns=numeric_columns).columns.values
)
print(f'''
There are {features.shape[0]} observations and {features.shape[1]} features.
Numeric features: {', '.join(numeric_columns)}.
Categorical features: {', '.join(categorical_columns)}.
''')
There are 48842 observations and 13 features. Numeric features: age, education-num, capital-gain, capital-loss, hours-per-week. Categorical features: workclass, education, marital-status, occupation, relationship, race, sex, native-country.
Before building a predictive model, it is a good idea to look at the data:
adult_census.hist(figsize=(20, 14));
Tip
In the previous cell, we used the following pattern: func(); where the semi-colon follows the plot function. We do this to avoid showing the ugly <matplotlib.axes object ...> output.
We can already make a few comments about some of the variables:
age
: there are not that many points for age > 70
. The dataset
description does indicate that retired people have been filtered out
(hours-per-week > 0
);education-num
: peak at 10 and 13, hard to tell what it corresponds to
without looking much further. We'll do that later in this notebook;hours-per-week
peaks at 40, this was very likely the standard number of
working hours at the time of the data collection;capital-gain
and capital-loss
are close to zero.For categorical variables, we can look at the distribution of values:
adult_census['sex'].value_counts()
Male 32650 Female 16192 Name: sex, dtype: int64
adult_census['education'].value_counts()
HS-grad 15784 Some-college 10878 Bachelors 8025 Masters 2657 Assoc-voc 2061 11th 1812 Assoc-acdm 1601 10th 1389 7th-8th 955 Prof-school 834 9th 756 12th 657 Doctorate 594 5th-6th 509 1st-4th 247 Preschool 83 Name: education, dtype: int64
Question
What do you think the difference is between education and education-num?
pd.crosstab(
index=adult_census['education'],
columns=adult_census['education-num']
)
education-num | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
education | ||||||||||||||||
10th | 0 | 0 | 0 | 0 | 0 | 1389 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
11th | 0 | 0 | 0 | 0 | 0 | 0 | 1812 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
12th | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 657 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1st-4th | 0 | 247 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5th-6th | 0 | 0 | 509 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7th-8th | 0 | 0 | 0 | 955 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
9th | 0 | 0 | 0 | 0 | 756 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Assoc-acdm | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1601 | 0 | 0 | 0 | 0 |
Assoc-voc | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2061 | 0 | 0 | 0 | 0 | 0 |
Bachelors | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8025 | 0 | 0 | 0 |
Doctorate | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 594 |
HS-grad | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15784 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Masters | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2657 | 0 | 0 |
Preschool | 83 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Prof-school | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 834 | 0 |
Some-college | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10878 | 0 | 0 | 0 | 0 | 0 | 0 |
Another way to inspect the data is to do a pairplot
and show how each
variable differs according to our target, i.e. class
. Plots along the
diagonal show the distribution of individual variables for each class
. The
plots on the off-diagonal can reveal interesting interactions between
variables.
import seaborn as sns
# We will plot a subset of the data to keep the plot readable and make the
# plotting faster
n_samples_to_plot = 5000
columns = ['age', 'education-num', 'hours-per-week']
sns.pairplot(data=adult_census[:n_samples_to_plot], vars=columns,
hue=target_column, plot_kws={'alpha': 0.2},
height=3, diag_kind='hist', diag_kws={'bins': 30});
By looking at the previous plots, we could create some hand-written rules that predicts whether someone has a high- or low-income.
Your Turn
Looking at the prior pairplot, what type of decision rules would you suggest?
For instance, we could focus on the combination of the hours-per-week
and age
features.
import matplotlib.pyplot as plt
ax = sns.scatterplot(
x="age", y="hours-per-week", data=adult_census[:n_samples_to_plot],
hue="class", alpha=0.5,
)
age_limit = 27
plt.axvline(x=age_limit, ymin=0, ymax=1, color="black", linestyle="--")
hours_per_week_limit = 40
plt.axhline(
y=hours_per_week_limit, xmin=0.18, xmax=1, color="black", linestyle="--"
)
plt.annotate("<=50K", (17, 25), rotation=90, fontsize=35)
plt.annotate("<=50K", (35, 20), fontsize=35)
plt.annotate("???", (45, 60), fontsize=35);
Machine learning models will work similarly.
However, ML models chose the "best" decision rules based on the data without human intervention or inspection.
ML models are extremely helpful when creating rules by hand is not straightforward, for example because we are in high dimension (many features) or because there are no simple and obvious rules that separate the two classes as in the top-right region of the previous plot.
To sum up, the important thing to remember is that in a machine-learning setting, a model automatically creates the "rules" from the data in order to make predictions on new unseen data. And this is where we'll turn our attention to in future modules.
In this module we have:
pandas
;pandas
and seaborn
. Data inspection can allow
you to decide whether using machine learning is appropriate for your data
and to highlight potential peculiarities in your data.