For all Case Study 2 questions we will be using the heart.csv data provided with this case study. Along with the heart.csv data, I have provided a heart_data_dictionary.csv file that provides a description of each column. As you answer the lab questions, it may be beneficial to reference this data dictionary.
import pandas as pd
heart = pd.read_csv("../data/heart.csv")
heart
| age | sex | chest_pain | rest_bp | chol | fbs | rest_ecg | max_hr | exang | old_peak | slope | ca | thal | disease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63 | Male | typical | 145 | 233 | 1 | left ventricular hypertrophy | 150 | 0 | 2.3 | 3 | 0.0 | fixed | 0 |
| 1 | 67 | Male | asymptomatic | 160 | 286 | 0 | left ventricular hypertrophy | 108 | 1 | 1.5 | 2 | 3.0 | normal | 1 |
| 2 | 67 | Male | asymptomatic | 120 | 229 | 0 | left ventricular hypertrophy | 129 | 1 | 2.6 | 2 | 2.0 | reversable | 1 |
| 3 | 37 | Male | nonanginal | 130 | 250 | 0 | normal | 187 | 0 | 3.5 | 3 | 0.0 | normal | 0 |
| 4 | 41 | Female | nontypical | 130 | 204 | 0 | left ventricular hypertrophy | 172 | 0 | 1.4 | 1 | 0.0 | normal | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 298 | 45 | Male | typical | 110 | 264 | 0 | normal | 132 | 0 | 1.2 | 2 | 0.0 | reversable | 1 |
| 299 | 68 | Male | asymptomatic | 144 | 193 | 1 | normal | 141 | 0 | 3.4 | 2 | 2.0 | reversable | 1 |
| 300 | 57 | Male | asymptomatic | 130 | 131 | 0 | normal | 115 | 1 | 1.2 | 2 | 1.0 | reversable | 1 |
| 301 | 57 | Female | nontypical | 130 | 236 | 0 | left ventricular hypertrophy | 174 | 0 | 0.0 | 2 | 1.0 | normal | 1 |
| 302 | 38 | Male | nonanginal | 138 | 175 | 0 | normal | 173 | 0 | 0.0 | 1 | NaN | normal | 0 |
303 rows × 14 columns
chest_pain, chol, and max_hr columns. How many rows and columns are in the resulting DataFrame?Filter the heart data for all observations where the person is 50 years or older. How many observations are there?
heart[heart["age"] >= 50].shape
(216, 16)
Using the original heart data, filter for those observations that are male and 50 years or older. How many observations are there?
heart[
(heart["age"] >= 50) & (heart["sex"] == "Male")
].shape
(143, 16)
Using the original heart data, filter for those observations that are female, 50 years or younger, and have the disease (disease = 1). Select chest_pain, chol, and max_hr columns. How many rows and columns are in the resulting DataFrame?
heart.loc[
(heart["sex"] == "Female") &
(heart["age"] <= 50) &
(heart["disease"] == 1),
["chest_pain", "chol", "max_hr"]
].shape
(1, 3)
Are there any missing values in this data? If so, which columns? For these columns, fill the missing values with the value that appears most often (aka "mode"). This is a multi-step process and it would be worth reviewing the .fillna() docs.
Create a new column called risk that is equal to $ \frac{age}{{res\_bp} + chol + {max\_hr}} $. What is the mean of this risk column?
Replace the values in the rest_ecg column so that:
Hint: one of the original values may have an extra space at the end of the name!
How many observations fall into each of the new rest_ecg categories?
Are there any missing values in this data? If so, which columns? For these columns, fill the missing values with the value that appears most often (aka "mode"). This is a multi-step process and it would be worth reviewing the .fillna() docs.
cols_miss = list(
heart.isnull().sum()
[(heart.isnull().sum() > 0)].index
)
print(heart[cols_miss].mode().iloc[0])
ca 0.0 thal normal Name: 0, dtype: object
heart = heart.fillna(heart[cols_miss].mode().iloc[0])
Create a new column called risk that is equal to $ \frac{age}{{res\_bp} + chol + {max\_hr}} $. What is the mean of this risk column?
heart = heart.assign(
risk=heart["age"] / (
heart["rest_bp"] + heart["chol"] + heart["max_hr"]
)
)
heart["risk"].mean()
0.10426734916395465
Replace the values in the rest_ecg column so that:
Hint: one of the original values may have an extra space at the end of the name!
How many observations fall into each of the new rest_ecg categories?
heart.loc[heart["rest_ecg"] == "normal", "rest_ecg"] = "normal"
heart.loc[
heart["rest_ecg"] == "left ventricular hypertrophy ",
"rest_ecg"
] = "lvh"
heart.loc[
heart["rest_ecg"] == "ST-T wave abnormality",
"rest_ecg"
] = "stt_wav_abn"
heart["rest_ecg"].isin(["normal", "lvh", "stt_wav_abn"]).sum()
303
risk value (the risk column was created in problem 2 of the "Manipulating data" section) for each age and sex. Which gender and age group has the highest average risk value?To use age groups, we'll need to make a new column from the existing age column.
def categorize_age(age):
if age <= 40:
return 40
elif age <= 50:
return 50
elif age <= 60:
return 60
elif age <= 70:
return 70
else:
return 80
heart = heart.assign(
age_group=heart["age"].apply(categorize_age)
)
What is the mean resting blood pressure for males and females?
heart.groupby("sex").agg({"rest_bp": "mean"})
| rest_bp | |
|---|---|
| sex | |
| Female | 133.340206 |
| Male | 130.912621 |
What is the mean and median cholesterol levels for males and females?
heart.groupby("sex").agg({
"chol": ["mean", "median"]
})
| chol | ||
|---|---|---|
| mean | median | |
| sex | ||
| Female | 261.752577 | 254.0 |
| Male | 239.601942 | 235.0 |
Which age group has the largest median cholesterol levels for males?
heart[heart["sex"] == "Male"].groupby(
"age_group", as_index=False
).agg(
{"chol": "median"}
).sort_values(
"chol", ascending=False
).iloc[0]["age_group"]
80.0
Compute mean risk value (the risk column was created in problem 2 of the "Manipulating data" section) for each age and sex. Which gender and age group has the highest average risk value?
heart.groupby(
["age_group", "sex"], as_index=False
).agg(
{"risk": "mean"}
).sort_values(
"risk", ascending=False
).iloc[0]
age_group 80 sex Female risk 0.150236 Name: 8, dtype: object