For all Case Study 2 questions we will be using the heart.csv data provided with this case study. Along with the heart.csv data, I have provided a heart_data_dictionary.csv file that provides a description of each column. As you answer the lab questions, it may be beneficial to reference this data dictionary.
import pandas as pd
heart = pd.read_csv("../data/heart.csv")
heart
age | sex | chest_pain | rest_bp | chol | fbs | rest_ecg | max_hr | exang | old_peak | slope | ca | thal | disease | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | Male | typical | 145 | 233 | 1 | left ventricular hypertrophy | 150 | 0 | 2.3 | 3 | 0.0 | fixed | 0 |
1 | 67 | Male | asymptomatic | 160 | 286 | 0 | left ventricular hypertrophy | 108 | 1 | 1.5 | 2 | 3.0 | normal | 1 |
2 | 67 | Male | asymptomatic | 120 | 229 | 0 | left ventricular hypertrophy | 129 | 1 | 2.6 | 2 | 2.0 | reversable | 1 |
3 | 37 | Male | nonanginal | 130 | 250 | 0 | normal | 187 | 0 | 3.5 | 3 | 0.0 | normal | 0 |
4 | 41 | Female | nontypical | 130 | 204 | 0 | left ventricular hypertrophy | 172 | 0 | 1.4 | 1 | 0.0 | normal | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
298 | 45 | Male | typical | 110 | 264 | 0 | normal | 132 | 0 | 1.2 | 2 | 0.0 | reversable | 1 |
299 | 68 | Male | asymptomatic | 144 | 193 | 1 | normal | 141 | 0 | 3.4 | 2 | 2.0 | reversable | 1 |
300 | 57 | Male | asymptomatic | 130 | 131 | 0 | normal | 115 | 1 | 1.2 | 2 | 1.0 | reversable | 1 |
301 | 57 | Female | nontypical | 130 | 236 | 0 | left ventricular hypertrophy | 174 | 0 | 0.0 | 2 | 1.0 | normal | 1 |
302 | 38 | Male | nonanginal | 138 | 175 | 0 | normal | 173 | 0 | 0.0 | 1 | NaN | normal | 0 |
303 rows × 14 columns
chest_pain
, chol
, and max_hr
columns. How many rows and columns are in the resulting DataFrame?Filter the heart data for all observations where the person is 50 years or older. How many observations are there?
heart[heart["age"] >= 50].shape
(216, 16)
Using the original heart data, filter for those observations that are male and 50 years or older. How many observations are there?
heart[
(heart["age"] >= 50) & (heart["sex"] == "Male")
].shape
(143, 16)
Using the original heart data, filter for those observations that are female, 50 years or younger, and have the disease (disease = 1). Select chest_pain
, chol
, and max_hr
columns. How many rows and columns are in the resulting DataFrame?
heart.loc[
(heart["sex"] == "Female") &
(heart["age"] <= 50) &
(heart["disease"] == 1),
["chest_pain", "chol", "max_hr"]
].shape
(1, 3)
Are there any missing values in this data? If so, which columns? For these columns, fill the missing values with the value that appears most often (aka "mode"). This is a multi-step process and it would be worth reviewing the .fillna()
docs.
Create a new column called risk
that is equal to $ \frac{age}{{res\_bp} + chol + {max\_hr}} $. What is the mean of this risk
column?
Replace the values in the rest_ecg
column so that:
Hint: one of the original values may have an extra space at the end of the name!
How many observations fall into each of the new rest_ecg
categories?
Are there any missing values in this data? If so, which columns? For these columns, fill the missing values with the value that appears most often (aka "mode"). This is a multi-step process and it would be worth reviewing the .fillna()
docs.
cols_miss = list(
heart.isnull().sum()
[(heart.isnull().sum() > 0)].index
)
print(heart[cols_miss].mode().iloc[0])
ca 0.0 thal normal Name: 0, dtype: object
heart = heart.fillna(heart[cols_miss].mode().iloc[0])
Create a new column called risk
that is equal to $ \frac{age}{{res\_bp} + chol + {max\_hr}} $. What is the mean of this risk
column?
heart = heart.assign(
risk=heart["age"] / (
heart["rest_bp"] + heart["chol"] + heart["max_hr"]
)
)
heart["risk"].mean()
0.10426734916395465
Replace the values in the rest_ecg
column so that:
Hint: one of the original values may have an extra space at the end of the name!
How many observations fall into each of the new rest_ecg
categories?
heart.loc[heart["rest_ecg"] == "normal", "rest_ecg"] = "normal"
heart.loc[
heart["rest_ecg"] == "left ventricular hypertrophy ",
"rest_ecg"
] = "lvh"
heart.loc[
heart["rest_ecg"] == "ST-T wave abnormality",
"rest_ecg"
] = "stt_wav_abn"
heart["rest_ecg"].isin(["normal", "lvh", "stt_wav_abn"]).sum()
303
risk
value (the risk
column was created in problem 2 of the "Manipulating data" section) for each age and sex. Which gender and age group has the highest average risk value?To use age groups, we'll need to make a new column from the existing age
column.
def categorize_age(age):
if age <= 40:
return 40
elif age <= 50:
return 50
elif age <= 60:
return 60
elif age <= 70:
return 70
else:
return 80
heart = heart.assign(
age_group=heart["age"].apply(categorize_age)
)
What is the mean resting blood pressure for males and females?
heart.groupby("sex").agg({"rest_bp": "mean"})
rest_bp | |
---|---|
sex | |
Female | 133.340206 |
Male | 130.912621 |
What is the mean and median cholesterol levels for males and females?
heart.groupby("sex").agg({
"chol": ["mean", "median"]
})
chol | ||
---|---|---|
mean | median | |
sex | ||
Female | 261.752577 | 254.0 |
Male | 239.601942 | 235.0 |
Which age group has the largest median cholesterol levels for males?
heart[heart["sex"] == "Male"].groupby(
"age_group", as_index=False
).agg(
{"chol": "median"}
).sort_values(
"chol", ascending=False
).iloc[0]["age_group"]
80.0
Compute mean risk
value (the risk
column was created in problem 2 of the "Manipulating data" section) for each age and sex. Which gender and age group has the highest average risk value?
heart.groupby(
["age_group", "sex"], as_index=False
).agg(
{"risk": "mean"}
).sort_values(
"risk", ascending=False
).iloc[0]
age_group 80 sex Female risk 0.150236 Name: 8, dtype: object