Case Study 2¶

For all Case Study 2 questions we will be using the heart.csv data provided with this case study. Along with the heart.csv data, I have provided a heart_data_dictionary.csv file that provides a description of each column. As you answer the lab questions, it may be beneficial to reference this data dictionary.

In [1]:
import pandas as pd

heart = pd.read_csv("../data/heart.csv")
In [50]:
heart
Out[50]:
age sex chest_pain rest_bp chol fbs rest_ecg max_hr exang old_peak slope ca thal disease
0 63 Male typical 145 233 1 left ventricular hypertrophy 150 0 2.3 3 0.0 fixed 0
1 67 Male asymptomatic 160 286 0 left ventricular hypertrophy 108 1 1.5 2 3.0 normal 1
2 67 Male asymptomatic 120 229 0 left ventricular hypertrophy 129 1 2.6 2 2.0 reversable 1
3 37 Male nonanginal 130 250 0 normal 187 0 3.5 3 0.0 normal 0
4 41 Female nontypical 130 204 0 left ventricular hypertrophy 172 0 1.4 1 0.0 normal 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
298 45 Male typical 110 264 0 normal 132 0 1.2 2 0.0 reversable 1
299 68 Male asymptomatic 144 193 1 normal 141 0 3.4 2 2.0 reversable 1
300 57 Male asymptomatic 130 131 0 normal 115 1 1.2 2 1.0 reversable 1
301 57 Female nontypical 130 236 0 left ventricular hypertrophy 174 0 0.0 2 1.0 normal 1
302 38 Male nonanginal 138 175 0 normal 173 0 0.0 1 NaN normal 0

303 rows × 14 columns

Subsetting data¶

  1. Filter the heart data for all observations where the person is 50 years or older. How many observations are there?
  2. Using the original heart data, filter for those observations that are male and 50 years or older. How many observations are there?
  3. Using the original heart data, filter for those observations that are female, 50 years or younger, and have the disease (disease = 1). Select chest_pain, chol, and max_hr columns. How many rows and columns are in the resulting DataFrame?

Question 1¶

Filter the heart data for all observations where the person is 50 years or older. How many observations are there?

In [96]:
heart[heart["age"] >= 50].shape
Out[96]:
(216, 16)

Question 2¶

Using the original heart data, filter for those observations that are male and 50 years or older. How many observations are there?

In [97]:
heart[
    (heart["age"] >= 50) & (heart["sex"] == "Male")
].shape
Out[97]:
(143, 16)

Question 3¶

Using the original heart data, filter for those observations that are female, 50 years or younger, and have the disease (disease = 1). Select chest_pain, chol, and max_hr columns. How many rows and columns are in the resulting DataFrame?

In [53]:
heart.loc[
    (heart["sex"] == "Female") &
    (heart["age"] <= 50) &
    (heart["disease"] == 1),
    ["chest_pain", "chol", "max_hr"]
].shape
Out[53]:
(1, 3)

Manipulating data¶

  1. Are there any missing values in this data? If so, which columns? For these columns, fill the missing values with the value that appears most often (aka "mode"). This is a multi-step process and it would be worth reviewing the .fillna() docs.

  2. Create a new column called risk that is equal to $ \frac{age}{{res\_bp} + chol + {max\_hr}} $. What is the mean of this risk column?

  3. Replace the values in the rest_ecg column so that:

    • normal = normal
    • left ventricular hypertrophy = lvh
    • ST-T wave abnormality = stt_wav_abn

    Hint: one of the original values may have an extra space at the end of the name!

    How many observations fall into each of the new rest_ecg categories?

Question 1¶

Are there any missing values in this data? If so, which columns? For these columns, fill the missing values with the value that appears most often (aka "mode"). This is a multi-step process and it would be worth reviewing the .fillna() docs.

In [54]:
cols_miss = list(
    heart.isnull().sum()
    [(heart.isnull().sum() > 0)].index
)

print(heart[cols_miss].mode().iloc[0])
ca         0.0
thal    normal
Name: 0, dtype: object
In [55]:
heart = heart.fillna(heart[cols_miss].mode().iloc[0])

Question 2¶

Create a new column called risk that is equal to $ \frac{age}{{res\_bp} + chol + {max\_hr}} $. What is the mean of this risk column?

In [59]:
heart = heart.assign(
    risk=heart["age"] / (
        heart["rest_bp"] + heart["chol"] + heart["max_hr"]
    )
)
heart["risk"].mean()
Out[59]:
0.10426734916395465

Question 3¶

Replace the values in the rest_ecg column so that:

  • normal = normal
  • left ventricular hypertrophy = lvh
  • ST-T wave abnormality = stt_wav_abn

Hint: one of the original values may have an extra space at the end of the name!

How many observations fall into each of the new rest_ecg categories?

In [99]:
heart.loc[heart["rest_ecg"] == "normal", "rest_ecg"] = "normal"
heart.loc[
    heart["rest_ecg"] == "left ventricular hypertrophy ",
    "rest_ecg"
] = "lvh"
heart.loc[
    heart["rest_ecg"] == "ST-T wave abnormality",
    "rest_ecg"
] = "stt_wav_abn"

heart["rest_ecg"].isin(["normal", "lvh", "stt_wav_abn"]).sum()
Out[99]:
303

Summarizing data¶

  1. What is the mean resting blood pressure for males and females?
  2. What is the mean and median cholesterol levels for males and females?
  3. Which age group has the largest median cholesterol levels for males?
  4. Compute mean risk value (the risk column was created in problem 2 of the "Manipulating data" section) for each age and sex. Which gender and age group has the highest average risk value?

To use age groups, we'll need to make a new column from the existing age column.

In [2]:
def categorize_age(age):
    if age <= 40:
        return 40
    elif age <= 50:
        return 50
    elif age <= 60:
        return 60
    elif age <= 70:
        return 70
    else:
        return 80
        
heart = heart.assign(
    age_group=heart["age"].apply(categorize_age)
)

Question 1¶

What is the mean resting blood pressure for males and females?

In [72]:
heart.groupby("sex").agg({"rest_bp": "mean"})
Out[72]:
rest_bp
sex
Female 133.340206
Male 130.912621

Question 2¶

What is the mean and median cholesterol levels for males and females?

In [74]:
heart.groupby("sex").agg({
    "chol": ["mean", "median"]
})
Out[74]:
chol
mean median
sex
Female 261.752577 254.0
Male 239.601942 235.0

Question 3¶

Which age group has the largest median cholesterol levels for males?

In [90]:
heart[heart["sex"] == "Male"].groupby(
    "age_group", as_index=False
).agg(
    {"chol": "median"}
).sort_values(
    "chol", ascending=False
).iloc[0]["age_group"]
Out[90]:
80.0

Question 4¶

Compute mean risk value (the risk column was created in problem 2 of the "Manipulating data" section) for each age and sex. Which gender and age group has the highest average risk value?

In [95]:
heart.groupby(
    ["age_group", "sex"], as_index=False
).agg(
    {"risk": "mean"}
).sort_values(
    "risk", ascending=False
).iloc[0]
Out[95]:
age_group          80
sex            Female
risk         0.150236
Name: 8, dtype: object