Case Study 2¶

For all Case Study 2 questions we will be using the heart.csv data provided with this case study. Along with the heart.csv data, I have provided a heart_data_dictionary.csv file that provides a description of each column. As you answer the lab questions, it may be beneficial to reference this data dictionary.

In [1]:

import pandas as pd

heart = pd.read_csv("../data/heart.csv")

In [50]:

heart

Out[50]:

	age	sex	chest_pain	rest_bp	chol	fbs	rest_ecg	max_hr	exang	old_peak	slope	ca	thal	disease
0	63	Male	typical	145	233	1	left ventricular hypertrophy	150	0	2.3	3	0.0	fixed	0
1	67	Male	asymptomatic	160	286	0	left ventricular hypertrophy	108	1	1.5	2	3.0	normal	1
2	67	Male	asymptomatic	120	229	0	left ventricular hypertrophy	129	1	2.6	2	2.0	reversable	1
3	37	Male	nonanginal	130	250	0	normal	187	0	3.5	3	0.0	normal	0
4	41	Female	nontypical	130	204	0	left ventricular hypertrophy	172	0	1.4	1	0.0	normal	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
298	45	Male	typical	110	264	0	normal	132	0	1.2	2	0.0	reversable	1
299	68	Male	asymptomatic	144	193	1	normal	141	0	3.4	2	2.0	reversable	1
300	57	Male	asymptomatic	130	131	0	normal	115	1	1.2	2	1.0	reversable	1
301	57	Female	nontypical	130	236	0	left ventricular hypertrophy	174	0	0.0	2	1.0	normal	1
302	38	Male	nonanginal	138	175	0	normal	173	0	0.0	1	NaN	normal	0

303 rows × 14 columns

Subsetting data¶

Filter the heart data for all observations where the person is 50 years or older. How many observations are there?
Using the original heart data, filter for those observations that are male and 50 years or older. How many observations are there?
Using the original heart data, filter for those observations that are female, 50 years or younger, and have the disease (disease = 1). Select chest_pain, chol, and max_hr columns. How many rows and columns are in the resulting DataFrame?

Question 1¶

Filter the heart data for all observations where the person is 50 years or older. How many observations are there?

In [96]:

heart[heart["age"] >= 50].shape

Out[96]:

(216, 16)

Question 2¶

Using the original heart data, filter for those observations that are male and 50 years or older. How many observations are there?

In [97]:

heart[
    (heart["age"] >= 50) & (heart["sex"] == "Male")
].shape

Out[97]:

(143, 16)

Question 3¶

Using the original heart data, filter for those observations that are female, 50 years or younger, and have the disease (disease = 1). Select chest_pain, chol, and max_hr columns. How many rows and columns are in the resulting DataFrame?

In [53]:

heart.loc[
    (heart["sex"] == "Female") &
    (heart["age"] <= 50) &
    (heart["disease"] == 1),
    ["chest_pain", "chol", "max_hr"]
].shape

Out[53]:

(1, 3)

Manipulating data¶

Are there any missing values in this data? If so, which columns? For these columns, fill the missing values with the value that appears most often (aka "mode"). This is a multi-step process and it would be worth reviewing the .fillna() docs.
Create a new column called risk that is equal to $ \frac{age}{{res\_bp} + chol + {max\_hr}} $. What is the mean of this risk column?
Replace the values in the rest_ecg column so that:
- normal = normal
- left ventricular hypertrophy = lvh
- ST-T wave abnormality = stt_wav_abn
Hint: one of the original values may have an extra space at the end of the name!

How many observations fall into each of the new rest_ecg categories?

Question 1¶

Are there any missing values in this data? If so, which columns? For these columns, fill the missing values with the value that appears most often (aka "mode"). This is a multi-step process and it would be worth reviewing the .fillna() docs.

In [54]:

cols_miss = list(
    heart.isnull().sum()
    [(heart.isnull().sum() > 0)].index
)

print(heart[cols_miss].mode().iloc[0])

ca         0.0
thal    normal
Name: 0, dtype: object

In [55]:

heart = heart.fillna(heart[cols_miss].mode().iloc[0])

Question 2¶

Create a new column called risk that is equal to $ \frac{age}{{res\_bp} + chol + {max\_hr}} $. What is the mean of this risk column?

In [59]:

heart = heart.assign(
    risk=heart["age"] / (
        heart["rest_bp"] + heart["chol"] + heart["max_hr"]
    )
)
heart["risk"].mean()

Out[59]:

0.10426734916395465

Question 3¶

Replace the values in the rest_ecg column so that:

normal = normal
left ventricular hypertrophy = lvh
ST-T wave abnormality = stt_wav_abn

Hint: one of the original values may have an extra space at the end of the name!

How many observations fall into each of the new rest_ecg categories?

In [99]:

heart.loc[heart["rest_ecg"] == "normal", "rest_ecg"] = "normal"
heart.loc[
    heart["rest_ecg"] == "left ventricular hypertrophy ",
    "rest_ecg"
] = "lvh"
heart.loc[
    heart["rest_ecg"] == "ST-T wave abnormality",
    "rest_ecg"
] = "stt_wav_abn"

heart["rest_ecg"].isin(["normal", "lvh", "stt_wav_abn"]).sum()

Out[99]:

Summarizing data¶

What is the mean resting blood pressure for males and females?
What is the mean and median cholesterol levels for males and females?
Which age group has the largest median cholesterol levels for males?
Compute mean risk value (the risk column was created in problem 2 of the "Manipulating data" section) for each age and sex. Which gender and age group has the highest average risk value?

To use age groups, we'll need to make a new column from the existing age column.

In [2]:

def categorize_age(age):
    if age <= 40:
        return 40
    elif age <= 50:
        return 50
    elif age <= 60:
        return 60
    elif age <= 70:
        return 70
    else:
        return 80
        
heart = heart.assign(
    age_group=heart["age"].apply(categorize_age)
)

Question 1¶

What is the mean resting blood pressure for males and females?

In [72]:

heart.groupby("sex").agg({"rest_bp": "mean"})

Out[72]:

	rest_bp
sex
Female	133.340206
Male	130.912621

Question 2¶

What is the mean and median cholesterol levels for males and females?

In [74]:

heart.groupby("sex").agg({
    "chol": ["mean", "median"]
})

Out[74]:

	chol
	mean	median
sex
Female	261.752577	254.0
Male	239.601942	235.0

Question 3¶

Which age group has the largest median cholesterol levels for males?

In [90]:

heart[heart["sex"] == "Male"].groupby(
    "age_group", as_index=False
).agg(
    {"chol": "median"}
).sort_values(
    "chol", ascending=False
).iloc[0]["age_group"]

Out[90]:

80.0

Question 4¶

Compute mean risk value (the risk column was created in problem 2 of the "Manipulating data" section) for each age and sex. Which gender and age group has the highest average risk value?

In [95]:

heart.groupby(
    ["age_group", "sex"], as_index=False
).agg(
    {"risk": "mean"}
).sort_values(
    "risk", ascending=False
).iloc[0]

Out[95]:

age_group          80
sex            Female
risk         0.150236
Name: 8, dtype: object