planes
data set as a DataFrame:import pandas as pd
planes_df = pd.read_csv('../data/planes.csv')
DataFrame['SeriesName']
DataFrame.sum()
DataFrame.describe()
DataFrame.agg({'VariableName': ['sum', 'mean']})
planes_df.agg({
'year': ['mean', 'median'],
'seats': ['mean', 'max']
})
year | seats | |
---|---|---|
mean | 2000.48401 | 154.316376 |
median | 2001.00000 | NaN |
max | NaN | 450.000000 |
Note
We will primarily use the .agg() method moving forward.
Question
Why might we be interested in grouping by a variable?
Question
What are common grouped aggregation metrics used in your industry/organization?
DataFrame.sum()
DataFrame.describe()
DataFrame.agg({'VariableName': ['sum', 'mean']})
DataFrame.groupby()
method and passing a variable name:planes_df.groupby('model')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x111554950>
groupby()
method is just setting the group - you can see the changed DataFrame class:type(planes_df.groupby('manufacturer'))
pandas.core.groupby.generic.DataFrameGroupBy
(
planes_df.groupby('manufacturer')
.agg({'seats': ['mean', 'max']}).head()
)
seats | ||
---|---|---|
mean | max | |
manufacturer | ||
AGUSTA SPA | 8.000000 | 8 |
AIRBUS | 221.202381 | 379 |
AIRBUS INDUSTRIE | 187.402500 | 379 |
AMERICAN AIRCRAFT INC | 2.000000 | 2 |
AVIAT AIRCRAFT INC | 2.000000 | 2 |
(
planes_df.groupby('manufacturer')
.agg({'seats': ['mean', 'max']}).head()
)
seats | ||
---|---|---|
mean | max | |
manufacturer | ||
AGUSTA SPA | 8.000000 | 8 |
AIRBUS | 221.202381 | 379 |
AIRBUS INDUSTRIE | 187.402500 | 379 |
AMERICAN AIRCRAFT INC | 2.000000 | 2 |
AVIAT AIRCRAFT INC | 2.000000 | 2 |
(
planes_df.groupby('manufacturer')
.agg({'seats': ['mean', 'max']}).index
)
Index(['AGUSTA SPA', 'AIRBUS', 'AIRBUS INDUSTRIE', 'AMERICAN AIRCRAFT INC', 'AVIAT AIRCRAFT INC', 'AVIONS MARCEL DASSAULT', 'BARKER JACK L', 'BEECH', 'BELL', 'BOEING', 'BOMBARDIER INC', 'CANADAIR', 'CANADAIR LTD', 'CESSNA', 'CIRRUS DESIGN CORP', 'DEHAVILLAND', 'DOUGLAS', 'EMBRAER', 'FRIEDEMANN JON', 'GULFSTREAM AEROSPACE', 'HURLEY JAMES LARRY', 'JOHN G HESS', 'KILDALL GARY', 'LAMBERT RICHARD', 'LEARJET INC', 'LEBLANC GLENN T', 'MARZ BARRY', 'MCDONNELL DOUGLAS', 'MCDONNELL DOUGLAS AIRCRAFT CO', 'MCDONNELL DOUGLAS CORPORATION', 'PAIR MIKE E', 'PIPER', 'ROBINSON HELICOPTER CO', 'SIKORSKY', 'STEWART MACO'], dtype='object', name='manufacturer')
pandas
, and probably how pandas
wants to be usedas_index = False
parameter/argument to groupby()
:(
planes_df.groupby('manufacturer', as_index = False)
.agg({'seats': ['mean', 'max']}).head()
)
manufacturer | seats | ||
---|---|---|---|
mean | max | ||
0 | AGUSTA SPA | 8.000000 | 8 |
1 | AIRBUS | 221.202381 | 379 |
2 | AIRBUS INDUSTRIE | 187.402500 | 379 |
3 | AMERICAN AIRCRAFT INC | 2.000000 | 2 |
4 | AVIAT AIRCRAFT INC | 2.000000 | 2 |
groupby()
method:(
planes_df.groupby(['manufacturer', 'year'], as_index = False)
.agg({'seats': ['mean', 'max']}).head()
)
manufacturer | year | seats | ||
---|---|---|---|---|
mean | max | |||
0 | AGUSTA SPA | 2001.0 | 8.000000 | 8 |
1 | AIRBUS | 2002.0 | 173.800000 | 200 |
2 | AIRBUS | 2003.0 | 174.966667 | 200 |
3 | AIRBUS | 2004.0 | 217.000000 | 379 |
4 | AIRBUS | 2005.0 | 197.000000 | 379 |
1. What is meant by "find the minimum number of seats on a plane by year"?
2. Fix the below code to find the minimum number of seats on a plane by year:
planes_df.groupby('_____').agg({'_____': ['min']})
3. What is the Index of the result?
Are there any questions before we move on?