list
structure is a very important one-dimensional data structure[]
or by calling the list()
functiona_list = [1, 2, 3]
a_list
[1, 2, 3]
list
can be subset using single-bracket notation -- note that they index from 0a_list[0]
1
dict
structure is a popular way of representing key-value pairs{}
or by calling the dict()
functiona_dict = {
'key1': 'value1',
'key2': 2,
'key3': [1, 2, 3]
}
dict
can be subset using single-bracket notation and a key value:a_dict['key1']
'value1'
pandas
is Important...¶pandas
Python package is the most popular choice of working with tabular data in Pythonpandas
basics in the Introduction to Python for Data Science classpandas
is commonly imported as:import pandas as pd
pandas
DataFrame¶pandas
DataFrame class is the most popular choice to represent tabular data in PythonDataFrame.head()
is a method that shows the first 5 rows of a DataFrameDataFrame.shape
is an attribute that displays the number of rows and number of variables of a DataFramepd.read_csv()
function:planes_df = pd.read_csv('../data/planes.csv')
planes_df.head()
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
1 | N102UW | 1998.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
2 | N103US | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
3 | N104UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NaN | Turbo-fan |
pd.read_csv()
, like sep
, to customize the importing based on your dataplanes_df.head(2)
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
1 | N102UW | 1998.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
planes_df['model'].head()
0 EMB-145XR 1 A320-214 2 A320-214 3 A320-214 4 EMB-145LR Name: model, dtype: object
planes_df[['model']].head(3)
model | |
---|---|
0 | EMB-145XR |
1 | A320-214 |
2 | A320-214 |
planes_df[['model', 'engines']].head(3)
model | engines | |
---|---|---|
0 | EMB-145XR | 2 |
1 | A320-214 | 2 |
2 | A320-214 | 2 |
True
will be returned and rows where the logical condition is False
will be removedplanes_df['year'].head()
0 2004.0 1 1998.0 2 1999.0 3 1999.0 4 2002.0 Name: year, dtype: float64
(planes_df['year'] >= 2000).head()
0 True 1 False 2 False 3 False 4 True Name: year, dtype: bool
planes_df.loc[planes_df['year'] >= 2000].head(2)
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NaN | Turbo-fan |
planes_df['type'] = planes_df['type'].str.upper()
planes_df['type'].head()
0 FIXED WING MULTI ENGINE 1 FIXED WING MULTI ENGINE 2 FIXED WING MULTI ENGINE 3 FIXED WING MULTI ENGINE 4 FIXED WING MULTI ENGINE Name: type, dtype: object
planes_df['seats_and_crew'] = planes_df['seats'] + 5
planes_df[['seats', 'seats_and_crew']].head()
seats | seats_and_crew | |
---|---|---|
0 | 55 | 60 |
1 | 182 | 187 |
2 | 182 | 187 |
3 | 182 | 187 |
4 | 55 | 60 |
planes_df['seats'].mean()
154.31637567730283
planes_df.mean(numeric_only=True)
year 2000.484010 engines 1.995184 seats 154.316376 speed 236.782609 seats_and_crew 159.316376 dtype: float64
agg()
method provides more flexibility when summarizing DataFramesagg()
accepts a dict
argument with column-name keys and a list
of summary functions as valuesplanes_df.agg({
'seats': ['mean'],
'engines': ['max', 'min']
})
seats | engines | |
---|---|---|
mean | 154.316376 | NaN |
max | NaN | 4.0 |
min | NaN | 1.0 |
groupby()
method(
planes_df.groupby('manufacturer', as_index = False)
.agg({'seats': ['mean']}).head()
)
manufacturer | seats | |
---|---|---|
mean | ||
0 | AGUSTA SPA | 8.000000 |
1 | AIRBUS | 221.202381 |
2 | AIRBUS INDUSTRIE | 187.402500 |
3 | AMERICAN AIRCRAFT INC | 2.000000 |
4 | AVIAT AIRCRAFT INC | 2.000000 |
concat()
function is used to union a list of DataFramesdf_1 = pd.DataFrame({'a': [100, 200]})
df_2 = pd.DataFrame({'a': [300, 400]})
pd.concat([df_1, df_2]).reset_index(drop = True)
a | |
---|---|
0 | 100 |
1 | 200 |
2 | 300 |
3 | 400 |
merge()
function is used to join DataFramesflights_df = pd.read_csv('../data/flights.csv')
pd.merge(flights_df, planes_df, how = 'left', on = 'tailnum').head(2)
year_x | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | ... | time_hour | year_y | type | manufacturer | model | engines | seats | speed | engine | seats_and_crew | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | ... | 2013-01-01 05:00:00 | 1999.0 | FIXED WING MULTI ENGINE | BOEING | 737-824 | 2.0 | 149.0 | NaN | Turbo-fan | 154.0 |
1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | ... | 2013-01-01 05:00:00 | 1998.0 | FIXED WING MULTI ENGINE | BOEING | 737-824 | 2.0 | 149.0 | NaN | Turbo-fan | 154.0 |
2 rows × 28 columns
Are there any questions before we move on?