list structure is a very important one-dimensional data structure[] or by calling the list() functiona_list = [1, 2, 3]
a_list
[1, 2, 3]
list can be subset using single-bracket notation -- note that they index from 0a_list[0]
1
dict structure is a popular way of representing key-value pairs{} or by calling the dict() functiona_dict = {
'key1': 'value1',
'key2': 2,
'key3': [1, 2, 3]
}
dict can be subset using single-bracket notation and a key value:a_dict['key1']
'value1'
pandas is Important...¶pandas Python package is the most popular choice of working with tabular data in Pythonpandas basics in the Introduction to Python for Data Science classpandas is commonly imported as:import pandas as pd
pandas DataFrame¶pandas DataFrame class is the most popular choice to represent tabular data in PythonDataFrame.head() is a method that shows the first 5 rows of a DataFrameDataFrame.shape is an attribute that displays the number of rows and number of variables of a DataFramepd.read_csv() function:planes_df = pd.read_csv('../data/planes.csv')
planes_df.head()
| tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
| 1 | N102UW | 1998.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
| 2 | N103US | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
| 3 | N104UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
| 4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NaN | Turbo-fan |
pd.read_csv(), like sep, to customize the importing based on your data
planes_df.head(2)
| tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
| 1 | N102UW | 1998.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
planes_df['model'].head()
0 EMB-145XR 1 A320-214 2 A320-214 3 A320-214 4 EMB-145LR Name: model, dtype: object
planes_df[['model']].head(3)
| model | |
|---|---|
| 0 | EMB-145XR |
| 1 | A320-214 |
| 2 | A320-214 |
planes_df[['model', 'engines']].head(3)
| model | engines | |
|---|---|---|
| 0 | EMB-145XR | 2 |
| 1 | A320-214 | 2 |
| 2 | A320-214 | 2 |
True will be returned and rows where the logical condition is False will be removedplanes_df['year'].head()
0 2004.0 1 1998.0 2 1999.0 3 1999.0 4 2002.0 Name: year, dtype: float64
(planes_df['year'] >= 2000).head()
0 True 1 False 2 False 3 False 4 True Name: year, dtype: bool
planes_df.loc[planes_df['year'] >= 2000].head(2)
| tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
| 4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NaN | Turbo-fan |
planes_df['type'] = planes_df['type'].str.upper()
planes_df['type'].head()
0 FIXED WING MULTI ENGINE 1 FIXED WING MULTI ENGINE 2 FIXED WING MULTI ENGINE 3 FIXED WING MULTI ENGINE 4 FIXED WING MULTI ENGINE Name: type, dtype: object
planes_df['seats_and_crew'] = planes_df['seats'] + 5
planes_df[['seats', 'seats_and_crew']].head()
| seats | seats_and_crew | |
|---|---|---|
| 0 | 55 | 60 |
| 1 | 182 | 187 |
| 2 | 182 | 187 |
| 3 | 182 | 187 |
| 4 | 55 | 60 |
planes_df['seats'].mean()
154.31637567730283
planes_df.mean(numeric_only=True)
year 2000.484010 engines 1.995184 seats 154.316376 speed 236.782609 seats_and_crew 159.316376 dtype: float64
agg() method provides more flexibility when summarizing DataFramesagg() accepts a dict argument with column-name keys and a list of summary functions as valuesplanes_df.agg({
'seats': ['mean'],
'engines': ['max', 'min']
})
| seats | engines | |
|---|---|---|
| mean | 154.316376 | NaN |
| max | NaN | 4.0 |
| min | NaN | 1.0 |
groupby() method(
planes_df.groupby('manufacturer', as_index = False)
.agg({'seats': ['mean']}).head()
)
| manufacturer | seats | |
|---|---|---|
| mean | ||
| 0 | AGUSTA SPA | 8.000000 |
| 1 | AIRBUS | 221.202381 |
| 2 | AIRBUS INDUSTRIE | 187.402500 |
| 3 | AMERICAN AIRCRAFT INC | 2.000000 |
| 4 | AVIAT AIRCRAFT INC | 2.000000 |
concat() function is used to union a list of DataFramesdf_1 = pd.DataFrame({'a': [100, 200]})
df_2 = pd.DataFrame({'a': [300, 400]})
pd.concat([df_1, df_2]).reset_index(drop = True)
| a | |
|---|---|
| 0 | 100 |
| 1 | 200 |
| 2 | 300 |
| 3 | 400 |
merge() function is used to join DataFramesflights_df = pd.read_csv('../data/flights.csv')
pd.merge(flights_df, planes_df, how = 'left', on = 'tailnum').head(2)
| year_x | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | ... | time_hour | year_y | type | manufacturer | model | engines | seats | speed | engine | seats_and_crew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | ... | 2013-01-01 05:00:00 | 1999.0 | FIXED WING MULTI ENGINE | BOEING | 737-824 | 2.0 | 149.0 | NaN | Turbo-fan | 154.0 |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | ... | 2013-01-01 05:00:00 | 1998.0 | FIXED WING MULTI ENGINE | BOEING | 737-824 | 2.0 | 149.0 | NaN | Turbo-fan | 154.0 |
2 rows × 28 columns
Are there any questions before we move on?