integers (int
): Whole numbers, positive or negative or zero
3
, 0
, -531
floats (float
): Decimal numbers
3.14
, 0.0004
, -878.482
strings (str
): Arbitrary text
"hello"
, 'my name is ethan'
, ""
booleans (bool
): Logical values True
and False
True
, False
-- that's itlists (list
): Ordered, 1-dimensional sequences of objects
dictionaries (dict
): Mappings from "keys" to "values", good for looking up entries by their key
The Pandas package is the backbone of data analysis in Python
Pandas is all about DataFrames, objects that store tabular data
The funny name is short for Panel Data
Easy to read in data from common formats (CSV, JSON, SQL databases)
pd.read_csv
is what we'll use most often
import pandas as pd
planes_df = pd.read_csv('../data/planes.csv')
df.head()
is usually the place to start -- returns the first 5 rows
planes_df.head()
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
---|---|---|---|---|---|---|---|---|---|
0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
1 | N102UW | 1998.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
2 | N103US | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
3 | N104UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NaN | Turbo-fan |
Pandas can save data in most of the formats it supports importing from.
Here we usually use df.to_csv
df.to_csv('data/my_data.csv', index=False)
Before you do much else, you need to be able to get at pieces of a DataFrame that you're interested in.
This usually means limiting to certain columns, to certain rows, or both at the same time.
Subsetting data by its columns is often called selecting
The syntax to select a single column is df[column_name]
This returns a series object, a 1-dimensional Pandas object
Series are a lot like Python lists, except all the data in them is usually of the same type
Using single brackets returns a Series; using double brackets returns a DataFrame
Selecting multiple columns must be done with double brackets
You can even use double brackets with a single column if you don't want a Series
planes_df['seats']
0 55 1 182 2 182 3 182 4 55 ... 3317 100 3318 142 3319 100 3320 142 3321 142 Name: seats, Length: 3322, dtype: int64
planes_df[['seats', 'tailnum']]
seats | tailnum | |
---|---|---|
0 | 55 | N10156 |
1 | 182 | N102UW |
2 | 182 | N103US |
3 | 182 | N104UW |
4 | 55 | N10575 |
... | ... | ... |
3317 | 100 | N997AT |
3318 | 142 | N997DL |
3319 | 100 | N998AT |
3320 | 142 | N998DL |
3321 | 142 | N999DN |
3322 rows × 2 columns
"Indexing" is the word we use for subsetting rows based on their location or row label.
Most things in Python index from 0.
DataFrames have row indexes, as we've discussed before.
You can think of them as row labels.
By default, they're just integers from 0 to (number_of_rows - 1).
df.loc
and brackets.planes_df.loc[2]
tailnum N103US year 1999.0 type Fixed wing multi engine manufacturer AIRBUS INDUSTRIE model A320-214 engines 2 seats 182 speed NaN engine Turbo-fan Name: 2, dtype: object
Like selecting a single column, indexing a single row returns a Series, not a DataFrame.
But using a slice for an index returns a DataFrame...
planes_df.loc[3:6]
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
---|---|---|---|---|---|---|---|---|---|
3 | N104UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NaN | Turbo-fan |
5 | N105UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
6 | N107US | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
df.loc[row_index, columns]
# Row indices 3-6, columns "seats" and "tailnum"
planes_df.loc[3:6, ['seats', 'tailnum']]
seats | tailnum | |
---|---|---|
3 | 182 | N104UW |
4 | 55 | N10575 |
5 | 182 | N105UW |
6 | 182 | N107US |
Filtering means limiting rows based on a condition of the data
This is also done with df.loc
, but you pass in an expression describing which rows to keep.
# This syntax is a little clunky; the DataFrame name is specified twice.
planes_df.loc[planes_df['engines'] > 2]
tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
---|---|---|---|---|---|---|---|---|---|
603 | N281AT | NaN | Fixed wing multi engine | AIRBUS INDUSTRIE | A340-313 | 4 | 375 | NaN | Turbo-jet |
1037 | N381AA | 1956.0 | Fixed wing multi engine | DOUGLAS | DC-7BF | 4 | 102 | 232.0 | Reciprocating |
2109 | N670US | 1990.0 | Fixed wing multi engine | BOEING | 747-451 | 4 | 450 | NaN | Turbo-jet |
2706 | N840MQ | 1974.0 | Fixed wing multi engine | CANADAIR LTD | CF-5D | 4 | 2 | NaN | Turbo-jet |
2764 | N854NW | 2004.0 | Fixed wing multi engine | AIRBUS | A330-223 | 3 | 379 | NaN | Turbo-fan |
2771 | N856NW | 2004.0 | Fixed wing multi engine | AIRBUS | A330-223 | 3 | 379 | NaN | Turbo-fan |
2931 | N905FJ | 1986.0 | Fixed wing multi engine | AVIONS MARCEL DASSAULT | MYSTERE FALCON 900 | 3 | 12 | NaN | Turbo-fan |
Like indexing, filtering can be combined with selecting in the .loc
brackets.
df.loc[row_filter, columns]
planes_df.loc[planes_df['seats'] == 139, ['seats', 'tailnum']]
seats | tailnum | |
---|---|---|
1813 | 139 | N600TR |
2131 | 139 | N675MC |
2402 | 139 | N762NC |
2432 | 139 | N767NC |
2472 | 139 | N774NC |
2483 | 139 | N777NC |
2492 | 139 | N779NC |
2503 | 139 | N782NC |
Generally:
Columns: df[columns]
Rows: df.loc[rows]
Both: df.loc[rows, columns]
df.describe()
planes_df.describe()
year | engines | seats | speed | |
---|---|---|---|---|
count | 3252.000000 | 3322.000000 | 3322.000000 | 23.000000 |
mean | 2000.484010 | 1.995184 | 154.316376 | 236.782609 |
std | 7.193425 | 0.117593 | 73.654974 | 149.759794 |
min | 1956.000000 | 1.000000 | 2.000000 | 90.000000 |
25% | 1997.000000 | 2.000000 | 140.000000 | 107.500000 |
50% | 2001.000000 | 2.000000 | 149.000000 | 162.000000 |
75% | 2005.000000 | 2.000000 | 182.000000 | 432.000000 |
max | 2013.000000 | 4.000000 | 450.000000 | 432.000000 |
Series (remember, individual columns are Series objects) offer lots of summary options.
Usually they're invoked as df[column].SUMMARY()
and return a single, scalar value.
df[column].mean()
df[column].max()
df[column].min()
df[column].quantile(q=0.5) # Median
planes_df['year'].min()
1956.0
df[column].nunique() # Number of unique values
df[column].value_counts() # Number of occurrences of each value, descending
planes_df['engine'].nunique()
6
planes_df['type'].value_counts()
Fixed wing multi engine 3292 Fixed wing single engine 25 Rotorcraft 5 Name: type, dtype: int64
value_counts
is an exception to the rule -- it doesn't return a single number, but instead a Series.f-strings let you interpolate, or dynamically insert, strings in other strings
instr1 = 'Brad'
instr2 = 'Ethan'
n_days = 4
my_str = f'{instr1} and {instr2} teach a {n_days}-day Python workshop.'
print(my_str)
Brad and Ethan teach a 4-day Python workshop.
You can easily make histograms and scatterplots with DataFrames.
planes_df.hist();
planes_df.plot.scatter('seats', 'year');
Functions are reusable code blocks that can be "called" from other code. They typically return a value, which the calling code can use.
def compute_triangle_area(base, height):
return base * height / 2
my_triangle_base = 3
my_triangle_height = 4
print(compute_triangle_area(my_triangle_base, my_triangle_height))
6.0
We use Conda to manage environments in Python.