The Python source distribution has long maintained the philosophy of "batteries included" -- having a rich and versatile standard library which is immediately available, without making the user download separate packages. This gives the Python language a head start in many projects.
- PEP 206
So far we've seen several data types that Python offers out-of-the-box. However, to keep things organized, some Python functionality is stored in standalone packages, or libraries of code. The word "module" is generally synonymous with package; you will hear both in discussions of Python.
For example, functionality related to the operating system -- such as creating files and folders -- is stored in a package called os
.
To use the tools in os
, we import the package.
import os
Once we import it, we gain access to everything inside. With Jupyter's autocomplete, we can view what's available.
# Move your cursor the end of the below line and press tab.
os.
Some packages, like os
, are bundled with every Python install; downloading Python guarantees you'll have these packages.
Collectively, this group of packages is known as the standard library.
Other packages must be downloaded separately, either because
The DataFrame type that we saw earlier is part of one such package called pandas
(short for Panel Data).
Since pandas is specific to data science and is still rapidly evolving, it is not part of the standard library.
We can download packages like pandas from the internet using a website called PyPI, the Python Package Index. Fortunately, since we are using Binder today, that has been handled for us and pandas is already installed.
It's possible to import packages under an alias, or a nickname. The community has adopted certain conventions for aliases for common packages; while following them isn't mandatory, it's highly recommended, as it makes your code easier for others to understand.
pandas is conventionally imported under the alias pd
.
import pandas as pd
# Importing pandas has given us access to the DataFrame, accessible as pd.DataFrame
pd.DataFrame
pandas.core.frame.DataFrame
Question
What is the type of pd
? Guess before you run the code below.
type(pd)
module
Third-party packages unlock a huge range of functionality that isn't available in native Python; much of Python's data science capabilities come from a handful of packages outside the standard library:
We won't have time to touch on most of these in this training, but if you're interested in one, google it!
numpy
library, listed above. Give it the alias "np".np.asco
We've seen it a few times already, but now it's time to discuss it explicitly: things inside packages can be accessed with dot-notation.
Dot notation looks like this:
pd.Series
or
import numpy as np
np.array
You can read this is "the array
variable, within the Numpy library".
Packages can contain pretty much anything that's legal in Python; if it's code, it can be in a package.
This flexibility is part of the reason that Python's package ecosystem is so expansive and powerful.
As you may have noticed already, occasionally we run code using parentheses. The feature that permits this in Python is functions -- code snippets wrapped up into a single name.
For example, take the type
function we saw above.
type(x)
type
does some complex things under the hood -- it looks at the variable inside the parentheses, determines what type of thing it is, and then returns that type to the user.
x = 7
type(x)
int
But the beauty of type
, and of all functions, is that you (as the user) don't need to know all the complex code that's necessary to figure out that x is an int
-- you just need to remember that there's a type
function to do that for you.
Functions make you much more powerful, as they unlock lots of functionality within a simple interface.
# Get the first few rows of the planes data.
planes.head()
# Read in the planes.csv file.
pd.read_csv('../data/planes.csv')
The variables within the parens are called function arguments, or simply arguments.
Above, the string '../data/planes.csv'
is the argument to the pd.read_csv
function.
Functions are integral to using Python, because it's much more efficient to use pre-written code than to always write your own.
If you ever do want to write your own function -- perhaps to share with others, or to make it easier to reuse your work -- it's fairly simple to do so, but beyond the scope of this training.
Dot-notation, which we discussed in relation to packages, has another use -- accessing things inside of objects.
What's an object? Basically, a variable that contains other data or functionality inside of it that is exposed to users.
For example, DataFrames are objects.
df
first_name | last_name | |
---|---|---|
0 | Ethan | Swan |
1 | Brad | Boehmke |
df.shape
(2, 2)
df.describe()
first_name | last_name | |
---|---|---|
count | 2 | 2 |
unique | 2 | 2 |
top | Ethan | Swan |
freq | 1 | 1 |
You can see that DataFrames have a shape
variable and a describe
function inside of them, both accessible through dot notation.
Note
Variables inside an object are often called attributes and functions inside objects are called methods.
One of the great things about Python is that its creators really cared about internal consistency.
What that means to us, as users, is that syntax is consistent and predictable -- even across different uses that would appear to be different at first.
Dot notation reveals something kind of cool about Python: packages are just like other objects, and the variables inside them are just attributes and methods!
This standardization across packages and objects helps us remember a single, intuitive syntax that works for many different things.
As we saw above, DataFrames are a type of Python object, so let's use them to explore the new Python features we've learned.
Using the read_csv
function from the Pandas package to read in a DataFrame
df = pd.read_csv('../data/airlines.csv')
Using the type
function to determine the type of df
type(df)
pandas.core.frame.DataFrame
Using the head
method of the DataFrame to view some of its rows
df.head()
carrier | name | |
---|---|---|
0 | 9E | Endeavor Air Inc. |
1 | AA | American Airlines Inc. |
2 | AS | Alaska Airlines Inc. |
3 | B6 | JetBlue Airways |
4 | DL | Delta Air Lines Inc. |
Examining the columns
attribute of the DataFrame to see the names of its columns.
df.columns
Index(['carrier', 'name'], dtype='object')
Inspecting the shape
attribute to find the dimensions (rows and columns) of the DataFrame.
df.shape
(16, 2)
Calling the describe
method to get a summary of the data in the DataFrame.
df.describe()
carrier | name | |
---|---|---|
count | 16 | 16 |
unique | 16 | 16 |
top | 9E | Endeavor Air Inc. |
freq | 1 | 1 |
Now let's combine them: using the type
function to determine what df.describe
holds.
type(df.describe)
method
Question
Does this result make sense? What would happen if you added parens? i.e. type(df.describe())
Spend some time using autocomplete to explore the methods and attributes of the df
object we used above.
Remember from the Jupyter lesson that you can use a question mark to see the documentation for a function or method (e.g. df.describe?
).
Now that we understand objects and functions better, let's look more at DataFrames.
Accessing an individual column of a DataFrame can be done by passing the column name as a string, in brackets.
carrier_column = df['carrier']
carrier_column
0 9E 1 AA 2 AS 3 B6 4 DL 5 EV 6 F9 7 FL 8 HA 9 MQ 10 OO 11 UA 12 US 13 VX 14 WN 15 YV Name: carrier, dtype: object
Individual columns are pandas Series
objects.
type(carrier_column)
pandas.core.series.Series
How are Series different from DataFrames?
to_list
method -- which doesn't make sense to have on DataFramescarrier_column.shape
(16,)
df.shape
(16, 2)
carrier_column.to_list()
['9E', 'AA', 'AS', 'B6', 'DL', 'EV', 'F9', 'FL', 'HA', 'MQ', 'OO', 'UA', 'US', 'VX', 'WN', 'YV']
df.to_list()
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[23], line 1 ----> 1 df.to_list() File /usr/local/anaconda3/envs/uc-python/lib/python3.11/site-packages/pandas/core/generic.py:5989, in NDFrame.__getattr__(self, name) 5982 if ( 5983 name not in self._internal_names_set 5984 and name not in self._metadata 5985 and name not in self._accessors 5986 and self._info_axis._can_hold_identifiers_and_holds_name(name) 5987 ): 5988 return self[name] -> 5989 return object.__getattribute__(self, name) AttributeError: 'DataFrame' object has no attribute 'to_list'
It's important to be familiar with Series because they are fundamentally the core of DataFrames. Not only are columns represented as Series, but so are rows!
# Fetch the first row of the DataFrame
first_row = df.loc[0]
first_row
carrier 9E name Endeavor Air Inc. Name: 0, dtype: object
type(first_row)
pandas.core.series.Series
Whenever you select individual columns or rows, you'll get Series objects.
First, let's create our own Series object from scratch -- they don't need to come from a DataFrame.
# Pass a list in as an argument and it will be converted to a Series.
s = pd.Series([10, 20, 30, 40, 50])
s
0 10 1 20 2 30 3 40 4 50 dtype: int64
# Pass a list in as an argument and it will be converted to a Series.
s = pd.Series([10, 20, 30, 40, 50])
s
0 10 1 20 2 30 3 40 4 50 dtype: int64
There are 3 things to notice about this Series:
Values are fairly self-explanatory; we chose them in our input list.
Data types are also straightforward.
Series are always homogeneous, holding only integers, floats, or generic Python objects (called just object
).
Because a Python object is general enough to contain any other type, any Series holding strings or other non-numeric data will typically default to be of type object
.
For example, going back to our carriers DataFrame, note that the carrier column is of type object
.
df['carrier']
0 9E 1 AA 2 AS 3 B6 4 DL 5 EV 6 F9 7 FL 8 HA 9 MQ 10 OO 11 UA 12 US 13 VX 14 WN 15 YV Name: carrier, dtype: object
Indexes are more interesting. Every Series has an index, a way to reference each element. The index of a Series is a lot like the keys of a dictionary: each index element corresponds to a value in the Series, and can be used to look up that element.
# Our index is a range from 0 (inclusive) to 5 (exclusive).
s.index
RangeIndex(start=0, stop=5, step=1)
s
0 10 1 20 2 30 3 40 4 50 dtype: int64
s[3]
40
In our example, the index is just the integers 0-4, so right now it looks no different that referencing elements of a regular Python list. But indexes can be changed to something different -- like the letters a-e, for example.
s.index = ['a', 'b', 'c', 'd', 'e']
s
a 10 b 20 c 30 d 40 e 50 dtype: int64
Now to look up the value 40, we reference 'd'
.
s['d']
40
We saw earlier that rows of a DataFrame are Series. In such cases, the flexibility of Series indexes comes in handy; the index is set to the DataFrame column names.
df.head()
carrier | name | |
---|---|---|
0 | 9E | Endeavor Air Inc. |
1 | AA | American Airlines Inc. |
2 | AS | Alaska Airlines Inc. |
3 | B6 | JetBlue Airways |
4 | DL | Delta Air Lines Inc. |
# Note that the index is ['carrier', 'name']
first_row = df.loc[0]
first_row
carrier 9E name Endeavor Air Inc. Name: 0, dtype: object
This is particularly handy because it means you can extract individual elements based on a column name.
first_row['carrier']
'9E'
It's not just Series that have indexes! DataFrames have them too. Take a look at the carrier DataFrame again and note the bold numbers on the left.
df.head()
carrier | name | |
---|---|---|
0 | 9E | Endeavor Air Inc. |
1 | AA | American Airlines Inc. |
2 | AS | Alaska Airlines Inc. |
3 | B6 | JetBlue Airways |
4 | DL | Delta Air Lines Inc. |
These numbers are an index, just like the one we saw on our example Series. And DataFrame indexes support similar functionality.
# Our index is a range from 0 (inclusive) to 16 (exclusive).
df.index
RangeIndex(start=0, stop=16, step=1)
When loading in a DataFrame, the default index will always be 0 to N-1, where N is the number of rows in your DataFrame.
This is called a RangeIndex
.
Selecting individual rows by their index is done with the .loc
accessor.
An accessor is an attribute designed specifically to help users reference something else (like rows within a DataFrame).
# Get the row at index 4 (the fifth row).
df.loc[4]
carrier DL name Delta Air Lines Inc. Name: 4, dtype: object
As with Series, DataFrames support reassigning their index.
However, with DataFrames it often makes sense to change one of your columns into the index.
This is analogous to a primary key in relational databases: a way to rapidly look up rows within a table.
In our case, maybe we will often use the carrier code (carrier
) to look up the full name of the airline.
In that case, it would make sense set the carrier column as our index.
df = df.set_index('carrier')
df.head()
name | |
---|---|
carrier | |
9E | Endeavor Air Inc. |
AA | American Airlines Inc. |
AS | Alaska Airlines Inc. |
B6 | JetBlue Airways |
DL | Delta Air Lines Inc. |
Now the RangeIndex has been replaced with a more meaningful index, and it's possible to look up rows of the table by passing carrier code to the .loc
accessor.
df.loc['UA']
name United Air Lines Inc. Name: UA, dtype: object
Caution!
Pandas does not require that indexes have unique values (that is, no duplicates) although many relational databases do have that requirement of a primary key. This means that it is *possible* to create a non-unique index, but highly inadvisable. Having duplicate values in your index can cause unexpected results when you refer to rows by index -- but multiple rows have that index. Don't do it if you can help it!
When starting to work with a DataFrame, it's often a good idea to determine what column makes sense as your index and to set it immediately.
This will make your code nicer -- by letting you directly look up values with the index -- and also make your selections and filters faster, because Pandas is optimized for operations by index.
If you want to change the index of your DataFrame later, you can always reset_index
(and then assign a new one).
df.head()
name | |
---|---|
carrier | |
9E | Endeavor Air Inc. |
AA | American Airlines Inc. |
AS | Alaska Airlines Inc. |
B6 | JetBlue Airways |
DL | Delta Air Lines Inc. |
df = df.reset_index()
df.head()
carrier | name | |
---|---|---|
0 | 9E | Endeavor Air Inc. |
1 | AA | American Airlines Inc. |
2 | AS | Alaska Airlines Inc. |
3 | B6 | JetBlue Airways |
4 | DL | Delta Air Lines Inc. |
The below cell has code to load in the first 100 rows of the airports data as airports
.
The data contains the airport code, airport name, and some basic facts about the airport location.
airports = pd.read_csv('../data/airports.csv')
airports = airports.loc[0:100]
airports.head()
faa | name | lat | lon | alt | tz | dst | tzone | |
---|---|---|---|---|---|---|---|---|
0 | 04G | Lansdowne Airport | 41.130472 | -80.619583 | 1044 | -5 | A | America/New_York |
1 | 06A | Moton Field Municipal Airport | 32.460572 | -85.680028 | 264 | -6 | A | America/Chicago |
2 | 06C | Schaumburg Regional | 41.989341 | -88.101243 | 801 | -6 | A | America/Chicago |
3 | 06N | Randall Airport | 41.431912 | -74.391561 | 523 | -5 | A | America/New_York |
4 | 09J | Jekyll Island Airport | 31.074472 | -81.427778 | 11 | -5 | A | America/New_York |
airports
?airports.set_index()
.Are there any questions before we move on?