Packages, Modules, Methods, and Functions¶

The Python source distribution has long maintained the philosophy of "batteries included" -- having a rich and versatile standard library which is immediately available, without making the user download separate packages. This gives the Python language a head start in many projects.

- PEP 206

Applied Review¶

Python and Jupyter Overview¶

  • We're working with Python through Jupyter, the most common IDE for data science.

Fundamentals¶

  • Python's common atomic, or basic, data types are:
    • Integers
    • Floats (decimals)
    • Strings
    • Booleans
  • These simple types can be combined to form more complex types, including:
    • Lists: Ordered collections
    • Dictionaries: Key-value pairs
    • DataFrames: Tabular datasets

Packages (aka Modules)¶

So far we've seen several data types that Python offers out-of-the-box. However, to keep things organized, some Python functionality is stored in standalone packages, or libraries of code. The word "module" is generally synonymous with package; you will hear both in discussions of Python.

For example, functionality related to the operating system -- such as creating files and folders -- is stored in a package called os. To use the tools in os, we import the package.

In [1]:
import os

Once we import it, we gain access to everything inside. With Jupyter's autocomplete, we can view what's available.

In [ ]:
# Move your cursor the end of the below line and press tab.
os.

Some packages, like os, are bundled with every Python install; downloading Python guarantees you'll have these packages. Collectively, this group of packages is known as the standard library.

Other packages must be downloaded separately, either because

  • they aren't sufficiently popular to merit inclusion in the standard library
  • or they change too quickly for the maintainers of Python to keep up

The DataFrame type that we saw earlier is part of one such package called pandas (short for Panel Data). Since pandas is specific to data science and is still rapidly evolving, it is not part of the standard library.

We can download packages like pandas from the internet using a website called PyPI, the Python Package Index. Fortunately, since we are using Binder today, that has been handled for us and pandas is already installed.

It's possible to import packages under an alias, or a nickname. The community has adopted certain conventions for aliases for common packages; while following them isn't mandatory, it's highly recommended, as it makes your code easier for others to understand.

pandas is conventionally imported under the alias pd.

In [3]:
import pandas as pd
In [4]:
# Importing pandas has given us access to the DataFrame, accessible as pd.DataFrame
pd.DataFrame
Out[4]:
pandas.core.frame.DataFrame

Question

What is the type of pd? Guess before you run the code below.

In [5]:
type(pd)
Out[5]:
module

Third-party packages unlock a huge range of functionality that isn't available in native Python; much of Python's data science capabilities come from a handful of packages outside the standard library:

  • pandas
  • numpy (numerical computing)
  • scikit-learn (modeling)
  • scipy (scientific computing)
  • matplotlib (graphing)

We won't have time to touch on most of these in this training, but if you're interested in one, google it!

Your Turn¶

  1. Import the numpy library, listed above. Give it the alias "np".
  2. Using autocomplete, determine what variable or function inside the numpy library starts with "asco". Hint: remember you'll need to preface the variable name with the package alias, e.g. np.asco

Dot Notation with Packages¶

We've seen it a few times already, but now it's time to discuss it explicitly: things inside packages can be accessed with dot-notation.

Dot notation looks like this:

pd.Series

or

import numpy as np
np.array

You can read this is "the array variable, within the Numpy library".

Packages can contain pretty much anything that's legal in Python; if it's code, it can be in a package.

This flexibility is part of the reason that Python's package ecosystem is so expansive and powerful.

Functions¶

As you may have noticed already, occasionally we run code using parentheses. The feature that permits this in Python is functions -- code snippets wrapped up into a single name.

For example, take the type function we saw above.

type(x)

type does some complex things under the hood -- it looks at the variable inside the parentheses, determines what type of thing it is, and then returns that type to the user.

In [6]:
x = 7
type(x)
Out[6]:
int

But the beauty of type, and of all functions, is that you (as the user) don't need to know all the complex code that's necessary to figure out that x is an int -- you just need to remember that there's a type function to do that for you.

Functions make you much more powerful, as they unlock lots of functionality within a simple interface.

# Get the first few rows of the planes data.
planes.head()
# Read in the planes.csv file.
pd.read_csv('../data/planes.csv')

The variables within the parens are called function arguments, or simply arguments.

Above, the string '../data/planes.csv' is the argument to the pd.read_csv function.

Functions are integral to using Python, because it's much more efficient to use pre-written code than to always write your own.

If you ever do want to write your own function -- perhaps to share with others, or to make it easier to reuse your work -- it's fairly simple to do so, but beyond the scope of this training.

Objects and Dot Notation¶

Dot-notation, which we discussed in relation to packages, has another use -- accessing things inside of objects.

What's an object? Basically, a variable that contains other data or functionality inside of it that is exposed to users.

For example, DataFrames are objects.

In [8]:
df
Out[8]:
first_name last_name
0 Ethan Swan
1 Brad Boehmke
In [9]:
df.shape
Out[9]:
(2, 2)
In [10]:
df.describe()
Out[10]:
first_name last_name
count 2 2
unique 2 2
top Ethan Swan
freq 1 1

You can see that DataFrames have a shape variable and a describe function inside of them, both accessible through dot notation.

Note

Variables inside an object are often called attributes and functions inside objects are called methods.

On Consistency and Language Design¶

One of the great things about Python is that its creators really cared about internal consistency.

What that means to us, as users, is that syntax is consistent and predictable -- even across different uses that would appear to be different at first.

Dot notation reveals something kind of cool about Python: packages are just like other objects, and the variables inside them are just attributes and methods!

This standardization across packages and objects helps us remember a single, intuitive syntax that works for many different things.

Functions, Objects, and Methods in the Context of DataFrames¶

As we saw above, DataFrames are a type of Python object, so let's use them to explore the new Python features we've learned.

Using the read_csv function from the Pandas package to read in a DataFrame

In [11]:
df = pd.read_csv('../data/airlines.csv')

Using the type function to determine the type of df

In [12]:
type(df)
Out[12]:
pandas.core.frame.DataFrame

Using the head method of the DataFrame to view some of its rows

In [13]:
df.head()
Out[13]:
carrier name
0 9E Endeavor Air Inc.
1 AA American Airlines Inc.
2 AS Alaska Airlines Inc.
3 B6 JetBlue Airways
4 DL Delta Air Lines Inc.

Examining the columns attribute of the DataFrame to see the names of its columns.

In [14]:
df.columns
Out[14]:
Index(['carrier', 'name'], dtype='object')

Inspecting the shape attribute to find the dimensions (rows and columns) of the DataFrame.

In [15]:
df.shape
Out[15]:
(16, 2)

Calling the describe method to get a summary of the data in the DataFrame.

In [16]:
df.describe()
Out[16]:
carrier name
count 16 16
unique 16 16
top 9E Endeavor Air Inc.
freq 1 1

Now let's combine them: using the type function to determine what df.describe holds.

In [17]:
type(df.describe)
Out[17]:
method

Question

Does this result make sense? What would happen if you added parens? i.e. type(df.describe())

Your Turn¶

Spend some time using autocomplete to explore the methods and attributes of the df object we used above. Remember from the Jupyter lesson that you can use a question mark to see the documentation for a function or method (e.g. df.describe?).

Deeper Dive on DataFrames¶

Now that we understand objects and functions better, let's look more at DataFrames.

What Are DataFrames Made of?¶

Accessing an individual column of a DataFrame can be done by passing the column name as a string, in brackets.

In [18]:
carrier_column = df['carrier']
carrier_column
Out[18]:
0     9E
1     AA
2     AS
3     B6
4     DL
5     EV
6     F9
7     FL
8     HA
9     MQ
10    OO
11    UA
12    US
13    VX
14    WN
15    YV
Name: carrier, dtype: object

Individual columns are pandas Series objects.

In [19]:
type(carrier_column)
Out[19]:
pandas.core.series.Series

How are Series different from DataFrames?

  • They're always 1-dimensional
  • They have different attributes than DataFrames
    • For example, Series have a to_list method -- which doesn't make sense to have on DataFrames
  • They don't print in the pretty format of DataFrames, but in plain text (see above)
In [20]:
carrier_column.shape
Out[20]:
(16,)
In [21]:
df.shape
Out[21]:
(16, 2)
In [22]:
carrier_column.to_list()
Out[22]:
['9E',
 'AA',
 'AS',
 'B6',
 'DL',
 'EV',
 'F9',
 'FL',
 'HA',
 'MQ',
 'OO',
 'UA',
 'US',
 'VX',
 'WN',
 'YV']
In [23]:
df.to_list()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[23], line 1
----> 1 df.to_list()

File /usr/local/anaconda3/envs/uc-python/lib/python3.11/site-packages/pandas/core/generic.py:5989, in NDFrame.__getattr__(self, name)
   5982 if (
   5983     name not in self._internal_names_set
   5984     and name not in self._metadata
   5985     and name not in self._accessors
   5986     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5987 ):
   5988     return self[name]
-> 5989 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'to_list'

It's important to be familiar with Series because they are fundamentally the core of DataFrames. Not only are columns represented as Series, but so are rows!

In [24]:
# Fetch the first row of the DataFrame
first_row = df.loc[0]
first_row
Out[24]:
carrier                   9E
name       Endeavor Air Inc.
Name: 0, dtype: object
In [25]:
type(first_row)
Out[25]:
pandas.core.series.Series

Whenever you select individual columns or rows, you'll get Series objects.

What Can You Do with a Series?¶

First, let's create our own Series object from scratch -- they don't need to come from a DataFrame.

In [26]:
# Pass a list in as an argument and it will be converted to a Series.
s = pd.Series([10, 20, 30, 40, 50])
s
Out[26]:
0    10
1    20
2    30
3    40
4    50
dtype: int64
In [27]:
# Pass a list in as an argument and it will be converted to a Series.
s = pd.Series([10, 20, 30, 40, 50])
s
Out[27]:
0    10
1    20
2    30
3    40
4    50
dtype: int64

There are 3 things to notice about this Series:

  • The values (10, 20, 30...)
  • The dtype, short for data type.
  • The index (0, 1, 2...)

Values¶

Values are fairly self-explanatory; we chose them in our input list.

dtype¶

Data types are also straightforward.

Series are always homogeneous, holding only integers, floats, or generic Python objects (called just object).

Because a Python object is general enough to contain any other type, any Series holding strings or other non-numeric data will typically default to be of type object.

For example, going back to our carriers DataFrame, note that the carrier column is of type object.

In [28]:
df['carrier']
Out[28]:
0     9E
1     AA
2     AS
3     B6
4     DL
5     EV
6     F9
7     FL
8     HA
9     MQ
10    OO
11    UA
12    US
13    VX
14    WN
15    YV
Name: carrier, dtype: object

Index¶

Indexes are more interesting. Every Series has an index, a way to reference each element. The index of a Series is a lot like the keys of a dictionary: each index element corresponds to a value in the Series, and can be used to look up that element.

In [29]:
# Our index is a range from 0 (inclusive) to 5 (exclusive).
s.index
Out[29]:
RangeIndex(start=0, stop=5, step=1)
In [30]:
s
Out[30]:
0    10
1    20
2    30
3    40
4    50
dtype: int64
In [31]:
s[3]
Out[31]:
40

In our example, the index is just the integers 0-4, so right now it looks no different that referencing elements of a regular Python list. But indexes can be changed to something different -- like the letters a-e, for example.

In [32]:
s.index = ['a', 'b', 'c', 'd', 'e']
s
Out[32]:
a    10
b    20
c    30
d    40
e    50
dtype: int64

Now to look up the value 40, we reference 'd'.

In [33]:
s['d']
Out[33]:
40

We saw earlier that rows of a DataFrame are Series. In such cases, the flexibility of Series indexes comes in handy; the index is set to the DataFrame column names.

In [34]:
df.head()
Out[34]:
carrier name
0 9E Endeavor Air Inc.
1 AA American Airlines Inc.
2 AS Alaska Airlines Inc.
3 B6 JetBlue Airways
4 DL Delta Air Lines Inc.
In [35]:
# Note that the index is ['carrier', 'name']
first_row = df.loc[0]
first_row
Out[35]:
carrier                   9E
name       Endeavor Air Inc.
Name: 0, dtype: object

This is particularly handy because it means you can extract individual elements based on a column name.

In [36]:
first_row['carrier']
Out[36]:
'9E'

DataFrame Indexes¶

It's not just Series that have indexes! DataFrames have them too. Take a look at the carrier DataFrame again and note the bold numbers on the left.

In [37]:
df.head()
Out[37]:
carrier name
0 9E Endeavor Air Inc.
1 AA American Airlines Inc.
2 AS Alaska Airlines Inc.
3 B6 JetBlue Airways
4 DL Delta Air Lines Inc.

These numbers are an index, just like the one we saw on our example Series. And DataFrame indexes support similar functionality.

In [38]:
# Our index is a range from 0 (inclusive) to 16 (exclusive).
df.index
Out[38]:
RangeIndex(start=0, stop=16, step=1)

When loading in a DataFrame, the default index will always be 0 to N-1, where N is the number of rows in your DataFrame. This is called a RangeIndex.

Selecting individual rows by their index is done with the .loc accessor. An accessor is an attribute designed specifically to help users reference something else (like rows within a DataFrame).

In [39]:
# Get the row at index 4 (the fifth row).
df.loc[4]
Out[39]:
carrier                      DL
name       Delta Air Lines Inc.
Name: 4, dtype: object

As with Series, DataFrames support reassigning their index.

However, with DataFrames it often makes sense to change one of your columns into the index.

This is analogous to a primary key in relational databases: a way to rapidly look up rows within a table.

In our case, maybe we will often use the carrier code (carrier) to look up the full name of the airline. In that case, it would make sense set the carrier column as our index.

In [40]:
df = df.set_index('carrier')
df.head()
Out[40]:
name
carrier
9E Endeavor Air Inc.
AA American Airlines Inc.
AS Alaska Airlines Inc.
B6 JetBlue Airways
DL Delta Air Lines Inc.

Now the RangeIndex has been replaced with a more meaningful index, and it's possible to look up rows of the table by passing carrier code to the .loc accessor.

In [41]:
df.loc['UA']
Out[41]:
name    United Air Lines Inc.
Name: UA, dtype: object

Caution!

Pandas does not require that indexes have unique values (that is, no duplicates) although many relational databases do have that requirement of a primary key. This means that it is *possible* to create a non-unique index, but highly inadvisable. Having duplicate values in your index can cause unexpected results when you refer to rows by index -- but multiple rows have that index. Don't do it if you can help it!

When starting to work with a DataFrame, it's often a good idea to determine what column makes sense as your index and to set it immediately.

This will make your code nicer -- by letting you directly look up values with the index -- and also make your selections and filters faster, because Pandas is optimized for operations by index.

If you want to change the index of your DataFrame later, you can always reset_index (and then assign a new one).

In [42]:
df.head()
Out[42]:
name
carrier
9E Endeavor Air Inc.
AA American Airlines Inc.
AS Alaska Airlines Inc.
B6 JetBlue Airways
DL Delta Air Lines Inc.
In [43]:
df = df.reset_index()
df.head()
Out[43]:
carrier name
0 9E Endeavor Air Inc.
1 AA American Airlines Inc.
2 AS Alaska Airlines Inc.
3 B6 JetBlue Airways
4 DL Delta Air Lines Inc.

Your Turn¶

The below cell has code to load in the first 100 rows of the airports data as airports. The data contains the airport code, airport name, and some basic facts about the airport location.

In [44]:
airports = pd.read_csv('../data/airports.csv')
airports = airports.loc[0:100]
airports.head()
Out[44]:
faa name lat lon alt tz dst tzone
0 04G Lansdowne Airport 41.130472 -80.619583 1044 -5 A America/New_York
1 06A Moton Field Municipal Airport 32.460572 -85.680028 264 -6 A America/Chicago
2 06C Schaumburg Regional 41.989341 -88.101243 801 -6 A America/Chicago
3 06N Randall Airport 41.431912 -74.391561 523 -5 A America/New_York
4 09J Jekyll Island Airport 31.074472 -81.427778 11 -5 A America/New_York
  1. What kind of index is the current index of airports?
  2. Is this a good choice for the DataFrame's index? If not, what column or columns would be a better candidate?
  3. If you chose a different column to be the index, make it your index using airports.set_index().
  4. Using your new index, look up "Pittsburgh-Monroeville Airport", code 4G0. What is its altitude?
  5. Reset your index in case you want to make a different column your index in the future.

Questions¶

Are there any questions before we move on?