The Python source distribution has long maintained the philosophy of "batteries included" -- having a rich and versatile standard library which is immediately available, without making the user download separate packages. This gives the Python language a head start in many projects.
- PEP 206
So far we've seen several data types that Python offers out-of-the-box. However, to keep things organized, some Python functionality is stored in standalone packages, or libraries of code. The word "module" is generally synonymous with "package," and "library"; you will hear all three in discussions of Python.
If you want more clear definitions, the three can thought of this way:
pip install package_name
pip list
or conda list
import module_name
For example, functionality related to the operating system -- such as creating files and folders -- is stored in a built-in library called os
.
To use the tools in os
, we import the package.
import os
Once we import it, we gain access to everything inside. With VSCode's autocomplete, we can view what's available.
# Move your cursor the end of the below line and press tab.
os.
Some libraries, like os
, are bundled with every Python install; downloading Python guarantees you'll have these libraries.
Collectively, this group of libraries is known as the standard library.
Other packages must be downloaded separately, either because
One very commonly-used data science package is called pandas
(short for Panel Data).
Since pandas
is specific to data science and is still rapidly evolving, it is not part of the standard library.
Note: We'll cover pandas
in more detail in later modules.
We can download packages like pandas from the internet using a website called PyPI, the Python Package Index. Fortunately, since we are using a pre-built conda environment today, that has been handled for us and pandas is already installed.
It's possible to import modules under an alias, or a nickname. The community has adopted certain conventions for aliases for common packages; while following them isn't mandatory, it's highly recommended, as it makes your code easier for others to understand.
pandas
is conventionally imported under the alias pd
.
import pandas as pd
Importing pandas
has given us access to the DataFrame
, accessible as pd.DataFrame
pd.DataFrame
pandas.core.frame.DataFrame
Question:
What is the type of pd
? Guess before you run the code below.
type(pd)
Third-party packages unlock a huge range of functionality that isn't available in native Python; much of Python's data science capabilities come from a handful of packages outside the standard library:
We won't have time to touch on most of these in this training, but if you're interested in one, google it!
numpy
library, listed above. Give it the alias "np".np.asco
We've seen it a few times already, but now it's time to discuss it explicitly: things inside modules can be accessed with dot-notation.
Dot notation looks like this:
pd.Series
or
import numpy as np
np.array
You can read this is "the array
variable, within the Numpy library".
Packages can contain pretty much anything that's legal in Python; if it's code, it can be in a package.
This flexibility is part of the reason that Python's package ecosystem is so expansive and powerful.
Dot-notation has another use -- accessing things inside of objects.
What's an object? Basically, a variable that contains other data or functionality inside of it that is exposed to users.
For example, DataFrames
are objects.
Note: We'll cover pandas
and DataFrames
in far more detail in later modules.
df
first_name | last_name | |
---|---|---|
0 | Ethan | Swan |
1 | Brad | Boehmke |
2 | Jay | Cunningham |
3 | Gus | Powers |
df.shape
(4, 2)
df.describe()
first_name | last_name | |
---|---|---|
count | 4 | 4 |
unique | 4 | 4 |
top | Ethan | Swan |
freq | 1 | 1 |
You can see that DataFrames have a shape
variable and a describe
function inside of them, both accessible through dot notation.
Note: Variables inside an object are often called attributes and functions inside objects are called methods.
Using the math
library:
f(3)
=> 3
, but f(3.2)
=> 4
and f(3.7) => 4
.One of the great things about Python is that its creators really cared about internal consistency.
What that means to us, as users, is that syntax is consistent and predictable -- even across different uses that would appear to be different at first.
Dot notation reveals something kind of cool about Python: packages are just like other objects, and the variables inside them are just attributes and methods!
This standardization across packages and objects helps us remember a single, intuitive syntax that works for many different things.
Are there any questions before we move on?