Everything in Python is an object, and every object has a type.
Let's review the most important ones.
Integers – Whole Numbers
i = 3
i
3
Floats – Decimal Numbers
f = 3.4
f
3.4
Strings – Bits of Text
s = 'python'
s
'python'
Lists – Ordered collections of other Python objects
l = ['a', 'b', 'c']
l
['a', 'b', 'c']
Dictionaries – A collection of key-value pairs, which let you easily look up the value for a given key
d = {'a': 1,
'b': 2,
'z': 26}
d
{'a': 1, 'b': 2, 'z': 26}
DataFrames - Tabular datasets. Part of the Pandas library.
import pandas as pd
df = pd.DataFrame([(1, 2), (3, 4)], columns=['x', 'y'])
df
x | y | |
---|---|---|
0 | 1 | 2 |
1 | 3 | 4 |
type
Function¶You can use the type
function to determine the type of an object.
x = [1, 2, 3]
type(x)
list
x = 'hello'
type(x)
str
Packages (generally synonymous with modules or libraries) are extensions for Python featuring useful code.
Some are included in every Python install ("standard library"), while others (like Pandas, matplotlib, and more) need to be installed separately ("third party packages").
The DataFrame type, a staple of data science, comes in the Pandas package.
Functions are executable Python code stored in a name, just like a regular variable.
You can call a function by putting parentheses after its name, and optionally including arguments to it (e.g. myfunction(argument_1, argument_2)
).
Well-named functions can help to simplify your code and make it much more readable.
Python objects (that's everything in Python, remember?) come with attributes, or internal information accessible through dot syntax:
myobject.attribute
Attributes can be handy when you want to learn more about an object.
df.shape
(2, 2)
Some attributes actually hold functions, in which case we call them methods.
df.describe()
x | y | |
---|---|---|
count | 2.000000 | 2.000000 |
mean | 2.000000 | 3.000000 |
std | 1.414214 | 1.414214 |
min | 1.000000 | 2.000000 |
25% | 1.500000 | 2.500000 |
50% | 2.000000 | 3.000000 |
75% | 2.500000 | 3.500000 |
max | 3.000000 | 4.000000 |
When you extract individual rows or columns of DataFrames, you get a 1-dimensional dataset called a Series.
Series look like lists but their data must be all of the same type, and they provide similar (though subtly different) functionality to DataFrames.
Importing data is the process of taking data on disk and moving it into memory, where Python can do its work.
Reading CSVs will likely be one of the most common ways you import data.
To do so, use Pandas' read_csv
function, passing the name of your file as an argument.
import pandas as pd
data = pd.read_csv('myfile.csv')
Though they are less common in data science, JSON and pickle files may come up in your work as well.
These are slightly more complicated to import, but it's still very doable.
JSON:
import json
with open('myfile.json', 'r') as f:
data = json.load(f)
Pickle:
import pickle
with open('myfile.pickle', 'rb') as f:
data = pickle.load(f)
There are three primary ways of subsetting data:
Selection is done with brackets. Pass a single column name (as a string) or a list of column names.
# The column "mycolumn", as a Series
df['mycolumn']
# The columns "column1" and "column2" as a DataFrame
df[['column_1', 'column_2']]
Note
If you pass a list, the returned value will be a DataFrame. If you pass a single column name, it will be a Series.
Slicing is typically done with the .loc
accessor and brackets.
Pass in a row index or a range of row indices.
# The fifth (zero-indexing!) row, as a Series
df.loc[4]
# The second, third, and fourth rows, as a DataFrame
df.loc[1:3]
Note
If you pass a range of indices, the returned value will be a DataFrame. Otherwise it will be a Series.
DataFrames can be filtered by passing a condition in brackets.
# Keep rows where `condition` is true
df[condition]
Conditions are things like tests of equality, assertions that one value is greater than another, etc.
# Keep rows where the value in "mycolumn" is equal to 5
df[df['mycolumn'] == 5]
# Keep rows where mycolumn is less than 3 OR greater than 10
df[ (df['mycolumn'] < 3) | (df['mycolumn'] > 10) ]
Using .loc
, it's possible to do selecting and filtering all in one step.
# Filter down to rows where column_a is equal to 5,
# and select column_b and column_c from those rows
df.loc[df['column_a'] == 5, ['column_b', 'column_c']]
It's possible to perform calculations using columns.
df['mycolumn'] + 7
df['mycolumn'] * 4 - 3
It's also possible to perform calculations based on values in multiple columns.
df['column_a'] / df['column_b']
Generally you'll want to save the calculated values in a new column, which you can do with sensible assignment syntax.
df['e'] = df['m'] * (df['c'] ** 2)
Lots of string functionality can be found within the .str
accessor.
# Convert the strings in mycolumn to all caps
df['mycolumn'].str.upper()
In some cases you may need to convert some values to other values.
This is a good case for the .map
method of Series.
Pass in a dictionary whose keys are the elements to be converted and whose values are the desired new values.
df
x | y | |
---|---|---|
0 | 1 | 2 |
1 | 3 | 4 |
df['x'] = df['x'].map({1: 11, 3: 33})
df
x | y | |
---|---|---|
0 | 11 | 2 |
1 | 33 | 4 |
weather.csv
) from the data folder of our repository. Store it in a variable called weather
.precip > 0
).wind_speed / 2 + visib
.Are there any questions before we move on?