Review of Week 1¶

Fundamentals¶

Data Types¶

Everything in Python is an object, and every object has a type.

Let's review the most important ones.

Integers – Whole Numbers

In [1]:
i = 3
i
Out[1]:
3

Floats – Decimal Numbers

In [2]:
f = 3.4
f
Out[2]:
3.4

Strings – Bits of Text

In [3]:
s = 'python'
s
Out[3]:
'python'

Lists – Ordered collections of other Python objects

In [4]:
l = ['a', 'b', 'c']
l
Out[4]:
['a', 'b', 'c']

Dictionaries – A collection of key-value pairs, which let you easily look up the value for a given key

In [5]:
d = {'a': 1,
     'b': 2,
     'z': 26}
d
Out[5]:
{'a': 1, 'b': 2, 'z': 26}

DataFrames - Tabular datasets. Part of the Pandas library.

In [6]:
import pandas as pd
df = pd.DataFrame([(1, 2), (3, 4)], columns=['x', 'y'])
df
Out[6]:
x y
0 1 2
1 3 4

The type Function¶

You can use the type function to determine the type of an object.

In [7]:
x = [1, 2, 3]
type(x)
Out[7]:
list
In [8]:
x = 'hello'
type(x)
Out[8]:
str

Packages, Modules, and Functions¶

Packages¶

Packages (generally synonymous with modules or libraries) are extensions for Python featuring useful code.

Some are included in every Python install ("standard library"), while others (like Pandas, matplotlib, and more) need to be installed separately ("third party packages").

The DataFrame type, a staple of data science, comes in the Pandas package.

Functions¶

Functions are executable Python code stored in a name, just like a regular variable.

You can call a function by putting parentheses after its name, and optionally including arguments to it (e.g. myfunction(argument_1, argument_2)).

Well-named functions can help to simplify your code and make it much more readable.

Attributes and Methods¶

Python objects (that's everything in Python, remember?) come with attributes, or internal information accessible through dot syntax:

myobject.attribute

Attributes can be handy when you want to learn more about an object.

In [9]:
df.shape
Out[9]:
(2, 2)

Some attributes actually hold functions, in which case we call them methods.

In [10]:
df.describe()
Out[10]:
x y
count 2.000000 2.000000
mean 2.000000 3.000000
std 1.414214 1.414214
min 1.000000 2.000000
25% 1.500000 2.500000
50% 2.000000 3.000000
75% 2.500000 3.500000
max 3.000000 4.000000

DataFrames and Series¶

When you extract individual rows or columns of DataFrames, you get a 1-dimensional dataset called a Series.

Series look like lists but their data must be all of the same type, and they provide similar (though subtly different) functionality to DataFrames.

Importing Data¶

Importing data is the process of taking data on disk and moving it into memory, where Python can do its work.

Reading CSVs will likely be one of the most common ways you import data.

To do so, use Pandas' read_csv function, passing the name of your file as an argument.

import pandas as pd
data = pd.read_csv('myfile.csv')

Though they are less common in data science, JSON and pickle files may come up in your work as well.

These are slightly more complicated to import, but it's still very doable.

JSON:

import json
with open('myfile.json', 'r') as f:
    data = json.load(f)

Pickle:

import pickle
with open('myfile.pickle', 'rb') as f:
    data = pickle.load(f)

Subsetting and Filtering¶

There are three primary ways of subsetting data:

  • Selecting - Including certain columns of the data while excluding others
  • Slicing - Including only certain rows based on their position (index) in the DataFrame
  • Filtering - Including only certain rows with data that meets some criterion

Selecting¶

Selection is done with brackets. Pass a single column name (as a string) or a list of column names.

# The column "mycolumn", as a Series
df['mycolumn']

# The columns "column1" and "column2" as a DataFrame 
df[['column_1', 'column_2']]

Note

If you pass a list, the returned value will be a DataFrame. If you pass a single column name, it will be a Series.

Slicing¶

Slicing is typically done with the .loc accessor and brackets. Pass in a row index or a range of row indices.

# The fifth (zero-indexing!) row, as a Series
df.loc[4]

# The second, third, and fourth rows, as a DataFrame
df.loc[1:3]

Note

If you pass a range of indices, the returned value will be a DataFrame. Otherwise it will be a Series.

Filtering¶

DataFrames can be filtered by passing a condition in brackets.

# Keep rows where `condition` is true
df[condition]

Conditions are things like tests of equality, assertions that one value is greater than another, etc.

# Keep rows where the value in "mycolumn" is equal to 5
df[df['mycolumn'] == 5]
# Keep rows where mycolumn is less than 3 OR greater than 10
df[ (df['mycolumn'] < 3) | (df['mycolumn'] > 10) ]

Selecting and Filtering Together¶

Using .loc, it's possible to do selecting and filtering all in one step.

# Filter down to rows where column_a is equal to 5,
# and select column_b and column_c from those rows
df.loc[df['column_a'] == 5, ['column_b', 'column_c']]

Manipulating Columns¶

Numeric Calculations¶

It's possible to perform calculations using columns.

df['mycolumn'] + 7
df['mycolumn'] * 4 - 3

It's also possible to perform calculations based on values in multiple columns.

df['column_a'] / df['column_b']

Generally you'll want to save the calculated values in a new column, which you can do with sensible assignment syntax.

df['e'] = df['m'] * (df['c'] ** 2)

String Manipulations¶

Lots of string functionality can be found within the .str accessor.

# Convert the strings in mycolumn to all caps
df['mycolumn'].str.upper()

Mapping Values¶

In some cases you may need to convert some values to other values.

This is a good case for the .map method of Series.

Pass in a dictionary whose keys are the elements to be converted and whose values are the desired new values.

In [11]:
df
Out[11]:
x y
0 1 2
1 3 4
In [12]:
df['x'] = df['x'].map({1: 11, 3: 33})
df
Out[12]:
x y
0 11 2
1 33 4

Practice¶

  1. Load the weather data (weather.csv) from the data folder of our repository. Store it in a variable called weather.
  2. Keep only the rows that have precipitation (i.e. precip > 0).
  3. Create a new column, "air_hazard_rating", that is wind_speed / 2 + visib.
  4. Keep only the "origin" and "time" columns.

Questions¶

Are there any questions before we move on?