Advanced Python for Data Science¶

Gus Powers and Jay Cunningham

January 2023

Introductions¶

Gus Powers¶

Lead Data Scientist at 84.51°

  • Creating and maintaining data science tools for internal use
  • Python, Bash (shell), & R

Academic

  • BS, Chemistry, Thomas More College
  • MS, Chemistry, University of Cincinnati
  • MS, Business Analytics, University of Cincinnati

Contact

  • GitHub: augustopher
  • LinkedIn: August Powers
  • Email: guspowers0@gmail.com

Jay Cunningham¶

Lead Data Scientist at 84.51°

  • Researching and developing forecasting models
  • Machine learning, Python

Academic

  • BA, Mathematics, University of Kentucky
  • MA, Economics, University of North Carolina (Greensboro)

Contact

  • LinkedIn: Jay Cunningham
  • Email: james@notbadafterall.com

Your Turn¶

We'll go around the room. Please share:

  • Your name
  • Your job or field
  • How you use Python now or would like to in the future

Course¶

Course Objectives¶

The following are the primary learning objectives of this course:

  • Develop an intuition for the machine learning workflow and Python tooling.
  • Build familiarity with common software engineering tooling and methodologies for implementing a machine learning project.
  • Gain hands-on experience with the tools and processes discussed with applied case study work.

Course Agenda¶

Day Topic Time
1 Introductions 12:45 - 1:00
Setting the Stage 1:00 - 1:15
Git & version control 1:15 - 2:00
Break 2:00 - 2:15
EDA & Our First scikit-learn Model 2:15 - 3:45
Q&A 3:45 - 4:15
2 Q&A 12:45 - 1:00
Modular Code 1:00 - 2:00
Feature Engineering 2:00 - 3:00
Break 3:00 - 3:15
Case Study, pt. 1 3:15 - 4:00
Q&A 4:00 - 4:15
Day Topic Time
3 Q&A 12:45 - 1:00
Model Evaluation & Selection 1:00 - 2:15
Break 2:15 - 2:30
More on Modular Code 2:30 - 3:15
Unit Tests 3:15 - 4:00
Q&A 4:00 - 4:15
4 Q&A 12:45 - 1:00
More on Unit Tests 1:00 - 1:30
ML lifecycle management 1:30 - 2:30
Break 2:30 - 2:45
Case Study, pt. 2 2:45 - 3:45
Case Study Review, pt. 2 and Q&A 3:45 - 4:15

Course Philosophy¶

Beginners typically need the instructor to make connections and solve problems for them.

Why is this code not running? What types of real world problems could I use this package for?

But as intermediate to advanced users, we believe you'll be more capable of seeing those connections yourselves. Instead of diving into details and working through small code examples, this advanced workshop takes a slightly different approach...

  • Give you an overview of the tools you might need to solve a problem. We can't teach you machine learning in just two days, but we can give you a foundation. And as experienced coders, you'll be able to fill in the details yourselves when the time comes to use these tools.
  • Explain more of the intuition behind tools and techniques. Beginners can't yet see the forest for the trees -- they are caught up in small problems and not yet ready to understand the big picture. But in this class we will talk more about general design patterns of Python and its libraries, in a way that should help you learn them instead of simply memorize functions.
  • Expect you to help yourself. We'll still be here to answer questions and help with hard problems, but the mark of an experienced programmer is that he/she consults references often (Google, documentation, etc) and can find answers there. You'll need to do that during this course and afterward when you apply the techniques we discuss.

Prerequisites¶

Python¶

  • If you're attending this class, it's assumed you're comfortable with the material covered in the Introduction to Python for Data Science and Intermediate Python for Data Science classes.
  • At a very high level, those courses covered:
    • Importing data into and exporting data out of Python, via Pandas
    • Wrangling data in Python with Pandas
    • Basics of visualization with Seaborn
    • Control flow
    • Writing functions
    • Conda environments
    • Running Python outside of Jupyter notebooks
    • Basics of modeling with scikit-learn

Jupyter¶

  • If you're attending this class, it's assumed you're comfortable with launching and using Python via Jupyter Notebooks -- and ideally outside of Jupyter as well.
  • Course materials (slides, case studies, etc.) will be in Jupyter Notebooks, but you're free to use your IDE of choice when completing exercises and case studies.

Technology Setup¶

  • Unlike our other courses, Advanced Python is not designed with Binder in mind.
  • This means that you'll need to use your personal laptop to run today's code.
  • Why? We're going to be working with bigger data and more computationally-intensive algorithms, for which Binder is not well-equipped.
    • In an industry setting, using these techniques would best be done on a server, not a personal computer.

Anaconda¶

  • Anaconda is the easiest way to install Python 3 and Jupyter.
  • If you have not yet installed Anaconda, please follow the directions in the course README.
  • Be sure that all Python packages listed in the environment.yaml are installed. See here for instructions on creating a Conda environment from an environment.yaml file.
  • This Anaconda installation will not be able to natively display the course content as slides, but I recommend using it for completing exercises and the case studies.

JupyterLab¶

  • If you took the introductory and/or intermediate courses, you may have used Jupyter Notebooks to write Python.
  • Jupyter Notebooks are slowly being deprecated in favor of a new, more featureful product called JupyterLab.
  • JupyterLab is extremely similar but supports more features, and Notebooks is no longer being updated.
  • I recommend using JupyterLab today even if you haven't used it before -- it comes packaged with Anaconda and should feel very familiar!

Course Materials¶

  • All of the material for this course can be reached from the GitHub repository.
  • This repository has access to the slides and notebooks.
  • You should download the material -- available via this link -- and open it via Anaconda Navigator and Jupyter Notebooks/Lab.

Slides are Notebooks¶

  • We'll be showing the material in slide format most of the time.
  • These slides contain the same content as your notebooks, so you can follow along and run cells as we go.

Source Code¶

  • Source code for the training can be found on GitHub
  • This repository is public so you can clone (download) and/or refer to the materials at any point in the future

Questions¶

Are there any questions before moving on?