Introduction to Programming, Web Scraping, and Data Cleaning in Python

Wednesday, June 3, 2020 - Wednesday, December 31, 1969

All Day

Four-week workshop on the Python programming language, web scraping, and data cleaning for increasing data fluency of graduate and post-graduate students.

Weekdays of June 1-26, from 10am to 3pm

The Center for Computation and Visualization (CCV) is offering an introductory workshop on the Python programming language, web scraping, and data cleaning, targeted toward graduate students and postdocs in the humanities and social sciences. Python is a versatile computer programming language, useful in many contexts including collecting, cleaning, analyzing, and visualizing data. No experience with coding is required. Faculty and staff members can register if space is available.

In the first two weeks of the workshop, we will learn Python basics with topics covering:

  • container types such as variables, lists, dictionaries, and arrays,
  • control flow techniques such as if and while statements, for loops, and list comprehensions,
  • functions to manipulate data,
  • simple speed and memory profiling,
  • and visualizations using matplotlib.

In the third week, we will use web scraping to collect data from the web. Web scraping allows us to automate data collection from websites of varying underlying formats. Attendees will learn topics covering:

  • the basics of web page structures (HTML, CSS),
  • inspecting the page source underlying a web page using developer tools in Google Chrome,
  • the fundamentals of scraping several different web pages of varying complexity,
  • controlling crawl rates and monitoring the scraping loop,
  • and scraping a multi-page web query into a dataframe.

In the fourth week, attendees will learn to clean and process the scraped data using the pandas library. Topics covered will be:

  • reading in and manipulating the scraped data from csv and excel files, and sql databases,
  • filtering and modifying the scraped data,
  • visualizing the data,
  • and calculating summary statistics.

The workshop runs June 1-26. Each day will consist of a morning lecture (10am-noon), lunch, and hands-on exercises (1pm-3pm). Participants will have the opportunity to work on their own data-intensive projects with help from CCV data scientists during the third and fourth weeks. The workshop is supported by the Data Science Initiative’s NSF-TRIPODS grant.

The workshop runs weekdays from June 1 - 26, from 10am to 3pm. Space is limited to 20 people.