Empirical Research in the Age of AI

Course Overview

This half-course focuses on the often-invisible infrastructure of empirical research: data collection, conceptualization, and validation. Most PhD programs don't teach these skills systematically, yet researchers spend tremendous effort on these exact tasks—determining whether needed data exists, assessing collection feasibility, building robust pipelines, and validating results.

We're at an inflection point where AI is making powerful data tools dramatically more accessible — rapidly accelerating this hidden work behind research. I'll share approaches from my own research practice, and you'll apply them directly to your projects during the course. The goal isn't just learning tools, but understanding how modern big data technologies (web scraping, databases, data warehouses, LLMs, cloud clusters) reshape what questions we can feasibly answer. By the end of the course, I hope that you have a strong start on developing your own data pipeline for a setting of interest to you.

As a one-off course, I'm tailoring this to your interests. Whether you're collecting social media data, administrative records, or text corpora, we'll focus on making you more efficient at the data work that actually consumes your research time.

Prerequisites

Students should have:

Learning Objectives

By course completion, you will:

  1. Assess data availability and collection feasibility during research conceptualization
  2. Build production-quality data pipelines using modern infrastructure
  3. Master exploratory data analysis and validation techniques
  4. Use LLMs as research assistants for data labeling, extraction, and validation
  5. Navigate the ecosystem of big data tools (APIs, databases, cloud computing)
  6. Develop systematic approaches to the "hidden" work of empirical research