This half-course focuses on the often-invisible infrastructure of empirical research: data collection, conceptualization, and validation. Most PhD programs don't teach these skills systematically, yet researchers spend tremendous effort on these exact tasks—determining whether needed data exists, assessing collection feasibility, building robust pipelines, and validating results.
We're at an inflection point where AI is making powerful data tools dramatically more accessible — rapidly accelerating this hidden work behind research. I'll share approaches from my own research practice, and you'll apply them directly to your projects during the course. The goal isn't just learning tools, but understanding how modern big data technologies (web scraping, databases, data warehouses, LLMs, cloud clusters) reshape what questions we can feasibly answer. By the end of the course, I hope that you have a strong start on developing your own data pipeline for a setting of interest to you.
As a one-off course, I'm tailoring this to your interests. Whether you're collecting social media data, administrative records, or text corpora, we'll focus on making you more efficient at the data work that actually consumes your research time.
Students should have:
By course completion, you will: