Prediction, Inference, and Causality (Fall 2024)

QTM 285-1 with David A. Hirshberg

Description

This class is a modern, mathematically rigorous introduction to statistical modeling and data-driven decision-making that provides a foundation for upper-level classes in the department. We will focus on prediction (using data we have to tell us something about data we don't), statistical inference (characterizing the uncertainty we have about the accuracy of these predictions), and causal inference (understanding what the relationships we see in the data tell us about the impact of actions we might take). While the class will emphasize the intuitive and mathematical foundations of these concepts, we will also cover the implementation of these techniques in applications using the statistical programming language R.

This class will be accepted in place of QTM 220 as a prerequisite and there is substantial overlap in content between the two. However, there will be a greater emphasis on the precision with which we move from stories and intuition to formal mathematical reasoning (and back) in this course. Data visualization, including sketching by hand and plotting in R, will be emphasized as a tool for making this connection. Being precise about how and why our methods work makes it easier to adapt them to answer new questions and work with new types of data. We'll practice this important skill often, e.g. by looking into how to make predictions that tell us about what happens in 90% of cases rather than in the typical case (quantile regression) and how to compare different treatments for a terminal illness using time-to-event data (survival analysis).

Meeting Times
Class will meet on Mondays and Wednesdays from 4:30-5:15 in PAIS 230 and Fridays 2:30-3:20 across the hall in PAIS 220.

I will hold office hours on Mondays from 2:00-4:00 in PAIS 583 excepting university holidays.

Goals
By the end of this course, students will be able to do all this.
  1. Formulate research questions in plain language and mathematical terms, expressing clearly what they want to know (the estimand) and what group they want to know it about (the population). For questions about causality, this'll involve potential outcomes, a formalism for thinking about populations that differ in some way---e.g. in who received what treatment---from the population that actually exists.
  2. Estimate the estimand by training a predictive model using a random sample from the population and appropriately combining its predictions, relying---where necessary---on randomization of treatment to get estimates with meaningful causal interpretations.
  3. Characterize the error of this estimate, using a confidence intervals to quantify uncertainty due to random sampling and discussing, in plain language, possible sources of bias and what both mean for the accuracy of their estimate and the coverage of their confidence interval.
  4. Interpret others' work in the same terms, e.g. identifying the estimand, population, and sample; describing the predictive model used and how its predictions are combined to form an estimate; summarizing their claims about accuracy; finding possible sources of error; and describing how these errors might affect the validity of the study's conclusions in mathematical terms and in plain language.
  5. Demonstrate proficiency in statistics-relevant programming skills in the R language, including data simulation, modeling, and visualization.
  6. Demonstrate proficiency in the skills of statistical communication with both expert and lay audiences, including the use of mathematical terminology or analogies as appropriate and sketches and approximations to capture the essence of what's happening in a complex analysis.

Background Knowledge
While this class is designed to be fairly self-contained, it's expected that you have some familiarity with probability, calculus, and linear algebra. Throughout the semester, we will be talking about random variables and their distributions, calculating expected values and variances, and doing all of this, at times, while conditioning on other random variables. We'll review these topics in detail when they come up, but if you've not seen them before or have forgotten most of what you've learned, our coverage may be a bit fast-paced. Toward the end of the semester, we'll have a few classes that'll draw on your calculus and linear algebra skills, e.g. where we'll need to take a derivative or an integral or work out a basis for a vector space. We'll do a quick review when these come up and try to make sure that, even if your skills are a bit rusty, the intuition is solid enough that you can follow what's happening.

You'll need some basic R programming skills, too. You'll be reading and reusing code rather than writing it from scratch, so if you're not familiar with R but have experience programming in another language, you should be fine. Like with the math skills, we'll review as we go.

Formally, the prerequisites are the same as for QTM 220: either QTM 210, Econ 220, or Math 361 and 362 for probability; Math 210 or 211 for calculus; Math 221 for linear algebra; and QTM 150 for R programming. Or equivalents. If you're missing some of these but willing to work as we go to fill in the gaps, I'm happy to have you in class and will do my best to help you succeed. If you're interested but not sure if you're prepared, reach out and we'll talk.

Should You Take This Class or QTM 220?

The core content and organization of this class and QTM 220 are similar, but this one will be a bit more mathematically rigorous and a bit broader in scope. In particular, in addition to using the homework to get some practice working with the methods we're covering in lecture, we'll use it to explore some variations on them. The intention is to develop skills and confidence that'll help you tackle new problems by adapting what you already know. As a result, this class will probably be a bit more work than QTM 220 at times.

If you're planning to learn more about machine learning or causal inference methods or work in an area that uses them, you'll probably find that there are some gaps between what you'll need there and what you'd learn in QTM 220. Or in this class. Gaps are inevitable. What's intended is that, with this one, you'll have a bit of a head start. You'll have the kind of understanding and practice that makes it easy to identify the gaps and to work through them. The time you'd save then might be worth the extra time you'd put in now taking this class. It might be more fun, too.

Tentative Schedule

Week 0
W Aug 28 Introduction
F Aug 30 Probability Background: Sampling
Homework
Week 1
M Sep 2 No Class. Labor Day.
W Sep 4 Point and Interval Estimates
F Sep 6 Probability Background: Random Variables
Homework
Week 2
M Sep 9 Calibrating Interval Estimates with Binary Observations
W Sep 11 Calibrating Interval Estimates with Binary Observations (continued)
F Sep 13 Probability Background: Expectations
Homework
Week 3
M Sep 16 Calibrating Interval Estimates using the Bootstrap
W Sep 18 Normal Approximation and Sample Size Calculation
F Sep 20 Conditional Expectations
Homework
Week 4
M Sep 23 Comparing Two Groups
W Sep 25 Comparing Two Groups (continued)
F Sep 27 Review
Practice Exam
Week 5
M Sep 30 Practice Exam Review
W Oct 2 Midterm 1
F Oct 4 The Midterm 1 Solution
Homework
Week 6
M Oct 7 Summarizing Trends involving Many Groups
W Oct 9 Inference for Complicated Summaries
F Oct 11 A Game of Telephone
No Homework. Enjoy Your Fall Break.
Week 7
M Oct 14 No Class. Fall Break.
W Oct 16 Multivariate Analysis and Adjustment
F Oct 18 Multivariate Analysis and Adjustment (continued)
Homework
Week 8
M Oct 21 Inference for Complicated Summaries (continued)
W Oct 23 Potential Outcomes and Causality
F Oct 25 Potential Outcomes and Causality (continued)
Week 9
M Oct 28 Cancelled
W Oct 30 Least Squares Regression in Linear Models
F Nov 1 Least Squares in R
Homework
Week 10
M Nov 4 Cancelled.
W Nov 6 No Class. There was an Election Yesterday.
F Nov 8 No Class. There was an Election 3 Days Ago.
Week 11
M Nov 11 The Behavior of Least Squares Predictors (revised) (from class)
W Nov 13 Misspecification and Inference
F Nov 15 Misspecification and Averaging
Week 12
M Nov 18 Inverse Probability Weighting
W Nov 20 Application: Profit vs. Outcomes in Heart Attack Patients
F Nov 22 Discussion of Profit vs. Outcomes Papers
Week 13
M Nov 25 Case Study
W Nov 27 No Class. Almost Thanksgiving Recess.
F Nov 29 No Class. Thanksgiving Recess.
Week 14
M Dec 2 TBD
W Dec 4 Trees
F Dec 6 Image Denoising
Take-Home Midterm. Posted Monday at end of class; Due Wednesday at 11:59 PM.
Extra Office Hours Tuesday and Wednesday from 2-3:45
Week 15
M Dec 9 Wrap Up and Review

Practices

Canvas Site

Readings, assignments, announcements, and course information are available through our course Canvas site. When I need to communicate with everyone in the course, for example, to amend an assignment or reschedule my office hours, I will make an announcement on Canvas. Your notifications/settings can be adjusted however you prefer to receive these announcements. To contact me, please use email or come to office hours. Please do not message me via Canvas inbox or reply to comments on assignments on Canvas or Gradescope, as it's likely that I won't see it for a while. If feedback you receive via Canvas or Gradescope is unclear or you would like to discuss comments, please come to office hours or make an appointment to meet with me.

Communication

I prefer to speak with students in real time rather than via email. That helps us get to know each other better and tends to lead to more efficient communication. The office hours listed above are set aside entirely for you; you don't need to make an appointment and can come and go as you please. I prefer that you attend in person, but if you can't make it to campus, I'm happy to talk via zoom. Join my meeting; I'll be there. If you'd like to meet outside of these hours, please email me to set up an appointment. I'll do my best to respond to emails within 48 hours.

Attendance

I expect you to attend the majority of our class meetings. Even Fridays. That said, schedule conflicts and illness happen. For my sake and your classmates', please do not come to class or office hours sick. I will record the lectures and post recordings soon after class. There is no need to explain your absences, but please try to inform me of them in advance of class meetings.

Please bring a laptop with an updated version of R to every meeting, as you'll need it for some in-class activities and quizzes. It may be worth bringing a tablet, too, as the lecture slides on this website embed a little whiteboard app. You can take notes, do calculations, and draw sketches right on top of them. That's what I'm doing when I'm presenting the slides.

Homework

Homework will be assigned weekly except for the weeks of and the weeks before our midterm exams. These will be a mix of calculations by hand and computer, often accompanied by sketches or plots to illustrate what's going on visually; structured data analysis tasks; and communication tasks that address the methods and results of some data analysis (your own or somebody else's) using a mix of sketching and writing. Collaboration on homework is encouraged. I prefer that each student write and turn in solutions in their own words, and think that it is often best that this writing is done separately, with collaboration limited to discussion of problems, sketching solutions on a whiteboard, etc. This will help you and I understand where you're at in terms of your proficiency with the material.

Each homework assignment, as well as a solution to the previous week's, will be posted Thursday at midnight and due the following Thursday at 11:59. Assignments will be posted on and submitted via Canvas. So that I can post solutions and provide feedback without delay, late work will not be accepted.

The use of Large Language Models, e.g. GPT4, to assist you in writing them is encouraged. I use one to help me write almost everything, including the materials for this class. In particular, I use the GitHub Copilot extension for VSCode, and I'm happy to help you get that set up on your computer if you'd like. But I'll warn you that doing this well will involve a lot of editing. The perspective we take in this class is very different from the ones you'll find in most of the text these models were trained on. As a result, they tend to respond to my questions with what is, when you work through all the qualifications and jargon, a nonanswer. I expect real answers, in the terminology we use in class, that you're prepared to explain.

Please submit your work as a single PDF or HTML file with answers to each question in order and clearly labeled. And please try to keep your submissions concise. In particular, include code only if it's asked for explicitly and plots only if asked for explicitly or you're using them to illustrate a point you're making in your answer. In that case, your text should refer to the plot explicitly and explain where to look and what we're meant to see. In short, write your answers knowing that somebody is going to read them and would prefer not to work harder than necessary to understand what you're saying. I'm not asking for beautiful formatting. It's fine, for example, to write answers by hand, take photos, and stick them in a PDF as long as everything is legible, labeled, and in the right order.

Accessibility and Accomodations
As the instructor of this course I endeavor to provide an inclusive learning environment. I want every student to succeed. The Department of Accessibility Services (DAS) works with students who have disabilities to provide reasonable accommodations. It is your responsibility to request accommodations. In order to receive consideration for reasonable accommodations, you must register with the DAS here. Accommodations cannot be retroactively applied so you need to contact DAS as early as possible and contact me as early as possible in the semester to discuss the plan for implementation of your accommodations. For additional information about accessibility and accommodations, please contact the Department of Accessibility Services at (404) 727-9877 or accessibility@emory.edu.

Writing Center
Tutors in the Emory Writing Center and the ESL Program are available to support Emory College students as they work on any type of writing assignment, at any stage of the composing process. Tutors can assist with a range of projects, from traditional papers and presentations to websites and other multimedia projects. Writing Center and ESL tutors take a similar approach as they work with students on concerns including idea development, structure, use of sources, grammar, and word choice. They do not proofread for students. Instead, they discuss strategies and resources students can use as they write, revise, and edit their own work. Students who are non-native speakers of English are welcome to visit either Writing Center tutors or ESL tutors. All other students in the college should see Writing Center tutors. Learn more, view hours, and make appointments by visiting the websites of the ESL Program and the Writing Center. Please review the Writing Center’s tutoring policies before your visit.

Honor Code
The Honor Code is in effect throughout the semester. By taking this course, you affirm that it is a violation of the code to cheat on exams, to plagiarize, to deviate from the teacher's instructions about collaboration on work that is submitted for grades, to give false information to a faculty member, and to undertake any other form of academic misconduct. You agree that the instructor is entitled to move you to another seat during examinations, without explanation. You also affirm that if you witness others violating the code you have a duty to report them to the honor council.

Assessment

Final grades will be a weighted average of scores on Quizzes (5%), Homework (25%), Two Midterms (20% each), and a Final Exam (30%).

Quizzes
Quizzes will typically occur at the beginning of class meetings (not all of them) and will not be announced in advance. These give you a chance to check your understanding of the material we've covered recently and me a chance to do the same and adjust my plans to incorporate some additional review if necessary. They're meant to be quick and low-stress, so the questions will be easy and the grading lenient. Think of this 5% as a sort of participation grade that asks you to keep up with the material as well as be present. To keep illness and other conflicts from affecting your grade much, I'll drop the lowest 25% of your quiz scores when calculating your final grade.

Homework
We'll have homework weekly as described above. Your homework score will be the average of your scores on the homework assignments after dropping the lowest three. This is meant to allow some slack for busy weeks, illness, or other circumstances that might keep you from doing your best work. I will not be granting extensions, but if you find yourself burning through this slack due to circumstances that make it difficult to complete multiple assignments, talk to me. We can work something out. I'm here to help you learn, not to make your life difficult.

Midterms
There will be two midterm exams during lecture meeting times listed on the course schedule. These exams will be closed-notes, but I will provide whatever formulas I think are relevant and will to provide additional ones if you ask for them during the exam. Collaboration is not allowed on the midterms. Nor is the use of any electronic devices other than a calculator.

Final Exam
The final will be held during the assigned exam time and may be either in-person or remote. We will discuss and then vote on the format in class toward the end of the semester.

The Curve
Grades will be curved at the end of the semester. While I can't tell you the exact curve in advance because there's a lot to learn about what is and isn't challenging when we write a new exam, averages above 75 % in my classes have historically been curved to an A or an A- and averages about 60 % to a B or a B+. I'll provide additional guidance on the interpretation of your exam scores during the semester. You are not in in competition with your classmates for a limited number of A's. I'm not curving to get any particular distribution of grades. I'm curving so I can ask interesting questions to help you learn and solidify your understanding instead of calibrating the exam so it's easy enough to get 90 % of them correct. This is a tradition in upper-level math classes and, while it can take a little getting used to, I think it's ultimately a good one.

Appeals
In order to appeal a grade, students must submit a written appeal by email no sooner than 24 hours after receipt of the graded assessment and no more than 2 weeks after the grade was posted/returned. I may regrade the sub-portion of the assessment being appealed and the regrade is final. The regrade may result in a higher, lower, or the same score. Please limit appeals to cases where you believe I've made a mistake or misunderstood what you've written.

Incompletes
Incomplete grades are now handled by Emory’s Office of Undergraduate Education, with permission of instructors. The College’s general policy on Incompletes can be found here; further questions can be directed to your OUE Academic Advisor. There must be an agreement between the instructor and the student prior to the end of the course for approval of an Incomplete, in addition to the approval from OUE.