“Using Large Language Model Annotations for the Social Sciences: A General Framework of Using Predicted Variables in Downstream Analyses,” Naoki Egami, MIT | Center for the Study of American Politics

Event time:

Thursday, April 16, 2026 - 12:00pm to 1:15pm

Location:

Institution for Social and Policy Studies, Room A002 See map

77 Prospect Street

New Haven, CT 06511

Speaker:

Naoki Egami, Associate Professor (with tenure) of Political Science, Massachusetts Institute of Technology

Event description:

QUANTITATIVE RESEARCH METHODS WORKSHOP

Abstract: Social scientists use automated annotation methods, such as supervised machine learning and, more recently, large language models (LLMs), that can predict labels and generate text-based variables. While such predicted text-based variables are often analyzed as if they were observed without errors, we show that ignoring prediction errors in the automated annotation step leads to substantial bias and invalid confidence intervals in downstream analyses, even if the accuracy of the automated annotations is high, e.g., above 90%. We propose a framework of design-based supervised learning (DSL) that can provide valid statistical estimates, even when predicted variables contain non-random prediction errors. DSL employs a doubly robust procedure to combine predicted labels and a smaller number of expert annotations. DSL allows scholars to apply advances in LLMs to social science research while maintaining statistical validity. We illustrate its general applicability using two applications where the outcome and independent variables are text-based. This paper is conditionally accepted at the American Journal of Political Science. LINK TO PAPER

Naoki Egami is an Associate Professor (with tenure) of Political Science at the Massachusetts Institute of Technology. He is also a faculty affiliate of the Statistics and Data Science Center at the Institute for Data, Systems, and Society (IDSS). Egami specializes in political methodology and develops statistical methods for questions in political science and the social sciences. Specifically, he works on causal inference and machine learning methods. His current research programs focus on three areas: (1) External Validity, (2) Machine Learning and AI for the Social Sciences, and (3) Causal Inference with Network and Spatial Data.

His work has appeared or is forthcoming in various academic journals in political science, statistics, and computer science, such as American Political Science Review, American Journal of Political Science, Journal of the American Statistical Association, Journal of the Royal Statistical Society (Series B), Neurips, and Proceedings of the National Academy of Sciences (PNAS). In 2025, Egami received the Emerging Scholar Award from the Society for Political Methodology, which “honors a young researcher, within ten years of their degree, who is making notable contributions to the field of political methodology.” Before joining MIT, Egami was an Assistant Professor at Columbia University from 2020 to 2025. He received a Ph.D. from Princeton University (2020) and a B.A. from the University of Tokyo (2015).

The Quantitative Research Methods Workshop is open to the Yale community. To receive regular announcements and invitations to attend, please subscribe at this link.