
Machine Learning Methods for Estimating Heterogeneous Causal Effects

Susan Atheyy Guido W. Imbensz

First Draft: October 2013

This Draft: April 2015



Abstract:In this paper we study the problems of estimating heterogeneity in causal effects in experimental or observational studies and conducting inference about the magnitude of the dierences in treatment effects across subsets of the population. In applications, our method provides a data-driven approach to determine which subpopulations have large or small treatment effects and to test hypotheses about the differences in these effects. For experiments, our method allows researchers to identify heterogeneity in treatment effects that was not specied in a pre-analysis plan, without concern about invalidating inference due to multiple testing. In most of the literature on supervised machine learning (e.g.regression trees, random forests, LASSO, etc.), the goal is to build a model of the relationship between a unit~q~s attributes and an observed outcome. A prominent role in these methods is played by cross-validation which compares predictions to actual outcomes in test samples, in order to select the level of complexity of the model that provides the best predictive power.Our method is closely related, but it differs in that it is tailored for predicting causal effects of a treatment rather than a unit~q~s outcome. The challenge is that the "ground truth" for acausal effect is not observed for any individual unit: we observe the unit with the treatment,or without the treatment, but not both at the same time. Thus, it is not obvious how to use cross-validation to determine whether a causal eect has been accurately predicted. We propose several novel cross-validation criteria for this problem and demonstrate through simulations the conditions under which they perform better than standard methods for the problem of causal eects. We then apply the method to a large-scale eld experiment

re-ranking results on a search engine.

Keywords: Potential Outcomes, Heterogeneous Treatment Effects, Causal In-ference, Supervised Machine Learning, Cross-Validation

1 Introduction

In this paper we study two closely related problems: first, estimating heterogeneity by features in causal effects in experimental or observational studies, and second, conducting inference about the magnitude of the differences in treatment effects across subsets of the population.Causal effects, in the Rubin Causal Model or potential outcome framework that we use here(Rubin, 1976, 1978; Imbens and Rubin, 2015), are comparisons between outcomes we observe and counterfactual outcomes we would have observed under a different regime or treatment.We introduce a method that provides a data-driven approach to select subpopulations with different average treatment effects and to test hypotheses about the differences between the effects in different subpopulations. For experiments, our method allows researchers to identify heterogeneity in treatment effects that was not specied in a pre-analysis plan, without concern about invalidating inference due to concerns about multiple testing.

Our approach is tailored for applications where there may be many attributes of a unit relative to the number of units observed, and where the functional form of the relationship between treatment effects and the attributes of units is not known. We build on methods from supervised machine learning (see Hastie, Tibshirani, and Friedman (2011) for an overview).This literature provides a variety of very effective methods for a closely related problem, the

problem of predicting outcomes as a function of covariates in similar environments. The most popular methods (e.g. regression trees, random forests, LASSO, support vector machines, etc.)entail building a model of the relationship between attributes and outcomes, with a penalty parameter that penalizes model complexity. To select the optimal level of complexity (the one that maximizes predictive power without"overtting"), the methods rely on cross-validation.The cross-validation approach compares a set of models with varying values of the complexity penalty, and selects the value of complexity parameter for which out-of-sample predictions best match the data using a criterion such as mean squared error (MSE). This method works well because in the test sample, the "ground truth" is known: we observe each unit~q~s outcome, so that we can easily assess the performance of the model.