机器学习估计异质性因果效应(全文):附机器学习公开课资料

Machine Learning Methods for Estimating Heterogeneous Causal Effects

Susan Atheyy Guido W. Imbensz

First Draft: October 2013

This Draft: April 2015

本文研究了在实验或观察性研究中估计因果效应的异质性的问题,并对处理效应的差异大小进行推断。提供了一种数据驱动的方法来确定哪些亚群具有较大或较小的处理效应,并测试关于这些效果差异的假设。实验方法允许研究人员识别处理效应的异质性,这些处理效应并不是预先分析计划中指定的,而不用担心由于多次测试而导致的无效推断。在大多数关于监督机器学习的文献中(例如回归树、随机森林、LASSO等),目标是建立一个单元属性与观察结果之间关系的模型。交叉验证在这些方法中扮演着重要的角色,它将预测结果与测试样本中的实际结果进行比较,以选择模型的复杂程度,从而提供最佳的预测能力。我们的方法是密切相关的,但它的不同之处在于,它是为预测处理的因果效应而定制的,而不是为预测一个单位的结果。挑战在于,因果效应的“基本事实”并不适用于任何一个单独的单位:我们观察有处理的单位,或没有处理的单位,但不是同时观察两者。因此,如何使用交叉验证来确定因果关系是否已被准确预测还不是很明显。针对这一问题,我们提出了几种新的交叉验证准则,并通过仿真证明了这些准则在处理因果关系问题时优于标准方法的条件。然后将该方法应用于大型现场试验,并基于一个搜索机制重新排列结果。

关键词:潜在结果,异质性处理效应,因果推断,监督机器学习,交叉验证

Abstract:In this paper we study the problems of estimating heterogeneity in causal effects in experimental or observational studies and conducting inference about the magnitude of the dierences in treatment effects across subsets of the population. In applications, our method provides a data-driven approach to determine which subpopulations have large or small treatment effects and to test hypotheses about the differences in these effects. For experiments, our method allows researchers to identify heterogeneity in treatment effects that was not specied in a pre-analysis plan, without concern about invalidating inference due to multiple testing. In most of the literature on supervised machine learning (e.g.regression trees, random forests, LASSO, etc.), the goal is to build a model of the relationship between a unit~q~s attributes and an observed outcome. A prominent role in these methods is played by cross-validation which compares predictions to actual outcomes in test samples, in order to select the level of complexity of the model that provides the best predictive power.Our method is closely related, but it differs in that it is tailored for predicting causal effects of a treatment rather than a unit~q~s outcome. The challenge is that the "ground truth" for acausal effect is not observed for any individual unit: we observe the unit with the treatment,or without the treatment, but not both at the same time. Thus, it is not obvious how to use cross-validation to determine whether a causal eect has been accurately predicted. We propose several novel cross-validation criteria for this problem and demonstrate through simulations the conditions under which they perform better than standard methods for the problem of causal eects. We then apply the method to a large-scale eld experiment

re-ranking results on a search engine.

Keywords: Potential Outcomes, Heterogeneous Treatment Effects, Causal In-ference, Supervised Machine Learning, Cross-Validation

1 Introduction

In this paper we study two closely related problems: first, estimating heterogeneity by features in causal effects in experimental or observational studies, and second, conducting inference about the magnitude of the differences in treatment effects across subsets of the population.Causal effects, in the Rubin Causal Model or potential outcome framework that we use here(Rubin, 1976, 1978; Imbens and Rubin, 2015), are comparisons between outcomes we observe and counterfactual outcomes we would have observed under a different regime or treatment.We introduce a method that provides a data-driven approach to select subpopulations with different average treatment effects and to test hypotheses about the differences between the effects in different subpopulations. For experiments, our method allows researchers to identify heterogeneity in treatment effects that was not specied in a pre-analysis plan, without concern about invalidating inference due to concerns about multiple testing.

Our approach is tailored for applications where there may be many attributes of a unit relative to the number of units observed, and where the functional form of the relationship between treatment effects and the attributes of units is not known. We build on methods from supervised machine learning (see Hastie, Tibshirani, and Friedman (2011) for an overview).This literature provides a variety of very effective methods for a closely related problem, the

problem of predicting outcomes as a function of covariates in similar environments. The most popular methods (e.g. regression trees, random forests, LASSO, support vector machines, etc.)entail building a model of the relationship between attributes and outcomes, with a penalty parameter that penalizes model complexity. To select the optimal level of complexity (the one that maximizes predictive power without"overtting"), the methods rely on cross-validation.The cross-validation approach compares a set of models with varying values of the complexity penalty, and selects the value of complexity parameter for which out-of-sample predictions best match the data using a criterion such as mean squared error (MSE). This method works well because in the test sample, the "ground truth" is known: we observe each unit~q~s outcome, so that we can easily assess the performance of the model.