Busso et al.(REST 2014) 傾向スコアマッチングと傾向スコアウェイティングの有限標本特性

傾向スコアによるマッチングとウェイティングの推定量を比較した論文.

Busso, M., DiNardo, J., & McCrary, J. (2014). "New Evidence on the Finite Sample Properties of Propensity Score Reweighting and Matching Estimators." Review of Economics and Statistics 96(5): 885–897.

様々な先行研究でウェイティングはマッチングに比べてパフォーマンスが悪いとの指摘がある.例えば,Frolich(2004)はウェイティングが最もシンプルなマッチング法と比べても最悪な推定量になっていることを報告している.Frolich(2004)のAbstractは以下.

The finite-sample properties of matching and weighting estimators, often used for estimating average treatment effects, are analyzed. Potential and feasible precision gains relative to pair matching are examined. Local linear matching (with and without trimming), k-nearest-neighbor matching, and particularly the weighting estimators performed worst. Ridge matching, on the other hand, leads to an approximately 25% smaller MSE than does pair matching. In addition, ridge matching is least sensitive to the design choice. [abstract]

一方で,Hirano et al.(2003)はFrolich(2004)とは異なる主張をしており,この点をBusso et al.(2014)は以下のように要約している.

In a recent article in the Review of Economics and Statistics, Frolich (2004) uses simulation to examine the finite sample properties of various propensity score matching estimators and compares them to those of a particular reweighting estimator. To the best of our knowledge, this is the only paper in the literature to explicitly compare the finite sample performance of propensity score matching and reweighting. The topic is an important one, both because large sample theory is currently only available for some matching estimators and because there can be meaningful discrepancies between large and small sample performance. Summarizing his findings regarding the mean squared error of the various estimators studied, Frolich (2004, p. 86) states that the “the weighting estimator turned out to be the worst of all [estimators considered]... it is far worse than pair matching in all of the designs”. This conclusion is at odds with some of the conclusions from the large sample literature. For example, Hirano et al. (2003) show that reweighting can be asymptotically efficient in a particular sense. This juxtaposition of conclusions motivated us to re-examine the evidence.

こうした従来の主張に対して,本稿が主張するのはFrolich(2004)の結論は間違っているということだ.

We conclude that reweighting is a much more effective approach to estimating average treatment effects than is suggested by the analysis in Frolich (2004). In particular, we conclude that in finite samples an appropriate reweighting estimator nearly always outperforms pair matching. Reweighting typically has bias on par with that of pair matching, yet much smaller variance. Moreover, in DGPs where overlap is good, reweighting not only outperforms pair matching, but is competitive with the most sophisticated matching estimators discussed in the literature.

ウェイティングもそこまで悪くないというのは朗報だそうだ.

This is an important finding because reweighting is simple to implement, and standard errors are readily obtained using two-step method of moments calculations. In contrast, sophisticated matching estimators involve more complicated programming, and standard errors are only available for some of the matching estimators used in the literature (Abadie and Imbens 2006, 2008, 2010).

以上のように,Busso et al.(REST 2014)では様々なData Generating Process(DGP)に着目してシミュレーションを試みており,結論も興味深い.しかしながら,応用計量分析家が実際に傾向スコアを用いて因果効果を推定する際にはStuart(2010)が示すように,かなりのヴァリエーションがあるので,それらのヴァリエーションとのシミュレーションも気になるところである(例えばBusso et al.(2014)のシミュレーション枠組みでBayes LogitやBARTで傾向スコアを推定した場合のマッチングとウェイティングの比較等).オーバーラップが満たされている場合にはウェイティング推定量は多くのマッチング推定量とcompetitiveだそうだが,この点については逆だと思っていた(とういかそう習った気がする).余談だが,傾向スコアの祖であるルービンは明らかにマッチング推しであり,このあたりの趨勢は自身でもう少し整理が必要に感じた.