2015-07-22

Atanasov and Black(WP 2015) Shock-Based IV再考

制度変更等を利用したShock-Based IVはIVが満たすべき仮定のひとつである独立性(as-if random assignment to treatment)や除外制約(only through)を満たしていると考えられているため，多くの研究で利用されているが，先行研究のShock-Based IVは本当に仮定を満たしているのかを再分析し検証した論文．

Atanasov, V. and B. Black. 2015. “The Trouble with Instruments: Re-Examining Shock-Based IV Designs.” SSRN Working Paper

要旨は以下．

Credible causal inference in accounting and finance research often comes from “natural” experiments. These natural experiments generate “shocks” which can be exploited using various research designs, including difference-in-differences (DiD), instrumental variables based on the shock (shock based IV), and regression discontinuity (RD). There is much to be said for shock-based designs. Moreover, if one must use IV, shock-based IV designs are highly likely to be preferred to non-shock IV designs. But shock- based IV remains problematic. Often, a near-equivalent DiD design is available, and is usually preferable. We illustrate the problems with shock-based IV by re-analyzing three recent, high-quality papers. None of the IVs in these papers turn out to be valid. For Desai and Dharmapala’s (REStat 2009) study of the interaction between tax shelter opportunities and corporate governance, their first stage fails when we impose a balanced sample of firms with data both before and after the shock. For Duchin, Matsusaka and Ozbas’s (DMO) (JFE 2010) study of the effect of board independence on firm performance, their first stage also fails when we balance treated and control firms on the pre-shock proportion of independent directors. For Iliev’s (JF 2010) RD/IV study of the cost of compliance with SOX § 404, we use combined DiD/RD and principal strata methods, and find cost estimates somewhat below his RD estimate, and well below his RD/IV estimate. The principal problem is that Iliev’s IV does not, for subtle reasons, satisfy the core “only through” condition (exclusion restriction) for a valid instrument. We discuss common themes that emerge from our re-analysis, including the fragility of IV compared to other shock-based designs; the need for covariate balance between treated and control firms; and the difficulty in satisfying the only-through condition. Our results suggest that even for shock-based designs, the scope for IV methods is very limited.

本稿で検討しているのはDesai and Dharmapala (REStat 2009), Duchin, Matsusaka and Ozbas (JFE 2010), Iliev (JF 2010)の3論文で使われているIVである．

Desai and Dharmapala (REStat 2009)

Desai and Dharmapala (2009, below D&D) study how corporate governance mediates the effect of tax shelter opportunities on firm value. Their shock is 1996 Treasury regulations that simplified taxation for small private firms. As an unintended side effect, these rules increased tax shelter opportunities for multinational firms. D&D use this shock, interacted with measures of the firm’s need to shelter income, as instruments for “book-tax gap” (a proxy for tax sheltering). They find that greater sheltering opportunities increase firm value, but only for firms with high institutional ownership (a proxy for corporate governance).

問題点：独立性を明らかに満たしていない．すなわちTreatmentとControlで共変量バランスがとれていない．共変量バランスを補正したところ1st Stageが有意にならなかった．

Duchin, Matsusaka and Ozbas (JFE 2010)

Duchin, Matsusaka and Ozbas (2010, below, DMO) study the effect of board independence on firm value and profitability. Their instrument for a change in board independence is whether a firm had to add independent directors to its audit committee to meet a 1999 New York Stock Exchange (NYSE) and NASDAQ requirement that audit committees consist entirely of independent directors (“Audit Committee Shock”). DMO find that a higher proportion of independent directors is value-neutral overall, but positive (negative) for firms with low (high) information costs. Over 2000-2005, firms in the top quartile of information cost that increase board independence by 10% (the amount predicted by their instrument) suffer a 3.0% drop in ROA relative to bottom-quartile firms; a 24% relative drop in Tobin’s q; and 31% lower cumulative share returns.

Iliev (JF 2010)

lIiev (2010) studies the cost of compliance with § 404 of the Sarbanes-Oxley Act (SOX) for firms near the compliance threshold (public float of $75M), using a combined regression discontinuity (RD) and IV design. His RD design exploits the discontinuity at $75M in float between firms which do (don’t) need to comply with SOX § 404. Iliev finds that some firms manipulate their float to stay below the $75M threshold, and uses IV to address this manipulation.

問題点：RDDの強みはRCTに似た環境をつくることができる点だが，共変量バランスチェックを行っていない．再分析でチェックを行ってみたところTreatmentとControlで共変量バランスがとれていない．そこでIIieveのデザインであるRD/IVではなくDiD/RD(RDのバンド幅のサンプルでDiDをする)を用いて再分析したところ，有意であることは変わらなかったが過大推定になっていることを発見した．この理由として，Atanasov and Black(WP 2015)はIIievの採用したIVは除外制約を満たしていないのではないかと指摘している．

Black本人が話していたことではあるが，IIievになぜDiD/RDを使わなかったと聞いたところ知らなかったそうである．また，BlackはIVやRDDを用いる際には常に共変量チェックを怠らないよう注意を促している．

2015-07-20

Busso et al.(REST 2014) 傾向スコアマッチングと傾向スコアウェイティングの有限標本特性

傾向スコアによるマッチングとウェイティングの推定量を比較した論文．

Busso, M., DiNardo, J., & McCrary, J. (2014). "New Evidence on the Finite Sample Properties of Propensity Score Reweighting and Matching Estimators." Review of Economics and Statistics 96(5): 885–897.

様々な先行研究でウェイティングはマッチングに比べてパフォーマンスが悪いとの指摘がある．例えば，Frolich(2004)はウェイティングが最もシンプルなマッチング法と比べても最悪な推定量になっていることを報告している．Frolich(2004)のAbstractは以下．

The finite-sample properties of matching and weighting estimators, often used for estimating average treatment effects, are analyzed. Potential and feasible precision gains relative to pair matching are examined. Local linear matching (with and without trimming), k-nearest-neighbor matching, and particularly the weighting estimators performed worst. Ridge matching, on the other hand, leads to an approximately 25% smaller MSE than does pair matching. In addition, ridge matching is least sensitive to the design choice. [abstract]

一方で，Hirano et al.(2003)はFrolich(2004)とは異なる主張をしており，この点をBusso et al.(2014)は以下のように要約している．

In a recent article in the Review of Economics and Statistics, Frolich (2004) uses simulation to examine the finite sample properties of various propensity score matching estimators and compares them to those of a particular reweighting estimator. To the best of our knowledge, this is the only paper in the literature to explicitly compare the finite sample performance of propensity score matching and reweighting. The topic is an important one, both because large sample theory is currently only available for some matching estimators and because there can be meaningful discrepancies between large and small sample performance. Summarizing his findings regarding the mean squared error of the various estimators studied, Frolich (2004, p. 86) states that the “the weighting estimator turned out to be the worst of all [estimators considered]... it is far worse than pair matching in all of the designs”. This conclusion is at odds with some of the conclusions from the large sample literature. For example, Hirano et al. (2003) show that reweighting can be asymptotically efficient in a particular sense. This juxtaposition of conclusions motivated us to re-examine the evidence.

こうした従来の主張に対して，本稿が主張するのはFrolich(2004)の結論は間違っているということだ．

We conclude that reweighting is a much more effective approach to estimating average treatment effects than is suggested by the analysis in Frolich (2004). In particular, we conclude that in finite samples an appropriate reweighting estimator nearly always outperforms pair matching. Reweighting typically has bias on par with that of pair matching, yet much smaller variance. Moreover, in DGPs where overlap is good, reweighting not only outperforms pair matching, but is competitive with the most sophisticated matching estimators discussed in the literature.

ウェイティングもそこまで悪くないというのは朗報だそうだ．

This is an important finding because reweighting is simple to implement, and standard errors are readily obtained using two-step method of moments calculations. In contrast, sophisticated matching estimators involve more complicated programming, and standard errors are only available for some of the matching estimators used in the literature (Abadie and Imbens 2006, 2008, 2010).

以上のように，Busso et al.(REST 2014)では様々なData Generating Process(DGP)に着目してシミュレーションを試みており，結論も興味深い．しかしながら，応用計量分析家が実際に傾向スコアを用いて因果効果を推定する際にはStuart(2010)が示すように，かなりのヴァリエーションがあるので，それらのヴァリエーションとのシミュレーションも気になるところである(例えばBusso et al.(2014)のシミュレーション枠組みでBayes LogitやBARTで傾向スコアを推定した場合のマッチングとウェイティングの比較等)．オーバーラップが満たされている場合にはウェイティング推定量は多くのマッチング推定量とcompetitiveだそうだが，この点については逆だと思っていた(とういかそう習った気がする)．余談だが，傾向スコアの祖であるルービンは明らかにマッチング推しであり，このあたりの趨勢は自身でもう少し整理が必要に感じた．

2015-07-20

Hainmueller et al.(JOP 2015) 現職候補者は有利なのか：RDDの外的妥当性

Regression Discontinuity Design(RDD)の外的妥当性を検討した論文．

Hainmueller, J., Hall, A. and J. Snyder. 2015. "Assessing the External Validity of Election RD Estimates: An Investigation of the Incumbency Advantage." Journal of Politics 77(3): 707-720.

選挙において現職候補者が有利(incumbency effects)だというのをRDDで示したのがLee(2008)であるが，この論文はRDDの解説でしばしば引かれる．一般に，RDDの内的妥当性は非常に高いことが示されているが(Buddelmeyer and Skoufias 2004; Cook et al. 2008; Berk et al. 2010; Shadish et al. 2011)，外的妥当性についてはさらなる検証の余地があるとされる．すなわち，カットオフ近傍のsubpopulationへの局所平均効果になっているため，カットオフ近傍から離れた場合に同様の効果があるかは必ずしも自明ではないということだ．そこで，Lee(2008)の結論は外的妥当性があるのだろうかという問いを検証したのが本稿Hainmueller et al.(2015)である．

方法としてはAngrist and Rokkanen(2013)に従っている．Angrist and Rokkanen(2013)はIZAからのDPだがJASAから2016年パブリッシュされるAngrist and Rokkanen(JASA 2016)と同じ内容である．論旨は以下の4点から成る．

First, we motivate our study in several ways. We discuss theoretical reasons to expect a larger, smaller, or equal effect away from the RD threshold, we present the results of a survey of political scientists that shows widespread disagreement over whether the effect ought to be larger or smaller away from the threshold, and we present descriptive evidence that obtaining an estimate even in relatively small windows around the RD threshold can markedly increase the estimate’s coverage, and thus its pertinence. Second, we lay out the technical details of the method. Third, we apply the method to U.S. statewide offices, presenting the results of the validity tests and the estimates of the incumbency advantage away from the threshold. Finally, we conclude.

以上に基づき，条件付き独立の仮定(CIA) ${ \displaystyle E[Y_{i, t+1}(D_{i, t})|V_{i, t}, X_{i, t}]=E[Y_{i, t+1}(D_{i, t})|X_{i, t}] }$ を用いることでバンド幅をLee(2008)より拡大して推定したところ(windowは15%ポイント=より競争の少なかった地域での推定)，Incumbency Effectsは微弱もしくは確認されなかったそうである．windowを広げてTreatment Effectが推定できているかどうかはCIAが成立しているかどうかに依存するが，本稿ではCIAのチェックも行っている．RDDの外的妥当性については色々と議論があったところなのでとても勉強になった．方法論についてよりフォーマルな議論をしているAngrist and Rokkanen(JASA 2016)は今後多く引用されるのではないだろうか．

2015-07-17

Primo et al.(2007) クラスター化標準誤差 vs マルチレベルモデリング

社会学や心理学でしばしば用いられるマルチレベルモデリングであるが，経済学者のいる研究会で「それって標準誤差をクラスター化すれば良いのでは？」と言われた経験のある人は多いのではないだろうか．このセリフを言ったことも言われたこともあるのだが，よくまとまっている論文とブログをみつけたのでメモしておく．

Primo, D. M., Jacobsmeier, M. L., & Milyo, J. 2007. "Estimating the Impact of State Policies and Institutions with Mixed-Level Data." State Politics & Policy Quarterly, 7(4), 446–459.

この論文を知ったきっかけはAndrew Gelmanのブログである．したがって，Gelmanのブログにも簡単な解説とコメントがのっている．

Primoらが推定しているのは以下の3モデル．

1) least-squares estimation ignoring state clustering,

(2) least squares estimation ignoring state clustering, with standard errors corrected using cluster information,

(3) multilevel modeling

Gelmanブログのコメントが秀逸なので彼の要約をそのままメモする(ちなみにGelmanは3)はstataでできないと書いているがそれは2007年当時のことであり現在はmixedコマンドで簡単に推定できる)．

1. One big advantage of multilevel modeling, beyond the cluster-standard-error approach recommended in this paper, is that it gives separate estimates for the individual states. Primo et al. minimize this issue by focusing on global questions–“Do voter registration laws affect turnout? Do legislators in states with term limits behave differently than legislators in states with no term limits”–and in their example they focus on p-values rather than point estimate or estimates of variation. Thus, in the examples they look at, multilevel modeling doesn’t have such a big comparative advantage.

2. Another advantage of multilevel modeling comes with unbalanced data–in their context, different sample sizes in different states.

3. I agree that it’s frustrating when software doesn’t work, and I agree with Primo et al. completely that it’s better to go with a reasonable method that runs, rather than trying to use a fancier approach that doesn’t work on your computer. That said, I think their abstract would’ve been clearer if they had simply said, “Stata couldn’t fit our multilevel model,” rather than vaguer claims about “large datasets or many cross-level interactions.”

4. I’d like to get their data and try to fit their model in R. It might very well crash in R also–we’ve had some difficulties with lmer()–in which case it would be useful to figure out what’s going on and how to get it to work.

5. I’d recommend displaying their Table 1 as a graph. (John K. also wrote a paper on this for political scientists.)

6. I completely disagree with their statement on page 456 that cluster-adjusted standard errors “requires fewer assumptions” than hierarchical linear modeling. As Tukey emphasized, methods are just methods. A method can be motivated by an assumption but it doesn’t “require” the assumption. For a simple example, least squares is maximum likelihood for a model with normally distributed errors. But if the errors have a different distribution, least squares is still least squares: it did not “require” the assumption. To go to the next step, classical least squares (which is what Primo et al. recommend for their point estimation) is simply multilevel modeling with group-level variance parameters set to zero. Thus, their estimate requires more assumptions than the multilevel estimate.

7. But, to conclude, I’m not criticizing their choice of clustered standard errors for their example. It’s not a bad idea to use a method that you’re comfortable with. Beyond that, it can be extremely helpful to fit complete-pooling and no-pooling models as a way of understanding multilevel data structures. (See here for more of my pluralistic thinking on this topic.) I hope that as more people read our book, they’ll become more comfortable with multilevel models. But what I really hope is that the software will improve (maybe I have to do some of the work on this) so we can actually fit such models, especially varying-intercept, varying-slope models with lots of predictors and nonnested levels.

2015-07-17

Imbens (JEL 2010) たかがLATEされどLATE

社会科学で急速に増えている実験や自然実験を用いた分析のトレンドを激しく非難したDeaton(2010)やHeckman and Urzua(2009)に反論した論文．誘導系vs構造推定を含め，エコノメ界隈では皆が知っているこのやりとりだが，日本の社会学界隈ではあまり知られていないだろう．しかしながら，Deaton (2009)[以下Deaton]やHeckman and Urzua (2009)[以下HU]で述べられていることは，いかにも社会学者が言いそうなことも含まれているため，社会学界隈でももっと普及するべきやりとりのように思う．実際に，社会学者のStephen Morganは因果推論のレクチャーでDeaton (2009), Heckman and Urzua (2009), Imbens(2010)をリーディングリストに挙げている．(社会学者が実際に論争に絡んでいるものとしては社会学者Sobelと経済学者HeckmanのやりとりであるSobel(SM 2005)とHeckman(SM 2006)を勧めたい)

Imbens, G. W. 2010. "Better LATE Than Nothing: Some Comments on Deaton (2009) and Heckman and Urzua (2009)." Journal of Economic Literature 48(2): 399-423.

Deaton (2009)やHeckman and Urzua (2009)が批判しているのは，開発経済学でのスタンダードとなりつつあるRCTや自然実験等を利用したIVEから明らかになるLATEである．Deatonらの批判を簡潔にまとめると，effect of causeばっかりやってないでcause of effectをやれ！LATEって知ってなにか良いことあるのか？という点に集約できるだろう( Heckman and Vytlacil(2005)はLIVやMTEを提唱している)．おそらく5年後に社会学者が似たような批判をするだろう．こうした批判に対して，Imbensが極めて明快に，effect of causeとcause of effectは補完的だしLATEが分かると良いことたくさんあるしLATE批判してるけど代わりになにするの？と回答している．

"Problems of identification and interpretation are swept under the rug and replaced by ‘an effect’ identified by IV that is often very difficult to interpret as an answer to an interesting economic question" (HU, p. 20).

"The LATE may, or may not, be a parameter of interest . . . and in general, there is no reason to suppose that it will be . . . I find it hard to make any sense of the LATE" (Deaton, p. 10).

"futility of trying to avoid thinking about how and why things work" (Deaton, p. 14).

これに対してImbensが冒頭で述べているのは，

By emphasizing internal validity and study design, this litera- ture has shown the importance of looking for clear and exogenous sources of variation in　potential causes. In contrast to what Deaton and HU suggest, this issue of data quality and study design is distinct from the choice between more or less structural or theory driven models and estimation methods.

ということだ．むしろImbensが懸念するのは以下である．

In my opinion, the main concern with the current trend toward credible causal inference in general, and toward randomized experiments in particular, is that it may lead researchers to avoid questions where randomization is difficult, or even conceptually impossible, and natural experiments are not available.

DeatonやHUはより理論的な研究に力を入れるべきだと述べているが，Imbensはそもそもそうした方向性では限界があるからRCTとかIVとかRDDが出てきたんだろと述べている．具体的には，2節でLaLonde(1986)の構造推定論文を引き，4節のThe Benefits of Randomized Experiments，5節のInstrumental Variables, Local Average Treatment Effects, and Regression Discontinuity Designsで近年の因果推論がいかにempirical workに貢献してきたかを説いている．5節のInstrumental Variables, Local Average Treatment Effects, and Regression Discontinuity Designsが調査観察データについてなので以下に簡潔にまとめる．

5. Instrumental Variables, Local Average Treatment Effects, and Regression Discontinuity Designs

IVやRDDが実験に続くセカンドベストと考えられている理由は2つ

First, they rely on additional assumptions and, second, they have less external validity. Often, however, such evaluations are all we have.

fuzzy RDDを含むIVの仮定を確認すると，

The first key assumption is that draft eligibility is exogenous. Since it was actually　randomly assigned, this is true by design in this case. The second is that there is no direct effect of the instrument, the lottery number, on the outcome. This is what Angrist, Imbens, and Rubin (1996) call the exclusion restriction. This is a substantive assumption that may well be violated. See Angrist (1990) and Angrist, Imbens, and Rubin (1996) for discussions of potential violations. The third assumption is what IA call monotonicity, which requires that any man who would serve if not draft eligible, would also serve if draft eligible. In this setting, monotonicity, or as it is sometimes called “no-defiers,” seems a very reasonable assumption.

最後の仮定はIVに連動するどころか逆の反応をする「天邪鬼」のようなdefiersがいないことを意味する(IVではcompliers, defiers, never-takers, always-takersの4subpopulationが想定できる)．DeatonはLATE(compliersにおける推定量)なんか興味ないだろと言っているが，それは分析目的に依存するし，Imbensもそう述べている．結局はどの母集団を分析したいのかという問題が背景にあるが，この点を説明するためにImbensはManskiの部分識別(Partial Identification)の話を引いている．さらに，そもそもすべての母集団の因果効果を識別できないからLATEがでてきたという話もしながらImbensは以下のように述べている．

Again, researchers do not necessarily set out to estimate the average for these particular subpopulations but, in the face of the lack of internal validity of estimates for other subpopulations, they justifiably choose to report estimates for them.

2015-07-15

Hill et al.(MBR 2011) 共変量が高次元の場合の傾向スコア法の検討

共変量が多い場合，傾向スコア法はいかに用いられるべきかを検討した論文．前回紹介したStuart(2010)でも示されているように，マッチング法といっても色々あり，傾向スコアを用いた因果推論の方法も色々ある．高次元の場合にはどの方法の組み合わせが望ましいのかを検討している．

Hill, J., Weiss, C., and Zhai, F. 2011. "Challenges With Propensity Score Strategies in a High-Dimensional Setting and a Potential Alternative." Multivariate Behavioral Research 46: 477-513.

ここでは，留年が成績に与える効果の推定が目的であるが，236個の共変量を用いている．236変数の全てが留年が決定する前の情報であるpretreatment variableであり，高次元の場合の傾向スコア分析の困難に挑戦している．処置前共変量がこんなにも豊富なデータセットはなかなかないだろう．最終的なサンプルサイズは6900，Treatment Groupすなわち留年したものは233．

傾向スコアの推定方法
傾向スコア推定の最も一般的な方法はロジットもしくはプロビット回帰である．しかしながら，どんなときもロジットやプロビットで良いというわけではない．共変量が多い場合には(さらにここではtreatmentのサンプルサイズが少ないことも)，ロジットやプロビットを用いた場合の傾向スコアは0か1に分布が偏ってしまい，overfitの可能性からも比較が困難となる．ベイジアンロジットでもこの問題はつきまとう．本稿では，Logit, Probit, Bayes Logit, Bayes Probit, Generalized Boosted Models(GBM: McCaffrey et al. 2004), Bayesian Additive Regression Trees(BART: Chipman et al. 2007)の6モデルを推定し，BARTが最もoverfitが少ないことを確認している．BARTについてはHIll and Su(2013)で解説されている．

どの傾向スコア分析を採用すべきか
傾向スコアが推定されたところで，推定された傾向スコアをどのように用いて分析するかはいくつかのバリエーションがある．代表的なものとして，マッチング，層別解析，ウェイティングが挙げられるが，これらで分析されるのはそれぞれATE, ATC, ATTであるように，分析目的に応じてどれを採用するかは異なる．ここではどの方法が最もバランスがとれている(「TreatmentとControlで共変量に有意な差がない」という点のみをもってバランスがとれているとは言えないことに注意=ここではQQバランスをみている)のかを確認するため，9パターン(nearest matching4パターン, optimal matching2パターン,フルマッチング，IPTW2パターン )を試している．結果はBARTが最もバランスがとれているようで，IPTWは最もバランスがとれていない．ウェイティング法の欠点は傾向スコアが0か1に偏ってしまうときに生じるので当然といえば当然だ．

Treatment Effectの推定
Treatment Effectがアウトカムに与える効果が最終的な分析目的だが(ここではATT)，ここでも分析手法を選択しなければならない．もしTreatmentとControlで完璧にバランスがとれているのであれば，平均値を比較するだけでバイアスのない推定量が得られる．これは実際の実験で行われる手続きだ．HIllらが試みているのは，この差の検定に加えた2種類の回帰モデル(with just test scores, with all covariates)である．

We can use the balance summaries, however, to discriminate between methods. Figure 5 plots treatment effect estimates for third-grade reading test scores and 95% conﬁdence intervals for each of the three analysis choices (difference in means, regression on test scores, regression on all covariates) for each of six propensity score strategies that met a set of balance criterion for the full set of covariates (std.mn < .08, std.max < .5, std.over.1 < .4), a set of balance criterion applying to all of the continuous covariates (medQQ.max < .2, maxQQ.mean <
.08, maxQQ.over.1 < .3 ), and a set of balance criterion for the full set of covariates plus quadratic terms (std.mn < .1, std.max < 1, std.over.1 < .3).

つまり，全部で18種類の推定量があるというわけだ．それでは我々はどの推定量をレポートすれば良いのだろうか？

It is somewhat difﬁcult to further distinguish between these methods because they represent trade-offs in some criteria over others (and some variables over others). However, the range of estimates that they yield, although narrower than the range for the full set of strategies, is still nontrivial.

BARTのススメ
本稿でHillらがススメているのがBARTである．まだ使ったことがないのでよく分からないが，ノンパラ推定でChipman et al. (2011)が嚆矢のようだ．

本稿の主眼はBARTのススメなのだが，まとめると，傾向スコアを用いた因果推論をおこなうためには，それぞれの分析段階において適切なモデル・推定量選択をしなければならない．具体的には，1)傾向スコア推定にどのモデルを使うか，2)どのタイプのマッチングもしくはウェイティングのアルゴリズムを使うか，3)どの手法でバランスチェックをするのか，そして何の基準でバランスがとれていると診断するのか，4)アウトカムの分析モデルを何にするか，の4段階での選択である．4段階もあればかなりの組み合わせがあるのだが，なぜその分析方法を採択したのかは多くの論文で明示されていない．Hillらが示しているのは，分析手法の選択で結果が無視できない違いをうんでいることだ．そしてBARTが推奨されているが，具体的には以下の記述がある．

We have presented an alternative estimation approach for this setting that relies on the BART algorithm that eliminates this complexity. This strategy has been demonstrated in previous work (Hill, 2011) to have equal or superior performance compared with some common propensity score strategies in a variety of settings. In this example the point estimate of the effect of the treatment on the treated produced by BART lies near the center of the estimate corresponding to the subset of propensity score approaches that achieve the best balance with these data. More research needs to be done to determine if there are scenarios in which BART may not perform as well. However, it appears to be a potentially promising alternative to propensity score matching, at least in situations with a large number of covariates, and at a minimum is worthy of further investigation and comparison. (p.505)

2015-07-07

Stuart(SS 2010) 因果推論におけるマッチング法の回顧と展望

傾向スコアマッチングの歴史や方法について外観したもの．

Stuart, E. A. (2010). "Matching Methods for Causal Inference: A Review and a Look Forward." Statistical Science 25(1): 1–21.

マッチング法は経済学，社会学，政治学，疫学，薬学，医学など様々な分野で用いられているが，ディシプリンを超えてマッチングをレビューしたものは以外と少ないそうで，そこで包括的なレビューをしようというのが著者のモチベーションである．

イントロでは「強く無視できる割当」やSUTVAの説明がなされた後に，マッチング法による分析を以下の4ステップに分類する．すなわち，(1)距離の定義(Defining Closeness)，(2)マッチング方法(Matching Methods)，(3)マッチング診断(Diagnosing Matching)，(4)アウトカムの分析(Analysis of the Outcome)の4ステップである．

(1)距離の定義(Defining Closeness)
　共変量選択と距離の測定の2点からなるのが距離の定義である．
　反実仮想の枠組みではIgnorabilityを満たす必要があるため，共変量選択は非常に重要である．理論的には，アウトカムと処置変数に関連するすべての変数を共変量選択しなければならない．だが実務上はすべての変数を利用すべきでないことが知られており，このあたりは星野本4章が分かりやすい．Stuartがここで指摘しているのは，傾向スコアマッチングにおいて実際にはアウトカムと関連のない変数を選択することはさして問題でないということだ(分散がすこし増す程度)．むしろ，大問題なのはアウトカムと強く関連する潜在的な共変量を除外してしまうことなので，この分野の研究者は，「アウトカムと関連があるかもしれない」(という程度の)変数を共変量として選択することにはリベラルだという．さらに，処置変数に影響を受ける変数もまずいというのが強調されている．これは実務上は処置後変数として警戒されているが，処置後変数が入り込むと処置変数の媒介効果を取り除いてしまうため，処置変数の総合効果が過少評価されるためである．
　さて，距離をなにで測定するかについては，(1)Exact，(2)Mahalanobis，(3)Propensity Score，(4)Linear Propensity Scoreの4点を挙げている．(1)のExact Matchingは社会学ではChapinが採用していた方法で，変数が全く同じ人をマッチさせるということである．この方法は多くの点で理想なことが知られているが(Imai, King and Stuart 2008)，共変量 ${ \displaystyle X }$ が多次元の場合には，Inexactで多くのサンプルサイズでマッチングをしたときよりバイアスが大きくなる(Rosenbaum and Rubin 1985)(Exact Matchするサンプルサイズはかなり少なくなる)．こうした「多次元の呪い」はマハラノビスマッチングにも共通する問題である．マハラノビスマッチングは共変量が少ないとき(8以下)には良い(Rubin 1979)が，共変量がそれ以上のとき，さらには正規分布していないときにはバイアスが大きくなる．マッチングにおけるブレイクスルーはRosenbaum and Rubin(1983)まで待たなければならなかった．ここで傾向スコア $e_i(X_i)=P(T_i|X_i)$ が登場する．傾向スコアの差の距離が一定のキャリパーより低いかどうかでマッチの有無を決めるというわけだ．Rosenbaum and Rubin(1985)はSD0.25のキャリパーを提案している．近年ではprognosis score(Hansen 2008)があるらしいが，細かくは触れられておらず，また私もよく知らない．ちなみにHansenは昨年のICPSRでCausal Inferenceを担当していた．傾向スコア推定で一般的なのはロジスティック回帰であるが，CARTやGBMといったノンパラ推定もgood performanceである．モデル診断のところで重要なことが書かれているので以下にそのまま引用する．

The model diagnostics when estimating propensity scores are not the standard model diagnostics for lo- gistic regression or CART. With propensity score esti- mation, concern is not with the parameter estimates of the model, but rather with the resulting balance of the covariates (Augurzky and Schmidt, 2001). Because of this, standard concerns about collinearity do not apply. Similarly, since they do not use covariate balance as a criterion, model fit statistics identifying classification ability (such as the c-statistic) or stepwise selection models are not helpful for variable selection (Rubin, 2004; Brookhart et al., 2006; Setoguchi et al., 2008). [p.7]

C統計量に頼りすぎてはいけないと書かれているが，この点は自戒の念をこめて，大切なのはマッチしたサンプルで共変量がバランスしているかどうかということであると強調しておきたい．

Research indicates that misestimation of the propen- sity score (e.g., excluding a squared term that is in the true model) is not a large problem, and that treat- ment effect estimates are more biased when the outcome model is misspecified than when the propensity score model is misspecified (Drake, 1993; Dehejia and Wahba, 1999, 2002; Zhao, 2004). This may in part be because the propensity score is used only as a tool to get covariate balance—the accuracy of the model is less important as long as balance is obtained. [p.7]

(2)マッチング方法(Matching Methods)
まず最近傍マッチング(Nearest Neighbor Matching)であり，これは必ずATTである．これは処置群と対照群で最小の距離にあるペアをマッチさせる方法である．1vs1，1vs多，多vs多のなかでも最もシンプルなのは1vs1だろう．1vs1のマッチングは多くの場合に対照群のサンプルを落としてしまうため，対照群のデータの多くを利用しないことになることが問題ともされる(検出力が減じる)．しかしながら↓

However, the reduction in power is often minimal, for two main rea- sons. First, in a two-sample comparison of means, the precision is largely driven by the smaller group size (Cohen, 1988). So if the treatment group stays the same size, and only the control group decreases in size, the overall power may not actually be reduced very much (Ho et al., 2007). Second, the power increases when the groups are more similar because of the reduced extrapolation and higher precision that is obtained when comparing groups that are similar versus groups that are quite different (Snedecor and Cochran, 1980).

ただpoor matchが生じないためにもキャリパーを設定することが考えられるが，そうすると処置群のサンプルがマッチしないこともあり，この点はトレードオフだとRosenbaum and Rubin(1985)は述べている．
上記のように，最近傍マッチングでは全てのサンプルが利用されるわけではないが，層別解析，フルマッチング，ウェイティング法は基本的に全てのサンプルを利用する．これらの手法の基本的な発想は全てのサンプルに0~1のウェイトをかけてやるというものである．ウェイティング法については $\frac{T_i}{e_i}+\frac{1-T_i}{1-e_i}$ をかけてやるIPW，オッズ $T_i+(1-T_i)\frac{e_i}{1-e_i}$ をかけてやる方法(Hirano, Imbens and Ridder 2003)，カーネルウェイトをかけてやる方法(Heckman, Ichimura and Todd 1997)がある．ウェイティング法の欠点は，傾向スコアが極端な値をとる場合(0or1に近い値)には分散が大きくなってしまうということ．

If the model is correctly specified and thus the weights are correct, then the large variance is appro- priate. However, a worry is that some of the extreme weights may be related more to the estimation procedure than to the true underlying probabilities. [p.10]

この問題に対してはダブリーロバストIPWが提唱されているが，この方法は星野本にわかりやすく書かれている．
さて，重要なのがコモンサポートの問題である．処置群と対照群で傾向スコアがオーバーラップしていないときに生じる問題であるが，いくつかの解決策が提示されている．

(3)マッチング診断(Diagnosing Matching)
この診断がマッチング法で最も重要であるとStuartは述べる．書くのがだんだん辛くなってきたので省略するが，マッチされたペアもしくはグループにおいて $p(X|T=1)=p(X|T=0)$ となっていれば共変量バランスがとれているのだが，こうなっていることを診断しなければならない．そのための方法としては大きく，(a)数値のチェック．(b)グラフチェック，の2点．前者についてRubin(2001)がすすめているのは，

1. The standardized difference of means of the propensity score.
2. The ratio of the variances of the propensity score in the treated and control groups.
3. For each covariate, the ratio of the variance of the residuals orthogonal to the propensity score in the treated and control groups.

後者については，マッチング前と後のグループの傾向スコアの分布をプロットすること，さらに，standardized differences of meansのプロットが推奨されている．マッチング後にstandardized differences of meansが小さくなっていることがポイントである．

(4)アウトカムの分析(Analysis of the Outcome)
マッチングの最終ゴールは処置変数がアウトカムに与える影響の分析だが，ここからが本来の目的であるアウトカムの分析ステップとなる．k:1マッチング後，層別解析とフルマッチング後，分散推定について書かれているが省略．

最後は今後のマッチング法の課題について書かれているが，力尽きたので後ほど追記したい．

まとめると，マッチング法を用いた因果推論の具体的な分析手順が示され，さらに各手順における代表的な手法が紹介されており，その説明も平易であるので，マッチングにある程度明るい人が読むと非常に有益な論文だと思う．著者はこの分野では有名なStuartであることも一読の価値があるだろう．
　

2015-04-23

Moretti(2013) 年収は「住むところ」で決まる

これまた積ん読だったモレッティ本を読了した．労働経済学の論文しか読んだことがないが，都市経済学でも著名な経済学者である．

年収は「住むところ」で決まる雇用とイノベーションの都市経済学

作者: エンリコ・モレッティ,安田洋祐(解説),池村千秋
出版社/メーカー: プレジデント社
発売日: 2014/04/23
メディア: 単行本（ソフトカバー）
この商品を含むブログ (10件) を見る

このモレッティ本，主題はタイトルのとおりで，居住地が所得に与える影響が非常に大きいということを示している．冒頭に以下の文章がある．

今日の先進国では，社会階層以上に居住地による格差のほうが大きくなっている．もちろん，グローバル化と技術の進歩は押しとどめようがなく，この二つの要因の影響を強く受ける経済では，教育レベルの低い働き手より教育レベルの高い働き手のほうが有利なことは間違いない．しかし，雇用と給料がこの二つの要因からどのような影響を受けるかは，個人がどういう技能をもっているかより，どこに住んでいるかに左右される．p.22

「どこに住んでいるか」が所得に与える影響というものは，我々が思っているより大きなものなのだ．ではどういう場所が所得に上昇効果をもたらしているのだろうか．それはイノベーション産業であるというのがモレッティの回答である．

イノベーション産業は労働市場に占める割合こそわずかだが，それよりはるかに多くの雇用を地域に生み出し，地域経済のあり方を決定づけている．...(略)...私がアメリカの320の大都市圏の1100万人勤労者について調査したところ，ある都市でハイテク関連の雇用が1つ生まれると，長期的には，その地域のハイテク以外の産業でも5つの新規雇用が生み出されることが分かった．p.83

ハイテク産業の乗数効果はものすごいそうである．上記の乗数効果の論文はMoretti(AER 2010)に掲載されているので後ほど目を通してみたい．都市が厚みのある労働市場を擁しているといくつかの思いがけない効果があることの事例であるが，こうした恩恵を受けられるのは大卒者等の高学歴者に限らず，高卒者等にも巡ってくることがポイントだろう．

とくに注目すべきなのは，技能の低い人ほど，大卒者の多い都市で暮らすことによる恩恵が概して大きいということだ．p.135

このことはMoretti(2004)で指摘されているが，もう少し細かくみると以下のことが言える．すなわち，ある都市に住む大卒者の数が増えれば，その都市の大卒者の給料が増えるが，さほど大きな伸びではない．一方で，高卒者の給料の伸びは大卒者の4倍に達する．高校中退者の場合は5倍だそうだ．これには驚いた．

それではそうしたハイテク産業が集中する都市があるとして，その都市になぜ企業が集中するのだろうか．第4章で検討されているのがこの「引き寄せのパワー」である．問題意識は明確である．

アメリカのイノベーションハブが形成されている場所は，一見するとなんの必然性もないように見える．従来型の産業では，個々の産業がどの土地に栄えるかは，たいてい天然資源と密接に結びついていた．.....それと異なり，なかなか説明がつきにくいのが，イノベーション産業の集積地の分布状況だ．p.160

この問いについて様々な視点から検討されているが，興味深いのは，知識の伝播がどのように生じているのかを検討している箇所だ．迅速なコミュニケーション手段が発達し，航空料金も昔に比べれば安くなった時代に，地理的な近さを重んじる必要などあるのあろうか．大いにあるというのが回答である．Jaffe et al.(QJE 1993)は，特許における先行技術の引用状況を調べ，イノベーションがどのように伝播していったのかを辿っている．

ジャフィーらが見いだした結果は，驚くべきものだった．発明家たちは特許申請の際に，遠く離れた場所の発明家ではなく，近くの発明家の業績を引用する傾向があったのである．取得された特許の内容は誰でも閲覧できるので，引用状況が地理の影響を受ける必然性はない．たとえば，ノースカロライナ州ダーラムの発明家がダーラムで生まれた特許について知る確率は，ほとかの土地で生まれた特許の場合と変わらないはずだ．ところが実際には，ダーラムの発明家は特許申請するとき，ほかの都市の発明家の先行特許より，ダーラムのほかの発明家の特許を引用する確率がはるかに高いのである．p.185

研究者のモレッティはこのことがよく分かると言う．少々長くなるが引用する．

遠くの研究仲間とは電話や電子メールで連絡を取り合っているが，本当に優れたアイデアは，たいてい予想もしていないときに思いつく．同僚とランチを食べているときだったり，給湯室で立ち話をしているときだったり．理由は単純だ．電話や電子メールは情報を伝達するのに適しており，研究の核となるアイデアを見いだせたあとに研究プロジェクトを進めるうえではきわめて有効な手段だが，新しい創造的なアイデアを生み出す手段としては最適ではないのだ．......いつ遠方の同僚と電話するかをあらかじめ決めておいて，そのときに新しいアイデアを思いつこうと計画するのはばかげた発想だ．大半の研究者は同意してくれると思う．アカデミズムの人間が大学に誰を採用するかを決めるために多くの時間を割くのは，どういう同僚と一緒に過ごすかによってみずかの生産性が左右されるからでもあるのだ．p.187

生産性の高い人に囲まれていると自らも生産性が高くなるという話は，Ph.D留学経験者からしばしば耳にする話だ．この点について，面白くかつ巧みなアイデアでセレクションバイアスを除去したAzoulay et al.(QJE 2010)が紹介されている．こちらも長くなるが引用する．

学界のスーパースター級の研究者と共同研究をおこなうと，医学研究者たちの研究の質にどういう影響があるのかを調べたのだ．この点に関して因果関係を割り出すのは簡単ではない．いわゆる自己選択のバイアスが作用する可能性があるからだ．スーパースター研究者は能力の高い研究者と一緒に研究したがるので，もし共同研究者たちの研究の生産性が高いとしても，スーパースターから知識が伝播したというより，その人たちがもともと優れているからにすぎないかもしれない．こうしたバイアスの影響を排除するために，アズレーらは賢明な方法を思いついた．スーパースターが急死した場合(そういうケースを112件見つけた)に，その前後で共同研究者たちの研究の生産性がどのように変化するかを調べたのだ．すると，共同研究者たち自身の環境は変わっていないにもかかわらず，「質を考慮に入れた場合の論文発表率は，長期にわたって5~8%の落ち込みが見られた」という．研究者同士が地理的に近くにいると，発表する論文の数だけでなく，質にも好ましい影響が及ぶようだ．p.188

以上のように，本書では興味深い結果が多く提示されており，またストーリ展開もエキサイティングであり，読み物として飽きない．モレッティは経済学者であるが，都市経済学の話は都市社会学とも密接に関連しているので，社会学者が読んでも得るものが大きいのではないのだろうか．トップジャーナルに何本も業績をもつモレッティが，きちんとした学問的根拠に基づいてストーリーを展開している点が本書の売りであろう．

2015-04-21

Goldthorpe(ESR 2001) 因果関係，統計学，社会学

イギリスの社会学者ゴールドソープが因果関係について整理し，さらに社会学で因果関係の分析がいかになされるべきかを説いているエッセイ．

Goldthorpe, J. H. (2001). “Causation, Statistics, and Sociology.” European Sociological Review, 17(1), 1–20.

まずゴールドソープは，Bernert(1983)を引用しながら，因果関係という概念が社会学で十分に検討されてこなかったことを指摘する．さらに，確率論と決定論の話に少し触れた上で，統計学の影響を受けながら形成された因果分析(causation)の概念を以下の3つに分類する．

1．Causation as Robust Dependence

　相関関係が因果関係ではないという話はどこでも聞く話だ．これに続く話として，例えば社会学で学部前半の講義であれば，ラザーズフェルドのエラボレーションの話がでるだろう．エラボレーションとは，クロス表でXとYに関係がありそうな場合に，新たな変数ZでXとYを条件付けたらXとYの関係が消えましたね，というだけの話だ．この場合，明らかにXとYはrobust dependenceではない．こうした作業の延長として，多変量解析の場合には変数を追加してコントロールすることもある．一方で計量経済学の場合，Granger(ECTA 1969)で有名なグレンジャーの因果が注目された．グレンジャーの因果は時系列での話であるが，時系列の計量テキストで有名な沖本(2010)では，定義を以下で与えている．

現在と過去のxの値だけに基づいた将来のxの予測と，現在と過去のxとyの値に基づいた将来のxの予測を比較して，後者のMSEの方が小さくなる場合，y_tからx_tへのグレンジャー因果性(Granger causality)が存在するといわれる．(p.80)

グレンジャー自身が述べているように，グレンジャー因果性は将来の予測に焦点があるが，ラザーズフェルドのエラボレーションにしてもグレンジャーの因果テストにしても，XがYに対してrobust dependenceかどうかをチェックしようとしているということだ．1960~80年頃の社会学では，このrobust dependenceの追求が因果分析と考えられていたとゴールドソープは述べている．ちなみにこれはMorgan and Winship(2014)が「回帰の時代」と述べているのと重なる．

　こうした分析は今日では因果分析(推論)として考えられていないし，果たしてこれを因果分析の3類型に組み込んでいるのはやや疑問だが，少なくとも上記のような分析が因果分析と考えられていた時代やコミュニティがあった(ある)ということをゴールドソープ先生が語ると時代認識が膨らむ．

2．Causation as Consequential Manipulation

　上記のrobust dependenceよりは今日的なトピックであるように，ここでの因果分析では興味のある変数を操作することで因果効果をみることを指す．つまり実験研究におけるRCTである．RCTの場合に，処置以外の共変量はtreatment groupとcontrol groupで有意に差がないはずなので(充分なサンプルサイズで大数の法則)，この2群の差を平均因果効果として解釈できるというわけだ．こうした考え方は，シンプルかつ(共変量を完全に除去しており)強力なので，robust dependenceに比べると統計学者にも因果分析として受け入れられる．これは当然だ．ゴールドソープはほぼ言及していないが，RCT以外にもconsequential manipulationの系にはIVやRDD等のquasi-experimentも含まれる．この章のみならず，このエッセイでゴールドソープが引用している文献は古すぎてかなり違和感があり，今日の因果推論を多少なりともフォローしてる人ならば「その問題提起古くない？？」となるかもしれない．それはさておき，ゴールドソープも言うように，社会学界ではmanipulationアプローチに対して賛成派と反対派がいる．賛成派はもちろんソーベル．反対派として挙げられているのはStanley Liebersonである．ゴールドソープはどうかというと，「俺は違う道を行く！」と言っており，それが次のCausation as Generative Processである．

3．Causation as Generative Process

　robust dependenceを今日的な因果分析の分類として採用するのは不適当だと思うので除外するが(歴史的な話として分類するには役に立つ)，manipulationで確認された因果効果はどのようなメカニズムで生じたのだろうか??それを考えるのが社会学が因果分析で取り組むべきことだろうというのがゴールドソープの主張であり，これをGenerative Processと呼んでいる．Generative Processは因果メカニズムと呼んでいいかもしれない．Generative Processを解明することは先のconsequential manipulationの分析結果をよりリッチにさせるので，その限りで両者の分析は補完的であるとゴールドソープは述べている．

　以上，かなり簡単にまとめたが，ゴールドソープはGenerative Process推しである．Consequential ManipulationとGenerative Processが補完的だというのは全くその通りだと感じる．このエッセイでゴールドソープは触れていないが，社会学では他の社会科学分野に比べて様々な分野で(必ずしもフォーマライズされていない)理論・質的分析が蓄積していると思うので，そうした理論・質的分析結果の蓄積がGenerative Processに貢献する点が大きいだろう．具体的には，ゴールドソープがGnerative Processの仮説構築で必要な作業としているcrucial subject-matter inputのあたりだろうか．計量屋と質的屋がタッグを組むことでGenerative Processの分析は前進すると思うので，まずは自分からこうした姿勢を心がけて頑張りたいなと感じさせるゴールドソープ御大のエッセイでした．

2015-04-13

Xie(2007) ダンカンの流儀：社会学における人口学的アプローチ

最近まわりで耳にする論文なので読んでみた．

Xie, Yu. 2007. "Otis Dudley Duncan's Legacy: The Demographic Approach to Quantitative Reasoning in Social Science." Research in Social Stratification and Mobility, 25(2): 141-156.

ダンカンといえばお笑い芸人とビートたけしのものまねが思い浮かぶかもしれないが，ここでのダンカンは社会学者のダンカンである．社会学では計量分析への貢献者として有名であり，またブラウとの共著American Occupational Structureは社会移動・階層の古典としてあまりにも有名である．
ここでは計量社会学の簡単な歴史と，ダンカンの研究スタンスが事例とともに紹介されている．目次に沿ってまとめる．
1．Population thinking versus typological thinking
　エルンスト・マイヤーによれば，プラトン以降の本質主義(essentialism)に親和性がある類型学的思考(typological thinking)と対立図式にあるのが集団的思考(population thinking)であり，後者はダーウィンによって導入されたものである．社会物理学の提唱者としても知られるケトレに続き，統計学を社会を思考する道具にしたのはゴルトン(ダーウィンのいとこ)である．いずれも統計学をツールにしていたが，両者は以下の点で異なっていた．すなわち，ケトレは集団における平均(average)が安定的であることに着目した一方で，ゴルトンはあまり平均(average)に関心をもたないどころか平均が安定的だとも思わず，むしろ全体の分布に興味を持っていたようだ．ゴルトンは後にregressionやcorrelationの概念を提唱している．類型学的思考(typological thinking)と集団的思考(population thinking)の考え方の違いは誤差をどのように捉えているかにもあらわているという．類型学的思考(typological thinking)については，

In typological thinking, deviations from the mean are simply "errors," with the mean approaching the true cause. That is, the true cause is con- stant, but what we actually observe is contaminated by measurement error.

というわけであり，集団的思考(population thinking)については，

In population thinking, deviations are the reality of substantive importance; the mean is just one property of a population. Variance is another, equally important, property.

というわけである．ダンカンはmeanとaverageの違いについてもこだわっていたことが書かれている．ところで類型学的思考(typological thinking)と集団的思考(population thinking)は，方法論的個人主義と集団主義(holism)の関係とどう関連しているのかが気になった．
2．Duncan as a population thinker
　ダンカンがpopulation thinkerであり続けたことが説かれている．著者のXieへ送ったメールからも文章が引用されていて面白い．ちなみに，ダンカン自身は1984年のNotes on Social Measurement: Historical and Criticalを生涯の代表作と考えていたらしい(2004年9月27日のXieとのやりとりで)．

Notes on Social Measurement: Historical and Critical

作者: Otis Dudley Duncan
出版社/メーカー: Russell Sage Foundation
発売日: 1984/05/01
メディア: ハードカバー
クリック: 2回
この商品を含むブログ (1件) を見る

集団的思考とダーウィンについて，ダンカンはNotes on Social Measurementのなかで以下のように述べている．

Darwin's emphasis on the variation among individuals in any natural population and the heritability of such variation actually provides the general conceptual framework for psychometrics and makes clear its affiliation with the population sciences. (Psychophysics, by contrast, has usually taken a typologically oriented interest in the species norm. . .and has only grudgingly conceded the existence of interindividual variation, regarding it as a nuisance rather than a primary object of inquiry.) (p. 200)

ダンカンは，普遍的な因果法則を社会に求めるのは意味がないと感じており，社会科学はpopulation scienceだと信じていたとXIeは述べている．
さらに，ダンカンの立場を強調するために，Xieは類型学的思考(typological thinking)と集団的思考(population thinking)をそれぞれGaussian approachとGaltonian approachと結びつけている．両アプローチの定式は以下となる．

Gaussian approach (typological thinking): Observed data = constant model + measurement error

Galtonian approach (population thinking): Observed data = systematic (between-group) Variability + remaining (within-group) variability

XieがDavid. Freedmanに指摘されているように，上記の違いは明確ではなく(統計的には同じ定式なので)，解釈の問題であることを最初に断っておく．ここでGaussian approach (typological thinking)派として挙げられているのが社会学者Blalockと先ほどの統計学者Freedmanで，Galtonian approach (population thinking)派として挙げられているのがダンカンである．Causal Inferencec in Nonexperimental Researchの著者であるBlalockが広くあてはまる因果法則を志向していたのに対して，ダンカンはきっぱりと，

The stress on the populational as opposed to the typological approach is valuable. I was totally unable to get it across to H. Blalock.

反対している．面白いのはフリードマンとダンカンの手紙のやりとりである．フリードマンがブラウ・ダンカン本でのパス解析の使い方を批判しているのに対して，ダンカンは手紙で「確かにまずかったかも」と反省している点等々，なんども手紙のやりとりをしているようだが，お互いの立場は典型的なGaussian approach [vs] Galtonian approachだったようであり溝は埋まらなかった．ちなみにフリードマンはダンカンに"Your distinction between the Gaussian and Galtonian regression traditions seems right."と書いているそうである．
3．Duncan's influence on quantitative reasoning in social science
　人口学と社会学が接近したのは主にダンカンの貢献であることが述べられている．
4．Dissatisfaction with statistical sociology
　ダンカンは社会学における計量分析に大きな不満があったようである．というのも，パス解析の回帰係数があたかも因果関係のように解釈されている研究が非常におおかったからである．パス解析の提案者ともいわれるダンカンは，因果関係を特定するというよりは，変数間の構造(相関関係)に着目し，問題提起をする意味でパス解析を用いていたが，巷ではダンカンの思いとは異なるように分析手法が流布していったからである．ダンカンは計量経済学者のゴールドバーグとも親交があったようで，ダンカンは社会学と経済学の違いについて以下のように述べている．

Sociologists appear to be most interested in an "inductive" strategy with respect to models, holding to the somewhat forlorn hope that it will be possible to "discover" the right model through data analysis ...... Economists, I take it, have somewhat more confidence in their theories which have a status of a priori information with respect to their models, and therefore are more concerned with efficient estimation.

うーん....こうした分類というか区別について社会学者と経済学者がどの程度同意するのだろうか．とりあえずこの章で強調されているのは，ダンカンは無頓着に統計的分手法を用いる社学者(この傾向をstatisticismとして揶揄している)に対して心底怒っていたということである．そしてなによりも，

quantitative tools should not be used to discover universal laws that would describe or explain the behaviors of all individuals. He totally rejected such endeavors as meaningless. He believed that all quantitative analysis can do is to summarize empirical patterns of between-group differences while temporarily ignoring within-group individual differences.

というのがダンカンの信条だそうである．
5．The key problem: population heterogeneity
　ここではダンカンがラッシュモデルにはまっていたこととpopulation heterogeneityの分析に重心を置いていたことが確認されるが，ダンカン曰く，

In the little thinking I do these days about the old battles I fought, it has increasingly seemed to me that one of two or three cardinal problems that social science has not yet come to grips with is precisely this issue of heterogeneity... The ubiquity of heterogeneity means that for the most part we substitute actuarial probabilities for the true individual probabilities, and therefore we generate mainly descriptively accurate but theoretically empty and prognostically useless statistics.

である．
6．Conclusion
　Xieはダンカンを非常に慕っていることが伺えるが，まとめるとダンカンの計量社会学への貢献は，ダーウィンにより導入されゴルトンによって発展させられた集団的思考を社会学界隈に周知した点であるそうだ．すなわち，ダンカンによれば計量社会学の重要なタスクとは，集団的多様性の体系的なパターンを記述することであるそうだ．

晩年のダンカンはpopulation heterogeneityに注力したが，結果として計量社会学にガッカリ(disappointed)したそうである．ガッカリというのはどういう意味でXieが使っているのかは分からないが，ダンカンのガッカリも，仮にダンカンが今日の統計的手法を勉強すればだいぶ緩和されたのではないかという気がする．確かにHeckman(2001)が述べるように，population heterogeneityは因果推論のみならず応用計量分析にとって課題ではあるが，なにもできていないわけではないし，現在の計量分析ツールがpopulation heterogeneityについてなにも分析できていないと言えば単なる勉強不足だと笑われるだけだろう．ダンカンの志向も分かるが，Gaussian approach [vs] Galtonian approachという図式を取り出して「俺はこっちだお前はこっちだ」とか言い続けるのではなく(もちろんこうした解釈は重要である)，そういうのはある程度内に秘めておきながら今日の社会学者はもっと計量分析の方法論で勝負できるようにならなければならないと感じる．社会学のテリトリーにもかかわらず，方法論で遅れをとっているために他分野にのっとられるという例は決して少なくないしこれは悔しいことである．最後に，ダンカンはこんなジョークもとばしている．

Economists reason correctly from false premises; sociologists reason incorrectly from true premises. Thus they create two complementary bodies of ignorance. (Duncan to Yu Xie, June 28, 2003)