読者です 読者をやめる 読者になる 読者になる

King and Roberts(PA 2015) 頑健標準誤差を無自覚に使ってはいけません

計量社会学のレクチャーで漸近理論の話が出てくることはあまりないが,これはまずいと思っている.漸近的性質については計量屋のフロンティアでいまもたくさん成果が出ているし,それらをフォローするのは難しいが,基本的なことは共有されるべきである.さて,推定値を正当化する漸近的近似が適切でない場合,頑健な標準誤差の推定値が過小になったりする.やみくもに頑健標準誤差やクラスター標準誤差を使えば良いというわけではないということは,モストリーハームレスの8章でも扱わているが,本論文もその類のものである.

King, R. and Robert, M. 2015. "How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It" Political Analysis 23: 159-179.

要旨は以下である.

“Robust standard errors” are used in a vast array of scholarship to correct standard errors for model misspecification. However, when misspecification is bad enough to make classical and robust standard errors diverge, assuming that it is nevertheless not so bad as to bias everything else requires considerable optimism. And even if the optimism is warranted, settling for a misspecified model, with or without robust standard errors, will still bias estimators of all but a few quantities of interest. The resulting cavernous gap between theory and practice suggests that considerable gains in applied statistics may be possible. We seek to help researchers realize these gains via a more productive way to understand and use robust standard errors; a new general and easier-to-use “generalized information matrix test” statistic that can formally assess misspecification (based on differences between robust and classical variance estimates); and practical illustra- tions via simulations and real examples from published research. How robust standard errors are used needs to change, but instead of jettisoning this popular tool we show how to use it to provide effective clues about model misspecification, likely biases, and a guide to considerably more reliable, and defensible, inferences. Accompanying this article is software that implements the methods we describe.

主眼は,頑健標準誤差においてmisspecificationがおきていないかを検定する"generalized information matrix test"の提案である.頑健標準誤差はポリサイでも多用されているようで,著者らによれば,

Among all articles between 2009 and 2012 that used some type of regression analysis published in the American Political Science Review, 66% reported robust standard errors. In International Organization, the figure is 73%, and in American Journal of Political Science, it is 45%.

ほどである.社会学でもかなり無自覚に使っているユーザーは多いと思われるが,通常の標準誤差と頑健標準誤差に大きな違いがある場合には注意が必要となる.かつては不均一分散の検定に使われていたブロッシュ=ペーガン検定やホワイト検定では,こうした問題をスルーしてしまう(モデルミススペシフィケーションなのか単なる不均一分散があるだけなのかの区別ができないということ).キングらに言わせればこういうことである.

However, they are often used in applications as a default setting, without justification (sometimes even as an effort to inoculate oneself from criticism), and without regard to the serious consequences their use implies about the likely misspecification in the rest of one’s model. Moreover, a model for which robust and classical standard error estimates differ is direct confirmation of misspecification that extends beyond what the procedure corrects, which means that some estimates drawn from it will be biased—often in a way that can be fixed but not merely by using robust standard errors.

まず頑健標準誤差の成功例を確認しよう.均一分散で線形正規分布を仮定した回帰モデルで最尤推定を行った場合,データ生成過程で実際には不均一分散が生じていたとしても,頑健標準誤差を用いれば,有効性はないが一致性と不偏性をもつ推定量となる.こうした頑健標準誤差の利点に対して,キングらは問題点は2つあると指摘する.

First, even if the functional form, independence, and other specification assumptions of this regression are correct, only certain quantities of interest can be consistently estimated. For example, if the dependent variable is the Democratic proportion of the two-party vote, we can consistently estimate a regression coefficient, but not the probability that the Democrat wins, the variation in vote outcome, risk ratios, vote predictions with confidence intervals, or other quantities. In general, computing quantities of interest from a model, such as by simulation, requires not only valid point estimates and a variance matrix, but also the veracity of the model’s complete stochastic component (King, Tomz, and Wittenberg 2000; Imai, King, and Lau 2008).
Second, if robust and classical standard errors diverge—which means the author acknowledges that one part of his or her model is wrong—then why should readers believe that all the other parts of the model that have not been examined are correctly specified? We normally prefer theories that come with measures of many validated observable implications; when one is shown to be inconsistent with the evidence, the validity of the whole theory is normally given more scrutiny, if not rejected (King, Keohane, and Verba 1994). Statistical modeling works the same way: each of the standard diagnostic tests evaluates an observable implication of the statistical model. The more these observable implications are evaluated, the better, since each one makes the theory vulnerable to being proven wrong. This is how science progresses. According to the contrary philosophy of science implied by the most common use of robust standard errors, if it looks like a duck and smells like a duck, it is just possible that it could be a beautiful blue-crested falcon.

キングらは頑健標準誤差を使うべきではないと言っているのではない.モデル推定において通常の標準誤差と頑健標準誤差に大きな違いがある場合には,ミススペシフィケーションの可能性が高いため,手持ちデータに関する基礎的な診断をすべきだとしている.その診断に応じて適切な推定手法を選ぶべきだと.彼らが結論が述べているように,要はこういうことである.

Robust standard errors should be treated not as a way to avoid reviewer criticism or as a magical cure-all. They are neither. They should instead be used for their fundamental contribution—as an excellent model diagnostic procedure. We strongly echo what the best data analysts have been saying for decades: use all the standard diagnostic tests; be sure that your model actually fits the data; seek out as many observable implications as you can observe from your model. And use all these diagnostic evaluation procedures to respecify your model. If you have succeeded in choosing a better model, your robust and classical standard errors should now approximately coincide.

GIM(generalized information matrix)テストでモデルを改善させた証は,大きな乖離のあった通常の標準誤差と頑健標準誤差がだいたい一致するということである.