Why Do You Think They Call It Data Mining?

Megan McArdle has a great little post on spurious correlations. This is an obsessive interest of mine. A simpler way to make the same basic point is that when we say we are “95% certain that a relationship between X and Y is non-random”, that means there is a 1-in-20 chance that it is random; hence, if I try 20 random potential correlates, I should expect to find one that is “significant”.

In fact, such a procedure is a violation of one of the little-understood conditions for such a significance test. This is why “data mining” was a pejorative term among statisticians 30 years ago, despite having been translated subsequently into a positive commercial term through the magic of the marketplace.

In complex systems, such as those involving human beings, this problem of spurious correlation is not an annoyance to be minimized and tolerated, but is really the central question for empirical economics and social science. Hold-out tests can reduce some of this problem, but the only real solution (though even this is imperfect) is to conduct random assignment experiments.