Why Do You Think They Call It Data Mining?

Megan McArdle has a great little post on spurious correlations. This is an obsessive interest of mine. A simpler way to make the same basic point is that when we say we are “95% certain that a relationship between X and Y is non-random”, that means there is a 1-in-20 chance that it is random; hence, if I try 20 random potential correlates, I should expect to find one that is “significant”.

In fact, such a procedure is a violation of one of the little-understood conditions for such a significance test. This is why “data mining” was a pejorative term among statisticians 30 years ago, despite having been translated subsequently into a positive commercial term through the magic of the marketplace.

In complex systems, such as those involving human beings, this problem of spurious correlation is not an annoyance to be minimized and tolerated, but is really the central question for empirical economics and social science. Hold-out tests can reduce some of this problem, but the only real solution (though even this is imperfect) is to conduct random assignment experiments.

Could you say more about this? A lot more?

— back40 · Feb 26, 09:15 PM · #

You know, I used to work for a hedge fund that focused on statistical arbitrage in the equities markets. Which pretty much always sounded like data mining to me. But they’ve got a 20-year positive return history, so I guess it’s real. Or necromancy.

— Noah Millman · Feb 26, 09:27 PM · #

I simply don’t understand how someone who has such a fear of false correlation could be such a rabid advocate of private school vouchers, an issue which is as dominated by selection error as I can possibly imagine.

— Freddie · Feb 26, 10:15 PM · #

Freddie: I assume McArdle supports vouchers on general ideological principles. I don’t think there’s a good enough data set for anyone to draw strong empirical conclusions – in either direction – from the voucher programs that have been tried so far.

— Noah Millman · Feb 26, 10:34 PM · #

Or, in fact, if you try 20, you should expect to find more than one…

— Sanjay · Feb 26, 11:09 PM · #

Jim
You bring up an interesting point, and its one that I struggle with currently as a social science grad student. Implicit in what you are addressing here is also the tension between addressing the ‘complexity’ of a social system, or focusing with a narrow eye on a piece of the system one can address with random assignment in a lab setting (and thus decreasing the complexity/ realism) Even now as I write this, I am conducting a lab on attitudes and the failure to disregard information in a decision-making environment. It is an interesting topic, and we study it cleanly to find correlation and causality, but its also not nearly as relevant in a complex system once you add all the other variables at work in a ‘real’ decision making environment.

I wonder what you think of the complex systems software and the ability to use simulation to address these questions (I think thinking here most specifically of systems dynamics by Jay Forrester, but there are obviouslly others).

Peter

— Peter Boumgarden · Feb 27, 12:27 AM · #

Noah:

As you know, the great thing about models in securities markets is that you can back-test, then dry trade, then run live and measure return vs. expectation and (if you’re not really leveraged) stop trading when, for some unexplainable reason, the algorithm stops working. You can then put it back in the “lab”, and see if you can modify to back-test well in the new environment, and start the whole cycle over again. As long as you have a portfolio of such model types, you can make money fairly consistently. Said differently, you can test whether or not you have reliable predictiveness experimentally (again, subject to that tricky “not being really leveraged” condition, otherwise your in Taleb country).

Freddie:

Reihan forwarded me your email about my Andrew Sullivan post on school choice; sorry I haven’t been able to respond yet. The impact of school choice on program participants is a subject on which some of the most certain conclusions about causal impact can be made, since we have repeated random assignment trials due the lottery selection of participants. A couple of excellent papers that review specific expeiments in detail and reference numerous other such studies are:

http://www.ksg.harvard.edu/pepg/PDF/Papers/PEPG_03-14.pdf
(This is a classic paper from the Harvard group that looked at the NYC experiment)

http://www.ksg.harvard.edu/pepg/PDF/Papers/PEPG_03-15.pdf
(This looks at a national experiment with 40,000 participating families across the US.

Peter:

I wrote exactly about the trade-off between comprehensive-but-unreliable structural equation models and reliable-but-ungeneralizable “behavioral economics models in the following guest post for Andrew Sullivan:

http://andrewsullivan.theatlantic.com/the_daily_dish/2008/02/freaks-and-geek.html

— Jim Manzi · Feb 27, 02:41 AM · #

I’m still waiting, Jim, for an explanation of the mechanism through which private schools are supposed to be better educating students.

— Freddie · Feb 27, 03:56 AM · #

However, I should say that that’s a discussion for another time.

— Freddie · Feb 27, 04:47 AM · #

Why do you need to know the mechanism? If they’re demonstrably better, that should be good enough, right?

I mean, we don’t really know the mechanism by which gravity works, but I still stick to the ground. We’ve also got a very incomplete understanding (to be generous) of how the human brain forms memories, but somehow we still manage (mostly) to remember who we are and what we’re doing.

Sometimes it’s enough for a phenomenon to be observable, even if it’s not currently explainable.

— TW Andrews · Feb 28, 03:27 PM · #

If they’re demonstrably better, that should be good enough, right?

First of all, no. Second of all, they’re not demonstrably better. Thirdly, it’s notoriously difficult to determine causal relationships in education. If you can’t describe the process through which your supposedly better method works, it becomes impossible.

— Freddie · Feb 29, 01:40 AM · #

An ongoing review of politics and culture