Has Medical Science Discovered Anything Useful?

Yes.

David Freedman has a very interesting article in The Atlantic that has generated a lot of justified interest in the blogosphere. Freedman describes how hard it is to develop reliable medical knowledge, saying of medical scientific studies that “you have to wonder whether they prove anything at all. Indeed, given the breadth of the potential problems raised at the meeting, can any medical-research studies be trusted?”

The article talks a whole lot about the role that researcher bias, skewed incentives to gain funding and so forth play in making even the conclusions of large-scale randomized control trials (RCTs) suspect. The hero of the article, celebrated medical meta-researcher Dr. John Ioannidis, is quoted as saying:

“The studies were biased,” he says. “Sometimes they were overtly biased. Sometimes it was difficult to see the bias, but it was there.” Researchers headed into their studies wanting certain results—and, lo and behold, they were getting them. We think of the scientific process as being objective, rigorous, and even ruthless in separating out what is true from what we merely wish to be true, but in fact it’s easy to manipulate results, even unintentionally or unconsciously. “At every step in the process, there is room to distort results, a way to make a stronger claim or to select what is going to be concluded,” says Ioannidis. “There is an intellectual conflict of interest that pressures researchers to find whatever it is that is most likely to get them funded.”

This kind of thing happens, as I have often written. But as always, the question about the claims that such a process produces is “compared to what?” That is, does our imperfect scientific knowledge of medicine allow us to make better decisions than we would make in the absence of this information, or not? After all, medical research is not the only scientific field in which funding pressure creates researcher biases, yet we still seem to be able to build functioning airplanes and mobile phones.

The author answers this question with too broad a brush. What’s so striking to me about the facts asserted in the article is that even if we accept the author’s claims, then certain identifiable categories of medical knowledge seem quite reliable, while others seem worse than useless.

First, consider the differences by research methodology. The article claims that 80 percent (!) of non-randomized studies turn out to be wrong, as compared to “as much as” 10 percent of large randomized trials. Sufficiently-large randomized experiments, as I have often argued, are not some nerdy nice-to-have when evaluating theories in a sufficiently complex environment; they are a requirement. Being wrong 80 percent of the time is literally worse than just flipping a coin. On the other hand, if I had a series of ailments, and had to make a series of decisions about treatments for them, I would happily rely on a method that was right at least 90 percent of time in preference to relying on some combination of my intuition, what my brother-in-law experienced and what I discovered on Google. Well-executed RCTs really do create useful scientific knowledge.

Second, consider the examples of subsequently-refuted findings that are identified in the article. Here is a representative paragraph:

“Of course, medical-science “never minds” are hardly secret. And they sometimes make headlines, as when in recent years large studies or growing consensuses of researchers concluded that mammograms, colonoscopies, and PSA tests are far less useful cancer-detection tools than we had been told; or when widely prescribed antidepressants such as Prozac, Zoloft, and Paxil were revealed to be no more effective than a placebo for most cases of depression; or when we learned that staying out of the sun entirely can actually increase cancer risks; or when we were told that the advice to drink lots of water during intense exercise was potentially fatal; or when, last April, we were informed that taking fish oil, exercising, and doing puzzles doesn’t really help fend off Alzheimer’s disease, as long claimed. Peer-reviewed studies have come to opposite conclusions on whether using cell phones can cause brain cancer, whether sleeping more than eight hours a night is healthful or dangerous, whether taking aspirin every day is more likely to save your life or cut it short, and whether routine angioplasty works better than pills to unclog heart arteries.”

Do you see a common theme here? While not exclusively, these tend very strongly to be long-term, behaviorally-oriented interventions. Consider the classical therapies that were evaluated during the heroic phase of clinical trials in the mid-20th century – things like the streptomycin trials in Britain or polio vaccines in the US. These situations can be characterized by acute conditions that cause death or obvious loss of function within a short period, and that are addressed by treatments that apply a chemical to the body. As we shade from this kind of a problem to those characterized by conditions that affect people over many years, often in subjective ways, and are addressed by lifestyle changes or daily dosages of vitamins and so on, we are shading from medicine as classically conceived to something that is analytically much more like social science. This latter end of the spectrum is where the most common and severe problems with reliable determinations of causal effectiveness of interventions arise. This is not necessarily because researchers in these areas are less honest than those in other fields, but because the problem is inherently harder. Among other issues, signal-to-noise is worse, the relevant measurement period becomes years and decades rather than weeks and months, and the causal mechanism often becomes subtly entangled with many lifestyle behaviors. In such situations, RCTs are often impractical, and when they can be done, the integrated complexity of the causal mechanisms means that replications are much more likely to fail because unobserved context differences turn out to be relevant in determining success or failure of the treatment. The problem isn’t always the researchers; sometimes the problem is the problem.

Nice article, Jim. One pedantic quibble – being wrong 80% of the time isn’t worse than flipping a coin if there is more than once choice.

If I take a multiple choice test with four answers per question and I get 40% of the questions right, I guess it’s literally true that I was worse than flipping a coin would be on a two answer question, but the implication — that I would have been better off guessing — is not right.

Sorry for the quibble, but couldn’t stop myself.

— J Mann · Oct 20, 04:33 PM · #

Even following the advice of a study that has a 90% of being wrong might be the best course of action. After all, following up on what you find through Google, or random anecdotes probably has around a 99.9% chance of being wrong.

— Bryan · Oct 20, 04:34 PM · #

Bryan, it’s hard to say whether you should follow the advice of a study that you know is 90% likely to be wrong. I’d have to ask the cost of following that advice versus the predicted benefit, compared with the cost-benefit of your alternative, and the number of alternatives.

If you told me that your test revealed that I had cancer, but that your test is wrong 90% of the time for both false positives and negatives, I guess the logical conclusion is that there is a 10% chance that I have cancer and a 90% chance that I don’t. What I do with that information depends on the cost benefits. (And the answer would be to get a more accurate test, if available).

— J Mann · Oct 20, 04:39 PM · #

Great post Jim. Hal Lewis agrees, sometimes the problem is the problem:

http://wattsupwiththat.com/2010/10/16/hal-lewis-my-resignation-from-the-american-physical-society/#more-26117

— Arminius · Oct 20, 04:41 PM · #

I guess the logical conclusion is that there is a 10% chance that I have cancer and a 90% chance that I don’t.

— Replica Swiss Watch · Oct 21, 06:12 AM · #

“If you told me that your test revealed that I had cancer, but that your test is wrong 90% of the time for both false positives and negatives, I guess the logical conclusion is that there is a 10% chance that I have cancer and a 90% chance that I don’t.”

To be pedantic, your “logical conclusion” is wrong. For a diagnostic test with known false positive/false negative rates, the probability that you have a disease given a positive test depends on the frequency the disease occurs in the underlying population.

See:

http://www.math.hmc.edu/funfacts/ffiles/30002.6.shtml

for a quick explanation of why this is the case.

— ratufa · Oct 21, 02:53 PM · #

The statin Lipitor was marketed as a wonder drug based on a Scandinavian study where the Lipitor group had a 30% lower total death rate (heart attacks, strokes, cancer, lightning strikes, car accidents, and ski jumping catastrophes) over 5 years than the control group getting placebos.

Is there something about long term trials that makes randomized tests not work as well? Do smarter people tend to figure out they’re getting the placebo and thus drop out of the sample, leaving only a bunch of people in the control group more likely to screw up and die?

— Steve Sailer · Oct 22, 04:59 AM · #

ratufa / J Mann / Bryan

The definition of “wrong” under discussion from the article (as it clearly stated) was not that within-study analysis estimated that some test or treatment had some likelihood of success or failure, but rather that a given study should be rejected in toto as being inadequate for consideration because of biased design or execution. So a more laborious restatement of what I said would be more like: if faced with making treatment decisions for a sequence of ten ailments, and if in the possession of the results of 10 large-scale randomized trials, each of which evaluated at least one potential treatment of interest for one of these ailments, and if I knew that 9 of these 10 RCTs were “valid” and one was “invalid” but I didn’t know which one, and if faced with two possible alternatives (1) incorporate the information produced by all 10 trials into my decisions logic, acting as if all 10 are valid knowledge – which as all of you say will have various downstream decision logic complications – or (2)ignore the information produced by all 10 trials, then I would choose alternative (1). Alternatively if I knew that 8 of these 10 produced invalid knowledge, then I would choose alternative (2).

Steve,

It’s a great question, and something I’ve spent some time looking in to. Modern, large-scale trials go to seemingly ridiculous lengths to avoid exactly this problem, but I suppose it’s impossible to know if some people have pierced the blinding. You can do things like pre-match test and controls and eliminate test individuals from the analysis as their “matched” controls exit, but of course you are then relying on matching and have lost a lot of the power created by randomization.

The bigger driver is probably that over very long periods of time signal-to-noise degrades: first, very long-term treatments rarely have the kind of massive “signal” that say streptomycin does; plus, compliance issues become more significant; plus, drop-outs from the study become a larger portion of the test and control populations, exacerbated by bias in who drops out (even if not conscious as in your question); plus, people die or get sick based on all kinds of other environmental and co-morbidity reasons, which brings down sample size and also creates another source of potential bias between test and control, and so on.

— Jim Manzi · Oct 22, 07:46 AM · #

An ongoing review of politics and culture