Still on Strike Against Campaign Blogging: Why It’s Plausible that Epistatic Interactions Are Complex and Central

I’ve previously blogged here about an article I wrote for National Review that argued that we lack the ability to reduce most mental states to genetic causes reliably using Genome-Wide Association Studies (GWAS). One important reason (I argued) that this is so hard is that genes interact (this is termed epistatic interaction), so the computational problem becomes daunting. If you read through about the first 20 or so comments, you can see a very collegial exchange between Razib and me on the question of how severe epistatic interactions really are.

At one extreme, if genes do not interact at all, and each individual gene has a simple linear relationship to the outcome, then barring other complexities, a GWAS should work fine. One could then represent the relationship between a vector of genes and the outcome essentially in the form of a linear equation: A 1 X 1 + A 2 X 2 + …+ A N X N , where you have N relevant genes, A i = a constant and X i = the gene state for the i th gene. At the other extreme, if each possible combination of genes has an impact on the physical outcome that bears no discernible relationship to any other combination of genes, then a GWAS is unlikely to determine causality for any outcome that depends on many genes. There is no functional form that can represent this relationship; you would just have a huge lookup table that provides the value of the outcome for all possible combinations of relevant gene states. Intermediate conditions between these two extremes would be defined by some structure that is more complex than purely linear relationships with no interaction terms.

Here’s why I think it is plausible (thought this is hardly a proof) that the nature of genetic evolution indicates that the actual functioning of the human genome is closer to the complex end of this spectrum than the simple end. (The following will only make sense if you read through my long post that describes how the genetic operators of selection, crossover and mutation work, using the example of somebody needing to find the combination of switch settings that maximizes output at a chemical plant).

Now assume that there is a second, identical chemical plant next door to this one. A different person, who happens to have a chemical engineering degree, walks into this one and is tasked with finding the best possible combination of switch settings. He reads the label for each switch, scribbles some calculations in his notepad, and confidently flips each switch to a specific setting and announces that the has the plant operating at maximum output.

A third identical plant sits next to this one. A third person walks into this one, and simply starts randomly flipping switches as fast as he possibly can, and recording the output for each tested combination.

Who will get the plant operating at the highest throughput fastest? Well, if the chemical engineer is right, he will win, but if he’s wrong, he’s likely to come in last if the other two guys get to experiment for a while. The first guy is likely to win if there is some underlying structure to be found between various combinations of switches. But if there is no structure to be found, then all of the calculations that the first guy does between flipping switches will slow him down, with no offsetting benefit, and the third guy has the best odds of getting the best solution fastest.

In the AI business, a search algorithm that makes stronger assumptions that let it home in on the right answer faster if the assumptions are right, but tends to hose you if the assumptions are wrong, is termed “greedy”. The second guy is using the limit case of greediness: he simply asserts the answer with no experimental testing. The third guy is using the limit case of non-greediness: he simply tries random combinations. The first guy is using an intermediate case, termed, not shockingly a “genetic algorithm” (GA).

There are other intermediate cases that are more or less greedy than a GA. If, for example, a fourth guy assumed not only that there is some underlying structure, but that this structure involved no interaction terms (i.e., no epistatic interactions), then he could use a search algorithm that would find the best answer much, much faster than the GA. There is a huge sub-field of AI devoted to developing and testing many classes of search algorithms with various levels of greediness, and attempting to demonstrate the types of problems for which each method is more or less appropriate. In this context, GAs are considered to be a very non-greedy algorithm, appropriate for cases in which the underlying structure of the data is highly complicated and/or opaque to the investigator.

Now imagine that we have a line of 1,000,000 identical chemical plants. A different person is escorted into each plant and given the task of finding the best switch setting as fast as possible. Some would use very greedy algorithms, some less so. Some would use GAs, and others would use other approaches. What would determine which algorithm would win would be the degree to which there is some underlying structure in the data. Even if there were a simple structure that just happened to be opaque to the vast majority of observers, a small number of the many tested algorithms would find the structure.

In this way, operating at a higher level of abstraction, this million-plant experiment would be random search of the space of possible algorithms. We could in fact imagine that a super-agent sitting on top of these million experiments could start to combine algorithmic results in various ways, some of which might be analogous to a GA-like evolving set of rules for combining estimates from different algorithms. This group of a million plants might just be one of a million groups of a million plants each, which could then be combined using analogous methods, and so on. We could call this “turtles all the way up”.

We could call this process of competing algorithms struggling to find the best solution as fast as possible “meta-evolution”. That is, each potential search method must compete for survival. The fact that the algorithm that has won this (idealized) competition in the real world has the form of a GA seems to indicate that there is some structure to the relationship between gene vectors and physical outcomes, but that it is much more complex that simple linear combinations without interaction terms, otherwise nature never would have evolved the evolutionary algorithm with all of its computational overhead. If epistatic interactions were not central, meta-evoltuion should have killed off evolution as we know it a long, long time ago.

UPDATE: See a very informed discussion of this by Razib at GNXP.