Case 5

Although the activities of a cell are largely mediated by proteins, cells also use and produce a whole range of "small" molecules¹ known as metabolites. The host of metabolites in a cell is known collectively as the metabolome, and the study of that is metabolomics².

Now, cells are scarily complicated assemblies of self-managing machinery, and while we have a good idea how many individual pieces work, a lot of the processes are understood in only the sketchiest ways. Metabolomics constitutes one of a number of different perspectives on the problem, casting the cell as a metabolite factory³ -- a somewhat limited view, but one that can add useful pieces to the jigsaw puzzle.

Determining the functions of genes, as discussed before, is a hit and miss affair, typically done by knocking genes out one by one⁴ and seeing what happens. Often the functional consequences of a deletion are difficult to identify, and this is somewhere metabolomics can help.

About three quarters of the genes in E. coli have had (at least some of) their functions identified by the deletion technique, and half of those are involved in metabolic control, so it seems plausible⁵ that a similar proportion of the unknown genes may have metabolic consequences. If it were possible to get a decent measure of a gene's effect on the metabolome, that could go some way toward working out what the gene is for⁶. As it turns out, it is possible, to some extent.

The problem has a number of components. To begin with, there's a whole biochemical obstacle course to be negotiated in order to culture the bacteria and extract the metabolome from in amongst all the other stuff down there -- extracellular material, membrane, all those clumsy big molecules -- over which let us draw a discreet veil.

The resulting soup of molecules then needs to be separated, identified and measured. This is no mean feat even once -- there are hundreds of different metabolites -- but to provide useful information it has to be done a lot of times, so the process mustn't be too slow.

The chosen technique in this case is⁷ capillary electrophoresis: the solution is drawn through a very fine glass tube by an electrical field. The field affects the molecules differently according to their charge and size, so different species wind up going through the tube at different times, although the separation isn't absolute. En route, their absorption spectra are measured over a visible to near-UV range.

The output from this is not, alas, a neat list of all the metabolites and their quantities, it's a whopping great table of absorption coefficients for different wavelengths over time. Getting from the latter to the former is another lengthy saga that we'll skip breezily over with only the following notelets: spectra are mapped into a space of known absorptions whose basis is neither orthogonal nor spanning; fractal compression is used to reduce the data size; efficient lookup algorithms are possible using differences from the basis, but there isn't yet a complete spectral library in which to look.

Having gone through all that, the real question becomes: how can we relate the metabolome measurements to gene function?

Metabolites participate in convoluted sequences of chemical reactions known as metabolic pathways. That, in ateleological essence, is what they're for. It's where they come from and where they're going. The rates of those reactions depend on the concentrations of the reactants, but also on the presence of appropriate proteins, serving as catalysts, and hence on the genes. Which -- oh, it sounds so simple! -- is how the genome controls metabolism.

If we abstract away all the cellular machinery, we can view the cell metabolomically as a network of transformations of small molecules. Each metabolite is connected to others by the reactions they have in common and each reaction has an associated rate. These rates, and the consequent chemical concentrations, are constrained by basic conservation laws: atoms can't magically appear or disappear. The network is dynamic -- always in flux -- but settles overall into a steady state where the reactions balance out⁸, and the cell hums along harmoniously.

If we modify the black box mechanics of the network by deleting a (metabolically relevant) gene, there will be some corresponding change in the reaction rates, leading to a change in the steady state. The concentrations of the metabolites -- the stuff we can measure -- will be different.

Of course, turning the measured concentrations into reaction rates -- reconstructing the perturbed network from experimental data -- is a rather hairy inverse problem. It would probably be impossible were it not that the actual pathways have already been mapped, so we know pretty much what the network topology must be. Deleting a gene is never going to add new reactions, it's only going to rebalance the ones already there, perhaps in some cases reducing them to non-occurence; and the conservation constraints must always remain.

Identifying the location of the perturbation -- which is to say, the principal effects of deleting the gene -- uses a technique called co-response analysis, based on determining the response and control coefficients (basically, the sensitivities of the concentrations to one another and to the reaction rates, calculated as partial derivatives) of different units (sub-nets) of the overall network.

This is considerably more analytic and considerably less statistical than I was expecting, and it seems at first sight that there are a number of inadequately justified steps to the argument, but I am assured that the technique has been comprehensively tested against a variety of known and unknown data sources and can persuasively locate network perturbations in both real and simulated data sets with a high degree of confidence.

1 All molecules are small, obviously, but some are considerably smaller than others. The big ones, as far as biology is concerned, are the polymers -- proteins and nucleic acids -- which can contain millions of atoms. Pretty much everything else is considered "small".
2 If you're thinking that's a pig-ugly word, well, you're not alone. It's also arguably a pretty artificial division, but we humans do cherish our territorial boundaries.
3 Other points of view include proteomics (the cell is a protein factory), genomics (it's a gene factory, which is tantamount to being a cell factory) and, with a name clearly invented just to outdo metabolomics in the ugly stakes, transcriptomics (an RNA factory). All of these assertions are, of course, absolutely true -- in a shadows on the cave wall kind of way.
4 In this context, by physically changing the DNA rather than suppressing its products as in the earlier RNA interference discussion. The two techniques work to different ends. Besides, the organism under consideration today will be E. coli, which is a prokaryote and thus lacking the eukaryotic machinery exploited by RNAi. (There have been reports of an analogous process in prokaryotes, but I don't think it has been put to equivalent use.)
5 Whether a gene's function is known is obviously not a random variable, so there's no basis for making anything but the most shoulder-shrugging claims about this.
6 #include <stddisclaimer.h>
7 Mass spectrometry is also used, to cross-check the electrophoresis results, but it doesn't scale: mass spectrometers are bulky and extremely expensive, so it's not feasible to run hundreds of them in parallel; capillaries, on the other hand, are cheap as chips.
8 Approximately. One of the potential issues with this whole scheme is the steady state assumption, but I'm told it is biologically relevant.