## Something awesome

May 4, 2014 in Jason's Blog

This is hilarious. Thanks Ivan.

December 18, 2013 in Jason's Blog, Labwide Announcements

Svante Pääbo's lab has just published a detailed analysis of the newer high-resolution Neanderthal genome sequences! This is very exciting and worth a look; someone, maybe me, should present this paper in lab meeting in January.

See the paper here.

August 14, 2013 in Jason's Blog

Friend of the lab Jesse Bloom is featured in this nice Scientific American article about how individually neutral mutations can interact to be non-neutral together. This is why coevolution matters in medical genomics, and is a motivating factor in our work on modelling coevolution and in testing it's possible impact on modulating the effect of human-disease-causing mutations present in other species (in @shams thesis). This is a nice blog post and is worth a few minutes to read.

June 12, 2013 in Jason's Blog

@nathan has been working on a Perl package to simulate sequence evolution under a continuous-time Markov chain. So far we are only considering a single branch in a phylogenetic tree, which is fine for now. This work will connect with @aaron's work soon, where we will compare the expected number of substitutions (and the estimated variance) for each substitution type to counts that we observe after running a series of simulations.

@nathan is currently producing some nice graphs that demonstrate two fundamental principles of sequence evolution (and of the dynamics of Markov chains):

1) The more time that has elapsed (i.e., the longer the branch), the greater the probability of "multiple hits" at the same position. In other words, longer branches lead to a greater chance that multiple substitution events will have occurred at the same position. A consequence of this is that if we simply count differences between ancestral and descendent sequences, we will be underestimating how many changes actually occurred.

2) The actual number of simulated substitution events along a branch should be linear with branch length. Since we are simulating substitution histories directly, we can keep track of the number of events easily. If the rates matrix of the CTMC model is scaled so that branchlength refers to the expected number of substitutions per site, the relationship between branchlength and the average number of actual events should be 1:1.

1) The simulation works by first drawing a random sequence from the stationary distribution of the CTMC. The script also supports drawing a starting sequence from some other distribution, which we will explore later.

2) For each position in the starting sequence, simulate evolution under the CTMC along a given branch length. We terminate the simulation when we hit or exceed the end of the branch:

a) Given a branch length , set the "current time" .

b) Given the current state , simulate a time to the next event, . For a CTMC, the distribution of "waiting times" follows an exponential distribution having a rate parameter , where . We use a Perl library to sample a time from such an exponential distribution.

c) if then STOP. Otherwise set .

d) Draw a new state from the discrete distribution given by .

e) Call the new state, , and go to step b.

The simulation thus ends when the branch length is exceeded, and the final state at the end of the branch is the last state sampled.

Sampling both the initial sequence and the new state conditional on the old state can be achieved using a simple algorithm that calculates the inverse CDF by summing and normalizing. It turns out that this algorithm is known as the Roulette wheel algorithm, and has a worst-case time complexity of . There exists, however, a better algorithm which @nathan found out about: Vose's alias algorithm. We will be implementing this algorithm later on. Although it's setup time is slightly longer than the roulette wheel, it can sample random variates in a fixed time. Some example MATLAB code for the alias method can be found here. Some C++ code is here.

It is also worth mentioning that the "standard method" for simulating sequence evolution under a CTMC involves instead calculating the time-dependent substitution probabilities, , by exponentiating the rates matrix: . This is a computationally expensive matrix calculation (@nathan is almost done implementing a Perl version of this algorithm). By calculating the matrix, one can simulate (sample) a state at the end of the branch *without simulating the process*. This method will be slower, and perhaps much slower, than the method described above. This is especially true when the matrix exponential computation is large, such as for models with a large state space (including codon substitution models, Markov-modulated Markov models, and models that consider pairwise coevolution and site interdependence).

- @nathan will be implementing a "mutation-selection" type model soon, which separately accounts for fitness constraints and mutational preferences. We will be using this model to explore the relationship between fitness, observed frequency (in a sequence alignment), and dwell time (the amount of time spent in each state along the tree). This work will be fundamental to understanding the pros and cons of different approaches for estimating fitness values for variant impact prediction.

- @nathan and @aaron will be working together soon to compare analytical and simulation results, focusing on how well the mean and variance (see above) characterize the CTMC

- @jason will be contributing code that calculates similar quantities but assuming a small number of substitutions occur in a small period of time. @aaron will be using this code to better quantify information loss and error using this approach compared to using the "full" CTMC

June 10, 2013 in Jason's Blog

@aaron and I have started looking at the work presented in the math-heavy paper by Minin and Suchard (2008) "Counting Labeled transitions in continuous-time Markov models of evolution". J. Math. Biol. 56:391-412.

The paper is significant as it presents several unifying developments that allow the "factorial moments" of a continuous-time Markov chain (CTMC) to be calculated along a branch. This is useful because it is a "simulation free" way to determine statistics of substitution histories, and thus can be developed into a framework for statistical inference (which they do in other papers). The two questions we are most interested in are:

**1) How efficient is this approach compared to simulation**; and

**2) How much information and precision do we lose by assuming that short intervals of time have no more than one substitution per site?**

The latter question is important because some of our lab's core methodology for rapidly performing Bayesian MCMC works by conditioning on a small number events per short time interval (see below).

The methods presented allow the expected number of substitution events of each type to be estimated using the restricted factorial moments of the CTMC. Some of the progress they made was achieved by assuming reversibility of the underlying process, which is a common but not always biologically-realistic assumption. Their results nicely generalize previously presented results from my postdoctoral lab and others (including notably from Asger Hobolth and Adam Siepel). For example, their approach generalizes results from our Krishnan et al. (2004), paper.

The main expression we are interested in is their calculation of the expected number of substitutions of a particular type (e.g., ). This can be estimated using:

where . is a matrix with zeros everywhere, except at the th element, which is one. is the instantaneous rates matrix for the state transitions of interest, the set (discussed below), and is the number of states in our evolutionary model (e.g., 4 for nucleotide substitution). In addition:

if and

if , where is a vector of the real eigenvalues for .

The set can be the entire set of transitions, or some subset such as only non-synonymous substitutions for a codon substitution model (64x64 state space). In our experiments, we will likely want to calculate the expected number of substitutions for **each transition type** separately.

The spectral decomposition required to calculate and can be performed in time (e.g., de Koning et al., 2010).

The calculation of for all can be performed in time, since any requires two matrix multiplications, each, for each .

The calculation of is for a given set of transition types. If my current understanding is correct, this will have to be done for to get results for each transition-type separately, and thus requires a staggering operations when performed in this way.

Even worse, in the context of a phylogenetic model, these computations must be performed along each branch of the tree. Thus **the total time complexity** is minimally for branches (retaining the lower-order terms for transparency).