Profile photo of shams

by shams

Tree Annotator Thingy

August 13, 2014 in Labwide Announcements, Shams' Blog

It's 11:55 PM on a hazy Sunday night. The snow begins to fall over a slumbering Calgary. Only you, a handsome, dashing young scientist, is awake. Surrounded by nothing but darkness, protected only by the glow of your computer screen, you are in HMRB 253 working on getting your results presentable to impress your supervisor tomorrow morning. For the past month, you've been categorizing the variants that impact the Cheesy virus. You've also looked at the conservation of variants across a set of vertebrate species. You want to have a phylogenetic tree that shows your supervisor the distribution of amino acids in your position of interest.

So you have a tree that looks something like this: ENST00000003084-rs1800123.tree
What you want is something that look like this: ENST00000003084-rs1800123-state.tree

Lucky for you, the former Shams Bhuiyan wrote a script to do just that! BOOM, your butt is saved. Here's what you'll need:

  1. A Newick tree of your phylogeny. Something that looks a lot like thisAn example newick format tree for the all vertebrate tree I displayed above
    An example Newick format tree for the all vertebrate tree I displayed above
  2.  You will also need an input file of the variants you're interested in. The columns should be in this order: alignment file name, variant position in alignment, the variant you're interested in, and a variant identifier (e.g rsID);
  3. As I am sure you have figured out by now, you will also need to give the script a path to a folder with all your sequence alignments (fasta format please)
  4. And of course, the actual script (! This is located in the Dropbox:de Koning Lab (1)/Public/Scripts/Shams/

All of these inputs need to be hardcoded into the script. I've tried my best to make it obvious where you hard code these inputs in. You will also need to change the array @Wanted_Species to whatever your species list  Perhaps some day I will turn them into command line arguments.

How it works:

  1. Open and read the alignment file corresponding to a given variant
  2. For a given species in the alignment file, record the amino acid state for the position of interest.
  3. Find the given species node on the tree and append the node with the amino acid state.
  4. After all the nodes have been appended, output a newick format tree that's titled [alignment file name]-[variant identifier].tree in directory trees
  5. Output a nexus format tree that's title [alignment file name]-[variant identifier].tree in directory nexus

There are two important things to note about the script! These are specific towards PAML reconstructions:

    1. The script only appends ancestral information as far up to the lowest common ancestor. Consider the tree mentioned earlier. It has 57 species as leaves. However, not all alignment files will have all 57 species. As such, not all ancestral reconstructions will have reliable data for all the ancestral node. The script considers all the species that are given by sequence data and uses BioPerl to draw upon what is the lowest common ancestor.
    2. Ancestral nodes where all descendants are gapped will also have gaps. Ancestral reconstructions will not have gaps. However, sometimes their children will have gaps in the alignment at the position of interest. As a result, the script uses a post-order traversal of the tree to check whether or not any ancestral node have only gapped children. If both children are gapped, then the ancestral node will be appended as gapped as well. That probably was confusing, so hopefully the following example illustrates it best:
So as you can see the ancestral node has a gap when both of its descendants have gaps at that position as well. However when only one child has a gap and the other has an amino acid state for that position, then the node will be appended based on the ancestral reconstruction for that node.

So as you can see the ancestral node has a gap when both of its descendants have gaps at that position as well. However when only one child has a gap and the other has an amino acid state for that position, then the node will be appended based on the ancestral reconstruction for that node.

I think that pretty much covers the overall framework of the program. The script itself has a lot of documentation of what's going on in each section that you can keep your hair well-sculpted. Feel free to ask me questions!

Profile photo of arnab

by arnab

Ever wondered what is the slowest evolving organism?

July 15, 2014 in Labwide Announcements

The honour goes to a relatively little-known and an endangered organism called Elephant Shark. There is more to this organism however than meets the mind. It reveals novel insights about the mechanism of bone formation and the origin of adaptive immunity. Now only that, its relatively small genome size (~1G) also makes it useful for studying evolution of genes. Want to know more? Why not! Lets discuss more about this fantastic organism in our next journal club meeting scheduled on July 24th. For reference see


Screen Shot 2014-07-15 at 5.21.58 PM

Profile photo of aaron

by aaron

stochastic tunnelling

July 10, 2014 in Labwide Announcements

Sorry for the delay.

Here's the paper: Genetics 2004 Iwasa. Have a glance, its pretty straight forward.


I'll let you know the room as soon as I can. July 11 at 1:30.



Profile photo of arnab

by arnab

Why Tibetans have adaptability to higher altitudes..and learn more about DNA introgression and incomplete ancestral lineage sorting

July 4, 2014 in Labwide Announcements

As modern humans migrated out of Africa, they encountered many new environmental conditions, including greater temperature extremes, different pathogens and higher altitudes. These diverse environments are likely to have acted as agents of natural selection and to have led to local adaptations. One of the most celebrated examples in humans is the adaptation of Tibetans to the hypoxic environment of the high-altitude Tibetan plateau123. A hypoxia pathway gene, EPAS1, was previously identified as having the most extreme signature of positive selection in Tibetans45678910, and was shown to be associated with differences in haemoglobin concentration at high altitude. Re-sequencing the region around EPAS1 in 40 Tibetan and 40 Han individuals, we find that this gene has a highly unusual haplotype structure that can only be convincingly explained by introgression of DNA from Denisovan or Denisovan-related individuals into humans. Scanning a larger set of worldwide populations, we find that the selected haplotype is only found in Denisovans and in Tibetans, and at very low frequency among Han Chinese. Furthermore, the length of the haplotype, and the fact that it is not found in any other populations, makes it unlikely that the haplotype sharing between Tibetans and Denisovans was caused by incomplete ancestral lineage sorting rather than introgression. Our findings illustrate that admixture with other hominin species has provided genetic variation that helped humans to adapt to new environments.

Profile photo of arnab

by arnab

FORGE Canada , Amer. Jour. Hum. Genet.

June 5, 2014 in Labwide Announcements

Of possible interest (

Inherited monogenic disease has an enormous impact on the well-being of children and their families. Over half of the children living

with one of these conditions are without a molecular diagnosis because of the rarity of the disease, the marked clinical heterogeneity, and

the reality that there are thousands of rare diseases for which causative mutations have yet to be identified. It is in this context that in

2010 a Canadian consortium was formed to rapidly identify mutations causing a wide spectrum of pediatric-onset rare diseases by using

whole-exome sequencing. The FORGE (Finding of Rare Disease Genes) Canada Consortium brought together clinicians and scientists

from 21 genetics centers and three science and technology innovation centers from across Canada. From nation-wide requests for

proposals, 264 disorders were selected for study from the 371 submitted; disease-causing variants (including in 67 genes not previously

associated with human disease; 41 of these have been genetically or functionally validated, and 26 are currently under study) were identified

for 146 disorders over a 2-year period. Here, we present our experience with four strategies employed for gene discovery and discuss

FORGE’s impact in a number of realms, from clinical diagnostics to the broadening of the phenotypic spectrum of many diseases to the

biological insight gained into both disease states and normal human development. Lastly, on the basis of this experience, we discuss the

way forward for rare-disease genetic discovery both in Canada and internationally.

Profile photo of nathan

by nathan

Journal Fight Club - June 5th

June 2, 2014 in Labwide Announcements, Nathan's Blog

Hello everyone,

My journal fight club talk will be on June 5th at 2pm in room G737. 

Biomed Central link to the article:


- Nathan

Bring it.

Bring it.

Profile photo of shams

by shams

Journal Fight Club for May 27th

May 24, 2014 in Labwide Announcements, Shams' Blog

Hello Everyone,

So my journal fight is scheduled for May 29th at 2 pm for OB1509. The journal of choice is:

Why Human Disease-Associated Residues Appear as the Wild-Type in Other Species: Genome-Scale Structural Evidence for the Compensation Hypothesis.

Pubmed link:

Profile photo of arnab

by arnab

Sewall Wright’s Seven Generalizations about Populations

May 22, 2014 in Labwide Announcements

(1)   The variations of most characters are affected by a great many loci (the multifactor hypothesis).

(2)  In general, each gene replacement has effects on many characters (the principle of universal pleiotropy)

(3)  each of the innumerable possible alleles at any locus has a unique array of differential effects on taking account of pleiotropy (uniqueness of alleles)

(4)  The dominance relation of two alleles is not an attribute of them but of the whole genome and the environment. Dominance may differ for each pleiotropic effect and is in general easily modifiable (relativity of dominance).

(5)  The effects of multiple loci on a character in general involve much nonadditive interaction (universality of interaction effects)

(6)  Both ontogenetic and phylogenetic homology depend on calling into play similar chains of gene-controlled reactions under similar developmental conditions (homology)

(7)  The contributions of measurable characters to overall selective value usually involve interaction effects of the most extreme sort because of the usually intermediate position of the optimum grade, a situation that implies the existence of innumerable different selective peaks (multiple selective peaks).


Profile photo of chenzhe

by chenzhe

Journal fightclub Friday May 23

May 20, 2014 in Labwide Announcements

High performance computing

Topic will be parallel paradigm covers shared memory (openmp) parallel, discrete memory (mpi) parallel and hybrid (openmp + mpi) parallel. There is NO specific paper will be discussed.

Room is booked at G750 at 3pm.

Profile photo of ivan

by ivan

Journal fightclub Tuesday May 20

May 15, 2014 in Labwide Announcements

Here is the paper I will be presenting: KODAMA

The paper describes what is essentially a statistical pipeline that includes iterative cross-validation of the results. The authors argue that this produces very accurate (high confidence) classifications. I suppose that one argument against this is that cross-validation may give you a "false sense of security" - since you are resampling the same dataset multiple times.

The project website has a tutorial and a few example datasets if you would want to play with it:

Please comment to let me know if you want me to just present the paper (more theoretical explanation), or prepare a little run-through tutorial (more practical explanation).


Skip to toolbar