Profile photo of shams

by

Tree Annotator Thingy

August 13, 2014 in Labwide Announcements, Shams' Blog

It's 11:55 PM on a hazy Sunday night. The snow begins to fall over a slumbering Calgary. Only you, a handsome, dashing young scientist, is awake. Surrounded by nothing but darkness, protected only by the glow of your computer screen, you are in HMRB 253 working on getting your results presentable to impress your supervisor tomorrow morning. For the past month, you've been categorizing the variants that impact the Cheesy virus. You've also looked at the conservation of variants across a set of vertebrate species. You want to have a phylogenetic tree that shows your supervisor the distribution of amino acids in your position of interest.

So you have a tree that looks something like this: ENST00000003084-rs1800123.tree
What you want is something that look like this: ENST00000003084-rs1800123-state.tree

Lucky for you, the former Shams Bhuiyan wrote a script to do just that! BOOM, your butt is saved. Here's what you'll need:

  1. A Newick tree of your phylogeny. Something that looks a lot like thisAn example newick format tree for the all vertebrate tree I displayed above
    An example Newick format tree for the all vertebrate tree I displayed above
  2.  You will also need an input file of the variants you're interested in. The columns should be in this order: alignment file name, variant position in alignment, the variant you're interested in, and a variant identifier (e.g rsID);
  3. As I am sure you have figured out by now, you will also need to give the script a path to a folder with all your sequence alignments (fasta format please)
  4. And of course, the actual script (findstate.pl)! This is located in the Dropbox:de Koning Lab (1)/Public/Scripts/Shams/

All of these inputs need to be hardcoded into the script. I've tried my best to make it obvious where you hard code these inputs in. You will also need to change the array @Wanted_Species to whatever your species list  Perhaps some day I will turn them into command line arguments.

How it works:

  1. Open and read the alignment file corresponding to a given variant
  2. For a given species in the alignment file, record the amino acid state for the position of interest.
  3. Find the given species node on the tree and append the node with the amino acid state.
  4. After all the nodes have been appended, output a newick format tree that's titled [alignment file name]-[variant identifier].tree in directory trees
  5. Output a nexus format tree that's title [alignment file name]-[variant identifier].tree in directory nexus

There are two important things to note about the script! These are specific towards PAML reconstructions:

    1. The script only appends ancestral information as far up to the lowest common ancestor. Consider the tree mentioned earlier. It has 57 species as leaves. However, not all alignment files will have all 57 species. As such, not all ancestral reconstructions will have reliable data for all the ancestral node. The script considers all the species that are given by sequence data and uses BioPerl to draw upon what is the lowest common ancestor.
    2. Ancestral nodes where all descendants are gapped will also have gaps. Ancestral reconstructions will not have gaps. However, sometimes their children will have gaps in the alignment at the position of interest. As a result, the script uses a post-order traversal of the tree to check whether or not any ancestral node have only gapped children. If both children are gapped, then the ancestral node will be appended as gapped as well. That probably was confusing, so hopefully the following example illustrates it best:
So as you can see the ancestral node has a gap when both of its descendants have gaps at that position as well. However when only one child has a gap and the other has an amino acid state for that position, then the node will be appended based on the ancestral reconstruction for that node.

So as you can see the ancestral node has a gap when both of its descendants have gaps at that position as well. However when only one child has a gap and the other has an amino acid state for that position, then the node will be appended based on the ancestral reconstruction for that node.

I think that pretty much covers the overall framework of the program. The script itself has a lot of documentation of what's going on in each section that you can keep your hair well-sculpted. Feel free to ask me questions!

Leave a reply

Your email address will not be published. Required fields are marked *

Skip to toolbar