You are browsing the archive for Script.

Profile photo of shams

by shams

Tree Annotator Thingy

August 13, 2014 in Labwide Announcements, Shams' Blog

It's 11:55 PM on a hazy Sunday night. The snow begins to fall over a slumbering Calgary. Only you, a handsome, dashing young scientist, is awake. Surrounded by nothing but darkness, protected only by the glow of your computer screen, you are in HMRB 253 working on getting your results presentable to impress your supervisor tomorrow morning. For the past month, you've been categorizing the variants that impact the Cheesy virus. You've also looked at the conservation of variants across a set of vertebrate species. You want to have a phylogenetic tree that shows your supervisor the distribution of amino acids in your position of interest.

So you have a tree that looks something like this: ENST00000003084-rs1800123.tree
What you want is something that look like this: ENST00000003084-rs1800123-state.tree

Lucky for you, the former Shams Bhuiyan wrote a script to do just that! BOOM, your butt is saved. Here's what you'll need:

  1. A Newick tree of your phylogeny. Something that looks a lot like thisAn example newick format tree for the all vertebrate tree I displayed above
    An example Newick format tree for the all vertebrate tree I displayed above
  2.  You will also need an input file of the variants you're interested in. The columns should be in this order: alignment file name, variant position in alignment, the variant you're interested in, and a variant identifier (e.g rsID);
  3. As I am sure you have figured out by now, you will also need to give the script a path to a folder with all your sequence alignments (fasta format please)
  4. And of course, the actual script (findstate.pl)! This is located in the Dropbox:de Koning Lab (1)/Public/Scripts/Shams/

All of these inputs need to be hardcoded into the script. I've tried my best to make it obvious where you hard code these inputs in. You will also need to change the array @Wanted_Species to whatever your species list  Perhaps some day I will turn them into command line arguments.

How it works:

  1. Open and read the alignment file corresponding to a given variant
  2. For a given species in the alignment file, record the amino acid state for the position of interest.
  3. Find the given species node on the tree and append the node with the amino acid state.
  4. After all the nodes have been appended, output a newick format tree that's titled [alignment file name]-[variant identifier].tree in directory trees
  5. Output a nexus format tree that's title [alignment file name]-[variant identifier].tree in directory nexus

There are two important things to note about the script! These are specific towards PAML reconstructions:

    1. The script only appends ancestral information as far up to the lowest common ancestor. Consider the tree mentioned earlier. It has 57 species as leaves. However, not all alignment files will have all 57 species. As such, not all ancestral reconstructions will have reliable data for all the ancestral node. The script considers all the species that are given by sequence data and uses BioPerl to draw upon what is the lowest common ancestor.
    2. Ancestral nodes where all descendants are gapped will also have gaps. Ancestral reconstructions will not have gaps. However, sometimes their children will have gaps in the alignment at the position of interest. As a result, the script uses a post-order traversal of the tree to check whether or not any ancestral node have only gapped children. If both children are gapped, then the ancestral node will be appended as gapped as well. That probably was confusing, so hopefully the following example illustrates it best:
So as you can see the ancestral node has a gap when both of its descendants have gaps at that position as well. However when only one child has a gap and the other has an amino acid state for that position, then the node will be appended based on the ancestral reconstruction for that node.

So as you can see the ancestral node has a gap when both of its descendants have gaps at that position as well. However when only one child has a gap and the other has an amino acid state for that position, then the node will be appended based on the ancestral reconstruction for that node.

I think that pretty much covers the overall framework of the program. The script itself has a lot of documentation of what's going on in each section that you can keep your hair well-sculpted. Feel free to ask me questions!

Profile photo of shams

by shams

Scriptin' Adventures: Producing a List of Desired Files

June 12, 2013 in Shams' Blog

When I was in high school, I did a lot of theatre, including writing and directing my own show. The scripting about to discuss is extremely different from the scripts I wrote back then.

So at this point I have written a few Perl scripts and a bash script and I thought I would share them with you, along with the joys and tears I had with them as well as where I sought help from...other than Jason.

Perl Scripts

First, a disclaimer: this is my first time coding in Perl and its also my first exposure to certain programming concepts like regular expression. Also note that I am a lazy programmer so learn more from my mistakes than from my successes. The first script I wrote here was to determine if a FASTA file had enough sequences for our study. Essentially, I had 18, 191 FASTA files in my directory. I wanted only the FASTA files that had 10 or more sequences. My first step was this:

grep -c ">" *.fasta > fastafiles.txt

grep() is a nifty function that grabs what ever you want. The -c tells the computer to count whatever you're grepping, in this case the ">" sign. In FASTA files, this sign is used to indicate the introduction of a new species.

So what does this give me? A text file (created using the > fastafile.txt) that says:

ENSG00000000003.unaligned.fasta:47
ENSG00000000005.unaligned.fasta:50
ENSG00000000419.unaligned.fasta:47
ENSG00000000457.unaligned.fasta:54
ENSG00000000460.unaligned.fasta:39
ENSG00000000938.unaligned.fasta:37
ENSG00000000971.unaligned.fasta:6

18184 more lines of this.

So now I needed a way to grab the file names that had more than 10 sequences. And I can tell you right now that was the wrong way to think about it. In reality, the better way to think of it was to remove the files that had less than 10 sequences (thank you @Nathan for this).  So I ended up with a neat script:

  1. @files = glob("fastafiles.txt");
  2. foreach $file (@files) {
  3.     open(IN, $file);
  4.     while ($line = <IN>) {
  5.     @ducks =split(/:/, $line);
  6.     if (@ducks[1] > 10){
  7.         print "@ducks[0]\n";
  8.     }
  9.     }
  10.     close $file;
  11. }

The first line reads in the text file in question into the array @files. Then the for loop reads in the text file line by line. The while loop takes each line and separates by ":" symbol using the split() function, moving it into the array @ducks. Thus, each line turns into an array where the 0th element is the file name (i.e ENSG00000000003.unaligned.fasta) and 1st element is the number of sequences (i.e 47) in that file. Thus I could create a conditional (line 6) of only printing out the 0th element where the corresponding 1st element is greater than 10. This provided me with terminal print out of the desired files which I simply transferred into a text file.

Now what I was doing wrong was when I started programming this, I thought the best approach was to remove undesired elements from @files array. That  I found to be more programmatically and computationally difficult. Picking out the elements you want and putting it into a new array was a much easier solution.

Skip to toolbar