You are browsing the archive for Shams’ Blog.

Profile photo of shams

by shams

Tree Annotator Thingy

August 13, 2014 in Labwide Announcements, Shams' Blog

It's 11:55 PM on a hazy Sunday night. The snow begins to fall over a slumbering Calgary. Only you, a handsome, dashing young scientist, is awake. Surrounded by nothing but darkness, protected only by the glow of your computer screen, you are in HMRB 253 working on getting your results presentable to impress your supervisor tomorrow morning. For the past month, you've been categorizing the variants that impact the Cheesy virus. You've also looked at the conservation of variants across a set of vertebrate species. You want to have a phylogenetic tree that shows your supervisor the distribution of amino acids in your position of interest.

So you have a tree that looks something like this: ENST00000003084-rs1800123.tree
What you want is something that look like this: ENST00000003084-rs1800123-state.tree

Lucky for you, the former Shams Bhuiyan wrote a script to do just that! BOOM, your butt is saved. Here's what you'll need:

  1. A Newick tree of your phylogeny. Something that looks a lot like thisAn example newick format tree for the all vertebrate tree I displayed above
    An example Newick format tree for the all vertebrate tree I displayed above
  2.  You will also need an input file of the variants you're interested in. The columns should be in this order: alignment file name, variant position in alignment, the variant you're interested in, and a variant identifier (e.g rsID);
  3. As I am sure you have figured out by now, you will also need to give the script a path to a folder with all your sequence alignments (fasta format please)
  4. And of course, the actual script (findstate.pl)! This is located in the Dropbox:de Koning Lab (1)/Public/Scripts/Shams/

All of these inputs need to be hardcoded into the script. I've tried my best to make it obvious where you hard code these inputs in. You will also need to change the array @Wanted_Species to whatever your species list  Perhaps some day I will turn them into command line arguments.

How it works:

  1. Open and read the alignment file corresponding to a given variant
  2. For a given species in the alignment file, record the amino acid state for the position of interest.
  3. Find the given species node on the tree and append the node with the amino acid state.
  4. After all the nodes have been appended, output a newick format tree that's titled [alignment file name]-[variant identifier].tree in directory trees
  5. Output a nexus format tree that's title [alignment file name]-[variant identifier].tree in directory nexus

There are two important things to note about the script! These are specific towards PAML reconstructions:

    1. The script only appends ancestral information as far up to the lowest common ancestor. Consider the tree mentioned earlier. It has 57 species as leaves. However, not all alignment files will have all 57 species. As such, not all ancestral reconstructions will have reliable data for all the ancestral node. The script considers all the species that are given by sequence data and uses BioPerl to draw upon what is the lowest common ancestor.
    2. Ancestral nodes where all descendants are gapped will also have gaps. Ancestral reconstructions will not have gaps. However, sometimes their children will have gaps in the alignment at the position of interest. As a result, the script uses a post-order traversal of the tree to check whether or not any ancestral node have only gapped children. If both children are gapped, then the ancestral node will be appended as gapped as well. That probably was confusing, so hopefully the following example illustrates it best:
So as you can see the ancestral node has a gap when both of its descendants have gaps at that position as well. However when only one child has a gap and the other has an amino acid state for that position, then the node will be appended based on the ancestral reconstruction for that node.

So as you can see the ancestral node has a gap when both of its descendants have gaps at that position as well. However when only one child has a gap and the other has an amino acid state for that position, then the node will be appended based on the ancestral reconstruction for that node.

I think that pretty much covers the overall framework of the program. The script itself has a lot of documentation of what's going on in each section that you can keep your hair well-sculpted. Feel free to ask me questions!

Profile photo of shams

by shams

Journal Fight Club for May 27th

May 24, 2014 in Labwide Announcements, Shams' Blog

Hello Everyone,

So my journal fight is scheduled for May 29th at 2 pm for OB1509. The journal of choice is:

Why Human Disease-Associated Residues Appear as the Wild-Type in Other Species: Genome-Scale Structural Evidence for the Compensation Hypothesis.

Pubmed link: http://www.ncbi.nlm.nih.gov/pubmed/24723421

Profile photo of shams

by shams

Polar Bear Genome

May 10, 2014 in Shams' Blog

Hey guys!

So it looks like they recently published a reference genome for a polar bear. There's a bit of an interesting discussion over the genes found to be under positive selection on the polar bear lineage

Here's the article:
http://www.cell.com/cell/abstract/S0092-8674%2814%2900488-7

Toodles
-Shams

Profile photo of shams

by shams

Branch Lengths

July 23, 2013 in Shams' Blog

Click on the image for a clearer image.graph

primates

prirod

prirodcet

eutherians

mammals

birdies

Froggies

AlignmentQualityvsTreeLengthforAllGenesandAllSpecies

It is interesting to note that 31 genes scored less 0.7 in alignment quality on the primate taxa set. This is noted because this set had the least diverse species list for any taxa set which explained why the average alignment score was the highest. These 31 genes were put through DAVID (http://david.abcc.ncifcrf.gov/home.jsp)  Gene Classification. Only two gene groups were found:

Gene Group 1	Enrichment Score: 4.096470438908425
ENSEMBL_GENE_ID	Gene Name
ENSG00000167676	KIAA1881
ENSG00000196156	keratin associated protein 4-3
ENSG00000163207	involucrin
ENSG00000169876	mucin 17, cell surface associated

Gene Group 2	Enrichment Score: 0.7784370651705931
ENSEMBL_GENE_ID	Gene Name
ENSG00000090612	zinc finger protein 10
ENSG00000171970	zinc finger protein 57
ENSG00000169926	Kruppel-like factor 13
ENSG00000196247	zinc finger protein 107

The remaining low scoring genes are:

ID	Gene Name	Species
ENSG00000197182	ENSG00000197182	Homo sapiens	
ENSG00000213799	ENSG00000213799	Homo sapiens	
ENSG00000136002	Rho guanine nucleotide exchange factor (GEF) 4	Homo sapiens	
ENSG00000198250	anthrax toxin receptor-like	Homo sapiens	
ENSG00000203721	chromosome 1 open reading frame 98	Homo sapiens	
ENSG00000180525	chromosome 10 open reading frame 108	Homo sapiens	
ENSG00000168746	chromosome 20 open reading frame 62	Homo sapiens	
ENSG00000134365	complement factor H-related 4	Homo sapiens	
ENSG00000179979	cysteine-rich PAK1 inhibitor	Homo sapiens	
ENSG00000198324	family with sequence similarity 109, member A	Homo sapiens	
ENSG00000143520	filaggrin family member 2	Homo sapiens	
ENSG00000165948	interferon, alpha-inducible protein 27-like 1	Homo sapiens	
ENSG00000187026	keratin associated protein 21-2	Homo sapiens	
ENSG00000204482	leukocyte specific transcript 1	Homo sapiens	
ENSG00000111877	minichromosome maintenance complex component 9	Homo sapiens	
ENSG00000181240	solute carrier family 25, member 41	Homo sapiens	
ENSG00000168454	thioredoxin domain containing 2 (spermatozoa)	Homo sapiens	
ENSG00000120949	tumor necrosis factor receptor superfamily, member 8	Homo sapiens	
ENSG00000150991	ubiquitin C	Homo sapiens	
ENSG00000165424	zinc finger, CCHC domain containing 24	Homo sapiens
Profile photo of shams

by shams

Step 2: Preliminary Data on Alignment Confidence as Ancestery Deepens

July 8, 2013 in Shams' Blog

The following is the Mean Residue Pair score data for each species set extracted from Guidance (see previous blog post). The data was organized by RStudio. Each pair of residues in the base MSA that is identically
aligned in the perturbed MSA is given a score of 1; allother residue pairs are given the score 0.

Primates:

Min.         1st Qu.  Median       Mean    3rd Qu.     Max.
0.1985  0.9765    0.9940       0.9805  0.9996    1.0000

Histogram of Mean Residue Pair Score for primates

Histogram of Mean Residue Pair Score for primates

Primates and Rodents

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.2283  0.9730  0.9895  0.9772  0.9976  1.0000

Histogram of Mean Residue Pair score of Primate and Rodents Multiple Sequence Alignments by Guidance

Histogram of Mean Residue Pair score of Primate and Rodents Multiple Sequence Alignments by Guidance

Primates, Rodents, Cetartiodactyl and Carnivores

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.2222  0.9671  0.9839  0.9724  0.9938  1.0000

Histogram of Mean Residue Pair Scores

Histogram of Mean Residue Pair Scores

Eutherians

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.2181  0.9591  0.9776  0.9665  0.9903  1.0000

 

Histogram of Euthereans

Histogram of Euthereans

All Mammals

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.06789 0.95180 0.97340 0.96070 0.98740 1.00000

Mean Residue Pair Scores - Histogram

Mean Residue Pair Scores - Histogram

All Mammals and Birds

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.0759  0.9446  0.9696  0.9553  0.9850  1.0000

Histogram of Mean Res Pair Scores up to Birds species

Histogram of Mean Res Pair Scores up to Birds species

All Mammals, Birds and Xenopus

Min. 1st Qu.      Median       Mean         3rd Qu.    Max.
0.06684 0.94030 0.96720 0.95210 0.98380 1.00000

Histogram of species set up to Xenopus Mean Res Pair Score

Histogram of species set up to Xenopus Mean Res Pair Score

All Vertebrae

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.2787  0.9164  0.9546  0.9371  0.9764  1.0000

Histogram of ALL species Mean Res Pair score

Histogram of ALL species Mean Res Pair score

 

Looking at the data right now, the median mean res pair score decreases as species sets become more divergent. The histograms show an increased frequency of scores less than 1 as species sets diverge. This data supports our initial hypothesis.

 

 

 

Profile photo of shams

by shams

Step 1 (which will probably break down to many steps): Running Guidance and Extracting Data

July 8, 2013 in Shams' Blog

So at some time in May,  Jason gave me 18189 files. Each file was a fasta file that contained one to one orthologous human genes that can be found across 54 vertebrae species. The species were the following:

Homo_sapiens
Pan_troglodytes
Gorrilla_gorilla
Pongo_abelii
Nomascus_leucogenys
Macaca_mulatta
Papio_hamadryas
Callithrix_jacchus
Tarsius_syrichta
Microcebus_murinus
Otolemur_garnettii
Tupaia_belangeri
Mus_musculus
Rattus_norvegicus
Dipodomys_ordii
Cavia_porcellus
Ictidomys_tridecemlineatus
Oryctolagus_cuniculus
Ochotona_prineceps
Vicugna_pacos
Tursiops_truncatus
Bos_taurus
Ovis_aries
Sus_scrofa
Equus_caballus
Felis_catus
Ailuropoda_melanoleuca
Mustela_putorius_furo
Canis_familiaris
Myotis_lucifugus
Pteropus_vampyrus
Erinaceus_europaeus
Sorex_araneus
Loxodonta_africana
Procavia_capensis
Echinops_telfairi
Dasypus_novemcinctus
Choloepus_hoffmanni
Monodelphis_domestica
Macropus_eugenii
Sarcophilus_harrisii
Ornithorhynchus_anatinus
Gallus_gallus
Meleagris_gallopavo
Anas_platyrhynchos
Taeniopygia_guttata
Anolis_carolinensis
Pelodiscus_sinensis
Xenopus_tropicalis
Latimeria_chalumnae
Oreochromis_niloticus
Tetraodon_nigroviridis
Takifugu_rubripes
Xiphophorus_maculatus
Oryzias_latipes
Gasterosteus_aculeatus
Gadus_morhua
Danio_rerio

Some of the FASTA files contains all the aforementioned species while others only included Homo_Sapiens and Gorilla_Gorilla.

So these FASTA files were used to measure the decrease in alignment quality as evolutionary ancestry deepens and moves further away from humans. So multiple sequence alignments were first ran on the orthologs found in (1) primates (only files with 10 or more taxas), then the (2) primates and rodents, then (3) primates, rodents and certariodactyls/carnivores, (4)primates, rodents, certariodactyls/carnivores and eutherians, (5) primates, rodents, certariodactyls/carnivores, eutherians and remaining mammals, (6) primates, rodents, certariodactyls/carnivores, eutherians, remaining mammals and birds, (7) primates, rodents, certariodactyls/carnivores, eutherians, remaining mammals, birds and xenopus, (8), primates, rodents, certariodactyls/carnivores, eutherians, remaining mammals, birds, xenopus and fishes.

The software used to determine alignment quality was GUIDANCE. The scores produced are based on the robustness of the MSA to guide tree uncertainty, which relies on a bootstrap approach. To run this program on the large number of files for 8 different sets, I created the following bash script:

  1. #! /bin/bash
  2. # Job script for PBS/Torque
  3. # de Koning Lab, University of Calgary
  4. # May 2013
  5. # Set some PBS variables (working directory, job name, location of out and error streams, etc.)
  6. #PBS -d /media/Data2/Working/secondguidanceruns
  7. #PBS -N guidancesequencing
  8. #PBS -o logs/guidanceruns.out
  9. #PBS -e logs/guidanceruns.err
  10. # Limits/resource requests for Westgrid systems, which demand it:
  11. #PBS -l walltime=2500:00:00
  12. ## PBS -l procs=1
  13. ## PBS -l mem=1900mb
  14. # Run how many analyses this time? Good idea to adjust this to keep total # of tasks in job array under 2000
  15. num=553
  16. # listfile containing list of data files to process
  17. listfile=guidancefastafiles.txt
  18. #Testing condition
  19. #job=`expr $1 - 1`
  20. # Set the job number from the PBS environment variable for a job array
  21. job=`expr ${PBS_ARRAYID} - 1`
  22. PATH=/home/shams:$PATH
  23. export PATH
  24. # Starting job, ending job, etc.
  25. start=`expr $job \\* $num + 1`
  26. end=`expr $start + $num`
  27. last=`cat $listfile |grep -c fasta`
  28. echo "last = $last end = $end start =$start job = $job num = $num"
  29. if [ $end -gt $last ]
  30. then
  31.         end=`expr $last + 1`
  32. fi
  33. # Loop through each analysis
  34. for ((i=$start; i<$end; i++))
  35. do
  36.     # Get the datafile filename
  37.         file=`cat $listfile | head -n $i |tail -n 1`
  38.         echo "Processing file $i: $file..."
  39.     name=`echo "$file" | cut -d'.' -f1`
  40.     # Now run the analysis for this data file (run a MAFFT MSA via Guidance on codon sequences)
  41.     perl /home/shams/guidance.v1.3.1/www/Guidance/guidance.pl --seqFile $file --msaProgram MAFFT --seqType codon --outDir output/$name --bootstraps 100
  42.         #echo "$i: $file"
  43. done

I then proceeded to write the following script (getMeanResScores.pl). This script was intended to descend into each of the output files and retrieve the mean pair wise residue score file for each MSA done. Please note that this post is merely to inform which script was used where. I further, in depth description/analysis of each script will appear as their own blog post.

  1. #! /usr/bin/perl
  2. use File::Copy;
  3. #Shamsuddin Bhuiyan
  4. #dekoning lab
  5. #June 11th, 2013
  6. #
  7. #
  8. #Iterate through the files of a given directory to grab multiple files of the same name, rename that file and then save to a file
  9. #
  10. my @dirs = grep { -d } glob './*';
  11. pop (@dirs);
  12. #print "@dirs \n";
  13. @missing = ();
  14. foreach $dirs (@dirs){
  15.         if (-e "$dirs/MSA.MAFFT.Guidance_msa.scr"){
  16. #opendir ($DIR, $dirs) or die "$!";
  17.         #print "$dirs \n";
  18.         print "copy(\"$dirs/MSA.MAFFT.Guidance_msa.scr\", \"MeanResPairs/$dirs.Guidance_msa.scr \n";
  19.         copy("$dirs/MSA.MAFFT.Guidance_msa.scr", "MeanResPairs/$dirs.Guidance_msa.scr") or die;
  20.         }
  21.         else{
  22.         push(@missing, $dirs);
  23.         }
  24. }
  25. open (FILE, ">> missing.txt");
  26. print FILE "@missing /n";
  27. close (FILE);

I ran the above script for each species set. Once the mean residue pair score files were separated for a species set, I ran another script over all the mean residue pair score files. This script was used to open each of the files up, grab the geneID and the mean residue pair score for the MSA done and then add to a table.

  1. #! /usr/bin/perl
  2. # Shamsuddin Bhuiyan
  3. # de koning lab
  4. # June 11th, 2013
  5. #Grab mean res pair score from guidance_msa.scr file and write to file
  6. #Grab files
  7. @files = glob("*.scr");
  8. #read in each file
  9. foreach $file (@files) {
  10.     open(IN, $file);
  11.     while ($line = <IN>) {
  12.         #if the line matches the mean res pair score indicator then split it by white spaces and grab the 1nth element of that line which should be the score you want
  13.         #add it to an array
  14.         if ($line =~ "#MEAN_RES_PAIR_SCORE") {
  15.                 @ducks =split(/ /, $line);
  16.                 print "$file @ducks[1]\n";
  17.         }
  18.     }
  19.     close $file;
  20. }

Once the scores were organized to a table, the data was fed into R to produce summary data and histograms.

Profile photo of shams

by shams

Scriptin' Adventures: Producing a List of Desired Files

June 12, 2013 in Shams' Blog

When I was in high school, I did a lot of theatre, including writing and directing my own show. The scripting about to discuss is extremely different from the scripts I wrote back then.

So at this point I have written a few Perl scripts and a bash script and I thought I would share them with you, along with the joys and tears I had with them as well as where I sought help from...other than Jason.

Perl Scripts

First, a disclaimer: this is my first time coding in Perl and its also my first exposure to certain programming concepts like regular expression. Also note that I am a lazy programmer so learn more from my mistakes than from my successes. The first script I wrote here was to determine if a FASTA file had enough sequences for our study. Essentially, I had 18, 191 FASTA files in my directory. I wanted only the FASTA files that had 10 or more sequences. My first step was this:

grep -c ">" *.fasta > fastafiles.txt

grep() is a nifty function that grabs what ever you want. The -c tells the computer to count whatever you're grepping, in this case the ">" sign. In FASTA files, this sign is used to indicate the introduction of a new species.

So what does this give me? A text file (created using the > fastafile.txt) that says:

ENSG00000000003.unaligned.fasta:47
ENSG00000000005.unaligned.fasta:50
ENSG00000000419.unaligned.fasta:47
ENSG00000000457.unaligned.fasta:54
ENSG00000000460.unaligned.fasta:39
ENSG00000000938.unaligned.fasta:37
ENSG00000000971.unaligned.fasta:6

18184 more lines of this.

So now I needed a way to grab the file names that had more than 10 sequences. And I can tell you right now that was the wrong way to think about it. In reality, the better way to think of it was to remove the files that had less than 10 sequences (thank you @Nathan for this).  So I ended up with a neat script:

  1. @files = glob("fastafiles.txt");
  2. foreach $file (@files) {
  3.     open(IN, $file);
  4.     while ($line = <IN>) {
  5.     @ducks =split(/:/, $line);
  6.     if (@ducks[1] > 10){
  7.         print "@ducks[0]\n";
  8.     }
  9.     }
  10.     close $file;
  11. }

The first line reads in the text file in question into the array @files. Then the for loop reads in the text file line by line. The while loop takes each line and separates by ":" symbol using the split() function, moving it into the array @ducks. Thus, each line turns into an array where the 0th element is the file name (i.e ENSG00000000003.unaligned.fasta) and 1st element is the number of sequences (i.e 47) in that file. Thus I could create a conditional (line 6) of only printing out the 0th element where the corresponding 1st element is greater than 10. This provided me with terminal print out of the desired files which I simply transferred into a text file.

Now what I was doing wrong was when I started programming this, I thought the best approach was to remove undesired elements from @files array. That  I found to be more programmatically and computationally difficult. Picking out the elements you want and putting it into a new array was a much easier solution.

Profile photo of shams

by shams

Hello World!

June 10, 2013 in Shams' Blog

Hello World!

As we are already more than a month into the summer, there will be a few blog posts coming out at once to catch people up on the happenings here in the de Koning lab. In this first blog post I hope to introduce myself and the project that I have undertaken for the summer.

My name is Shamsuddin (Shams) Bhuiyan and I am entering my fourth year of the Faculty of Medicine's Bioinformatics program here at the U of C. I got introduce to Dr. de Koning and his work via the Bioinformatics club lab tour. My background in research outside of my degree's curriculum has been one summer in Christian Jacob's LINDSAY Virtual Human project where I developed an augmented reality tool for the iOS component of the project. Entering into the de Koning lab this summer, my relevant skills include programming (Python, Java, Objective C, Assembly, algorithm design and analysis), an understanding of certain biological fields (genetics, molecular biology, biochemistry, physiology and anatomy), a bit of experience with bioinformatics (alignments, MSA, building phylogenies, BLAST) and some coursework in math (calculus, linear and discrete). I hope to not only improve in some of those areas but also gain a bit of background on statistics within the scope of bioinformatics. Gaining some experience in statistics was what pushed me to sign on to the de Koning lab and undertake this specific project.

Project Title:

 

Project title: P
-->Finding the Limits of Statistical Inference in Comparative Genomics

Introduction: Biologists established phylogenetic trees as a tool for evolutionary studies in order to homology in organismal sequence data. Phylogram provide important data in their branches, where the length of the branch is relative to the amount of evolutionary change occurring from a common ancestor. Longer branches indicate a higher change in character with both nucleotide and amino acid sequences. This increased amount of character change means more data and more data usually indicates increased statistical power. The increase of statistical power allows evolutionary biologists to make statistical inferences in comparative genomics especially since many ancestral genomes are unavailable to modern researchers. Thus it makes sense to go deep into ancestral lines of species to get more data. However, the problem arises that alignment data in these deep ancestries become less reliable (i.e the confidence scores of the alignments are lower).

Objective: The objective of my project is to find the trade off between statistical power and alignment quality based on branch length. How long should a branch be before alignment quality decreases significantly? What branch length is long enough to achieve significant statistical power? To investigate the objective, I propose the use of the MAFFT (Multiple Alignment using Fast Fourier Transform) tool of the GUIDANCE alignment software, Rstudio, bash scripting, Perl scripting, parametric bootstrap analysis and the monster of a machine Ozymandias. I hypothesize that as statistical power increases through the increasing of branch length, alignment quality of sequences will decrease. To test this hypothesis, I propose the following aims:

Specific Aim #1: To investigate the effects that branch length has upon alignment quality in phylograms. Using a bash script, a large amount of vertebrae sequences will be aligned through the GUIDANCE software via the MAFFT tool. Alignment quality will be based on GUIDANCE scores. This alignment will be done on Ozymandias.

Specific Aim #2: To investigate the effects that branch length has upon the statistical power in inferences made in a phylogram.

Specific Aim #3: To determine the trade off between alignment quality and statistical power in terms of branch length of phylograms.

Skip to toolbar