Multiple Sequence Alignment and Phylogenomics tutorial
A phylogeny can be built from one sequence (like the COI or 16s), or several, like all variants of the 16s (in ecoli there are 7), or all the shared single orthologous genes (BUSCO) for example. Regardless, the foundation of all phylogenies is a multiple sequence alignment, where we correlate the number of sequence mismatches and gaps to evolutionary distance. Many softwares and pipelines exist to accomplish this. Here we use mafft, clipkit, and iqtree. At the end are two methods of viewing your tree once it is generated: Figtree and RStudio.
Single gene
The first step is to gather all the gene sequences you are interested in aligning. We will use the rrsA 16s sequence as a continuation from the BLAST tutorial. If you have completed this tutorial, then you should have a directory called blast_loop with 5 ecoli genomes and the directory called blast_out.
cd to blast_out and you should have 5 output files that end in _blast.out
more or less one of these files. It should look something like:
NC_000913.3:4035531-4037072 NZ_CP024138.1 99.676 1542 5 0 1 1542 3848577 3847036 0.0 2820
NC_000913.3:4035531-4037072 NZ_CP024138.1 99.546 1542 6 1 1 1542 548767 550307 0.0 2808
NC_000913.3:4035531-4037072 NZ_CP024138.1 98.962 1542 16 0 1 1542 4852712 4851171 0.0 2760
NC_000913.3:4035531-4037072 NZ_CP024138.1 98.898 1542 17 0 1 1542 4567895 4566354 0.0 2754
NC_000913.3:4035531-4037072 NZ_CP024138.1 98.898 1542 17 0 1 1542 4736107 4734566 0.0 2754
NC_000913.3:4035531-4037072 NZ_CP024138.1 98.898 1542 16 1 1 1542 1258286 1259826 0.0 2752
NC_000913.3:4035531-4037072 NZ_CP024138.1 98.768 1542 17 2 1 1542 4527455 4525916 0.0 2741
Note: For nematodes, the ribosomal gene typically used for phylogenies is the 18s sequence (it's actually 18s, 5.8s, and 28s of the rDNA, but we'll say 18s for simplicity) taken from a previous nematode phylogenomics paper The nematode 18s sequence is in the file /home/data/jfierst_classroom/msaPractice/18s.fasta
The BLAST output tells us which regions of the genomes match the sequence of interest (columns 9 and 10), but they don't give us the actual sequnces, so we need to make a bed file and use bedtools getfasta to extract these fasta sequences.
cd back into the blast_loop directory
Make your script:
vi bedtools.sh
hit [i] for insert, then copy/paste the following code. The code is commented so that you know what each line is doing.
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_bedtools.log
#SBATCH --mail-user=username@fiu.edu #use your own email instead
#SBATCH --mail-type=ALL
module load bedtools2-2.27.1-gcc-8.2.0-bxmhnwb #load the software
#loop makes bed file from blast output and extracts sequence from fasta file
while read -r line; do
#take the first line from each blast output file and cut columns 2,9,10 to make a bed file
head -1 ./blast_out/${line}_blast.out | cut -f 2,9,10 > ${line}.bed
#check if column 2 of bed file is less than column 3, if not switch them (if will cause an error if anitsense)
awk '{if ($2 > $3) {temp = $2; $2 = $3; $3 = temp} print}' ${line}.bed > tmpfile && mv tmpfile ${line}.bed
#check to make sure bed files are tab delimited
sed -i 's/ /\t/g' ${line}.bed
#use bedtools getfasta to get sequnce from genome
bedtools getfasta -fi ${line} -bed ${line}.bed -fo ${line}.tmp
done < list.txt
#concatenate all sequences together into a single file
cat *.tmp > cat_rrsA.fasta
#remove intermediate files
rm *.fai
rm *.tmp
rm *.bed
hit [Esc] and type :wq and then hit [enter]
sbatch bedtools.sh
You can find more information on bedtools getfasta here: https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html
This script runs finished in seconds and your output is the file cat_rrsA.fasta. more or less the file. What does it look like?
These names (lines beginning with ">") are going to be annoying later on. Let's fix it!
sed -i 's/contig_1:138094-139635/lab_strain/g' cat_rrsA.fasta
sed -i 's/NC_000913.3:4035531-4037072/ASM584/g' cat_rrsA.fasta
sed -i 's/NC_002695.2:3449868-3451409/ASM886/g' cat_rrsA.fasta
sed -i 's/NC_007946.1:4202676-4204217/ASM132/g' cat_rrsA.fasta
sed -i 's/NZ_CP024138.1:3847036-3848577/ASM285/g' cat_rrsA.fasta
Make your script:
vi mafft.sh
hit [i] for insert and copy/paste the following:
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_mafft.log
#SBATCH --mail-user=username@fiu.edu #use your own email instead
#SBATCH --mail-type=ALL
module load mafft-7.221-gcc-8.2.0-y6cgezm #load the software
linsi --thread 16 --maxiterate 1000 cat_rrsA.fasta > rrsA_aligned.fasta
hit [Esc] and type :wq and then hit [enter]
sbatch mafft.sh
The code finishes in seconds and the output is rrsA_aligned.fasta. more or less rrsA_aligned.fasta. What does it look like? Do you see rows of -------------? What does ---- indicate?
We need to "trim" our alignments because gaps may exist in the alignment which are not informative. Specifically, we remove all gaps present in 90% of sequences, thereby only keeping the gaps which may be informative when building the phylogeny. We use a software called ClipKit to do this, with the option smart-gap.
First, this software isn't available as a module on the HPC, so we must install it with mamba.
module load mamba/23.1.0-4
conda create -n clipkit
source activate clipkit
mamba install bioconda::clipkit
Type clipkit -h and the manual should appear if you've installed it correctly.
Make your script:
vi clipkit.sh
hit [i] for insert and copy/paste the following:
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_clipkit.log
#SBATCH --mail-user=username@fiu.edu #use your own email instead
#SBATCH --mail-type=ALL
module load mamba/23.1.0-4
source activate clipkit
clipkit -m smart-gap rrsA_aligned.fasta
hit [Esc] and type :wq and then hit [enter]
sbatch clipkit.sh
The code runs in ~30 seconds on the ecoli data. The output file is rrsA_aligned.fasta.clipkit. more or less the file. Do you see a difference between this file and the rrsA_aligned.fasta file?
Make the script:
vi iqtree.sh
hit [i] for insert and copy/paste the following:
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_iqtree.log
#SBATCH --mail-user=username@fiu.edu #use your own email instead
#SBATCH --mail-type=ALL
module load iqtree-2-gcc-8.2.0
iqtree2 -s rrsA_aligned.fasta.clipkit -m MFP -bb 1000 -alrt 1000 -nt AUTO
hit [Esc] and type :wq and then hit [enter]
sbatch iqtree.sh
Here, we are running iqtree with the with the MFP option which will test a bunch of phylogenetic models on your data, calculate AIC, AICc, and BIC scores, then choose the model with the best (smallest) BIC score (this can be changed by adding -AIC or -AICc. -bb specifies the number of bootstrap replicates. -alrt is similar to -bb and specifies bootstrap replicates when calculating branch support. Finally, -nt AUTO allows the program to determine the best number of cores to use for efficiency.
The code runs for ~5minutes with the ecoli data. Many files will be output. The rrsA_aligned.fasta.clipkit.treefile is the nexus file which you can download to your personal computer and view in Figtree or Rstudio as described in the visualization section below. If you type head rrsA_aligned.fasta.clipkit.treefile, then you should see:
(contig_1_138094-139635:0.0006492102,(NC_000913.3_4035531-4037072:0.0000324674,(NC_002695.2_3449868-3451409:0.0000010000,NZ_CP024138.1_3847036-3848577:0.001
5950342)100/100:1.1955503250)0/20:0.0019178427,NC_007946.1_4202676-4204217:0.0000010000);
iqtree can be a difficult program to grasp as it has many options, models, and parameters. See the documentation of more information: http://www.iqtree.org/doc/Tutorial#first-running-example
Multi-gene
There are two ways to infer a species tree from many genes: concatenation of sequences or concatenation of trees. What I mean by this is that we can do multiple sequence alignment of many genes, concatenate those sequences together, and then calculate the maximum likelihood tree, OR we can do a multiple sequence alignment for many genes, calculate a maximum likelihood tree for each gene, and then find the species tree by concatenating the gene trees. Science is headed toward the latter, however we here will do the easier option of concatenating the sequences and then finding the maximum-likelihood tree. For a tutorial concatenating gene trees, see [].
A multi-gene tree with sequence concatenation follows much of the same steps as a single gene-tree. However, instead of one trimmed MSA, we will generate several (one per gene). It is important that all MSAs have the same headers (lines beginning with >). All alignments are then concatenated with AMAS, which generates two files needed to run IQTREE, a concatenated msa and a partition file.
For this tutorial we will use the 7 variants of 16s rDNA found in E.coli to demonstrate the process of building a phylogeny from a multi-gene MSA.
cd to your working directory (for me this is /home/data/jfierst_classroom/tori)
Make a directory to keep your work in (stay organized!)
mkdir multigene_phylogeny
cd multigene_phylogeny
Get the data:
cp -R /home/data/jfierst_classroom/msaPractice/genomes ./.
cp -R /home/data/jfierst_classroom/msaPractice/rrs ./.
ls -Rlath
You should see that within the /rrs/ directory there are 7 fasta sequences. Within the /genomes/ directory there is a directory per species, each containing a fna file (the genome for that species).
Make list of genes and list of species (preparation for loops).
Do not copy/paste the code block, do it line by line.
cd genomes #move into the genomes directory
ls * > ./../species.txt #list the directories into species.txt (species.txt will be located in the multigene_phylogeny directory)
cd ../rrs #move back a directory and into the rrs directory
ls * > ./../genes.txt #list the files into genes.txt (genes.txt will be located in the multigene_phylogeny directory)
cd .. #move back into the multigene_phylogeny directory
more species.txt
What does species.txt look like? Right, there's been a problem. We want a list of species without all the directory information. Lets fix it!
Do not copy/paste the code block, do it line by line.
awk 'NR%3==1' species.txt > temp #prints the first of every three lines and saves the first to temp
sed -i 's/://g' temp #removes ':' symbol
mv temp species.txt #moves temp to species.txt
more species.txt
Have we fixed it?
species.txt should look like:
ASM132
ASM285
ASM584
ASM886
lab_strain
And genes.txt should look like:
rrsA.fasta
rrsB.fasta
rrsC.fasta
rrsD.fasta
rrsE.fasta
rrsG.fasta
rrsH.fasta
Make the script:
vi blast.sh
hit [i] for insert and copy/paste the following:
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_blast.log
#SBATCH --mail-user=username@fiu.edu #use your own email instead
#SBATCH --mail-type=ALL
module load blast-plus-2.11.0 #load software
mkdir -p blast_out #make output directory if it doesn't already exist
#A loop within a loop, the first one chooses a gene sequence from genes.txt and the second one goes through the list of species
while read -r gene; do
while read -r species; do
blastn -query ./rrs/${gene} -subject ./genomes/${species}/*.fna -outfmt 6 -out ./blast_out/${gene}_${species}_blast.out
done < species.txt
done < genes.txt
hit [Esc] and type :wq and then hit [enter]
Submit the script:
sbatch blast.sh
Your output is in blast_out. It should be 35 files (7 genes * 5 species).
Make your script:
vi bedtools.sh
hit [i] for insert, then copy/paste the following code. The code is commented so that you know what each line is doing.
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_bedtools.log
#SBATCH --mail-user=username@fiu.edu #use your own email instead
#SBATCH --mail-type=ALL
module load bedtools2-2.27.1-gcc-8.2.0-bxmhnwb #load the software
mkdir -p concatenated_sequences
#loop makes bed file from blast output and extracts sequence from fasta file
for file in ./blast_out/*; do
#take the first line from each blast output file and cut columns 2,9,10 to make a bed file
head -1 ${file} | cut -f 2,9,10 > ${file}.bed
#check if column 2 of bed file is less than column 3, if not switch them (if will cause an error if anitsense)
awk '{if ($2 > $3) {temp = $2; $2 = $3; $3 = temp} print}' ${file}.bed > tmpfile && mv tmpfile ${file}.bed
#check to make sure bed files are tab delimited
sed -i 's/ /\t/g' ${file}.bed
done
#use bedtools getfasta to get sequnce from genome
while read -r species; do
while read -r gene; do
bedtools getfasta -fi ./genomes/${species}/*.fna -bed ./blast_out/${gene}_${species}*.bed -fo ./blast_out/${gene}_${species}.tmp
done < genes.txt
done < species.txt
#fix header lines (lines with >, so that it is just the species name)
while read -r species; do
for file in ./blast_out/*${species}*.tmp; do
sed "s/^>.*/>${species}/" ${file} > ${file}.fixed
done
done < species.txt
#concatenate all sequences per gene together into a single file
while read -r gene; do
cat ./blast_out/*${gene}*.tmp.fixed > ./concatenated_sequences/cat_${gene}
done < genes.txt
#remove intermediate files
rm ./blast_out/*.tmp
rm ./blast_out/*.bed
rm ./blast_out/*.fixed
hit [Esc] and type :wq and then hit [enter]
sbatch bedtools.sh
Your output is in the directory concatenated_sequences. cd into the directory and ls. You should see 7 files corresponding the each gene. head a file, what do you see? If you type grep ">" *, then you should see that all lines beginning with ">" are simplified to the strain names. Remember to cd back to multigene_phylogeny before continuing.
Make your script:
vi mafft.sh
hit [i] for insert and copy/paste the following:
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_mafft.log
#SBATCH --mail-user=username@fiu.edu #use your own email instead
#SBATCH --mail-type=ALL
module load mafft-7.221-gcc-8.2.0-y6cgezm #load the software
for file in ./concatenated_sequences/*; do
linsi --thread 16 --maxiterate 1000 ${file} > ${file}.aligned
done
hit [Esc] and type :wq and then hit [enter]
sbatch mafft.sh
vi clipkit.sh
hit [i] for insert and copy/paste the following:
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_clipkit.log
#SBATCH --mail-user=username@fiu.edu #use your own email instead
#SBATCH --mail-type=ALL
module load mamba/23.1.0-4
source activate clipkit
for file in ./concatenated_sequences/*.aligned; do
clipkit -m smart-gap ${file}
done
#remove intermediate files
cd concatenated_sequences
rm *.fasta
rm *.aligned
cd ..
hit [Esc] and type :wq and then hit [enter]
sbatch clipkit.sh
Get the software:
module load mamba/23.1.0-4
conda create -n AMAS
source activate AMAS
mamba install bioconda::amas
Make the script:
vi AMAS.sh
Hit [i] for insert and copy/paste the following:
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_AMAS.log
#SBATCH --mail-user=username@fiu.edu #use your own email instead
#SBATCH --mail-type=ALL
# get partition file with AMAS
module load mamba/23.1.0-4
source activate AMAS
python3 /home/[your_username]/.conda/envs/AMAS/bin/AMAS.py concat -c 40 -f fasta -d dna --part-format raxml -i ./concatenated_sequences/*
hit [Esc] and type :wq and then hit [enter]
Submit the script:
sbatch AMAS.sh
The output is 2 files, concatenated.out and partitions.txt. These files are located in the multigene_phylogeny directory. ls. Do you see them? more or less the these 2 files. What do they look like?
Make the script:
vi iqtree.sh
hit [i] for insert and copy/paste the following:
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_iqtree.log
#SBATCH --mail-user=username@fiu.edu #use your own email instead
#SBATCH --mail-type=ALL
module load iqtree-2-gcc-8.2.0
iqtree2 -s concatenated.out -spp partitions.txt -m MFP+MERGE -bb 1000 -alrt 1000 -nt AUTO
hit [Esc] and type :wq and then hit [enter]
sbatch iqtree.sh
Here, I am running IQTREE with 1000 ultrafast bootstraps, 1000 non-parametric bootstraps when calculating branch support, AUTO allocate cores for efficiency, model MFP+MERGE (takes forever, but considers the FreeRate heterogeneity and full partitioning when finding the best model). More information about running IQTREE with multi-gene alignments is located here. You can even mix data types!
Your output is several files. Here is what each output file contains:
Analysis results written to:
IQ-TREE report: partitions.txt.iqtree
Maximum-likelihood tree: partitions.txt.treefile
Likelihood distances: partitions.txt.mldist
Best partitioning scheme: partitions.txt.best_scheme.nex
in RAxML format: partitions.txt.best_scheme
Ultrafast bootstrap approximation results written to:
Split support values: partitions.txt.splits.nex
Consensus tree: partitions.txt.contree
Screen log file: partitions.txt.log
For visualization we will move forward with partitions.txt.treefile. head the file. It should look like:
(ASM132:0.0001143634,((ASM285:6.3936111178,ASM886:0.0000131456)100/100:14.7610316400,ASM584:0.0000147020)100/100:3.1080925270,lab_strain:0.0056631617);
Choosing which genes to base your phylogeny off of can be difficult. There are the typical mitochondrial genes like COI (or some people do whole mitochondrial alignment), or ribosomal subunits like 16s, 18s, etc. Another option is to use your BUSCO outputs. An example of building a phylogeny based on BUSCO shared single copy orthologs can be found here
Visualization
The files you want to download are *.treefile. If you did the single gene tutorial and want to download that treefile, it is located in /home/data/jfierst_classroom/[username]/blast_loop/*.treefile. If you did the multigene tutorial, then that treefile is located in /home/data/jfierst_classroom/[username]/multigene_phylogeny/*.treefile
To download a file from the web browser:
Go to the HPC dashboard tab that is currently open on your laptop.
Click on the Files button.
Next, click on the Home Directory button.
Navigate to the file and check the white box next to it.
Finally, click the Download button in the top-right corner to download the file.
To download a file from command line
Type 'exit' and hit [enter]. This will log you out of the hpc
Type 'sftp username@hpclogin01.fiu.edu' and hit [enter]
Enter your password
navagate to the file
Type 'get file'
Type 'exit' and hit [enter] to log out of sftp
You can install FigTree on your local machine by downloading it here
Once you have FigTree, navigate to the upper left corner and click on file, then click open
Find your treefile and double click, it should create a tree like: 
If you navigate to Tip Labels on the left ribbon, and click the arrow to expand the selection, you should see font size. Increase this. Otherwise, play around with the options. If you click on a branch or label you can annotate it, color it, rotate the node, etc.
To install RStudio, you must first install R, then RStudio. You can do so here
A simple tree can be plotted with the following code:
install.packages("ape")
library(ape)
setwd("C:/Users/torie") #set to your directory containing the treefile
multigene_tree <- read.tree("partitions.txt.treefile")
plot(multigene_tree)
There may be more of a learning curve with R, but in the long run, it allows you to calculate statistics about trait evolution across the tree and customize your tree in alot more ways. It's also easily reproducable.