Bootcamp/6.MultipleSequenceAlignment.md at main · FierstLab/Bootcamp

FIU Genomics & Bioinformatics Analysis CURE BSC3990L Genomics

Multiple Sequence Alignment and Phylogenomics tutorial

A phylogeny can be built from one sequence (like the COI or 16s), or several, like all variants of the 16s (in ecoli there are 7), or all the shared single orthologous genes (BUSCO) for example. Regardless, the foundation of all phylogenies is a multiple sequence alignment, where we correlate the number of sequence mismatches and gaps to evolutionary distance. Many softwares and pipelines exist to accomplish this. Here we use mafft, clipkit, and iqtree. At the end are two methods of viewing your tree once it is generated: Figtree and RStudio.

Single gene

Extract Sequences

The first step is to gather all the gene sequences you are interested in aligning. We will use the rrsA 16s sequence as a continuation from the BLAST tutorial. If you have completed this tutorial, then you should have a directory called blast_loop with 5 ecoli genomes and the directory called blast_out.

cd to blast_out and you should have 5 output files that end in _blast.out

more or less one of these files. It should look something like:

NC_000913.3:4035531-4037072     NZ_CP024138.1   99.676  1542    5       0       1       1542    3848577 3847036 0.0     2820
NC_000913.3:4035531-4037072     NZ_CP024138.1   99.546  1542    6       1       1       1542    548767  550307  0.0     2808
NC_000913.3:4035531-4037072     NZ_CP024138.1   98.962  1542    16      0       1       1542    4852712 4851171 0.0     2760
NC_000913.3:4035531-4037072     NZ_CP024138.1   98.898  1542    17      0       1       1542    4567895 4566354 0.0     2754
NC_000913.3:4035531-4037072     NZ_CP024138.1   98.898  1542    17      0       1       1542    4736107 4734566 0.0     2754
NC_000913.3:4035531-4037072     NZ_CP024138.1   98.898  1542    16      1       1       1542    1258286 1259826 0.0     2752
NC_000913.3:4035531-4037072     NZ_CP024138.1   98.768  1542    17      2       1       1542    4527455 4525916 0.0     2741

Note: For nematodes, the ribosomal gene typically used for phylogenies is the 18s sequence (it's actually 18s, 5.8s, and 28s of the rDNA, but we'll say 18s for simplicity) taken from a previous nematode phylogenomics paper The nematode 18s sequence is in the file /home/data/jfierst_classroom/msaPractice/18s.fasta

The BLAST output tells us which regions of the genomes match the sequence of interest (columns 9 and 10), but they don't give us the actual sequnces, so we need to make a bed file and use bedtools getfasta to extract these fasta sequences.

cd back into the blast_loop directory

Make your script:

vi bedtools.sh

hit [i] for insert, then copy/paste the following code. The code is commented so that you know what each line is doing.

#!/bin/bash

#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_bedtools.log
#SBATCH --mail-user=username@fiu.edu    #use your own email instead
#SBATCH --mail-type=ALL

module load bedtools2-2.27.1-gcc-8.2.0-bxmhnwb #load the software

#loop makes bed file from blast output and extracts sequence from fasta file
while read -r line; do

#take the first line from each blast output file and cut columns 2,9,10 to make a bed file
  head -1 ./blast_out/${line}_blast.out | cut -f 2,9,10 > ${line}.bed

#check if column 2 of bed file is less than column 3, if not switch them (if will cause an error if anitsense)
  awk '{if ($2 > $3) {temp = $2; $2 = $3; $3 = temp} print}' ${line}.bed > tmpfile && mv tmpfile ${line}.bed

#check to make sure bed files are tab delimited
  sed -i 's/ /\t/g' ${line}.bed

#use bedtools getfasta to get sequnce from genome
  bedtools getfasta -fi ${line} -bed ${line}.bed -fo ${line}.tmp

done < list.txt

#concatenate all sequences together into a single file
cat *.tmp > cat_rrsA.fasta

#remove intermediate files
rm *.fai
rm *.tmp
rm *.bed

hit [Esc] and type :wq and then hit [enter]

sbatch bedtools.sh

You can find more information on bedtools getfasta here: https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html

This script runs finished in seconds and your output is the file cat_rrsA.fasta. more or less the file. What does it look like?

These names (lines beginning with ">") are going to be annoying later on. Let's fix it!

sed -i 's/contig_1:138094-139635/lab_strain/g' cat_rrsA.fasta
sed -i 's/NC_000913.3:4035531-4037072/ASM584/g' cat_rrsA.fasta
sed -i 's/NC_002695.2:3449868-3451409/ASM886/g' cat_rrsA.fasta
sed -i 's/NC_007946.1:4202676-4204217/ASM132/g' cat_rrsA.fasta
sed -i 's/NZ_CP024138.1:3847036-3848577/ASM285/g' cat_rrsA.fasta

Align Sequences

Make your script:

vi mafft.sh

hit [i] for insert and copy/paste the following:

#!/bin/bash

#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_mafft.log
#SBATCH --mail-user=username@fiu.edu    #use your own email instead
#SBATCH --mail-type=ALL

module load mafft-7.221-gcc-8.2.0-y6cgezm #load the software

linsi --thread 16 --maxiterate 1000 cat_rrsA.fasta > rrsA_aligned.fasta

hit [Esc] and type :wq and then hit [enter]

sbatch mafft.sh

The code finishes in seconds and the output is rrsA_aligned.fasta. more or less rrsA_aligned.fasta. What does it look like? Do you see rows of -------------? What does ---- indicate?

Trim Alignments

We need to "trim" our alignments because gaps may exist in the alignment which are not informative. Specifically, we remove all gaps present in 90% of sequences, thereby only keeping the gaps which may be informative when building the phylogeny. We use a software called ClipKit to do this, with the option smart-gap.

First, this software isn't available as a module on the HPC, so we must install it with mamba.

module load mamba/23.1.0-4

conda create -n clipkit

source activate clipkit

mamba install bioconda::clipkit

Type clipkit -h and the manual should appear if you've installed it correctly.

Make your script:

vi clipkit.sh

hit [i] for insert and copy/paste the following:

#!/bin/bash

#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_clipkit.log
#SBATCH --mail-user=username@fiu.edu    #use your own email instead
#SBATCH --mail-type=ALL

module load mamba/23.1.0-4

source activate clipkit

clipkit -m smart-gap rrsA_aligned.fasta

hit [Esc] and type :wq and then hit [enter]

sbatch clipkit.sh

The code runs in ~30 seconds on the ecoli data. The output file is rrsA_aligned.fasta.clipkit. more or less the file. Do you see a difference between this file and the rrsA_aligned.fasta file?

Build the Tree

Make the script:

vi iqtree.sh

hit [i] for insert and copy/paste the following:

#!/bin/bash

#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_iqtree.log
#SBATCH --mail-user=username@fiu.edu    #use your own email instead
#SBATCH --mail-type=ALL


module load iqtree-2-gcc-8.2.0

iqtree2 -s rrsA_aligned.fasta.clipkit -m MFP -bb 1000 -alrt 1000 -nt AUTO

hit [Esc] and type :wq and then hit [enter]

sbatch iqtree.sh

Here, we are running iqtree with the with the MFP option which will test a bunch of phylogenetic models on your data, calculate AIC, AICc, and BIC scores, then choose the model with the best (smallest) BIC score (this can be changed by adding -AIC or -AICc. -bb specifies the number of bootstrap replicates. -alrt is similar to -bb and specifies bootstrap replicates when calculating branch support. Finally, -nt AUTO allows the program to determine the best number of cores to use for efficiency.

The code runs for ~5minutes with the ecoli data. Many files will be output. The rrsA_aligned.fasta.clipkit.treefile is the nexus file which you can download to your personal computer and view in Figtree or Rstudio as described in the visualization section below. If you type head rrsA_aligned.fasta.clipkit.treefile, then you should see:

(contig_1_138094-139635:0.0006492102,(NC_000913.3_4035531-4037072:0.0000324674,(NC_002695.2_3449868-3451409:0.0000010000,NZ_CP024138.1_3847036-3848577:0.001
5950342)100/100:1.1955503250)0/20:0.0019178427,NC_007946.1_4202676-4204217:0.0000010000);

iqtree can be a difficult program to grasp as it has many options, models, and parameters. See the documentation of more information: http://www.iqtree.org/doc/Tutorial#first-running-example

Multi-gene

There are two ways to infer a species tree from many genes: concatenation of sequences or concatenation of trees. What I mean by this is that we can do multiple sequence alignment of many genes, concatenate those sequences together, and then calculate the maximum likelihood tree, OR we can do a multiple sequence alignment for many genes, calculate a maximum likelihood tree for each gene, and then find the species tree by concatenating the gene trees. Science is headed toward the latter, however we here will do the easier option of concatenating the sequences and then finding the maximum-likelihood tree. For a tutorial concatenating gene trees, see [].

A multi-gene tree with sequence concatenation follows much of the same steps as a single gene-tree. However, instead of one trimmed MSA, we will generate several (one per gene). It is important that all MSAs have the same headers (lines beginning with >). All alignments are then concatenated with AMAS, which generates two files needed to run IQTREE, a concatenated msa and a partition file.

For this tutorial we will use the 7 variants of 16s rDNA found in E.coli to demonstrate the process of building a phylogeny from a multi-gene MSA.

Get the Data

cd to your working directory (for me this is /home/data/jfierst_classroom/tori)

Make a directory to keep your work in (stay organized!)

mkdir multigene_phylogeny

cd multigene_phylogeny

Get the data:

cp -R /home/data/jfierst_classroom/msaPractice/genomes ./.
cp -R /home/data/jfierst_classroom/msaPractice/rrs ./.

ls -Rlath

You should see that within the /rrs/ directory there are 7 fasta sequences. Within the /genomes/ directory there is a directory per species, each containing a fna file (the genome for that species).

Make list of genes and list of species (preparation for loops).

Do not copy/paste the code block, do it line by line.

cd genomes  #move into the genomes directory
ls * > ./../species.txt  #list the directories into species.txt (species.txt will be located in the multigene_phylogeny directory)
cd ../rrs  #move back a directory and into the rrs directory
ls * > ./../genes.txt  #list the files into genes.txt (genes.txt will be located in the multigene_phylogeny directory)
cd ..  #move back into the multigene_phylogeny directory

more species.txt

What does species.txt look like? Right, there's been a problem. We want a list of species without all the directory information. Lets fix it!

Do not copy/paste the code block, do it line by line.

awk 'NR%3==1' species.txt > temp  #prints the first of every three lines and saves the first to temp
sed -i 's/://g' temp #removes ':' symbol
mv temp species.txt #moves temp to species.txt

more species.txt

Have we fixed it?

species.txt should look like:

ASM132
ASM285
ASM584
ASM886
lab_strain

And genes.txt should look like:

rrsA.fasta
rrsB.fasta
rrsC.fasta
rrsD.fasta
rrsE.fasta
rrsG.fasta
rrsH.fasta

BLAST to Find Sequences

Make the script:

vi blast.sh

hit [i] for insert and copy/paste the following:

#!/bin/bash

#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_blast.log
#SBATCH --mail-user=username@fiu.edu    #use your own email instead
#SBATCH --mail-type=ALL

module load blast-plus-2.11.0  #load software

mkdir -p blast_out    #make output directory if it doesn't already exist

#A loop within a loop, the first one chooses a gene sequence from genes.txt and the second one goes through the list of species 
while read -r gene; do
  while read -r species; do
    blastn -query ./rrs/${gene} -subject ./genomes/${species}/*.fna -outfmt 6 -out ./blast_out/${gene}_${species}_blast.out
  done < species.txt
done < genes.txt

hit [Esc] and type :wq and then hit [enter]

Submit the script:

sbatch blast.sh

Your output is in blast_out. It should be 35 files (7 genes * 5 species).

Extract Sequences

Make your script:

vi bedtools.sh

hit [i] for insert, then copy/paste the following code. The code is commented so that you know what each line is doing.

#!/bin/bash

#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_bedtools.log
#SBATCH --mail-user=username@fiu.edu    #use your own email instead
#SBATCH --mail-type=ALL

module load bedtools2-2.27.1-gcc-8.2.0-bxmhnwb #load the software

mkdir -p concatenated_sequences

#loop makes bed file from blast output and extracts sequence from fasta file
for file in ./blast_out/*; do

#take the first line from each blast output file and cut columns 2,9,10 to make a bed file
  head -1 ${file} | cut -f 2,9,10 > ${file}.bed

#check if column 2 of bed file is less than column 3, if not switch them (if will cause an error if anitsense)
  awk '{if ($2 > $3) {temp = $2; $2 = $3; $3 = temp} print}' ${file}.bed > tmpfile && mv tmpfile ${file}.bed

#check to make sure bed files are tab delimited
  sed -i 's/ /\t/g' ${file}.bed

done

#use bedtools getfasta to get sequnce from genome
while read -r species; do
  while read -r gene; do
    bedtools getfasta -fi ./genomes/${species}/*.fna -bed ./blast_out/${gene}_${species}*.bed -fo ./blast_out/${gene}_${species}.tmp
  done < genes.txt
done < species.txt

#fix header lines (lines with >, so that it is just the species name)
while read -r species; do
  for file in ./blast_out/*${species}*.tmp; do
    sed "s/^>.*/>${species}/" ${file} > ${file}.fixed
  done
done < species.txt

#concatenate all sequences per gene together into a single file
while read -r gene; do
  cat ./blast_out/*${gene}*.tmp.fixed > ./concatenated_sequences/cat_${gene}
done < genes.txt

#remove intermediate files
rm ./blast_out/*.tmp
rm ./blast_out/*.bed
rm ./blast_out/*.fixed

hit [Esc] and type :wq and then hit [enter]

sbatch bedtools.sh

Your output is in the directory concatenated_sequences. cd into the directory and ls. You should see 7 files corresponding the each gene. head a file, what do you see? If you type grep ">" *, then you should see that all lines beginning with ">" are simplified to the strain names. Remember to cd back to multigene_phylogeny before continuing.

Align Sequences

Make your script:

vi mafft.sh

hit [i] for insert and copy/paste the following:

#!/bin/bash

#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_mafft.log
#SBATCH --mail-user=username@fiu.edu    #use your own email instead
#SBATCH --mail-type=ALL

module load mafft-7.221-gcc-8.2.0-y6cgezm #load the software

for file in ./concatenated_sequences/*; do
  linsi --thread 16 --maxiterate 1000 ${file} > ${file}.aligned
done

hit [Esc] and type :wq and then hit [enter]

sbatch mafft.sh

Trim Alignments

vi clipkit.sh

hit [i] for insert and copy/paste the following:

#!/bin/bash

#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_clipkit.log
#SBATCH --mail-user=username@fiu.edu    #use your own email instead
#SBATCH --mail-type=ALL

module load mamba/23.1.0-4

source activate clipkit

for file in ./concatenated_sequences/*.aligned; do
  clipkit -m smart-gap ${file}
done

#remove intermediate files
cd concatenated_sequences
rm *.fasta   
rm *.aligned
cd ..

hit [Esc] and type :wq and then hit [enter]

sbatch clipkit.sh

Concatenate Alignments

Get the software:

module load mamba/23.1.0-4

conda create -n AMAS

source activate AMAS

mamba install bioconda::amas

Make the script:

vi AMAS.sh

Hit [i] for insert and copy/paste the following:

#!/bin/bash

#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_AMAS.log
#SBATCH --mail-user=username@fiu.edu    #use your own email instead
#SBATCH --mail-type=ALL

# get partition file with AMAS
module load mamba/23.1.0-4
source activate AMAS

python3 /home/[your_username]/.conda/envs/AMAS/bin/AMAS.py concat -c 40 -f fasta -d dna --part-format raxml -i ./concatenated_sequences/*

hit [Esc] and type :wq and then hit [enter]

Submit the script:

sbatch AMAS.sh

The output is 2 files, concatenated.out and partitions.txt. These files are located in the multigene_phylogeny directory. ls. Do you see them? more or less the these 2 files. What do they look like?

Build the Tree

Make the script:

vi iqtree.sh

hit [i] for insert and copy/paste the following:

#!/bin/bash

#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_iqtree.log
#SBATCH --mail-user=username@fiu.edu    #use your own email instead
#SBATCH --mail-type=ALL


module load iqtree-2-gcc-8.2.0

iqtree2 -s concatenated.out -spp partitions.txt -m MFP+MERGE -bb 1000 -alrt 1000 -nt AUTO

hit [Esc] and type :wq and then hit [enter]

sbatch iqtree.sh

Here, I am running IQTREE with 1000 ultrafast bootstraps, 1000 non-parametric bootstraps when calculating branch support, AUTO allocate cores for efficiency, model MFP+MERGE (takes forever, but considers the FreeRate heterogeneity and full partitioning when finding the best model). More information about running IQTREE with multi-gene alignments is located here. You can even mix data types!

Your output is several files. Here is what each output file contains:

Analysis results written to:
  IQ-TREE report:                partitions.txt.iqtree
  Maximum-likelihood tree:       partitions.txt.treefile
  Likelihood distances:          partitions.txt.mldist
  Best partitioning scheme:      partitions.txt.best_scheme.nex
           in RAxML format:      partitions.txt.best_scheme

Ultrafast bootstrap approximation results written to:
  Split support values:          partitions.txt.splits.nex
  Consensus tree:                partitions.txt.contree
  Screen log file:               partitions.txt.log

For visualization we will move forward with partitions.txt.treefile. head the file. It should look like:

(ASM132:0.0001143634,((ASM285:6.3936111178,ASM886:0.0000131456)100/100:14.7610316400,ASM584:0.0000147020)100/100:3.1080925270,lab_strain:0.0056631617);

Choosing which genes to base your phylogeny off of can be difficult. There are the typical mitochondrial genes like COI (or some people do whole mitochondrial alignment), or ribosomal subunits like 16s, 18s, etc. Another option is to use your BUSCO outputs. An example of building a phylogeny based on BUSCO shared single copy orthologs can be found here

Visualization

Download the file to your local machine

The files you want to download are *.treefile. If you did the single gene tutorial and want to download that treefile, it is located in /home/data/jfierst_classroom/[username]/blast_loop/*.treefile. If you did the multigene tutorial, then that treefile is located in /home/data/jfierst_classroom/[username]/multigene_phylogeny/*.treefile

To download a file from the web browser:

Go to the HPC dashboard tab that is currently open on your laptop.

Click on the Files button.

Next, click on the Home Directory button.

Navigate to the file and check the white box next to it.

Finally, click the Download button in the top-right corner to download the file.

To download a file from command line

Type 'exit' and hit [enter]. This will log you out of the hpc

Type 'sftp username@hpclogin01.fiu.edu' and hit [enter]

Enter your password

navagate to the file

Type 'get file'

Type 'exit' and hit [enter] to log out of sftp

FigTree

You can install FigTree on your local machine by downloading it here

Once you have FigTree, navigate to the upper left corner and click on file, then click open

Find your treefile and double click, it should create a tree like:

If you navigate to Tip Labels on the left ribbon, and click the arrow to expand the selection, you should see font size. Increase this. Otherwise, play around with the options. If you click on a branch or label you can annotate it, color it, rotate the node, etc.

RStudio

To install RStudio, you must first install R, then RStudio. You can do so here

A simple tree can be plotted with the following code:

install.packages("ape")

library(ape)

setwd("C:/Users/torie") #set to your directory containing the treefile

multigene_tree <- read.tree("partitions.txt.treefile")
plot(multigene_tree)

There may be more of a learning curve with R, but in the long run, it allows you to calculate statistics about trait evolution across the tree and customize your tree in alot more ways. It's also easily reproducable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIU Genomics & Bioinformatics Analysis CURE BSC3990L Genomics

Extract Sequences

Align Sequences

Trim Alignments

Build the Tree

Get the Data

BLAST to Find Sequences

Extract Sequences

Align Sequences

Trim Alignments

Concatenate Alignments

Build the Tree

Download the file to your local machine

FigTree

RStudio

FilesExpand file tree

6.MultipleSequenceAlignment.md

Latest commit

History

6.MultipleSequenceAlignment.md

File metadata and controls

FIU Genomics & Bioinformatics Analysis CURE BSC3990L Genomics

Extract Sequences

Align Sequences

Trim Alignments

Build the Tree

Get the Data

BLAST to Find Sequences

Extract Sequences

Align Sequences

Trim Alignments

Concatenate Alignments

Build the Tree

Download the file to your local machine

FigTree

RStudio