Bootcamp/5.ntBLAST.md at main · FierstLab/Bootcamp

FIU Genomics & Bioinformatics Analysis CURE BSC3990L Genomics

Nucleotide BLAST tutorial

BLAST stands for Basic Local Alignment Search Tool. There are different types of blast: blastn, blastx, blastp, tblastn, etc. Here, we will focus on blastn, which does nucleotide to nucleotide searches.

This tutorial continues with the ecoli data from step 4.Genome_assembly.md with the flye assembly. The assembly was renamed to ecoli.fasta. The first step of this current tutorial has you copy the fasta to your working directory, however you could continue with the assembly you yourself made (named assembly.fasta) and should get the same results.

Before beginning, remember to log on to the HPC (see previous tutorial if you don't remember how). Once you have logged on, cd to your working directory, for me this will be /home/data/jfierst_classroom/tori

BLAST a sequence against your genome

This type of BLAST is useful if you want to find a specific sequence in your genome. For example, you may want to extract a ribosomal gene to build a phylogeny, or perhaps you have an interest in genome defense and want to find if there are any PIWI orthologs. Or maybe your interest is about metabolic genes, immune genes, chemosensation, etc. The possibilities are as creative as you are, but being able to search any genome for any sequence of interest is an essential skill in bioinformatics.

Get the data

cp /home/data/jfierst_classroom/blastPractice/ecoli.fasta ./.
cp /home/data/jfierst_classroom/blastPractice/rrsA.fasta ./.

Make your script:

vi blast.sh

Hit [i] for insert and copy/paste the following script.

#!/bin/bash

#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_blast.log
#SBATCH --mail-user=username@fiu.edu    #use your own email instead
#SBATCH --mail-type=ALL

module load blast-plus-2.11.0 #load software

blastn -query rrsA.fasta -subject ecoli.fasta -outfmt 6 -out blast.out #run blastn, specify table format output

Hit [esc], type ":wq" and then hit [enter] to save your script.

Query is the sequence you want to find while subject is the sequence you're searching.

Submit your script with:

sbatch blast.sh

This should only take a minute or two to run. Use more or less to check output_blast.log and blast.out for results or errors.

There are many options you can use to alter the script above. To see these, type:

module load blast-plus-2.11.0
blastn -help

Your output file (blast.out) should look like:

NC_000913.3:4035531-4037072     NZ_CP024138.1   99.676  1542    5       0       1       1542    3848577 3847036 0.0     2820
NC_000913.3:4035531-4037072     NZ_CP024138.1   99.546  1542    6       1       1       1542    548767  550307  0.0     2808
NC_000913.3:4035531-4037072     NZ_CP024138.1   98.962  1542    16      0       1       1542    4852712 4851171 0.0     2760
NC_000913.3:4035531-4037072     NZ_CP024138.1   98.898  1542    17      0       1       1542    4567895 4566354 0.0     2754
NC_000913.3:4035531-4037072     NZ_CP024138.1   98.898  1542    17      0       1       1542    4736107 4734566 0.0     2754
NC_000913.3:4035531-4037072     NZ_CP024138.1   98.898  1542    16      1       1       1542    1258286 1259826 0.0     2752
NC_000913.3:4035531-4037072     NZ_CP024138.1   98.768  1542    17      2       1       1542    4527455 4525916 0.0     2741

The output looks like this because in our script we specify the table format -outfmt 6. The default table format reports:

Query id (rrsA.fasta)
Subject id (ecoli.fasta)
% identity
alignment length
mismatches
gap openings
query start
query end
subject start
subject end
e-value
bit score

You can get more information about the table format at: https://www.metagenomics.wiki/tools/blast/blastn-output-format-6

BLAST a sequence against many genomes

In coding, a loop is when we tell the computer to repeat a command a specified amount of times. Here, we repeat the the same blast command we ran before, but this time for each line in list.txt.

To prepare the loop, first obtain all genomes you want to search against and save their filenames in list.txt.

make a directory and cd into it:

mkdir blast_loop
cd blast_loop

Get the genomes:

cp /home/data/jfierst_classroom/blastPractice/*.fna ./.

Save filenames to list.txt:

ls *.fna > list.txt

Now we are ready to make our loop.

vi blast.sh

Hit [i] for insert and copy paste the following script.

#!/bin/bash

#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_blast.log
#SBATCH --mail-user=username@fiu.edu    #use your own email instead
#SBATCH --mail-type=ALL

module load blast-plus-2.11.0  #load software

mkdir -p blast_out    #make output directory if it doesn't already exist

#loop searches for the sequence in sequence.fasta in all genomes listed in list.txt and outputs the results from each genome in separate files in blast_out 
while read -r line; do
  blastn -query ./../rrsA.fasta -subject ${line} -outfmt 6 -out ./blast_out/${line}_blast.out
done < list.txt

Hit [esc], then type ":wq" and hit [enter] to save the script.

submit the script:

sbatch blast.sh

Your results are in a directory called blast_out

BLAST your genome against the nucleotide database

BLASTing your entire genome can give you information about potential contamination and species identification. However, it can be computationally expensive and slow. Additionally, if it is a de novo assembly of an organism previously not sequenced, some sequences may not hit to anything in the BLAST nucleotide database.

#!/bin/bash

#SBATCH --account acc_jfierst_classroom
#SBATCH --partition highmem1
#SBATCH --qos highmem1
#SBATCH --output=output_blast.log
#SBATCH --mail-user=username@fiu.edu    #use your own email instead
#SBATCH --mail-type=ALL

module load blast-plus-2.11.0  #load software

export BLASTDB='/home/data/jfierst_classroom/blastPractice/nt_db/'  #export path to blast nt database

#run blast
blastn -query ecoli.fasta -db nt -culling_limit 5 -evalue 1e-25 \
    -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle" \
    -out blast.out

This takes several hours, even with the small ecoli genome we are practicing on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIU Genomics & Bioinformatics Analysis CURE BSC3990L Genomics

FilesExpand file tree

5.ntBLAST.md

Latest commit

History

5.ntBLAST.md

File metadata and controls

FIU Genomics & Bioinformatics Analysis CURE BSC3990L Genomics