Genome Annotation tutorial
Note: This tutorial demonstrates eukaryotic de novo genome annotation without RNA data. Although we have been using E.coli in our previous tutorials, it is not possible with some of the softwares used in this pipeline. Why? because some rely on databases specific to eukaryotes. So, we will be demonstrating with c.elegans. Because we are using c.elegans, this is not a quick and short tutorial. Some of these programs can take several hours.
-
Bacterial genome annotation has its own specialized softwares which are likely to perform better. If you are doing prokaryotic genome annotation, try looking into Prokka.
-
De novo annotation is more difficult than if you had a reference genome. For gene annotation when a reference is present, try looking into Liftoff.
-
The more data the better. For this course we did not do RNA sequencing, however, it is recommended to provide more evidence for annotation softwares. If you have RNA data, the process is very similar. The only difference is that after repeat masking, you have to align your RNA data to your genome. A few popular tools exist for this; try looking into STAR or HISAT2. You can find more information on running STAR or HISAT2 on the FIU HPC here, under protein annotation.
Masking Repeats
Masking the repeat regions of the genome makes protein annotation easier, and many softwares ask for a 'softmasked' version of a genome. If not provided, they often make one for you before beginning annotation. 'Softmasking' a genome means to replace repeat regions with lowercase letters, so instead of ATGCGC, you might see ATgcgc. 'Hardmasking' a genome means to replace repeat regions with 'N', so instead of ATGCGC, you might see ATNNNN. Here, we softmask our genome with repeatmodeler and repeatmasker.
cd to your working directory. For me this is /home/data/jfierst_classroom/tori
mkdir annotation
cd annotation
There are 2 ways to get the data (if you are not affiliated with the class do the second option):
Option #1
cp /home/data/jfierst_classroom/annotationPractice/elegans.fasta ./.
Option #2
Do not copy/paste teh entire code block but go line by line
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/985/GCF_000002985.6_WBcel235/GCF_000002985.6_WBcel235_genomic.fna.gz
gunzip GCF_000002985.6_WBcel235_genomic.fna.gz
awk '/^>/ {print; next} {print toupper($0)}' GCF_000002985.6_WBcel235_genomic.fna > elegans.fasta
First we use RepeatModeler/Masker to find repetitive regions and create whats called a 'repeat library'.
Install repeatModeler/Masker with conda:
module load mamba/23.1.0-4
conda create -n repeatmodeler
source activate repeatmodeler
mamba install -c bioconda repeatmodeler
Make sure it installed properly. Type BuildDatabase and the manual for that command should appear. If there is an error that says command not found, make sure you did the above installation correctly.
Make the script:
vi repeatmodeler.sh
Hit [i] for insert and copy/paste the following:
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --qos highmem1
#SBATCH --partition highmem1
#SBATCH --output=out_repeatmodeler.log
#SBATCH --mail-user=username@email.com #use your own email
#SBATCH --mail-type=ALL
module load mamba/23.1.0-4
source activate repeatmodeler
#Build the database
BuildDatabase -name ELEGANS elegans.fasta
#Run RepeatModeler for de novo repeat identification and characterization. Takes long time.
RepeatModeler -threads 8 -database ELEGANS
#Use the queryRepeatDatabase.pl script inside RepeatMasker/util to extract Rhabditida repeats
python /home/[username]/.conda/envs/repeatmodeler/share/RepeatMasker/famdb.py families -f fasta_acc -ad --curated 'rhabditida' > Rhabditida.repeatmasker
#Combine the files to create a library of de novo and known repeats
cat RM*/consensi.fa.classified Rhabditida.repeatmasker > Elegans.repeats
Hit [Esc], then type :wq and hit [Enter]
Your output will be Elegans.repeats. This takes about 1-2 days to complete on c.elegans
Now, mask the repeats from the library you just generated.
Make the script:
vi repeatmasker.sh
Hit [i] for insert and copy/paste the following:
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --qos highmem1
#SBATCH --partition highmem1
#SBATCH --output=out_repeatmasker.log
#SBATCH --mail-user=username@email.com #use your own email
#SBATCH --mail-type=ALL
module load mamba/23.1.0-4
source activate repeatmodeler
#Mask the genome of known repeats
RepeatMasker -lib Elegans.repeats -pa 8 -xsmall -nolow elegans.fasta
Hit [Esc], then type :wq and hit [Enter]
-nolow / -l(ow)
With the option -nolow or -l(ow) only interspersed repeats are masked. Other repeats, which are less complex, like simple tandem repeats and low complexity (polypurine, AT-rich) regions are skipped with the -nolow option. By default all repeats are masked. For database searches the default setting is recommended, but sometimes, e.g. when using the masked sequence to predict the presence of exons, it may be better to skip the low complexity masking.
-xsmall
Returns repetitive regions in lowercase (soft masking) instead of replacing with N's (hard masking). Non-repeat regions remain in uppercase.
-pa
Stands for parallel, meaning it runs the program on 8 sequences at a time.
This will take about 40 minutes to complete. The output of RepeatMasker is elegans.fasta.masked. more or less the file. You should notice that it now has a mix of lowercase and uppercase letters.
Protein Prediction
Many softwares exist for protein prediction, most of which use machine learning methods like CNNs, LSTMs, and HMMs. Those used today include BRAKER, GALBA, Tiberius, and Helixer. Some older ones include MAKER or FunAnnotate. BRAKER has 3 versions: BRAKER1 is genome + RNA, BRAKER2 is genome + protein, and BRAKER3 is genome + RNA + protien. GALBA is like BRAKER2 but specifically code to deal with issues unique to large genomes (>500Gb). Tiberius is the newest annotator but has only been trained (modeled) on mammalian data. Helixer has more available models but is more particular about GPU requirements. Below, we use BRAKER2 for gene annotation, although notice from the images that these may not be the best annotations. Rather, it's what we can do with what we have.
https://github.com/Gaius-Augustus/BRAKER
Install the software using a singularity container:
module load singularity-3.8.7
singularity build braker3.sif docker://teambraker/braker3:latest
This may take several minutes.
singularity exec braker3.sif braker.pl
If installed correctly, you should get the braker help menu.
You can also test your installation with their test data:
singularity exec -B $PWD:$PWD braker3.sif cp /opt/BRAKER/example/singularity-tests/test1.sh .
export BRAKER_SIF=/your/path/to/braker3.sif # may need to modify
bash test1.sh
This may run for about 15 minutes, creating an output directory called test1. The main output you are looking for is a file called braker.gtf. This file contains the final protein predictions made by BRAKER. Of course make sure to check your log files for any errors or warnings.
If the program is working, continue.
Get the protein data:
cp /home/data/jfierst_classroom/GCF_000002985.6_WBcel235_translated_cds.faa ./.
Make the script:
vi singularity_braker.sh
Hit [i] for insert and copy/paste the following.
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --qos highmem1
#SBATCH --partition highmem1
#SBATCH --mail-user=your@email.com #input your email
#SBATCH --mail-type=ALL
module load singularity-3.8.7
module load proxy
export BRAKER_SIF=/your/path/to/braker3.sif #input your path
wd=output_braker2
# remove output directory if it already exists
if [ -d $wd ]; then
rm -r $wd
fi
singularity exec -B ${PWD}:${PWD} ${BRAKER_SIF} braker.pl \
--genome=elegans.fasta.masked \ #change genome if you are doing this with different data
--prot_seq=GCF_000002985.6_WBcel235_translated_cds.faa \
--workingdir=${wd} \
--GENEMARK_PATH=${ETP}/gmes --threads 8 --softmasking --busco_lineage nematoda_odb10 &> output_singularity_braker2.log
hit [Esc] and type :wq then hit [Enter]
Submit the script:
sbatch singularity_braker.sh
Note: all files must be in the same working directory when using a singularity container. The genome, protein, and sif file should all be in your working directory.
This takes 1-2 days to run.
You can check the status of your job with:
squeue --me
Once complete, you should see a few different output files in the output_braker2 directory, namely:
braker.gtf is a table format of predictions (exon, intron, CDS) that includes the union of geneMark and Augustus predictions. augustus.hints.gtf is very similar but doesn't include geneMark hints. Using augustus.hints.gft may reduce false positives if BRAKER was run in ES mode.
braker.aa is the predicted protein sequences in fasta format.
braker.codingseq is the nucleotide sequence of predicted proteins in fasta format.
Lets have a look:
more or less or head braker.gtf
more or less or head braker.codingseq
more or less or head braker.aa
How many genes are there?
cat braker.gtf | cut -f 3 | grep -c "gene"
How many sequences are in braker.aa?
grep -c ">" braker.aa
Why might there be a difference?
Quality Check your predictions
Like assemblies, your results can vary depending on the program used and input data. Thus it is possible to generate multiple prediction sets for the same genome. If you want to compare them, you may want to quality check.
AGAT is used to covert the .gtf output file from BRAKER, to a .gff3 file format (this format can also be generated directly from BRAKER if you use the --gff3 flag). AGAT also calculates some basic statistics such as gene count, average length of the gene, average intron length, percent of genome which is genic, percent of genome covered by introns or exons, etc.
Install the software:
module load mamba/23.1.0-4
conda create -n AGAT
source activate AGAT
mamba install -c bioconda agat
Make the script:
vi agat.sh
hit [i] for insert and copy/paste the following:
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --qos highmem1
#SBATCH --partition highmem1
#SBATCH --output=out_agat.log
#SBATCH --mail-user=your@email.com
#SBATCH --mail-type=ALL
#load software
module load mamba/23.1.0-4
source activate AGAT
#convert gtf to gff file format
/home/[username]/.conda/envs/AGAT/bin/agat_convert_sp_gxf2gxf.pl \
-g ./output_braker2/braker.gtf \
-o braker2.gff3
#get statistics
/home/[username]/.conda/envs/AGAT/bin/agat_sp_statistics.pl \
--gff braker2.gff3 \
-f ./../elegans.fasta.masked \
-o AGATstats.txt
hit [Esc] and then type :wq and hit [Enter]
Submit the script:
sbatch agat.sh
This is completed in about 5 minutes.
more or less AGATstats.txt
How many genes do you have?
How many transcripts?
How many genes are overlapping?
What percent of the genome is genic?
Remember we've run Busco before, back in the GenomeAssembly Tutorial, it is the same concept, except here we use the predicted proteins as input and change -m nucl to -m prot
Make the script:
vi busco.sh
hit [i] for insert and copy/paste the following:
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --qos=highmem1
#SBATCH --partition=highmem1
#SBATCH --output=out_busco.log
#SBATCH -n 12
#SBATCH --mail-user=your_email@fiu.edu # Please use your own email
#SBATCH --mail-type=ALL
module load busco/5.4.7
busco -c 4 -m prot -i ./output_braker2/braker.aa -o busco_out --offline --lineage_dataset /home/data/jfierst_classroom/nematoda_odb10
hit [Esc] and type :wq then hit [Enter]
Submit the script:
sbatch busco.sh
This should complete in 5-10 minutes.
How does this BUSCO result compare to your assembly BUSCO result?
Functional Annotation
InterProScan scans databases to predict protein function and domains. The input to this is a fasta format of predicted protein sequences, either nucleotide or amino acid works.
First, braker.aa has * characters at the end of each sequence. We have to remove these before input into InterProScan or you will get an error.
cd output_braker2
sed 's/\*//g' braker.aa > clean_braker.aa
cd ..
Make the script:
vi interpro.sh
Hit [i] for insert and copy/past the following:
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --qos highmem1
#SBATCH --partition highmem1
#SBATCH -n 8
#SBATCH --output=out_interpro.log
#SBATCH --mail-user=youremail@fiu.edu
#SBATCH --mail-type=ALL
module load interproscan/5.68.100.0
/home/applications/interproscan/5.68.100.0/interproscan.sh -i ./output_braker2/clean_braker.aa -dp -goterms
hit [Esc] and type :wq then hit [Enter]
-dp deactivates the use of precalculated match lookup, which is more computationally expensive, but more thorough.
-goterms specifies that we want goterm matches output.
See the InterProScan documentation for more information.
Submit the script:
sbatch interpro.sh
This will take about 8 hours to run. The output will be clean_braker.aa.tsv, clean_braker.aa.xml, and clean_braker.aa.gff3. These are all just different file formats of the same information.
Let's say we are interested in GPCR proteins. This is generally a fast evolving protein family in nematodes, having alot to do with chemosensation, which is the main way nematodes navigate their environments.
Try:
grep --color=ALWAYS "GPCR" clean_braker.aa.gff3 | head
Are there GPCR proteins that were predicted?
Try:
grep "GPCR" clean_braker.aa.gff3 | cut -f 1 | sort | uniq | wc -l
How many genes are defined as GPCR proteins?
Repeat Annotation
There are many softwares for repeat annotation. Extensive de novo repeat annotator (EDTA) was made by plant biologists and is sort of the standard in the field at the moment. However, many other softwares have been released in recent years, such as EarlGrey (benchmarked with D.melanogaster) and TransposonUltimate (benchmarked with C.elegans). Below we demonstrate EDTA.
First, we are going to extract the coding sequences from our gene annotation. TEs often contain their own genes which can get mixed up with the "native" genes. Hopefully by including coding sequences, we can tell the two apart.
Find your braker.gtf file
Filter the table for only CDS lines:
awk '$3=="CDS"' braker.gtf > braker_cds.gtf
Make a bed file:
cat braker_cds.gtf | cut -f 1,4,5 | sed 's/ /\t/g' > braker_cds.bed
Unwrap the fasta:
awk '/^>/ {print (NR==1?"":"\n") $0; next} {printf "%s", $0} END {print ""}' elegans.fasta.masked > unwrapped_elegans.fasta.masked
Replace spaces with underscores in the fasta header (header and bed file first column must match)
sed -i 's/ /_/g' elegans.fasta.masked
Use bedtools getfasta to retrieve the sequences using the bed file coordinates.
Make the script:
vi bedtools.sh
hit [i] for insert and copy/paste the following:
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --qos highmem1
#SBATCH --partition highmem1
#SBATCH --output=out_bedtools.log
#SBATCH --mail-user=youremail@fiu.edu
#SBATCH --mail-type=ALL
module load bedtools2-2.27.1-gcc-8.2.0-bxmhnwb
#check coordinate order (start location should be a smaller number than the stop location)
awk '{if ($2 > $3) {temp = $2; $2 = $3; $3 = temp} print}' braker_cds.bed > tmpfile && mv tmpfile braker_cds.bed
#check to make sure bed files are tab delimited
sed -i 's/ /\t/g' braker_cds.bed
#use bedtools getfasta to get sequnce from genome
bedtools getfasta -fi ./../unwrapped_elegans.fasta.masked -bed braker_cds.bed -fo elegans.cds.fa
Run the script:
sbatch bedtools.sh
This runs in seconds and your output is elegans.cds.fa. This will be used in EDTA.
Get the software:
module load mamba/23.1.0-4
conda create -n EDTA
source activate EDTA
mamba install -c conda-forge -c bioconda edta
Be patient, this will take a few minutes. If you think your screen is stuck hit [Enter] a few times.
Make the script:
vi EDTA.sh
hit[i] for insert and copy/paste the following:
#!/bin/bash
#SBATCH --account acc_jfierst_classroom
#SBATCH --qos highmem1
#SBATCH --partition highmem1
#SBATCH --output=out_EDTA.log
#SBATCH --mail-user=youremail@fiu.edu
#SBATCH --mail-type=ALL
module load mamba/23.1.0-4
source activate EDTA
/home/[username]/.conda/envs/EDTA/share/EDTA/EDTA.pl \
--genome elegans.fasta.masked \
--cds elegans.cds.fa \
--rmlib Elegans.repeats\
--overwrite 1 \
--anno 1 \
--threads 10
hit [Esc] and type :wq then hit [Enter]
Most of these options are self explanatory (genome==your softmasked genome sequence)
--rmlib == repeatmodeler library
--overwrite 1 == overwrite previous results if found
--anno 1 == annotate the TEs
For more information see the EDTA manual here
Submit the script:
sbatch EDTA.sh

