Skip to content

tRNA taxonomy v226#2492

Open
FlorianTrigodet wants to merge 2 commits intomasterfrom
tRNA_taxonomy_v226
Open

tRNA taxonomy v226#2492
FlorianTrigodet wants to merge 2 commits intomasterfrom
tRNA_taxonomy_v226

Conversation

@FlorianTrigodet
Copy link
Contributor

New version of the tRNA taxonomy using GTDB v226.
I ran a quick test and it worked fine. I can also see that the database version's check are working.
I also noticed that in the original commit for the first and only version of the tRNA taxonomy database in anvi'o that we would make a README file showing how we generated the data. But, alas, no such README was ever committed.

I will write a short one explaining how I got the data and how I transformed it.

@FlorianTrigodet
Copy link
Contributor Author

WELL. I just pushed the README.txt to the master branch 🤦

@FlorianTrigodet
Copy link
Contributor Author

Content of the README:

We used the genome collection of GlobDB v226 (https://globdb.org), which itself contains all genome representatives of GTDB v226.
GlobDB already provides contigs-databases with annotated tRNA, through the command anvi-scan-trnas.

We made an external-genomes-txt file with all the GTDB genomes present in GlobDB and used anvi-get-sequences-for-hmm-hits to extract the sequences for all tRNAs:

anvi-get-sequences-for-hmm-hits -e external-genomes.txt --hmm-sources Transfer_RNAs -o all_tRNAs.fa

The headers of that fasta file contains the information about the anticodon and the genome's id:

>Ile_GAT___Transfer_RNAs___4068027d3b59b533bad5e550e79d768a08c97ce5fbceb337f7a6171e bin_id:GCA_000008085|source:Transfer_RNAs|e_value:92.8|contig:hash112713c5_GCA_000008085_000000000001|gene_callers_id:hash112713c5_612|start:7000|stop:7074|length:74

Using that information, we can create as many fasta file as anticodon and only keep the genome's name as the defline:

mkdir -p ANTICODON_SEARCH_DATABASES

awk -v outdir="ANTICODON_SEARCH_DATABASES" '
    /^>/ {
        # parse anticodon and genome id from header
        match($0, />[^_]+_([A-Z]{3})___.*bin_id:([^|]+)/, m)
        anticodon = m[1]
        genome = m[2]
        # print sequence under genome ID
        if (anticodon != "" && genome != "") {
            fname = outdir "/" anticodon
            print ">" genome >> fname
        }
        next
    }
    {
        print >> fname
    }
' all_tRNAs.fa

# gzip all outputs
gzip ANTICODON_SEARCH_DATABASES/*

Finally, we can use the taxonomy file provided in GlobDB to construct the table mapping the genome's name/accession to a taxonomy:

grep "^>" all_tRNAs.fa | sed -E 's/.*bin_id:([^|]+).*/\1/' | sort -u > genome_ids.txt
grep -Ff genome_ids.txt globdb_r226_taxonomy.tsv > ACCESSION_TO_TAXONOMY.txt
gzip ACCESSION_TO_TAXONOMY.txt && rm genome_ids.txt

@ivagljiva
Copy link
Contributor

WELL. I just pushed the README.txt to the master branch 🤦

LOL 😂

@meren
Copy link
Member

meren commented Nov 10, 2025

Thank you very much for this, @FlorianTrigodet!

I think it would be great if we could translate your README.txt into something that could go into the help docs for anvi-estimate-trna-taxonomy under a section, say, "Notes for Advanced Users and Programmers" where we describe how the underlying data for this can be updated.

@meren
Copy link
Member

meren commented Nov 10, 2025

Plus, what do you mean by that you tested and it worked? Does it mean you run the old and new database on a few contigs-db files do you get reasonably comparable results? :)

@FlorianTrigodet
Copy link
Contributor Author

I ran the old an new version on a metagenomic assembly (human gut) and from a quick glance at a few dozen annotations (out of >1200 annotation for AUG) the results are relatively comparable.
I cannot do a simple diff because GTDB changed quite a few phylum name since v89.
Here is the file that compare the previous and new annotation for each tRNA CAT (start codon AUG):
tRNA_CAT_v89_vs_v226.txt

As for moving the readme to the documentation, should we move it to anvi-setup-trna-taxonomy or anvi-run-trna-taxonomy?

@meren
Copy link
Member

meren commented Nov 11, 2025

I had suggested anvi-estimate-trna-taxonomy, but I think anvi-setup-trna-taxonomy would be much better to keep that information I guess :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants