Prompted by discussion question to summarize consensus in this issue
The latest iteration of the ingest workflow for segmented viruses is the oropouche ingest workflow, implemented in nextstrain/oropouche#18. At a high level, the oropouche ingest workflow does the following:
- download metadata and sequences from NCBI
- run through the usual curation pipeline
- align sequences to segment references via
nextclade run to separate sequences by segment
- merge metadata and Nextclade outputs
- transform metadata from one row per accession to one row per strain, where a single strain is linked to multiple segment sequences
- outputs 1 metadata.tsv + N segment FASTAs
TODOs
Previous content
Related to https://github.com/nextstrain/private/issues/102 and #50
Writing out current methods for ingesting segmented viruses. These are slightly different from gene specific builds like RSV or measles because the upstream records in NCBI GenBank are per segment rather than per genome.
Avian flu example
The segment and strain fields are pretty standardized for the recent H5N1 outbreak, so we are able to directly use that metadata to match segments of the same metadata record.
- Pull segment/strain name from NCBI Virus.
- All data is processed through the usual curation pipeline.
- Then split into segment metadata + sequences
- Finally, loop through the segment metadata to add a
n_segments column tracking how many segments are linked to the metadata record. There is no "merging" of segment metadata, we just use the metadata for the HA segment.
This results in 1 metadata TSV + 8 FASTAs where the segment sequences are linked to the metadata via a unique strain.
Lassa example
I had originally thought we could replicate the avian flu ingest in lassa, but the lack of standardized segment and strain fields makes this difficult. @j23414 has implemented an alternative method in nextstrain/lassa#12.
- Workflow uses the usual
datasets download and curation pipeline.
- Use
nextclade run to align records to L and S reference sequences to separate the segment sequences.
- Use
augur filter subset the metadata by segment sequences using accession as the metadata id column.
- The original metadata + sequences file that contains all records are kept for a record of samples that failed to align to either L or S.
This results in 3 metadata TSVs + 3 FASTAs. The metadata/sequences with all records are only used for debugging purposes. The phylogenetic workflow would start from the L/S metadata.tsv + sequences.fasta, where the records are linked by a unique accession.
I toyed with the idea of creating strain names for lassa, but the lack of data makes it difficult to follow our usual pattern of <location>/<sample_id>/<year>. The lack of linked BioSample records also prevents us from using the BioSample accession to link segments.
Other viruses?
I think we got lucky with the H5N1 data. It is highly likely for other segmented viruses to have less standardized data like lassa. The default method for ingesting segmented virus data from NCBI should probably follow the lassa example.
Prompted by discussion question to summarize consensus in this issue
The latest iteration of the ingest workflow for segmented viruses is the oropouche ingest workflow, implemented in nextstrain/oropouche#18. At a high level, the oropouche ingest workflow does the following:
nextclade runto separate sequences by segmentTODOs
Previous content
Related to https://github.com/nextstrain/private/issues/102 and #50
Writing out current methods for ingesting segmented viruses. These are slightly different from gene specific builds like RSV or measles because the upstream records in NCBI GenBank are per segment rather than per genome.
Avian flu example
The
segmentandstrainfields are pretty standardized for the recent H5N1 outbreak, so we are able to directly use that metadata to match segments of the same metadata record.n_segmentscolumn tracking how many segments are linked to the metadata record. There is no "merging" of segment metadata, we just use the metadata for the HA segment.This results in 1 metadata TSV + 8 FASTAs where the segment sequences are linked to the metadata via a unique
strain.Lassa example
I had originally thought we could replicate the avian flu ingest in lassa, but the lack of standardized segment and strain fields makes this difficult. @j23414 has implemented an alternative method in nextstrain/lassa#12.
datasets downloadand curation pipeline.nextclade runto align records to L and S reference sequences to separate the segment sequences.augur filtersubset the metadata by segment sequences usingaccessionas the metadata id column.This results in 3 metadata TSVs + 3 FASTAs. The metadata/sequences with all records are only used for debugging purposes. The phylogenetic workflow would start from the L/S metadata.tsv + sequences.fasta, where the records are linked by a unique
accession.I toyed with the idea of creating strain names for lassa, but the lack of data makes it difficult to follow our usual pattern of
<location>/<sample_id>/<year>. The lack of linked BioSample records also prevents us from using the BioSample accession to link segments.Other viruses?
I think we got lucky with the H5N1 data. It is highly likely for other segmented viruses to have less standardized data like lassa. The default method for ingesting segmented virus data from NCBI should probably follow the lassa example.