ingest: How to handle segmented viruses

_Prompted by [discussion question](https://discussion.nextstrain.org/t/consensus-on-how-to-deal-with-segmented-viruses/1994) to summarize consensus in this issue_

The latest iteration of the ingest workflow for segmented viruses is the oropouche ingest workflow, implemented in https://github.com/nextstrain/oropouche/pull/18. At a high level, the oropouche ingest workflow does the following:

1. download metadata and sequences from NCBI
2. run through the usual curation pipeline
3. align sequences to segment references via `nextclade run` to separate sequences by segment
4. merge metadata and Nextclade outputs
5. transform metadata from one row per accession to one row per strain, where a single strain is linked to multiple segment sequences
6. outputs 1 metadata.tsv + N segment FASTAs

### TODOs 

- [ ] add docs for segmented viruses in [Creating an ingest workflow](https://docs.nextstrain.org/en/latest/tutorials/creating-a-pathogen-repo/creating-an-ingest-workflow.html)

---

<details>
<summary> Previous content </summary>

Related to https://github.com/nextstrain/private/issues/102 and https://github.com/nextstrain/pathogen-repo-guide/issues/50

Writing out current methods for ingesting segmented viruses. These are slightly different from gene specific builds like RSV or measles because the upstream records in NCBI GenBank are per segment rather than per genome.

### Avian flu example

The `segment` and `strain` fields are pretty standardized for the recent H5N1 outbreak, so we are able to directly use that metadata to match segments of the same metadata record. 

1. [Pull segment/strain name from NCBI Virus](https://github.com/nextstrain/avian-flu/blob/74b95ff842c0931cd85dbb90d21344f2190aa55a/ingest/build-configs/ncbi/bin/ncbi-virus-url#L65-L66). 
2. All data is processed through the usual curation pipeline.
3. Then [split into segment metadata + sequences](https://github.com/nextstrain/avian-flu/blob/74b95ff842c0931cd85dbb90d21344f2190aa55a/ingest/build-configs/ncbi/rules/curate.smk#L108)
4. Finally, loop through the segment metadata [to add a `n_segments` column](https://github.com/nextstrain/avian-flu/blob/74b95ff842c0931cd85dbb90d21344f2190aa55a/ingest/rules/merge_segment_metadata.smk#L7) tracking how many segments are linked to the metadata record. There is no "merging" of segment metadata, we just use the metadata for the HA segment.

This results in 1 metadata TSV + 8 FASTAs where the segment sequences are linked to the metadata via a unique `strain`. 
 
### Lassa example 

I had originally thought we could replicate the avian flu ingest in lassa, but the [lack of standardized segment](https://github.com/nextstrain/lassa/pull/12#issuecomment-2251439504) and strain fields makes this difficult. @j23414 has implemented an alternative method in https://github.com/nextstrain/lassa/pull/12.

1. Workflow uses the usual `datasets download` and curation pipeline. 
2. Use `nextclade run` to align records to L and S reference sequences to separate the segment sequences. 
3. Use `augur filter` subset the metadata by segment sequences using `accession` as the metadata id column. 
4. The original metadata + sequences file that contains _all_ records are kept for a record of samples that failed to align to either L or S. 

This results in 3 metadata TSVs + 3 FASTAs. The metadata/sequences with _all_ records are only used for debugging purposes. The phylogenetic workflow would start from the L/S metadata.tsv + sequences.fasta, where the records are linked by a unique `accession`.  

I [toyed with the idea of creating strain names](https://github.com/nextstrain/lassa/pull/12#issuecomment-2251520598) for lassa, but the lack of data makes it difficult to follow our usual pattern of `<location>/<sample_id>/<year>`. The lack of linked BioSample records also prevents us from using the BioSample accession to link segments. 

### Other viruses?

I think we got lucky with the H5N1 data. It is highly likely for other segmented viruses to have less standardized data like lassa. The default method for ingesting segmented virus data from NCBI should probably follow the lassa example. 

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest: How to handle segmented viruses #59

TODOs

Avian flu example

Lassa example

Other viruses?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ingest: How to handle segmented viruses #59

Description

TODOs

Avian flu example

Lassa example

Other viruses?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions