phylogenetic: generic pattern for additional inputs by joverlee521 · Pull Request #91 · nextstrain/pathogen-repo-guide

joverlee521 · 2025-08-18T21:11:17Z

Description of proposed changes

Documents the pattern for supporting multiple inputs and points to the zika and avian-flu workflows as examples.

Related issue(s)

Resolves #72

Open questions

should single inputs be run through augur read-file to ensure inputs are always decompressed? - yup
should we use Snakemake modules for merge_inputs.smk? - nope

subrepo: subdir: "shared/vendored" merged: "2d063cf" upstream: origin: "https://github.com/nextstrain/shared" branch: "main" commit: "2d063cf" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"

Copied from <https://github.com/nextstrain/zika/blob/4e1636b3684733379114739cec33ceae9e9f24bc/phylogenetic/rules/merge_inputs.smk> Will make edits in subsequent commits to generalize the functions and documentation.

Use `rules` variable for `input_*` functions to ensure that the returned merged path stay in-sync with the `merge_*` rules outputs. This makes it easier to adapt the template to use wildcards. Technically `_gather_inputs`, `input_metadata`, and `input_sequences` are generalized enough to vendor as shared functions. However, I think it's easier to understand how the inputs are handled when they are colocated with the `merge_*` rules. If we use Snakemake modules to use the `merge_*` rules, then this whole file can be vendored, but that's a bigger change with unknown pitfalls so I'll punt on it for now.

Based on PR feedback <#91 (comment)>

Also updates docs for other rules/*.smk files to point to the `input_*` functions as inputs.

Instead of maintaining a custom rule to copy over example data, the CI build config can just directly use the `inputs` param to define the paths to the example data.

The `input_metadata` and `input_sequences` input functions expect one or more inputs, so add verification that config defined inputs have at least one metadata and sequences. Without the check, the `augur merge` commands fail loudly but it's not very obvious that it's due to an invalid config input.

I didn't want to have a link to augur.io.open_file and then link to xopen since that seem like implementation details that are not relevant to the end user. Just be clear and list out the supported compression formats. Motivated by wanting docs for the pathogen-repo-guide changes in nextstrain/pathogen-repo-guide#91

joverlee521 · 2025-09-22T18:54:54Z

+def strip_compression_ext(input: str) -> str:
+    expected_compression_extensions = {
+        ".gz",
+        ".bz2",
+        ".xz",
+        ".zst",
+    }
+    input_path= Path(input)
+    return (
+        str(input_path.with_suffix(""))
+        if input_path.suffix in expected_compression_extensions
+        else input
+    )


I'm not really happy with hardcoding the compression extensions here to strip the extensions, but I'm not sure how else to preserve the original file path for the decompress_metadata and decompress_sequences rules.

Why try to preserve the original file path? The juice doesn't seem worth the squeeze.

I think it's better to name the outputs with short constant names like most other rules in the workflow. Currently, remote inputs get stored in results/.snakemake/… which is weird:

augur subsample \ --sequences results/.snakemake/storage/s3_unsigned/nextstrain-data/files/workflows/WNV/sequences.fasta \ --metadata results/.snakemake/storage/s3_unsigned/nextstrain-data/files/workflows/WNV/metadata.tsv \

Example of short constant names:

augur subsample \ --sequences results/sequences_decompressed.fasta \ --metadata results/metadata_decompressed.tsv \

We could also preserve the input name to make it more dynamic, but I don't think it's necessary since that information isn't retained in subsequent filenames:

augur subsample \ --sequences results/sequences_ncbi.fasta \ --metadata results/metadata_ncbi.tsv \ … --output-sequences results/lineage-1A/sequences_filtered.fasta \ --output-metadata results/lineage-1A/metadata_filtered.tsv

I'd think to have the decompress vs. merge rules produce the same output file, e.g. results/metadata.tsv and results/sequences.fasta. This means the data's in a consistent place across builds (useful when inspecting results, for example). It also simplifies usage of those files downstream in the workflow, as they can be referenced by path directly (as per our guidelines) instead of by the pre-defined input functions.

The simplest way to achieve this is by dynamically defining the rules based on the config rather than dynamically defining the inputs and outputs to different sets of static rules. For example,

if len(_input_metadata) == 1: rule decompress_metadata: output: "results/metadata.tsv" … else: rule merge_metadata: output: "results/metadata.tsv" …

I kept the original file path to keep wildcards support for single inputs, but seeing now that doesn't work as nicely because the log and benchmark paths still need to be edited (or log and benchmark paths need to be wrapped in strip_compression_ext as well).

I see the dynamically defined rules can simplify things and we've done a form of this in ncov-ingest. However, from a user experience, things are easier to understand/debug when two workflow paths use different file paths. Rather than having to dig through the logs to see how results/metadata.tsv was produced, it's easier to know what happened seeing the files results/metadata_merged.tsv vs results/<input_path>.

I kept the original file path to keep wildcards support for single inputs

Oh, hmm…

Rather than having to dig through the logs to see how results/metadata.tsv was produced, it's easier to know what happened seeing the files results/metadata_merged.tsv vs results/<input_path>.

I can understand this perspective, but it also seems like if you're iterating on the same analysis and start with one input and then re-run with 2+ inputs (or vice versa), you're gonna end up with results/metadata.tsv and results/metadata_merged.tsv at the same time. And that seems more confusing than maybe having to (mock gasp) read the logs to verify if a merge happened or not.

Switched to the dynamic rules approach in 17b7dab. With this change, wildcards for single inputs no longer just work and the files paths for both merge_* and decompress_* rules need to be updated, but I think thats ~fine.

Note the dynamic rules didn't work out as well avian-flu, see nextstrain/avian-flu@f5c9949.

Ensures that we support the same compression formats for single inputs and multiple inputs across the board. Includes a link to the `augur read-file` docs which shows the exact compression formats that are supported.

@tsibley

…ple inputs Always produce the same output files for single or multiple inputs so that downstream rules can use the paths directly. When using wildcards in inputs, the file paths in both sets of rules will then need to be updated accordingly. Based on feedback from @tsibley <#91 (comment)>

victorlin

Tested on WNV and works great. Added some minor comments.

@victorlin

Based on review from @victorlin

Use the standard rules add to pathogen-repo-guide in nextstrain/pathogen-repo-guide#91. The config parameters remain unchanged so this should not affect outside users. Downstream rules are now expected to use "results/metadata.tsv" and "results/sequences.fasta" as inputs instead of the `input_*` functions.

Allow users to define multiple inputs via config, following the standardized multiple input support implemented in the pathogen repo guide.¹ Resolves #106 Resolves #82 ¹ <nextstrain/pathogen-repo-guide#91>

joverlee521 added 2 commits August 18, 2025 11:33

git subrepo pull (merge) shared/vendored

713be21

subrepo: subdir: "shared/vendored" merged: "2d063cf" upstream: origin: "https://github.com/nextstrain/shared" branch: "main" commit: "2d063cf" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"

phylogenetic: include shared *.smk files

d899fdb

genehack approved these changes Aug 19, 2025

View reviewed changes

jameshadfield reviewed Aug 20, 2025

View reviewed changes

Comment thread phylogenetic/rules/merge_inputs.smk Outdated

Comment thread phylogenetic/rules/merge_inputs.smk Outdated

jameshadfield reviewed Aug 20, 2025

View reviewed changes

Comment thread phylogenetic/rules/merge_inputs.smk Outdated

joverlee521 added 5 commits August 22, 2025 11:21

phylogenetic: copy merge_inputs.smk from zika

66cca7e

Copied from <https://github.com/nextstrain/zika/blob/4e1636b3684733379114739cec33ceae9e9f24bc/phylogenetic/rules/merge_inputs.smk> Will make edits in subsequent commits to generalize the functions and documentation.

merge_inputs: generalize docs

9dc453f

merge_inputs: add docs for compression support

adbf1af

Based on PR feedback <#91 (comment)>

phylogenetic: include merge_inputs rules file

1bc57a0

Also updates docs for other rules/*.smk files to point to the `input_*` functions as inputs.

joverlee521 force-pushed the user-data-pattern branch from 6e7b178 to 1bc57a0 Compare August 22, 2025 20:41

phylogenetic/ci: update config to use inputs param

42b74e6

Instead of maintaining a custom rule to copy over example data, the CI build config can just directly use the `inputs` param to define the paths to the example data.

joverlee521 mentioned this pull request Aug 27, 2025

phylo: Use multiple inputs to include PPX RESTRICTED data nextstrain/rsv#97

Closed

joverlee521 mentioned this pull request Sep 5, 2025

read-file/write-file: Add supported compression formats to docs nextstrain/augur#1881

Merged

4 tasks

victorlin mentioned this pull request Sep 15, 2025

Minor issue: missing -r in copying ingest to phylo nextstrain/ebola#26

Closed

joverlee521 force-pushed the user-data-pattern branch from 8ae0b19 to 37c82c7 Compare September 22, 2025 18:34

joverlee521 commented Sep 22, 2025

View reviewed changes

tsibley reviewed Sep 22, 2025

View reviewed changes

Comment thread phylogenetic/rules/merge_inputs.smk Outdated

phylo/merge_inputs: decompress single inputs through augur read-file

fed74be

Ensures that we support the same compression formats for single inputs and multiple inputs across the board. Includes a link to the `augur read-file` docs which shows the exact compression formats that are supported.

joverlee521 force-pushed the user-data-pattern branch from 37c82c7 to fed74be Compare September 22, 2025 21:56

This was referenced Sep 23, 2025

Update workflows for nextstrain run #93

Merged

Write blog for multiple inputs config nextstrain/nextstrain.org#1231

Closed

This was referenced Sep 23, 2025

Update input/output docs nextstrain/WNV#104

Merged

Standardize multiple inputs nextstrain/WNV#110

Merged

joverlee521 mentioned this pull request Sep 23, 2025

Add/update multiple input support for existing pathogens nextstrain/public#25

Open

19 tasks

victorlin approved these changes Sep 24, 2025

View reviewed changes

Comment thread phylogenetic/rules/merge_inputs.smk Outdated

Comment thread phylogenetic/rules/merge_inputs.smk Outdated

Comment thread phylogenetic/rules/merge_inputs.smk Outdated

Comment thread phylogenetic/rules/merge_inputs.smk Outdated

Comment thread phylogenetic/rules/prepare_sequences.smk

phylo/merge_inputs: small docs & error message edits

8b25198

Based on review from @victorlin

victorlin approved these changes Sep 24, 2025

View reviewed changes

joverlee521 merged commit cb87e1a into main Sep 24, 2025
1 check passed

joverlee521 deleted the user-data-pattern branch September 24, 2025 22:03

joverlee521 mentioned this pull request Sep 24, 2025

Add blog post for standardized multiple inputs nextstrain/nextstrain.org#1233

Merged

3 tasks

joverlee521 mentioned this pull request Oct 8, 2025

phylogenetic: Add standardized multiple inputs nextstrain/rsv#108

Merged

2 tasks

j23414 mentioned this pull request Oct 8, 2025

phylogenetic: Add standardized multiple input support nextstrain/mumps#47

Open

joverlee521 mentioned this pull request Jan 27, 2026

phylo: Add multiple inputs support nextstrain/measles#107

Merged

3 tasks

Conversation

joverlee521 commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of proposed changes

Related issue(s)

Open questions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joverlee521 Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

tsibley Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

victorlin Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

tsibley Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

joverlee521 Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

tsibley Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joverlee521 Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

joverlee521 Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

victorlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

joverlee521 commented Aug 18, 2025 •

edited

Loading

tsibley Sep 23, 2025 •

edited

Loading