Skip to content

phylogenetic: generic pattern for additional inputs#91

Merged
joverlee521 merged 12 commits into
mainfrom
user-data-pattern
Sep 24, 2025
Merged

phylogenetic: generic pattern for additional inputs#91
joverlee521 merged 12 commits into
mainfrom
user-data-pattern

Conversation

@joverlee521
Copy link
Copy Markdown
Contributor

@joverlee521 joverlee521 commented Aug 18, 2025

Description of proposed changes

Documents the pattern for supporting multiple inputs and points to the zika and avian-flu workflows as examples.

Related issue(s)

Resolves #72

Open questions

subrepo:
  subdir:   "shared/vendored"
  merged:   "2d063cf"
upstream:
  origin:   "https://github.com/nextstrain/shared"
  branch:   "main"
  commit:   "2d063cf"
git-subrepo:
  version:  "0.4.6"
  origin:   "https://github.com/ingydotnet/git-subrepo"
  commit:   "110b9eb"
Comment thread phylogenetic/rules/merge_inputs.smk Outdated
Comment thread phylogenetic/rules/merge_inputs.smk Outdated
Comment thread phylogenetic/rules/merge_inputs.smk Outdated
Use `rules` variable for `input_*` functions to ensure that the returned merged 
path stay in-sync with the `merge_*` rules outputs. This makes it easier to 
adapt the template to use wildcards.

Technically `_gather_inputs`, `input_metadata`, and `input_sequences` are 
generalized enough to vendor as shared functions. However, I think it's easier 
to understand how the inputs are handled when they are colocated with the 
`merge_*` rules. If we use Snakemake modules to use the `merge_*` rules, then 
this whole file can be vendored, but that's a bigger change with unknown 
pitfalls so I'll punt on it for now.
Also updates docs for other rules/*.smk files to point to the 
`input_*` functions as inputs.
Instead of maintaining a custom rule to copy over example data, the CI build 
config can just directly use the `inputs` param to define the paths to the 
example data.
The `input_metadata` and `input_sequences` input functions expect one or 
more inputs, so add verification that config defined inputs have at least one 
metadata and sequences. Without the check, the `augur merge` commands fail 
loudly but it's not very obvious that it's due to an invalid config input.
joverlee521 added a commit to nextstrain/augur that referenced this pull request Sep 5, 2025
I didn't want to have a link to augur.io.open_file and then link to xopen since
that seem like implementation details that are not relevant to the end user. 
Just be clear and list out the supported compression formats.

Motivated by wanting docs for the pathogen-repo-guide changes in 
nextstrain/pathogen-repo-guide#91
joverlee521 added a commit to nextstrain/augur that referenced this pull request Sep 8, 2025
I didn't want to have a link to augur.io.open_file and then link to xopen since
that seem like implementation details that are not relevant to the end user. 
Just be clear and list out the supported compression formats.

Motivated by wanting docs for the pathogen-repo-guide changes in 
nextstrain/pathogen-repo-guide#91
Comment thread phylogenetic/rules/merge_inputs.smk Outdated
Comment on lines +106 to +118
def strip_compression_ext(input: str) -> str:
expected_compression_extensions = {
".gz",
".bz2",
".xz",
".zst",
}
input_path= Path(input)
return (
str(input_path.with_suffix(""))
if input_path.suffix in expected_compression_extensions
else input
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really happy with hardcoding the compression extensions here to strip the extensions, but I'm not sure how else to preserve the original file path for the decompress_metadata and decompress_sequences rules.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why try to preserve the original file path? The juice doesn't seem worth the squeeze.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to name the outputs with short constant names like most other rules in the workflow. Currently, remote inputs get stored in results/.snakemake/… which is weird:

augur subsample \
    --sequences results/.snakemake/storage/s3_unsigned/nextstrain-data/files/workflows/WNV/sequences.fasta \
    --metadata results/.snakemake/storage/s3_unsigned/nextstrain-data/files/workflows/WNV/metadata.tsv \

Example of short constant names:

augur subsample \
    --sequences results/sequences_decompressed.fasta \
    --metadata results/metadata_decompressed.tsv \

We could also preserve the input name to make it more dynamic, but I don't think it's necessary since that information isn't retained in subsequent filenames:

augur subsample \
    --sequences results/sequences_ncbi.fasta \
    --metadata results/metadata_ncbi.tsv \
    …
    --output-sequences results/lineage-1A/sequences_filtered.fasta \
    --output-metadata results/lineage-1A/metadata_filtered.tsv

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think to have the decompress vs. merge rules produce the same output file, e.g. results/metadata.tsv and results/sequences.fasta. This means the data's in a consistent place across builds (useful when inspecting results, for example). It also simplifies usage of those files downstream in the workflow, as they can be referenced by path directly (as per our guidelines) instead of by the pre-defined input functions.

The simplest way to achieve this is by dynamically defining the rules based on the config rather than dynamically defining the inputs and outputs to different sets of static rules. For example,

if len(_input_metadata) == 1:
    rule decompress_metadata:
        output: "results/metadata.tsv"else:
    rule merge_metadata:
        output: "results/metadata.tsv"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the original file path to keep wildcards support for single inputs, but seeing now that doesn't work as nicely because the log and benchmark paths still need to be edited (or log and benchmark paths need to be wrapped in strip_compression_ext as well).


I see the dynamically defined rules can simplify things and we've done a form of this in ncov-ingest. However, from a user experience, things are easier to understand/debug when two workflow paths use different file paths. Rather than having to dig through the logs to see how results/metadata.tsv was produced, it's easier to know what happened seeing the files results/metadata_merged.tsv vs results/<input_path>.

Copy link
Copy Markdown
Contributor

@tsibley tsibley Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the original file path to keep wildcards support for single inputs

Oh, hmm…

Rather than having to dig through the logs to see how results/metadata.tsv was produced, it's easier to know what happened seeing the files results/metadata_merged.tsv vs results/<input_path>.

I can understand this perspective, but it also seems like if you're iterating on the same analysis and start with one input and then re-run with 2+ inputs (or vice versa), you're gonna end up with results/metadata.tsv and results/metadata_merged.tsv at the same time. And that seems more confusing than maybe having to (mock gasp) read the logs to verify if a merge happened or not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to the dynamic rules approach in 17b7dab. With this change, wildcards for single inputs no longer just work and the files paths for both merge_* and decompress_* rules need to be updated, but I think thats ~fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the dynamic rules didn't work out as well avian-flu, see nextstrain/avian-flu@f5c9949.

Comment thread phylogenetic/rules/merge_inputs.smk Outdated
Ensures that we support the same compression formats for single inputs and 
multiple inputs across the board. Includes a link to the `augur read-file` docs
which shows the exact compression formats that are supported.
…ple inputs

Always produce the same output files for single or multiple inputs so that 
downstream rules can use the paths directly. When using wildcards in inputs, 
the file paths in both sets of rules will then need to be updated accordingly. 

Based on feedback from @tsibley
<#91 (comment)>
Copy link
Copy Markdown
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on WNV and works great. Added some minor comments.

Comment thread phylogenetic/rules/merge_inputs.smk Outdated
Comment thread phylogenetic/rules/merge_inputs.smk Outdated
Comment thread phylogenetic/rules/merge_inputs.smk Outdated
Comment thread phylogenetic/rules/merge_inputs.smk Outdated
Comment thread phylogenetic/rules/prepare_sequences.smk
@joverlee521 joverlee521 merged commit cb87e1a into main Sep 24, 2025
1 check passed
@joverlee521 joverlee521 deleted the user-data-pattern branch September 24, 2025 22:03
joverlee521 added a commit to nextstrain/zika that referenced this pull request Sep 24, 2025
Use the standard rules add to pathogen-repo-guide in 
nextstrain/pathogen-repo-guide#91.

The config parameters remain unchanged so this should not affect outside users. 
Downstream rules are now expected to use "results/metadata.tsv" and 
"results/sequences.fasta" as inputs instead of the `input_*` functions.
joverlee521 added a commit to nextstrain/measles that referenced this pull request Jan 27, 2026
Allow users to define multiple inputs via config, following the
standardized multiple input support implemented in the pathogen repo
guide.¹

Resolves #106
Resolves #82

¹ <nextstrain/pathogen-repo-guide#91>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provide a generic pattern for including additional user data alongside curated data

5 participants