phylogenetic: generic pattern for additional inputs#91
Conversation
subrepo: subdir: "shared/vendored" merged: "2d063cf" upstream: origin: "https://github.com/nextstrain/shared" branch: "main" commit: "2d063cf" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"
Copied from <https://github.com/nextstrain/zika/blob/4e1636b3684733379114739cec33ceae9e9f24bc/phylogenetic/rules/merge_inputs.smk> Will make edits in subsequent commits to generalize the functions and documentation.
Use `rules` variable for `input_*` functions to ensure that the returned merged path stay in-sync with the `merge_*` rules outputs. This makes it easier to adapt the template to use wildcards. Technically `_gather_inputs`, `input_metadata`, and `input_sequences` are generalized enough to vendor as shared functions. However, I think it's easier to understand how the inputs are handled when they are colocated with the `merge_*` rules. If we use Snakemake modules to use the `merge_*` rules, then this whole file can be vendored, but that's a bigger change with unknown pitfalls so I'll punt on it for now.
Based on PR feedback <#91 (comment)>
Also updates docs for other rules/*.smk files to point to the `input_*` functions as inputs.
6e7b178 to
1bc57a0
Compare
Instead of maintaining a custom rule to copy over example data, the CI build config can just directly use the `inputs` param to define the paths to the example data.
The `input_metadata` and `input_sequences` input functions expect one or more inputs, so add verification that config defined inputs have at least one metadata and sequences. Without the check, the `augur merge` commands fail loudly but it's not very obvious that it's due to an invalid config input.
I didn't want to have a link to augur.io.open_file and then link to xopen since that seem like implementation details that are not relevant to the end user. Just be clear and list out the supported compression formats. Motivated by wanting docs for the pathogen-repo-guide changes in nextstrain/pathogen-repo-guide#91
I didn't want to have a link to augur.io.open_file and then link to xopen since that seem like implementation details that are not relevant to the end user. Just be clear and list out the supported compression formats. Motivated by wanting docs for the pathogen-repo-guide changes in nextstrain/pathogen-repo-guide#91
8ae0b19 to
37c82c7
Compare
| def strip_compression_ext(input: str) -> str: | ||
| expected_compression_extensions = { | ||
| ".gz", | ||
| ".bz2", | ||
| ".xz", | ||
| ".zst", | ||
| } | ||
| input_path= Path(input) | ||
| return ( | ||
| str(input_path.with_suffix("")) | ||
| if input_path.suffix in expected_compression_extensions | ||
| else input | ||
| ) |
There was a problem hiding this comment.
I'm not really happy with hardcoding the compression extensions here to strip the extensions, but I'm not sure how else to preserve the original file path for the decompress_metadata and decompress_sequences rules.
There was a problem hiding this comment.
Why try to preserve the original file path? The juice doesn't seem worth the squeeze.
There was a problem hiding this comment.
I think it's better to name the outputs with short constant names like most other rules in the workflow. Currently, remote inputs get stored in results/.snakemake/… which is weird:
augur subsample \
--sequences results/.snakemake/storage/s3_unsigned/nextstrain-data/files/workflows/WNV/sequences.fasta \
--metadata results/.snakemake/storage/s3_unsigned/nextstrain-data/files/workflows/WNV/metadata.tsv \
Example of short constant names:
augur subsample \
--sequences results/sequences_decompressed.fasta \
--metadata results/metadata_decompressed.tsv \
We could also preserve the input name to make it more dynamic, but I don't think it's necessary since that information isn't retained in subsequent filenames:
augur subsample \
--sequences results/sequences_ncbi.fasta \
--metadata results/metadata_ncbi.tsv \
…
--output-sequences results/lineage-1A/sequences_filtered.fasta \
--output-metadata results/lineage-1A/metadata_filtered.tsv
There was a problem hiding this comment.
I'd think to have the decompress vs. merge rules produce the same output file, e.g. results/metadata.tsv and results/sequences.fasta. This means the data's in a consistent place across builds (useful when inspecting results, for example). It also simplifies usage of those files downstream in the workflow, as they can be referenced by path directly (as per our guidelines) instead of by the pre-defined input functions.
The simplest way to achieve this is by dynamically defining the rules based on the config rather than dynamically defining the inputs and outputs to different sets of static rules. For example,
if len(_input_metadata) == 1:
rule decompress_metadata:
output: "results/metadata.tsv"
…
else:
rule merge_metadata:
output: "results/metadata.tsv"
…There was a problem hiding this comment.
I kept the original file path to keep wildcards support for single inputs, but seeing now that doesn't work as nicely because the log and benchmark paths still need to be edited (or log and benchmark paths need to be wrapped in strip_compression_ext as well).
I see the dynamically defined rules can simplify things and we've done a form of this in ncov-ingest. However, from a user experience, things are easier to understand/debug when two workflow paths use different file paths. Rather than having to dig through the logs to see how results/metadata.tsv was produced, it's easier to know what happened seeing the files results/metadata_merged.tsv vs results/<input_path>.
There was a problem hiding this comment.
I kept the original file path to keep wildcards support for single inputs
Oh, hmm…
Rather than having to dig through the logs to see how
results/metadata.tsvwas produced, it's easier to know what happened seeing the filesresults/metadata_merged.tsvvsresults/<input_path>.
I can understand this perspective, but it also seems like if you're iterating on the same analysis and start with one input and then re-run with 2+ inputs (or vice versa), you're gonna end up with results/metadata.tsv and results/metadata_merged.tsv at the same time. And that seems more confusing than maybe having to (mock gasp) read the logs to verify if a merge happened or not.
There was a problem hiding this comment.
Switched to the dynamic rules approach in 17b7dab. With this change, wildcards for single inputs no longer just work and the files paths for both merge_* and decompress_* rules need to be updated, but I think thats ~fine.
There was a problem hiding this comment.
Note the dynamic rules didn't work out as well avian-flu, see nextstrain/avian-flu@f5c9949.
Ensures that we support the same compression formats for single inputs and multiple inputs across the board. Includes a link to the `augur read-file` docs which shows the exact compression formats that are supported.
37c82c7 to
fed74be
Compare
…ple inputs Always produce the same output files for single or multiple inputs so that downstream rules can use the paths directly. When using wildcards in inputs, the file paths in both sets of rules will then need to be updated accordingly. Based on feedback from @tsibley <#91 (comment)>
victorlin
left a comment
There was a problem hiding this comment.
Tested on WNV and works great. Added some minor comments.
Based on review from @victorlin
Use the standard rules add to pathogen-repo-guide in nextstrain/pathogen-repo-guide#91. The config parameters remain unchanged so this should not affect outside users. Downstream rules are now expected to use "results/metadata.tsv" and "results/sequences.fasta" as inputs instead of the `input_*` functions.
Allow users to define multiple inputs via config, following the standardized multiple input support implemented in the pathogen repo guide.¹ Resolves #106 Resolves #82 ¹ <nextstrain/pathogen-repo-guide#91>
Description of proposed changes
Documents the pattern for supporting multiple inputs and points to the zika and avian-flu workflows as examples.
Related issue(s)
Resolves #72
Open questions
augur read-fileto ensure inputs are always decompressed? - yup