nextstrain · joverlee521 · Sep 24, 2025 · Aug 18, 2025 · Aug 18, 2025 · Aug 22, 2025
diff --git a/phylogenetic/Snakefile b/phylogenetic/Snakefile
@@ -20,6 +20,10 @@ rule all:
         auspice_json="",
 
 
+# Shared Snakemake files with generic functions are shared across pathogens
+include: "../shared/vendored/snakemake/config.smk"
+include: "../shared/vendored/snakemake/remote_files.smk"
+
 # These rules are imported in the order that they are expected to run.
 # Each Snakefile will have documented inputs and outputs that should be kept as
 # consistent interfaces across pathogen repos. This allows us to define typical
@@ -30,6 +34,7 @@ rule all:
 # If there are build specific customizations, they should be added with the
 # custom_rules imported below to ensure that the core workflow is not complicated
 # by build specific rules.
+include: "rules/merge_inputs.smk"
 include: "rules/prepare_sequences.smk"
 include: "rules/construct_phylogeny.smk"
 include: "rules/annotate_phylogeny.smk"

diff --git a/phylogenetic/build-configs/ci/config.yaml b/phylogenetic/build-configs/ci/config.yaml
@@ -1,7 +1,7 @@
 # This configuration file contains the custom configurations parameters
 # for the CI workflow to run with the example data.
 
-# Custom rules to run as part of the CI automated workflow
-# The paths should be relative to the phylogenetic directory.
-custom_rules:
-  - build-configs/ci/copy_example_data.smk
+inputs:
+  - name: example_data
+    metadata: "example_data/metadata.tsv"
+    sequences: "example_data/sequences.fasta"
diff --git a/phylogenetic/build-configs/ci/copy_example_data.smk b/phylogenetic/build-configs/ci/copy_example_data.smk
diff --git a/phylogenetic/rules/annotate_phylogeny.smk b/phylogenetic/rules/annotate_phylogeny.smk
@@ -3,7 +3,7 @@ This part of the workflow creates additonal annotations for the phylogenetic tre
 
 REQUIRED INPUTS:
 
-    metadata            = data/metadata.tsv
+    metadata            = results/metadata.tsv
     prepared_sequences  = results/prepared_sequences.fasta
     tree                = results/tree.nwk
 

diff --git a/phylogenetic/rules/construct_phylogeny.smk b/phylogenetic/rules/construct_phylogeny.smk
@@ -3,7 +3,7 @@ This part of the workflow constructs the phylogenetic tree.
 
 REQUIRED INPUTS:
 
-    metadata            = data/metadata.tsv
+    metadata            = results/metadata.tsv
     prepared_sequences  = results/prepared_sequences.fasta
 
 OUTPUTS:

diff --git a/phylogenetic/rules/export.smk b/phylogenetic/rules/export.smk
@@ -4,7 +4,7 @@ export a Nextstrain dataset.
 
 REQUIRED INPUTS:
 
-    metadata        = data/metadata.tsv
+    metadata        = results/metadata.tsv
     tree            = results/tree.nwk
     branch_lengths  = results/branch_lengths.json
     node_data       = results/*.json

diff --git a/phylogenetic/rules/merge_inputs.smk b/phylogenetic/rules/merge_inputs.smk
@@ -0,0 +1,184 @@
+"""
+This part of the workflow merges inputs based on what is defined in the config.
+
+OUTPUTS:
+
+    metadata  = results/metadata.tsv
+    sequences = results/sequences.fasta
+
+The config dict is expected to have a top-level `inputs` list that defines the
+separate inputs' name, metadata, and sequences. Optionally, the config can have
+a top-level `additional-inputs` list that is used to define additional data that
+are combined with the default inputs:
+
+```yaml
+inputs:
+    - name: default
+      metadata: <path-or-url>
+      sequences: <path-or-url>
+
+additional_inputs:
+    - name: private
+      metadata: <path-or-url>
+      sequences: <path-or-url>
+```
+
+Supports any of the compression formats that are supported by `augur read-file`,
+see <https://docs.nextstrain.org/projects/augur/page/usage/cli/read-file.html>
+
+NOTE: The included rules are written for workflows that do not use wildcards
+for defining inputs such as zika. You will need to edit the rules to support wildcards
+
+1. If your workflow needs wildcards for both metadata and sequences,
+e.g. serotypes for dengue, then you will need to edit the `output`, `log`, and
+`benchmark` paths of the metadata and sequences rules.
+The wildcards can then be directly used in the config for inputs:
+
+```yaml
+inputs:
+    - name: default
+      metadata: https://data.nextstrain.org/files/workflows/dengue/metadata_{serotype}.tsv.zst
+      sequences: https://data.nextstrain.org/files/workflows/dengue/sequences_{serotype}.fasta.zst
+
+```
+
+2. If your workflow only needs wildcards for sequences, e.g. segments for influenza,
+then you will only need to edit the paths for the sequences rules.
+The wildcards can then be directly used in the config for inputs:
+
+```yaml
+inputs:
+    - name: default
+      metadata: s3://nextstrain-data-private/files/workflows/avian-flu/metadata.tsv.zst
+      sequences: s3://nextstrain-data-private/files/workflows/avian-flu/{segment}/sequences.fasta.zst
+```
+"""
+from pathlib import Path
+
+
+def _gather_inputs():
+    all_inputs = [*config['inputs'], *config.get('additional_inputs', [])]
+
+    if len(all_inputs)==0:
+        raise InvalidConfigError("Config must define at least one element in config.inputs or config.additional_inputs lists")
+    if not all([isinstance(i, dict) for i in all_inputs]):
+        raise InvalidConfigError("All of the elements in config.inputs and config.additional_inputs lists must be dictionaries. "
+            "If you've used a command line '--config' double check your quoting.")
+    if len({i['name'] for i in all_inputs})!=len(all_inputs):
+        raise InvalidConfigError("Names of inputs (config.inputs and config.additional_inputs) must be unique")
+    if not all(['name' in i and ('sequences' in i or 'metadata' in i) for i in all_inputs]):
+        raise InvalidConfigError("Each input (config.inputs and config.additional_inputs) must have a 'name' and 'metadata' and/or 'sequences'")
+    if not any(['metadata' in i for i in all_inputs]):
+        raise InvalidConfigError("At least one input must have 'metadata'")
+    if not any (['sequences' in i for i in all_inputs]):
+        raise InvalidConfigError("At least one input must have 'sequences'")
+
+    available_keys = set(['name', 'metadata', 'sequences'])
+    if any([len(set(el.keys())-available_keys)>0 for el in all_inputs]):
+        raise InvalidConfigError(f"Each input (config.inputs and config.additional_inputs) can only include keys of {', '.join(available_keys)}")
+
+    return {el['name']: {k:(v if k=='name' else path_or_url(v)) for k,v in el.items()} for el in all_inputs}
+
+input_sources = _gather_inputs()
+_input_metadata = [info['metadata'] for info in input_sources.values() if info.get('metadata', None)]
+_input_sequences = [info['sequences'] for info in input_sources.values() if info.get('sequences', None)]
+
+
+if len(_input_metadata) == 1:
+
+    rule decompress_metadata:
+        """
+        This rule is invoked when there is a single metadata input to
+        ensure that we have a decompressed input for downstream rules to match
+        the output of rule.merge_metadata.
+        """
+        input:
+            metadata = _input_metadata[0],
+        output:
+            metadata = "results/metadata.tsv",
+        log:
+            "logs/decompress_metadata.txt",
+        benchmark:
+            "benchmarks/decompress_metadata.txt",
+        shell:
+            r"""
+            exec &> >(tee {log:q})
+
+            augur read-file {input.metadata:q} > {output.metadata:q}
+            """
+
+else:
+
+    rule merge_metadata:
+        """
+        This rule is invoked when there are multiple defined metadata inputs
+        (config.inputs + config.additional_inputs)
+        """
+        input:
+            **{name: info['metadata'] for name,info in input_sources.items() if info.get('metadata', None)}
+        params:
+            metadata = lambda w, input: list(map("=".join, input.items())),
+            id_field = config['strain_id_field'],
+        output:
+            metadata = "results/metadata.tsv"
+        log:
+            "logs/merge_metadata.txt",
+        benchmark:
+            "benchmarks/merge_metadata.txt"
+        shell:
+            r"""
+            exec &> >(tee {log:q})
+
+            augur merge \
+                --metadata {params.metadata:q} \
+                --metadata-id-columns {params.id_field:q} \
+                --output-metadata {output.metadata:q}
+            """
+
+
+if len(_input_sequences) == 1:
+
+    rule decompress_sequences:
+        """
+        This rule is invoked when there is a single sequences input to
+        ensure that we have a decompressed input for downstream rules to match
+        the output of rule.merge_sequences.
+        """
+        input:
+            sequences = _input_sequences[0],
+        output:
+            sequences = "results/sequences.fasta",
+        log:
+            "logs/decompress_sequences.txt",
+        benchmark:
+            "benchmarks/decompress_sequences.txt",
+        shell:
+            r"""
+            exec &> >(tee {log:q})
+
+            augur read-file {input.sequences:q} > {output.sequences:q}
+            """
+
+else:
+
+    rule merge_sequences:
+        """
+        This rule is invoked when there are multiple defined sequences inputs
+        (config.inputs + config.additional_inputs)
+        """
+        input:
+            **{name: info['sequences'] for name,info in input_sources.items() if info.get('sequences', None)}
+        output:
+            sequences = "results/sequences.fasta",
+        log:
+            "logs/merge_sequences.txt",
+        benchmark:
+            "benchmarks/merge_sequences.txt"
+        shell:
+            r"""
+            exec &> >(tee {log:q})
+
+            augur merge \
+                --sequences {input:q} \
+                --output-sequences {output.sequences:q}
+            """
diff --git a/phylogenetic/rules/prepare_sequences.smk b/phylogenetic/rules/prepare_sequences.smk
@@ -3,8 +3,8 @@ This part of the workflow prepares sequences for constructing the phylogenetic t
 
 REQUIRED INPUTS:
 
-    metadata    = data/metadata.tsv
-    sequences   = data/sequences.fasta
+    metadata    = results/metadata.tsv
+    sequences   = results/sequences.fasta
     reference   = ../shared/reference.fasta
 
 OUTPUTS:

diff --git a/shared/vendored/.github/workflows/ci.yaml b/shared/vendored/.github/workflows/ci.yaml
@@ -11,5 +11,5 @@ jobs:
   shellcheck:
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@v5
       - uses: nextstrain/.github/actions/shellcheck@master
diff --git a/shared/vendored/.github/workflows/pre-commit.yaml b/shared/vendored/.github/workflows/pre-commit.yaml
@@ -7,7 +7,7 @@ jobs:
   pre-commit:
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@v5
       - uses: actions/setup-python@v5
         with:
           python-version: "3.12"

diff --git a/shared/vendored/.gitrepo b/shared/vendored/.gitrepo
@@ -6,7 +6,7 @@
 [subrepo]
 	remote = https://github.com/nextstrain/shared
 	branch = main
-	commit = 82880c8b026d4f317e3f3d9b1f8bb4db226ea01e
-	parent = d914b98862f08486b76121bf5f385938ff3152f7
+	commit = 2d063cf2bae0cfc91d70fda2c36f1451656e5757
+	parent = f3c792c2a4d6ebec3a70c3d65b61258208789c67
 	method = merge
 	cmdver = 0.4.6
diff --git a/shared/vendored/README.md b/shared/vendored/README.md
@@ -86,6 +86,7 @@ approach to "ingest" has been discussed in various internal places, including:
 
 Scripts for supporting workflow automation that don’t really belong in any of our existing tools.
 
+- [assign-colors](scripts/assign-colors) - Generate colors.tsv for augur export based on ordering, color schemes, and what exists in the metadata. Used in the phylogenetic or nextclade workflows.
 - [notify-on-diff](scripts/notify-on-diff) - Send Slack message with diff of a local file and an S3 object
 - [notify-on-job-fail](scripts/notify-on-job-fail) - Send Slack message with details about failed workflow job on GitHub Actions and/or AWS Batch
 - [notify-on-job-start](scripts/notify-on-job-start) - Send Slack message with details about workflow job on GitHub Actions and/or AWS Batch
@@ -97,6 +98,7 @@ Scripts for supporting workflow automation that don’t really belong in any of
 - [trigger-on-new-data](scripts/trigger-on-new-data) - Triggers downstream GitHub Actions if the provided `upload-to-s3` outputs do not contain the `identical_file_message`
   A hacky way to ensure that we only trigger downstream phylogenetic builds if the S3 objects have been updated.
 
+
 NCBI interaction scripts that are useful for fetching public metadata and sequences.
 
 - [fetch-from-ncbi-entrez](scripts/fetch-from-ncbi-entrez) - Fetch metadata and nucleotide sequences from [NCBI Entrez](https://www.ncbi.nlm.nih.gov/books/NBK25501/) and output to a GenBank file.
@@ -122,6 +124,8 @@ Potential Nextstrain CLI scripts
 Snakemake workflow functions that are shared across many pathogen workflows that don’t really belong in any of our existing tools.
 
 - [config.smk](snakemake/config.smk) - Shared functions for parsing workflow configs.
+- [remote_files.smk](snakemake/remote_files.smk) - Exposes the `path_or_url` function which will use Snakemake's storage plugins to download/upload files to remote providers as needed.
+
 
 ## Software requirements