Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions phylogenetic/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ rule all:
auspice_json="",


# Shared Snakemake files with generic functions are shared across pathogens
include: "../shared/vendored/snakemake/config.smk"
include: "../shared/vendored/snakemake/remote_files.smk"

# These rules are imported in the order that they are expected to run.
# Each Snakefile will have documented inputs and outputs that should be kept as
# consistent interfaces across pathogen repos. This allows us to define typical
Expand All @@ -30,6 +34,7 @@ rule all:
# If there are build specific customizations, they should be added with the
# custom_rules imported below to ensure that the core workflow is not complicated
# by build specific rules.
include: "rules/merge_inputs.smk"
include: "rules/prepare_sequences.smk"
include: "rules/construct_phylogeny.smk"
include: "rules/annotate_phylogeny.smk"
Expand Down
8 changes: 4 additions & 4 deletions phylogenetic/build-configs/ci/config.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# This configuration file contains the custom configurations parameters
# for the CI workflow to run with the example data.

# Custom rules to run as part of the CI automated workflow
# The paths should be relative to the phylogenetic directory.
custom_rules:
- build-configs/ci/copy_example_data.smk
inputs:
- name: example_data
metadata: "example_data/metadata.tsv"
sequences: "example_data/sequences.fasta"
17 changes: 0 additions & 17 deletions phylogenetic/build-configs/ci/copy_example_data.smk

This file was deleted.

2 changes: 1 addition & 1 deletion phylogenetic/rules/annotate_phylogeny.smk
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ This part of the workflow creates additonal annotations for the phylogenetic tre

REQUIRED INPUTS:

metadata = data/metadata.tsv
metadata = results/metadata.tsv
prepared_sequences = results/prepared_sequences.fasta
tree = results/tree.nwk

Expand Down
2 changes: 1 addition & 1 deletion phylogenetic/rules/construct_phylogeny.smk
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ This part of the workflow constructs the phylogenetic tree.

REQUIRED INPUTS:

metadata = data/metadata.tsv
metadata = results/metadata.tsv
prepared_sequences = results/prepared_sequences.fasta

OUTPUTS:
Expand Down
2 changes: 1 addition & 1 deletion phylogenetic/rules/export.smk
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ export a Nextstrain dataset.

REQUIRED INPUTS:

metadata = data/metadata.tsv
metadata = results/metadata.tsv
tree = results/tree.nwk
branch_lengths = results/branch_lengths.json
node_data = results/*.json
Expand Down
184 changes: 184 additions & 0 deletions phylogenetic/rules/merge_inputs.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
"""
This part of the workflow merges inputs based on what is defined in the config.

OUTPUTS:

metadata = results/metadata.tsv
sequences = results/sequences.fasta

The config dict is expected to have a top-level `inputs` list that defines the
separate inputs' name, metadata, and sequences. Optionally, the config can have
a top-level `additional-inputs` list that is used to define additional data that
are combined with the default inputs:

```yaml
inputs:
- name: default
metadata: <path-or-url>
sequences: <path-or-url>

additional_inputs:
- name: private
metadata: <path-or-url>
sequences: <path-or-url>
```

Supports any of the compression formats that are supported by `augur read-file`,
see <https://docs.nextstrain.org/projects/augur/page/usage/cli/read-file.html>

NOTE: The included rules are written for workflows that do not use wildcards
for defining inputs such as zika. You will need to edit the rules to support wildcards

1. If your workflow needs wildcards for both metadata and sequences,
e.g. serotypes for dengue, then you will need to edit the `output`, `log`, and
`benchmark` paths of the metadata and sequences rules.
The wildcards can then be directly used in the config for inputs:

```yaml
inputs:
- name: default
metadata: https://data.nextstrain.org/files/workflows/dengue/metadata_{serotype}.tsv.zst
sequences: https://data.nextstrain.org/files/workflows/dengue/sequences_{serotype}.fasta.zst

```

2. If your workflow only needs wildcards for sequences, e.g. segments for influenza,
then you will only need to edit the paths for the sequences rules.
The wildcards can then be directly used in the config for inputs:

```yaml
inputs:
- name: default
metadata: s3://nextstrain-data-private/files/workflows/avian-flu/metadata.tsv.zst
sequences: s3://nextstrain-data-private/files/workflows/avian-flu/{segment}/sequences.fasta.zst
```
"""
from pathlib import Path


def _gather_inputs():
all_inputs = [*config['inputs'], *config.get('additional_inputs', [])]

if len(all_inputs)==0:
raise InvalidConfigError("Config must define at least one element in config.inputs or config.additional_inputs lists")
if not all([isinstance(i, dict) for i in all_inputs]):
raise InvalidConfigError("All of the elements in config.inputs and config.additional_inputs lists must be dictionaries. "
"If you've used a command line '--config' double check your quoting.")
if len({i['name'] for i in all_inputs})!=len(all_inputs):
raise InvalidConfigError("Names of inputs (config.inputs and config.additional_inputs) must be unique")
if not all(['name' in i and ('sequences' in i or 'metadata' in i) for i in all_inputs]):
raise InvalidConfigError("Each input (config.inputs and config.additional_inputs) must have a 'name' and 'metadata' and/or 'sequences'")
if not any(['metadata' in i for i in all_inputs]):
raise InvalidConfigError("At least one input must have 'metadata'")
if not any (['sequences' in i for i in all_inputs]):
raise InvalidConfigError("At least one input must have 'sequences'")

available_keys = set(['name', 'metadata', 'sequences'])
if any([len(set(el.keys())-available_keys)>0 for el in all_inputs]):
raise InvalidConfigError(f"Each input (config.inputs and config.additional_inputs) can only include keys of {', '.join(available_keys)}")

return {el['name']: {k:(v if k=='name' else path_or_url(v)) for k,v in el.items()} for el in all_inputs}

input_sources = _gather_inputs()
_input_metadata = [info['metadata'] for info in input_sources.values() if info.get('metadata', None)]
_input_sequences = [info['sequences'] for info in input_sources.values() if info.get('sequences', None)]


if len(_input_metadata) == 1:

rule decompress_metadata:
"""
This rule is invoked when there is a single metadata input to
ensure that we have a decompressed input for downstream rules to match
the output of rule.merge_metadata.
"""
input:
metadata = _input_metadata[0],
output:
metadata = "results/metadata.tsv",
log:
"logs/decompress_metadata.txt",
benchmark:
"benchmarks/decompress_metadata.txt",
shell:
r"""
exec &> >(tee {log:q})

augur read-file {input.metadata:q} > {output.metadata:q}
"""

else:

rule merge_metadata:
"""
This rule is invoked when there are multiple defined metadata inputs
(config.inputs + config.additional_inputs)
"""
input:
**{name: info['metadata'] for name,info in input_sources.items() if info.get('metadata', None)}
params:
metadata = lambda w, input: list(map("=".join, input.items())),
id_field = config['strain_id_field'],
output:
metadata = "results/metadata.tsv"
log:
"logs/merge_metadata.txt",
benchmark:
"benchmarks/merge_metadata.txt"
shell:
r"""
exec &> >(tee {log:q})

augur merge \
--metadata {params.metadata:q} \
--metadata-id-columns {params.id_field:q} \
--output-metadata {output.metadata:q}
"""


if len(_input_sequences) == 1:

rule decompress_sequences:
"""
This rule is invoked when there is a single sequences input to
ensure that we have a decompressed input for downstream rules to match
the output of rule.merge_sequences.
"""
input:
sequences = _input_sequences[0],
output:
sequences = "results/sequences.fasta",
log:
"logs/decompress_sequences.txt",
benchmark:
"benchmarks/decompress_sequences.txt",
shell:
r"""
exec &> >(tee {log:q})

augur read-file {input.sequences:q} > {output.sequences:q}
"""

else:

rule merge_sequences:
"""
This rule is invoked when there are multiple defined sequences inputs
(config.inputs + config.additional_inputs)
"""
input:
**{name: info['sequences'] for name,info in input_sources.items() if info.get('sequences', None)}
output:
sequences = "results/sequences.fasta",
log:
"logs/merge_sequences.txt",
benchmark:
"benchmarks/merge_sequences.txt"
shell:
r"""
exec &> >(tee {log:q})

augur merge \
--sequences {input:q} \
--output-sequences {output.sequences:q}
"""
4 changes: 2 additions & 2 deletions phylogenetic/rules/prepare_sequences.smk
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ This part of the workflow prepares sequences for constructing the phylogenetic t

REQUIRED INPUTS:

metadata = data/metadata.tsv
sequences = data/sequences.fasta
metadata = results/metadata.tsv
sequences = results/sequences.fasta
Comment thread
victorlin marked this conversation as resolved.
reference = ../shared/reference.fasta

OUTPUTS:
Expand Down
2 changes: 1 addition & 1 deletion shared/vendored/.github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,5 @@ jobs:
shellcheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5
- uses: nextstrain/.github/actions/shellcheck@master
2 changes: 1 addition & 1 deletion shared/vendored/.github/workflows/pre-commit.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5
- uses: actions/setup-python@v5
with:
python-version: "3.12"
Expand Down
4 changes: 2 additions & 2 deletions shared/vendored/.gitrepo
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
[subrepo]
remote = https://github.com/nextstrain/shared
branch = main
commit = 82880c8b026d4f317e3f3d9b1f8bb4db226ea01e
parent = d914b98862f08486b76121bf5f385938ff3152f7
commit = 2d063cf2bae0cfc91d70fda2c36f1451656e5757
parent = f3c792c2a4d6ebec3a70c3d65b61258208789c67
method = merge
cmdver = 0.4.6
4 changes: 4 additions & 0 deletions shared/vendored/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ approach to "ingest" has been discussed in various internal places, including:

Scripts for supporting workflow automation that don’t really belong in any of our existing tools.

- [assign-colors](scripts/assign-colors) - Generate colors.tsv for augur export based on ordering, color schemes, and what exists in the metadata. Used in the phylogenetic or nextclade workflows.
- [notify-on-diff](scripts/notify-on-diff) - Send Slack message with diff of a local file and an S3 object
- [notify-on-job-fail](scripts/notify-on-job-fail) - Send Slack message with details about failed workflow job on GitHub Actions and/or AWS Batch
- [notify-on-job-start](scripts/notify-on-job-start) - Send Slack message with details about workflow job on GitHub Actions and/or AWS Batch
Expand All @@ -97,6 +98,7 @@ Scripts for supporting workflow automation that don’t really belong in any of
- [trigger-on-new-data](scripts/trigger-on-new-data) - Triggers downstream GitHub Actions if the provided `upload-to-s3` outputs do not contain the `identical_file_message`
A hacky way to ensure that we only trigger downstream phylogenetic builds if the S3 objects have been updated.


NCBI interaction scripts that are useful for fetching public metadata and sequences.

- [fetch-from-ncbi-entrez](scripts/fetch-from-ncbi-entrez) - Fetch metadata and nucleotide sequences from [NCBI Entrez](https://www.ncbi.nlm.nih.gov/books/NBK25501/) and output to a GenBank file.
Expand All @@ -122,6 +124,8 @@ Potential Nextstrain CLI scripts
Snakemake workflow functions that are shared across many pathogen workflows that don’t really belong in any of our existing tools.

- [config.smk](snakemake/config.smk) - Shared functions for parsing workflow configs.
- [remote_files.smk](snakemake/remote_files.smk) - Exposes the `path_or_url` function which will use Snakemake's storage plugins to download/upload files to remote providers as needed.


## Software requirements

Expand Down
Loading