Skip to content

Sync#17

Merged
DOH-LAF2303 merged 34 commits into
masterfrom
sync
Dec 10, 2025
Merged

Sync#17
DOH-LAF2303 merged 34 commits into
masterfrom
sync

Conversation

@DOH-LAF2303

@DOH-LAF2303 DOH-LAF2303 commented Dec 10, 2025

Copy link
Copy Markdown

Description of proposed changes

Updates upstream changes, fixes custom_rules to fit new changes, changes build to 6y tree instead of all-time

Related issue(s)

Checklist

  • Checks pass
  • Update changelog

joverlee521 and others added 30 commits September 13, 2023 17:00
Realized through nextstrain#37 that the
ingest pipeline does _not_ trigger the rebuild. The rebuild is just
scheduled to run after the ingest workflow. Removing all parameters and
references to trigger in this commit so that it does not confuse anyone
else in the future.

Keeping the schedule as-is since it's been working fine and we are
planning to be shift pathogen workflows in the future to be able to
go from ingest to a build within a single run without going through
triggers and S3 interactions.
This prevents 16 unnecessary duplicate runs with the same inputs and
outputs.
This prevents 18 unnecessary duplicate runs with the same inputs and
outputs.
This prevents 16 unnecessary duplicate runs with the same inputs and
outputs.
ingest: Remove parameters related to trigger
Simplify the workflow using similar changes to 
<nextstrain/zika#83>

Also removes extra options:

* `PAT_GITHUB_DISPATCH` is not used in the ingest workflow since it is _not_
triggering the downstream phylo workflow.

* `--printshellcmds` is a default flag already included in `nextstrain build`
<https://github.com/nextstrain/cli/blob/7252a9b0d9b6e628500f9e2b991cc16a929f2879/nextstrain/cli/command/build.py#L209>
Add input `trial_name` as a way to start trial runs that deploy to staging.

Motivated by recent comment
<nextstrain#110 (comment)>
I noticed in a recent run of the ingest workflow that ~6min was spent on waiting 
for the Batch job to start. So I was curious if this workflow can run within
GH Actions using the docker runtime. My trial run was 9min23s¹ compared to the 
previous run on AWS Batch that was 15m21s.² 

Just going to use the docker runtime since it's faster and free

¹ <https://github.com/nextstrain/rsv/actions/runs/18957867813>
² <https://github.com/nextstrain/rsv/actions/runs/18881097661>
…/vendored

subrepo:
  subdir:   "shared/vendored"
  merged:   "bfbbb68"
upstream:
  origin:   "https://github.com/nextstrain/shared"
  branch:   "main"
  commit:   "bfbbb68"
git-subrepo:
  version:  "0.4.6"
  origin:   "https://github.com/ingydotnet/git-subrepo"
  commit:   "110b9eb"
Same vendored scripts are now available in shared/vendored
Updated the filepaths for the rules copied from pathogen-repo-guide 
to support the `a_or_b` wildcard. It's not clear to me why the workflow uses 
`a_or_b` instead of a `subtype` wildcard to match the config param `subtypes`, 
but I'm not going to make the changes to consolidate them here.

Removes snakemake_rules/download.smk since the multiple input support uses 
the Snakemake storage plugins to handle remote files. This bumps the minimum 
Snakemake version to 8.0.0.
Defines `inputs` to ensure that we only use OPEN PPX data as example data. 
Moves the chores.smk to only be included through `custom_rules` because the 
default workflow would run into warning:

```
CyclicGraphException in rule decompress_metadata in file "/nextstrain/build/workflow/snakemake_rules/merge_inputs.smk", line 62:
Cyclic dependency on rule decompress_metadata.
```
Updated example data using the updated chore config

```
nextstrain build . update_example_data --configfile config/chores.yaml
```
Reorganized to match our usual phylogenetic READMEs and added new 
instructions for using the `inputs` and `additional_inputs` params for 
configuring workflow inputs.
Since the workflow now expects data at `results/{a_or_b}/*`, the 
built-in copy example data command in pathogen-repo-ci.yaml@v0 no longer works.
Instead of updating the v0 workflow, just use the config to start from specific 
example_data inputs. 

This should _not_ require any changes in augur/docker-base/conda-base since 
those CI workflows are already using this CI config:

<https://github.com/nextstrain/augur/blob/677d535eda13d370d4099558e0cca29db9abcafd/.github/workflows/ci.yaml#L268>
<https://github.com/nextstrain/docker-base/blob/9ec2845e06e331877eae5f446fd4adc56cd33d9f/.github/workflows/ci.yml#L219>
<https://github.com/nextstrain/conda-base/blob/9048d8410e7b3a1a7098dd5c498234a489b8ab0b/.github/workflows/ci.yaml#L148>
Simplifies the ingest/Snakefile to easily understand what are the outputs 
of the workflow and hides the upload process in a Nextstrain automation 
build config.
Updated to match the pathogen-repo-guide at 
<https://github.com/nextstrain/pathogen-repo-guide/tree/4784a831fc78bf1cdc416824b26ce36ad4f5bcc2/ingest/build-configs/nextstrain-automation>

This simplified the upload config and makes it easier to understand which files 
are uploaded to S3 as `*_with_restricted`.
Multiple input sources are expected to be defined in the phylo workflow 
going forward, so we no longer need to support it here. With the recent switch
to PPX data, it was also obvious that multiple sources also doesn't work well 
when the curations are pretty different.
Extract "OPEN" and "RESTRICTED" data into separate files that are uploaded to 
S3 separately. This will reduce the amount of duplicate data that we host on S3.

Outside of the changes in the workflow, we should delete the previously uploaded
"*_with_restricted" files from S3 so that they are not confused with the new 
"*_restricted" files added here.
Since the previous commit separates the OPEN and RESTRICTED files on S3, 
update the phylo config to start from these multiple inputs.
@DOH-LAF2303 DOH-LAF2303 merged commit bb1bb99 into master Dec 10, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants