Skip to content

2026 updates#64

Merged
jameshadfield merged 19 commits into
mainfrom
2026-updates
Jun 29, 2026
Merged

2026 updates#64
jameshadfield merged 19 commits into
mainfrom
2026-updates

Conversation

@jameshadfield

@jameshadfield jameshadfield commented Jun 24, 2026

Copy link
Copy Markdown
Member

Replaces the various previous phylo workflows with a single phylogenetic workflow which produces these builds:

  • ebov/all-outbreaks subsamples genomes across outbreaks to present an overview of the known genomic history of Ebola virus (EBOV), formerly Zaïre ebolavirus. Outbreaks are classified using Nextclade.
    This dataset is kept up-to-date on nextstrain.org/ebola/all-outbreaks

  • ebov/west-africa-2014 a work-in-progress workflow which produces a single analysis of the West Africa outbreak. Note: this is NOT the workflow which produced nextstrain.org/ebola/ebov-2013

  • bdbv/all-outbreaks This dataset is kept up-to-date on nextstrain.org/ebola/bdbv

  • bdbv/2026 work-in-progress

  • sudv/all-outbreaks This dataset is kept up-to-date on nextstrain.org/ebola/sudv

See commit messages for more details & design intentions

Closes #60
Closes #61
Closes #37
Closes #27
Closes #29
Closes #17

How to run

  1. Run ingest locally cd ingest; snakemake --cores 2 -pf
  2. Run the phylo workflow, using locally ingested PPX open & restricted data, to produce the 5 builds described above:
cd phylogenetic
snakemake --cores 4 --configfile defaults/config-local-inputs.yaml -pf

Open questions / work to do:

  • ? Choose a better name for bdbv/2026 ?
  • bdbv/2026 timetree is not great with the current 16 sequences. Perhaps we can revisit this when more genomes are available?
  • We use (PPX) accession as the metadata ID column. Is this the right decision for adding private data? It means metadata TSVs will need to an accession column **Issue Improve ability to use private data #66#
  • West-african timetree is terrible (this predated this PR). I think we need a custom TreeTime script which generates the clock distribution from the non-relapse cases. Issue Improve ability to use private data #66
  • Use auspice config overlays rather than duplicating so much across builds
  • Add colours for open/restricted data into auspice configs so they're consistent across datasets
  • Understand the CI workflow and example_data, neither of which have been updated in this PR
  • Add README details on adding private data
  • Nextstrain run testing
  • (Before merge) switch from s3://nextstrain-staging to s3://nextstrain-data

@jameshadfield jameshadfield linked an issue Jun 24, 2026 that may be closed by this pull request
@jameshadfield jameshadfield mentioned this pull request Jun 24, 2026
8 tasks

@victorlin victorlin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread phylogenetic/rules/config.smk
Comment thread phylogenetic/Snakefile
Comment on lines +17 to +18
# certain commands (but not all) use SEARCH_PATHS
SEARCH_PATHS = [workflow.basedir, os.getcwd()]

@victorlin victorlin Jun 24, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: 6f02065

Why set custom SEARCH_PATHS instead of using the default AUGUR_SEARCH_PATHS from config.smk?

@jameshadfield jameshadfield Jun 24, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revisit this. Part of this was me not understanding what parts of augur use AUGUR_SEARCH_PATHS at the moment -- it's automatically set via the include:, and subsample uses it, but resolve_config_path doesn't yet. The other part was noticing that we can only specify a single fallback in the vendored resolve_config_path (i.e. analysis directory + single defaults_dir), and I wanted to use consistent search paths everywhere; this led me to defining SEARCH_PATHS = [workflow.basedir, os.getcwd()] for subsample and resolve_config_path(fname, workflow.basedir).

(As I type that I think the order of my SEARCH_PATHS is inverted!)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make resolve_config_path use AUGUR_SEARCH_PATHS. Can you try with my draft of that at nextstrain/measles@84014d3? It's a patch on the shared config.smk file so you should be able to copy/paste into the vendored copy here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied this patch and am going to merge it. I'm a little hesitant to be using this in production when we're considering things like

FIXME: rename to NEXTSTRAIN_SEARCH_PATHS?

but I also see the value in having real-world usage of functionality to give us confidence in the direction.

I added a couple of extra bits:

  1. In the shared config.smk
    # For simplicity, ensure search paths are unique (e.g. often the CWD == workflow.basedir)
    search_paths = [p for idx,p in enumerate(search_paths) if search_paths.index(p)==idx]
  1. In Ebola's main snakefile:
print("Relative filepaths will be searched for using the `AUGUR_SEARCH_PATHS`"
   " env variable, which has the following directories:"
    "\n\t" + "\n\t".join(os.environ["AUGUR_SEARCH_PATHS"].split(':')) + "\n",
    file=sys.stderr)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, have you considered preventing relative paths searching above (i.e. "../") these paths? Here's an example which I currently use in ebola:

  1. We have a config-defined path of ../shared/zaire/reference.gb - added when I was using resolve_config_path(p, wokflow.basedir). (Now, with your more extensive AUGUR_SEARCH_PATHS I could just use shared/zaire/reference.gb).
  2. In an external analysis directory we're going to start by looking for 'shared' in the analysis directories parent folder. I don't think we should be doing this?

@joverlee521 joverlee521 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping first pass of review, will need to think on how resolve_config_path should work with the top level shared directory for nextstrain run.

Comment thread ingest/rules/curate.smk
Comment thread ingest/rules/nextclade.smk
Comment thread phylogenetic/rules/construct_phylogeny.smk Outdated
Comment thread phylogenetic/rules/construct_phylogeny.smk Outdated
alignment = "results/{build}/aligned.fasta"
tree = "results/{species}/{build}/tree.nwk",
alignment = "results/{species}/{build}/subsampled.fasta", # unmasked
annotation = lambda w: config['ancestral'][f"{w.species}/{w.build}"]['annotation'],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran into MissingInputException when testing with nextstrain run. I tried wrapping this in resolve_config_path, but then ran into InvalidConfigError. The config path would need to be updated to ../../shared/{species}/reference.gb, which is not intuitive at all...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will sort out.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed up by using resolve_config_path(<path>, workflow.basedir)({}) - the error you ran into is one of my motivations for what's being discussed in this other thread.

P.S. Be careful using {species} as there are mismatches around - including here - where the species wildcard is (e.g.) "sudv" but the directory is "shared/sudan".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, now I'm remembering why we do all the magic in resolve_config_path for users. With the default config parameter being ../shared/zaire/reference.gb, the user has to put their reference outside of their analysis directory or override the config parameter with a custom config. It doesn't "just work" for them to have all of the config files in their analysis directory.

The path that would be nice for the users (but definitely not intuitive for authors) is ebov/reference.gb and the workflow uses resolve_config_path(<path>, Path(workflow.basedir)/../shared).

(The mismatch of species in the config filenames is definitely confusing and should be fixed to be all ebov/sudv/bdbv.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied Victor's patch from above which now scans multiple directories, including the base ebola (repo root) directory, so we could slightly simplify these paths going forward.

joverlee521 added a commit to nextstrain/seasonal-flu that referenced this pull request Jun 26, 2026
Optionally use the `inputs` param to define multiple inputs for the
workflow, backwards compatible with existing profiles. Mainly motivated
by the need for defining inputs for the forthcoming open builds, but
it's nice to support multiple inputs!

Use of `inputs.lineage` inspired by ebola's `inputs.species` added in
<nextstrain/ebola#64>

Note this does _not_ support inputs for titer data. That has too much
complexity with lineage/center/passage/assay that I did not want to
block this work.
Since we are running ingest-to-phylo workflows daily these are
unnecessary.
and related (empty!) example data files. We can add these back if
we add a working phylo CI workflow, however with ingest-to-phylo
running daily we already have quick feedback when a workflow breaks.
The phylo workflows will be refactored to start from multiple inputs
which motivated these two big changes. Having partitioned
open/restricted data sources forces us to use multiple inputs in our
automated builds which improves the reliability of that feature.
Allowing injection of private data (via multiple inputs) in phylo
workflows also makes it cleaner to run nextclade (for clade annotation)
_after_ the data merging so that all data has the annotations. The
dataset sizes are such that this is fast, even for the c. 4,000 EBOV
genomes.
Extends the multiple inputs approach to supporting multiple species
(whose inputs are handled independently). This begins the first portion
of the workflow (gathering inputs, alignment) which is done per-species.

The default config is S3 PPX-open data only, following our guidelines
<nextstrain/public#36>. When running locally
(development, CI) you can add `--configfile defaults/config-local-inputs.yaml`
to easily use local OPEN & RESTRICTED data.

External analysis directories are supported for the default S3 inputs.
To use locally ingested data you'll need to essentially copy
`defaults/config-local-inputs.yaml` as a local `config.yaml`
override and update the paths.
Adds per-build subsampling configs. Previous builds' filtering configs
are replicated here in the newer subsampling YAML format. Build
'bdbv/2026' is new.

NOTE: We write out small-multiple subsampling configs (i.e.
per-build-pair) so that snakemake can know (via file contents hash) if
there are any changes for that specific subsampling job. The alternate
approach of writing out the entire (all-builds) `run_config.yaml` means
that any config changes re-run all the subsampling jobs.

I didn't exhaustively add per-build include/exclude text files as most
would be empty files. It should be straightforward to add new files (and
update the config) when the need arises.
Implements masking (optional), tree construction, re-rooting (optional) and refine steps.
Adds the various "annotation" functionality (ancestral, sampling year, traits)
found across builds into the new canonical workflow and exports these
as an Auspice dataset JSON.

The main functional change is to now consistently use `augur ancestral` to reconstruct
nextclade-translated AA sequences rather than `augur translate`.
Colors taken directly from https://github.com/nextstrain/ebola/blob/0a9401b6e0d4220cdbc3dcb564a3085c1e518864/config/colors.tsv

More colors need to be added for missing countries / divisions, and this also flags up some mis-spellings of division (e.g. 'Montserrado' is also spelt 'Montesserrado' and 'Monstserrado')
Leverages the functionality of `augur export v2` to provide multiple JSONs
which are merged together, combined with a new (and hopefully intuitive)
UI where we can write an overlay JSON within the main config YAML.

The motivation is to stop maintaining a bunch of nearly-identical files.
I explored encoding the entire auspice-config-json within the config-yaml,
leveraging YAML anchors heavily, but it was still too verbose.
All functionality has been shifted to the main `phylogenetic/Snakefile`
workflow
The rename_jsons_to_reflect_urls config & rule can be removed once we
choose a new URL structure, but it's convenient for now.
Note that this dataset is a WIP and not yet deployed (uploaded)
so we have more scope to change the name again before we start
automating it.
From <nextstrain/shared#76>

(Adding via a patch commit simply to allow the ebola PR to be merged now)
Following feedback / discussion in <#64 (comment)>
and based on <nextstrain/measles@84014d3>
with changes.
@jameshadfield jameshadfield merged commit b235832 into main Jun 29, 2026
5 checks passed
@jameshadfield jameshadfield deleted the 2026-updates branch June 29, 2026 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants