2026 updates by jameshadfield · Pull Request #64 · nextstrain/ebola

jameshadfield · 2026-06-24T03:56:05Z

Replaces the various previous phylo workflows with a single phylogenetic workflow which produces these builds:

ebov/all-outbreaks subsamples genomes across outbreaks to present an overview of the known genomic history of Ebola virus (EBOV), formerly Zaïre ebolavirus. Outbreaks are classified using Nextclade.
This dataset is kept up-to-date on nextstrain.org/ebola/all-outbreaks
ebov/west-africa-2014 a work-in-progress workflow which produces a single analysis of the West Africa outbreak. Note: this is NOT the workflow which produced nextstrain.org/ebola/ebov-2013
bdbv/all-outbreaks This dataset is kept up-to-date on nextstrain.org/ebola/bdbv
bdbv/2026 work-in-progress
sudv/all-outbreaks This dataset is kept up-to-date on nextstrain.org/ebola/sudv

See commit messages for more details & design intentions

Closes #60
Closes #61
Closes #37
Closes #27
Closes #29
Closes #17

How to run

Run ingest locally cd ingest; snakemake --cores 2 -pf
Run the phylo workflow, using locally ingested PPX open & restricted data, to produce the 5 builds described above:

cd phylogenetic
snakemake --cores 4 --configfile defaults/config-local-inputs.yaml -pf

Open questions / work to do:

victorlin

Reviewed 765d880...6f02065

victorlin · 2026-06-24T20:59:35Z

+# certain commands (but not all) use SEARCH_PATHS
+SEARCH_PATHS = [workflow.basedir, os.getcwd()]


Re: 6f02065

Why set custom SEARCH_PATHS instead of using the default AUGUR_SEARCH_PATHS from config.smk?

I'll revisit this. Part of this was me not understanding what parts of augur use AUGUR_SEARCH_PATHS at the moment -- it's automatically set via the include:, and subsample uses it, but resolve_config_path doesn't yet. The other part was noticing that we can only specify a single fallback in the vendored resolve_config_path (i.e. analysis directory + single defaults_dir), and I wanted to use consistent search paths everywhere; this led me to defining SEARCH_PATHS = [workflow.basedir, os.getcwd()] for subsample and resolve_config_path(fname, workflow.basedir).

(As I type that I think the order of my SEARCH_PATHS is inverted!)

We should make resolve_config_path use AUGUR_SEARCH_PATHS. Can you try with my draft of that at nextstrain/measles@84014d3? It's a patch on the shared config.smk file so you should be able to copy/paste into the vendored copy here.

I applied this patch and am going to merge it. I'm a little hesitant to be using this in production when we're considering things like

FIXME: rename to NEXTSTRAIN_SEARCH_PATHS?

but I also see the value in having real-world usage of functionality to give us confidence in the direction.

I added a couple of extra bits:

In the shared config.smk

# For simplicity, ensure search paths are unique (e.g. often the CWD == workflow.basedir) search_paths = [p for idx,p in enumerate(search_paths) if search_paths.index(p)==idx]

In Ebola's main snakefile:

print("Relative filepaths will be searched for using the `AUGUR_SEARCH_PATHS`" " env variable, which has the following directories:" "\n\t" + "\n\t".join(os.environ["AUGUR_SEARCH_PATHS"].split(':')) + "\n", file=sys.stderr)

Also, have you considered preventing relative paths searching above (i.e. "../") these paths? Here's an example which I currently use in ebola:

We have a config-defined path of ../shared/zaire/reference.gb - added when I was using resolve_config_path(p, wokflow.basedir). (Now, with your more extensive AUGUR_SEARCH_PATHS I could just use shared/zaire/reference.gb).

In an external analysis directory we're going to start by looking for 'shared' in the analysis directories parent folder. I don't think we should be doing this?

joverlee521

Dropping first pass of review, will need to think on how resolve_config_path should work with the top level shared directory for nextstrain run.

joverlee521 · 2026-06-25T00:08:25Z

-        alignment = "results/{build}/aligned.fasta"
+        tree = "results/{species}/{build}/tree.nwk",
+        alignment = "results/{species}/{build}/subsampled.fasta", # unmasked
+        annotation = lambda w: config['ancestral'][f"{w.species}/{w.build}"]['annotation'],


Ran into MissingInputException when testing with nextstrain run. I tried wrapping this in resolve_config_path, but then ran into InvalidConfigError. The config path would need to be updated to ../../shared/{species}/reference.gb, which is not intuitive at all...

Thanks! Will sort out.

Fixed up by using resolve_config_path(<path>, workflow.basedir)({}) - the error you ran into is one of my motivations for what's being discussed in this other thread.

P.S. Be careful using {species} as there are mismatches around - including here - where the species wildcard is (e.g.) "sudv" but the directory is "shared/sudan".

Right, now I'm remembering why we do all the magic in resolve_config_path for users. With the default config parameter being ../shared/zaire/reference.gb, the user has to put their reference outside of their analysis directory or override the config parameter with a custom config. It doesn't "just work" for them to have all of the config files in their analysis directory.

The path that would be nice for the users (but definitely not intuitive for authors) is ebov/reference.gb and the workflow uses resolve_config_path(<path>, Path(workflow.basedir)/../shared).

(The mismatch of species in the config filenames is definitely confusing and should be fixed to be all ebov/sudv/bdbv.)

I applied Victor's patch from above which now scans multiple directories, including the base ebola (repo root) directory, so we could slightly simplify these paths going forward.

Optionally use the `inputs` param to define multiple inputs for the workflow, backwards compatible with existing profiles. Mainly motivated by the need for defining inputs for the forthcoming open builds, but it's nice to support multiple inputs! Use of `inputs.lineage` inspired by ebola's `inputs.species` added in <nextstrain/ebola#64> Note this does _not_ support inputs for titer data. That has too much complexity with lineage/center/passage/assay that I did not want to block this work.

Since we are running ingest-to-phylo workflows daily these are unnecessary.

and related (empty!) example data files. We can add these back if we add a working phylo CI workflow, however with ingest-to-phylo running daily we already have quick feedback when a workflow breaks.

The phylo workflows will be refactored to start from multiple inputs which motivated these two big changes. Having partitioned open/restricted data sources forces us to use multiple inputs in our automated builds which improves the reliability of that feature. Allowing injection of private data (via multiple inputs) in phylo workflows also makes it cleaner to run nextclade (for clade annotation) _after_ the data merging so that all data has the annotations. The dataset sizes are such that this is fast, even for the c. 4,000 EBOV genomes.

Taken without changes from <https://github.com/nextstrain/measles/blob/7f46d96d65012861edab29a5eb5b2775102dddc1/phylogenetic/rules/merge_inputs.smk>

Extends the multiple inputs approach to supporting multiple species (whose inputs are handled independently). This begins the first portion of the workflow (gathering inputs, alignment) which is done per-species. The default config is S3 PPX-open data only, following our guidelines <nextstrain/public#36>. When running locally (development, CI) you can add `--configfile defaults/config-local-inputs.yaml` to easily use local OPEN & RESTRICTED data. External analysis directories are supported for the default S3 inputs. To use locally ingested data you'll need to essentially copy `defaults/config-local-inputs.yaml` as a local `config.yaml` override and update the paths.

Adds per-build subsampling configs. Previous builds' filtering configs are replicated here in the newer subsampling YAML format. Build 'bdbv/2026' is new. NOTE: We write out small-multiple subsampling configs (i.e. per-build-pair) so that snakemake can know (via file contents hash) if there are any changes for that specific subsampling job. The alternate approach of writing out the entire (all-builds) `run_config.yaml` means that any config changes re-run all the subsampling jobs. I didn't exhaustively add per-build include/exclude text files as most would be empty files. It should be straightforward to add new files (and update the config) when the need arises.

Implements masking (optional), tree construction, re-rooting (optional) and refine steps.

Adds the various "annotation" functionality (ancestral, sampling year, traits) found across builds into the new canonical workflow and exports these as an Auspice dataset JSON. The main functional change is to now consistently use `augur ancestral` to reconstruct nextclade-translated AA sequences rather than `augur translate`.

Colors taken directly from https://github.com/nextstrain/ebola/blob/0a9401b6e0d4220cdbc3dcb564a3085c1e518864/config/colors.tsv More colors need to be added for missing countries / divisions, and this also flags up some mis-spellings of division (e.g. 'Montserrado' is also spelt 'Montesserrado' and 'Monstserrado')

Leverages the functionality of `augur export v2` to provide multiple JSONs which are merged together, combined with a new (and hopefully intuitive) UI where we can write an overlay JSON within the main config YAML. The motivation is to stop maintaining a bunch of nearly-identical files. I explored encoding the entire auspice-config-json within the config-yaml, leveraging YAML anchors heavily, but it was still too verbose.

All functionality has been shifted to the main `phylogenetic/Snakefile` workflow

The rename_jsons_to_reflect_urls config & rule can be removed once we choose a new URL structure, but it's convenient for now.

Note that this dataset is a WIP and not yet deployed (uploaded) so we have more scope to change the name again before we start automating it.

From <nextstrain/shared#76> (Adding via a patch commit simply to allow the ebola PR to be merged now)

Following feedback / discussion in <#64 (comment)> and based on <nextstrain/measles@84014d3> with changes.

jameshadfield linked an issue Jun 24, 2026 that may be closed by this pull request

phylogenetic: Support multiple inputs #60

Closed

jameshadfield mentioned this pull request Jun 24, 2026

Add 2025 outbreak dataset #19

Closed

8 tasks

victorlin reviewed Jun 24, 2026

View reviewed changes

joverlee521 mentioned this pull request Jun 25, 2026

remote_files: Add nextstrain-staging to PUBLIC_BUCKETS nextstrain/shared#76

Open

2 tasks

joverlee521 reviewed Jun 25, 2026

View reviewed changes

jameshadfield force-pushed the 2026-updates branch from 70d96da to 7fd3f3a Compare June 25, 2026 04:15

jameshadfield added 19 commits June 29, 2026 11:06

update gitignore

a3edffe

[CI] remove daily ingest CI runs

5dd3d4a

Since we are running ingest-to-phylo workflows daily these are unnecessary.

[CI] remove unused phylo CI

9356f47

and related (empty!) example data files. We can add these back if we add a working phylo CI workflow, however with ingest-to-phylo running daily we already have quick feedback when a workflow breaks.

[phylo] Add multiple inputs rule file

ac8c2c1

Taken without changes from <https://github.com/nextstrain/measles/blob/7f46d96d65012861edab29a5eb5b2775102dddc1/phylogenetic/rules/merge_inputs.smk>

[phylo] align with nextclade, add metadata columns

5e1dd6f

[phylo] trees

6a571c8

Implements masking (optional), tree construction, re-rooting (optional) and refine steps.

[phylo] remove old workflows

097f61c

All functionality has been shifted to the main `phylogenetic/Snakefile` workflow

[phylo] signal nextstrain run compatibility

37976e7

[phylo] Upload auspice datasets

c4f590e

The rename_jsons_to_reflect_urls config & rule can be removed once we choose a new URL structure, but it's convenient for now.

[phylo] Use "bdbv/drc-uganda-2026"

3f80f06

Note that this dataset is a WIP and not yet deployed (uploaded) so we have more scope to change the name again before we start automating it.

Update README

cf8343c

[shared] Add nextstrain-staging to PUBLIC_BUCKETS

205e453

From <nextstrain/shared#76> (Adding via a patch commit simply to allow the ebola PR to be merged now)

[shared] switch to AUGUR_SEARCH_PATHS

f8e2abd

Following feedback / discussion in <#64 (comment)> and based on <nextstrain/measles@84014d3> with changes.

jameshadfield force-pushed the 2026-updates branch from 7fd3f3a to f8e2abd Compare June 29, 2026 01:51

jameshadfield merged commit b235832 into main Jun 29, 2026
5 checks passed

jameshadfield deleted the 2026-updates branch June 29, 2026 02:00

jameshadfield mentioned this pull request Jun 29, 2026

Improve ability to use private data #66

Open

		# certain commands (but not all) use SEARCH_PATHS
		SEARCH_PATHS = [workflow.basedir, os.getcwd()]

Uh oh!

Conversation

jameshadfield commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to run

Open questions / work to do:

Uh oh!

victorlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

victorlin Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jameshadfield Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joverlee521 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jameshadfield commented Jun 24, 2026 •

edited

Loading

victorlin Jun 24, 2026 •

edited

Loading

jameshadfield Jun 24, 2026 •

edited

Loading