2026 updates#64
Conversation
| # certain commands (but not all) use SEARCH_PATHS | ||
| SEARCH_PATHS = [workflow.basedir, os.getcwd()] |
There was a problem hiding this comment.
Re: 6f02065
Why set custom SEARCH_PATHS instead of using the default AUGUR_SEARCH_PATHS from config.smk?
There was a problem hiding this comment.
I'll revisit this. Part of this was me not understanding what parts of augur use AUGUR_SEARCH_PATHS at the moment -- it's automatically set via the include:, and subsample uses it, but resolve_config_path doesn't yet. The other part was noticing that we can only specify a single fallback in the vendored resolve_config_path (i.e. analysis directory + single defaults_dir), and I wanted to use consistent search paths everywhere; this led me to defining SEARCH_PATHS = [workflow.basedir, os.getcwd()] for subsample and resolve_config_path(fname, workflow.basedir).
(As I type that I think the order of my SEARCH_PATHS is inverted!)
There was a problem hiding this comment.
We should make resolve_config_path use AUGUR_SEARCH_PATHS. Can you try with my draft of that at nextstrain/measles@84014d3? It's a patch on the shared config.smk file so you should be able to copy/paste into the vendored copy here.
There was a problem hiding this comment.
I applied this patch and am going to merge it. I'm a little hesitant to be using this in production when we're considering things like
FIXME: rename to
NEXTSTRAIN_SEARCH_PATHS?
but I also see the value in having real-world usage of functionality to give us confidence in the direction.
I added a couple of extra bits:
- In the shared
config.smk
# For simplicity, ensure search paths are unique (e.g. often the CWD == workflow.basedir)
search_paths = [p for idx,p in enumerate(search_paths) if search_paths.index(p)==idx]- In Ebola's main snakefile:
print("Relative filepaths will be searched for using the `AUGUR_SEARCH_PATHS`"
" env variable, which has the following directories:"
"\n\t" + "\n\t".join(os.environ["AUGUR_SEARCH_PATHS"].split(':')) + "\n",
file=sys.stderr)There was a problem hiding this comment.
Also, have you considered preventing relative paths searching above (i.e. "../") these paths? Here's an example which I currently use in ebola:
- We have a config-defined path of
../shared/zaire/reference.gb- added when I was usingresolve_config_path(p, wokflow.basedir). (Now, with your more extensiveAUGUR_SEARCH_PATHSI could just useshared/zaire/reference.gb). - In an external analysis directory we're going to start by looking for 'shared' in the analysis directories parent folder. I don't think we should be doing this?
joverlee521
left a comment
There was a problem hiding this comment.
Dropping first pass of review, will need to think on how resolve_config_path should work with the top level shared directory for nextstrain run.
| alignment = "results/{build}/aligned.fasta" | ||
| tree = "results/{species}/{build}/tree.nwk", | ||
| alignment = "results/{species}/{build}/subsampled.fasta", # unmasked | ||
| annotation = lambda w: config['ancestral'][f"{w.species}/{w.build}"]['annotation'], |
There was a problem hiding this comment.
Ran into MissingInputException when testing with nextstrain run. I tried wrapping this in resolve_config_path, but then ran into InvalidConfigError. The config path would need to be updated to ../../shared/{species}/reference.gb, which is not intuitive at all...
There was a problem hiding this comment.
Thanks! Will sort out.
There was a problem hiding this comment.
Fixed up by using resolve_config_path(<path>, workflow.basedir)({}) - the error you ran into is one of my motivations for what's being discussed in this other thread.
P.S. Be careful using {species} as there are mismatches around - including here - where the species wildcard is (e.g.) "sudv" but the directory is "shared/sudan".
There was a problem hiding this comment.
Right, now I'm remembering why we do all the magic in resolve_config_path for users. With the default config parameter being ../shared/zaire/reference.gb, the user has to put their reference outside of their analysis directory or override the config parameter with a custom config. It doesn't "just work" for them to have all of the config files in their analysis directory.
The path that would be nice for the users (but definitely not intuitive for authors) is ebov/reference.gb and the workflow uses resolve_config_path(<path>, Path(workflow.basedir)/../shared).
(The mismatch of species in the config filenames is definitely confusing and should be fixed to be all ebov/sudv/bdbv.)
There was a problem hiding this comment.
I applied Victor's patch from above which now scans multiple directories, including the base ebola (repo root) directory, so we could slightly simplify these paths going forward.
70d96da to
7fd3f3a
Compare
Optionally use the `inputs` param to define multiple inputs for the workflow, backwards compatible with existing profiles. Mainly motivated by the need for defining inputs for the forthcoming open builds, but it's nice to support multiple inputs! Use of `inputs.lineage` inspired by ebola's `inputs.species` added in <nextstrain/ebola#64> Note this does _not_ support inputs for titer data. That has too much complexity with lineage/center/passage/assay that I did not want to block this work.
Since we are running ingest-to-phylo workflows daily these are unnecessary.
and related (empty!) example data files. We can add these back if we add a working phylo CI workflow, however with ingest-to-phylo running daily we already have quick feedback when a workflow breaks.
The phylo workflows will be refactored to start from multiple inputs which motivated these two big changes. Having partitioned open/restricted data sources forces us to use multiple inputs in our automated builds which improves the reliability of that feature. Allowing injection of private data (via multiple inputs) in phylo workflows also makes it cleaner to run nextclade (for clade annotation) _after_ the data merging so that all data has the annotations. The dataset sizes are such that this is fast, even for the c. 4,000 EBOV genomes.
Extends the multiple inputs approach to supporting multiple species (whose inputs are handled independently). This begins the first portion of the workflow (gathering inputs, alignment) which is done per-species. The default config is S3 PPX-open data only, following our guidelines <nextstrain/public#36>. When running locally (development, CI) you can add `--configfile defaults/config-local-inputs.yaml` to easily use local OPEN & RESTRICTED data. External analysis directories are supported for the default S3 inputs. To use locally ingested data you'll need to essentially copy `defaults/config-local-inputs.yaml` as a local `config.yaml` override and update the paths.
Adds per-build subsampling configs. Previous builds' filtering configs are replicated here in the newer subsampling YAML format. Build 'bdbv/2026' is new. NOTE: We write out small-multiple subsampling configs (i.e. per-build-pair) so that snakemake can know (via file contents hash) if there are any changes for that specific subsampling job. The alternate approach of writing out the entire (all-builds) `run_config.yaml` means that any config changes re-run all the subsampling jobs. I didn't exhaustively add per-build include/exclude text files as most would be empty files. It should be straightforward to add new files (and update the config) when the need arises.
Implements masking (optional), tree construction, re-rooting (optional) and refine steps.
Adds the various "annotation" functionality (ancestral, sampling year, traits) found across builds into the new canonical workflow and exports these as an Auspice dataset JSON. The main functional change is to now consistently use `augur ancestral` to reconstruct nextclade-translated AA sequences rather than `augur translate`.
Colors taken directly from https://github.com/nextstrain/ebola/blob/0a9401b6e0d4220cdbc3dcb564a3085c1e518864/config/colors.tsv More colors need to be added for missing countries / divisions, and this also flags up some mis-spellings of division (e.g. 'Montserrado' is also spelt 'Montesserrado' and 'Monstserrado')
Leverages the functionality of `augur export v2` to provide multiple JSONs which are merged together, combined with a new (and hopefully intuitive) UI where we can write an overlay JSON within the main config YAML. The motivation is to stop maintaining a bunch of nearly-identical files. I explored encoding the entire auspice-config-json within the config-yaml, leveraging YAML anchors heavily, but it was still too verbose.
All functionality has been shifted to the main `phylogenetic/Snakefile` workflow
The rename_jsons_to_reflect_urls config & rule can be removed once we choose a new URL structure, but it's convenient for now.
Note that this dataset is a WIP and not yet deployed (uploaded) so we have more scope to change the name again before we start automating it.
From <nextstrain/shared#76> (Adding via a patch commit simply to allow the ebola PR to be merged now)
Following feedback / discussion in <#64 (comment)> and based on <nextstrain/measles@84014d3> with changes.
7fd3f3a to
f8e2abd
Compare
Replaces the various previous phylo workflows with a single phylogenetic workflow which produces these builds:
ebov/all-outbreaks subsamples genomes across outbreaks to present an overview of the known genomic history of Ebola virus (EBOV), formerly Zaïre ebolavirus. Outbreaks are classified using Nextclade.
This dataset is kept up-to-date on nextstrain.org/ebola/all-outbreaks
ebov/west-africa-2014 a work-in-progress workflow which produces a single analysis of the West Africa outbreak. Note: this is NOT the workflow which produced nextstrain.org/ebola/ebov-2013
bdbv/all-outbreaks This dataset is kept up-to-date on nextstrain.org/ebola/bdbv
bdbv/2026 work-in-progress
sudv/all-outbreaks This dataset is kept up-to-date on nextstrain.org/ebola/sudv
See commit messages for more details & design intentions
Closes #60
Closes #61
Closes #37
Closes #27
Closes #29
Closes #17
How to run
cd ingest; snakemake --cores 2 -pfcd phylogenetic snakemake --cores 4 --configfile defaults/config-local-inputs.yaml -pfOpen questions / work to do:
accessioncolumn **Issue Improve ability to use private data #66#