From 212da4f5fd4154ab5d6b03e56f2b70502d45159a Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 10:21:09 +0000 Subject: [PATCH 1/2] Initial plan From d5787bd035f0181ef7d97452d7f889129d34a5cc Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 10:33:54 +0000 Subject: [PATCH 2/2] docs: add PyMC and inversion refactor plan Agent-Logs-Url: https://github.com/openghg/openghg_inversions/sessions/693f8709-71e5-4449-8146-8b12710bd409 Co-authored-by: brendan-m-murphy <11130776+brendan-m-murphy@users.noreply.github.com> --- PYMC_AND_INVERSION_REFACTOR_PLAN.md | 253 ++++++++++++++++++++++++++++ 1 file changed, 253 insertions(+) create mode 100644 PYMC_AND_INVERSION_REFACTOR_PLAN.md diff --git a/PYMC_AND_INVERSION_REFACTOR_PLAN.md b/PYMC_AND_INVERSION_REFACTOR_PLAN.md new file mode 100644 index 00000000..3b9aad9e --- /dev/null +++ b/PYMC_AND_INVERSION_REFACTOR_PLAN.md @@ -0,0 +1,253 @@ +# PyMC and inversion pipeline refactor plan + +Draft wiki content for [issue #392](https://github.com/openghg/openghg_inversions/issues/392). + +This page is intended to do three things: + +1. summarise what has already changed in `openghg_inversions`, +2. show how the main refactor threads fit together, +3. give a short map from legacy/temporary paths to their intended replacements. + +## Executive summary + +The current work is best understood as two umbrella refactors plus two related future capability tracks: + +- [#370](https://github.com/openghg/openghg_inversions/issues/370): refactor PyMC model construction so the current fixed-basis inversion model is built from reusable functions and canonical xarray inputs; +- [#360](https://github.com/openghg/openghg_inversions/issues/360): refactor basis-function handling and output plumbing so basis objects can be retained, saved, and eventually used directly in post-processing; +- [#359](https://github.com/openghg/openghg_inversions/pull/359) and [#337](https://github.com/openghg/openghg_inversions/issues/337): high-resolution inner-domain / 6 km support, which currently collides with the refactors above and should be aligned with them rather than merged independently; +- [#338](https://github.com/openghg/openghg_inversions/issues/338): incorporation of CO2 / multi-sector work, likely drawing on . + +The main PyMC refactor is already well advanced: + +- Stage A ([#378](https://github.com/openghg/openghg_inversions/pull/378)) added regression coverage and extracted model construction from `inferpymc`. +- Stage B ([#380](https://github.com/openghg/openghg_inversions/pull/380)) let `inferpymc` consume xarray-based inversion inputs. +- Stage C ([#382](https://github.com/openghg/openghg_inversions/pull/382)) introduced `openghg_inversions.models` and reusable model components. +- Stage D ([#389](https://github.com/openghg/openghg_inversions/pull/389)) made the new builder the active runtime path. +- Stage E follow-up tidy-up ([#391](https://github.com/openghg/openghg_inversions/pull/391)) is making the modern/legacy boundary explicit rather than mixing the two concerns. + +The main basis/output refactor is partly complete: + +- `BasisFunctions` / `BasisOperator` were introduced in [#358](https://github.com/openghg/openghg_inversions/pull/358). +- Wrapper/runtime retention and DataTree save support were added in [#367](https://github.com/openghg/openghg_inversions/pull/367). +- A compatibility post-processing output mode was added in [#361](https://github.com/openghg/openghg_inversions/pull/361). +- `fixedbasisMCMC` output handling was cleaned up in [#390](https://github.com/openghg/openghg_inversions/pull/390). + +The practical short-term direction is: + +- keep shrinking the legacy `inferpymc` compatibility surface, +- add a parallel modern inversion pathway that works directly with canonical inputs and `InferenceData`, +- avoid landing major new model variants only on the legacy path, +- prioritise incorporation of multi-sector code from the CO2 work once the modern path is ready. + +## How the workstreams fit together + +### 1. PyMC model refactor (`#370`) + +The goal of [#370](https://github.com/openghg/openghg_inversions/issues/370) is not to add a new scientific model directly. It is to separate model construction, sampling, and compatibility logic so the current fixed-basis model can be rebuilt cleanly and future models can be added with less duplication. + +Completed and in-progress stages: + +| Stage | Issue / PR | Status | Main result | +| --- | --- | --- | --- | +| A | [#371](https://github.com/openghg/openghg_inversions/issues/371) / [#378](https://github.com/openghg/openghg_inversions/pull/378) | done | Added regression tests and extracted `build_inferpymc_model(...)` from `inferpymc(...)`. | +| B | [#372](https://github.com/openghg/openghg_inversions/issues/372) / [#380](https://github.com/openghg/openghg_inversions/pull/380) | done | Added dataset/xarray-native inputs for `inferpymc(...)`. | +| C | [#373](https://github.com/openghg/openghg_inversions/issues/373) / [#382](https://github.com/openghg/openghg_inversions/pull/382) | done | Introduced `openghg_inversions.models` with reusable coords, priors, and component helpers. | +| D | [#374](https://github.com/openghg/openghg_inversions/issues/374) / [#389](https://github.com/openghg/openghg_inversions/pull/389) | done | Switched the active runtime to the new builder and dataset-first path. | +| E | [#375](https://github.com/openghg/openghg_inversions/issues/375) / [#391](https://github.com/openghg/openghg_inversions/pull/391) | open PR | Tidies the modern/legacy boundary and keeps `InferenceData` central. | + +Key architectural direction after Stage D/E: + +- canonical inversion inputs come from `make_inv_inputs(...)`, +- model construction should use canonical dims/coords (`region`, `bc_region`, etc.), +- modern sampling should work in terms of `InferenceData`, +- legacy naming and dict-shaped results should be confined to explicit compatibility adapters. + +### 2. Basis functions and output refactor (`#360`) + +[Issue #360](https://github.com/openghg/openghg_inversions/issues/360) is the parallel basis/output workstream. It matters because future multi-sector and inner-domain workflows need richer basis objects than the legacy flat-basis path. + +Completed and planned slices: + +| Slice | PR / issue | Status | Main result | +| --- | --- | --- | --- | +| Basis objects introduced | [#358](https://github.com/openghg/openghg_inversions/pull/358) | done | Added `BasisFunctions` and `BasisOperator`, including DataTree support and multi-source-aware operators. | +| Wrapper retention / dual-format save | [#367](https://github.com/openghg/openghg_inversions/pull/367) | done | `basis_functions_wrapper` can now optionally return basis objects and save DataTree artifacts. | +| Legacy-format post-processing reproduction | [#361](https://github.com/openghg/openghg_inversions/pull/361) | done | New post-processing code can reproduce the legacy HBMCMC-style output format. | +| `fixedbasisMCMC` output cleanup | [#376](https://github.com/openghg/openghg_inversions/issues/376) / [#390](https://github.com/openghg/openghg_inversions/pull/390) | done | Split output handling into clearer stages without changing inversion math. | +| Operator-backed post-processing default | `#360` PR-4/5 | not done | Still to come; current post-processing still retains legacy reconstruction in places. | + +### 3. High-resolution inner-region work (`#359`, `#337`) + +The current wiki page on high-resolution inner regions is here: + +- + +That note recommends **not** merging [PR #359](https://github.com/openghg/openghg_inversions/pull/359) as-is. The main reasons are architectural: + +- it cuts across the basis refactor in `#360`, +- it adds new model logic to a PyMC path that is already being replaced by `#370`, +- it overloads one-domain assumptions instead of introducing explicit inner/outer structure. + +The recommendation from that note remains sound: + +- continue with forward-model validation and clean data-structure work, +- align final inversion integration with the new builder path, +- give the inner state vector and basis objects their own explicit representation. + +### 4. CO2 / multi-sector integration (`#338` and RHIME fork) + +[Issue #338](https://github.com/openghg/openghg_inversions/issues/338) is still the key placeholder for integrating the CO2 pipeline and multi-sector inversions. This is also the main reason the current refactors matter. + +The external codebase that should inform this work is: + +- + +The expected landing zone in `openghg_inversions` is: + +- basis handling that can represent multiple sector-specific basis partitions, +- model building that can add multiple flux/source components without reworking `inferpymc` monolithically, +- post-processing that can consume canonical datasets / `InferenceData` rather than hard-coded legacy arrays. + +## Related stub / overview issues + +These issues are still useful as signposts, but most of the concrete progress has now happened in `#360` and `#370`. + +| Issue | Current interpretation | +| --- | --- | +| [#337](https://github.com/openghg/openghg_inversions/issues/337) | pipeline support for high-spatial-resolution footprints and inner domains | +| [#338](https://github.com/openghg/openghg_inversions/issues/338) | integrate CO2 / multi-sector inversion workflow | +| [#339](https://github.com/openghg/openghg_inversions/issues/339) | break `hbmcmc` functionality into more reusable pieces; recent comment notes that progress has effectively moved into `#370` | +| [#340](https://github.com/openghg/openghg_inversions/issues/340) | basis-function updates, including shape constraints and helper classes | + +## 2026 change log for this refactor area + +This is a selective summary of Brendan Murphy PRs since the start of 2026, grouped by relevance. + +### Directly relevant to the PyMC / basis / pipeline redesign + +| PR | Summary | +| --- | --- | +| [#356](https://github.com/openghg/openghg_inversions/pull/356) | Reimplemented `make_inv_inputs(...)` using newer xarray-based methods and preserved legacy behaviour with tests. | +| [#358](https://github.com/openghg/openghg_inversions/pull/358) | Added `BasisFunctions` / `BasisOperator` abstractions. | +| [#361](https://github.com/openghg/openghg_inversions/pull/361) | Reproduced legacy HBMCMC output format using the newer post-processing pipeline. | +| [#365](https://github.com/openghg/openghg_inversions/pull/365) | Fixed the monthly `sigma_freq_index` indexing bug when months are missing. | +| [#367](https://github.com/openghg/openghg_inversions/pull/367) | Added basis wrapper object retention and dual-format basis artifact saving. | +| [#378](https://github.com/openghg/openghg_inversions/pull/378) | Stage A of the PyMC refactor. | +| [#380](https://github.com/openghg/openghg_inversions/pull/380) | Stage B of the PyMC refactor. | +| [#382](https://github.com/openghg/openghg_inversions/pull/382) | Stage C of the PyMC refactor. | +| [#389](https://github.com/openghg/openghg_inversions/pull/389) | Stage D of the PyMC refactor. | +| [#390](https://github.com/openghg/openghg_inversions/pull/390) | Output cleanup linked to the basis/post-processing workstream. | +| [#391](https://github.com/openghg/openghg_inversions/pull/391) | Stage E follow-up tidy-up; open at time of writing. | + +### Ancillary 2026 PRs reviewed + +These are not central to the roadmap below, but they were reviewed as part of this issue because they may affect surrounding workflow or release context: + +- [#349](https://github.com/openghg/openghg_inversions/pull/349): dependency pins so `uv` installs work, +- [#350](https://github.com/openghg/openghg_inversions/pull/350): release `v0.6.0`, +- [#352](https://github.com/openghg/openghg_inversions/pull/352): merge release changes back into `devel`, +- [#353](https://github.com/openghg/openghg_inversions/pull/353): PARIS output bugfix for in situ data, +- [#355](https://github.com/openghg/openghg_inversions/pull/355): earlier `make_inv_inputs(...)` extraction, +- [#363](https://github.com/openghg/openghg_inversions/pull/363): superseded precursor to `#367`. + +## Legacy-to-replacement map + +This table is intentionally pragmatic rather than formal. Some legacy paths are not fully deprecated yet, but this is the current direction of travel. + +| Legacy / temporary concept | Current or intended replacement | Notes | +| --- | --- | --- | +| Monolithic `inferpymc(...)` model construction | `build_inferpymc_model(...)` + reusable helpers in `openghg_inversions.models` | Stages A-C established this split. | +| NumPy-array-first `inferpymc(...)` inputs | dataset-first `inv_inputs` from `make_inv_inputs(...)` | Stage D makes the dataset-first path the main runtime. | +| Temporary Stage C `model_builder="legacy"` vs `"components"` switch | single builder path from `openghg_inversions.models` | The dual-builder experiment was transitional and removed in Stage D. | +| Legacy trace renaming inside model construction | explicit compatibility renaming in `_rename_trace_for_legacy_inferpymc(...)` | This is the main Stage E clean-up. | +| Transitional sampling wrapper object | direct `InferenceData` return from the modern sampling helper | Stage E removes the wrapper. | +| `inferpymc_postprocessouts` as the only way to get legacy-format HBMCMC outputs | new post-processing compatibility path (`hbmcmc_postprocessing`, `make_legacy_hbmcmc_output_from_postprocessing`) | Added in `#361`; old path can be retired later. | +| Ad hoc flat-basis helper logic as the only basis representation | `BasisFunctions` / `BasisOperator` | Flat basis files still exist for compatibility. | +| Legacy flat basis save format as the only persisted form | dual support for legacy flat format and DataTree artifacts | Legacy is still the default writer for now. | +| `fixedbasisMCMC` as the place where all runtime concerns are mixed together | clearer separation of input prep, inference, artifact creation, and post-processing | `#390` is structural cleanup, not the final state. | + +## What is still intentionally legacy + +Even after the recent refactors, the following still exist mainly for compatibility: + +- `inferpymc(...)` still returns the legacy dict-like result shape, +- `fixedbasisMCMC(...)` still drives that compatibility path, +- legacy post-processing still expects `xouts`, `sigouts`, `Ytrace`, `YBCtrace`, and related fields in some places, +- `step1`, `step2`, and custom `"convergence"` metadata still survive for compatibility, +- basis handling still falls back to legacy reconstruction paths in post-processing, +- `sigma_freq_index` is still created in `make_inv_inputs(...)` instead of living entirely inside sigma-component logic. + +## Priority open loops + +The main open loops from the issue discussion, PR reviews, and the notes on [#391](https://github.com/openghg/openghg_inversions/pull/391) are: + +1. **Add a parallel modern inversion pathway** + - build from canonical inversion inputs, + - sample directly to `InferenceData`, + - create modern outputs directly, + - keep `inferpymc(...)` as a compatibility wrapper rather than the main conceptual API. + +2. **Finish uncoupling modern post-processing from legacy output shapes** + - reduce dependence on `xouts`, `sigouts`, `Ytrace`, `YBCtrace`, etc., + - move toward dataset-based post-processing inputs, + - simplify `InversionOutput` further. + +3. **Complete the basis/output migration** + - make operator-backed basis handling primary in post-processing, + - keep legacy save/read only as a temporary compatibility path, + - ensure this works for multi-source/multi-sector basis layouts. + +4. **Define how future model families expose inputs and outputs** + - likely via semantic component roles or lightweight manifests, + - especially important for alternative likelihoods and multi-sector models. + +5. **Decide what should be serialised for model reconstruction** + - canonical inversion inputs, + - model-building arguments, + - `InferenceData.constant_data`, + - or an explicit serialised model spec. + +6. **Move CO2 multi-sector work into the modern architecture** + - this is a key scientific priority, + - and is one of the main reasons not to keep extending only the legacy path. + +7. **Align high-resolution inner-domain work with the new abstractions** + - separate outer and inner basis/state representations cleanly, + - avoid implementing dual-domain logic only in the legacy path. + +## Suggested near-term sequencing + +The current work seems to cluster naturally into the following order: + +### Next PR / near-term + +- merge or finish the Stage E tidy-up in [#391](https://github.com/openghg/openghg_inversions/pull/391), +- document the modern builder/sampling layer, +- start a minimal parallel modern inversion entrypoint, +- continue reducing downstream legacy naming assumptions. + +### Soon after + +- continue the `#360` post-processing migration so basis objects become the primary path, +- simplify `InversionOutput`, +- decide where coordinate restoration should live and make modern traces more self-contained. + +### After that + +- land multi-sector support based on the RHIME CO2 work, +- rework high-resolution inner-domain support on top of the new model/basis structure, +- formalise serialization / reproducibility expectations. + +## Practical planning notes + +- Avoid adding major new scientific capabilities only to the legacy `inferpymc(...)` compatibility path. +- Treat parity with legacy outputs as a migration tool, not a long-term design constraint. +- Prefer canonical names and explicit adapters over hidden renaming. +- Keep the modern posterior object as `InferenceData`. +- Use the CO2 multi-sector work as the main test of whether the new abstractions are actually sufficient. + +## Short status summary + +If a single status line is needed: + +> The PyMC refactor has largely completed the move from monolithic model construction toward reusable builder/components, the basis/output refactor has established the new basis abstractions but not yet made them the default post-processing path, and the next major milestone should be a modern inversion pathway that can absorb CO2 multi-sector and inner-domain work without further expanding the legacy compatibility layer.