[WIP] [LGR] flow: parallel LGR INIT output from gathered simulator transmissibilities#7211
[WIP] [LGR] flow: parallel LGR INIT output from gathered simulator transmissibilities#7211arturcastiel wants to merge 2 commits into
Conversation
|
jenkins build this please |
1 similar comment
|
jenkins build this please |
|
jenkins build this please |
2 similar comments
|
jenkins build this please |
|
jenkins build this please |
|
@akva2 did I screw something up? I cannot get jenkins going |
| if (simulator.vanguard().grid().comm().rank() == 0) { | ||
| // Parallel LGR: the output path (computeTrans_) indexes trans on the GLOBAL refined | ||
| // (equil) grid, which the coarse per-rank globalTrans_ cannot answer (out_of_range). | ||
| // Decide on equilGrid().maxLevel() (the GLOBAL refinement) -- NOT grid().maxLevel(), |
There was a problem hiding this comment.
Just a comment, I haven't looked at all the changes deeply, only a glance.
grid.maxLevel() coincides across rank, with potentially empty level grids in those ranks that do not contain any refined cell. maxLevel is determined by the length of data_ or distributed_data_ (CpGrid::currentData()) in https://github.com/OPM/opm-grid/blob/5e213cf68d07a973fb2a9c2a6bdaea55a6f61af2/opm/grid/cpgrid/CpGrid.cpp#L767-L775
In a quick look at the tests/cpgrid/lgr, I couldn't find a proof of this, but I'm 99% sure it's like that. In any case, in simulators equilGrid hold the global view of the grid and might be needed here for other reasons.
If I find time, I'll ass a tiny test, just to double-check grid.maxLevel() behavior on a distributed grid. I keep you updated!
There was a problem hiding this comment.
thank you, this is more a proof of concept. let me know if you think I am on the right direction
There was a problem hiding this comment.
@aritorto new comments will be highly appreciated.
There was a problem hiding this comment.
I added a few lines in an existing test: OPM/opm-grid#1046
to illustrate the behavior of grid.maxLevel() on a distributed grid. I hope this helps!
I try to take a deeper look at some point, for now, it was only about the comment on grid.maxLevel()
Nice that parallel output for LGRs is moving forward : )
There was a problem hiding this comment.
@aritorto parallel lgr is moving forward just like brazil in the world cup.
There was a problem hiding this comment.
Thanks a lot @aritorto — you're right, and this was really helpful.
I traced it through CpGrid::maxLevel() (just currentData().size() - 2) and
refineAndUpdateGrid: the level grids are pushed one per entry of
cells_per_dim_vec (the global LGR spec), unconditionally — a rank that owns
no refined cells still gets the (empty) level grids — and nothing prunes them
afterwards. So grid().maxLevel() really is identical on every rank, and your
opm-grid#1046 test makes that explicit. My earlier comment claiming it "would
be 0 on rank 0" was simply wrong.
I've reworded the comment: it no longer makes that claim. It now justifies
gating on equilGrid().maxLevel() as the global (undistributed) view — the
authoritative refinement depth, and the same grid the refined transmissibility
is built over. I kept equilGrid() for clarity/consistency rather than switching
to grid().maxLevel(), since they're equivalent here.
|
jenkins build this please |
1 similar comment
|
jenkins build this please |
blattms
left a comment
There was a problem hiding this comment.
Nice start. I only took a very short look. I hope my comments help a bit.
Please note that the current code does the computation on the root rank, because globalTrans_ is used for partioning. Hence it is already there and reusing it is cheap.
For parallel LGRs, recomputing it on the root rank might not be the most efficient way as we could hopefully also do this in parallel.
Hence in the long run, we can hopefully compute the values in parallel or reuse already existing ones that the simulator has computed. Of course then we need to gather these and maybe skip some of them. That might be complicated, too.
| // Refined transmissibility over the global (equil) refined leaf grid — built lazily on the I/O rank | ||
| // for parallel LGR INIT output (TRANS/NNC). Separate from globalTrans_ (kept coarse, for domain | ||
| // decomposition). See refinedGlobalTransmissibility(); reset by releaseGlobalTransmissibilities(). | ||
| std::unique_ptr<TransmissibilityType> refinedGlobalTrans_; |
There was a problem hiding this comment.
As this is only used once, maybe we should not store it?
(globalTrans_ is there because it is computed before partitioning already, then later reused for INIT)
It might even be possible to overwrite/reset globalTrans_
There was a problem hiding this comment.
Resolved by making the question moot: with the gather-based reuse there is no whole-grid refined transmissibility any more — the member, the method, and the global-props switch it needed are all gone (the vanguard headers are back to their previous state). In parallel LGR runs the writer receives no whole-grid transmissibility object at all; the coarse globalTrans_ keeps its one job (partitioning + non-LGR parallel INIT) untouched.
| @@ -0,0 +1,469 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Do you really need an extra script for this? What is different from the existing one?
A data file for a test can easily be stored somewhere else or you could use one from opm-tests.
There was a problem hiding this comment.
not really, fair point, this is some noise from my agents, it will be fixed
There was a problem hiding this comment.
Fair, and done — the bespoke script is removed entirely. Coverage is restored with existing infrastructure instead:
run-parallel-regressionTest.shgained a backward-compatible-m initmode (dry run, EGRID+INIT compare); the default is unchanged for every existing caller.add_test_compare_parallel_simulationgained an optionalCOMPARE_MODEparameter.- The test reuses the existing
SPE1CASE1_CARFINdeck from opm-tests (already used by thespe1case1_carfinsmoke tests) — no new deck, no new script.
| -- +-------+-------+-------+ | ||
| -- | (1,2) | (2,2) | (3,2) | | ||
| -- +-------+-------+-------+ | ||
| -- |(1,1) | (2,1) | (3,1) | <- LGR1 host (INJ) |
There was a problem hiding this comment.
Is this setup used because larger LGRs spanning multiple ranks do not work?
In general I would prefer an LGR that gets split between ranks. That would test more things. As this is run with 4 ranks maybe use 3-4 host cells?
There was a problem hiding this comment.
I can add another test to check that.
There was a problem hiding this comment.
The replacement deck covers this: SPE1CASE1_CARFIN has two 2×2×3 LGRs (12 host cells each) on a 10×10×3 grid, and the test run confirms they really do get split between ranks. From the 4-rank run's log: the level-0 load balance owns 78/69/72/81 cells per rank, and during local refinement rank 0 marks 9 of the 24 LGR host cells ("9 elements have been marked (in 0 rank)" vs 24 on the global view). Since each LGR has 12 host cells, no LGR fits inside rank 0 — at least one straddles a rank boundary, which is exactly the case you wanted exercised. The serial-vs-parallel compare passes with that split.
|
jenkins build this please |
|
@blattms Thanks — you were right, and this is now implemented rather than deferred. The branch no longer recomputes anything on the root rank: the values the simulator already computed in parallel (its own distributed transmissibilities) are reused for the INIT output. Each rank walks its interior cells and contributes its connections keyed by (level, level-Cartesian index) — a key that is identical on the distributed grid and the I/O rank's global view — and a one-shot gatherv brings them to the I/O rank, where the existing output walk just looks values up. Rank-boundary duplicates are the "skip some" you predicted: same-level ones arrive twice with identical values (benign), level-crossing ones are contributed only from the smaller-level side, exactly once. What remains on the I/O rank is the topology walk over the global grid and the file write itself — the single-writer floor shared by all output paths (the I/O rank already holds the global grid + global field properties for every INIT static and for EGRID). Per-connection work on the root drops from compute-everything to index-and-copy-everything. Validated with the serial-vs-parallel INIT/EGRID regression added in this PR (all TRAN* and NNC arrays compare against serial). |
f1387a1 to
091f743
Compare
|
jenkins build this please |
…ties A parallel run with LGRs cannot write a correct INIT file: the output path queries transmissibilities on the global refined grid, which the coarse per-rank globalTrans_ cannot answer. Fix it by reusing the values the simulation itself already computed in parallel. Each rank walks its interior leaf cells and records every connection it owns from its own (distributed) simulator transmissibilities, keyed by level-Cartesian indices: same-level connections as (level, min, max), level-crossing ones as (smaller level, its index, larger level, its index). The keys are geometrically canonical -- identical on the distributed grid and the I/O rank's global view -- so the existing output walk (computeTrans_ / exportNncStructure_) looks the values up directly; a missing key is a hard error. The records are gathered on the I/O rank once, with a plain counts/displacements gatherv (new helper gatherLgrOutputTrans in LgrOutputTransGather.hpp). Same-level rank-boundary connections arrive from both owner ranks with identical values (either record is equally valid); level-crossing connections are contributed exactly once, from the smaller-level side. The local transmissibilities are finished before the INIT extract in this branch; finishTransmissibilities() is idempotent, so the later call is a no-op. Nothing is recomputed on the I/O rank, there is no global-property switch, and no whole-grid transmissibility object is stored -- in parallel LGR runs the writer receives none at all. Serial runs and parallel runs without LGRs are unchanged: with empty gathered maps the writer queries the whole-grid transmissibility object exactly as before.
Add a serial-vs-parallel regression for the parallel LGR INIT output using existing infrastructure: - run-parallel-regressionTest.sh gains a backward-compatible "-m <mode>" flag (default "summary" keeps the current behaviour for every existing caller; "init" does a dry run and compares EGRID+INIT only, ignoring the parallel-only MPI_RANK keyword). - add_test_compare_parallel_simulation gains an optional COMPARE_MODE parameter that forwards "-m init" and names the test compareParallelInitSim_<sim>+<case>. - The test is registered against the existing SPE1CASE1_CARFIN deck (opm-tests/lgr), which has two 12-host-cell LGRs on a 10x10x3 grid. Under the default 4-rank partition the LGR host cells land on multiple ranks (rank 0 marks 9 of the 24 host cells during local refinement), so the compare exercises rank-boundary-straddling LGRs. The registration sits before the opm_set_test_driver switch to run-comparison.sh so it picks up the run-parallel-regressionTest.sh driver that understands "-m".
091f743 to
1d926f7
Compare
|
jenkins build this please |
A parallel (MPI)
flowrun with LGRs currently cannot write a correct INIT file: the output path queries transmissibilities on the global refined grid, which the coarse per-rankglobalTrans_cannot answer, and the run dies. This PR makes the parallel LGR INIT output (TRANX/TRANY/TRANZ + all NNC arrays) correct, by reusing the transmissibilities the simulation itself already computed in parallel.Approach. Each rank walks its interior leaf cells and records every connection it owns from its own (distributed) simulator transmissibilities, keyed by level-Cartesian indices — same-level connections as (level, min, max), level-crossing ones as (smaller level, its index, larger level, its index). These keys are geometrically canonical: identical on the distributed grid and the I/O rank's global view. A one-shot gatherv brings the records to the I/O rank, where the existing output walk over the global grid looks the values up directly (new helper
gatherLgrOutputTransinopm/simulators/flow/LgrOutputTransGather.hpp). Same-level rank-boundary connections arrive twice with identical values; level-crossing ones are contributed exactly once, from the smaller-level side. Nothing is recomputed anywhere; there is no global-property switch, and no whole-grid transmissibility object is stored — in parallel LGR runs the writer receives none at all. Serial runs and parallel runs without LGRs are unchanged.Testing. New regression
compareParallelInitSim_flow+SPE1CASE1_CARFIN: the existingrun-parallel-regressionTest.shdriver gained a backward-compatible-m initmode (dry run, EGRID+INIT compare, serial vs 4-rank parallel), registered through a new optionalCOMPARE_MODEparameter ofadd_test_compare_parallel_simulation, on the existingSPE1CASE1_CARFINdeck from opm-tests. That deck has two 2×2×3 LGRs (12 host cells each) on a 10×10×3 grid, and the run log confirms the LGR host cells land on multiple ranks (rank 0 marks 9 of the 24 host cells during local refinement), so the compare exercises rank-boundary-straddling LGRs.Scope. INIT output only. Summary (
LW*/LC*/LB*), restart write, restart read, and RFT for parallel LGR runs are separate work.