Feat: Add auto memory detection for self-energy calculation by AsymmetryChou · Pull Request #47 · DeePTB-Lab/dpnegf

AsymmetryChou · 2025-11-30T07:00:17Z

Implement automatic memory detection to optimize the number of parallel workers for self-energy calculations based on available system memory. This enhancement improves resource management and prevents memory overflow during computations. For issue #48.

Summary by CodeRabbit

New Features
- Automatic memory-aware parallelization for self-energy calculations: the system estimates per-worker memory and selects a safe number of parallel workers to avoid over-committing RAM.
Bug Fixes
- Fewer crashes and out-of-memory failures during large or parallel self-energy runs by adapting worker counts to available memory.
Tests
- Added comprehensive tests validating memory estimation, fallback behavior, and safe worker selection across scenarios.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-30T07:00:33Z

Walkthrough

Adds memory-aware parallelization to lead self-energy computation: new helpers estimate per-worker memory and compute a safe worker count using psutil/CPU info; compute_all_self_energy samples one k-point to derive safe_n_jobs and uses it for all Parallel calls, logging any adjustments.

Changes

Cohort / File(s)	Summary
Memory-aware parallelization `dpnegf/negf/lead_property.py`	Adds `_estimate_worker_memory(lead_L, lead_R, kpoint=None, temp_allocation_factor=3.0)` and `_get_safe_n_jobs(lead_L, lead_R, requested_n_jobs=-1, max_memory_fraction=0.7, min_workers=1, kpoint=None)`; compute_all_self_energy now samples the first k-point to estimate per-worker memory, computes `safe_n_jobs`, replaces direct `n_jobs` with `safe_n_jobs` in both small- and large-batch `Parallel` paths, and logs when the requested worker count is adjusted.
Tests — auto memory detection `dpnegf/tests/test_auto_memory.py`	Adds tests for `_estimate_worker_memory` and `_get_safe_n_jobs`, covering matrix-size scaling, temp allocation factor, NumPy vs PyTorch tensors, k-point handling, failure fallbacks, requested n_jobs edge cases, CPU/memory constraint scenarios, mixed-success lead retrievals, and integration-style checks for small/large systems and stability of estimates.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Review _estimate_worker_memory() math (units, base overhead, per-lead accounting).
Validate _get_safe_n_jobs() handling of requested_n_jobs semantics, psutil-based memory caps, and CPU count edge cases.
Confirm compute_all_self_energy() replaces all Parallel invocations and logs adjustments.
Check new tests for realism and sufficient coverage of edge cases.

Poem

🐰 I hop through matrices, soft and spry,
Counting bytes beneath the sky.
Psutil whispers how many can play,
I pick safe workers, nibble delays away.
Joyful parallel, no crash today. 🥕✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: adding automatic memory detection for self-energy calculations, which directly aligns with the core objective of optimizing parallel workers based on available system memory.
Docstring Coverage	✅ Passed	Docstring coverage is 86.67% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

dpnegf/negf/lead_property.py (3)
507-514: Narrow the exception type and log at a higher level when fallback is used.

Catching a bare Exception masks unexpected errors (e.g., KeyboardInterrupt is a BaseException, but other unexpected issues like MemoryError could be silently swallowed). Consider catching specific exceptions like (KeyError, ValueError, FileNotFoundError, IOError) that are expected from I/O or lookup failures.

Also, if the fallback path is taken, the estimation may be incomplete if the cached attributes don't exist either. Consider logging at warning level when relying on fallback to alert users.
-        except Exception as e:
-            log.debug(f"Could not estimate memory from {lead.tab}: {e}")
+        except (KeyError, ValueError, FileNotFoundError, IOError, RuntimeError) as e:
+            log.warning(f"Could not estimate memory from {lead.tab} via get_hs_lead, "
+                        f"using fallback: {e}")
             # Fallback: check cached matrices on the lead object
516-520: Handle edge case when matrix estimation fails for both leads.

If get_hs_lead fails for both leads and no cached attributes exist, matrix_bytes remains 0, causing a severe underestimate (only 300 MB base overhead). This could lead to spawning too many workers and causing OOM.

Consider adding a minimum safeguard or returning a sentinel value to trigger conservative behavior:
     # Total estimate: base overhead + scaled computation memory
     computation_memory = matrix_bytes * temp_allocation_factor
     total_estimate = base_overhead + int(computation_memory)
 
+    if matrix_bytes == 0:
+        log.warning("Could not estimate computation memory from lead matrices. "
+                    "Using conservative fallback estimate.")
+        # Conservative fallback: assume 1GB per worker if we can't measure
+        total_estimate = max(total_estimate, 1024 * 1024 * 1024)
+
     return total_estimate
560-571: Consider matching joblib's semantics for n_jobs=0 and negative values.

In joblib, n_jobs=0 raises an error, and negative values like -2 mean "all CPUs minus 1". The current else branch (line 570-571) treats all non-positive values (except -1) as "use max_workers", which diverges from joblib conventions.

If matching joblib semantics is intended, consider:
     if requested_n_jobs == -1:
         return max_workers
     elif requested_n_jobs > 0:
         if requested_n_jobs > max_workers:
             log.warning(f"Requested n_jobs={requested_n_jobs} may exceed available memory. "
                        f"Limiting to {max_workers} workers "
                        f"(available: {available_memory / 1e9:.1f} GB, "
                        f"est. per worker: {memory_per_worker / 1e9:.1f} GB)")
             return max_workers
         return requested_n_jobs
+    elif requested_n_jobs == 0:
+        raise ValueError("n_jobs=0 is not valid. Use -1 for auto or a positive integer.")
+    else:
+        # Negative values other than -1: joblib interprets as (cpu_count + 1 + n_jobs)
+        effective_jobs = max(cpu_count + 1 + requested_n_jobs, min_workers)
+        return min(effective_jobs, max_workers)
-    else:
-        return max_workers

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between faafb4e and 44564b9.

📒 Files selected for processing (1)

dpnegf/negf/lead_property.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

dpnegf/negf/lead_property.py (1)

dpnegf/negf/negf_hamiltonian_init.py (1)

get_hs_lead (857-949)

🪛 Ruff (0.14.6)

dpnegf/negf/lead_property.py

507-507: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: run-tests

🔇 Additional comments (3)

dpnegf/negf/lead_property.py (3)

613-621: Good integration of memory-aware parallelization.

The logic correctly samples the first k-point for memory estimation and logs adjustments clearly. One minor edge case: if kpoints_grid is empty, sample_kpoint becomes None, which is handled by the fallback in _estimate_worker_memory. However, the subsequent loop would be a no-op anyway, so this is not a practical concern.

623-634: LGTM: Consistent use of safe_n_jobs in both execution paths.

Both the small-batch (line 624) and large-batch (line 631) execution paths now correctly use safe_n_jobs, ensuring memory constraints are respected throughout the computation.

13-13: Verify psutil is declared in project dependencies.

The new psutil import introduces an external dependency. Ensure it's added to the project's setup.py, requirements.txt, or pyproject.toml to prevent import errors in fresh environments.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

dpnegf/negf/lead_property.py (2)

458-517: Narrow the broad except Exception in _estimate_worker_memory.

Catching a blind Exception here makes it easy to hide real programming errors (e.g., shape/dtype bugs in get_hs_lead) and is what Ruff BLE001 is flagging. Consider restricting this to the specific exceptions you expect from get_hs_lead (e.g., AssertionError, OSError, RuntimeError, ValueError) and either re-raising unexpected ones or logging and re-raising, so only genuinely “estimation-related” failures fall back to the 100 MB heuristic.

633-652: Memory‑aware n_jobs integration into compute_all_self_energy looks good.

Using a representative k‑point to estimate per‑worker memory and then routing both the small‑ and large‑batch Parallel calls through safe_n_jobs is consistent and should prevent obviously over‑aggressive parallelism. If you find in practice that safe_n_jobs is often 1, you might optionally short‑circuit to a non‑parallel path to avoid joblib overhead, but the current implementation is perfectly acceptable.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44564b9 and 7909e7f.

📒 Files selected for processing (1)

dpnegf/negf/lead_property.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

dpnegf/negf/lead_property.py (1)

dpnegf/negf/negf_hamiltonian_init.py (1)

get_hs_lead (857-949)

🪛 Ruff (0.14.6)

dpnegf/negf/lead_property.py

507-507: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: run-tests

🔇 Additional comments (2)

dpnegf/negf/lead_property.py (2)

13-13: Ensure psutil is declared as a runtime dependency.

Since psutil is imported at module import time, missing the package will cause an ImportError before any of these utilities can be used. Verify that psutil is listed in the appropriate dependency file(s) (e.g., requirements.txt, pyproject.toml, or setup.cfg), or guard with an optional import if not strictly required at runtime.

520-590: Fix non‑int requested_n_jobs handling and remove redundant assignment.

Two issues in this function:

The type check for requested_n_jobs logs a warning and sets final_worker = min_workers, but lacks an early return. Execution then flows into subsequent comparisons (requested_n_jobs == -1, requested_n_jobs > 0), which will raise TypeError for non-int values like None or "4", contradicting the intent of graceful fallback.

max_workers_by_memory is assigned twice with identical expressions (lines 556–557), which is redundant.

Add an early return after the type validation and remove the duplicate assignment to ensure invalid inputs are handled gracefully and the code remains clean.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

dpnegf/negf/lead_property.py (1)

458-517: Consider narrowing the broad Exception catch in _estimate_worker_memory

Catching Exception at Line 507 will also swallow unexpected programming errors, not just data/IO issues, and silently fall back to 100 MB per lead. If you know the set of likely failures from get_hs_lead, it would be safer to catch those specific exception types (e.g., I/O- or shape-related) or log-and-re‑raise unknown ones, while still keeping the 100 MB fallback for expected issues.

dpnegf/tests/test_auto_memory.py (1)

329-344: Strengthen the non-integer n_jobs test once helper behavior is clarified

test_handles_non_integer_n_jobs currently only checks result >= 1 and documents a known bug in _get_safe_n_jobs where the min_workers safeguard can be overwritten. If you adopt a clearer contract in _get_safe_n_jobs (e.g., early‑return min_workers or coercing to int), this test can be tightened to assert the exact expected value and ensure that behavior doesn’t regress.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7909e7f and e8a494f.

📒 Files selected for processing (2)

dpnegf/negf/lead_property.py (3 hunks)
dpnegf/tests/test_auto_memory.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

dpnegf/tests/test_auto_memory.py (1)

dpnegf/negf/lead_property.py (2)

_estimate_worker_memory (458-517)

_get_safe_n_jobs (520-590)

dpnegf/negf/lead_property.py (1)

dpnegf/negf/negf_hamiltonian_init.py (1)

get_hs_lead (857-949)

🪛 Ruff (0.14.6)

dpnegf/tests/test_auto_memory.py

25-25: Unused method argument: kpoint

(ARG002)

25-25: Unused method argument: tab

(ARG002)

25-25: Unused method argument: v

(ARG002)

27-27: Avoid specifying long messages outside the exception class

(TRY003)

dpnegf/negf/lead_property.py

507-507: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: run-tests

🔇 Additional comments (2)

dpnegf/negf/lead_property.py (1)

633-655: Memory-aware n_jobs wiring in compute_all_self_energy looks correct

safe_n_jobs is computed once from a representative k-point and then used consistently for all Parallel calls (Lines 644 and 651), with clear logging when auto-detection is used or when user input is constrained by memory (Lines 637–640). This matches the stated goal of avoiding memory over-commitment while respecting the requested n_jobs where possible.

dpnegf/tests/test_auto_memory.py (1)

64-416: Test coverage for memory estimation and worker selection is thorough

The combination of unit and integration tests here exercises base overhead, matrix-size scaling, fallback paths, CPU/memory constraints, and various n_jobs conventions. This gives good confidence that the auto-memory logic behaves as intended across realistic scenarios.

coderabbitai · 2025-11-30T07:46:33Z

+def _get_safe_n_jobs(lead_L, lead_R, requested_n_jobs=-1, max_memory_fraction=0.7, min_workers=1, kpoint=None):
+    """
+    Calculate safe number of parallel workers based on available system memory.
+
+    Parameters
+    ----------
+    lead_L, lead_R : LeadProperty
+        Lead objects for memory estimation.
+    requested_n_jobs : int
+        User-requested n_jobs. -1 means auto-detect.
+    max_memory_fraction : float
+        Maximum fraction of available memory to use. Default 0.7.
+    min_workers : int
+        Minimum number of workers to use. Default 1.
+    kpoint : array-like, optional
+        A sample k-point for fetching Hamiltonian matrices to estimate memory.
+
+    Returns
+    -------
+    int
+        Safe number of parallel workers.
+    """
+    cpu_count = os.cpu_count()
+    if cpu_count is None or cpu_count < 1:
+        cpu_count = 1
+        log.warning("os.cpu_count() returned None or invalid value. Defaulting to 1 CPU core.")
+
+    available_memory = psutil.virtual_memory().available
+    memory_per_worker = _estimate_worker_memory(lead_L, lead_R, kpoint=kpoint)
+
+    # Calculate max workers that fit in available memory
+    if memory_per_worker <= 0:
+        log.warning(f"Memory estimation returned non-positive value. Using min_workers={min_workers}.")
+        return min_workers
+
+    # Calculate max workers that fit in available memory
+    max_workers_by_memory = int((available_memory * max_memory_fraction) / memory_per_worker)
+    max_workers_by_memory = int((available_memory * max_memory_fraction) / memory_per_worker)
+    max_workers_by_memory = max(max_workers_by_memory, min_workers)
+
+    # Cap by CPU count
+    max_workers = min(max_workers_by_memory, cpu_count)
+
+    safe_n_worker = 0
+    # check requested_n_jobs is a number
+    if not isinstance(requested_n_jobs, int):
+        log.warning(f"Requested n_jobs={requested_n_jobs} is not an integer. \n"
+                    f"Using min_workers={min_workers}.")
+        safe_n_worker = min_workers
+
+    if requested_n_jobs == -1:
+        safe_n_worker = max_workers
+    elif requested_n_jobs == 0:
+        log.warning(f"Requested n_jobs=0 is invalid. Using min_workers={min_workers}.")
+        safe_n_worker = min_workers
+    elif requested_n_jobs > 0:
+        if requested_n_jobs > max_workers:
+            log.warning(f"Requested n_jobs={requested_n_jobs} may exceed available memory. "
+                       f"Limiting to {max_workers} workers "
+                       f"(available: {available_memory / 1e9:.1f} GB, "
+                       f"est. per worker: {memory_per_worker / 1e9:.1f} GB)")
+            safe_n_worker = max_workers
+        else:
+            safe_n_worker = requested_n_jobs
+    else:
+        # Negative values other than -1: joblib interprets as (cpu_count + 1 + n_jobs)
+        effective_n_jobs = max(cpu_count + 1 + requested_n_jobs, min_workers)
+        safe_n_worker = min(effective_n_jobs, max_workers)
+
+    log.info(f"Estimated safe n_jobs={safe_n_worker} based on available memory.")
+    return safe_n_worker


⚠️ Potential issue | 🟡 Minor

Fix duplicated max_workers_by_memory assignment and make non-integer requested_n_jobs handling consistent

Lines 556–557 compute max_workers_by_memory twice with the same expression; one of these assignments can be removed with no change in behavior.

The non-integer requested_n_jobs path (Lines 565–568) sets safe_n_worker = min_workers but then continues into the later conditionals, which can overwrite that value and may return a non-int worker count. That contradicts the docstring return type and can make downstream usage harder to reason about.

It would be clearer to either:

return min_workers immediately after logging for non‑integer requested_n_jobs, or

coerce once via requested_n_jobs = int(requested_n_jobs) and document that behavior.

To enforce the contract, you could also cast once at the end, e.g. return int(safe_n_worker).

🤖 Prompt for AI Agents

In dpnegf/negf/lead_property.py around lines 520–590, remove the duplicated assignment to max_workers_by_memory (lines 556–557) so it is computed only once; for non-integer requested_n_jobs, after logging immediately return min_workers (do not continue into later branches) to avoid overwriting the value or returning a non-int; and finally ensure the function returns an int by casting safe_n_worker to int (e.g., return int(safe_n_worker)) before returning.

Feat: add auto memory detection for self_energy calculation

44564b9

AsymmetryChou changed the title ~~Add auto memory detection for self-energy calculation~~ Feat: Add auto memory detection for self-energy calculation Nov 30, 2025

coderabbitai Bot reviewed Nov 30, 2025

View reviewed changes

Comment thread dpnegf/negf/lead_property.py

AsymmetryChou added 3 commits November 30, 2025 15:09

Fix: handle non-positive memory estimation in worker calculation

f593b51

Fix: improve memory estimation fallback in worker calculation

f3eafaa

Fix: enhance job handling in memory estimation for parallel calculations

7909e7f

coderabbitai Bot reviewed Nov 30, 2025

View reviewed changes

AsymmetryChou added 3 commits November 30, 2025 15:39

Add unit tests for automatic memory detection in lead_property.py

6c54c98

Fix: update worker memory estimation variable name

f81cbfe

update run.ipynb

e8a494f

AsymmetryChou requested a review from Lonya0 November 30, 2025 07:44

coderabbitai Bot reviewed Nov 30, 2025

View reviewed changes

Lonya0 approved these changes Nov 30, 2025

View reviewed changes

AsymmetryChou merged commit 4366d0e into DeePTB-Lab:main Dec 1, 2025
2 checks passed

AsymmetryChou mentioned this pull request Dec 1, 2025

Memorgy Overload when treating systems with large PL #48

Closed

coderabbitai Bot mentioned this pull request May 13, 2026

self energy and e3 model fix #52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Add auto memory detection for self-energy calculation#47

Feat: Add auto memory detection for self-energy calculation#47
AsymmetryChou merged 7 commits into
DeePTB-Lab:mainfrom
AsymmetryChou:auto_mem

AsymmetryChou commented Nov 30, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Nov 30, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Nov 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AsymmetryChou commented Nov 30, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AsymmetryChou commented Nov 30, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Nov 30, 2025 •

edited

Loading