Skip to content

Feat: Add auto memory detection for self-energy calculation#47

Merged
AsymmetryChou merged 7 commits into
DeePTB-Lab:mainfrom
AsymmetryChou:auto_mem
Dec 1, 2025
Merged

Feat: Add auto memory detection for self-energy calculation#47
AsymmetryChou merged 7 commits into
DeePTB-Lab:mainfrom
AsymmetryChou:auto_mem

Conversation

@AsymmetryChou
Copy link
Copy Markdown
Contributor

@AsymmetryChou AsymmetryChou commented Nov 30, 2025

Implement automatic memory detection to optimize the number of parallel workers for self-energy calculations based on available system memory. This enhancement improves resource management and prevents memory overflow during computations. For issue #48.

Summary by CodeRabbit

  • New Features

    • Automatic memory-aware parallelization for self-energy calculations: the system estimates per-worker memory and selects a safe number of parallel workers to avoid over-committing RAM.
  • Bug Fixes

    • Fewer crashes and out-of-memory failures during large or parallel self-energy runs by adapting worker counts to available memory.
  • Tests

    • Added comprehensive tests validating memory estimation, fallback behavior, and safe worker selection across scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 30, 2025

Walkthrough

Adds memory-aware parallelization to lead self-energy computation: new helpers estimate per-worker memory and compute a safe worker count using psutil/CPU info; compute_all_self_energy samples one k-point to derive safe_n_jobs and uses it for all Parallel calls, logging any adjustments.

Changes

Cohort / File(s) Summary
Memory-aware parallelization
dpnegf/negf/lead_property.py
Adds _estimate_worker_memory(lead_L, lead_R, kpoint=None, temp_allocation_factor=3.0) and _get_safe_n_jobs(lead_L, lead_R, requested_n_jobs=-1, max_memory_fraction=0.7, min_workers=1, kpoint=None); compute_all_self_energy now samples the first k-point to estimate per-worker memory, computes safe_n_jobs, replaces direct n_jobs with safe_n_jobs in both small- and large-batch Parallel paths, and logs when the requested worker count is adjusted.
Tests — auto memory detection
dpnegf/tests/test_auto_memory.py
Adds tests for _estimate_worker_memory and _get_safe_n_jobs, covering matrix-size scaling, temp allocation factor, NumPy vs PyTorch tensors, k-point handling, failure fallbacks, requested n_jobs edge cases, CPU/memory constraint scenarios, mixed-success lead retrievals, and integration-style checks for small/large systems and stability of estimates.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Review _estimate_worker_memory() math (units, base overhead, per-lead accounting).
  • Validate _get_safe_n_jobs() handling of requested_n_jobs semantics, psutil-based memory caps, and CPU count edge cases.
  • Confirm compute_all_self_energy() replaces all Parallel invocations and logs adjustments.
  • Check new tests for realism and sufficient coverage of edge cases.

Poem

🐰 I hop through matrices, soft and spry,
Counting bytes beneath the sky.
Psutil whispers how many can play,
I pick safe workers, nibble delays away.
Joyful parallel, no crash today. 🥕✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: adding automatic memory detection for self-energy calculations, which directly aligns with the core objective of optimizing parallel workers based on available system memory.
Docstring Coverage ✅ Passed Docstring coverage is 86.67% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@AsymmetryChou AsymmetryChou changed the title Add auto memory detection for self-energy calculation Feat: Add auto memory detection for self-energy calculation Nov 30, 2025
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
dpnegf/negf/lead_property.py (3)

507-514: Narrow the exception type and log at a higher level when fallback is used.

Catching a bare Exception masks unexpected errors (e.g., KeyboardInterrupt is a BaseException, but other unexpected issues like MemoryError could be silently swallowed). Consider catching specific exceptions like (KeyError, ValueError, FileNotFoundError, IOError) that are expected from I/O or lookup failures.

Also, if the fallback path is taken, the estimation may be incomplete if the cached attributes don't exist either. Consider logging at warning level when relying on fallback to alert users.

-        except Exception as e:
-            log.debug(f"Could not estimate memory from {lead.tab}: {e}")
+        except (KeyError, ValueError, FileNotFoundError, IOError, RuntimeError) as e:
+            log.warning(f"Could not estimate memory from {lead.tab} via get_hs_lead, "
+                        f"using fallback: {e}")
             # Fallback: check cached matrices on the lead object

516-520: Handle edge case when matrix estimation fails for both leads.

If get_hs_lead fails for both leads and no cached attributes exist, matrix_bytes remains 0, causing a severe underestimate (only 300 MB base overhead). This could lead to spawning too many workers and causing OOM.

Consider adding a minimum safeguard or returning a sentinel value to trigger conservative behavior:

     # Total estimate: base overhead + scaled computation memory
     computation_memory = matrix_bytes * temp_allocation_factor
     total_estimate = base_overhead + int(computation_memory)
 
+    if matrix_bytes == 0:
+        log.warning("Could not estimate computation memory from lead matrices. "
+                    "Using conservative fallback estimate.")
+        # Conservative fallback: assume 1GB per worker if we can't measure
+        total_estimate = max(total_estimate, 1024 * 1024 * 1024)
+
     return total_estimate

560-571: Consider matching joblib's semantics for n_jobs=0 and negative values.

In joblib, n_jobs=0 raises an error, and negative values like -2 mean "all CPUs minus 1". The current else branch (line 570-571) treats all non-positive values (except -1) as "use max_workers", which diverges from joblib conventions.

If matching joblib semantics is intended, consider:

     if requested_n_jobs == -1:
         return max_workers
     elif requested_n_jobs > 0:
         if requested_n_jobs > max_workers:
             log.warning(f"Requested n_jobs={requested_n_jobs} may exceed available memory. "
                        f"Limiting to {max_workers} workers "
                        f"(available: {available_memory / 1e9:.1f} GB, "
                        f"est. per worker: {memory_per_worker / 1e9:.1f} GB)")
             return max_workers
         return requested_n_jobs
+    elif requested_n_jobs == 0:
+        raise ValueError("n_jobs=0 is not valid. Use -1 for auto or a positive integer.")
+    else:
+        # Negative values other than -1: joblib interprets as (cpu_count + 1 + n_jobs)
+        effective_jobs = max(cpu_count + 1 + requested_n_jobs, min_workers)
+        return min(effective_jobs, max_workers)
-    else:
-        return max_workers
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between faafb4e and 44564b9.

📒 Files selected for processing (1)
  • dpnegf/negf/lead_property.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dpnegf/negf/lead_property.py (1)
dpnegf/negf/negf_hamiltonian_init.py (1)
  • get_hs_lead (857-949)
🪛 Ruff (0.14.6)
dpnegf/negf/lead_property.py

507-507: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: run-tests
🔇 Additional comments (3)
dpnegf/negf/lead_property.py (3)

613-621: Good integration of memory-aware parallelization.

The logic correctly samples the first k-point for memory estimation and logs adjustments clearly. One minor edge case: if kpoints_grid is empty, sample_kpoint becomes None, which is handled by the fallback in _estimate_worker_memory. However, the subsequent loop would be a no-op anyway, so this is not a practical concern.


623-634: LGTM: Consistent use of safe_n_jobs in both execution paths.

Both the small-batch (line 624) and large-batch (line 631) execution paths now correctly use safe_n_jobs, ensuring memory constraints are respected throughout the computation.


13-13: Verify psutil is declared in project dependencies.

The new psutil import introduces an external dependency. Ensure it's added to the project's setup.py, requirements.txt, or pyproject.toml to prevent import errors in fresh environments.

Comment thread dpnegf/negf/lead_property.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
dpnegf/negf/lead_property.py (2)

458-517: Narrow the broad except Exception in _estimate_worker_memory.

Catching a blind Exception here makes it easy to hide real programming errors (e.g., shape/dtype bugs in get_hs_lead) and is what Ruff BLE001 is flagging. Consider restricting this to the specific exceptions you expect from get_hs_lead (e.g., AssertionError, OSError, RuntimeError, ValueError) and either re-raising unexpected ones or logging and re-raising, so only genuinely “estimation-related” failures fall back to the 100 MB heuristic.


633-652: Memory‑aware n_jobs integration into compute_all_self_energy looks good.

Using a representative k‑point to estimate per‑worker memory and then routing both the small‑ and large‑batch Parallel calls through safe_n_jobs is consistent and should prevent obviously over‑aggressive parallelism. If you find in practice that safe_n_jobs is often 1, you might optionally short‑circuit to a non‑parallel path to avoid joblib overhead, but the current implementation is perfectly acceptable.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44564b9 and 7909e7f.

📒 Files selected for processing (1)
  • dpnegf/negf/lead_property.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dpnegf/negf/lead_property.py (1)
dpnegf/negf/negf_hamiltonian_init.py (1)
  • get_hs_lead (857-949)
🪛 Ruff (0.14.6)
dpnegf/negf/lead_property.py

507-507: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: run-tests
🔇 Additional comments (2)
dpnegf/negf/lead_property.py (2)

13-13: Ensure psutil is declared as a runtime dependency.

Since psutil is imported at module import time, missing the package will cause an ImportError before any of these utilities can be used. Verify that psutil is listed in the appropriate dependency file(s) (e.g., requirements.txt, pyproject.toml, or setup.cfg), or guard with an optional import if not strictly required at runtime.


520-590: Fix non‑int requested_n_jobs handling and remove redundant assignment.

Two issues in this function:

  1. The type check for requested_n_jobs logs a warning and sets final_worker = min_workers, but lacks an early return. Execution then flows into subsequent comparisons (requested_n_jobs == -1, requested_n_jobs > 0), which will raise TypeError for non-int values like None or "4", contradicting the intent of graceful fallback.

  2. max_workers_by_memory is assigned twice with identical expressions (lines 556–557), which is redundant.

Add an early return after the type validation and remove the duplicate assignment to ensure invalid inputs are handled gracefully and the code remains clean.

@AsymmetryChou AsymmetryChou requested a review from Lonya0 November 30, 2025 07:44
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
dpnegf/negf/lead_property.py (1)

458-517: Consider narrowing the broad Exception catch in _estimate_worker_memory

Catching Exception at Line 507 will also swallow unexpected programming errors, not just data/IO issues, and silently fall back to 100 MB per lead. If you know the set of likely failures from get_hs_lead, it would be safer to catch those specific exception types (e.g., I/O- or shape-related) or log-and-re‑raise unknown ones, while still keeping the 100 MB fallback for expected issues.

dpnegf/tests/test_auto_memory.py (1)

329-344: Strengthen the non-integer n_jobs test once helper behavior is clarified

test_handles_non_integer_n_jobs currently only checks result >= 1 and documents a known bug in _get_safe_n_jobs where the min_workers safeguard can be overwritten. If you adopt a clearer contract in _get_safe_n_jobs (e.g., early‑return min_workers or coercing to int), this test can be tightened to assert the exact expected value and ensure that behavior doesn’t regress.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7909e7f and e8a494f.

📒 Files selected for processing (2)
  • dpnegf/negf/lead_property.py (3 hunks)
  • dpnegf/tests/test_auto_memory.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
dpnegf/tests/test_auto_memory.py (1)
dpnegf/negf/lead_property.py (2)
  • _estimate_worker_memory (458-517)
  • _get_safe_n_jobs (520-590)
dpnegf/negf/lead_property.py (1)
dpnegf/negf/negf_hamiltonian_init.py (1)
  • get_hs_lead (857-949)
🪛 Ruff (0.14.6)
dpnegf/tests/test_auto_memory.py

25-25: Unused method argument: kpoint

(ARG002)


25-25: Unused method argument: tab

(ARG002)


25-25: Unused method argument: v

(ARG002)


27-27: Avoid specifying long messages outside the exception class

(TRY003)

dpnegf/negf/lead_property.py

507-507: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: run-tests
🔇 Additional comments (2)
dpnegf/negf/lead_property.py (1)

633-655: Memory-aware n_jobs wiring in compute_all_self_energy looks correct

safe_n_jobs is computed once from a representative k-point and then used consistently for all Parallel calls (Lines 644 and 651), with clear logging when auto-detection is used or when user input is constrained by memory (Lines 637–640). This matches the stated goal of avoiding memory over-commitment while respecting the requested n_jobs where possible.

dpnegf/tests/test_auto_memory.py (1)

64-416: Test coverage for memory estimation and worker selection is thorough

The combination of unit and integration tests here exercises base overhead, matrix-size scaling, fallback paths, CPU/memory constraints, and various n_jobs conventions. This gives good confidence that the auto-memory logic behaves as intended across realistic scenarios.

Comment on lines +520 to +590
def _get_safe_n_jobs(lead_L, lead_R, requested_n_jobs=-1, max_memory_fraction=0.7, min_workers=1, kpoint=None):
"""
Calculate safe number of parallel workers based on available system memory.

Parameters
----------
lead_L, lead_R : LeadProperty
Lead objects for memory estimation.
requested_n_jobs : int
User-requested n_jobs. -1 means auto-detect.
max_memory_fraction : float
Maximum fraction of available memory to use. Default 0.7.
min_workers : int
Minimum number of workers to use. Default 1.
kpoint : array-like, optional
A sample k-point for fetching Hamiltonian matrices to estimate memory.

Returns
-------
int
Safe number of parallel workers.
"""
cpu_count = os.cpu_count()
if cpu_count is None or cpu_count < 1:
cpu_count = 1
log.warning("os.cpu_count() returned None or invalid value. Defaulting to 1 CPU core.")

available_memory = psutil.virtual_memory().available
memory_per_worker = _estimate_worker_memory(lead_L, lead_R, kpoint=kpoint)

# Calculate max workers that fit in available memory
if memory_per_worker <= 0:
log.warning(f"Memory estimation returned non-positive value. Using min_workers={min_workers}.")
return min_workers

# Calculate max workers that fit in available memory
max_workers_by_memory = int((available_memory * max_memory_fraction) / memory_per_worker)
max_workers_by_memory = int((available_memory * max_memory_fraction) / memory_per_worker)
max_workers_by_memory = max(max_workers_by_memory, min_workers)

# Cap by CPU count
max_workers = min(max_workers_by_memory, cpu_count)

safe_n_worker = 0
# check requested_n_jobs is a number
if not isinstance(requested_n_jobs, int):
log.warning(f"Requested n_jobs={requested_n_jobs} is not an integer. \n"
f"Using min_workers={min_workers}.")
safe_n_worker = min_workers

if requested_n_jobs == -1:
safe_n_worker = max_workers
elif requested_n_jobs == 0:
log.warning(f"Requested n_jobs=0 is invalid. Using min_workers={min_workers}.")
safe_n_worker = min_workers
elif requested_n_jobs > 0:
if requested_n_jobs > max_workers:
log.warning(f"Requested n_jobs={requested_n_jobs} may exceed available memory. "
f"Limiting to {max_workers} workers "
f"(available: {available_memory / 1e9:.1f} GB, "
f"est. per worker: {memory_per_worker / 1e9:.1f} GB)")
safe_n_worker = max_workers
else:
safe_n_worker = requested_n_jobs
else:
# Negative values other than -1: joblib interprets as (cpu_count + 1 + n_jobs)
effective_n_jobs = max(cpu_count + 1 + requested_n_jobs, min_workers)
safe_n_worker = min(effective_n_jobs, max_workers)

log.info(f"Estimated safe n_jobs={safe_n_worker} based on available memory.")
return safe_n_worker
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix duplicated max_workers_by_memory assignment and make non-integer requested_n_jobs handling consistent

  • Lines 556–557 compute max_workers_by_memory twice with the same expression; one of these assignments can be removed with no change in behavior.
  • The non-integer requested_n_jobs path (Lines 565–568) sets safe_n_worker = min_workers but then continues into the later conditionals, which can overwrite that value and may return a non-int worker count. That contradicts the docstring return type and can make downstream usage harder to reason about.
  • It would be clearer to either:
    • return min_workers immediately after logging for non‑integer requested_n_jobs, or
    • coerce once via requested_n_jobs = int(requested_n_jobs) and document that behavior.
  • To enforce the contract, you could also cast once at the end, e.g. return int(safe_n_worker).
🤖 Prompt for AI Agents
In dpnegf/negf/lead_property.py around lines 520–590, remove the duplicated
assignment to max_workers_by_memory (lines 556–557) so it is computed only once;
for non-integer requested_n_jobs, after logging immediately return min_workers
(do not continue into later branches) to avoid overwriting the value or
returning a non-int; and finally ensure the function returns an int by casting
safe_n_worker to int (e.g., return int(safe_n_worker)) before returning.

@AsymmetryChou AsymmetryChou merged commit 4366d0e into DeePTB-Lab:main Dec 1, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants