Feat: Add auto memory detection for self-energy calculation#47
Conversation
WalkthroughAdds memory-aware parallelization to lead self-energy computation: new helpers estimate per-worker memory and compute a safe worker count using psutil/CPU info; compute_all_self_energy samples one k-point to derive safe_n_jobs and uses it for all Parallel calls, logging any adjustments. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
dpnegf/negf/lead_property.py (3)
507-514: Narrow the exception type and log at a higher level when fallback is used.Catching a bare
Exceptionmasks unexpected errors (e.g.,KeyboardInterruptis aBaseException, but other unexpected issues likeMemoryErrorcould be silently swallowed). Consider catching specific exceptions like(KeyError, ValueError, FileNotFoundError, IOError)that are expected from I/O or lookup failures.Also, if the fallback path is taken, the estimation may be incomplete if the cached attributes don't exist either. Consider logging at
warninglevel when relying on fallback to alert users.- except Exception as e: - log.debug(f"Could not estimate memory from {lead.tab}: {e}") + except (KeyError, ValueError, FileNotFoundError, IOError, RuntimeError) as e: + log.warning(f"Could not estimate memory from {lead.tab} via get_hs_lead, " + f"using fallback: {e}") # Fallback: check cached matrices on the lead object
516-520: Handle edge case when matrix estimation fails for both leads.If
get_hs_leadfails for both leads and no cached attributes exist,matrix_bytesremains 0, causing a severe underestimate (only 300 MB base overhead). This could lead to spawning too many workers and causing OOM.Consider adding a minimum safeguard or returning a sentinel value to trigger conservative behavior:
# Total estimate: base overhead + scaled computation memory computation_memory = matrix_bytes * temp_allocation_factor total_estimate = base_overhead + int(computation_memory) + if matrix_bytes == 0: + log.warning("Could not estimate computation memory from lead matrices. " + "Using conservative fallback estimate.") + # Conservative fallback: assume 1GB per worker if we can't measure + total_estimate = max(total_estimate, 1024 * 1024 * 1024) + return total_estimate
560-571: Consider matching joblib's semantics forn_jobs=0and negative values.In joblib,
n_jobs=0raises an error, and negative values like-2mean "all CPUs minus 1". The current else branch (line 570-571) treats all non-positive values (except -1) as "use max_workers", which diverges from joblib conventions.If matching joblib semantics is intended, consider:
if requested_n_jobs == -1: return max_workers elif requested_n_jobs > 0: if requested_n_jobs > max_workers: log.warning(f"Requested n_jobs={requested_n_jobs} may exceed available memory. " f"Limiting to {max_workers} workers " f"(available: {available_memory / 1e9:.1f} GB, " f"est. per worker: {memory_per_worker / 1e9:.1f} GB)") return max_workers return requested_n_jobs + elif requested_n_jobs == 0: + raise ValueError("n_jobs=0 is not valid. Use -1 for auto or a positive integer.") + else: + # Negative values other than -1: joblib interprets as (cpu_count + 1 + n_jobs) + effective_jobs = max(cpu_count + 1 + requested_n_jobs, min_workers) + return min(effective_jobs, max_workers) - else: - return max_workers
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dpnegf/negf/lead_property.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dpnegf/negf/lead_property.py (1)
dpnegf/negf/negf_hamiltonian_init.py (1)
get_hs_lead(857-949)
🪛 Ruff (0.14.6)
dpnegf/negf/lead_property.py
507-507: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: run-tests
🔇 Additional comments (3)
dpnegf/negf/lead_property.py (3)
613-621: Good integration of memory-aware parallelization.The logic correctly samples the first k-point for memory estimation and logs adjustments clearly. One minor edge case: if
kpoints_gridis empty,sample_kpointbecomesNone, which is handled by the fallback in_estimate_worker_memory. However, the subsequent loop would be a no-op anyway, so this is not a practical concern.
623-634: LGTM: Consistent use ofsafe_n_jobsin both execution paths.Both the small-batch (line 624) and large-batch (line 631) execution paths now correctly use
safe_n_jobs, ensuring memory constraints are respected throughout the computation.
13-13: Verifypsutilis declared in project dependencies.The new
psutilimport introduces an external dependency. Ensure it's added to the project'ssetup.py,requirements.txt, orpyproject.tomlto prevent import errors in fresh environments.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
dpnegf/negf/lead_property.py (2)
458-517: Narrow the broadexcept Exceptionin_estimate_worker_memory.Catching a blind
Exceptionhere makes it easy to hide real programming errors (e.g., shape/dtype bugs inget_hs_lead) and is what Ruff BLE001 is flagging. Consider restricting this to the specific exceptions you expect fromget_hs_lead(e.g.,AssertionError,OSError,RuntimeError,ValueError) and either re-raising unexpected ones or logging and re-raising, so only genuinely “estimation-related” failures fall back to the 100 MB heuristic.
633-652: Memory‑awaren_jobsintegration intocompute_all_self_energylooks good.Using a representative k‑point to estimate per‑worker memory and then routing both the small‑ and large‑batch
Parallelcalls throughsafe_n_jobsis consistent and should prevent obviously over‑aggressive parallelism. If you find in practice thatsafe_n_jobsis often 1, you might optionally short‑circuit to a non‑parallel path to avoid joblib overhead, but the current implementation is perfectly acceptable.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dpnegf/negf/lead_property.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dpnegf/negf/lead_property.py (1)
dpnegf/negf/negf_hamiltonian_init.py (1)
get_hs_lead(857-949)
🪛 Ruff (0.14.6)
dpnegf/negf/lead_property.py
507-507: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: run-tests
🔇 Additional comments (2)
dpnegf/negf/lead_property.py (2)
13-13: Ensurepsutilis declared as a runtime dependency.Since
psutilis imported at module import time, missing the package will cause anImportErrorbefore any of these utilities can be used. Verify thatpsutilis listed in the appropriate dependency file(s) (e.g.,requirements.txt,pyproject.toml, orsetup.cfg), or guard with an optional import if not strictly required at runtime.
520-590: Fix non‑intrequested_n_jobshandling and remove redundant assignment.Two issues in this function:
The type check for
requested_n_jobslogs a warning and setsfinal_worker = min_workers, but lacks an earlyreturn. Execution then flows into subsequent comparisons (requested_n_jobs == -1,requested_n_jobs > 0), which will raiseTypeErrorfor non-int values likeNoneor"4", contradicting the intent of graceful fallback.
max_workers_by_memoryis assigned twice with identical expressions (lines 556–557), which is redundant.Add an early return after the type validation and remove the duplicate assignment to ensure invalid inputs are handled gracefully and the code remains clean.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
dpnegf/negf/lead_property.py (1)
458-517: Consider narrowing the broadExceptioncatch in_estimate_worker_memoryCatching
Exceptionat Line 507 will also swallow unexpected programming errors, not just data/IO issues, and silently fall back to 100 MB per lead. If you know the set of likely failures fromget_hs_lead, it would be safer to catch those specific exception types (e.g., I/O- or shape-related) or log-and-re‑raise unknown ones, while still keeping the 100 MB fallback for expected issues.dpnegf/tests/test_auto_memory.py (1)
329-344: Strengthen the non-integern_jobstest once helper behavior is clarified
test_handles_non_integer_n_jobscurrently only checksresult >= 1and documents a known bug in_get_safe_n_jobswhere themin_workerssafeguard can be overwritten. If you adopt a clearer contract in_get_safe_n_jobs(e.g., early‑returnmin_workersor coercing toint), this test can be tightened to assert the exact expected value and ensure that behavior doesn’t regress.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
dpnegf/negf/lead_property.py(3 hunks)dpnegf/tests/test_auto_memory.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
dpnegf/tests/test_auto_memory.py (1)
dpnegf/negf/lead_property.py (2)
_estimate_worker_memory(458-517)_get_safe_n_jobs(520-590)
dpnegf/negf/lead_property.py (1)
dpnegf/negf/negf_hamiltonian_init.py (1)
get_hs_lead(857-949)
🪛 Ruff (0.14.6)
dpnegf/tests/test_auto_memory.py
25-25: Unused method argument: kpoint
(ARG002)
25-25: Unused method argument: tab
(ARG002)
25-25: Unused method argument: v
(ARG002)
27-27: Avoid specifying long messages outside the exception class
(TRY003)
dpnegf/negf/lead_property.py
507-507: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: run-tests
🔇 Additional comments (2)
dpnegf/negf/lead_property.py (1)
633-655: Memory-awaren_jobswiring incompute_all_self_energylooks correct
safe_n_jobsis computed once from a representative k-point and then used consistently for allParallelcalls (Lines 644 and 651), with clear logging when auto-detection is used or when user input is constrained by memory (Lines 637–640). This matches the stated goal of avoiding memory over-commitment while respecting the requestedn_jobswhere possible.dpnegf/tests/test_auto_memory.py (1)
64-416: Test coverage for memory estimation and worker selection is thoroughThe combination of unit and integration tests here exercises base overhead, matrix-size scaling, fallback paths, CPU/memory constraints, and various
n_jobsconventions. This gives good confidence that the auto-memory logic behaves as intended across realistic scenarios.
| def _get_safe_n_jobs(lead_L, lead_R, requested_n_jobs=-1, max_memory_fraction=0.7, min_workers=1, kpoint=None): | ||
| """ | ||
| Calculate safe number of parallel workers based on available system memory. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| lead_L, lead_R : LeadProperty | ||
| Lead objects for memory estimation. | ||
| requested_n_jobs : int | ||
| User-requested n_jobs. -1 means auto-detect. | ||
| max_memory_fraction : float | ||
| Maximum fraction of available memory to use. Default 0.7. | ||
| min_workers : int | ||
| Minimum number of workers to use. Default 1. | ||
| kpoint : array-like, optional | ||
| A sample k-point for fetching Hamiltonian matrices to estimate memory. | ||
|
|
||
| Returns | ||
| ------- | ||
| int | ||
| Safe number of parallel workers. | ||
| """ | ||
| cpu_count = os.cpu_count() | ||
| if cpu_count is None or cpu_count < 1: | ||
| cpu_count = 1 | ||
| log.warning("os.cpu_count() returned None or invalid value. Defaulting to 1 CPU core.") | ||
|
|
||
| available_memory = psutil.virtual_memory().available | ||
| memory_per_worker = _estimate_worker_memory(lead_L, lead_R, kpoint=kpoint) | ||
|
|
||
| # Calculate max workers that fit in available memory | ||
| if memory_per_worker <= 0: | ||
| log.warning(f"Memory estimation returned non-positive value. Using min_workers={min_workers}.") | ||
| return min_workers | ||
|
|
||
| # Calculate max workers that fit in available memory | ||
| max_workers_by_memory = int((available_memory * max_memory_fraction) / memory_per_worker) | ||
| max_workers_by_memory = int((available_memory * max_memory_fraction) / memory_per_worker) | ||
| max_workers_by_memory = max(max_workers_by_memory, min_workers) | ||
|
|
||
| # Cap by CPU count | ||
| max_workers = min(max_workers_by_memory, cpu_count) | ||
|
|
||
| safe_n_worker = 0 | ||
| # check requested_n_jobs is a number | ||
| if not isinstance(requested_n_jobs, int): | ||
| log.warning(f"Requested n_jobs={requested_n_jobs} is not an integer. \n" | ||
| f"Using min_workers={min_workers}.") | ||
| safe_n_worker = min_workers | ||
|
|
||
| if requested_n_jobs == -1: | ||
| safe_n_worker = max_workers | ||
| elif requested_n_jobs == 0: | ||
| log.warning(f"Requested n_jobs=0 is invalid. Using min_workers={min_workers}.") | ||
| safe_n_worker = min_workers | ||
| elif requested_n_jobs > 0: | ||
| if requested_n_jobs > max_workers: | ||
| log.warning(f"Requested n_jobs={requested_n_jobs} may exceed available memory. " | ||
| f"Limiting to {max_workers} workers " | ||
| f"(available: {available_memory / 1e9:.1f} GB, " | ||
| f"est. per worker: {memory_per_worker / 1e9:.1f} GB)") | ||
| safe_n_worker = max_workers | ||
| else: | ||
| safe_n_worker = requested_n_jobs | ||
| else: | ||
| # Negative values other than -1: joblib interprets as (cpu_count + 1 + n_jobs) | ||
| effective_n_jobs = max(cpu_count + 1 + requested_n_jobs, min_workers) | ||
| safe_n_worker = min(effective_n_jobs, max_workers) | ||
|
|
||
| log.info(f"Estimated safe n_jobs={safe_n_worker} based on available memory.") | ||
| return safe_n_worker |
There was a problem hiding this comment.
Fix duplicated max_workers_by_memory assignment and make non-integer requested_n_jobs handling consistent
- Lines 556–557 compute
max_workers_by_memorytwice with the same expression; one of these assignments can be removed with no change in behavior. - The non-integer
requested_n_jobspath (Lines 565–568) setssafe_n_worker = min_workersbut then continues into the later conditionals, which can overwrite that value and may return a non-intworker count. That contradicts the docstring return type and can make downstream usage harder to reason about. - It would be clearer to either:
return min_workersimmediately after logging for non‑integerrequested_n_jobs, or- coerce once via
requested_n_jobs = int(requested_n_jobs)and document that behavior.
- To enforce the contract, you could also cast once at the end, e.g.
return int(safe_n_worker).
🤖 Prompt for AI Agents
In dpnegf/negf/lead_property.py around lines 520–590, remove the duplicated
assignment to max_workers_by_memory (lines 556–557) so it is computed only once;
for non-integer requested_n_jobs, after logging immediately return min_workers
(do not continue into later branches) to avoid overwriting the value or
returning a non-int; and finally ensure the function returns an int by casting
safe_n_worker to int (e.g., return int(safe_n_worker)) before returning.
Implement automatic memory detection to optimize the number of parallel workers for self-energy calculations based on available system memory. This enhancement improves resource management and prevents memory overflow during computations. For issue #48.
Summary by CodeRabbit
New Features
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.