Fix NCU profiling under Buck PAR via KERNEL_PROFILER_PYTHON (#136)#136
Closed
wychi wants to merge 1 commit into
Closed
Fix NCU profiling under Buck PAR via KERNEL_PROFILER_PYTHON (#136)#136wychi wants to merge 1 commit into
wychi wants to merge 1 commit into
Conversation
|
@wychi has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105739069. |
…orch#136) Summary: NCU profiling subprocesses failed inside a Buck PAR with: ImportError: .../platform010/lib/python3.12/lib-dynload/ _posixsubprocess.cpython-312-x86_64-linux-gnu.so: undefined symbol: _Py_NoneStruct ModuleNotFoundError: No module named 'torch' Root cause: inside a PAR, sys.executable points at the static-linked native-main binary. Re-exec'ing it from a subprocess skips the env setup that _bootstrap.sh performs (LD_LIBRARY_PATH for CUDA, LD_PRELOAD for the allocator, PYTHONPATH/FB_PAR_* for the embedded import system), so the spawned Python falls back to system platform010 stdlib whose lib-dynload .so files are ABI-incompatible with the static libpython. Fix is split along the OSS/fb boundary: * OSS ncu_profiler.profile_triton_kernel honors a generic env-var override KERNEL_PROFILER_PYTHON, falling back to sys.executable. No Meta/buck/PAR knowledge in OSS code. * fb utils/fb/internal_env.setup_internal_environment detects the PAR via "#native-main#" in sys.executable.name and sets the env var to <BASE_DIR>/_bootstrap.sh, which rebuilds the full env from \$0 before exec'ing the native main. Respects an explicit user override. Differential Revision: D105739069
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
NCU profiling subprocesses failed inside a Buck PAR with:
ImportError: .../platform010/lib/python3.12/lib-dynload/
_posixsubprocess.cpython-312-x86_64-linux-gnu.so:
undefined symbol: _Py_NoneStruct
ModuleNotFoundError: No module named 'torch'
Root cause: inside a PAR, sys.executable points at the static-linked
native-main binary. Re-exec'ing it from a subprocess skips the env setup
that bootstrap.sh performs (LD_LIBRARY_PATH for CUDA, LD_PRELOAD for
the allocator, PYTHONPATH/FB_PAR* for the embedded import system), so
the spawned Python falls back to system platform010 stdlib whose
lib-dynload .so files are ABI-incompatible with the static libpython.
Fix is split along the OSS/fb boundary:
OSS ncu_profiler.profile_triton_kernel honors a generic env-var
override KERNEL_PROFILER_PYTHON, falling back to sys.executable.
No Meta/buck/PAR knowledge in OSS code.
fb utils/fb/internal_env.setup_internal_environment detects the
PAR via "#native-main#" in sys.executable.name and sets the env
var to <BASE_DIR>/_bootstrap.sh, which rebuilds the full env from
$0 before exec'ing the native main.
Respects an explicit user override.
Differential Revision: D105739069