Summary
PTODSL currently mixes compilation/build work into the launch path, which makes it hard to measure kernel execution time accurately from Python.
I would like PTODSL launch to accept an optional profile object. When this object is provided, the generated launch path should insert profiling/statistics code immediately around the real kernel invocation, so that after launch finishes we can read kernel performance data from profile without counting PTODSL tracing / PTOAS / Bisheng build overhead.
Motivation / use case
Today, the first PTODSL launch may include both:
- JIT specialization/tracing on the
kernel[grid, stream] path
- Native build / shared-library generation on the first actual call path
From the current code:
KernelHandle.__getitem__() calls self.compile() before returning the launch handle (ptodsl/ptodsl/_jit.py, lines 294-297)
KernelCompiler.compile() may trace and compiled.build() a new specialization when it is not cached (ptodsl/ptodsl/_kernel_compilation.py, lines 81-119)
LaunchHandle._ensure_launch_fn() lazily calls build_native_library() on first launch (ptodsl/ptodsl/_runtime/launch.py, lines 134-149)
build_native_library() may run ptoas, Bisheng compile, and link steps when the native cache is cold (ptodsl/ptodsl/_runtime/native_build.py, lines 178-233)
Because some of this work happens on the launch path, it is difficult to use PTODSL itself to answer a simple question: how long did the kernel actually run?
In practice, users can manually warm up with .compile() and extra launches, but that is easy to forget, is not ergonomic, and still does not provide a standard PTODSL-side profiling interface.
Proposed API / behavior
Suggested direction:
profile = pto.LaunchProfile()
compiled[grid, stream](arg0, arg1, ..., profile=profile)
# or: kernel[grid, stream](..., profile=profile)
When profile is present:
- PTODSL should keep the existing compile/cache behavior unchanged
- The generated launch code should insert timing/stat collection immediately before and after the real kernel invocation point
- The measured result should reflect kernel runtime as closely as possible, rather than Python-side tracing / compilation / native build overhead
- The collected data should be readable from the
profile object after launch completes
Possible data fields could include:
- kernel elapsed time
- launch timestamp / end timestamp
- optional cache-hit or compile/build metadata as separate fields, if that is useful
Backward compatibility:
- No behavior change when
profile is omitted
- Existing launch syntax should keep working
Alternatives considered
- Ask users to always call
.compile() manually and then do multiple warm-up launches before measuring
- Add a separate explicit
prepare() / build() API and require users to split preparation from measurement themselves
Those approaches help, but they still leave PTODSL without a built-in, user-facing profiling surface attached to launch itself.
Additional context
The key point is not only “add profiling”, but “profile around the real kernel call site”. Otherwise the reported numbers will still be polluted by first-launch compilation/build work and will be hard to compare across kernels.
Summary
PTODSL currently mixes compilation/build work into the launch path, which makes it hard to measure kernel execution time accurately from Python.
I would like PTODSL launch to accept an optional
profileobject. When this object is provided, the generated launch path should insert profiling/statistics code immediately around the real kernel invocation, so that after launch finishes we can read kernel performance data fromprofilewithout counting PTODSL tracing / PTOAS / Bisheng build overhead.Motivation / use case
Today, the first PTODSL launch may include both:
kernel[grid, stream]pathFrom the current code:
KernelHandle.__getitem__()callsself.compile()before returning the launch handle (ptodsl/ptodsl/_jit.py, lines 294-297)KernelCompiler.compile()may trace andcompiled.build()a new specialization when it is not cached (ptodsl/ptodsl/_kernel_compilation.py, lines 81-119)LaunchHandle._ensure_launch_fn()lazily callsbuild_native_library()on first launch (ptodsl/ptodsl/_runtime/launch.py, lines 134-149)build_native_library()may runptoas, Bisheng compile, and link steps when the native cache is cold (ptodsl/ptodsl/_runtime/native_build.py, lines 178-233)Because some of this work happens on the launch path, it is difficult to use PTODSL itself to answer a simple question: how long did the kernel actually run?
In practice, users can manually warm up with
.compile()and extra launches, but that is easy to forget, is not ergonomic, and still does not provide a standard PTODSL-side profiling interface.Proposed API / behavior
Suggested direction:
When
profileis present:profileobject after launch completesPossible data fields could include:
Backward compatibility:
profileis omittedAlternatives considered
.compile()manually and then do multiple warm-up launches before measuringprepare()/build()API and require users to split preparation from measurement themselvesThose approaches help, but they still leave PTODSL without a built-in, user-facing profiling surface attached to launch itself.
Additional context
The key point is not only “add profiling”, but “profile around the real kernel call site”. Otherwise the reported numbers will still be polluted by first-launch compilation/build work and will be hard to compare across kernels.