Skip to content

[Feature] Add optional PTODSL launch profile object to measure kernel runtime without compile/build overhead #893

Description

@Zhendong404

Summary

PTODSL currently mixes compilation/build work into the launch path, which makes it hard to measure kernel execution time accurately from Python.

I would like PTODSL launch to accept an optional profile object. When this object is provided, the generated launch path should insert profiling/statistics code immediately around the real kernel invocation, so that after launch finishes we can read kernel performance data from profile without counting PTODSL tracing / PTOAS / Bisheng build overhead.

Motivation / use case

Today, the first PTODSL launch may include both:

  • JIT specialization/tracing on the kernel[grid, stream] path
  • Native build / shared-library generation on the first actual call path

From the current code:

  • KernelHandle.__getitem__() calls self.compile() before returning the launch handle (ptodsl/ptodsl/_jit.py, lines 294-297)
  • KernelCompiler.compile() may trace and compiled.build() a new specialization when it is not cached (ptodsl/ptodsl/_kernel_compilation.py, lines 81-119)
  • LaunchHandle._ensure_launch_fn() lazily calls build_native_library() on first launch (ptodsl/ptodsl/_runtime/launch.py, lines 134-149)
  • build_native_library() may run ptoas, Bisheng compile, and link steps when the native cache is cold (ptodsl/ptodsl/_runtime/native_build.py, lines 178-233)

Because some of this work happens on the launch path, it is difficult to use PTODSL itself to answer a simple question: how long did the kernel actually run?

In practice, users can manually warm up with .compile() and extra launches, but that is easy to forget, is not ergonomic, and still does not provide a standard PTODSL-side profiling interface.

Proposed API / behavior

Suggested direction:

profile = pto.LaunchProfile()
compiled[grid, stream](arg0, arg1, ..., profile=profile)
# or: kernel[grid, stream](..., profile=profile)

When profile is present:

  • PTODSL should keep the existing compile/cache behavior unchanged
  • The generated launch code should insert timing/stat collection immediately before and after the real kernel invocation point
  • The measured result should reflect kernel runtime as closely as possible, rather than Python-side tracing / compilation / native build overhead
  • The collected data should be readable from the profile object after launch completes

Possible data fields could include:

  • kernel elapsed time
  • launch timestamp / end timestamp
  • optional cache-hit or compile/build metadata as separate fields, if that is useful

Backward compatibility:

  • No behavior change when profile is omitted
  • Existing launch syntax should keep working

Alternatives considered

  • Ask users to always call .compile() manually and then do multiple warm-up launches before measuring
  • Add a separate explicit prepare() / build() API and require users to split preparation from measurement themselves

Those approaches help, but they still leave PTODSL without a built-in, user-facing profiling surface attached to launch itself.

Additional context

The key point is not only “add profiling”, but “profile around the real kernel call site”. Otherwise the reported numbers will still be polluted by first-launch compilation/build work and will be hard to compare across kernels.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions