[Feature] Add optional PTODSL launch profile object to measure kernel runtime without compile/build overhead

## Summary

PTODSL currently mixes compilation/build work into the launch path, which makes it hard to measure kernel execution time accurately from Python.

I would like PTODSL launch to accept an optional `profile` object. When this object is provided, the generated launch path should insert profiling/statistics code immediately around the real kernel invocation, so that after launch finishes we can read kernel performance data from `profile` without counting PTODSL tracing / PTOAS / Bisheng build overhead.

## Motivation / use case

Today, the first PTODSL launch may include both:

- JIT specialization/tracing on the `kernel[grid, stream]` path
- Native build / shared-library generation on the first actual call path

From the current code:

- `KernelHandle.__getitem__()` calls `self.compile()` before returning the launch handle (`ptodsl/ptodsl/_jit.py`, lines 294-297)
- `KernelCompiler.compile()` may trace and `compiled.build()` a new specialization when it is not cached (`ptodsl/ptodsl/_kernel_compilation.py`, lines 81-119)
- `LaunchHandle._ensure_launch_fn()` lazily calls `build_native_library()` on first launch (`ptodsl/ptodsl/_runtime/launch.py`, lines 134-149)
- `build_native_library()` may run `ptoas`, Bisheng compile, and link steps when the native cache is cold (`ptodsl/ptodsl/_runtime/native_build.py`, lines 178-233)

Because some of this work happens on the launch path, it is difficult to use PTODSL itself to answer a simple question: how long did the kernel actually run?

In practice, users can manually warm up with `.compile()` and extra launches, but that is easy to forget, is not ergonomic, and still does not provide a standard PTODSL-side profiling interface.

## Proposed API / behavior

Suggested direction:

```python
profile = pto.LaunchProfile()
compiled[grid, stream](arg0, arg1, ..., profile=profile)
# or: kernel[grid, stream](..., profile=profile)
```

When `profile` is present:

- PTODSL should keep the existing compile/cache behavior unchanged
- The generated launch code should insert timing/stat collection immediately before and after the real kernel invocation point
- The measured result should reflect kernel runtime as closely as possible, rather than Python-side tracing / compilation / native build overhead
- The collected data should be readable from the `profile` object after launch completes

Possible data fields could include:

- kernel elapsed time
- launch timestamp / end timestamp
- optional cache-hit or compile/build metadata as separate fields, if that is useful

Backward compatibility:

- No behavior change when `profile` is omitted
- Existing launch syntax should keep working

## Alternatives considered

- Ask users to always call `.compile()` manually and then do multiple warm-up launches before measuring
- Add a separate explicit `prepare()` / `build()` API and require users to split preparation from measurement themselves

Those approaches help, but they still leave PTODSL without a built-in, user-facing profiling surface attached to launch itself.

## Additional context

The key point is not only “add profiling”, but “profile around the real kernel call site”. Otherwise the reported numbers will still be polluted by first-launch compilation/build work and will be hard to compare across kernels.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Add optional PTODSL launch profile object to measure kernel runtime without compile/build overhead #893

Summary

Motivation / use case

Proposed API / behavior

Alternatives considered

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] Add optional PTODSL launch profile object to measure kernel runtime without compile/build overhead #893

Description

Summary

Motivation / use case

Proposed API / behavior

Alternatives considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions