[Perf] Add qd.field_array(N, dtype) for indexed @qd.dataclass fields#712
Draft
hughperkins wants to merge 1 commit into
Draft
[Perf] Add qd.field_array(N, dtype) for indexed @qd.dataclass fields#712hughperkins wants to merge 1 commit into
hughperkins wants to merge 1 commit into
Conversation
…ields
Adds a new ``qd.field_array(N, dtype)`` annotation for ``@qd.dataclass`` that
exposes a logical N-element array as ``obj.r[i]`` while storing N individually-
named synthetic scalar fields (``_r0..._r{N-1}``) under the hood. For python-int
indices (including ``qd.static(range(N))``-unrolled loop variables), the AST
transformer rewrites ``obj.r[i]`` directly to ``obj._r{i}``, so the generated
LLVM IR / PTX is byte-identical to a hand-rolled named-field struct.
Motivation: today's idiomatic ``r: qd.types.vector(N, dtype)`` group field
leaves an alloca that LLVM SROA can't decompose once register pressure crosses
a threshold (e.g. two concurrent tiles in a Cholesky+TRSM kernel), causing
runtime regressions via local-memory spills. The named-field cascade pattern
avoids the spill but balloons source size (32-way ``if k == N: self.rN = val``
write cascades duplicated at every callsite). ``field_array`` collapses those
cascades to one AST node per callsite while preserving the named-field IR.
Changes:
- ``lang/struct.py``: ``FieldArray`` type wrapper, ``field_array(count, dtype)``
constructor, expansion in ``StructType.__init__`` (synthetic field names plus
``_field_groups`` metadata), propagation in ``StructType.__call__``,
``_FieldArrayRef`` transient proxy.
- ``lang/impl.py``: preserve ``_qd_field_groups`` across the ``Struct`` rewrap
in ``expr_init``.
- ``lang/ast/ast_transformer.py``: ``build_Attribute`` returns a
``_FieldArrayRef`` for group access; ``build_Subscript`` resolves it to a
direct field reference for python-int indices.
- ``tests/python/test_field_array.py``: 5 tests covering construction, static
python-int index, qd.static loop-var index, runtime-index rejection (clear
error), and static-index OOB rejection.
Runtime-int indexing is intentionally rejected with a friendly error pointing
at ``qd.static``; existing cascade helpers continue to handle the runtime case
by spelling out the ``_rN`` fields directly. Adding runtime-int support is a
small follow-up.
Verified on a field_array port of genesis ``_tile32.py``: PTX byte-identical to
the named-field S1 baseline (modulo the per-session-nonce comment) on both
``chol_kernel`` and ``chol_trsm_kernel``; zero local-memory spills (S1: 0/0,
FA: 0/0, F4-A vector-field variant: 42/97); 25% compile-time reduction on the
single-tile harness (5.60s -> 4.19s, 3-run mean). Source dropped from 1068 to
515 lines (-52%). Full writeup in perso_hugh/doc/qd_field_array_2026may23.md.
All 201 tests in test_py_dataclass.py + test_complex_struct.py + test_struct.py
continue to pass; the 5 new tests pass in 1.76s total.
Collaborator
Author
|
Need some user-facing doc. |
Collaborator
Author
|
(Also, might want to revisit the name) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a new
qd.field_array(N, dtype)annotation for@qd.dataclassthat exposes a logical N-element array asobj.r[i]while storing N individually- named synthetic scalar fields (_r0..._r{N-1}) under the hood. For python-int indices (includingqd.static(range(N))-unrolled loop variables), the AST transformer rewritesobj.r[i]directly toobj._r{i}, so the generated LLVM IR / PTX is byte-identical to a hand-rolled named-field struct.Motivation: today's idiomatic
r: qd.types.vector(N, dtype)group field leaves an alloca that LLVM SROA can't decompose once register pressure crosses a threshold (e.g. two concurrent tiles in a Cholesky+TRSM kernel), causing runtime regressions via local-memory spills. The named-field cascade pattern avoids the spill but balloons source size (32-wayif k == N: self.rN = valwrite cascades duplicated at every callsite).field_arraycollapses those cascades to one AST node per callsite while preserving the named-field IR.Changes:
lang/struct.py:FieldArraytype wrapper,field_array(count, dtype)constructor, expansion inStructType.__init__(synthetic field names plus_field_groupsmetadata), propagation inStructType.__call__,_FieldArrayReftransient proxy.lang/impl.py: preserve_qd_field_groupsacross theStructrewrap inexpr_init.lang/ast/ast_transformer.py:build_Attributereturns a_FieldArrayReffor group access;build_Subscriptresolves it to a direct field reference for python-int indices.tests/python/test_field_array.py: 5 tests covering construction, static python-int index, qd.static loop-var index, runtime-index rejection (clear error), and static-index OOB rejection.Runtime-int indexing is intentionally rejected with a friendly error pointing at
qd.static; existing cascade helpers continue to handle the runtime case by spelling out the_rNfields directly. Adding runtime-int support is a small follow-up.Verified on a field_array port of genesis
_tile32.py: PTX byte-identical to the named-field S1 baseline (modulo the per-session-nonce comment) on bothchol_kernelandchol_trsm_kernel; zero local-memory spills (S1: 0/0, FA: 0/0, F4-A vector-field variant: 42/97); 25% compile-time reduction on the single-tile harness (5.60s -> 4.19s, 3-run mean). Source dropped from 1068 to 515 lines (-52%). Full writeup in perso_hugh/doc/qd_field_array_2026may23.md.All 201 tests in test_py_dataclass.py + test_complex_struct.py + test_struct.py continue to pass; the 5 new tests pass in 1.76s total.
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough