Upstream contrib/main sync test by avinciguerra07 · Pull Request #6 · cornelisnetworks/libfabric

avinciguerra07 · 2026-06-10T16:48:35Z

No description provided.

Drop the lock so memory monitor doesn't risk a deadlock. Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com> (cherry picked from commit ec4b693)

Drop the memory-monitor lock before performing memory operations and pre-allocate the rbnode outside the lock, so the cache insert path no longer holds mm_lock across allocation. Also avoid freeing a still-valid MR entry when the cache is full or a subscribe fails. As part of this, the ofi_rbnode_alloc and ofi_rbnode_free helpers are now public (the thin ofi_rbnode_new/ ofi_rbnode_del wrappers are removed) and the single-statement if is de-braced, per ofiwg maintainer review feedback. Signed-off-by: Armando Vinciguerra <armando.vinciguerra@cornelisnetworks.com>

FI_PEEK | FI_CLAIM removes the unexpected FIRST packet from the UE queue and stores it for a later FI_CLAIM receive. That receive must perform the same multi-packet eager bookkeeping as the normal unexpected-match path so remaining NTH packets can complete the message. Process pending MP eager UE packets after completing the claimed FIRST packet and queue the receive until all fragments arrive. Signed-off-by: Ken Raffenetti <ken.raffenetti@cornelisnetworks.com>

HFISVC pins memory regions eagerly at registration time. Two inputs that the upper libfabric layer accepts but HFISVC cannot service were producing kernel-side failures or silent truncation: 1. iov_base == NULL: hfi1_mem_region_pin returns -EFAULT, which surfaces as a crash from bulksvc_user.c:user_mr_record_create_ pinned_and_insert(). 2. iov_len > UINT32_MAX: the HFISVC cmd_mr_open API takes a uint32_t length, so larger sizes are silently truncated and only the low 4 GiB of the requested range is pinned. Reject both up front with FI_EOPNOTSUPP when use_hfisvc is set. Signed-off-by: Ken Raffenetti <ken.raffenetti@cornelisnetworks.com>

The default value of FI_OPX_RZV_MIN_PAYLOAD_BYTES is selected at build time based on the HMEM backend OPX was configured with (see OPX_RZV_MIN_PAYLOAD_BYTES_DEFAULT in prov/opx/include/rdma/opx/fi_opx_hfi1.h): - 4096 for CUDA-enabled builds - 8192 for AMD ROCR-enabled builds - 16385 for host-only builds (which effectively keeps payloads that fit in multi-packet eager off the rendezvous path) Previously the man page only listed "Defaults to 16385", which is correct only for host-only builds and gave no indication that GPU-enabled builds use a much lower rendezvous threshold. Spell the three defaults out so users tuning RZV behavior on CUDA/ROCR systems do not have to read the source. Documentation only; no functional change. Signed-off-by: Lindsay Reiser <lindsay.reiser@cornelisnetworks.com> Co-authored-by: Ken Raffenetti <raffenet@users.noreply.github.com> (cherry picked from commit 8611ecb)

Keep OPENED notify as observation-only for STL-77956 path by removing byte_counter decrement and rzv_comp free from MR notify handling. ICQ completion remains the owner for decrement/finalization. Signed-off-by: Andriy Kot <andriy.kot@cornelisnetworks.com> (cherry picked from commit c8baf82)

* prov/opx: Initialize deferred HFISVC receive contexts Deferred HMEM rendezvous receives enqueue work before the deferred worker submits RDMA reads. Initialize the matched receive context before enqueueing so the ICQ completion drains niov to zero instead of decrementing the posted-receive sentinel (-1). For truncated receives, keep the counted context pending until scratch reads complete, then report FI_ETRUNC through the CQ error path. * prov/opx: Limit deferred HFISVC context fix scope Keep the STL-77956 deferred receive context initialization focused on the non-truncated dmabuf path while preserving the existing HFISVC truncation behavior for follow-up work.\n\nThe deferred path now initializes the matched receive context before enqueueing only when the OPX MR is still opening and the receive is not truncated. * prov/opx: Format deferred HFISVC context initialization Apply the OPX provider formatting alignment to the deferred receive context initialization added for STL-77956. --------- Signed-off-by: Andriy Kot <andriy.kot@cornelisnetworks.com>

Multi-packet eager sends would only allow SYSTEM memory on the belief that RZV was more performant. That does not appear to be the case, so allow GPU memory to use the multi-packet eager path. Lowered send threshold default from 4096 to 512. cuda_gdrcopy_from_dev (CPU pulls from GPU BAR1 write-combining mapping) saturates due to uncached PCIe round-trips per cacheline. Above ~512B, ofi_cudaMemcpy(D2H) wins over gdrcopy. Below ~512B, gdrcopy's lower fixed cost still beats the cudaMemcpy CUDA-driver floor. The receive-side knob (OPX_HMEM_DEV_REG_RECV_THRESHOLD) is unchanged: CPU writes to GPU BAR1 via gdr_copy_to_dev hit the fast WC store path. Signed-off-by: Lindsay Reiser <lindsay.reiser@cornelisnetworks.com> (cherry picked from commit b44d45b)

avinciguerra07 force-pushed the upstream-contrib/main-sync-test branch 3 times, most recently from 0685738 to 7d3ac3d Compare June 10, 2026 20:09

bcernohous and others added 8 commits June 10, 2026 21:13

prov/opx: Drop mm lock before memory operations

5d80655

Drop the lock so memory monitor doesn't risk a deadlock. Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com> (cherry picked from commit ec4b693)

avinciguerra07 force-pushed the upstream-contrib/main-sync-test branch from 7d3ac3d to bf40ab9 Compare June 11, 2026 01:22

avinciguerra07 closed this Jun 11, 2026

avinciguerra07 deleted the upstream-contrib/main-sync-test branch June 12, 2026 00:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream contrib/main sync test#6

Upstream contrib/main sync test#6
avinciguerra07 wants to merge 8 commits into
mainfrom
upstream-contrib/main-sync-test

avinciguerra07 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

Conversation

avinciguerra07 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants