Skip to content

Upstream contrib/main sync test#6

Closed
avinciguerra07 wants to merge 8 commits into
mainfrom
upstream-contrib/main-sync-test
Closed

Upstream contrib/main sync test#6
avinciguerra07 wants to merge 8 commits into
mainfrom
upstream-contrib/main-sync-test

Conversation

@avinciguerra07

Copy link
Copy Markdown

No description provided.

@avinciguerra07 avinciguerra07 force-pushed the upstream-contrib/main-sync-test branch 3 times, most recently from 0685738 to 7d3ac3d Compare June 10, 2026 20:09
bcernohous and others added 8 commits June 10, 2026 21:13
Drop the lock so memory monitor doesn't risk a deadlock.

Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
(cherry picked from commit ec4b693)
Drop the memory-monitor lock before performing memory operations and
pre-allocate the rbnode outside the lock, so the cache insert path no longer
holds mm_lock across allocation. Also avoid freeing a still-valid MR entry when
the cache is full or a subscribe fails. As part of this, the ofi_rbnode_alloc
and ofi_rbnode_free helpers are now public (the thin ofi_rbnode_new/
ofi_rbnode_del wrappers are removed) and the single-statement if is de-braced,
per ofiwg maintainer review feedback.

Signed-off-by: Armando Vinciguerra <armando.vinciguerra@cornelisnetworks.com>
FI_PEEK | FI_CLAIM removes the unexpected FIRST packet from the UE
queue and stores it for a later FI_CLAIM receive.  That receive must
perform the same multi-packet eager bookkeeping as the normal
unexpected-match path so remaining NTH packets can complete the
message.

Process pending MP eager UE packets after completing the claimed FIRST
packet and queue the receive until all fragments arrive.

Signed-off-by: Ken Raffenetti <ken.raffenetti@cornelisnetworks.com>
HFISVC pins memory regions eagerly at registration time. Two inputs
that the upper libfabric layer accepts but HFISVC cannot service were
producing kernel-side failures or silent truncation:

  1. iov_base == NULL: hfi1_mem_region_pin returns -EFAULT, which
     surfaces as a crash from bulksvc_user.c:user_mr_record_create_
     pinned_and_insert().
  2. iov_len > UINT32_MAX: the HFISVC cmd_mr_open API takes a uint32_t
     length, so larger sizes are silently truncated and only the low
     4 GiB of the requested range is pinned.

Reject both up front with FI_EOPNOTSUPP when use_hfisvc is set.

Signed-off-by: Ken Raffenetti <ken.raffenetti@cornelisnetworks.com>
The default value of FI_OPX_RZV_MIN_PAYLOAD_BYTES is selected at build
time based on the HMEM backend OPX was configured with (see
OPX_RZV_MIN_PAYLOAD_BYTES_DEFAULT in prov/opx/include/rdma/opx/fi_opx_hfi1.h):
  - 4096  for CUDA-enabled builds
  - 8192  for AMD ROCR-enabled builds
  - 16385 for host-only builds (which effectively keeps payloads that fit
          in multi-packet eager off the rendezvous path)
Previously the man page only listed "Defaults to 16385", which is correct
only for host-only builds and gave no indication that GPU-enabled builds
use a much lower rendezvous threshold. Spell the three defaults out so
users tuning RZV behavior on CUDA/ROCR systems do not have to read the
source.
Documentation only; no functional change.

Signed-off-by: Lindsay Reiser <lindsay.reiser@cornelisnetworks.com>
Co-authored-by: Ken Raffenetti <raffenet@users.noreply.github.com>
(cherry picked from commit 8611ecb)
Keep OPENED notify as observation-only for STL-77956 path by removing byte_counter decrement and rzv_comp free from MR notify handling. ICQ completion remains the owner for decrement/finalization.

Signed-off-by: Andriy Kot <andriy.kot@cornelisnetworks.com>
(cherry picked from commit c8baf82)
* prov/opx: Initialize deferred HFISVC receive contexts

Deferred HMEM rendezvous receives enqueue work before the deferred worker submits RDMA reads. Initialize the matched receive context before enqueueing so the ICQ completion drains niov to zero instead of decrementing the posted-receive sentinel (-1).

For truncated receives, keep the counted context pending until scratch reads complete, then report FI_ETRUNC through the CQ error path.

* prov/opx: Limit deferred HFISVC context fix scope

Keep the STL-77956 deferred receive context initialization focused on the non-truncated dmabuf path while preserving the existing HFISVC truncation behavior for follow-up work.\n\nThe deferred path now initializes the matched receive context before enqueueing only when the OPX MR is still opening and the receive is not truncated.

* prov/opx: Format deferred HFISVC context initialization

Apply the OPX provider formatting alignment to the deferred receive context initialization added for STL-77956.

---------

Signed-off-by: Andriy Kot <andriy.kot@cornelisnetworks.com>
Multi-packet eager sends would only allow SYSTEM memory on the belief
that RZV was more performant.  That does not appear to be the case, so
allow GPU memory to use the multi-packet eager path.

Lowered send threshold default from 4096 to 512.  cuda_gdrcopy_from_dev
(CPU pulls from GPU BAR1 write-combining mapping) saturates due to
uncached PCIe round-trips per cacheline.  Above ~512B,
ofi_cudaMemcpy(D2H) wins over gdrcopy.  Below ~512B, gdrcopy's lower
fixed cost still beats the cudaMemcpy CUDA-driver floor. The receive-side knob
(OPX_HMEM_DEV_REG_RECV_THRESHOLD) is unchanged: CPU writes to GPU BAR1 via
gdr_copy_to_dev hit the fast WC store path.

Signed-off-by: Lindsay Reiser <lindsay.reiser@cornelisnetworks.com>
(cherry picked from commit b44d45b)
@avinciguerra07 avinciguerra07 force-pushed the upstream-contrib/main-sync-test branch from 7d3ac3d to bf40ab9 Compare June 11, 2026 01:22
@avinciguerra07 avinciguerra07 deleted the upstream-contrib/main-sync-test branch June 12, 2026 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants