Skip to content

fix(ai): serialize sentence-transformer encoding to prevent GPU races#1182

Open
ksaurabhAparavi wants to merge 1 commit into
rocketride-org:developfrom
ksaurabhAparavi:fix/RR-1169-sentence-transformer-concurrency
Open

fix(ai): serialize sentence-transformer encoding to prevent GPU races#1182
ksaurabhAparavi wants to merge 1 commit into
rocketride-org:developfrom
ksaurabhAparavi:fix/RR-1169-sentence-transformer-concurrency

Conversation

@ksaurabhAparavi

@ksaurabhAparavi ksaurabhAparavi commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Serialize both the wrapper encode() and raw shared-model access so concurrent inference on the shared NomicBert model does not race or trigger intermittent tensor size mismatches.
  • Adds a CUDA reproducer and focused regression coverage.

⚠️ Reviewer note

Conflict-resolved by keeping upstream's packages/ai/tests/conftest.py (the downstream commit's conftest additions were dropped). The core fix in sentence_transformers.py applied cleanly. Please confirm the added tests run under upstream's conftest in CI.

Testing

  • CI (./builder test) — relying on GitHub Actions; not runnable in the contributor's local shell (engine build / Maven / torch unavailable). Static checks (compile, no conflict markers) pass.

Linked Issue

Fixes #1169

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 61db1aaf-2a01-4c86-a0f7-5069281f9746

📥 Commits

Reviewing files that changed from the base of the PR and between efecb7e and bd4c627.

📒 Files selected for processing (4)
  • packages/ai/src/ai/common/models/transformers/sentence_transformers.py
  • packages/ai/tests/ai/common/models/transformers/__init__.py
  • packages/ai/tests/ai/common/models/transformers/reproduce_sentence_transformer_origin.py
  • packages/ai/tests/ai/common/models/transformers/test_sentence_transformers.py

📝 Walkthrough

Walkthrough

The PR adds thread serialization to SentenceTransformer.encode() local inference path via a mutex lock, preventing concurrent calls from interleaving preprocess/inference/postprocess operations. Includes a unit test verifying serialization and a standalone GPU reproducer script for manual testing.

Changes

Concurrent encode serialization

Layer / File(s) Summary
Serialization lock implementation
packages/ai/src/ai/common/models/transformers/sentence_transformers.py
threading module imported; self._encode_lock created in __init__; _encode_local() acquires lock around batched preprocess → inference → postprocess, serializing concurrent local encodes.
Unit test verification
packages/ai/tests/ai/common/models/transformers/test_sentence_transformers.py
Monkeypatches SentenceTransformer loader and pipeline; spawns concurrent encode() calls via ThreadPoolExecutor; asserts inference executes serially (max 1 simultaneous active call); validates NumPy array output shape (4, 1).
Manual reproducer and test infrastructure
packages/ai/tests/ai/common/models/transformers/__init__.py, reproduce_sentence_transformer_origin.py
Standalone GPU reproducer generates variable-length synthetic batches, logs encode events with thread/worker identity, and supports sequential/concurrent execution modes via CLI flags; test package init.py header added.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Suggested reviewers

  • jmaionchi
  • stepmikhaylov
  • Rod-Christensen

Poem

🐰 A lock on the encoder so fair,
No more shall the threads interfere,
Sequential encode, now pristine,
GPU tensors stay serene—
One thread at a time, we declare!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding serialization to sentence-transformer encoding to prevent GPU race conditions.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the module:ai AI/ML modules label Jun 8, 2026
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
🤖 Internal: Discord sync marker

Auto-managed by the Discord notification workflow. Stores the linked Discord message ID. Do not edit or delete.

Serialize both the wrapper encode() and raw shared-model access so concurrent
inference on the shared NomicBert model does not race or trigger intermittent
tensor size mismatches during instance processing. Adds a CUDA reproducer and
focused regression coverage.

Fixes rocketride-org#1169
@ksaurabhAparavi ksaurabhAparavi force-pushed the fix/RR-1169-sentence-transformer-concurrency branch from bdedea6 to bd4c627 Compare June 8, 2026 11:51

@stepmikhaylov stepmikhaylov left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes. Good diagnosis and repro, but the lock is too coarse.

Issue: Wrapping all of SentenceTransformer._encode_local in a per-instance self._encode_lock serializes the entire encode call — tokenization and postprocess included — when only the GPU forward pass needs protection. There's already an established pattern for this in the same package.

Fix: Adopt the WhisperLoader approach (audio/whisper.py): a class-level _model_locks registry keyed by id(model), acquired inside inference() around the forward pass only (see _get_model_lock). Apply the same to SentenceTransformerLoader.inference(), keyed on id(actual_model) (after the model_obj unwrap). This confines the critical section to the unsafe operation and lets tokenization/postprocess overlap. Once it's in the loader, drop the now-redundant self._encode_lock.

Additional point: the current fix also doesn't account for remote mode — self._encode_lock only covers the local wrapper path, while the static SentenceTransformerLoader.inference() path is left unsynchronized. Moving the lock into the loader resolves this too. (The commit message mentions "raw shared-model access," but only the wrapper is locked — please update it to match the final scope.)

Test: Keep the coverage but retarget it — the current test monkeypatches inference wholesale, which would replace the lock along with it. Stub only the GPU forward and assert serialization through the real inference(). The reproducer is fine to keep as-is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ai AI/ML modules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Concurrent shared sentence-transformer inference races (tensor size mismatches)

2 participants