fix(ai): serialize sentence-transformer encoding to prevent GPU races#1182
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThe PR adds thread serialization to ChangesConcurrent encode serialization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🤖 Internal: Discord sync markerAuto-managed by the Discord notification workflow. Stores the linked Discord message ID. Do not edit or delete. |
Serialize both the wrapper encode() and raw shared-model access so concurrent inference on the shared NomicBert model does not race or trigger intermittent tensor size mismatches during instance processing. Adds a CUDA reproducer and focused regression coverage. Fixes rocketride-org#1169
bdedea6 to
bd4c627
Compare
stepmikhaylov
left a comment
There was a problem hiding this comment.
Requesting changes. Good diagnosis and repro, but the lock is too coarse.
Issue: Wrapping all of SentenceTransformer._encode_local in a per-instance self._encode_lock serializes the entire encode call — tokenization and postprocess included — when only the GPU forward pass needs protection. There's already an established pattern for this in the same package.
Fix: Adopt the WhisperLoader approach (audio/whisper.py): a class-level _model_locks registry keyed by id(model), acquired inside inference() around the forward pass only (see _get_model_lock). Apply the same to SentenceTransformerLoader.inference(), keyed on id(actual_model) (after the model_obj unwrap). This confines the critical section to the unsafe operation and lets tokenization/postprocess overlap. Once it's in the loader, drop the now-redundant self._encode_lock.
Additional point: the current fix also doesn't account for remote mode — self._encode_lock only covers the local wrapper path, while the static SentenceTransformerLoader.inference() path is left unsynchronized. Moving the lock into the loader resolves this too. (The commit message mentions "raw shared-model access," but only the wrapper is locked — please update it to match the final scope.)
Test: Keep the coverage but retarget it — the current test monkeypatches inference wholesale, which would replace the lock along with it. Stub only the GPU forward and assert serialization through the real inference(). The reproducer is fine to keep as-is.
Summary
encode()and raw shared-model access so concurrent inference on the shared NomicBert model does not race or trigger intermittent tensor size mismatches.Conflict-resolved by keeping upstream's
packages/ai/tests/conftest.py(the downstream commit's conftest additions were dropped). The core fix insentence_transformers.pyapplied cleanly. Please confirm the added tests run under upstream's conftest in CI.Testing
./builder test) — relying on GitHub Actions; not runnable in the contributor's local shell (engine build / Maven / torch unavailable). Static checks (compile, no conflict markers) pass.Linked Issue
Fixes #1169