feat: add Qwen3-Embedding-0.6B and Qwen3-Reranker-0.6B support#605
feat: add Qwen3-Embedding-0.6B and Qwen3-Reranker-0.6B support#605n24q02m wants to merge 5 commits intoqdrant:mainfrom
Conversation
Add native support for Qwen3 embedding and reranker models: - Qwen3TextEmbedding: last-token pooling, MRL (32-1024 dims), instruction-aware - Qwen3CrossEncoder: causal LM yes/no logit scoring, chat-template formatting - last_token_pool() utility for causal embedding models - LAST_TOKEN pooling type in PoolingType enum - Graceful handling of missing special_tokens_map.json in preprocessor_utils - Fix pad_token_id=null and dict pad_token in tokenizer config ONNX models hosted at: - n24q02m/Qwen3-Embedding-0.6B-ONNX - n24q02m/Qwen3-Reranker-0.6B-ONNX Closes qdrant#528 Closes qdrant#529 Related to qdrant#530
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Qwen3 support and last-token pooling: introduces PoolingType.LAST_TOKEN; implements last_token_pool utility handling left/right padding; makes tokenizer padding defensive and adds logging in preprocessor_utils; adds Qwen3TextEmbedding and Qwen3CrossEncoder (with workers, model registries, ONNX inference and last-token postprocessing); changes onnx_embed to normalize float16→float32 outputs; adds HuggingFace post-download verification for required files; and updates tests with canonical Qwen3 vectors and reranker scores. Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- Add logging warning when pad_token_id defaults to 0 - Hoist input_names computation out of per-text loop - Add dim parameter validation (1 <= dim <= max_dim) - Add batch_size warning when non-1 value is ignored - Add docstrings to all public/internal methods for coverage
- Register Q4F16 variants for Qwen3-Embedding and Qwen3-Reranker - Add float16-to-float32 cast after ONNX inference for Q4F16 outputs - Fix snapshot_download cache bug: verify model_file exists in cached snapshot before returning (prevents stale cache hit when multiple variants share the same HF repo)
Adds tests to ensure that left-padding and last-token pooling correctly handles batch inference without losing positional context for short strings.
Self-Review & Batching ProofI've added two new tests: (Note on ONNX Weights: I have uploaded the exported ONNX architectures for these models at |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/test_text_cross_encoder.py (1)
164-172: Remove the redundant in-functionimport numpy as npand usenext(iter(...))for the single-element read.
numpyis already imported at module scope (line 4), so the duplicate import at line 171 is unnecessary. Additionally, Ruff flags line 164 (RUF015): materialising a full list only to take[0]is wasteful — usenext(iter(...))instead (this also aligns with the fix above where.scoreis dropped).♻️ Proposed refactor
- single_result = list(model.rerank(query, [short_doc]))[0].score + single_result = next(iter(model.rerank(query, [short_doc]))) # Infer short string mixed in a batch with a very long string batch_results = list(model.rerank(query, [long_doc, short_doc])) - batch_result_short = batch_results[1].score + batch_result_short = batch_results[1] # Ensure the score is exactly the same, proving causal LM logit selection is precise - import numpy as np assert np.allclose(single_result, batch_result_short, atol=1e-4)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_text_cross_encoder.py` around lines 164 - 172, Replace the redundant in-function "import numpy as np" with the module-level numpy and avoid materialising a list for the single-element result: use next(iter(model.rerank(query, [short_doc]))) to get the single_result (and keep accessing .score as done for batch_results[1].score), referencing model.rerank, single_result, batch_results, short_doc and long_doc; remove the local import and update the first assignment accordingly so the subsequent np.allclose assertion uses the same values without creating an intermediate list.tests/test_text_onnx_embeddings.py (1)
233-233: Redundantlist()materialisation; prefernext(iter(...)).
list(model.embed([short_text]))[0]fully materialises the generator before indexing. Since only the first (and only) element is needed, use the more efficient form flagged by Ruff (RUF015):♻️ Proposed fix
- single_result = list(model.embed([short_text]))[0] + single_result = next(iter(model.embed([short_text])))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_text_onnx_embeddings.py` at line 233, The test currently materialises the entire generator via list(model.embed([short_text]))[0], which is inefficient; change the extraction to use next(iter(...)) so only the first element is consumed. Locate the call to model.embed([...]) in the test (where single_result and short_text are used) and replace the list(...) indexing with next(iter(model.embed([short_text]))) to avoid full materialisation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_text_cross_encoder.py`:
- Around line 164-168: The test is incorrectly accessing .score on values
returned by TextCrossEncoder.rerank / OnnxTextCrossEncoder.rerank which yield
raw floats; update the two uses in tests/test_text_cross_encoder.py (the
variables single_result and batch_result_short) to treat rerank results as
floats (e.g., assign the direct float from list(model.rerank(...))[0] and use
the second item directly from list(model.rerank(...))[1]) or convert to an array
with np.array(list(...)) before indexing — remove any .score attribute access so
the assertions operate on float values.
In `@tests/test_text_onnx_embeddings.py`:
- Line 240: Remove the redundant in-function import "import numpy as np" in
tests/test_text_onnx_embeddings.py (the local import inside the test) because
numpy is already imported at the module level; delete that line so the test uses
the top-level numpy name instead and avoid shadowing the existing np binding.
- Around line 225-241: The test is not exercising left-padding because
Qwen3TextEmbedding.embed forces batch_size=1 so sequences are tokenized
individually; fix by replacing the current integration-style check with a unit
test that directly verifies last_token_pool behavior: create synthetic hidden
states and an attention_mask representing left-padded and non-padded sequences,
call last_token_pool(hidden, attention_mask) and assert the pooled vectors match
the expected last-token vectors; refer to Qwen3TextEmbedding.embed (to avoid
relying on its batching behavior) and the last_token_pool function to locate
code to test.
---
Nitpick comments:
In `@tests/test_text_cross_encoder.py`:
- Around line 164-172: Replace the redundant in-function "import numpy as np"
with the module-level numpy and avoid materialising a list for the
single-element result: use next(iter(model.rerank(query, [short_doc]))) to get
the single_result (and keep accessing .score as done for
batch_results[1].score), referencing model.rerank, single_result, batch_results,
short_doc and long_doc; remove the local import and update the first assignment
accordingly so the subsequent np.allclose assertion uses the same values without
creating an intermediate list.
In `@tests/test_text_onnx_embeddings.py`:
- Line 233: The test currently materialises the entire generator via
list(model.embed([short_text]))[0], which is inefficient; change the extraction
to use next(iter(...)) so only the first element is consumed. Locate the call to
model.embed([...]) in the test (where single_result and short_text are used) and
replace the list(...) indexing with next(iter(model.embed([short_text]))) to
avoid full materialisation.
tests/test_text_onnx_embeddings.py
Outdated
| batch_result_short = batch_results[1] | ||
|
|
||
| # Ensure the vector is exactly the same, proving left-padding last-token pooling is precise | ||
| import numpy as np |
There was a problem hiding this comment.
Redundant in-function import — numpy already imported at module level.
import numpy as np on line 240 is shadowing the already-present top-level import on line 5. Remove it.
♻️ Proposed fix
- import numpy as np
assert np.allclose(single_result, batch_result_short, atol=1e-4)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import numpy as np | |
| assert np.allclose(single_result, batch_result_short, atol=1e-4) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_text_onnx_embeddings.py` at line 240, Remove the redundant
in-function import "import numpy as np" in tests/test_text_onnx_embeddings.py
(the local import inside the test) because numpy is already imported at the
module level; delete that line so the test uses the top-level numpy name instead
and avoid shadowing the existing np binding.
b4c8898 to
deaf664
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_text_cross_encoder.py`:
- Around line 155-163: The test test_qwen3_reranker_left_padding_batch
unconditionally pulls Qwen/Qwen3-Reranker-0.6B; wrap it with the same CI guard
used elsewhere by calling should_test_model(model_name) (or using
pytest.mark.skipif with that predicate) and skip the test early when it returns
False so CI won't download the large model; locate the model_name variable and
the with model_cache(...) block in test_qwen3_reranker_left_padding_batch and
add the skip/guard before entering model_cache.
---
Duplicate comments:
In `@tests/test_text_cross_encoder.py`:
- Around line 164-169: The test is still accessing .score on a bare float from
model.rerank; update the code that assigns batch_result_short to use the float
directly (e.g., batch_result_short = batch_results[1]) instead of
batch_results[1].score so it matches the earlier fix to single_result =
next(iter(model.rerank(...))); ensure any subsequent assertions compare the
float values (batch_result_short) rather than expecting an object with a .score
attribute.
| def test_qwen3_reranker_left_padding_batch(model_cache) -> None: | ||
| '''Test to ensure Qwen3 causal logit cross encoder works reliably when left-padded in batch.''' | ||
| model_name = "Qwen/Qwen3-Reranker-0.6B" | ||
| query = "Testing Qwen" | ||
| short_doc = "This is a short doc." | ||
| long_doc = "This is a significantly longer string that will force the shorter string to be padded with `<pad>` tokens on the left side during the tokenization phase. The embedding pooling must ignore these left padding tokens." | ||
|
|
||
| with model_cache(model_name) as model: | ||
| # Infer short string alone |
There was a problem hiding this comment.
No CI skip guard — will unconditionally download Qwen3-Reranker-0.6B on every CI run.
test_rerank uses should_test_model to gate large-model downloads; test_qwen3_reranker_left_padding_batch has no such guard and will attempt to download the full Qwen3-Reranker-0.6B model on every run, including lightweight CI pipelines.
💡 Proposed fix
def test_qwen3_reranker_left_padding_batch(model_cache) -> None:
- '''Test to ensure Qwen3 causal logit cross encoder works reliably when left-padded in batch.'''
+ """Test to ensure Qwen3 causal logit cross encoder works reliably when left-padded in batch."""
model_name = "Qwen/Qwen3-Reranker-0.6B"
+ is_ci = os.getenv("CI")
+ is_manual = os.getenv("GITHUB_EVENT_NAME") == "workflow_dispatch"
+
+ model_desc = next(
+ (m for m in TextCrossEncoder._list_supported_models() if m.model == model_name), None
+ )
+ if model_desc is None or not should_test_model(model_desc, model_name, is_ci, is_manual):
+ pytest.skip(f"Skipping {model_name} — not selected for this run")
+
query = "Testing Qwen"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_text_cross_encoder.py` around lines 155 - 163, The test
test_qwen3_reranker_left_padding_batch unconditionally pulls
Qwen/Qwen3-Reranker-0.6B; wrap it with the same CI guard used elsewhere by
calling should_test_model(model_name) (or using pytest.mark.skipif with that
predicate) and skip the test early when it returns False so CI won't download
the large model; locate the model_name variable and the with model_cache(...)
block in test_qwen3_reranker_left_padding_batch and add the skip/guard before
entering model_cache.
|
I've addressed the feedback from @coderabbitai. |
|
Tip For best results, initiate chat on the files or code changes.
If you found this review helpful, would you consider giving us a shout-out on X? Thank you for using CodeRabbit! |
Summary
Add native support for Qwen3-Embedding-0.6B and Qwen3-Reranker-0.6B models, including both INT8 and Q4F16 quantization variants.
These are the first causal-LM-based embedding and reranker models in fastembed, using fundamentally different architectures from existing BERT-family models.
Closes #528
Closes #529
Related to #530
What's Added
Qwen3TextEmbedding (
fastembed/text/qwen3_embedding.py)dim=parameterInstruct: {task}\nQuery: {text}formatbatch_size=1(causal LM ONNX graph limitation)Qwen3CrossEncoder (
fastembed/rerank/cross_encoder/qwen3_cross_encoder.py)Shared Changes
last_token_pool()utility infastembed/common/utils.pyLAST_TOKENpooling type added toPoolingTypeenumpreprocessor_utils.py: graceful handling of missingspecial_tokens_map.json+ fix forpad_token_id: nulland dictpad_tokenin tokenizer configsmodel_management.py: fix snapshot cache verification -- verify requestedmodel_fileexists in cached snapshot before returning (prevents stale cache hit when multiple variants share the same HF repo)onnx_text_model.py: cast float16 outputs to float32 after inference (required for Q4F16 variant)Models
Qwen/Qwen3-Embedding-0.6Bonnx/model_quantized.onnxQwen/Qwen3-Embedding-0.6B-Q4F16onnx/model_q4f16.onnxQwen/Qwen3-Reranker-0.6Bonnx/model_quantized.onnxQwen/Qwen3-Reranker-0.6B-Q4F16onnx/model_q4f16.onnxTests
Qwen/Qwen3-Embedding-0.6Bintest_text_onnx_embeddings.pyQwen/Qwen3-Reranker-0.6Bintest_text_cross_encoder.pyONNX Models
Pre-converted ONNX models hosted on HuggingFace:
Conversion pipeline:
torch.onnx.export(opset 21) +onnxruntime.quantization.quantize_dynamic(QInt8)MatMulNBitsQuantizer(4-bit, block_size=128, symmetric) + FP16 castUsage
Verification
Canonical values verified against the ONNX model outputs:
[-0.0223, 0.0187, -0.0145, -0.0854, 0.0122](first 5 dims, "hello world")[0.9945, 0.0164]("What is the capital of France?" vs Paris/Berlin docs)INT8 and Q4F16 variants both produce correct semantic rankings.
Checklist
last_token_pool)ruff checkpassesruff formatpassesmypypasses (strict mode)