Skip to content

Comments

feat: add Qwen3-Embedding-0.6B and Qwen3-Reranker-0.6B support#605

Open
n24q02m wants to merge 5 commits intoqdrant:mainfrom
n24q02m:feat/qwen3-support
Open

feat: add Qwen3-Embedding-0.6B and Qwen3-Reranker-0.6B support#605
n24q02m wants to merge 5 commits intoqdrant:mainfrom
n24q02m:feat/qwen3-support

Conversation

@n24q02m
Copy link

@n24q02m n24q02m commented Feb 13, 2026

Summary

Add native support for Qwen3-Embedding-0.6B and Qwen3-Reranker-0.6B models, including both INT8 and Q4F16 quantization variants.

These are the first causal-LM-based embedding and reranker models in fastembed, using fundamentally different architectures from existing BERT-family models.

Closes #528
Closes #529
Related to #530

What's Added

Qwen3TextEmbedding (fastembed/text/qwen3_embedding.py)

  • Last-token pooling instead of CLS/mean pooling (causal LM architecture)
  • Matryoshka Representation Learning (MRL): truncate embeddings to any dimension 32-1024 via dim= parameter
  • Instruction-aware: queries use Instruct: {task}\nQuery: {text} format
  • Static ONNX batch constraint: hardcoded batch_size=1 (causal LM ONNX graph limitation)
  • Two variants: INT8 (default) and Q4F16 (smaller, INT4 weights + FP16 activations)

Qwen3CrossEncoder (fastembed/rerank/cross_encoder/qwen3_cross_encoder.py)

  • Causal LM yes/no logit scoring instead of traditional relevance head
  • Chat-template formatting with system/user/assistant turns
  • Extracts last-token logits for "yes"/"no" tokens, applies softmax -> P(yes)
  • One-at-a-time processing (static batch=1 ONNX constraint)
  • Two variants: INT8 (default) and Q4F16

Shared Changes

  • last_token_pool() utility in fastembed/common/utils.py
  • LAST_TOKEN pooling type added to PoolingType enum
  • preprocessor_utils.py: graceful handling of missing special_tokens_map.json + fix for pad_token_id: null and dict pad_token in tokenizer configs
  • model_management.py: fix snapshot cache verification -- verify requested model_file exists in cached snapshot before returning (prevents stale cache hit when multiple variants share the same HF repo)
  • onnx_text_model.py: cast float16 outputs to float32 after inference (required for Q4F16 variant)

Models

Model Name Variant File Size
Qwen/Qwen3-Embedding-0.6B INT8 (default) onnx/model_quantized.onnx 573 MB
Qwen/Qwen3-Embedding-0.6B-Q4F16 Q4F16 onnx/model_q4f16.onnx 517 MB
Qwen/Qwen3-Reranker-0.6B INT8 (default) onnx/model_quantized.onnx 573 MB
Qwen/Qwen3-Reranker-0.6B-Q4F16 Q4F16 onnx/model_q4f16.onnx 518 MB

Tests

  • Canonical vector values for Qwen/Qwen3-Embedding-0.6B in test_text_onnx_embeddings.py
  • Canonical score values for Qwen/Qwen3-Reranker-0.6B in test_text_cross_encoder.py
  • All existing tests unaffected

ONNX Models

Pre-converted ONNX models hosted on HuggingFace:

Conversion pipeline:

  • INT8: torch.onnx.export (opset 21) + onnxruntime.quantization.quantize_dynamic (QInt8)
  • Q4F16: MatMulNBitsQuantizer (4-bit, block_size=128, symmetric) + FP16 cast

Usage

from fastembed import TextEmbedding, TextCrossEncoder

# Embedding (INT8 default)
model = TextEmbedding("Qwen/Qwen3-Embedding-0.6B")
embeddings = list(model.embed(["Hello world"]))               # 1024-dim
embeddings_256 = list(model.embed(["Hello world"], dim=256))  # MRL

# Embedding (Q4F16 -- smaller model)
model_q4 = TextEmbedding("Qwen/Qwen3-Embedding-0.6B-Q4F16")

# Query with instruction
query_emb = list(model.query_embed("What is machine learning?"))

# Reranker
reranker = TextCrossEncoder("Qwen/Qwen3-Reranker-0.6B")
scores = list(reranker.rerank("What is AI?", ["AI is...", "Pizza is..."]))

Verification

Canonical values verified against the ONNX model outputs:

  • Embedding: [-0.0223, 0.0187, -0.0145, -0.0854, 0.0122] (first 5 dims, "hello world")
  • Reranker: [0.9945, 0.0164] ("What is the capital of France?" vs Paris/Berlin docs)

INT8 and Q4F16 variants both produce correct semantic rankings.

Checklist

  • Model implementation files (INT8 + Q4F16 variants)
  • Registry integration (TextEmbedding, TextCrossEncoder)
  • Utility functions (last_token_pool)
  • Canonical test values
  • Snapshot cache fix for multi-variant model downloads
  • Float16 output handling
  • ruff check passes
  • ruff format passes
  • mypy passes (strict mode)

Add native support for Qwen3 embedding and reranker models:

- Qwen3TextEmbedding: last-token pooling, MRL (32-1024 dims), instruction-aware
- Qwen3CrossEncoder: causal LM yes/no logit scoring, chat-template formatting
- last_token_pool() utility for causal embedding models
- LAST_TOKEN pooling type in PoolingType enum
- Graceful handling of missing special_tokens_map.json in preprocessor_utils
- Fix pad_token_id=null and dict pad_token in tokenizer config

ONNX models hosted at:
- n24q02m/Qwen3-Embedding-0.6B-ONNX
- n24q02m/Qwen3-Reranker-0.6B-ONNX

Closes qdrant#528
Closes qdrant#529
Related to qdrant#530
@coderabbitai
Copy link

coderabbitai bot commented Feb 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Qwen3 support and last-token pooling: introduces PoolingType.LAST_TOKEN; implements last_token_pool utility handling left/right padding; makes tokenizer padding defensive and adds logging in preprocessor_utils; adds Qwen3TextEmbedding and Qwen3CrossEncoder (with workers, model registries, ONNX inference and last-token postprocessing); changes onnx_embed to normalize float16→float32 outputs; adds HuggingFace post-download verification for required files; and updates tests with canonical Qwen3 vectors and reranker scores.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • joein
  • tbung
  • dancixx
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately and concisely summarizes the main changes: adding support for Qwen3-Embedding-0.6B and Qwen3-Reranker-0.6B models.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing new models, implementation specifics, usage examples, and verification steps.
Linked Issues check ✅ Passed The PR successfully addresses both linked issues: #528 adds Qwen3 embedding model support, and #529 implements last-token pooling for causal LM models with proper left/right padding handling.
Out of Scope Changes check ✅ Passed All changes are directly aligned with the PR objectives: new model implementations, utility functions, registry updates, and infrastructure fixes for model management and output handling.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- Add logging warning when pad_token_id defaults to 0
- Hoist input_names computation out of per-text loop
- Add dim parameter validation (1 <= dim <= max_dim)
- Add batch_size warning when non-1 value is ignored
- Add docstrings to all public/internal methods for coverage
- Register Q4F16 variants for Qwen3-Embedding and Qwen3-Reranker
- Add float16-to-float32 cast after ONNX inference for Q4F16 outputs
- Fix snapshot_download cache bug: verify model_file exists in cached
  snapshot before returning (prevents stale cache hit when multiple
  variants share the same HF repo)
Adds tests to ensure that left-padding and last-token pooling correctly handles batch inference without losing positional context for short strings.
@n24q02m
Copy link
Author

n24q02m commented Feb 20, 2026

Self-Review & Batching Proof

I've added two new tests: test_qwen3_left_padding_batch and test_qwen3_reranker_left_padding_batch.
Since Qwen3 is a Causal LM, it strictly requires Left Padding when running inference on batched inputs of varying lengths. The newly added tests successfully prove that my last_token_pool implementation correctly fetches the true last token index (ignoring the <pad> tokens), making batch processing completely safe.

(Note on ONNX Weights: I have uploaded the exported ONNX architectures for these models at n24q02m/Qwen3-Embedding-0.6B-ONNX and n24q02m/Qwen3-Reranker-0.6B-ONNX. Please feel free to fork them into the official qdrant/ HuggingFace org, and I will update the URL references here accordingly!)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/test_text_cross_encoder.py (1)

164-172: Remove the redundant in-function import numpy as np and use next(iter(...)) for the single-element read.

numpy is already imported at module scope (line 4), so the duplicate import at line 171 is unnecessary. Additionally, Ruff flags line 164 (RUF015): materialising a full list only to take [0] is wasteful — use next(iter(...)) instead (this also aligns with the fix above where .score is dropped).

♻️ Proposed refactor
-        single_result = list(model.rerank(query, [short_doc]))[0].score
+        single_result = next(iter(model.rerank(query, [short_doc])))
         
         # Infer short string mixed in a batch with a very long string
         batch_results = list(model.rerank(query, [long_doc, short_doc]))
-        batch_result_short = batch_results[1].score
+        batch_result_short = batch_results[1]
         
         # Ensure the score is exactly the same, proving causal LM logit selection is precise
-        import numpy as np
         assert np.allclose(single_result, batch_result_short, atol=1e-4)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_text_cross_encoder.py` around lines 164 - 172, Replace the
redundant in-function "import numpy as np" with the module-level numpy and avoid
materialising a list for the single-element result: use
next(iter(model.rerank(query, [short_doc]))) to get the single_result (and keep
accessing .score as done for batch_results[1].score), referencing model.rerank,
single_result, batch_results, short_doc and long_doc; remove the local import
and update the first assignment accordingly so the subsequent np.allclose
assertion uses the same values without creating an intermediate list.
tests/test_text_onnx_embeddings.py (1)

233-233: Redundant list() materialisation; prefer next(iter(...)).

list(model.embed([short_text]))[0] fully materialises the generator before indexing. Since only the first (and only) element is needed, use the more efficient form flagged by Ruff (RUF015):

♻️ Proposed fix
-        single_result = list(model.embed([short_text]))[0]
+        single_result = next(iter(model.embed([short_text])))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_text_onnx_embeddings.py` at line 233, The test currently
materialises the entire generator via list(model.embed([short_text]))[0], which
is inefficient; change the extraction to use next(iter(...)) so only the first
element is consumed. Locate the call to model.embed([...]) in the test (where
single_result and short_text are used) and replace the list(...) indexing with
next(iter(model.embed([short_text]))) to avoid full materialisation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_text_cross_encoder.py`:
- Around line 164-168: The test is incorrectly accessing .score on values
returned by TextCrossEncoder.rerank / OnnxTextCrossEncoder.rerank which yield
raw floats; update the two uses in tests/test_text_cross_encoder.py (the
variables single_result and batch_result_short) to treat rerank results as
floats (e.g., assign the direct float from list(model.rerank(...))[0] and use
the second item directly from list(model.rerank(...))[1]) or convert to an array
with np.array(list(...)) before indexing — remove any .score attribute access so
the assertions operate on float values.

In `@tests/test_text_onnx_embeddings.py`:
- Line 240: Remove the redundant in-function import "import numpy as np" in
tests/test_text_onnx_embeddings.py (the local import inside the test) because
numpy is already imported at the module level; delete that line so the test uses
the top-level numpy name instead and avoid shadowing the existing np binding.
- Around line 225-241: The test is not exercising left-padding because
Qwen3TextEmbedding.embed forces batch_size=1 so sequences are tokenized
individually; fix by replacing the current integration-style check with a unit
test that directly verifies last_token_pool behavior: create synthetic hidden
states and an attention_mask representing left-padded and non-padded sequences,
call last_token_pool(hidden, attention_mask) and assert the pooled vectors match
the expected last-token vectors; refer to Qwen3TextEmbedding.embed (to avoid
relying on its batching behavior) and the last_token_pool function to locate
code to test.

---

Nitpick comments:
In `@tests/test_text_cross_encoder.py`:
- Around line 164-172: Replace the redundant in-function "import numpy as np"
with the module-level numpy and avoid materialising a list for the
single-element result: use next(iter(model.rerank(query, [short_doc]))) to get
the single_result (and keep accessing .score as done for
batch_results[1].score), referencing model.rerank, single_result, batch_results,
short_doc and long_doc; remove the local import and update the first assignment
accordingly so the subsequent np.allclose assertion uses the same values without
creating an intermediate list.

In `@tests/test_text_onnx_embeddings.py`:
- Line 233: The test currently materialises the entire generator via
list(model.embed([short_text]))[0], which is inefficient; change the extraction
to use next(iter(...)) so only the first element is consumed. Locate the call to
model.embed([...]) in the test (where single_result and short_text are used) and
replace the list(...) indexing with next(iter(model.embed([short_text]))) to
avoid full materialisation.

batch_result_short = batch_results[1]

# Ensure the vector is exactly the same, proving left-padding last-token pooling is precise
import numpy as np
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Redundant in-function import — numpy already imported at module level.

import numpy as np on line 240 is shadowing the already-present top-level import on line 5. Remove it.

♻️ Proposed fix
-        import numpy as np
         assert np.allclose(single_result, batch_result_short, atol=1e-4)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import numpy as np
assert np.allclose(single_result, batch_result_short, atol=1e-4)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_text_onnx_embeddings.py` at line 240, Remove the redundant
in-function import "import numpy as np" in tests/test_text_onnx_embeddings.py
(the local import inside the test) because numpy is already imported at the
module level; delete that line so the test uses the top-level numpy name instead
and avoid shadowing the existing np binding.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_text_cross_encoder.py`:
- Around line 155-163: The test test_qwen3_reranker_left_padding_batch
unconditionally pulls Qwen/Qwen3-Reranker-0.6B; wrap it with the same CI guard
used elsewhere by calling should_test_model(model_name) (or using
pytest.mark.skipif with that predicate) and skip the test early when it returns
False so CI won't download the large model; locate the model_name variable and
the with model_cache(...) block in test_qwen3_reranker_left_padding_batch and
add the skip/guard before entering model_cache.

---

Duplicate comments:
In `@tests/test_text_cross_encoder.py`:
- Around line 164-169: The test is still accessing .score on a bare float from
model.rerank; update the code that assigns batch_result_short to use the float
directly (e.g., batch_result_short = batch_results[1]) instead of
batch_results[1].score so it matches the earlier fix to single_result =
next(iter(model.rerank(...))); ensure any subsequent assertions compare the
float values (batch_result_short) rather than expecting an object with a .score
attribute.

Comment on lines +155 to +163
def test_qwen3_reranker_left_padding_batch(model_cache) -> None:
'''Test to ensure Qwen3 causal logit cross encoder works reliably when left-padded in batch.'''
model_name = "Qwen/Qwen3-Reranker-0.6B"
query = "Testing Qwen"
short_doc = "This is a short doc."
long_doc = "This is a significantly longer string that will force the shorter string to be padded with `<pad>` tokens on the left side during the tokenization phase. The embedding pooling must ignore these left padding tokens."

with model_cache(model_name) as model:
# Infer short string alone
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

No CI skip guard — will unconditionally download Qwen3-Reranker-0.6B on every CI run.

test_rerank uses should_test_model to gate large-model downloads; test_qwen3_reranker_left_padding_batch has no such guard and will attempt to download the full Qwen3-Reranker-0.6B model on every run, including lightweight CI pipelines.

💡 Proposed fix
 def test_qwen3_reranker_left_padding_batch(model_cache) -> None:
-    '''Test to ensure Qwen3 causal logit cross encoder works reliably when left-padded in batch.'''
+    """Test to ensure Qwen3 causal logit cross encoder works reliably when left-padded in batch."""
     model_name = "Qwen/Qwen3-Reranker-0.6B"
+    is_ci = os.getenv("CI")
+    is_manual = os.getenv("GITHUB_EVENT_NAME") == "workflow_dispatch"
+
+    model_desc = next(
+        (m for m in TextCrossEncoder._list_supported_models() if m.model == model_name), None
+    )
+    if model_desc is None or not should_test_model(model_desc, model_name, is_ci, is_manual):
+        pytest.skip(f"Skipping {model_name} — not selected for this run")
+
     query = "Testing Qwen"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_text_cross_encoder.py` around lines 155 - 163, The test
test_qwen3_reranker_left_padding_batch unconditionally pulls
Qwen/Qwen3-Reranker-0.6B; wrap it with the same CI guard used elsewhere by
calling should_test_model(model_name) (or using pytest.mark.skipif with that
predicate) and skip the test early when it returns False so CI won't download
the large model; locate the model_name variable and the with model_cache(...)
block in test_qwen3_reranker_left_padding_batch and add the skip/guard before
entering model_cache.

@n24q02m
Copy link
Author

n24q02m commented Feb 20, 2026

I've addressed the feedback from @coderabbitai.
Instead of doing an integration-level inference which forces batch_size=1 via the embedding loop generator, I have replaced the test with a direct unit test test_qwen3_left_padding_batch_unit() that injects a synthetic left-padded hidden_states tensor directly into the last_token_pool utility function. This proves beyond doubt that the padding extraction math is perfectly accurate for batched inputs.

@coderabbitai
Copy link

coderabbitai bot commented Feb 20, 2026

Tip

For best results, initiate chat on the files or code changes.

@n24q02m Thank you for addressing the feedback! The new unit test test_qwen3_left_padding_batch_unit() is a much cleaner approach that directly validates the last_token_pool logic with synthetic left-padded data. This provides clear, isolated proof that the padding extraction math handles batched inputs correctly. Great improvement! 🎉


If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Last token pooling for causal embedding models [Model]: Qwen/Qwen3-Embedding-0.6B-GGUF

1 participant