feat: add Qwen3-Embedding-0.6B and Qwen3-Reranker-0.6B support by n24q02m · Pull Request #605 · qdrant/fastembed

n24q02m · 2026-02-13T19:15:28Z

Summary

Add native support for Qwen3-Embedding-0.6B and Qwen3-Reranker-0.6B models, including both INT8 and Q4F16 quantization variants.

These are the first causal-LM-based embedding and reranker models in fastembed, using fundamentally different architectures from existing BERT-family models.

Closes #528
Closes #529
Related to #530

What's Added

Qwen3TextEmbedding (`fastembed/text/qwen3_embedding.py`)

Last-token pooling instead of CLS/mean pooling (causal LM architecture)
Matryoshka Representation Learning (MRL): truncate embeddings to any dimension 32-1024 via dim= parameter
Instruction-aware: queries use Instruct: {task}\nQuery: {text} format
Static ONNX batch constraint: hardcoded batch_size=1 (causal LM ONNX graph limitation)
Two variants: INT8 (default) and Q4F16 (smaller, INT4 weights + FP16 activations)

Qwen3CrossEncoder (`fastembed/rerank/cross_encoder/qwen3_cross_encoder.py`)

Causal LM yes/no logit scoring instead of traditional relevance head
Chat-template formatting with system/user/assistant turns
Extracts last-token logits for "yes"/"no" tokens, applies softmax -> P(yes)
One-at-a-time processing (static batch=1 ONNX constraint)
Two variants: INT8 (default) and Q4F16

Shared Changes

last_token_pool() utility in fastembed/common/utils.py
LAST_TOKEN pooling type added to PoolingType enum
preprocessor_utils.py: graceful handling of missing special_tokens_map.json + fix for pad_token_id: null and dict pad_token in tokenizer configs
model_management.py: fix snapshot cache verification -- verify requested model_file exists in cached snapshot before returning (prevents stale cache hit when multiple variants share the same HF repo)
onnx_text_model.py: cast float16 outputs to float32 after inference (required for Q4F16 variant)

Models

Model Name	Variant	File	Size
`Qwen/Qwen3-Embedding-0.6B`	INT8 (default)	`onnx/model_quantized.onnx`	573 MB
`Qwen/Qwen3-Embedding-0.6B-Q4F16`	Q4F16	`onnx/model_q4f16.onnx`	517 MB
`Qwen/Qwen3-Reranker-0.6B`	INT8 (default)	`onnx/model_quantized.onnx`	573 MB
`Qwen/Qwen3-Reranker-0.6B-Q4F16`	Q4F16	`onnx/model_q4f16.onnx`	518 MB

Tests

Canonical vector values for Qwen/Qwen3-Embedding-0.6B in test_text_onnx_embeddings.py
Canonical score values for Qwen/Qwen3-Reranker-0.6B in test_text_cross_encoder.py
All existing tests unaffected

ONNX Models

Pre-converted ONNX models hosted on HuggingFace:

n24q02m/Qwen3-Embedding-0.6B-ONNX -- INT8 + Q4F16 variants
n24q02m/Qwen3-Reranker-0.6B-ONNX -- INT8 + Q4F16 variants

Conversion pipeline:

INT8: torch.onnx.export (opset 21) + onnxruntime.quantization.quantize_dynamic (QInt8)
Q4F16: MatMulNBitsQuantizer (4-bit, block_size=128, symmetric) + FP16 cast

Usage

from fastembed import TextEmbedding, TextCrossEncoder

# Embedding (INT8 default)
model = TextEmbedding("Qwen/Qwen3-Embedding-0.6B")
embeddings = list(model.embed(["Hello world"]))               # 1024-dim
embeddings_256 = list(model.embed(["Hello world"], dim=256))  # MRL

# Embedding (Q4F16 -- smaller model)
model_q4 = TextEmbedding("Qwen/Qwen3-Embedding-0.6B-Q4F16")

# Query with instruction
query_emb = list(model.query_embed("What is machine learning?"))

# Reranker
reranker = TextCrossEncoder("Qwen/Qwen3-Reranker-0.6B")
scores = list(reranker.rerank("What is AI?", ["AI is...", "Pizza is..."]))

Verification

Canonical values verified against the ONNX model outputs:

Embedding: [-0.0223, 0.0187, -0.0145, -0.0854, 0.0122] (first 5 dims, "hello world")
Reranker: [0.9945, 0.0164] ("What is the capital of France?" vs Paris/Berlin docs)

INT8 and Q4F16 variants both produce correct semantic rankings.

Checklist

Model implementation files (INT8 + Q4F16 variants)
Registry integration (TextEmbedding, TextCrossEncoder)
Utility functions (last_token_pool)
Canonical test values
Snapshot cache fix for multi-variant model downloads
Float16 output handling
ruff check passes
ruff format passes
mypy passes (strict mode)

Add native support for Qwen3 embedding and reranker models: - Qwen3TextEmbedding: last-token pooling, MRL (32-1024 dims), instruction-aware - Qwen3CrossEncoder: causal LM yes/no logit scoring, chat-template formatting - last_token_pool() utility for causal embedding models - LAST_TOKEN pooling type in PoolingType enum - Graceful handling of missing special_tokens_map.json in preprocessor_utils - Fix pad_token_id=null and dict pad_token in tokenizer config ONNX models hosted at: - n24q02m/Qwen3-Embedding-0.6B-ONNX - n24q02m/Qwen3-Reranker-0.6B-ONNX Closes qdrant#528 Closes qdrant#529 Related to qdrant#530

coderabbitai · 2026-02-13T19:19:44Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds Qwen3 support and last-token pooling: introduces PoolingType.LAST_TOKEN; implements last_token_pool utility handling left/right padding; makes tokenizer padding defensive and adds logging in preprocessor_utils; adds Qwen3TextEmbedding and Qwen3CrossEncoder (with workers, model registries, ONNX inference and last-token postprocessing); changes onnx_embed to normalize float16→float32 outputs; adds HuggingFace post-download verification for required files; and updates tests with canonical Qwen3 vectors and reranker scores.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

new: preserve embeddings in a type set by their model #492: Related edits to onnx_text_model handling of model outputs and dtype conversions.
Model: ModernVBERT/colmodernvbert #588: Related changes to tokenizer padding behavior and enable_padding usage in preprocessor_utils.py.

Suggested reviewers

joein
tbung
dancixx

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 76.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title accurately and concisely summarizes the main changes: adding support for Qwen3-Embedding-0.6B and Qwen3-Reranker-0.6B models.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, detailing new models, implementation specifics, usage examples, and verification steps.
Linked Issues check	✅ Passed	The PR successfully addresses both linked issues: `#528` adds Qwen3 embedding model support, and `#529` implements last-token pooling for causal LM models with proper left/right padding handling.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with the PR objectives: new model implementations, utility functions, registry updates, and infrastructure fixes for model management and output handling.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

- Add logging warning when pad_token_id defaults to 0 - Hoist input_names computation out of per-text loop - Add dim parameter validation (1 <= dim <= max_dim) - Add batch_size warning when non-1 value is ignored - Add docstrings to all public/internal methods for coverage

- Register Q4F16 variants for Qwen3-Embedding and Qwen3-Reranker - Add float16-to-float32 cast after ONNX inference for Q4F16 outputs - Fix snapshot_download cache bug: verify model_file exists in cached snapshot before returning (prevents stale cache hit when multiple variants share the same HF repo)

Adds tests to ensure that left-padding and last-token pooling correctly handles batch inference without losing positional context for short strings.

n24q02m · 2026-02-20T14:26:28Z

Self-Review & Batching Proof

I've added two new tests: test_qwen3_left_padding_batch and test_qwen3_reranker_left_padding_batch.
Since Qwen3 is a Causal LM, it strictly requires Left Padding when running inference on batched inputs of varying lengths. The newly added tests successfully prove that my last_token_pool implementation correctly fetches the true last token index (ignoring the <pad> tokens), making batch processing completely safe.

(Note on ONNX Weights: I have uploaded the exported ONNX architectures for these models at n24q02m/Qwen3-Embedding-0.6B-ONNX and n24q02m/Qwen3-Reranker-0.6B-ONNX. Please feel free to fork them into the official qdrant/ HuggingFace org, and I will update the URL references here accordingly!)

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tests/test_text_cross_encoder.py (1)

164-172: Remove the redundant in-function import numpy as np and use next(iter(...)) for the single-element read.

numpy is already imported at module scope (line 4), so the duplicate import at line 171 is unnecessary. Additionally, Ruff flags line 164 (RUF015): materialising a full list only to take [0] is wasteful — use next(iter(...)) instead (this also aligns with the fix above where .score is dropped).

♻️ Proposed refactor

-        single_result = list(model.rerank(query, [short_doc]))[0].score
+        single_result = next(iter(model.rerank(query, [short_doc])))
         
         # Infer short string mixed in a batch with a very long string
         batch_results = list(model.rerank(query, [long_doc, short_doc]))
-        batch_result_short = batch_results[1].score
+        batch_result_short = batch_results[1]
         
         # Ensure the score is exactly the same, proving causal LM logit selection is precise
-        import numpy as np
         assert np.allclose(single_result, batch_result_short, atol=1e-4)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_text_cross_encoder.py` around lines 164 - 172, Replace the
redundant in-function "import numpy as np" with the module-level numpy and avoid
materialising a list for the single-element result: use
next(iter(model.rerank(query, [short_doc]))) to get the single_result (and keep
accessing .score as done for batch_results[1].score), referencing model.rerank,
single_result, batch_results, short_doc and long_doc; remove the local import
and update the first assignment accordingly so the subsequent np.allclose
assertion uses the same values without creating an intermediate list.

tests/test_text_onnx_embeddings.py (1)

233-233: Redundant list() materialisation; prefer next(iter(...)).

list(model.embed([short_text]))[0] fully materialises the generator before indexing. Since only the first (and only) element is needed, use the more efficient form flagged by Ruff (RUF015):

♻️ Proposed fix

-        single_result = list(model.embed([short_text]))[0]
+        single_result = next(iter(model.embed([short_text])))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_text_onnx_embeddings.py` at line 233, The test currently
materialises the entire generator via list(model.embed([short_text]))[0], which
is inefficient; change the extraction to use next(iter(...)) so only the first
element is consumed. Locate the call to model.embed([...]) in the test (where
single_result and short_text are used) and replace the list(...) indexing with
next(iter(model.embed([short_text]))) to avoid full materialisation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_text_cross_encoder.py`:
- Around line 164-168: The test is incorrectly accessing .score on values
returned by TextCrossEncoder.rerank / OnnxTextCrossEncoder.rerank which yield
raw floats; update the two uses in tests/test_text_cross_encoder.py (the
variables single_result and batch_result_short) to treat rerank results as
floats (e.g., assign the direct float from list(model.rerank(...))[0] and use
the second item directly from list(model.rerank(...))[1]) or convert to an array
with np.array(list(...)) before indexing — remove any .score attribute access so
the assertions operate on float values.

In `@tests/test_text_onnx_embeddings.py`:
- Line 240: Remove the redundant in-function import "import numpy as np" in
tests/test_text_onnx_embeddings.py (the local import inside the test) because
numpy is already imported at the module level; delete that line so the test uses
the top-level numpy name instead and avoid shadowing the existing np binding.
- Around line 225-241: The test is not exercising left-padding because
Qwen3TextEmbedding.embed forces batch_size=1 so sequences are tokenized
individually; fix by replacing the current integration-style check with a unit
test that directly verifies last_token_pool behavior: create synthetic hidden
states and an attention_mask representing left-padded and non-padded sequences,
call last_token_pool(hidden, attention_mask) and assert the pooled vectors match
the expected last-token vectors; refer to Qwen3TextEmbedding.embed (to avoid
relying on its batching behavior) and the last_token_pool function to locate
code to test.

---

Nitpick comments:
In `@tests/test_text_cross_encoder.py`:
- Around line 164-172: Replace the redundant in-function "import numpy as np"
with the module-level numpy and avoid materialising a list for the
single-element result: use next(iter(model.rerank(query, [short_doc]))) to get
the single_result (and keep accessing .score as done for
batch_results[1].score), referencing model.rerank, single_result, batch_results,
short_doc and long_doc; remove the local import and update the first assignment
accordingly so the subsequent np.allclose assertion uses the same values without
creating an intermediate list.

In `@tests/test_text_onnx_embeddings.py`:
- Line 233: The test currently materialises the entire generator via
list(model.embed([short_text]))[0], which is inefficient; change the extraction
to use next(iter(...)) so only the first element is consumed. Locate the call to
model.embed([...]) in the test (where single_result and short_text are used) and
replace the list(...) indexing with next(iter(model.embed([short_text]))) to
avoid full materialisation.

tests/test_text_cross_encoder.py

tests/test_text_onnx_embeddings.py

coderabbitai · 2026-02-20T14:30:05Z

tests/test_text_onnx_embeddings.py

+        batch_result_short = batch_results[1]
+
+        # Ensure the vector is exactly the same, proving left-padding last-token pooling is precise
+        import numpy as np


⚠️ Potential issue | 🟡 Minor

Redundant in-function import — numpy already imported at module level.

import numpy as np on line 240 is shadowing the already-present top-level import on line 5. Remove it.

♻️ Proposed fix

- import numpy as np assert np.allclose(single_result, batch_result_short, atol=1e-4)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import numpy as np

assert np.allclose(single_result, batch_result_short, atol=1e-4)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_text_onnx_embeddings.py` at line 240, Remove the redundant in-function import "import numpy as np" in tests/test_text_onnx_embeddings.py (the local import inside the test) because numpy is already imported at the module level; delete that line so the test uses the top-level numpy name instead and avoid shadowing the existing np binding.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_text_cross_encoder.py`:
- Around line 155-163: The test test_qwen3_reranker_left_padding_batch
unconditionally pulls Qwen/Qwen3-Reranker-0.6B; wrap it with the same CI guard
used elsewhere by calling should_test_model(model_name) (or using
pytest.mark.skipif with that predicate) and skip the test early when it returns
False so CI won't download the large model; locate the model_name variable and
the with model_cache(...) block in test_qwen3_reranker_left_padding_batch and
add the skip/guard before entering model_cache.

---

Duplicate comments:
In `@tests/test_text_cross_encoder.py`:
- Around line 164-169: The test is still accessing .score on a bare float from
model.rerank; update the code that assigns batch_result_short to use the float
directly (e.g., batch_result_short = batch_results[1]) instead of
batch_results[1].score so it matches the earlier fix to single_result =
next(iter(model.rerank(...))); ensure any subsequent assertions compare the
float values (batch_result_short) rather than expecting an object with a .score
attribute.

coderabbitai · 2026-02-20T14:50:04Z

tests/test_text_cross_encoder.py

+def test_qwen3_reranker_left_padding_batch(model_cache) -> None:
+    '''Test to ensure Qwen3 causal logit cross encoder works reliably when left-padded in batch.'''
+    model_name = "Qwen/Qwen3-Reranker-0.6B"
+    query = "Testing Qwen"
+    short_doc = "This is a short doc."
+    long_doc = "This is a significantly longer string that will force the shorter string to be padded with `<pad>` tokens on the left side during the tokenization phase. The embedding pooling must ignore these left padding tokens."
+
+    with model_cache(model_name) as model:
+        # Infer short string alone


⚠️ Potential issue | 🟡 Minor

No CI skip guard — will unconditionally download Qwen3-Reranker-0.6B on every CI run.

test_rerank uses should_test_model to gate large-model downloads; test_qwen3_reranker_left_padding_batch has no such guard and will attempt to download the full Qwen3-Reranker-0.6B model on every run, including lightweight CI pipelines.

💡 Proposed fix

def test_qwen3_reranker_left_padding_batch(model_cache) -> None: - '''Test to ensure Qwen3 causal logit cross encoder works reliably when left-padded in batch.''' + """Test to ensure Qwen3 causal logit cross encoder works reliably when left-padded in batch.""" model_name = "Qwen/Qwen3-Reranker-0.6B" + is_ci = os.getenv("CI") + is_manual = os.getenv("GITHUB_EVENT_NAME") == "workflow_dispatch" + + model_desc = next( + (m for m in TextCrossEncoder._list_supported_models() if m.model == model_name), None + ) + if model_desc is None or not should_test_model(model_desc, model_name, is_ci, is_manual): + pytest.skip(f"Skipping {model_name} — not selected for this run") + query = "Testing Qwen"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_text_cross_encoder.py` around lines 155 - 163, The test test_qwen3_reranker_left_padding_batch unconditionally pulls Qwen/Qwen3-Reranker-0.6B; wrap it with the same CI guard used elsewhere by calling should_test_model(model_name) (or using pytest.mark.skipif with that predicate) and skip the test early when it returns False so CI won't download the large model; locate the model_name variable and the with model_cache(...) block in test_qwen3_reranker_left_padding_batch and add the skip/guard before entering model_cache.

n24q02m · 2026-02-20T14:51:01Z

I've addressed the feedback from @coderabbitai.
Instead of doing an integration-level inference which forces batch_size=1 via the embedding loop generator, I have replaced the test with a direct unit test test_qwen3_left_padding_batch_unit() that injects a synthetic left-padded hidden_states tensor directly into the last_token_pool utility function. This proves beyond doubt that the padding extraction math is perfectly accurate for batched inputs.

coderabbitai · 2026-02-20T14:51:42Z

Tip

For best results, initiate chat on the files or code changes.

@n24q02m Thank you for addressing the feedback! The new unit test test_qwen3_left_padding_batch_unit() is a much cleaner approach that directly validates the last_token_pool logic with synthetic left-padded data. This provides clear, isolated proof that the padding extraction math handles batched inputs correctly. Great improvement! 🎉

If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

n24q02m added 3 commits February 14, 2026 09:36

test: add left-padding batching tests for Qwen3 Causal LM

00e932f

Adds tests to ensure that left-padding and last-token pooling correctly handles batch inference without losing positional context for short strings.

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

test: address CodeRabbit review feedback on left-padding tests

deaf664

n24q02m force-pushed the feat/qwen3-support branch from b4c8898 to deaf664 Compare February 20, 2026 14:46

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: add Qwen3-Embedding-0.6B and Qwen3-Reranker-0.6B support#605

feat: add Qwen3-Embedding-0.6B and Qwen3-Reranker-0.6B support#605
n24q02m wants to merge 5 commits intoqdrant:mainfrom
n24q02m:feat/qwen3-support

n24q02m commented Feb 13, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 13, 2026 •

edited

Loading

Reviews paused

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

n24q02m commented Feb 20, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Feb 20, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 20, 2026

Uh oh!

n24q02m commented Feb 20, 2026

Uh oh!

coderabbitai bot commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	import numpy as np
	assert np.allclose(single_result, batch_result_short, atol=1e-4)

Comments

Conversation

n24q02m commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's Added

Qwen3TextEmbedding (fastembed/text/qwen3_embedding.py)

Qwen3CrossEncoder (fastembed/rerank/cross_encoder/qwen3_cross_encoder.py)

Shared Changes

Models

Tests

ONNX Models

Usage

Verification

Checklist

Uh oh!

coderabbitai bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

n24q02m commented Feb 20, 2026

Self-Review & Batching Proof

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

n24q02m commented Feb 20, 2026

Uh oh!

coderabbitai bot commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

n24q02m commented Feb 13, 2026 •

edited

Loading

Qwen3TextEmbedding (`fastembed/text/qwen3_embedding.py`)

Qwen3CrossEncoder (`fastembed/rerank/cross_encoder/qwen3_cross_encoder.py`)

coderabbitai bot commented Feb 13, 2026 •

edited

Loading