[CPU] Enable BF16 dynamic quantization path for compressed FullyConnected#35726
Conversation
There was a problem hiding this comment.
Pull request overview
Extends the Intel CPU plugin’s compressed FullyConnected (weights decompression) path to allow BF16 activations to use the oneDNN dynamic quantization implementation (previously limited to F32), and adds functional coverage for the new BF16 scenario.
Changes:
- Enable BF16 as a supported activation type for compressed FullyConnected on x86_64.
- Extend dynamic-quantization eligibility checks in the oneDNN FC primitive to accept BF16 sources (with ISA gating).
- Add a BF16-specific MatMul-weights-decompression test fixture and instantiate new dyn-quant BF16 test cases.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/x64/matmul_weights_decompression.cpp |
Adds BF16 dyn-quant test instantiation and a BF16-specific additional-config filter. |
src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/classes/matmul_weights_decompression.hpp |
Introduces a BF16-derived test class and a shared setup helper taking data precision. |
src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/classes/matmul_weights_decompression.cpp |
Refactors setup to parameterize network precision and adds BF16 test execution. |
src/plugins/intel_cpu/src/nodes/fullyconnected.cpp |
Enables BF16 in getSupportedCompressedActivationsTypes() for x86_64. |
src/plugins/intel_cpu/src/nodes/executors/dnnl/dnnl_fullyconnected_primitive.cpp |
Updates dynamic-quantization gating to allow BF16 sources with additional ISA checks. |
rkazants
left a comment
There was a problem hiding this comment.
@liubo-intel, @yuxu42, is it needed for Qwen3,5?
Hi, @rkazants : This PR is mainly intended to enable NVL to benefit from the more efficient avx512_core_vnni instruction set for the BF16 dynamic-quant path. As far as I know, it is not in the scope of the Qwen3.5 enablement effort. |
|
@liubo-intel , could you please create a dedicated oneDNN fork PR to facilitate the review process? |
* Gate BF16 dyn-quant entry on AMX-capable HW (two layers: node-level getSupportedCompressedActivationsTypes + primitive-level useDynamicQuantizationImpl), since AMX BF16 TMUL handles long prompts (prefill) more efficiently than VNNI int8 dyn-quant. * Drive the BF16 dyn-quant test through the inference_precision hint on an f32 IR; remove the MatmulWeightsDecompressionBF16 subclass and decompression_precisions_bf16.
54f8ed1 to
ebdc717
Compare
oneDNN fork PR:openvinotoolkit/oneDNN#310 |
maxnick
left a comment
There was a problem hiding this comment.
In general LGTM. Please apply comment in the corresponding oneDNN PR.
Details:
oneDNN fork PR:openvinotoolkit/oneDNN#310
Tickets: