[NPU] Support NPUW for text-embedding models #3088

mengweiguo · 2025-12-01T07:21:20Z

Description

The benefits handled by prefill-chunk in NPUW:

Support long context
Performance improvement

Note:

NPUW PR: [NPUW] Support prefill-chunk for text-embedding model openvino#33076
The tests has been verified to work with both NPUW and GenAI updates.

CVS-177453

Checklist:

Tests have been updated or added to cover the new code.
This patch fully addresses the ticket.
I have made corresponding changes to the documentation.

Copilot

Pull request overview

This PR adds support for NPUW (Neural Processing Unit Workload) optimization for text embedding models, enabling long context support and performance improvements through prefill-chunk handling.

Key changes:

Added NPU-specific compilation path for text embedding models with dynamic shapes
Introduced new configuration parameter emb_pad_to_max_length to control padding behavior
Refactored NPU compilation logic to support both LLM and text embedding model types

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tools/llm_bench/llm_bench_utils/ov_utils.py	Fixed parameter name and moved padding configuration to support new padding control
tools/llm_bench/llm_bench_utils/model_utils.py	Added mapping for new `emb_pad_to_max_length` parameter
tools/llm_bench/benchmark.py	Added command-line argument for embedding padding control
src/cpp/src/utils.hpp	Declared new function for NPU text embedding compilation
src/cpp/src/utils.cpp	Refactored NPU compilation logic and added text embedding-specific configuration
src/cpp/src/rag/text_embedding_pipeline.cpp	Implemented NPU compilation path for dynamic text embedding models

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/rag/text_embedding_pipeline.cpp

src/cpp/src/utils.cpp

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/utils.cpp

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/utils.cpp

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/rag/text_embedding_pipeline.cpp

src/cpp/src/utils.cpp

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/rag/text_embedding_pipeline.cpp

src/cpp/src/utils.cpp

as-suvorov · 2025-12-03T09:36:53Z

@mengweiguo You applied formatting for updated files. We have several PRs to enable formatting. I'm a bit concerned if your changes would match our fromatting rules.
If you want to apply formatting for updated files please align your PR with #3008 pre commit rules. Otherwise please revert formatting changes.

mengweiguo · 2025-12-04T13:25:58Z

Otherwise please revert formatting changes.

Reverted. Thanks.

src/cpp/src/rag/text_embedding_pipeline.cpp

as-suvorov · 2025-12-04T13:51:58Z

tools/llm_bench/llm_bench_utils/ov_utils.py

    pooling_type = kwargs.get("emb_pooling_type")
    max_length = kwargs.get("emb_max_length")
-    padding_side = kwargs.get("embedding_padding_side")
+    padding_side = kwargs.get("emb_padding_side")


@sbalandi argument names were not aligned, thus option was lost. Consider arguments review and potentially introduce types.
@mengweiguo thanks!

as-suvorov · 2025-12-04T13:54:27Z

@mengweiguo we also need tests for NPUW, could you please implement simple test at: https://github.com/openvinotoolkit/openvino.genai/blob/master/tests/python_tests/test_rag.py

mengweiguo · 2025-12-05T01:45:32Z

could you please implement simple test

Sure, will do that.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/utils.cpp

src/cpp/src/rag/text_embedding_pipeline.cpp

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/python_tests/test_rag.py

src/cpp/src/rag/text_embedding_pipeline.cpp

mengweiguo · 2025-12-08T05:03:22Z

Tests on the NPU have been added. Because the outputs from NPU and CPU do not match exactly, I increased the threshold for NPU tests.

as-suvorov · 2025-12-08T09:47:22Z

tests/python_tests/test_rag.py

+@pytest.mark.xfail(condition=(sys.platform == "darwin"), reason="Ticket - 174635")
+def test_qwen3_embedding_npu(emb_model, config):
+    # Use the entire TEXT_DATASET as a single document instead of chunks
+    flat_documents = [TEXT_DATASET]


Should we test multibatch as well?

NPU doesn't support multi-batch.

Is it NPUW postprocessing limitation?
We have a test for NPUW with CPU fallback wich worked with multibatch: https://github.com/openvinotoolkit/openvino.genai/blob/master/tests/python_tests/test_rag.py#L504

I also forgot to mention - CI has no real NPU device, we have to use CPU fallback

The batch is unrolled or detached before go into NPUW. And I also need a big chunk size( >200) to enable prefill-chunk in NPUW, so take the entire prompt as a single input.

Got it. I will switch to CPU fallback. Thanks.

I didn't get this. TextEmbeddingPipeline public API supports multiple documents to be passed to embed method. This documents can be of any length. What would happen if several small documents passed to embedding pipeline on NPU device?

The NPU plugin will try to unroll or detach batch. During inference, iterate over the documents and feed them to the NPU. So this is not true batch processing.
The documents can be of any length. I just need a long chunk to cover my test case.

Changed. I introduced the threshold parameter to make the tests adaptable to different scenarios.

tests/python_tests/test_rag.py

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/rag/text_embedding_pipeline.cpp

mengweiguo · 2025-12-15T02:40:58Z

@as-suvorov hi, sorry to interrupt you. I have a update on post-processing.

The post-processing operations cannot be appended to the embedding model when entering NPUW. Previously, I duplicated the post-processing code within NPUW, but this approach is difficult to maintain and makes extending post types in the future challenging.
To address this, I’ve created a separate post-processing model to handle these operations in GenAI. We have the flow below:

input Token -> Embedding model -> output Token  -> post model -> final output

With this change, post-processing is now centralized in a single location. Could you please review it once more? Thanks!

as-suvorov · 2025-12-17T08:28:47Z

src/cpp/src/rag/text_embedding_pipeline.cpp

+    auto output_shape = output_node.get_partial_shape();
+    auto input_param =
+        std::make_shared<ov::op::v0::Parameter>(output_node.get_element_type(), ov::PartialShape{1, max_prompt_size, output_shape[2]});
+    set_node_name(input_param, "input_ids");


Suggested change

set_node_name(input_param, "input_ids");

set_node_name(input_param, "embedding_hidden_state");

as-suvorov · 2025-12-17T08:31:58Z

src/cpp/src/rag/text_embedding_pipeline.cpp

+            std::copy_n(m_attention_mask.data<int64_t>(),
+                    m_attention_mask.get_size(),
+                    attention_mask_tensor.data<int64_t>());
+            if (m_attention_mask.get_size() < attention_mask_tensor.get_size()) {


When this is the case?

NPUW can support long context with prefill-chunk.
For example, 8K tokens input may not work on NPU. So we split 8K into eight chunks of 1K each and reshape model to 1K input. Then 8 iterations complete the inference.
For the case of max-length=8K and real input 7.5K, the output of embedding from NPUW is 8K and the input for post model is also 8K(attention_mask_tensor=8K) while the real mask is only 7.5K. So need to fill the dirty slots with 0.

It also explains why post ops can't append to embedding model because the runtime model is not matched with the original static model.

I think the post model does not affect asynchronous inference for embedding. The post model now uses synchronous inference because the model is small enough.

as-suvorov · 2025-12-17T08:36:23Z

@as-suvorov hi, sorry to interrupt you. I have a update on post-processing.

The post-processing operations cannot be appended to the embedding model when entering NPUW. Previously, I duplicated the post-processing code within NPUW, but this approach is difficult to maintain and makes extending post types in the future challenging.

To address this, I’ve created a separate post-processing model to handle these operations in GenAI. We have the flow below:
input Token -> Embedding model -> output Token  -> post model -> final output 
With this change, post-processing is now centralized in a single location. Could you please review it once more? Thanks!

Why post-processing operations cannot be appended for NPUW? This 2 model split can make things a bit more complex for future features. We might want to use async infer queue for embedding pipeline.
Nitpick, there are no tokens after embedding model: input Token -> Embedding model -> embeddings -> post model -> final output

Copilot AI review requested due to automatic review settings December 1, 2025 07:21

github-actions bot added category: llm_bench Label for tool/llm_bench folder no-match-files category: RAG RAG pipeline components labels Dec 1, 2025

Copilot AI reviewed Dec 1, 2025

View reviewed changes

src/cpp/src/rag/text_embedding_pipeline.cpp Outdated Show resolved Hide resolved

src/cpp/src/rag/text_embedding_pipeline.cpp Outdated Show resolved Hide resolved

src/cpp/src/utils.cpp Outdated Show resolved Hide resolved

mengweiguo mentioned this pull request Dec 1, 2025

[NPUW] Support prefill-chunk for text-embedding model openvinotoolkit/openvino#33076

Open

mengweiguo force-pushed the qwen3-embedding-pr branch from d1b6ce1 to 94495b4 Compare December 1, 2025 16:50

Copilot AI review requested due to automatic review settings December 2, 2025 02:50

Copilot AI reviewed Dec 2, 2025

View reviewed changes

src/cpp/src/utils.cpp Outdated Show resolved Hide resolved

src/cpp/src/utils.cpp Outdated Show resolved Hide resolved

mengweiguo force-pushed the qwen3-embedding-pr branch 2 times, most recently from 1b7e667 to 3120bdf Compare December 2, 2025 03:16

Copilot AI review requested due to automatic review settings December 2, 2025 03:16

Copilot AI reviewed Dec 2, 2025

View reviewed changes

src/cpp/src/utils.cpp Outdated Show resolved Hide resolved

mengweiguo force-pushed the qwen3-embedding-pr branch 2 times, most recently from f713558 to a591398 Compare December 2, 2025 10:14

Copilot AI review requested due to automatic review settings December 2, 2025 10:14

Copilot AI reviewed Dec 2, 2025

View reviewed changes

src/cpp/src/rag/text_embedding_pipeline.cpp Outdated Show resolved Hide resolved

src/cpp/src/utils.cpp Show resolved Hide resolved

mengweiguo changed the title ~~Support NPUW for text-embedding models~~ [NPU] Support NPUW for text-embedding models Dec 3, 2025

mengweiguo marked this pull request as ready for review December 3, 2025 05:47

Copilot AI review requested due to automatic review settings December 3, 2025 05:47

Copilot AI reviewed Dec 3, 2025

View reviewed changes

Wovchena requested a review from as-suvorov December 3, 2025 07:20

as-suvorov requested changes Dec 3, 2025

View reviewed changes

mengweiguo force-pushed the qwen3-embedding-pr branch from 628bc9e to 7f6ba1e Compare December 4, 2025 13:22

as-suvorov approved these changes Dec 4, 2025

View reviewed changes

as-suvorov self-requested a review December 4, 2025 13:55

Copilot AI review requested due to automatic review settings December 5, 2025 06:26

Copilot AI reviewed Dec 5, 2025

View reviewed changes

src/cpp/src/utils.cpp Show resolved Hide resolved

src/cpp/src/rag/text_embedding_pipeline.cpp Outdated Show resolved Hide resolved

mengweiguo force-pushed the qwen3-embedding-pr branch from 054354b to 0817705 Compare December 5, 2025 06:30

mengweiguo added 4 commits December 5, 2025 22:50

Support NPUW for text-embedding models

d0b6af0

Fix review remark

1003fd9

Add option normalize support

13d578a

Add tests

23170e8

Copilot AI review requested due to automatic review settings December 7, 2025 13:06

mengweiguo force-pushed the qwen3-embedding-pr branch from 0817705 to 23170e8 Compare December 7, 2025 13:06

github-actions bot added the category: GGUF GGUF file reader label Dec 7, 2025

Copilot AI reviewed Dec 7, 2025

View reviewed changes

tests/python_tests/test_rag.py Show resolved Hide resolved

tests/python_tests/test_rag.py Outdated Show resolved Hide resolved

src/cpp/src/rag/text_embedding_pipeline.cpp Outdated Show resolved Hide resolved

as-suvorov reviewed Dec 8, 2025

View reviewed changes

tests/python_tests/test_rag.py Outdated Show resolved Hide resolved

tests fallback to CPU

e48a215

as-suvorov approved these changes Dec 12, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings December 15, 2025 02:17

Copilot AI reviewed Dec 15, 2025

View reviewed changes

src/cpp/src/rag/text_embedding_pipeline.cpp Outdated Show resolved Hide resolved

src/cpp/src/rag/text_embedding_pipeline.cpp Outdated Show resolved Hide resolved

Handle the post-processing with a separate model

0dcc7bc

mengweiguo force-pushed the qwen3-embedding-pr branch from c0c98d4 to 0dcc7bc Compare December 15, 2025 02:27

mengweiguo requested a review from as-suvorov December 15, 2025 02:41

as-suvorov reviewed Dec 17, 2025

View reviewed changes

	set_node_name(input_param, "input_ids");
	set_node_name(input_param, "embedding_hidden_state");

[NPU] Support NPUW for text-embedding models #3088

Are you sure you want to change the base?

[NPU] Support NPUW for text-embedding models #3088

Conversation

mengweiguo commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

as-suvorov commented Dec 3, 2025

Uh oh!

mengweiguo commented Dec 4, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

as-suvorov commented Dec 4, 2025

Uh oh!

mengweiguo commented Dec 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mengweiguo commented Dec 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

as-suvorov Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengweiguo commented Dec 1, 2025 •

edited

Loading

as-suvorov Dec 8, 2025 •

edited

Loading

mengweiguo Dec 17, 2025 •

edited

Loading

as-suvorov commented Dec 17, 2025 •

edited

Loading