[VLM] Optimize video frame preprocessing for LLaVA-NeXT-Video-7B on GPU #3097

andrew-k-park · 2025-12-04T13:23:43Z

Description

Optimize video frame preprocessing for LLaVA-NeXT-Video-7B model on GPU by creating an OpenVINO preprocessing model to move preprocessing operations from CPU to GPU

Ticket: CVS-177558

Average 1st token latency (1280x720 5s video (32 frames) + 100 input tokens -> generate 128 tokens)

CPP preprocessing (GPU)  2906.118 ms
OV preprocessing (GPU)   845.6711 ms
CPP preprocessing (CPU)  15321.59 ms
OV preprocessing (CPU)   14327.6  ms

WWB results with video input (--model-type visual-video-text):

CPP preprocessing (GPU)  0.880806
OV preprocessing (GPU)   0.860603
CPP preprocessing (CPU)  0.918247
OV preprocessing (CPU)   0.906167

WWB results with image input (--model-type visual-text):

CPP preprocessing (GPU)  0.899575
OV preprocessing (GPU)   0.881532
CPP preprocessing (CPU)  0.903088
OV preprocessing (CPU)   0.902678

Checklist:

Tests have been updated or added to cover the new code.
This patch fully addresses the ticket.
I have made corresponding changes to the documentation.

Copilot

Pull request overview

This PR optimizes video frame preprocessing for the LLaVA-NeXT-Video-7B model by implementing GPU-accelerated preprocessing using OpenVINO operations instead of CPU-based preprocessing. The change provides significant performance improvements, reducing first token latency from ~15s (CPU) to ~845ms (GPU with OV preprocessing).

Key changes:

Added OpenVINO-based preprocessing model that performs resize, crop, and normalization on GPU
Implemented environment variable control to switch between CPU and GPU preprocessing
Refactored preprocessing logic to support both CPU (preprocess_frames_cpp) and GPU (preprocess_frames_ov) paths

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
src/cpp/src/visual_language/llava_next_video/classes.hpp	Added new methods for CPU and GPU preprocessing, added preprocessing model infrastructure and use flag
src/cpp/src/visual_language/llava_next_video/classes.cpp	Implemented OpenVINO preprocessing model creation and GPU-accelerated frame preprocessing logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/visual_language/llava_next_video/classes.cpp

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/visual_language/llava_next_video/classes.cpp

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/visual_language/llava_next_video/classes.cpp

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/visual_language/llava_next_video/classes.cpp

andrew-k-park · 2025-12-11T07:16:32Z

@yatarkan @Wovchena Could you review this PR?

src/cpp/src/visual_language/llava_next_video/classes.hpp

yatarkan · 2025-12-11T14:00:28Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

+    return sliced;
+}
+
+std::shared_ptr<ov::Model> create_video_preprocess_model(const ProcessorConfig& config) {


Can we apply the same OV preprocessing to images as well (in parent class)?

preprocessing for images and videos is similar, but the normalization formula and functions used differ and need verification. Once confirmed, they will likely be applied consistently. currently, the focus is on video frame preprocessing, so after validation, this can be reflected in a follow-up PR later

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

andrew-k-park · 2025-12-15T15:06:37Z

@yatarkan Could you review this PR again?

yatarkan · 2025-12-16T08:15:18Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

+        ov::Shape concat_shape = preprocessed_frames[0].get_shape();
+        concat_shape[0] = preprocessed_frames.size();
+        ov::Tensor concatenated_frames = ov::Tensor(preprocessed_frames[0].get_element_type(), concat_shape);
+
+        float* frames_data = concatenated_frames.data<float>();
+        for (size_t i = 0; i < preprocessed_frames.size(); i++) {
+            memcpy(frames_data, preprocessed_frames[i].data(), preprocessed_frames[i].get_byte_size());
+            frames_data+=ov::shape_size(preprocessed_frames[i].get_shape());
        }


Does it make sense to move tensor concatenation to preprocess_frames so it can be also optimized with OV processing?

I moved tensor concatenation into each preprocess_frames function and for OV processing, I modified it to handle the concatenated frames as batch processing

yatarkan · 2025-12-16T08:23:18Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

+        auto preprocess_model = create_video_preprocess_model(m_processor_config);
+        auto compiled_preprocess = utils::singleton_core().compile_model(preprocess_model, device, properties);
+        m_ireq_queue_preprocess = std::make_unique<CircularBufferQueue<ov::InferRequest>>(
+            compiled_preprocess.get_property(ov::optimal_number_of_infer_requests),
+            [&compiled_preprocess]() -> ov::InferRequest {
+                return compiled_preprocess.create_infer_request();
+            });


In similar PRs we had a pattern that preprocessing was patched into vision encoder model (see example).
I would follow the same approach if it does not contradict/interfere with image input preprocessing.

Following the example's approach, I removed the dedicated preprocessing model and updated preprocessing to patch preprocessing pipeline directly into vision encoder

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/visual_language/llava_next_video/classes.cpp

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/visual_language/llava_next_video/classes.cpp

andrew-k-park · 2025-12-17T09:17:45Z

@yatarkan I've applied the final comments and CI has passed. Could you review the PR once more?

yatarkan · 2025-12-17T13:23:20Z

tests/python_tests/test_vlm_pipeline.py

+# Disable OV preprocessing for video in tests to avoid input parameter conflicts
+# The integrated preprocessing model changes the vision encoder inputs from a single
+# 'pixel_values' parameter to multiple parameters (video_frames, resize_target_size,
+# crop_height, crop_width), which conflicts with image encoding that still expects
+# the original 'pixel_values' input
+os.environ["VIDEO_PREPROCESS"] = "CPP"


As default is OV preprocessing (not cpp) user will face issues on running the model with image input. This breaks expected functionality and behavior.
As noted in #3097 (comment), patching vision encoder model makes sense if there are no conflicts with image processing, but they are. So let's don't follow the "vision encoder patching approach" until preprocessing is used and aligned for both images and videos.

The implementation has been updated so that preprocessing is aligned for both images and videos. The WWB results for image input will be updated through the description.

yatarkan · 2025-12-17T13:50:29Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

    std::vector<ov::genai::EncodedVideo> encoded_videos;
    for (const auto video: videos) {
        std::vector<ov::Tensor> frames = to_single_image_tensors({video});
        auto vision_encoder = std::static_pointer_cast<VisionEncoderLLaVANextVideo>(m_vision_encoder);


Should be moved outside the for-loop?

yatarkan · 2025-12-17T13:50:36Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

-            memcpy(frames_data, prepprocessed_frames[i].data(), prepprocessed_frames[i].get_byte_size());
-            frames_data+=ov::shape_size(prepprocessed_frames[i].get_shape());
-        }
+        auto config = vision_encoder->get_processor_config();


Should be moved outside the for-loop?

yatarkan · 2025-12-17T18:01:07Z

src/cpp/src/visual_language/llava_next_video/classes.cpp


 std::vector<ov::genai::EncodedVideo> InputsEmbedderLLaVANextVideo::encode_videos(const std::vector<ov::Tensor>& videos) {
    std::vector<ov::genai::EncodedVideo> encoded_videos;
    for (const auto video: videos) {


Actually I agree with some of the copilot comments/suggestions despite they were mark as resolved.

Taking into account that patching vision encoder conflicts with image preprocessing, I suggest the following flow:

for (const auto video: videos) { ImageSize original_size = ...; ImageSize target_size = get_resize_target_size(original_size, config); // utility function, to be reused in preprocess_clip_image_llava_next_video(...) size_t num_frames = video.get_shape().at(0); ov::Tensor pixel_values; size_t num_video_tokens; if (vision_encoder->get_use_ov_preprocess()) { // We don't need here to split video tensor into vector of single frames, ov_video_preprocess_model will handle batched frames ov::Tensor target_size_tensor(ov::element::i64, {2}); target_size_tensor.data<int64_t>()[0] = target_size.height; target_size_tensor.data<int64_t>()[1] = target_size.width; ov::Tensor crop_size_tensor(ov::element::i64, {2}); crop_size_tensor.data<int64_t>()[0] = config.crop_size_height; crop_size_tensor.data<int64_t>()[1] = config.crop_size_width; // Pass video, target size, crop size to ov_video_preprocess_model -> pixel_values for vision_encoder model ov_video_preprocess_model.set_input_tensor(0, video); ov_video_preprocess_model.set_input_tensor(1, target_size_tensor); ov_video_preprocess_model.set_input_tensor(2, crop_size_tensor); // both crop_height and crop_width as for target_size_tensor input ov_video_preprocess_model.infer(); pixel_values = ov_video_preprocess_model.get_output_tensor(); num_video_tokens = vision_encoder->get_num_video_tokens(target_size, num_frames); // utility method, to be reused in preprocess_frames(...) } else { // Follow original CPP flow std::vector<ov::Tensor> frames = to_single_image_tensors({video}); // Preprocess and concatenate preprocessed frames to single tensor -> pixel_values for vision_encoder model pixel_values = vision_encoder->preprocess_frames(frames); // concatenate preprocessed frames to single tensor inside num_video_tokens = vision_encoder->get_num_video_tokens(target_size, num_frames); } vision_encoder_infer_request.set_tensor("pixel_values", pixel_values); // ... }

As applied changes to support preprocessing for video and image, refactored the code by creating helper function for duplicate logic and cleaned up the overall code

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/cpp/src/visual_language/llava_next_video/classes.cpp:415

The variable searched_pos is declared but never used in the function. This appears to be dead code that should be removed.

    size_t searched_pos = 0;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/visual_language/llava_next_video/classes.cpp

yatarkan · 2025-12-18T11:37:45Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

+    size_t orig_height,
+    size_t orig_width,


Why not to use ov::genai::ImageSize struct, e.g.

Suggested change

size_t orig_height,

size_t orig_width,

ImageSize original_size,

And for return value

yatarkan · 2025-12-18T11:40:46Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

+std::pair<int64_t, int64_t> calculate_resize_dimensions(
+    size_t orig_height,
+    size_t orig_width,
+    int target_shortest_edge) {


config.size_shortest_edge has size_t type

yatarkan · 2025-12-18T14:01:13Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

+    size_t orig_height,
+    size_t orig_width,


Let's use ov::genai::ImageSize struct, here and in other places where possible

yatarkan · 2025-12-18T14:14:11Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

+    size_t num_video_tokens = ((config.crop_size_height / m_patch_size) * 
+                               (config.crop_size_width / m_patch_size) / 4) * num_frames;


Seems that num_video_tokens can be calculated outside of preprocess_frames_ov/preprocess_frames_cpp methods so this will remove duplication

yatarkan · 2025-12-18T14:21:26Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

+        auto [concatenated_frames, num_video_tokens] = vision_encoder->get_use_ov_preprocess() 
+            ? vision_encoder->preprocess_frames_ov(frames)
+            : vision_encoder->preprocess_frames_cpp(frames);


Seems that preprocess_frames_ov is not really needed. If we calculate num_video_tokens outside the preprocess_frames_* methods, frames concatenation is applicable only for CPP processing.
For OV processing we can pass video tensor to set_preprocess_parameters below instead of concatenated_frames

yatarkan · 2025-12-18T14:27:39Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

+}
+
+bool can_use_ov_preprocess() {
+    const char* env = std::getenv("VISION_PREPROCESS");


Could you please align env var name with other models (qwen2vl, phi3_vision).
Actually I find you name more suitable as it relates to both image and video inputs, so I would prefer replacing IMAGE_PREPROCESS with VISION_PREPROCESS.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-22T04:26:03Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

 namespace ov::genai {

+namespace {
+


The create_bicubic_resize function lacks documentation explaining its purpose, parameters, and the significance of the bicubic resize configuration. Add a docstring explaining that this creates a bicubic resize operation for NHWC format inputs, the meaning of the cube_coeff value (-0.5 for Catmull-Rom), and why ASYMMETRIC coordinate transformation mode is used.

Suggested change

// Creates a bicubic resize operation for NHWC-formatted inputs.

//

// Parameters:

// - input: Input tensor in NHWC layout (N: batch, H: height, W: width, C: channels).

// - target_size: 1D tensor with two elements: [new_height, new_width] used with

// ShapeCalcMode::SIZES to define the output spatial size.

//

// The interpolation is configured to:

// - Operate on the spatial axes [1, 2] corresponding to H and W in NHWC.

// - Use cubic interpolation with cube_coeff = -0.5f, which corresponds to the

// Catmull-Rom bicubic kernel (a = -0.5) and is chosen to match CPU preprocessing.

// - Use CoordinateTransformMode::ASYMMETRIC so that source and target coordinates

// are mapped without half-pixel offsets, aligning with common preprocessing

// behavior in vision models and OpenVINO-based CPU pipelines.

Copilot · 2025-12-22T04:26:03Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

+
+    return std::make_shared<v11::Interpolate>(input_f32, target_size, axes, attrs);
+}
+


The create_mean_scale function lacks documentation explaining the normalization formula being implemented and why the conversion logic differs from the context example. Add a docstring explaining the per-channel normalization formula: (x/255.0 - mean) / std.

Suggested change

/**

* Builds an OpenVINO subgraph that applies per-channel image normalization.

*

* The implemented formula matches the original mean_scale() preprocessing logic:

*

* y[c] = ( x[c] / 255.0f - image_mean[c] ) / image_std[c]

*

* where:

* - x is the input pixel value in the range [0, 255] when provided as uint8,

* - image_mean[c] and image_std[c] are channel-wise mean and std values taken

* from ProcessorConfig::image_mean and ProcessorConfig::image_std,

* - the operation is performed per channel c with broadcasting for NHWC

* tensors using constants of shape [1, 1, 1, 3].

*

* Unlike some context examples that always start from uint8 tensors, this helper

* accepts either u8 or f32 input:

* - if the input is u8, it is first converted to f32 to faithfully reproduce

* the original mean_scale() behavior: float(x) / 255.0f;

* - if the input is already f32 (e.g., pre-scaled elsewhere), it is used

* directly to avoid redundant conversions while still applying the same

* (x/255.0 - mean) / std normalization formula via OV ops.

*/

Copilot · 2025-12-22T04:26:04Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

+
+    return result;
+}
+


The create_channels_first function lacks documentation explaining the transpose operation. Add a docstring indicating this converts from NHWC to NCHW layout.

Suggested change

/// Creates a transpose node that converts an input tensor from NHWC to NCHW layout.

Copilot · 2025-12-22T04:26:04Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

+    auto transpose_order = v0::Constant::create(ov::element::i64, ov::Shape{4}, std::vector<int64_t>{0, 3, 1, 2});
+    return std::make_shared<v1::Transpose>(input_nhwc, transpose_order);
+}
+


The create_center_crop function lacks documentation explaining its purpose and the crop calculation logic. Add a docstring explaining that this performs center cropping by calculating start positions as (dimension - crop_size) / 2.

Suggested change

/**

* Perform a center crop on the spatial dimensions of an NHWC input tensor.

*

* The requested crop size (height, width) is taken from {@code crop_size}, and

* the crop region is positioned at the center of the input by computing the

* starting coordinates as:

* start_y = (H - crop_height) / 2

* start_x = (W - crop_width) / 2

* where H and W are the input tensor height and width. The function then

* slices the input tensor using these start positions and the given crop size.

*/

Copilot · 2025-12-22T04:26:04Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

+    const char* env = std::getenv("VISION_PREPROCESS");
+    return !(env && std::string(env) == "CPP");
+}
+


The patch_preprocess_into_vision_encoder_model function lacks documentation explaining its purpose and the preprocessing pipeline it creates. Add a docstring describing that this integrates bicubic resize, center crop, normalization, and channel transpose operations into the vision encoder model.

Suggested change

/**

* @brief Integrates a preprocessing pipeline into a vision encoder model.

*

* This function wraps the provided @p vision_encoder_model with an OpenVINO subgraph

* that performs the image preprocessing typically done on the CPU. The new model

* exposes three inputs:

* - input_frames: concatenated image/video frames in NHWC uint8 format

* - resize_target_size: target spatial size [height, width] for bicubic resize

* - crop_size: center crop size [height, width]

*

* The injected preprocessing pipeline consists of:

* 1. Bicubic resize (Interpolate with CUBIC mode) to @p resize_target_size.

* 2. Center crop to @p crop_size.

* 3. Per-channel normalization using mean/scale parameters from @p config.

* 4. Channel transpose from NHWC to NCHW (channels-first layout).

*

* The output of this pipeline is connected to the original encoder's first input

* (typically "pixel_values"), so that the returned model directly accepts raw

* uint8 frames and produces the same outputs as the original encoder.

*

* @param vision_encoder_model Original vision encoder model to be patched.

* @param config Processor configuration providing normalization parameters and sizes.

* @return A new model with preprocessing integrated into the encoder graph.

*/

Copilot · 2025-12-22T04:26:04Z

src/cpp/src/visual_language/llava_next_video/classes.cpp

    ov::Tensor image_newline;
-    size_t searched_pos = 0;
    std::vector<ov::Tensor> image_embeds;


The variable image_newline is declared but never used in the get_inputs_embeds function. Remove this unused variable declaration.

Signed-off-by: Andrew Park <[email protected]>

Copilot AI review requested due to automatic review settings December 4, 2025 13:23

github-actions bot added the category: visual language Visual language pipeline label Dec 4, 2025

Copilot AI reviewed Dec 4, 2025

View reviewed changes

andrew-k-park force-pushed the preproc_opt_for_llava_next_video branch 2 times, most recently from 8037bec to 89aa9b7 Compare December 8, 2025 07:58

Copilot AI review requested due to automatic review settings December 8, 2025 07:58

Copilot AI reviewed Dec 8, 2025

View reviewed changes

src/cpp/src/visual_language/llava_next_video/classes.cpp Outdated Show resolved Hide resolved

src/cpp/src/visual_language/llava_next_video/classes.cpp Outdated Show resolved Hide resolved

andrew-k-park force-pushed the preproc_opt_for_llava_next_video branch from 89aa9b7 to c6480b2 Compare December 9, 2025 12:20

Copilot AI review requested due to automatic review settings December 10, 2025 05:04

andrew-k-park force-pushed the preproc_opt_for_llava_next_video branch from c6480b2 to 212b086 Compare December 10, 2025 05:04

Copilot AI reviewed Dec 10, 2025

View reviewed changes

src/cpp/src/visual_language/llava_next_video/classes.cpp Outdated Show resolved Hide resolved

andrew-k-park force-pushed the preproc_opt_for_llava_next_video branch from b6c840a to ae8bcbd Compare December 10, 2025 05:35

Copilot AI review requested due to automatic review settings December 10, 2025 05:35

Copilot AI reviewed Dec 10, 2025

View reviewed changes

src/cpp/src/visual_language/llava_next_video/classes.cpp Outdated Show resolved Hide resolved

andrew-k-park requested review from Wovchena and yatarkan December 10, 2025 05:36

yatarkan reviewed Dec 11, 2025

View reviewed changes

andrew-k-park force-pushed the preproc_opt_for_llava_next_video branch from ae8bcbd to c99b28d Compare December 14, 2025 12:10

andrew-k-park requested review from Copilot and yatarkan December 14, 2025 12:10

Copilot AI reviewed Dec 14, 2025

View reviewed changes

andrew-k-park force-pushed the preproc_opt_for_llava_next_video branch from 082ab54 to 8862644 Compare December 15, 2025 11:48

yatarkan reviewed Dec 16, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings December 17, 2025 01:28

Copilot AI reviewed Dec 17, 2025

View reviewed changes

src/cpp/src/visual_language/llava_next_video/classes.cpp Outdated Show resolved Hide resolved

src/cpp/src/visual_language/llava_next_video/classes.cpp Outdated Show resolved Hide resolved

src/cpp/src/visual_language/llava_next_video/classes.cpp Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings December 17, 2025 01:54

andrew-k-park force-pushed the preproc_opt_for_llava_next_video branch from aac54d3 to e493f3a Compare December 17, 2025 01:54

Copilot AI reviewed Dec 17, 2025

View reviewed changes

andrew-k-park requested a review from yatarkan December 17, 2025 01:59

github-actions bot added the category: GGUF GGUF file reader label Dec 17, 2025

yatarkan reviewed Dec 17, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings December 18, 2025 05:26

github-actions bot removed the category: GGUF GGUF file reader label Dec 18, 2025

Copilot AI reviewed Dec 18, 2025

View reviewed changes

src/cpp/src/visual_language/llava_next_video/classes.cpp Outdated Show resolved Hide resolved

andrew-k-park force-pushed the preproc_opt_for_llava_next_video branch from 2229b5f to 46e1de9 Compare December 18, 2025 05:27

andrew-k-park requested a review from yatarkan December 18, 2025 07:31

yatarkan reviewed Dec 18, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings December 22, 2025 04:25

Copilot AI reviewed Dec 22, 2025

View reviewed changes

andrew-k-park force-pushed the preproc_opt_for_llava_next_video branch from 922fe4f to c6dcf71 Compare December 22, 2025 04:27

andrew-k-park added 8 commits December 22, 2025 13:32

Add OpenVINO-based preprocessing pipeline for LLaVA-NeXT-Video 7B

aee60e4

Signed-off-by: Andrew Park <[email protected]>

Apply comments

527c6e9

Signed-off-by: Andrew Park <[email protected]>

fix lint

975d6b1

Integrate video preprocessing into vision encoder for LLaVA-NeXT-Video

4312c31

Apply comments

71bbbd4

Signed-off-by: Andrew Park <[email protected]>

Fix CI failure

e4b557d

clean up code

f6332df

Apply comments

c6dcf71

	size_t orig_height,
	size_t orig_width,
	ImageSize original_size,

		size_t num_video_tokens = ((config.crop_size_height / m_patch_size) *
		(config.crop_size_width / m_patch_size) / 4) * num_frames;

+// Creates a bicubic resize operation for NHWC-formatted inputs.
+//
+// Parameters:
+// - input:        Input tensor in NHWC layout (N: batch, H: height, W: width, C: channels).
+// - target_size:  1D tensor with two elements: [new_height, new_width] used with
+//                 ShapeCalcMode::SIZES to define the output spatial size.
+//
+// The interpolation is configured to:
+// - Operate on the spatial axes [1, 2] corresponding to H and W in NHWC.
+// - Use cubic interpolation with cube_coeff = -0.5f, which corresponds to the
+//   Catmull-Rom bicubic kernel (a = -0.5) and is chosen to match CPU preprocessing.
+// - Use CoordinateTransformMode::ASYMMETRIC so that source and target coordinates
+//   are mapped without half-pixel offsets, aligning with common preprocessing
+//   behavior in vision models and OpenVINO-based CPU pipelines.


		return std::make_shared<v11::Interpolate>(input_f32, target_size, axes, attrs);
		}

+/**
+ * Builds an OpenVINO subgraph that applies per-channel image normalization.
+ *
+ * The implemented formula matches the original mean_scale() preprocessing logic:
+ *
+ *     y[c] = ( x[c] / 255.0f - image_mean[c] ) / image_std[c]
+ *
+ * where:
+ *   - x is the input pixel value in the range [0, 255] when provided as uint8,
+ *   - image_mean[c] and image_std[c] are channel-wise mean and std values taken
+ *     from ProcessorConfig::image_mean and ProcessorConfig::image_std,
+ *   - the operation is performed per channel c with broadcasting for NHWC
+ *     tensors using constants of shape [1, 1, 1, 3].
+ *
+ * Unlike some context examples that always start from uint8 tensors, this helper
+ * accepts either u8 or f32 input:
+ *   - if the input is u8, it is first converted to f32 to faithfully reproduce
+ *     the original mean_scale() behavior: float(x) / 255.0f;
+ *   - if the input is already f32 (e.g., pre-scaled elsewhere), it is used
+ *     directly to avoid redundant conversions while still applying the same
+ *     (x/255.0 - mean) / std normalization formula via OV ops.
+ */



	/// Creates a transpose node that converts an input tensor from NHWC to NCHW layout.

+/**
+ * Perform a center crop on the spatial dimensions of an NHWC input tensor.
+ *
+ * The requested crop size (height, width) is taken from {@code crop_size}, and
+ * the crop region is positioned at the center of the input by computing the
+ * starting coordinates as:
+ *   start_y = (H - crop_height) / 2
+ *   start_x = (W - crop_width) / 2
+ * where H and W are the input tensor height and width. The function then
+ * slices the input tensor using these start positions and the given crop size.
+ */

+/**
+ * @brief Integrates a preprocessing pipeline into a vision encoder model.
+ *
+ * This function wraps the provided @p vision_encoder_model with an OpenVINO subgraph
+ * that performs the image preprocessing typically done on the CPU. The new model
+ * exposes three inputs:
+ *   - input_frames: concatenated image/video frames in NHWC uint8 format
+ *   - resize_target_size: target spatial size [height, width] for bicubic resize
+ *   - crop_size: center crop size [height, width]
+ *
+ * The injected preprocessing pipeline consists of:
+ *   1. Bicubic resize (Interpolate with CUBIC mode) to @p resize_target_size.
+ *   2. Center crop to @p crop_size.
+ *   3. Per-channel normalization using mean/scale parameters from @p config.
+ *   4. Channel transpose from NHWC to NCHW (channels-first layout).
+ *
+ * The output of this pipeline is connected to the original encoder's first input
+ * (typically "pixel_values"), so that the returned model directly accepts raw
+ * uint8 frames and produces the same outputs as the original encoder.
+ *
+ * @param vision_encoder_model Original vision encoder model to be patched.
+ * @param config Processor configuration providing normalization parameters and sizes.
+ * @return A new model with preprocessing integrated into the encoder graph.
+ */

[VLM] Optimize video frame preprocessing for LLaVA-NeXT-Video-7B on GPU #3097

Are you sure you want to change the base?

[VLM] Optimize video frame preprocessing for LLaVA-NeXT-Video-7B on GPU #3097

Conversation

andrew-k-park commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

andrew-k-park commented Dec 11, 2025

Uh oh!

Uh oh!

yatarkan Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

andrew-k-park commented Dec 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andrew-k-park commented Dec 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

andrew-k-park commented Dec 4, 2025 •

edited

Loading

yatarkan Dec 11, 2025 •

edited

Loading