Skip to content

Conversation

@peterchen-intel
Copy link
Collaborator

Description

CVS-###

Fixes #(issue)

Checklist:

  • Tests have been updated or added to cover the new code.
  • This patch fully addresses the ticket.
  • I have made corresponding changes to the documentation.

Copilot AI review requested due to automatic review settings November 30, 2025 11:36
@github-actions github-actions bot added category: continuous batching Continuous batching category: LLM LLM pipeline (stateful, static) category: sampling Sampling / Decoding algorithms category: speculative decoding Speculative decoding category: GHA CI based on Github actions category: LLM samples GenAI LLM samples category: CPP API Changes in GenAI C++ public headers no-match-files category: GGUF GGUF file reader category: VLM samples GenAI VLM samples labels Nov 30, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces support for EAGLE3 speculative decoding, a new variant of speculative decoding that improves inference performance. The key changes include:

  • Implementation of EAGLE3 speculative decoding algorithm with hidden state handling
  • Python test infrastructure updates to support EAGLE3 models
  • Model transformation passes for extracting hidden states from specific layers
  • CI/CD workflow updates to test EAGLE3 functionality

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/python_tests/utils/hugging_face.py Added EAGLE3 model detection and conditional tokenizer handling
tests/python_tests/test_continuous_batching.py Added EAGLE3 speculative decoding test cases
tests/python_tests/samples/test_speculative_decoding_lm.py Refactored test structure and added EAGLE3 test class
tests/python_tests/samples/conftest.py Added EAGLE3 model conversion configuration
src/cpp/src/utils.hpp Removed ModelDesc struct and helper functions (moved to separate file)
src/cpp/src/utils.cpp Changed default prefix caching setting and moved ModelDesc functions
src/cpp/src/speculative_decoding/update_request_structs.hpp Added hidden_states field to GeneratedSequence struct
src/cpp/src/speculative_decoding/speculative_decoding_stateful.hpp Added model_desc.hpp include
src/cpp/src/speculative_decoding/speculative_decoding_impl.hpp Introduced generate_common template function and GenerateStrategy struct
src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp Refactored generate() to use generate_common with strategy pattern
src/cpp/src/speculative_decoding/speculative_decoding_eagle3_impl.hpp New file implementing EAGLE3 decoding with model transformations
src/cpp/src/speculative_decoding/speculative_decoding_eagle3_impl.cpp New file containing EAGLE3 implementation details
src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp Added EAGLE3-specific pipeline implementation class
src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp Added hidden state handling and EAGLE3 pipeline support
src/cpp/src/sequence_group.hpp Added hidden state storage and manipulation methods to Sequence
src/cpp/src/sampling/sampler.hpp Added draft-to-target token mapping support
src/cpp/src/sampling/sampler.cpp Implemented token offset computation for EAGLE3 draft model
src/cpp/src/model_desc.hpp New file defining ModelDesc struct
src/cpp/src/model_desc.cpp New file implementing ModelDesc utility functions
src/cpp/src/llm/pipeline.cpp Added EAGLE3 runtime info handling and model initialization
src/cpp/src/continuous_batching/pipeline_impl.hpp Added step counter for embeddings handling
src/cpp/src/continuous_batching/pipeline_impl.cpp Updated embedding append logic for EAGLE3
src/cpp/src/continuous_batching/pipeline.cpp Added EAGLE3 pipeline instantiation logic
src/cpp/src/continuous_batching/model_runner.hpp Added hidden state management infrastructure
src/cpp/src/continuous_batching/model_runner.cpp Implemented hidden state import/export during inference
src/cpp/include/openvino/genai/continuous_batching_pipeline.hpp Added EAGLE3 forward declarations and friend classes
samples/python/visual_language_chat/benchmark_vlm.py Added draft model support (appears to be test/debug code)
.github/workflows/windows.yml Added EAGLE3 test workflow step
.github/workflows/manylinux_2_28.yml Added EAGLE3 test workflow step
.github/workflows/linux.yml Added EAGLE3 test workflow step

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +241 to +243
hf_tokenizer.save_pretrained(models_path)
# convert tokenizers as well
convert_and_save_tokenizer(hf_tokenizer, models_path)
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversion from str(models_path) on line 241 to just models_path is inconsistent with line 243 which still uses str(models_path) in the convert_and_save_tokenizer call. For consistency, either both should use str(models_path) or both should use models_path directly.

Suggested change
hf_tokenizer.save_pretrained(models_path)
# convert tokenizers as well
convert_and_save_tokenizer(hf_tokenizer, models_path)
hf_tokenizer.save_pretrained(str(models_path))
# convert tokenizers as well
convert_and_save_tokenizer(hf_tokenizer, str(models_path))

Copilot uses AI. Check for mistakes.
@pytest.mark.parametrize(
"convert_model, convert_draft_model, sample_args",
[
pytest.param("Qwen3-1.7B", "qwen3_1.7b_eagle3", test_prompt, marks=pytest.mark.skip(reason='CVS-171947, CVS-171943, CVS-174959')),
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is marked as skipped with multiple CVS ticket references. Consider adding a more descriptive reason explaining what issues these tickets represent and what needs to be fixed before the test can be enabled.

Suggested change
pytest.param("Qwen3-1.7B", "qwen3_1.7b_eagle3", test_prompt, marks=pytest.mark.skip(reason='CVS-171947, CVS-171943, CVS-174959')),
pytest.param(
"Qwen3-1.7B",
"qwen3_1.7b_eagle3",
test_prompt,
marks=pytest.mark.skip(
reason=(
"Skipped due to known issues: "
"CVS-171947 (model conversion failure), "
"CVS-171943 (output mismatch), "
"CVS-174959 (runtime error). "
"Enable this test after resolving these issues."
)
),
),

Copilot uses AI. Check for mistakes.
SchedulerConfig default_config;
default_config.max_num_batched_tokens = std::numeric_limits<size_t>::max(); // don't limit total batch size
default_config.enable_prefix_caching = true; // for better TTFT in chat scenarios
default_config.enable_prefix_caching = false; // for better TTFT in chat scenarios
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states 'for better TTFT in chat scenarios' but the code is disabling prefix caching, which typically would hurt TTFT performance. The comment appears to contradict the implementation. Please clarify or correct the comment to explain why disabling prefix caching improves TTFT.

Suggested change
default_config.enable_prefix_caching = false; // for better TTFT in chat scenarios
default_config.enable_prefix_caching = false; // disables prefix caching; in some chat scenarios, this may reduce memory usage or avoid stale cache issues, but may hurt TTFT performance

Copilot uses AI. Check for mistakes.
result.removed_tokens_cnt = remove_tokens_from_sequence(running_sequence, min_generated_tokens, logit_processor);

//if (!m_is_validation_mode_enabled)
//std::cout << "removed " << result.removed_tokens_cnt << " tokens from sequence " << std::endl;
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out debug code should be removed before merging to production.

Suggested change
//std::cout << "removed " << result.removed_tokens_cnt << " tokens from sequence " << std::endl;

Copilot uses AI. Check for mistakes.
} else if (_is_hs_internal()) {
// fill hidden_state_data with m_hidden_states
if (hidden_state_data) {
OPENVINO_ASSERT(num_scheduled_tokens == 1, "unexpected num_scheduled_tokens in speculative drafting stage in eagle3 mode");
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'drafting' to 'decoding' in the error message.

Suggested change
OPENVINO_ASSERT(num_scheduled_tokens == 1, "unexpected num_scheduled_tokens in speculative drafting stage in eagle3 mode");
OPENVINO_ASSERT(num_scheduled_tokens == 1, "unexpected num_scheduled_tokens in speculative decoding stage in eagle3 mode");

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +85
print("draft_model_path=", draft_model_path)
print("device=", device)
draft_model = ov_genai.draft_model(str(draft_model_path), device)
#pipe = ov_genai.VLMPipeline(models_path, device, scheduler_config=scheduler_config)
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug print statements and commented-out code should be removed before merging. These appear to be temporary testing code that shouldn't be in production.

Suggested change
print("draft_model_path=", draft_model_path)
print("device=", device)
draft_model = ov_genai.draft_model(str(draft_model_path), device)
#pipe = ov_genai.VLMPipeline(models_path, device, scheduler_config=scheduler_config)
draft_model = ov_genai.draft_model(str(draft_model_path), device)

Copilot uses AI. Check for mistakes.
run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).WWB.test }}
timeout: 90
- name: 'EAGLE3 speculative decoding tests'
cmd: |
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing from a personal GitHub fork using a specific commit hash is fragile and not maintainable. Consider either: 1) merging these changes into the official optimum-intel repository and using a tagged release, or 2) documenting why this fork is necessary and when it can be removed.

Suggested change
cmd: |
cmd: |
# TODO: Using a personal fork of optimum-intel for EAGLE3 speculative decoding tests.
# Reason: [Add explanation here, e.g., "Required for feature X not yet merged upstream. See PR #123."]
# Remove this and use official optimum-intel release when upstream PR is merged.

Copilot uses AI. Check for mistakes.
run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).WWB.test }}
timeout: 90
- name: 'EAGLE3 speculative decoding tests'
cmd: |
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing from a personal GitHub fork using a specific commit hash is fragile and not maintainable. Consider either: 1) merging these changes into the official optimum-intel repository and using a tagged release, or 2) documenting why this fork is necessary and when it can be removed.

Suggested change
cmd: |
cmd: |
# FIXME: Using a personal fork of optimum-intel for EAGLE3 speculative decoding tests.
# Reason: Required changes are not yet merged upstream. See https://github.com/huggingface/optimum-intel/pull/<PR_NUMBER> (replace with actual PR/issue link).
# Remove this and use official optimum-intel release once changes are merged.

Copilot uses AI. Check for mistakes.
run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).WWB.test }}
timeout: 90
- name: 'EAGLE3 speculative decoding tests'
cmd: |
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing from a personal GitHub fork using a specific commit hash is fragile and not maintainable. Consider either: 1) merging these changes into the official optimum-intel repository and using a tagged release, or 2) documenting why this fork is necessary and when it can be removed.

Suggested change
cmd: |
cmd: |
# FIXME: Installing from a personal fork is fragile. This is required because the official optimum-intel does not yet support EAGLE3 speculative decoding.
# Remove this and use the official optimum-intel release once https://github.com/huggingface/optimum-intel/pull/XXX is merged and released.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: continuous batching Continuous batching category: CPP API Changes in GenAI C++ public headers category: GGUF GGUF file reader category: GHA CI based on Github actions category: LLM samples GenAI LLM samples category: LLM LLM pipeline (stateful, static) category: sampling Sampling / Decoding algorithms category: speculative decoding Speculative decoding category: VLM samples GenAI VLM samples do_not_merge do_not_review no-match-files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants