-
Notifications
You must be signed in to change notification settings - Fork 309
Testing #3087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Testing #3087
Conversation
Signed-off-by: fishbell <[email protected]>
Signed-off-by: fishbell <[email protected]>
Signed-off-by: fishbell <[email protected]>
Signed-off-by: fishbell <[email protected]>
Signed-off-by: xiping.yan <[email protected]>
Signed-off-by: xiping.yan <[email protected]>
Signed-off-by: xiping.yan <[email protected]>
2: bell's update, Signed-off-by: xiping.yan <[email protected]>
Signed-off-by: xiping.yan <[email protected]>
Signed-off-by: xiping.yan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces support for EAGLE3 speculative decoding, a new variant of speculative decoding that improves inference performance. The key changes include:
- Implementation of EAGLE3 speculative decoding algorithm with hidden state handling
- Python test infrastructure updates to support EAGLE3 models
- Model transformation passes for extracting hidden states from specific layers
- CI/CD workflow updates to test EAGLE3 functionality
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/python_tests/utils/hugging_face.py | Added EAGLE3 model detection and conditional tokenizer handling |
| tests/python_tests/test_continuous_batching.py | Added EAGLE3 speculative decoding test cases |
| tests/python_tests/samples/test_speculative_decoding_lm.py | Refactored test structure and added EAGLE3 test class |
| tests/python_tests/samples/conftest.py | Added EAGLE3 model conversion configuration |
| src/cpp/src/utils.hpp | Removed ModelDesc struct and helper functions (moved to separate file) |
| src/cpp/src/utils.cpp | Changed default prefix caching setting and moved ModelDesc functions |
| src/cpp/src/speculative_decoding/update_request_structs.hpp | Added hidden_states field to GeneratedSequence struct |
| src/cpp/src/speculative_decoding/speculative_decoding_stateful.hpp | Added model_desc.hpp include |
| src/cpp/src/speculative_decoding/speculative_decoding_impl.hpp | Introduced generate_common template function and GenerateStrategy struct |
| src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp | Refactored generate() to use generate_common with strategy pattern |
| src/cpp/src/speculative_decoding/speculative_decoding_eagle3_impl.hpp | New file implementing EAGLE3 decoding with model transformations |
| src/cpp/src/speculative_decoding/speculative_decoding_eagle3_impl.cpp | New file containing EAGLE3 implementation details |
| src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.hpp | Added EAGLE3-specific pipeline implementation class |
| src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp | Added hidden state handling and EAGLE3 pipeline support |
| src/cpp/src/sequence_group.hpp | Added hidden state storage and manipulation methods to Sequence |
| src/cpp/src/sampling/sampler.hpp | Added draft-to-target token mapping support |
| src/cpp/src/sampling/sampler.cpp | Implemented token offset computation for EAGLE3 draft model |
| src/cpp/src/model_desc.hpp | New file defining ModelDesc struct |
| src/cpp/src/model_desc.cpp | New file implementing ModelDesc utility functions |
| src/cpp/src/llm/pipeline.cpp | Added EAGLE3 runtime info handling and model initialization |
| src/cpp/src/continuous_batching/pipeline_impl.hpp | Added step counter for embeddings handling |
| src/cpp/src/continuous_batching/pipeline_impl.cpp | Updated embedding append logic for EAGLE3 |
| src/cpp/src/continuous_batching/pipeline.cpp | Added EAGLE3 pipeline instantiation logic |
| src/cpp/src/continuous_batching/model_runner.hpp | Added hidden state management infrastructure |
| src/cpp/src/continuous_batching/model_runner.cpp | Implemented hidden state import/export during inference |
| src/cpp/include/openvino/genai/continuous_batching_pipeline.hpp | Added EAGLE3 forward declarations and friend classes |
| samples/python/visual_language_chat/benchmark_vlm.py | Added draft model support (appears to be test/debug code) |
| .github/workflows/windows.yml | Added EAGLE3 test workflow step |
| .github/workflows/manylinux_2_28.yml | Added EAGLE3 test workflow step |
| .github/workflows/linux.yml | Added EAGLE3 test workflow step |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| hf_tokenizer.save_pretrained(models_path) | ||
| # convert tokenizers as well | ||
| convert_and_save_tokenizer(hf_tokenizer, models_path) |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conversion from str(models_path) on line 241 to just models_path is inconsistent with line 243 which still uses str(models_path) in the convert_and_save_tokenizer call. For consistency, either both should use str(models_path) or both should use models_path directly.
| hf_tokenizer.save_pretrained(models_path) | |
| # convert tokenizers as well | |
| convert_and_save_tokenizer(hf_tokenizer, models_path) | |
| hf_tokenizer.save_pretrained(str(models_path)) | |
| # convert tokenizers as well | |
| convert_and_save_tokenizer(hf_tokenizer, str(models_path)) |
| @pytest.mark.parametrize( | ||
| "convert_model, convert_draft_model, sample_args", | ||
| [ | ||
| pytest.param("Qwen3-1.7B", "qwen3_1.7b_eagle3", test_prompt, marks=pytest.mark.skip(reason='CVS-171947, CVS-171943, CVS-174959')), |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is marked as skipped with multiple CVS ticket references. Consider adding a more descriptive reason explaining what issues these tickets represent and what needs to be fixed before the test can be enabled.
| pytest.param("Qwen3-1.7B", "qwen3_1.7b_eagle3", test_prompt, marks=pytest.mark.skip(reason='CVS-171947, CVS-171943, CVS-174959')), | |
| pytest.param( | |
| "Qwen3-1.7B", | |
| "qwen3_1.7b_eagle3", | |
| test_prompt, | |
| marks=pytest.mark.skip( | |
| reason=( | |
| "Skipped due to known issues: " | |
| "CVS-171947 (model conversion failure), " | |
| "CVS-171943 (output mismatch), " | |
| "CVS-174959 (runtime error). " | |
| "Enable this test after resolving these issues." | |
| ) | |
| ), | |
| ), |
| SchedulerConfig default_config; | ||
| default_config.max_num_batched_tokens = std::numeric_limits<size_t>::max(); // don't limit total batch size | ||
| default_config.enable_prefix_caching = true; // for better TTFT in chat scenarios | ||
| default_config.enable_prefix_caching = false; // for better TTFT in chat scenarios |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states 'for better TTFT in chat scenarios' but the code is disabling prefix caching, which typically would hurt TTFT performance. The comment appears to contradict the implementation. Please clarify or correct the comment to explain why disabling prefix caching improves TTFT.
| default_config.enable_prefix_caching = false; // for better TTFT in chat scenarios | |
| default_config.enable_prefix_caching = false; // disables prefix caching; in some chat scenarios, this may reduce memory usage or avoid stale cache issues, but may hurt TTFT performance |
| result.removed_tokens_cnt = remove_tokens_from_sequence(running_sequence, min_generated_tokens, logit_processor); | ||
|
|
||
| //if (!m_is_validation_mode_enabled) | ||
| //std::cout << "removed " << result.removed_tokens_cnt << " tokens from sequence " << std::endl; |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out debug code should be removed before merging to production.
| //std::cout << "removed " << result.removed_tokens_cnt << " tokens from sequence " << std::endl; |
| } else if (_is_hs_internal()) { | ||
| // fill hidden_state_data with m_hidden_states | ||
| if (hidden_state_data) { | ||
| OPENVINO_ASSERT(num_scheduled_tokens == 1, "unexpected num_scheduled_tokens in speculative drafting stage in eagle3 mode"); |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'drafting' to 'decoding' in the error message.
| OPENVINO_ASSERT(num_scheduled_tokens == 1, "unexpected num_scheduled_tokens in speculative drafting stage in eagle3 mode"); | |
| OPENVINO_ASSERT(num_scheduled_tokens == 1, "unexpected num_scheduled_tokens in speculative decoding stage in eagle3 mode"); |
| print("draft_model_path=", draft_model_path) | ||
| print("device=", device) | ||
| draft_model = ov_genai.draft_model(str(draft_model_path), device) | ||
| #pipe = ov_genai.VLMPipeline(models_path, device, scheduler_config=scheduler_config) |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug print statements and commented-out code should be removed before merging. These appear to be temporary testing code that shouldn't be in production.
| print("draft_model_path=", draft_model_path) | |
| print("device=", device) | |
| draft_model = ov_genai.draft_model(str(draft_model_path), device) | |
| #pipe = ov_genai.VLMPipeline(models_path, device, scheduler_config=scheduler_config) | |
| draft_model = ov_genai.draft_model(str(draft_model_path), device) |
| run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).WWB.test }} | ||
| timeout: 90 | ||
| - name: 'EAGLE3 speculative decoding tests' | ||
| cmd: | |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Installing from a personal GitHub fork using a specific commit hash is fragile and not maintainable. Consider either: 1) merging these changes into the official optimum-intel repository and using a tagged release, or 2) documenting why this fork is necessary and when it can be removed.
| cmd: | | |
| cmd: | | |
| # TODO: Using a personal fork of optimum-intel for EAGLE3 speculative decoding tests. | |
| # Reason: [Add explanation here, e.g., "Required for feature X not yet merged upstream. See PR #123."] | |
| # Remove this and use official optimum-intel release when upstream PR is merged. |
| run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).WWB.test }} | ||
| timeout: 90 | ||
| - name: 'EAGLE3 speculative decoding tests' | ||
| cmd: | |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Installing from a personal GitHub fork using a specific commit hash is fragile and not maintainable. Consider either: 1) merging these changes into the official optimum-intel repository and using a tagged release, or 2) documenting why this fork is necessary and when it can be removed.
| cmd: | | |
| cmd: | | |
| # FIXME: Using a personal fork of optimum-intel for EAGLE3 speculative decoding tests. | |
| # Reason: Required changes are not yet merged upstream. See https://github.com/huggingface/optimum-intel/pull/<PR_NUMBER> (replace with actual PR/issue link). | |
| # Remove this and use official optimum-intel release once changes are merged. |
| run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).WWB.test }} | ||
| timeout: 90 | ||
| - name: 'EAGLE3 speculative decoding tests' | ||
| cmd: | |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Installing from a personal GitHub fork using a specific commit hash is fragile and not maintainable. Consider either: 1) merging these changes into the official optimum-intel repository and using a tagged release, or 2) documenting why this fork is necessary and when it can be removed.
| cmd: | | |
| cmd: | | |
| # FIXME: Installing from a personal fork is fragile. This is required because the official optimum-intel does not yet support EAGLE3 speculative decoding. | |
| # Remove this and use the official optimum-intel release once https://github.com/huggingface/optimum-intel/pull/XXX is merged and released. |
Signed-off-by: xipingya <[email protected]>
Description
CVS-###
Fixes #(issue)
Checklist: