Skip to content

Conversation

@sbalandi
Copy link
Contributor

Description

It was found that llm_bench shows very low throughput for speculative decode. This happens because llm_bench calculates the latency for tokens other than the first based on the PerfMetrics raw_metrics.m_new_token_times, which do not take into account the batch. raw_metrics.m_durations are also calculated based on raw_metrics.m_new_token_times, but the time is divided into batches, which significantly affects the calculations for speculative decoding.

CVS-174513

Fixes #(issue)

Checklist:

  • Tests have been updated or added to cover the new code.
  • This patch fully addresses the ticket.
  • I have made corresponding changes to the documentation.

@github-actions github-actions bot added the category: llm_bench Label for tool/llm_bench folder label Dec 11, 2025
Copy link
Contributor

@AsyaPronina AsyaPronina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!!

@sbalandi sbalandi added this pull request to the merge queue Dec 17, 2025
Merged via the queue into openvinotoolkit:master with commit 2c35439 Dec 17, 2025
97 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: llm_bench Label for tool/llm_bench folder

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants