[GuideLLM Refactor] Updates and Fixes for benchmark outputs, schemas, and stats calculations #442

markurtz · 2025-10-31T17:52:49Z

Summary

Details

[ ]

Test Plan

Related Issues

Resolves #

"I certify that all code in this PR is my own, except as noted below."

Use of AI

Includes AI-assisted code completion
Includes code generated by an AI application
Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

Copilot

Pull Request Overview

This PR performs a significant refactoring to reorganize the codebase's statistics, schemas, and benchmarking components. The main changes consolidate utilities into schemas, remove deprecated presentation modules, and restructure the test suite to align with the new organization.

Key changes:

Moved statistics utilities from utils/statistics.py to schemas/statistics.py and updated to use probability density functions (PDFs) instead of cumulative distribution functions (CDFs)
Relocated pydantic utilities from utils/pydantic_utils.py to schemas/base.py
Removed deprecated presentation modules (injector.py, data_models.py, builder.py)
Complete rewrite of statistics tests to use parametrized fixtures and broader distribution coverage
Added new console utilities for formatted table printing

Reviewed Changes

Copilot reviewed 59 out of 61 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
tests/unit/utils/test_statistics.py	Complete rewrite with parametrized fixtures testing multiple probability distributions
tests/unit/utils/test_pydantic_utils.py	Updated import path from utils to schemas
tests/unit/presentation/*	Removed deprecated presentation test files
tests/unit/mock_benchmark.py	Updated class name from BenchmarkSchedulerStats to BenchmarkSchedulerMetrics
src/guidellm/utils/statistics.py	Deleted - moved to schemas/statistics.py
src/guidellm/schemas/statistics.py	New file with refactored statistics using PDF-based approach
src/guidellm/schemas/base.py	New file consolidating pydantic utilities
src/guidellm/utils/console.py	Enhanced with table printing capabilities and improved documentation
src/guidellm/utils/functions.py	Added safe_format_number utility
Multiple schema files	Updated imports and restructured benchmark schemas

Comments suppressed due to low confidence (3)

src/guidellm/data/preprocessors/formatters.py:44

This class does not call RequestFormatter.init during initialization. (GenerativeTextCompletionsRequestFormatter.init may be missing a call to a base class init)

class GenerativeTextCompletionsRequestFormatter(RequestFormatter):

src/guidellm/data/preprocessors/formatters.py:118

This class does not call RequestFormatter.init during initialization. (GenerativeChatCompletionsRequestFormatter.init may be missing a call to a base class init)

class GenerativeChatCompletionsRequestFormatter(RequestFormatter):

src/guidellm/data/preprocessors/formatters.py:307

This class does not call RequestFormatter.init during initialization. (GenerativeAudioTranscriptionRequestFormatter.init may be missing a call to a base class init)

class GenerativeAudioTranscriptionRequestFormatter(RequestFormatter):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/unit/schemas/test_statistics.py

src/guidellm/utils/console.py

src/guidellm/schemas/statistics.py

src/guidellm/schemas/request_stats.py

src/guidellm/data/loaders.py

src/guidellm/benchmark/schemas/base.py

src/guidellm/benchmark/outputs/output.py

src/guidellm/data/preprocessors/preprocessor.py

src/guidellm/benchmark/outputs/output.py

sjmonson · 2025-10-31T20:55:41Z

Ran this benchmark:

guidellm benchmark
  --target http://vllm-standalone-granite-3-2b.llmd.svc.cluster.local \
  --data "prompt_tokens=4096,prompt_tokens_stdev=512,prompt_tokens_min=2048,prompt_tokens_max=6144,output_tokens=512,output_tokens_stdev=128,output_tokens_min=1,output_tokens_max=1024" \
  --max-seconds 10 \
  --profile concurrent \
  --rate 10

And got the following error:

...
  File "/root/guidellm/src/guidellm/benchmark/benchmarker.py", line 161, in run
    benchmark = benchmark_class.compile(
        accumulator=accumulator,
        scheduler_state=scheduler_state,
    )
  File "/root/guidellm/src/guidellm/benchmark/schemas/generative/benchmark.py", line 134, in compile
    metrics=GenerativeMetrics.compile(accumulator),
            ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
  File "/root/guidellm/src/guidellm/benchmark/schemas/generative/metrics.py", line 797, in compile
    incomplete = accumulator.incomplete.get_within_range(start_time, end_time)
  File "/root/guidellm/src/guidellm/benchmark/schemas/generative/accumulator.py", line 623, in get_within_range
    if (stats.request_end_time >= start_time)
        ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/guidellm/src/guidellm/schemas/request_stats.py", line 81, in request_end_time
    raise ValueError("resolve_end timings should be set but is None.")
ValueError: resolve_end timings should be set but is None.

Edit: Seems to be due to max-seconds constraint.

sjmonson

Aside from the max-seconds bug here are some high-level notes:

Needs signoff
Needs cleanup. To retroactively run pre-commit on only the changes: pre-commit run --from $(git merge-base main@{u} HEAD) --to HEAD
Only glanced over the accumulator and statistics code. Will do a more in-depth review time permitting but don't block on it.
Is warmup/cooldown only seconds now and not sometimes a percent? Setting --warmup .1 results in a table entry of Warm Sec: 0.1.

pyproject.toml

src/guidellm/__main__.py

src/guidellm/benchmark/entrypoints.py

src/guidellm/scheduler/worker_group.py

…or the latest state of refactor Signed-off-by: Mark Kurtz <[email protected]>

Signed-off-by: Mark Kurtz <[email protected]>

…ons, metrics, and outputs Signed-off-by: Mark Kurtz <[email protected]>

Signed-off-by: Mark Kurtz <[email protected]>

Co-authored-by: Samuel Monson <[email protected]> Signed-off-by: Mark Kurtz <[email protected]> Signed-off-by: Mark Kurtz <[email protected]>

Signed-off-by: Mark Kurtz <[email protected]>

jaredoconnell

I can confirm that the max-seconds error is fixed. I'll do more testing tomorrow, but it works for me.
The new CLI table outputs are a little busy to look at, but work. Adding more horizontal padding may help, but isn't necessary.

jaredoconnell · 2025-11-04T20:04:45Z

src/guidellm/benchmark/outputs/csv.py

+        self._add_field(
+            headers,
+            values,
+            "Run Info",
+            "Requests",
+            json.dumps(benchmark.config.requests),
+        )
+        self._add_field(
+            headers, values, "Run Info", "Backend", json.dumps(benchmark.config.backend)
+        )
+        self._add_field(
+            headers,
+            values,
+            "Run Info",
+            "Environment",
+            json.dumps(benchmark.config.environment),
+        )


The requests and environment values are not the most pretty, but it works and doesn't exclude any information.
The format doesn't preserve the format of the original input, like it does on main right now, but if that's okay then I'm okay with proceeding.

Yeah, I was of the opinion that for the csv side, this is more one off, manual debugging in spreadsheets if someone needs it. If they want actual access to the data, to query/manipulate/act on it, then they should be interacting with either the json file or loading it into a python object. I could be moved the otherway to try and format it better, but I think I would consider that a follow up to this PR

src/guidellm/scheduler/worker_group.py

Signed-off-by: Mark Kurtz <[email protected]>

jaredoconnell

This doesn't have anything obvious that I am going to block on. We just need to fix that test dependency.

sjmonson

LGTM Either land #440 first for test fixes or ignore.

markurtz requested review from Copilot and sjmonson October 31, 2025 17:52

markurtz self-assigned this Oct 31, 2025

Copilot AI reviewed Oct 31, 2025

View reviewed changes

sjmonson requested changes Oct 31, 2025

View reviewed changes

markurtz force-pushed the features/refactor/progress_refactor branch from bfc4b2f to 4a951c4 Compare November 4, 2025 15:19

markurtz and others added 9 commits November 4, 2025 12:26

Updates for scenarios and benchmarking entrypoints to reenable them f…

2786c06

…or the latest state of refactor Signed-off-by: Mark Kurtz <[email protected]>

Initial state for new progress output

724b411

Signed-off-by: Mark Kurtz <[email protected]>

Intermediate working state for refactoring progress and output

c72cf92

Signed-off-by: Mark Kurtz <[email protected]>

Updated and functional state with e2e working for new stats calculati…

16ffdc9

…ons, metrics, and outputs Signed-off-by: Mark Kurtz <[email protected]>

Fixes from reviews and types/style

6c8b3cb

Signed-off-by: Mark Kurtz <[email protected]>

Fixes from reviews

7019fc5

Signed-off-by: Mark Kurtz <[email protected]>

update pylock

7e0f1f0

Signed-off-by: Mark Kurtz <[email protected]>

Update src/guidellm/benchmark/entrypoints.py

ad2e111

Co-authored-by: Samuel Monson <[email protected]> Signed-off-by: Mark Kurtz <[email protected]> Signed-off-by: Mark Kurtz <[email protected]>

fix typing, remove dataset preprocessing until it's migrated

6c7f133

Signed-off-by: Mark Kurtz <[email protected]>

markurtz force-pushed the features/refactor/progress_refactor branch from 73e7539 to 6c7f133 Compare November 4, 2025 17:26

Further fixes for precommit and formatting/typing

54b2bd0

Signed-off-by: Mark Kurtz <[email protected]>

jaredoconnell reviewed Nov 5, 2025

View reviewed changes

markurtz and others added 3 commits November 5, 2025 17:12

Fix unit tests for refactor and schemas package

65450f2

Signed-off-by: Mark Kurtz <[email protected]>

minor fix from review

2504e2f

Signed-off-by: Mark Kurtz <[email protected]>

Merge branch 'main' into features/refactor/progress_refactor

ccced0d

jaredoconnell reviewed Nov 5, 2025

View reviewed changes

sjmonson approved these changes Nov 6, 2025

View reviewed changes

[GuideLLM Refactor] Updates and Fixes for benchmark outputs, schemas, and stats calculations #442

Are you sure you want to change the base?

[GuideLLM Refactor] Updates and Fixes for benchmark outputs, schemas, and stats calculations #442

Conversation

markurtz commented Oct 31, 2025

Summary

Details

Test Plan

Related Issues

Use of AI

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjmonson commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjmonson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaredoconnell left a comment

Choose a reason for hiding this comment

Uh oh!

jaredoconnell Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

markurtz Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jaredoconnell left a comment

Choose a reason for hiding this comment

Uh oh!

sjmonson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sjmonson commented Oct 31, 2025 •

edited

Loading

sjmonson left a comment •

edited

Loading

sjmonson left a comment •

edited

Loading