Skip to content

Conversation

@Elnifio
Copy link
Contributor

@Elnifio Elnifio commented Nov 7, 2025

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features
    • Added deployment recipes for Qwen-3 models with support for disaggregated and aggregated inference modes
    • Implemented persistent volume-based model caching mechanism
    • Added automated model downloading from Hugging Face
    • Introduced performance benchmarking capabilities for model evaluation

@Elnifio Elnifio requested review from a team as code owners November 7, 2025 07:35
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 7, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

github-actions bot commented Nov 7, 2025

👋 Hi Elnifio! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added the external-contribution Pull request is from an external contributor label Nov 7, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 7, 2025

Walkthrough

Introduces Kubernetes manifests for deploying two Qwen model variants with disaggregated and aggregated inference strategies. Adds PersistentVolumeClaims for model caching, model download Jobs, DynamoGraphDeployment configurations with TensorRT-LLM backends, and aiperf benchmarking Jobs supporting both parallel prefill/decode and single-node inference patterns.

Changes

Cohort / File(s) Summary
Model Cache Setup
recipes/qwen3-235b-a22b-instruct-2507-fp8/model-cache/model-cache.yaml, recipes/qwen3-235b-a22b/model-cache/model-cache.yaml
New PersistentVolumeClaim manifests defining 300Gi storage with ReadWriteMany access mode and placeholder storageClassName.
Model Download Jobs
recipes/qwen3-235b-a22b-instruct-2507-fp8/model-cache/model-download.yaml, recipes/qwen3-235b-a22b/model-cache/model-download.yaml
New Kubernetes Job manifests for downloading models from Hugging Face using hf_transfer, mounting persistent volumes at /model-store, with HF_TOKEN sourced from secrets.
Disaggregated Deployment
recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/deploy.yaml
DynamoGraphDeployment with separate prefill and decode worker components, ConfigMaps for prefill.yaml and decode.yaml configurations, and shared model-cache PVC. Includes frontend router, TensorRT-LLM prefill/decode workers with GPU specifications and node affinity rules.
Aggregated Deployment
recipes/qwen3-235b-a22b/trtllm/agg/deploy.yaml
DynamoGraphDeployment with single aggregated TrtllmWorker component and agg.yaml ConfigMap. Specifies 4 GPU allocations, pod anti-affinity, and model-cache mounting for unified inference.
Disaggregated Benchmarking
recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/perf.yaml
Kubernetes Job running aiperf benchmarks with inline bash functions for model readiness polling, concurrency calculation, artifact management, and performance profiling against disaggregated endpoint.
Aggregated Benchmarking
recipes/qwen3-235b-a22b/trtllm/agg/perf.yaml
Kubernetes Job for aiperf benchmarking with bash helper functions (wait_for_model_ready, run_perf), dynamic concurrency computation, input_config.json generation, and artifact collection for aggregated deployment.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • DynamoGraphDeployment resources: Verify correct component hierarchy, volumeMounting references, and environment variable mappings for both disaggregated and aggregated modes.
  • Prefill/decode configuration split: Ensure prefill.yaml and decode.yaml ConfigMap contents align with TensorRT-LLM disaggregation expectations and tensor parallel sizing.
  • Bash scripting in perf.yaml: Review wait_for_model_ready polling logic, aiperf argument construction, concurrency calculations (TOTAL_CONCURRENCY = CONCURRENCY_PER_GPU × DEPLOYMENT_GPU_COUNT), and input_config.json schema.
  • Volume and secret wiring: Validate PVC mounting paths, hf-token-secret envFrom sourcing, and storage class compatibility across both model variants.
  • Node affinity and pod anti-affinity: Confirm placement rules for prefill workers, decode workers, and benchmarking pods across cluster topology.

Poem

🐰 Two Qwens hop into Kubernetes with flair,
One splits its brains—prefill and decode with care,
The other stands tall, aggregated and strong,
With caches and benchmarks to carry along,
Models download, storage persists with grace,
Performance unfolds in each deployment's space!

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description contains only template placeholders with no substantive content, missing all required details about changes, implementation specifics, and related issues. Fill in all sections with concrete details: provide an overview of the new recipes, describe the specific changes (e.g., FP8 and other precision variants), identify key files for review, and reference the actual GitHub issue number instead of the placeholder.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title accurately describes the main change: adding Qwen3-235B-A22B recipes in both aggregated and disaggregated configurations (agg and disagg folders are present in the file summary).

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (2)
recipes/qwen3-235b-a22b-instruct-2507-fp8/model-cache/model-download.yaml (1)

15-44: Add securityContext to restrict privilege escalation and enforce non-root execution.

Same security concern as the model-download job in recipes/qwen3-235b-a22b/model-cache/model-download.yaml. The container runs as root without explicit restrictions on privilege escalation (CKV_K8S_20, CKV_K8S_23).

Apply the same securityContext fix:

      spec:
        restartPolicy: Never
        containers:
          - name: model-download
            image: python:3.10-slim
            command: ["sh", "-c"]
+           securityContext:
+             allowPrivilegeEscalation: false
+             runAsNonRoot: true
+             runAsUser: 1000
            envFrom:
              - secretRef:
                  name: hf-token-secret
recipes/qwen3-235b-a22b/trtllm/agg/perf.yaml (1)

141-142: Eliminate privileged mode and run as non-root; justify if strictly necessary.

Same critical security concern as the disaggregated perf Job (file recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/perf.yaml). The container runs with privileged: true (CKV_K8S_16 – HIGH), which is a significant security risk.

Apply the same fix:

        securityContext:
-         privileged: true
+         allowPrivilegeEscalation: false
+         runAsNonRoot: true
+         runAsUser: 1000
+         capabilities:
+           drop:
+             - ALL

Verify aiperf functionality with these restrictions and adjust capabilities only if specific syscalls fail.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f509493 and 9dfa886.

📒 Files selected for processing (8)
  • recipes/qwen3-235b-a22b-instruct-2507-fp8/model-cache/model-cache.yaml (1 hunks)
  • recipes/qwen3-235b-a22b-instruct-2507-fp8/model-cache/model-download.yaml (1 hunks)
  • recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/deploy.yaml (1 hunks)
  • recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/perf.yaml (1 hunks)
  • recipes/qwen3-235b-a22b/model-cache/model-cache.yaml (1 hunks)
  • recipes/qwen3-235b-a22b/model-cache/model-download.yaml (1 hunks)
  • recipes/qwen3-235b-a22b/trtllm/agg/deploy.yaml (1 hunks)
  • recipes/qwen3-235b-a22b/trtllm/agg/perf.yaml (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 3858
File: recipes/deepseek-r1/model-cache/model-download.yaml:18-32
Timestamp: 2025-10-24T04:21:08.751Z
Learning: In the recipes directory structure, model-specific recipes (e.g., recipes/deepseek-r1/, recipes/llama-3-70b/) contain hardcoded model names and revisions in their Kubernetes manifests (like model-download.yaml). Each recipe directory is deployment-specific and self-contained, so hardcoding model-specific values is the intended design pattern.
📚 Learning: 2025-10-24T04:21:08.751Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 3858
File: recipes/deepseek-r1/model-cache/model-download.yaml:18-32
Timestamp: 2025-10-24T04:21:08.751Z
Learning: In the recipes directory structure, model-specific recipes (e.g., recipes/deepseek-r1/, recipes/llama-3-70b/) contain hardcoded model names and revisions in their Kubernetes manifests (like model-download.yaml). Each recipe directory is deployment-specific and self-contained, so hardcoding model-specific values is the intended design pattern.

Applied to files:

  • recipes/qwen3-235b-a22b/model-cache/model-cache.yaml
  • recipes/qwen3-235b-a22b/model-cache/model-download.yaml
  • recipes/qwen3-235b-a22b-instruct-2507-fp8/model-cache/model-cache.yaml
  • recipes/qwen3-235b-a22b-instruct-2507-fp8/model-cache/model-download.yaml
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.

Applied to files:

  • recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/deploy.yaml
  • recipes/qwen3-235b-a22b/trtllm/agg/deploy.yaml
🪛 Checkov (3.2.334)
recipes/qwen3-235b-a22b/model-cache/model-download.yaml

[medium] 3-44: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 3-44: Minimize the admission of root containers

(CKV_K8S_23)

recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/perf.yaml

[medium] 3-153: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 3-153: Container should not be privileged

(CKV_K8S_16)


[medium] 3-153: Minimize the admission of root containers

(CKV_K8S_23)

recipes/qwen3-235b-a22b-instruct-2507-fp8/model-cache/model-download.yaml

[medium] 3-44: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 3-44: Minimize the admission of root containers

(CKV_K8S_23)

recipes/qwen3-235b-a22b/trtllm/agg/perf.yaml

[medium] 3-153: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 3-153: Container should not be privileged

(CKV_K8S_16)


[medium] 3-153: Minimize the admission of root containers

(CKV_K8S_23)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4179/merge) by Elnifio.
recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/deploy.yaml

[error] 1-1: Trailing whitespace detected and fixed by pre-commit hook (trailing-whitespace) during pre-commit run.

recipes/qwen3-235b-a22b/trtllm/agg/deploy.yaml

[error] 1-1: Trailing whitespace detected and fixed by pre-commit hook (trailing-whitespace) during pre-commit run.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/deploy.yaml (1)

8-59: Engine configuration strategy for disaggregation is well-tuned.

The prefill.yaml and decode.yaml configs show thoughtful tuning for the disaggregated inference pattern:

  • Prefill worker: Low batch size (2), tensor parallelism (2), optimized for throughput of short batches
  • Decode worker: Higher batch size (512), tensor parallelism (4), optimized for sustained token generation

GPU allocation aligns with workload characteristics. The fp8 dtype and memory settings are appropriate for the Qwen3-235B model size.

Signed-off-by: Elnifio <[email protected]>
@Elnifio Elnifio force-pushed the add-qwen3-235b-recipe branch from 1095b8f to d289661 Compare November 7, 2025 19:44
@Elnifio Elnifio changed the title Adds Qwen3-235B-A22B recipes for both precisions docs: Adds Qwen3-235B-A22B recipes for both precisions Nov 7, 2025
@github-actions github-actions bot added the docs label Nov 7, 2025
@Elnifio Elnifio changed the title docs: Adds Qwen3-235B-A22B recipes for both precisions docs: Adds Qwen3-235B-A22B-FP8 recipes for agg and disagg Nov 13, 2025
Signed-off-by: Elnifio <[email protected]>
@Elnifio Elnifio enabled auto-merge (squash) November 13, 2025 19:58
@Elnifio
Copy link
Contributor Author

Elnifio commented Nov 13, 2025

/ok to test 684ab61

@Elnifio Elnifio merged commit 956f93f into ai-dynamo:main Nov 13, 2025
24 of 28 checks passed
Elnifio added a commit that referenced this pull request Nov 14, 2025
daiyaanarfeen pushed a commit that referenced this pull request Nov 14, 2025
nv-tusharma pushed a commit that referenced this pull request Nov 14, 2025
chandlj pushed a commit to chandlj/dynamo that referenced this pull request Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs external-contribution Pull request is from an external contributor size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants