-
Notifications
You must be signed in to change notification settings - Fork 592
Fix for moe on sm110 #2190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix for moe on sm110 #2190
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThe changes update SM110 CUTLASS kernel dispatch logic by switching epilogue scheduling from AUTO to TMA strategy and relaxing SM version compatibility checks to allow SM100 configurations to execute on SM110 devices. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @jhalabi-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces critical fixes and optimizations for Mixture of Experts (MoE) General Matrix Multiply (GEMM) operations specifically targeting SM110 architectures. The changes focus on enhancing performance by utilizing Tensor Memory Accelerator (TMA) for epilogue scheduling, improving compatibility by allowing SM100 configurations to run on SM110, and refining the kernel dispatch logic to ensure SM110 devices leverage the most appropriate specialized kernels for efficient execution. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to add support for MoE kernels on the SM110 architecture. The changes involve updating kernel candidate configurations and modifying dispatch logic. Specifically, it allows SM100 kernel configurations to be used on SM110 devices and adjusts compile-time kernel selection for SM110. My review has identified a significant logical inconsistency between the runtime configuration generation and the compile-time kernel specializations for SM110, which results in SM110-specific code paths being unreachable. I've provided comments with high and medium severity to highlight these issues and suggest a more consistent approach for SM110 support.
| TLLM_CHECK_WITH_INFO( | ||
| (inputs.gemm_config.sm_version / 10 == sm_ / 10) || | ||
| // allow sm100 configs to run on sm110 as well | ||
| (inputs.gemm_config.sm_version / 10 == 10 && sm_ / 10 == 11), | ||
| "Using SM %d configuration for SM %d device", | ||
| inputs.gemm_config.sm_version, sm_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be a logical inconsistency in how SM110 support is being added. This change allows SM100-kernels to run on SM110 devices, which is necessary because get_candidate_configs_sm110 in cutlass_heuristic.cpp generates configurations with sm_version=100 instead of 110.
However, other changes in moe_gemm_template_dispatch_tma_ws.h introduce compile-time logic for SM110-specific kernels. These kernels will never be dispatched because the runtime configuration will always be for SM100.
This makes the SM110-specific code paths dead code and the overall approach confusing. A more robust solution would be:
- Update
get_candidate_configs_sm110to generate configs withsm_version=110. - Add a dispatch case for
sm_version=110indispatchMoeGemmSelectTileShapeTmaWarpSpecialized. - This would make this explicit check for SM100 on SM110 unnecessary.
If the intention is to use SM100 kernels on SM110 for now, the changes for SM110-specific kernels in moe_gemm_template_dispatch_tma_ws.h should probably be in a separate, future PR to avoid confusion.
| } | ||
|
|
||
| if constexpr (Arch::kMinComputeCapability >= 100 && Arch::kMinComputeCapability < 120) { | ||
| if constexpr (Arch::kMinComputeCapability >= 100 && Arch::kMinComputeCapability < 110) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change, along with the one on line 207, creates a specific compile-time path for SM110. However, as noted in my main comment on moe_gemm_template_dispatch.h, the current runtime logic seems configured to use SM100 kernels on SM110 devices. This makes this new SM110-specific path unreachable. If these changes are preparatory for full SM110 kernel support in a future update, it might be better to introduce them then to avoid having dead code in the repository.
| selected_func(hopper_input, num_experts, multi_processor_count, stream, occupancy, | ||
| workspace_size, cluster_shape_cute, cluster_shape_cute_fallback); | ||
| } else if constexpr (Arch::kMinComputeCapability >= 120 || Arch::kMinComputeCapability == 90) { | ||
| } else if constexpr (Arch::kMinComputeCapability >= 120 || Arch::kMinComputeCapability == 90 || Arch::kMinComputeCapability == 110) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the change on line 179, this adds SM110 to this kernel dispatch path. As mentioned in my other comments, this path appears to be unreachable with the current configuration generation logic, which defaults to using SM100 kernels on SM110 hardware. This might be dead code until the configuration generation and dispatch logic are updated to handle sm_version=110.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)
797-803: Consider improving clarity of the SM version compatibility check.The added condition correctly allows SM100 configs to run on SM110 devices as an interim solution. However, the magic numbers and division logic could be more explicit.
Consider extracting the major version logic into named constants or a helper function:
// At file or namespace scope constexpr int SM_MAJOR_VERSION_100 = 10; constexpr int SM_MAJOR_VERSION_110 = 11; inline int getSmMajorVersion(int sm_version) { return sm_version / 10; } // In the check: TLLM_CHECK_WITH_INFO( (getSmMajorVersion(inputs.gemm_config.sm_version) == getSmMajorVersion(sm_)) || // Allow SM100 configs to run on SM110 as interim solution (getSmMajorVersion(inputs.gemm_config.sm_version) == SM_MAJOR_VERSION_100 && getSmMajorVersion(sm_) == SM_MAJOR_VERSION_110), "Using SM %d configuration for SM %d device", inputs.gemm_config.sm_version, sm_);Also, verify that the TMA epilogue schedule pinning in
cutlass_heuristic.cppproperly coordinates with this dispatch logic:#!/bin/bash # Verify the flow from config generation to dispatch for SM110 rg -n "get_candidate_configs_sm110|EpilogueScheduleType::TMA" --type=cpp -A3 -B3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp(2 hunks)csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-12T03:35:17.583Z
Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.
Applied to files:
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
🔇 Additional comments (2)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (2)
572-580: LGTM! Consistent TMA epilogue schedule applied to all SM110 configs.The change ensures all generated SM110 configurations use
EpilogueScheduleType::TMAinstead ofAUTO, which is consistent with the FAST_BUILD path at line 536 and addresses the issue where AUTO wasn't being resolved by the dispatch function.
533-537: Confirm TMA is the only valid epilogue schedule for SM110 MOE kernels.The change from
EpilogueScheduleType::AUTOtoEpilogueScheduleType::TMApins the epilogue schedule for FAST_BUILD on SM110. Please verify that TMA is indeed the only supported epilogue schedule for SM110 MOE kernels by checking the SM110-specific dispatch logic and any constraints documented in the codebase.
|
/bot run |
yzh119
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, should be ready to merge once gitlab CI passed.
📌 Description
SM100 configs are generated for SM110 devices, but of the valid MOE kernels for SM110, only TMA Epilogue Scheduling is supported. In the generated configs,
EpilogueScheduleType::AUTOis not handled well in thegetDispatchFunctionForSM100dispatch function (it is never resolved to a concrete value, either TMA or NO_SMEM), so all configs are rejected.This PR pins the generated configs to
EpilogueScheduleType::TMAfor SM110 devices. It also allows SM100 configs to run on SM110 devices.Future work would include generating SM110 configs specifically for SM110 devices, rather than rely on SM100 configs.
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
Bug Fixes
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.