Skip to content

Conversation

@jhalabi-nv
Copy link

@jhalabi-nv jhalabi-nv commented Dec 9, 2025

📌 Description

SM100 configs are generated for SM110 devices, but of the valid MOE kernels for SM110, only TMA Epilogue Scheduling is supported. In the generated configs, EpilogueScheduleType::AUTO is not handled well in the getDispatchFunctionForSM100 dispatch function (it is never resolved to a concrete value, either TMA or NO_SMEM), so all configs are rejected.

This PR pins the generated configs to EpilogueScheduleType::TMA for SM110 devices. It also allows SM100 configs to run on SM110 devices.

Future work would include generating SM110 configs specifically for SM110 devices, rather than rely on SM100 configs.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Bug Fixes

    • Fixed GPU configuration validation to support latest hardware generations.
  • Refactor

    • Optimized kernel execution scheduling for improved performance on newer devices.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 9, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

The changes update SM110 CUTLASS kernel dispatch logic by switching epilogue scheduling from AUTO to TMA strategy and relaxing SM version compatibility checks to allow SM100 configurations to execute on SM110 devices.

Changes

Cohort / File(s) Summary
Epilogue Schedule Type Update
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
Modified get_candidate_configs_sm110 to use EpilogueScheduleType::TMA instead of AUTO in FAST_BUILD path and per-tile config construction for CutlassGemmConfig instances.
SM Version Compatibility Relaxation
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
Relaxed SM version validation in MoeGemmRunner::dispatchToArch to allow SM100 configurations to run on SM110 devices by accepting both exact major/minor matches and a fallback condition.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Changes are confined to two specific functions with straightforward parameter updates and conditional logic modifications
  • SM architecture compatibility logic is localized and easy to verify
  • No complex branching or multi-file dependencies introduced
  • Primary concern: ensure TMA scheduling is appropriate for SM110 and SM100/110 compatibility doesn't introduce correctness issues

Possibly related PRs

Suggested reviewers

  • djmmoss
  • cyx-6
  • wenscarl
  • bkryu
  • yzh119

Poem

🐰 TMA schedules now bloom, so bright and keen,
On SM110's stage, a GPU dream,
With backward compat, SM100 hops along,
The kernels dance faster—our code grows strong!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix for moe on sm110' directly addresses the main change—fixing MOE kernel execution on SM110 devices by adjusting epilogue scheduling and validation logic.
Description check ✅ Passed The description includes a clear explanation of the problem, the solution, and notes on future work. It covers the main sections but lacks content in the Related Issues and Reviewer Notes sections.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @jhalabi-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces critical fixes and optimizations for Mixture of Experts (MoE) General Matrix Multiply (GEMM) operations specifically targeting SM110 architectures. The changes focus on enhancing performance by utilizing Tensor Memory Accelerator (TMA) for epilogue scheduling, improving compatibility by allowing SM100 configurations to run on SM110, and refining the kernel dispatch logic to ensure SM110 devices leverage the most appropriate specialized kernels for efficient execution.

Highlights

  • Epilogue Scheduling Optimization: Updated the get_candidate_configs_sm110 function to explicitly use EpilogueScheduleType::TMA for all CutlassGemmConfig instances, including the fast build configuration, for SM110 architectures. This change likely leverages Tensor Memory Accelerator (TMA) for improved performance in the epilogue phase of GEMM operations.
  • SM100/SM110 Compatibility: Modified the moe_gemm_template_dispatch.h file to allow configurations designed for SM100 devices to run on SM110 devices. This ensures broader compatibility and flexibility for MoE GEMM operations across these architectures.
  • Refined Dispatch Logic for SM110: Adjusted the dispatchMoeGemmFinalDispatchTmaWarpSpecialized function in moe_gemm_template_dispatch_tma_ws.h to correctly categorize SM110 within the compute capability checks. This change ensures that SM110 devices are routed to the appropriate specialized kernel dispatch paths, separating them from SM100 and grouping them with SM90 or SM120+ for certain kernel selections.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to add support for MoE kernels on the SM110 architecture. The changes involve updating kernel candidate configurations and modifying dispatch logic. Specifically, it allows SM100 kernel configurations to be used on SM110 devices and adjusts compile-time kernel selection for SM110. My review has identified a significant logical inconsistency between the runtime configuration generation and the compile-time kernel specializations for SM110, which results in SM110-specific code paths being unreachable. I've provided comments with high and medium severity to highlight these issues and suggest a more consistent approach for SM110 support.

Comment on lines 799 to 804
TLLM_CHECK_WITH_INFO(
(inputs.gemm_config.sm_version / 10 == sm_ / 10) ||
// allow sm100 configs to run on sm110 as well
(inputs.gemm_config.sm_version / 10 == 10 && sm_ / 10 == 11),
"Using SM %d configuration for SM %d device",
inputs.gemm_config.sm_version, sm_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There appears to be a logical inconsistency in how SM110 support is being added. This change allows SM100-kernels to run on SM110 devices, which is necessary because get_candidate_configs_sm110 in cutlass_heuristic.cpp generates configurations with sm_version=100 instead of 110.

However, other changes in moe_gemm_template_dispatch_tma_ws.h introduce compile-time logic for SM110-specific kernels. These kernels will never be dispatched because the runtime configuration will always be for SM100.

This makes the SM110-specific code paths dead code and the overall approach confusing. A more robust solution would be:

  1. Update get_candidate_configs_sm110 to generate configs with sm_version=110.
  2. Add a dispatch case for sm_version=110 in dispatchMoeGemmSelectTileShapeTmaWarpSpecialized.
  3. This would make this explicit check for SM100 on SM110 unnecessary.

If the intention is to use SM100 kernels on SM110 for now, the changes for SM110-specific kernels in moe_gemm_template_dispatch_tma_ws.h should probably be in a separate, future PR to avoid confusion.

}

if constexpr (Arch::kMinComputeCapability >= 100 && Arch::kMinComputeCapability < 120) {
if constexpr (Arch::kMinComputeCapability >= 100 && Arch::kMinComputeCapability < 110) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This change, along with the one on line 207, creates a specific compile-time path for SM110. However, as noted in my main comment on moe_gemm_template_dispatch.h, the current runtime logic seems configured to use SM100 kernels on SM110 devices. This makes this new SM110-specific path unreachable. If these changes are preparatory for full SM110 kernel support in a future update, it might be better to introduce them then to avoid having dead code in the repository.

selected_func(hopper_input, num_experts, multi_processor_count, stream, occupancy,
workspace_size, cluster_shape_cute, cluster_shape_cute_fallback);
} else if constexpr (Arch::kMinComputeCapability >= 120 || Arch::kMinComputeCapability == 90) {
} else if constexpr (Arch::kMinComputeCapability >= 120 || Arch::kMinComputeCapability == 90 || Arch::kMinComputeCapability == 110) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the change on line 179, this adds SM110 to this kernel dispatch path. As mentioned in my other comments, this path appears to be unreachable with the current configuration generation logic, which defaults to using SM100 kernels on SM110 hardware. This might be dead code until the configuration generation and dispatch logic are updated to handle sm_version=110.

@jhalabi-nv jhalabi-nv marked this pull request as ready for review December 10, 2025 01:48
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)

797-803: Consider improving clarity of the SM version compatibility check.

The added condition correctly allows SM100 configs to run on SM110 devices as an interim solution. However, the magic numbers and division logic could be more explicit.

Consider extracting the major version logic into named constants or a helper function:

// At file or namespace scope
constexpr int SM_MAJOR_VERSION_100 = 10;
constexpr int SM_MAJOR_VERSION_110 = 11;

inline int getSmMajorVersion(int sm_version) {
  return sm_version / 10;
}

// In the check:
TLLM_CHECK_WITH_INFO(
    (getSmMajorVersion(inputs.gemm_config.sm_version) == getSmMajorVersion(sm_)) ||
    // Allow SM100 configs to run on SM110 as interim solution
    (getSmMajorVersion(inputs.gemm_config.sm_version) == SM_MAJOR_VERSION_100 && 
     getSmMajorVersion(sm_) == SM_MAJOR_VERSION_110),
    "Using SM %d configuration for SM %d device",
    inputs.gemm_config.sm_version, sm_);

Also, verify that the TMA epilogue schedule pinning in cutlass_heuristic.cpp properly coordinates with this dispatch logic:

#!/bin/bash
# Verify the flow from config generation to dispatch for SM110
rg -n "get_candidate_configs_sm110|EpilogueScheduleType::TMA" --type=cpp -A3 -B3
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fe01a2 and fa5886d.

📒 Files selected for processing (2)
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (2 hunks)
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-12T03:35:17.583Z
Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.

Applied to files:

  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
🔇 Additional comments (2)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (2)

572-580: LGTM! Consistent TMA epilogue schedule applied to all SM110 configs.

The change ensures all generated SM110 configurations use EpilogueScheduleType::TMA instead of AUTO, which is consistent with the FAST_BUILD path at line 536 and addresses the issue where AUTO wasn't being resolved by the dispatch function.


533-537: Confirm TMA is the only valid epilogue schedule for SM110 MOE kernels.

The change from EpilogueScheduleType::AUTO to EpilogueScheduleType::TMA pins the epilogue schedule for FAST_BUILD on SM110. Please verify that TMA is indeed the only supported epilogue schedule for SM110 MOE kernels by checking the SM110-specific dispatch logic and any constraints documented in the codebase.

@yzh119
Copy link
Collaborator

yzh119 commented Dec 10, 2025

/bot run

Copy link
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, should be ready to merge once gitlab CI passed.

@flashinfer-bot
Copy link
Collaborator

GitLab MR !188 has been created, and the CI pipeline #39986560 is currently running. I'll report back once the pipeline job completes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants