Skip to content

Conversation

@Kepontry
Copy link

@Kepontry Kepontry commented Nov 7, 2025

Summary

This PR enhances the AddSwKernelInstructionPrefetchPass to enable prefetching of SHAVE kernel instructions after the first SHAVE task, if the initial slack is insufficient.

Currently, instruction prefetching is skipped if the slack before the first SHAVE task is insufficient. This limitation is suboptimal when initial insertion slots (tiles) are limited or L2 cache capacity is constrained.

Based on the observation that SHAVE utilization is often low, we propose this change to prefetch opportunistically later in the schedule. This approach has demonstrated a ~3% performance gain on models such as Qwen2-1.5b and Qwen3-0.6b.

Target Platform For Release Notes

  • NPU37XX
  • NPU40XX
  • NONE (Not included in release notes)

Classification of this Pull Request

  • Maintenance
  • BUG
  • Feature

Implementation Details

  • The new logic searches for insertion gaps that begin at a "non-saturated" point (where num_shave_tasks < available_shave_count).
  • The gap ends at either a "saturated" point or the kernel designated for prefetching.
  • The prefetch operation is inserted at the tile3 task of the identified insertion point.
  • The minimal insertion slack required is set to 50K cycles.

Additional Fixes & Enhancements

  • Corrected an issue in insertDummyKernelOpBeforeFirstKernelTask where the clusterIdx was not being used during tile assignment.
  • Expanded Prefetching: Enriched the "kernel kind" logic to allow more types of kernels to be prefetched.

We also noted that the previous 250K-cycle threshold is overly conservative for our platform (Ultra 258V). Our analysis shows that prefetching provides benefits even with a slack as low as 170K cycles.

@Kepontry Kepontry requested a review from a team as a code owner November 7, 2025 08:18
@DariaMityagina
Copy link
Contributor

DariaMityagina commented Nov 7, 2025

@Kepontry hello! Thanks for your PR!

Could you please ensure your changes include test coverage by adding tests to https://github.com/openvinotoolkit/npu_compiler/blob/develop/tests/lit/NPU/dialect/VPUIP/passes/add_sw_kernel_instruction_prefetch_40XX.mlir and maybe some functional tests?

@DariaMityagina
Copy link
Contributor

@Kepontry hello! Thanks for your PR!

Could you please ensure your changes include test coverage by adding tests to https://github.com/openvinotoolkit/npu_compiler/blob/develop/tests/lit/NPU/dialect/VPUIP/passes/add_sw_kernel_instruction_prefetch_40XX.mlir and maybe some functional tests?

@Kepontry could you please look into this comment? Thank you!

@Kepontry
Copy link
Author

Apologies for the delay; I missed the email notification for this thread. I am currently working on adding the test. Could you provide some guidance or documentation on how to use the lit test framework within the NPU compiler?

@Kepontry
Copy link
Author

Functional test added.

@Maxim-Doronin
Copy link
Collaborator

Hi @Kepontry! Please adhere to the clang-format guidelines. You will find the automatically fixed code style in the job logs: https://github.com/openvinotoolkit/npu_compiler/actions/runs/19637113134/job/56230572067?pr=199

I also noticed that some LIT tests failed. Could you please verify if these failures are due to your changes?
https://github.com/openvinotoolkit/npu_compiler/actions/runs/19632108154/job/56223833881

cc @DariaMityagina

@Kepontry
Copy link
Author

Hi @Maxim-Doronin , the failure of LIT test is caused by the DummySWKernelsForInstructionPrefetchReservedMemory not being found. I can reproduce this error by setting the minimum-shave-start-time-for-prefetch threshold to 5 in the default_hw_mode_40XX test. So the problem exists before this PR. I suspect that the createSWKernelInstructionPrefetchReserveMemForDummyKernelsPass function in the VPU pipeline is not called, but I am not entirely sure. I would appreciate your help verifying this.

vpux-opt --split-input-file --init-compiler="vpu-arch=NPU40XX compilation-mode=DefaultHW allow-custom-values=true" --mlir-elide-elementsattrs-if-larger 8 --default-hw-mode-vpuip="function-outlining='naive'" --add-sw-kernel-instruction-prefetch="minimum-shave-start-time-for-prefetch=5" default_hw_mode_40XX.mlir

@DariaMityagina
Copy link
Contributor

Hi @Maxim-Doronin , the failure of LIT test is caused by the DummySWKernelsForInstructionPrefetchReservedMemory not being found. I can reproduce this error by setting the minimum-shave-start-time-for-prefetch threshold to 5 in the default_hw_mode_40XX test. So the problem exists before this PR. I suspect that the createSWKernelInstructionPrefetchReserveMemForDummyKernelsPass function in the VPU pipeline is not called, but I am not entirely sure. I would appreciate your help verifying this.

vpux-opt --split-input-file --init-compiler="vpu-arch=NPU40XX compilation-mode=DefaultHW allow-custom-values=true" --mlir-elide-elementsattrs-if-larger 8 --default-hw-mode-vpuip="function-outlining='naive'" --add-sw-kernel-instruction-prefetch="minimum-shave-start-time-for-prefetch=5" default_hw_mode_40XX.mlir

Hello @Kepontry!

Thanks for adding the tests and sharing your findings regarding pre-commit failures!
I'll check them locally and get back to you.

@DariaMityagina
Copy link
Contributor

DariaMityagina commented Nov 27, 2025

@Kepontry I managed to reproduce the issue:

vpux-opt --split-input-file --init-compiler="vpu-arch=NPU40XX compilation-mode=DefaultHW allow-custom-values=true" --mlir-elide-elementsattrs-if-larger 8 --default-hw-mode-vpuip="function-outlining='naive'" --add-sw-kernel-instruction-prefetch="minimum-shave-start-time-for-prefetch=5" default_hw_mode_40XX.mlir

->

Cannot find DummySWKernelsForInstructionPrefetchReservedMemory!

Will research it a bit and get back!

In the meantime, could you please share with us why you set this particular value?
--add-sw-kernel-instruction-prefetch="minimum-shave-start-time-for-prefetch=5"

@Kepontry
Copy link
Author

Hi, @DariaMityagina , thanks for your assistance regarding this issue. Since this PR enables prefetching regardless of the first SHAVE task's start time threshold, it exposed some existing bugs in certain test cases. These bugs were previously hidden because there wasn't enough time slack to trigger the prefetch logic. I adjusted the threshold to simulate a scenario that forces prefetch insertion, confirming that these test cases fail without this PR.

@Kepontry
Copy link
Author

Kepontry commented Dec 9, 2025

Hi @DariaMityagina ,

I have fixed the failed tests. The root cause was that setDummySwKernelsForInstructionPrefetchReservedMemory is normally invoked during the VPU pipeline. Since these tests target the VPUIP pipeline in isolation, the required memory attribute was missing from the input module.

I fixed this by manually adding the DummySWKernelsForInstructionPrefetchReservedMemory resource to the MLIR files, following the pattern in tests in the tests/lit/NPU/dialect/VPUIP/passes directory. I have also resolved the clang-format issues.

@DariaMityagina
Copy link
Contributor

Hi @DariaMityagina ,

I have fixed the failed tests. The root cause was that setDummySwKernelsForInstructionPrefetchReservedMemory is normally invoked during the VPU pipeline. Since these tests target the VPUIP pipeline in isolation, the required memory attribute was missing from the input module.

I fixed this by manually adding the DummySWKernelsForInstructionPrefetchReservedMemory resource to the MLIR files, following the pattern in tests in the tests/lit/NPU/dialect/VPUIP/passes directory. I have also resolved the clang-format issues.

Thanks a lot for the updates!
Let's wait for the precommit results. In the meantime, we'll do another round of reviews.

@Kepontry
Copy link
Author

The failed log indicates an Assertion addr + size <= _totalSize failed. However, in my local environment, I ran the following command and the test runs successfully.

vpux-opt --split-input-file --init-compiler="vpu-arch=NPU37XX compilation-mode=DefaultHW allow-custom-values=true" --mlir-elide-elementsattrs-if-larger 8 --default-hw-mode-vpuip="function-outlining='naive'" default_hw_mode_repeating_blocks.mlir | FileCheck default_hw_mode_repeating_blocks.mlir

I suspect this issue is related to the recently inserted MLIR code.

    config.Resources {activity_factor = 0.078934384661980161 : f64} 2 of @NCE at 1.700000e+03 MHz {
        builtin.module @ReservedMemory {
        module @DummySWKernelsForInstructionPrefetchReservedMemory {
            config.MemoryResource 8 bytes of @CMX_NN offset 1474552
        }
        }
        config.MemoryResource 1326182 bytes of @CMX_NN_FragmentationAware
        config.MemoryResource 1473536 bytes of @CMX_NN {config.bandwidth = 64 : i64, config.derateFactor = 1.000000e+00 : f64}
        config.ExecutorResource 2 of @SHAVE_ACT
        config.ExecutorResource 1 of @DPU
    }

I am currently uncertain whether the root cause involves the specific values (1326182 or 1473536) or the reserved memory allocation. Since the code runs on both NPU37XX and NPU40XX, I made modifications according to the implementation found in feasible_allocation.mlir.

config.Resources 1 of @NCE at 1.300000e+03 MHz {
builtin.module @ReservedMemory {
module @DmaProfilingReservedMemory {
config.MemoryResource 512 bytes of @CMX_NN offset 0
}
}
}

I hope this resolves the issue.Alternatively, could you provide the scripts necessary to reproduce this experiment in the CI environment?

@Kepontry
Copy link
Author

Kepontry commented Dec 15, 2025

Fixed the failing tests on NPU37XX by adjusting the offset of the reserved memory.

Similar changes were applied to the NPU40XX tests.

Note: I'm currently uncertain why the add_sw_kernel_instruction_prefetch_40XX.mlir test is passing, as it uses the same offset as the failed ones.

@Kepontry
Copy link
Author

Hi @DariaMityagina , the prefetching for the TopK kernel was still problematic due to its complexity, so I decided to remove it for now. Local tests are passing. I also fixed a segfault in the logging logic.

@Kepontry
Copy link
Author

Kepontry commented Dec 17, 2025

Hi @DariaMityagina , since TopK is not supported for prefetching now, I replaced it with a Convert kernel in the failing test (add_sw_kernel_instruction_prefetch_mid_execution_40XX.mlir). All tests should pass now. Thanks for your patience.

@DariaMityagina
Copy link
Contributor

Hi @DariaMityagina , since TopK is not supported for prefetching now, I replaced it with a Convert kernel in the failing test (add_sw_kernel_instruction_prefetch_mid_execution_40XX.mlir). All tests should pass now. Thanks for your patience.

Hello @Kepontry! Great! Thank you!

@Kepontry
Copy link
Author

Hi, @DariaMityagina , I refactored the code as requested. I moved the variables into the class and added the comments. Thanks for pointing these out.

@DariaMityagina
Copy link
Contributor

Hi, @DariaMityagina , I refactored the code as requested. I moved the variables into the class and added the comments. Thanks for pointing these out.

Hello @Kepontry! Thank you!
We'll perform additional tests to verify the changes and get back to you shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants