Releases: iree-org/wave
v3.9.1
Release v3.9
Wave v3.9 Release
1. New Ops and Kernel Features
1.1 Fine-grained Pipeline Control using wave.schedule Construct
Introduces a new wave.schedule language feature enabling explicit control of pipeline staging. Authors can group operators into stages and control pipelining behavior. A ScheduleRegionGraph and ScheduleContext were added to support schedule tracing. (#333)
1.2 Float Remainder Op (remf)
Adds remf to the Wave dialect, implementing floating-point remainder semantics matching the arith dialect. (#279)
1.3 Tensor Load Enhancements
- Basic Tensor Load Op Integration - Adds initial tensor load support with a data mover integrated into single-wave GEMM. (#379)
- Unaligned Shape Support - Tensor load now supports unaligned shapes via computed descriptor dims and refined local bounds logic; stride computation moved entirely to handlers. (#399)
- Shared Memory Padding - Tensor load ops now support padded shared-memory allocations; padding is preserved or dropped based on backend capability. (#408)
- Tensor Load Multicast - Introduces multicast optimization for tensor loads to share a single load across workgroups in a cluster. (#437)
- Tensor Waitcnt - Adds support for tensor-level waitcnt logic required for correct tensor load behavior. (#383)
1.4 Multi-Wave Execution and Support
- Support added for executing a single 16×16×16 MMA across multiple waves and workgroups, with lit and e2e tests. (MI350 supported; MI25x not). (#442)
- Extends TDM (Tensor Data Movement) op support by using
WaveConstraintto compute per-wave tile sizes. (#463)
2. Compiler & Backend Enhancements
2.1 New ASM Backend (Experimental)
- Adds an experimental assembly backend lowering MLIR → AMD GCN ISA directly.
Includes instruction support, expression simplification, tests, and documentation.
Currently supports a copy example and is runnable only with the wave runtime (no VMFB). (#356) - Adds lowering for a 16×16×16 MMA into the ASM backend, staging lhs/rhs through shared memory. Includes lit and e2e tests. (#404)
- Replaces hardcoded loop scheduling with latency-driven scheduling. Adds a dedicated ticketing class for vmcnt/lgkmcnt placement. (#428)
- Documentation added for the ASM backend, its capabilities, and workflow. (#356)
2.2 Optimized Memory Waitcnt for Async BF16 PP GEMM
Adds memory_counter_wait op and integrates it to optimize waitcnt placement in async BF16 pipelined GEMM. (#436)
2.3 Ping-Pong GatherToLDS for F16 GEMM
Adds a GatherToLDS ping-pong pipeline implementation, fixes dot-slicing bugs, and cleans waitcnt emission after upstream LLVM fixes. (#431)
3. Runtime / Integration Improvements
3.1 Wave as a TorchDynamo Custom Backend
Wave kernels can now be selected via:
torch.compile(MyMode, backend="wave")
Currently replaces torch.mm with Wave GEMM kernels; others fall back to eager execution. (#396)
Change Log
Git History
What's Changed
- Bump version to 3.8.0 by @sa-faizal in #362
- import the second batch of the wave dialect commits by @ftynse in #338
- [Runtime] t.dlpack fallback support for pytorch compatibility by @Megan0704-1 in #363
- Update pytorch rocm requirements from 6.3 to 6.4 by @Megan0704-1 in #364
- Add wave.schedule for more fine grained control by @harsh-nod in #333
- Forward codegen info for read/write to wave attributes by @tyb0807 in #355
- Fix typos in gather_to_shared by @ftynse in #365
- Add asm backend for compiler by @harsh-nod in #356
- Remove vector.splat by @tgymnich in #366
- Switch emitter to use only upstream dialects by @Hardcode84 in #359
- [Water] Fix Include Cycle by @tgymnich in #369
- Backport: Replace deprecated op vector.splat with vector.broadcast (#66) by @tgymnich in #372
- [Synchronization] split barriers support in add_shared_memory_barriers pass by @Megan0704-1 in #351
- Fix documentation by @harsh-nod in #374
- Combine WaveExprAttr and WaveExpressionAttr by @tgymnich in #373
- fix duplicated wave prefix by @tgymnich in #376
- [debugging] debug_log extra_iteration_dimensions by @willghatch in #375
- Broader support for Water emission by @martin-luecke in #371
- Water Diagnostic Serialization by @tgymnich in #378
- improve lit compatability for sharktank_integration test by @willghatch in #381
- Simplify hardware transpose index calculation by @harsh-nod in #382
- Bump IREE requirement pins to their latest versions by @raikonenfnu in #385
- Bump the github-actions group across 1 directory with 3 updates by @dependabot[bot] in #384
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #387
- Tensor waticnt support by @Megan0704-1 in #383
- [Wave] Add remf op by @Megan0704-1 in #279
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #392
- Tensor load op support by @Megan0704-1 in #379
- Fix gather-to-lds tail padding calculations by @Hardcode84 in #393
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #397
- Use correct offset for base element of transposed-load operation by @ashay in #388
- Unaligned shapes support with tensor load op by @Hardcode84 in #399
- Standalone examples by @panditsa in #367
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #405
- [Compiler][NFC] Annotate GEMM operation on schedule by @raikonenfnu in #400
- Update wmma codegen for gfx1250 by @Megan0704-1 in #407
- Move wave index attribute representation to DictArrayAttr by @martin-luecke in #386
- Keep all the dims for scan-op by @panditsa in #368
- Updated requirements such that wave-lang picks the right version of iree dependencies by @xintin in #395
- bumped wave-lang to 3.8.1 by @sa-faizal in #413
- Tensor load padding support by @Hardcode84 in #408
- [NFC] xfail CI machine specific failure by @raikonenfnu in #412
- Add sample MMA lowering for asm backend by @harsh-nod in #404
- Add lowering for unary wave ops by @martin-luecke in #424
- Rewrite more reads and writes with gather-to-lds operations by @ashay in #377
- [water] Fix HardwareConstraints verifiers by @tgymnich in #421
- Add Wave as a custom dynamo backend by @nithinsubbiah in #396
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #430
- Add latency based scheduling for asm backend by @harsh-nod in #428
- Manual shared-memory management in GEMM by @panditsa in #391
- Add wave.mma MLIR lowering by @martin-luecke in #429
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #432
- [water] add C-bindings for constraints by @tgymnich in #420
- [water] remove wg_constraint parameter from wave_constraint by @tgymnich in #425
- [water] handle device constraint by @tgymnich in #426
- [Compiler] Add PP for GatherToLDS based F16 GEMM by @raikonenfnu in #431
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #433
- [Compiler][Gemm] Optimize memory waitcnt for Async BF16 PP GEMM by @raikonenfnu in #436
- [water] Lower wave constraints to MLIR by @tgymnich in #422
- gitignore water artifacts by @Hardcode84 in #439
- Add -Wno-macro-redefined to cmake by @ftynse in #447
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #438
- GEMM two_pp_cluster scheduling by @panditsa in #435
- wait tensorcnt support by @Megan0704-1 in #406
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in https://github.com/iree-org/wav...
v3.8.2
Release wheels for v3.8.2.
v3.8.1.post1
Release wheels for v3.8.1.post1.
Release v3.8.0
Wave v3.8 Release
New Website – Launched a new website to help users learn more about Wave and explore its features: https://www.wave-lang.com/
New Ops and Kernels
- TopkOp Implementation — Implements TopkOp, modeled after ReduceOp, using iterative reduction and masking (similar to sglang’s moe_topk_softmax kernel). #277
- RoundOp Added — Introduced RoundOp for Wave kernels. #283
- Implement Broadcasting for SelectOp — Generalized broadcast handling from binary ops to any-arity ops, used by Topk and similar kernels. #251
- [WMMA] v_wmma_f32_16x16x16_f16 Type Support — Added RDNA MMA type for 16×16×16 mixed-precision matrix operations. #306
- Binary Ops Lowering — Added lowering support for binary operations to expand backend compatibility. ftynse/water#36
- Distributed GEMM Across Multi-GPU Devices — Enables distributed matrix multiplication with per-dimension device partitioning via DeviceConstraint. #302
- Dynamic AtomicOp Indexing — Adds mapping_dynamic_vals for runtime-computed indices in atomic operations (used in MoE alignment). #269
Kernel Optimization
- Linearize Shared Memory Accesses — Linearizes shared-memory reads/writes to improve register reuse and enable better common subexpression detection. #275
- Scalarize Packed Math — Added option to scalarize packed addf/mulf near MFMA ops to avoid hardware-induced performance penalties. #274
- Hardware Transpose Support (gfx950+) — Enables amdgpu.transpose_load for native hardware transpose on MI350+ GPUs. #285
Compiler Enhancements
- Normal Forms Framework — Introduces WaveNormalFormAttr for enforcing IR invariants and managing pass pre-/post-conditions at fine granularity. ftynse/water#41, ftynse/water#57
- Dataflow-Based Shape Inference — Adds forward/backward dataflow analysis for shape inference, ensuring convergence and conflict detection. ftynse/water#20
- Hyperparameter and Index Mapping Attributes — Adds WaveHyperparameterAttr and WaveIndexMappingAttr for symbol–value mappings and affine index modeling. ftynse/water#23, ftynse/water#30, ftynse/water#42
- Lowering Pipeline Setup + RegisterOp Lowering — Establishes type converter (wave.tensor → vector/memref) and first lowering pattern (RegisterOp). ftynse/water#28
- Wave Dialect GEMM Representation — Adds high-level and lowered GEMM kernels using new MMA type variants. ftynse/water#27
- Wave Dialect Initialization — Created base dialect structure with symbol attributes, tensor types, symbolic shapes, and minimal MMA op. ftynse/water#17
- Wave Dialect MLIR Converter & Emitter Smoketest — Added converter and water emitter to translate Wave kernel traces into MLIR Wave dialect. #273
- Partition Gathers/Scatters Pass — Moved gather/scatter decomposition into a standalone pass for cleaner read/write handlers. #259
Scheduling / Runtime Improvements
- Standalone Multi-Device Runtime Wrapper — Introduces MultiDeviceLaunchable, a minimal API for executing IREE models across multiple GPUs. #222
- Full Trace Preservation After Compilation — WaveKernel now stores full post-pass traces for accurate Wave dialect emission and debugging. #298
Documentation / Developer Experience
- Normal Forms Documentation — Added detailed explanation of normal form concepts, invariants, and pass conditions. ftynse/water#57
- Wave Dialect Python Bindings — Added C API, Python bindings (water_mlir.dialects.wave), and CI integration for build verification. ftynse/water#19
- Debug Env Var for Location Control — Added environment variable to dynamically control debug location levels for easier inspection. #289
Change Log
Git History
What's Changed
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #272
- hip-iree-target parsing non-colon fix by @lamikr in #271
- Remove some unused TK remains. by @Hardcode84 in #268
- Bump the github-actions group with 3 updates by @dependabot[bot] in #276
- Change ROCm7 [docker installation] with TheRock in mi35x CI runner by @sa-faizal in #263
- Replace target-backends with target-device argument by @panditsa in #221
- Scalarize packed math by @Hardcode84 in #274
- [Wave] Linearize shared memory accesses by @raikonenfnu in #275
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #281
- Multibuffer liveness analysis by @Hardcode84 in #250
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #286
- add round op by @saladpalad in #283
- Capture locations for Iterate and Conditional by @willghatch in #290
- Add trace to WaveKernel even when only compile to MLIR by @tyb0807 in #294
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #293
- [Dist.] Standalone multi-device runtime wrapper by @panditsa in #222
- Store full trace after all compilation passes in WaveKernel by @tyb0807 in #298
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #299
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #303
- Bump the github-actions group with 2 updates by @dependabot[bot] in #300
- Add a dedicated pass to partition gathers/scatters by @Hardcode84 in #259
- add debug env var to set default location level by @willghatch in #289
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #305
- Integrate Water sources from a separate repository by @ftynse in #288
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #307
- [CI] Add GitHub Actions workflow: self-hosted RDNA4 by @Megan0704-1 in #295
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #309
- Bump actions/setup-python from 5.6.0 to 6.0.0 in the github-actions group by @dependabot[bot] in #310
- implement broadcasting for SelectOp by @willghatch in #251
- add locations to placeholder and output nodes based on kernel location by @willghatch in #291
- Fix version mismatch after iree-bump by @Megan0704-1 in #319
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #321
- Fixes
group_size_min the reordered gemm. by @xintin in #316 - [debugging] propagate locations by @willghatch in #297
- [debugging] add location propagation to replace_all_uses_with by @willghatch in #296
- Remove migration notice from README by @harsh-nod in #317
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #324
- add location_check_pass by @willghatch in #292
- [debugging] fix minor nit in test case by @willghatch in #323
- Add support for hardware transpose operation by @ashay in #285
- [CI] Update installation of user space libraries for RDNA4-CI by @Megan0704-1 in #325
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #326
- Added C-style modulo/remainder operator by @nirmie in #322
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #328
- Use dynamic values for AtomicOps by @panditsa in #269
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #331
- [WMMA] v_wmma_f32_16x16x16_f16 type support by @Megan0704-1 in #306
- Bump IREE requirement ...
Wave Release v3.7.0
Highlights in this release
Starting with this release, Wave will adopt a new versioning scheme to align with Shark-AI and IREE release versions. This change is intended to improve cross-project compatibility and simplify dependency management across the ecosystem.
- Previous Wave version: v1.0.2
- New version format: v3.7.0 (aligned with Shark-AI/IREE)
This change does not affect functionality or backwards compatibility, but version numbers going forward will reflect the aligned release cadence.
New Operators and Kernels
- Reciprocal of square root operator (#187)
- Sinh operator (#112)
- RMSNorm Kernel (#100)
- Scatter add operation (#56)
Documentation
- Buffer Loads, Stores, and L1 Cache Swizzling (https://wave-lang.readthedocs.io/en/latest/wave/buffer_and_swizzle.html)
- Debugger use instructions (https://wave-lang.readthedocs.io/en/latest/wave/debugging.html)
- Matrix Addition Example (https://github.com/iree-org/wave/blob/9accd0bc13384c3aecae4154fdd07d04a81069ff/examples/jupyter/matrix_addition.ipynb)
- Convolution 2D docs (https://wave-lang.readthedocs.io/en/latest/wave/conv.html)
- Thread trace documentation (https://wave-lang.readthedocs.io/en/latest/wave/trace.html)
Kernel Optimizations
- Compatibility to handle tensors from hal sub-allocation and scaling in BroadcastOp (#220, #219)
- Changes to Speculative decode kernel (#75, #72)
Compiler Enhancements
- Added conditional barriers to attention schedule (#223)
- Extract dimensions from the input tensors instead of passing the as arguments (#183)
- Compilation time boost (#157)
- Enable async kernel execution with iree runtime and fully switch to Launchable (#11)
- Checks to validate constraints for workgroup and wavefronts (#134)
- New API for IndexMapping (#141, #151)
- Build aplp in setup.py (#103)
- Introduced APIs to support distributed workloads, implementation is ongoing (#191, #245)
Hardware Bring-up
- Introduced FP16 and BF16 CDNA4 (double-rate) MFMA types (#261)
- Unaligned shapes support in gather_to_shared pass (#60)
- Gather to lds swizzling (#149)
- Add scaled_dim/scaled_gemm support for gather_to_lds (#80)
Scheduling Improvements
- Add 4-stage prefetch + multi-buffering schedule for attention(#214)
- Rotating pointers for multi-buffering (#207)
- Add schedule reordering support for BMK,NK->BMN (#20)
- SchedulingType.FOUR_STAGE: GEMM Full Software Pipelining with Initiation Interval 1, Via Multibuffering (#77)
- Include XCD reordering to template and tweaked PingPong Schedule (#76)
Misc/General Updates
- Speculative decode benchmarking (#136)
- Support unaligned and unconstrained shapes in expansion (#155)
- Improve barrier placement pass (#68)
- Enable buffer load for dynamic cases (#79)
Integration
- Wave is now integrated into SGLang as a separate attention backend (sgl-project/sglang#8660)
- MXFP4 GEMM, extend attention kernel, and decode attention kernel integration into SHARK Tank (nod-ai/amd-shark-ai#1777, nod-ai/amd-shark-ai#2140, nod-ai/amd-shark-ai#1957)
Change Log
Git History
- disable writing to cache when there are debug_log operations by @willghatch in #99
- Rename Turbine to Wave by @tgymnich in #110
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #116
- Build with
uvin GitHub Actions by @paulzzy in #114 - fix signature of debug_log_write by @willghatch in #118
- Fix race condition in release job by @paulzzy in #117
- [Wave] Pad shared memory when total size is not divisible
GatherToLDSsize by @Hardcode84 in #113 - [Wave] Remove spammy warning by @Hardcode84 in #120
- rename debug_log_write to debug_log by @willghatch in #121
- Migrate MI300 Capacity to new MI325 Capacity. by @deedongala in #131
- Restore build_tools/update_iree_requirement_pins.py by @paulzzy in #130
- modifying prefetch scheduling to have a more appropriate schedule by @bodhisaha in #122
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #132
- Updated usage instructions by @sa-faizal in #73
- Fix logit_cap calculation in paged attention test by @nithinsubbiah in #135
- Fix missing python-version warning by @tgymnich in #66
- Add initial RMSNorm kernel by @adedespirlet in #100
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #137
- [Wave] Add sinh op by @panditsa in #112
- add mappings to debug_log by @willghatch in #128
- [Wave] Cleanup
elements_per_threadby @Hardcode84 in #138 - Support unaligned and unconstrained shapes in expansion by @nithinsubbiah in #94
- [Wave] Add pass profiling and optimize
expand_graphpass. by @Hardcode84 in #139 - pytorch 2.8 by @tgymnich in #12
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #145
- Re-enable
manylinuxbuilds withcibuildwheelby @paulzzy in #123 - Add devcontainer by @tgymnich in #146
- [WAVE] Convolution 2D docs by @badgerbroch in #107
- NFC move some reordering logic for debug logs by @willghatch in #119
- Support for TK CI manual dispatch. by @xintin in #148
- [Wave] Introduce new API for IndexMapping by @harsh-nod in #141
- [Wave] Build aplp in setup.py by @harsh-nod in #103
- [Wave] More tests cleanup by @Hardcode84 in #30
- Revert "Support unaligned and unconstrained shapes in expansion" by @raikonenfnu in #154
- [Wave] Added bool to float casting by @badgerbroch in #133
- Support and build for Windows by @paulzzy in #115
- [Wave] Optimize
subs_idxcby @Hardcode84 in #144 - Remove unused code and rename some variables by @harsh-nod in #158
- Update version number to be consistent with pypi by @harsh-nod in #160
- [Wave] More cleanups by @Hardcode84 in #159
- Add clang-format to pre-commit by @Hardcode84 in #156
- add basic checks to validate constraints for workgroup and wavefronts by @ashay in #134
- SchedulingType.FOUR_STAGE: GEMM Full Software Pipelining with Initiation Interval 1, Via Multibuffering by @SourishW in #77
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #161
- Bump pypa/cibuildwheel from 3.1.2 to 3.1.3 in the github-actions group by @dependabot[bot] in #162
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #164
- [Wave] Fix tail-padded memref allocs when
minimize_shared_allocsis disabled by @Hardcode84 in #150 - [Wave] Gather to lds swizzling by @Hardcode84 in #149
- [debugging] add printer and handler args to debug_log by @willghatch in #153
- [debugging] implement an html generation debug_log viewer by @willghatch in #165
- Install Rust when building
manylinuxwheels by @paulzzy in #167 - [debugging] add dark theme to html_view...
Release v1.0.1
What's Changed
- Warn on leak instead of raising a runtime error by @tgymnich in #65
- Disable large and slow shapes for testAttentionBackward by @tgymnich in #64
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #70
- [Wave] Unaligned shapes support in
gather_to_sharedpass by @Hardcode84 in #60 - [Wave] Update README with quickstart instructions by @harsh-nod in #62
- [Wave] Make thread trace documentation visible in docs by @harsh-nod in #71
- [Wave] fix tutorial ref to tkl by @willghatch in #55
- [Wave] fix gitignore and lit test for gather_to_shared by @raikonenfnu in #78
- added XCD reordering to template and tweaked PingPong Schedule by @bodhisaha in #76
- [Wave] add scaled_dim/scaled_gemm support for gather_to_lds by @raikonenfnu in #80
- [Wave] Enable buffer load for dynamic cases by @raikonenfnu in #79
- [Wave] Improve barrier placement pass by @Hardcode84 in #68
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #81
- [Wave] Remove torch from default requirement by @raikonenfnu in #82
- [Wave] Enable > 4GB bufferOps using resetOffset by @raikonenfnu in #84
- Add scatter_add operation by @adedespirlet in #56
- [Wave] Fix prefetch scheduling for
GatherToLDSby @Hardcode84 in #85 - [Wave] add debug_log_write op by @willghatch in #74
- [WAVE] Updated wave speculativde decode as per the latest flashinfer kernel updates by @xintin in #72
- [WAVE] Merge two kernels into one in wave speculative decode by @xintin in #75
- [Wave] Register CustomOp to wave_lang S.T Wave+Turbine can live together by @raikonenfnu in #86
- Python 3.13 Support by @tgymnich in #89
- NFC: Fix type annotations for return values in attention kernels by @ftynse in #90
- Recover location info from MLIR after Water roundtrip by @ftynse in #83
- MoE kernel by @tyb0807 in #13
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #98
- Temporary fix for math library loading while using cache by @yichiche in #96
- add constraints to the cache key by @ashay in #91
- support dynamic shapes in debug_log_write by @willghatch in #102
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator[bot] in #108
- Deprecate vector_d.splat by @tgymnich in #109
- Bump minimum torch version to 2.6 by @paulzzy in #104
- Remove unused file with unused
def_libraryby @paulzzy in #111
New Contributors
- @tgymnich made their first contribution in #65
- @iree-pr-automator[bot] made their first contribution in #70
- @bodhisaha made their first contribution in #76
- @adedespirlet made their first contribution in #56
- @ftynse made their first contribution in #90
- @yichiche made their first contribution in #96
- @paulzzy made their first contribution in #104
Full Changelog: v1.0.0-beta.1...v1.0.1
v1.0.0-beta.1
Version 1.0.0-beta.1
dev-wheels
Automatic nightly release of wave-lang python wheels.