Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
3b92772
Utilize multiple GPUs in tuningRunner.
mirza-halilcevic Dec 18, 2025
31dca63
tuningRunner.py refactoring
mirza-halilcevic Dec 19, 2025
4e7c234
Merge branch 'develop' into mgpu-tuning
mirza-halilcevic Dec 19, 2025
5e0338e
Overlap compilation and benchmarking of perf configs in rocmlir-tunin…
mirza-halilcevic Dec 19, 2025
4d71e33
Merge remote-tracking branch 'origin/mgpu-tuning' into mgpu-tuning
mirza-halilcevic Dec 19, 2025
eb8dff6
Newline at end of file.
mirza-halilcevic Dec 19, 2025
70d021d
Implement persistence logic.
mirza-halilcevic Dec 21, 2025
735fb2b
Fix thread allocation bug in tuning-driver.
mirza-halilcevic Dec 21, 2025
d36e62d
Simplify implementation and optimize for edge cases.
mirza-halilcevic Dec 22, 2025
e01d4b1
Merge remote-tracking branch 'origin/develop' into mgpu-tuning
mirza-halilcevic Dec 23, 2025
a3c15e8
Merge branch 'develop' into mgpu-tuning
mirza-halilcevic Dec 24, 2025
cf2046e
Address review comments:
mirza-halilcevic Dec 25, 2025
8d6dbc3
Merge remote-tracking branch 'origin/develop' into mgpu-tuning
mirza-halilcevic Dec 25, 2025
df88112
Merge remote-tracking branch 'origin/mgpu-tuning' into mgpu-tuning
mirza-halilcevic Dec 25, 2025
af53625
Improve gpus argument parsing and improve graceful shutdown.
mirza-halilcevic Dec 26, 2025
e8d29d1
Implement OutputFileWriter and DebugFileWriter.
mirza-halilcevic Dec 26, 2025
95f771b
Fix output parsing during shutdown.
mirza-halilcevic Dec 26, 2025
e6a985b
Keep track of commit hash for each tuning run.
mirza-halilcevic Dec 26, 2025
2c48a92
Remove semicolons.
mirza-halilcevic Dec 26, 2025
34fffb6
Add debug info.
mirza-halilcevic Dec 26, 2025
b6d7bea
Fix progress bar output.
mirza-halilcevic Dec 26, 2025
e3e8add
Add GPU ID to debug info.
mirza-halilcevic Dec 26, 2025
90414e7
Fix stderr deadlock.
mirza-halilcevic Dec 26, 2025
c8d7bd8
Add --num-compile-threads argument.
mirza-halilcevic Dec 28, 2025
1c55901
Merge remote-tracking branch 'origin/develop' into mgpu-tuning
mirza-halilcevic Dec 29, 2025
305f5df
Improvements.
mirza-halilcevic Dec 30, 2025
9dded74
Merge branch 'develop' into mgpu-tuning
mirza-halilcevic Jan 5, 2026
e1f04bb
Implement NUMA-awareness and better encapsulation.
mirza-halilcevic Jan 6, 2026
1ef64bb
Merge remote-tracking branch 'origin/develop' into mgpu-tuning
mirza-halilcevic Jan 6, 2026
f93b4cd
Improve argument parsing.
mirza-halilcevic Jan 6, 2026
a620622
Merge branch 'develop' into mgpu-tuning
mirza-halilcevic Jan 6, 2026
94bef8b
Add option to wait for compiles.
mirza-halilcevic Jan 8, 2026
aa99327
Merge remote-tracking branch 'origin/develop' into mgpu-tuning
mirza-halilcevic Jan 8, 2026
924536d
Merge remote-tracking branch 'origin/develop' into mgpu-tuning
mirza-halilcevic Jan 8, 2026
9709f5b
Fix mempolicy call.
mirza-halilcevic Jan 8, 2026
8f4e6b2
Improve debug info.
mirza-halilcevic Jan 8, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions mlir/tools/rocmlir-tuning-driver/ConcurrentQueue.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
//===- ConcurrentQueue.h - Simple MPMC queue --------------------*- C++ -*-===//
//
// Part of the rocMLIR Project, under the Apache License v2.0 with LLVM
// Exceptions. See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//

#ifndef ROCMLIR_TUNING_DRIVER_CONCURRENT_QUEUE_H
#define ROCMLIR_TUNING_DRIVER_CONCURRENT_QUEUE_H

#include "llvm/Support/Compiler.h"

#include <atomic>
#include <condition_variable>
#include <mutex>
#include <queue>

namespace rocmlir::tuningdriver {

template <typename T>
class ConcurrentQueue {
public:
template <typename U>
bool push(U &&item) {
if (LLVM_UNLIKELY(done.load(std::memory_order_relaxed)))
return false; // Early exit if terminated

{
std::lock_guard<std::mutex> lock(mtx);
if (LLVM_UNLIKELY(done.load(std::memory_order_relaxed)))
return false; // Double-check after acquiring the lock

queue.emplace(std::forward<U>(item));
}

cv.notify_one();
return true;
}

bool pop(T &item) {
std::unique_lock<std::mutex> lock(mtx);
cv.wait(lock, [this] {
return !queue.empty() || done.load(std::memory_order_relaxed);
});

if (LLVM_UNLIKELY(queue.empty()))
return false;

item = std::move(queue.front());
queue.pop();
return true;
}

void terminate() {
done.store(true, std::memory_order_relaxed);
cv.notify_all();
}

bool isTerminated() const { return done.load(std::memory_order_relaxed); }

private:
std::queue<T> queue;
std::mutex mtx;
std::condition_variable cv;
std::atomic<bool> done{false};
};

} // namespace rocmlir::tuningdriver

#endif // ROCMLIR_TUNING_DRIVER_CONCURRENT_QUEUE_H
95 changes: 55 additions & 40 deletions mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated but it'd be nice if we refactor for (unsigned iterIdx = 0; iterIdx < numTuningIterations; ++iterIdx) {. Because, I think we are launching threads for every iteration, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do this in another PR if that's ok? as it's unrelated to this change.

Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,9 @@

// Utilities to allocate buffers
#include "../utils/performance/common/benchmarkUtils.h"

#include "CacheFlush.h"
#include "ConcurrentQueue.h"

#include <hip/hip_runtime.h>

Expand Down Expand Up @@ -160,6 +162,11 @@ static llvm::cl::opt<unsigned> numCompileThreads(
llvm::cl::desc("Number of parallel compilation threads (0 = auto)"),
llvm::cl::value_desc("thread count"), llvm::cl::init(0));

static llvm::cl::opt<bool> waitForCompiles(
"wait-for-compiles",
llvm::cl::desc("Wait for all compilations to finish before benchmarking"),
llvm::cl::init(false));

// Ripped out of JitRunner.cpp
static OwningOpRef<ModuleOp> parseMLIRInput(StringRef inputFilename,
MLIRContext *context) {
Expand Down Expand Up @@ -276,6 +283,7 @@ struct BenchmarkParams {
rock::TuningParamSetKind tuningSpaceKind;
const unsigned numCompileThreads;
std::string benchmarkConfig;
bool waitForCompiles;
};

enum class CompilationStatus {
Expand Down Expand Up @@ -740,7 +748,7 @@ static LogicalResult runTuningLoop(ModuleOp source) {
const BenchmarkParams benchmarkParams = {
numIterations, warmupIterations, useMedian, trimPercent,
sleepUs, showStats, showAllMeasurements, tuningSpaceKind,
numCompileThreads, benchmarkConfig};
numCompileThreads, benchmarkConfig, waitForCompiles};

unsigned numTuningIterations =
rock::getNumberOfIterations(benchmarkParams.tuningSpaceKind);
Expand Down Expand Up @@ -827,10 +835,8 @@ static LogicalResult runTuningLoop(ModuleOp source) {
}

// PHASE 3: Parallel compilation phase using pre-initialized resources
std::vector<CompilationResult> compilationResults(configs.size());
ConcurrentQueue<CompilationResult> compilationResults;
std::mutex outputMutex; // For thread-safe console output
std::atomic<bool> compilationFailed{
false}; // Flag to signal early termination

// Compile a single config using pre-initialized thread resources
auto compileConfig = [&](size_t idx,
Expand Down Expand Up @@ -876,7 +882,6 @@ static LogicalResult runTuningLoop(ModuleOp source) {
auto tunedFunc = sourceCopy->lookupSymbol<func::FuncOp>(fnName);
if (!tunedFunc) {
result.status = CompilationStatus::CompilationFailed;
compilationFailed.store(true, std::memory_order_relaxed);
return result;
}
result.blockSizes.push_back(
Expand All @@ -891,7 +896,6 @@ static LogicalResult runTuningLoop(ModuleOp source) {
llvm::errs() << "Backend pipeline failed for config: "
<< result.perfConfig << "\n";
result.status = CompilationStatus::CompilationFailed;
compilationFailed.store(true, std::memory_order_relaxed);
return result;
}

Expand All @@ -901,7 +905,6 @@ static LogicalResult runTuningLoop(ModuleOp source) {
sourceCopy->lookupSymbol<gpu::BinaryOp>(fnName + "_module");
if (!binary) {
result.status = CompilationStatus::CompilationFailed;
compilationFailed.store(true, std::memory_order_relaxed);
return result;
}
result.hipModules.push_back(
Expand All @@ -920,53 +923,65 @@ static LogicalResult runTuningLoop(ModuleOp source) {
// compilation times vary dramatically between configs (NotApplicable is
// fast, full compilation is slow). Dynamic work stealing provides better
// load balancing by allowing fast threads to pick up more work.
{
std::atomic<size_t> nextIdx{0};
std::atomic<unsigned> nextThreadId{0};

auto worker = [&]() {
// Each worker gets assigned a unique thread ID for its resources
unsigned myThreadId =
nextThreadId.fetch_add(1, std::memory_order_relaxed);
ThreadResources &myRes = threadResources[myThreadId];

while (true) {
if (compilationFailed.load(std::memory_order_relaxed))
break;
std::atomic<size_t> nextIdx{0};
std::atomic<unsigned> nextThreadId{0};
std::atomic<size_t> activeThreads{numThreads};
auto worker = [&] {
// Each worker gets assigned a unique thread ID for its resources
unsigned myThreadId =
nextThreadId.fetch_add(1, std::memory_order_relaxed);
ThreadResources &myRes = threadResources[myThreadId];

while (true) {
size_t idx = nextIdx.fetch_add(1, std::memory_order_relaxed);
if (idx >= configs.size())
break;

if (compilationResults.isTerminated())
break; // Avoid unnecessary work

if (!compilationResults.push(compileConfig(idx, myRes)))
break; // Queue terminated
}

size_t idx = nextIdx.fetch_add(1, std::memory_order_relaxed);
if (idx >= configs.size())
break;
if (activeThreads.fetch_sub(1, std::memory_order_acq_rel) == 1) {
// Last thread - signal termination
compilationResults.terminate();
}
};

compilationResults[idx] = compileConfig(idx, myRes);
}
};
std::vector<std::thread> threads;
threads.reserve(numThreads);
for (unsigned i = 0; i < numThreads; ++i) {
threads.emplace_back(worker);
}

std::vector<std::thread> threads;
threads.reserve(numThreads);
for (unsigned i = 0; i < numThreads; ++i) {
threads.emplace_back(worker);
auto threadCleanup = llvm::make_scope_exit([&] {
// In case of early termination, signal all threads to stop
compilationResults.terminate();
for (auto &t : threads) {
t.join();
}
});

if (benchmarkParams.waitForCompiles) {
for (auto &t : threads) {
t.join();
}
}

// Check if any compilation failed and terminate early
if (compilationFailed.load(std::memory_order_relaxed)) {
llvm::errs()
<< "Compilation failed for one or more configs. Terminating.\n";
return failure();
threads.clear();
}

int64_t validResults = 0;
// Sequential benchmarking phase (must be sequential for accurate timing)
// Note: Due to early exit on compilation failures, only NotApplicable and
// Success statuses are possible here.
for (const auto &result : compilationResults) {
CompilationResult result;
while (compilationResults.pop(result)) {
llvm::outs() << result.perfConfig << "\t";

if (result.status == CompilationStatus::CompilationFailed) {
llvm::errs() << "Compilation failed\n";
return failure();
}

if (result.status == CompilationStatus::NotApplicable) {
llvm::outs() << "N/A\n";
continue;
Expand Down
4 changes: 2 additions & 2 deletions mlir/utils/jenkins/Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -1179,10 +1179,10 @@ PY
stage("Tune Fusion") {
dir('build') {
// Tune resnet50
sh """python3 ./bin/tuningRunner.py --abort-on-error --op fusion --test_dir ../mlir/test/fusion/resnet50-e2e/ -o tuning_fusion_${CHIP}.tsv"""
sh """python3 ./bin/tuningRunner.py --quiet --abort-on-error --op fusion --test-dir ../mlir/test/fusion/resnet50-e2e/ -o tuning_fusion_${CHIP}.tsv"""

// Tune bert
sh """python3 ./bin/tuningRunner.py --abort-on-error --op fusion --test_dir ../mlir/test/xmir/bert-torch-tosa-e2e/ -o tuning_fusion_${CHIP}.tsv"""
sh """python3 ./bin/tuningRunner.py --quiet --abort-on-error --op fusion --test-dir ../mlir/test/xmir/bert-torch-tosa-e2e/ -o tuning_fusion_${CHIP}.tsv"""
}
sh 'rm -f build/CMakeCache.txt'
}
Expand Down
8 changes: 4 additions & 4 deletions mlir/utils/jenkins/Jenkinsfile.downstream
Original file line number Diff line number Diff line change
Expand Up @@ -150,14 +150,14 @@ pipeline {
dir('build') {
timeout(time: 60, activity: true, unit: 'MINUTES') {
// Tune gemms, fail if the DB is not created
sh """python3 ./bin/tuningRunner.py --abort-on-error \
sh """python3 ./bin/tuningRunner.py --quiet --abort-on-error \
--operation gemm \
--configs_file=../mlir/utils/jenkins/ci-configs/selected-gemm-configs \
--configs-file=../mlir/utils/jenkins/ci-configs/selected-gemm-configs \
--output=tuning_gemm.tsv
[ -f tuning_gemm.tsv ]"""
sh """python3 ./bin/tuningRunner.py --abort-on-error \
sh """python3 ./bin/tuningRunner.py --quiet --abort-on-error \
--operation conv \
--configs_file=../mlir/utils/jenkins/ci-configs/selected-conv-configs \
--configs-file=../mlir/utils/jenkins/ci-configs/selected-conv-configs \
--output=tuning_conv.tsv
[ -f tuning_conv.tsv ]"""
}
Expand Down
7 changes: 4 additions & 3 deletions mlir/utils/performance/perfRunner.py
Original file line number Diff line number Diff line change
Expand Up @@ -529,15 +529,16 @@ def from_command_line(cls, argv, arch, num_cu):
datatype = 'bf8_fp8'
elif argv[0] == 'convbf8_bf8':
datatype = 'bf8_bf8'
else:
raise ValueError(f"Unknown conv datatype: {argv[0]}")

try:
# TBD:
# implement -m ?
# implement -t ?
opts, _ = getopt.getopt(argv[1:], "F:f:I:O:n:c:H:W:k:y:x:p:q:l:j:u:v:g:m:t:")
except getopt.GetoptError:
print('getopt error')
sys.exit(1)
except getopt.GetoptError as e:
raise ValueError(f"Invalid conv config: {e}")

for opt, arg in opts:
if opt == '-F':
Expand Down
Loading