Silo/3d conv benchmark #1012

amd-bartgips · 2025-11-05T13:04:52Z

Motivation

Draft PR. Some changes to make life easier for gathering data for training our 3D heuristics.
Does not need to be merged at the moment.

Some small changes to the docker-file. This should probably be moved somewhere else as to not dirty this one.

The main changes are in the way the distributor and workers interact when collecting and running jobs.
We run the distributor and worker on isolated nodes where the only communication happens implicitly through the database.
For that reason the distributors should be a bit less greedy in grabbing all jobs from the database, but only a single batch.
Only when the batch is nearing completion, it will query the database again to fetch a new batch of jobs.

This makes it easier to run multiple nodes in parallel without requiring a single central distributor and direct communication between the nodes.

Technical Details

miopen_lib.y:

extract_job_id_from_context() - A new method to extract job IDs from MIOpen celery task contexts

mituna_interface.py

Job Tracking: Added instance variables for tracking claimed and completed jobs:
- self.claimed_job_ids = set()
- self.completed_job_ids = set()
- self.progress_factor = 0.25 (for determining when to grab more jobs, default check after 75% is finished)
Polling Optimization:
- Configurable TUNA_POLL_INTERVAL (default 5 seconds)
- Added "Progress-aware polling" with shorter intervals and smarter enqueuing
Job Completion Tracking:
- Modified parse_result() to track completed jobs
- Added abstract method extract_job_id_from_context() that subclasses must implement

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…e reliability - Replace individual UPDATE queries with bulk UPDATE for job state changes - Add retry logic with configurable max attempts for database operations - Implement consecutive empty fetch tracking to prevent infinite loops - Add proper error handling and recovery for database session failures - Track enqueued jobs to prevent duplicate processing - Add configurable TUNA_MAX_EMPTY_FETCHES environment variable - Improve logging for better observability of enqueue process This optimization significantly reduces database round-trips when updating multiple job states and makes the enqueue process more resilient to transient failures.

…ocker build stage

…uns out.

amd-bartgips added 10 commits October 8, 2025 11:11

Added changes/additions to Dockerfile

27a67bf

auto format

11462b3

auto format

06fbe3c

WIP: parallell functionality

ebc50d8

used yapf formatter

3001f9e

changed default base image and properly passed it through to second d…

a19f1bb

…ocker build stage

changed to newer version of clang-format (12 no longer available)

c396afc

changed branch of MIOpen

0a8fbba

Fixed bug where distributor would not grab new batch when old batch r…

50eace5

…uns out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Silo/3d conv benchmark #1012

Silo/3d conv benchmark #1012

Uh oh!

amd-bartgips commented Nov 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Silo/3d conv benchmark #1012

Are you sure you want to change the base?

Silo/3d conv benchmark #1012

Uh oh!

Conversation

amd-bartgips commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

miopen_lib.y:

mituna_interface.py

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amd-bartgips commented Nov 5, 2025 •

edited

Loading