Skip to content

Conversation

@amd-bartgips
Copy link

@amd-bartgips amd-bartgips commented Nov 5, 2025

Motivation

Draft PR. Some changes to make life easier for gathering data for training our 3D heuristics.
Does not need to be merged at the moment.

Some small changes to the docker-file. This should probably be moved somewhere else as to not dirty this one.

The main changes are in the way the distributor and workers interact when collecting and running jobs.
We run the distributor and worker on isolated nodes where the only communication happens implicitly through the database.
For that reason the distributors should be a bit less greedy in grabbing all jobs from the database, but only a single batch.
Only when the batch is nearing completion, it will query the database again to fetch a new batch of jobs.

This makes it easier to run multiple nodes in parallel without requiring a single central distributor and direct communication between the nodes.

Technical Details

miopen_lib.y:

  • extract_job_id_from_context() - A new method to extract job IDs from MIOpen celery task contexts

mituna_interface.py

  • Job Tracking: Added instance variables for tracking claimed and completed jobs:

    • self.claimed_job_ids = set()
    • self.completed_job_ids = set()
    • self.progress_factor = 0.25 (for determining when to grab more jobs, default check after 75% is finished)
  • Polling Optimization:

    • Configurable TUNA_POLL_INTERVAL (default 5 seconds)
    • Added "Progress-aware polling" with shorter intervals and smarter enqueuing
  • Job Completion Tracking:

    • Modified parse_result() to track completed jobs
    • Added abstract method extract_job_id_from_context() that subclasses must implement

Test Plan

Test Result

Submission Checklist

…e reliability

- Replace individual UPDATE queries with bulk UPDATE for job state changes
- Add retry logic with configurable max attempts for database operations
- Implement consecutive empty fetch tracking to prevent infinite loops
- Add proper error handling and recovery for database session failures
- Track enqueued jobs to prevent duplicate processing
- Add configurable TUNA_MAX_EMPTY_FETCHES environment variable
- Improve logging for better observability of enqueue process

This optimization significantly reduces database round-trips when updating
multiple job states and makes the enqueue process more resilient to
transient failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants