Fix: stabilize test after distributed training completion #375

stop1one · 2025-10-01T07:34:15Z

Description

Related Issues

Fixes #316 #374

Summary of Changes

This PR fixes two problems for distributed training when run_test=True:

Synchronization before testing
- Added torch.distributed.barrier() when args.distributed is enabled, right after training and before testing.
- Ensures all processes wait until rank 0 has finished saving checkpoint_best_total.pth before moving on to testing.
Correct checkpoint loading in DDP
- Changed from model.load_state_dict(best_state_dict) to model_without_ddp.load_state_dict(best_state_dict)
- Prevents key mismatch errors when loading checkpoints in distributed environments.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

How has this change been tested, please provide a testcase or example of how you tested the change?

train.py :

from rfdetr import RFDETRLarge

model = RFDETRLarge()

model.train(
    dataset_dir=<DATASET_PATH>,
    epochs=30,
    batch_size=8
    grad_accum_steps=2,
    lr=1e-4,
    output_dir=<OUTPUT_PATH>,
)

python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py

CLAassistant · 2025-10-01T07:34:23Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Fix: stabilize test after distributed training completion

db961bb

stop1one requested review from Matvezy, SkalskiP, isaacrob-roboflow and probicheaux as code owners October 1, 2025 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: stabilize test after distributed training completion #375

Fix: stabilize test after distributed training completion #375

stop1one commented Oct 1, 2025

Uh oh!

CLAassistant commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: stabilize test after distributed training completion #375

Are you sure you want to change the base?

Fix: stabilize test after distributed training completion #375

Conversation

stop1one commented Oct 1, 2025

Description

Related Issues

Summary of Changes

Type of change

How has this change been tested, please provide a testcase or example of how you tested the change?

Uh oh!

CLAassistant commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants