Skip to content

Conversation

@stop1one
Copy link

@stop1one stop1one commented Oct 1, 2025

Description

Related Issues

Fixes #316 #374

Summary of Changes

This PR fixes two problems for distributed training when run_test=True:

  1. Synchronization before testing

    • Added torch.distributed.barrier() when args.distributed is enabled, right after training and before testing.
    • Ensures all processes wait until rank 0 has finished saving checkpoint_best_total.pth before moving on to testing.
  2. Correct checkpoint loading in DDP

    • Changed from model.load_state_dict(best_state_dict) to model_without_ddp.load_state_dict(best_state_dict)
    • Prevents key mismatch errors when loading checkpoints in distributed environments.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How has this change been tested, please provide a testcase or example of how you tested the change?

train.py :

from rfdetr import RFDETRLarge

model = RFDETRLarge()

model.train(
    dataset_dir=<DATASET_PATH>,
    epochs=30,
    batch_size=8
    grad_accum_steps=2,
    lr=1e-4,
    output_dir=<OUTPUT_PATH>,
)

python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Distributed Training Fails at End: FileNotFoundError and State Dict Mismatch Issues

2 participants