Skip to content

Conversation

@DAKSH-coder26
Copy link

@DAKSH-coder26 DAKSH-coder26 commented Dec 11, 2025

Fix BatchProgress Inconsistency When Resuming From Mid-Epoch Checkpoint (Fixes #19367)

What does this PR do?

This PR fixes incorrect synchronization of batch_progress when resuming training from a mid-epoch checkpoint, as described in issue #19367.

Before this fix, resuming from a checkpoint saved before completing an epoch caused:

  • batch_progress.total_ready and total_completed to desynchronize from the restored global_step
  • current_ready / current_completed to become inconsistent or invalid
  • Validation, checkpoint intervals, or internal loop schedules to fire at the wrong steps
  • Off-by-one errors in resumed training

Summary of the Fix

The PR modifies CheckpointConnector.resume_start() to:

  • Realign batch_progress.total_ready and total_completed to be at least the restored global_step
  • Recompute current_ready and current_completed using
    global_step % limit_train_batches when available
  • Fall back safely when dataloader size or fields differ across Lightning versions
  • Ensure behavior remains unchanged when no checkpoint is provided

A new test verifies correct behavior and prevents regressions.

Tests Added

A new file:
tests/test_resume_batch_progress.py

This test:

  1. Trains to global_step = 5
  2. Saves a mid-epoch checkpoint
  3. Resumes from the checkpoint
  4. Ensures:
    • batch_progress exists in the expected form
    • total_completedglobal_step
    • current_completedtotal_completed
    • All counters are non-negative and consistent

The test failed before the fix and passes after it.

Fixes #19367


Before submitting
  • Was this discussed/agreed via a GitHub issue? (Issue Potential off by 1 error when resuming training of mid-epoch checkpoint #19367)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing?
  • No documentation changes were needed
  • Added the necessary tests
  • Verified new and existing tests pass locally
  • Listed breaking changes (none introduced)
  • Updated the CHANGELOG

PR review

Anyone in the community is welcome to review the PR.

Reviewer checklist
  • Is this pull request ready for review?
  • Check that all items from Before submitting are resolved
  • Ensure the title and description clearly explain the PR
  • Apply appropriate labels and milestones

📚 Documentation preview 📚: https://pytorch-lightning--21411.org.readthedocs.build/en/21411/

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Dec 11, 2025
@DAKSH-coder26 DAKSH-coder26 marked this pull request as draft December 11, 2025 22:17
@DAKSH-coder26
Copy link
Author

Hi maintainers , this PR is ready for review from my side.

Some required checks (schema, required-jobs) are stuck in "Expected" due to the old Lit CI platform being shut down.
Could you please trigger / approve the workflows or override the stuck checks?

Thanks!

@DAKSH-coder26 DAKSH-coder26 marked this pull request as ready for review December 12, 2025 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pl Generic label for PyTorch Lightning package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential off by 1 error when resuming training of mid-epoch checkpoint

2 participants