FP8 AllGather Support in Fairscale #1185

levendlee · 2024-05-20T18:12:28Z

What does this PR do?

Fixes # (issue).

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Co-authored-by: Naman Goyal <[email protected]>

This commit works with a 4 GPU run on SMALL model with FSDP and PP enabled.

- Clean up flatten and non_flatten parameter generation logic. - Avoid checking `main_grad` attribute all equal to zeros.

- Cleans up amax and scale update logic. Amax and scale should be done for both weights and parameters. So it should be done at forward of each microbatch. - Consolidate `cast_params` and `all_gather` stream.

Co-authored-by: Naman Goyal <[email protected]>

This commit works with a 4 GPU run on SMALL model with FSDP and PP enabled.

- Clean up flatten and non_flatten parameter generation logic. - Avoid checking `main_grad` attribute all equal to zeros.

- Cleans up amax and scale update logic. Amax and scale should be done for both weights and parameters. So it should be done at forward of each microbatch. - Consolidate `cast_params` and `all_gather` stream.

…kresearch/fairscale into shikaili_fp8_allgather_no_pp_fix

awgu

Thanks @levendlee for the great work! I left some comments for my own learning.

awgu · 2024-06-24T16:06:28Z