-
Couldn't load subscription status.
- Fork 72
FP8 support for Allreduce #646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… fp8 type with HIP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls refer this for AMD fp8: https://github.com/ROCm/rccl/blob/e738c03e39f681b5f38409788c781a7ffd3cc73e/src/device/reduce_kernel.h#L440-L443
… fp8 elements; skip special FP8 cal_vectors_helper
…_fp8x2_e5m2 and __fp8x4_e5m2
…p8_e4m3 and __fp8_e5m2. Fix the issue of data validation for __fp8_e5m2 with HIP. Add missing handling of sizeof(T) == 1 for FP8 in allreduceAllPairs and allreduce7.
|
We need to put our reduction function to a unified place. We can change this in another PR |
|
There was an error handling pipeline event 79c0b7b6-6c02-4298-8105-6a2b4742057f. |
…CUDA_ARCH__ < 1000
Add FP8 support for Allreduce on both NVIDIA and AMD platform.
Add new data type: fp8_e4m3 and fp8_e5m2
Allreduce performance
1. NVIDIA H100
Nccl-tests with MSCCLPP
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
MSCCLPP:
134217728 67108864 half sum -1 496.6 270.25 472.94 0 499.2 268.85 470.49 0
134217728 67108864 bfloat16 sum -1 497.2 269.94 472.39 0 497.4 269.83 472.20 0
NCCL:
(Using MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce")
134217728 134217728 f8e4m3 sum -1 735.0 182.60 319.55 0 709.0 189.32 331.31 0
134217728 134217728 f8e5m2 sum -1 734.6 182.70 319.73 0 709.2 189.26 331.21 0
2. AMD MI300
Rccl-tests with MSCCLPP
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
MSCCLPP:
134217728 134217728 fp8_e4m3 sum -1 763.0 175.91 307.83 0 761.9 176.16 308.28 0
134217728 134217728 fp8_e5m2 sum -1 762.7 175.97 307.94 0 762.2 176.09 308.15 0
134217728 67108864 half sum -1 752.6 178.33 312.09 0 752.9 178.27 311.97 0
RCCL:
(Using MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce")
134217728 134217728 fp8_e4m3 sum -1 838.3 160.11 280.20 0 839.3 159.92 279.86 0
134217728 134217728 fp8_e5m2 sum -1 837.4 160.28 280.49 0 838.9 159.99 279.98 0