Skip to content

Conversation

@seagater
Copy link
Contributor

@seagater seagater commented Oct 13, 2025

Add FP8 support for Allreduce on both NVIDIA and AMD platform.
Add new data type: fp8_e4m3 and fp8_e5m2

Allreduce performance

1. NVIDIA H100
Nccl-tests with MSCCLPP

mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=/home/qinghuazhou/mscclpp_allreduce_fp8/build/apps/nccl/libmscclpp_nccl.so ./build/all_reduce_perf -b 1K -e 64K -f 2 -d f8e4m3 -G 1 -w 10 -n 100 -c 1
mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=/home/qinghuazhou/mscclpp_allreduce_fp8/build/apps/nccl/libmscclpp_nccl.so ./build/all_reduce_perf -b 1K -e 64K -f 2 -d f8e5m2 -G 1 -w 10 -n 100 -c 1
mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=/home/qinghuazhou/mscclpp_allreduce_fp8/build/apps/nccl/libmscclpp_nccl.so ./build/all_reduce_perf -b 1K -e 128M -f 2 -d half -G 1 -w 10 -n 100 -c 1
mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=/home/qinghuazhou/mscclpp_allreduce_fp8/build/apps/nccl/libmscclpp_nccl.so ./build/all_reduce_perf -b 1K -e 128M -f 2 -d bfloat16 -G 1 -w 10 -n 100 -c 1

# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

MSCCLPP:

    1024          1024    f8e4m3     sum      -1     5.09    0.20    0.35      0     5.10    0.20    0.35      0
    2048          2048    f8e4m3     sum      -1     5.46    0.37    0.66      0     5.36    0.38    0.67      0
    4096          4096    f8e4m3     sum      -1     5.52    0.74    1.30      0     5.55    0.74    1.29      0
    8192          8192    f8e4m3     sum      -1     5.61    1.46    2.56      0     5.60    1.46    2.56      0
   16384         16384    f8e4m3     sum      -1     5.99    2.73    4.78      0     5.86    2.80    4.90      0
   32768         32768    f8e4m3     sum      -1     7.52    4.36    7.63      0     7.49    4.37    7.65      0
   65536         65536    f8e4m3     sum      -1     7.73    8.48   14.83      0     7.58    8.64   15.13      0


    1024          1024    f8e5m2     sum      -1     5.06    0.20    0.35      0     5.12    0.20    0.35      0
    2048          2048    f8e5m2     sum      -1     5.36    0.38    0.67      0     5.34    0.38    0.67      0
    4096          4096    f8e5m2     sum      -1     5.51    0.74    1.30      0     5.57    0.74    1.29      0
    8192          8192    f8e5m2     sum      -1     5.61    1.46    2.56      0     5.59    1.47    2.57      0
   16384         16384    f8e5m2     sum      -1     5.89    2.78    4.87      0     5.87    2.79    4.88      0
   32768         32768    f8e5m2     sum      -1     7.52    4.36    7.63      0     7.49    4.37    7.65      0
   65536         65536    f8e5m2     sum      -1     7.59    8.64   15.12      0     7.60    8.62   15.09      0


    1024           512      half     sum      -1     5.25    0.20    0.34      0     5.04    0.20    0.36      0
    2048          1024      half     sum      -1     5.36    0.38    0.67      0     5.19    0.39    0.69      0
    4096          2048      half     sum      -1     5.51    0.74    1.30      0     5.41    0.76    1.32      0
    8192          4096      half     sum      -1     5.61    1.46    2.56      0     5.49    1.49    2.61      0
   16384          8192      half     sum      -1     5.91    2.77    4.85      0     5.78    2.84    4.96      0
   32768         16384      half     sum      -1     7.47    4.38    7.67      0     7.35    4.46    7.80      0
   65536         32768      half     sum      -1     7.59    8.63   15.11      0     7.49    8.75   15.31      0
  131072         65536      half     sum      -1     6.56   19.97   34.94      0     6.50   20.18   35.31      0
  262144        131072      half     sum      -1     7.03   37.29   65.26      0     6.88   38.08   66.65      0
  524288        262144      half     sum      -1     7.87   66.60  116.55      0     7.86   66.75  116.80      0
 1048576        524288      half     sum      -1     9.93  105.62  184.83      0     9.96  105.28  184.24      0
 2097152       1048576      half     sum      -1    13.71  152.93  267.63      0    13.63  153.83  269.20      0
 4194304       2097152      half     sum      -1    21.94  191.17  334.54      0    21.95  191.10  334.42      0
 8388608       4194304      half     sum      -1    37.51  223.65  391.39      0    37.59  223.15  390.52      0
16777216       8388608      half     sum      -1    68.51  244.89  428.56      0    68.39  245.31  429.28      0
33554432      16777216      half     sum      -1    130.4  257.35  450.36      0    130.3  257.50  450.62      0
67108864      33554432      half     sum      -1    253.1  265.12  463.96      0    252.7  265.56  464.72      0

134217728 67108864 half sum -1 496.6 270.25 472.94 0 499.2 268.85 470.49 0

    1024           512  bfloat16     sum      -1     5.22    0.20    0.34      0     5.14    0.20    0.35      0
    2048          1024  bfloat16     sum      -1     5.37    0.38    0.67      0     5.27    0.39    0.68      0
    4096          2048  bfloat16     sum      -1     5.54    0.74    1.29      0     5.48    0.75    1.31      0
    8192          4096  bfloat16     sum      -1     5.59    1.47    2.56      0     5.49    1.49    2.61      0
   16384          8192  bfloat16     sum      -1     5.92    2.77    4.84      0     5.79    2.83    4.95      0
   32768         16384  bfloat16     sum      -1     7.51    4.36    7.63      0     7.38    4.44    7.77      0
   65536         32768  bfloat16     sum      -1     7.59    8.64   15.12      0     7.49    8.75   15.32      0
  131072         65536  bfloat16     sum      -1     6.57   19.95   34.92      0     6.48   20.24   35.42      0
  262144        131072  bfloat16     sum      -1     7.19   36.45   63.79      0     6.87   38.18   66.81      0
  524288        262144  bfloat16     sum      -1     7.90   66.33  116.08      0     7.82   67.04  117.32      0
 1048576        524288  bfloat16     sum      -1     9.95  105.34  184.34      0     9.96  105.25  184.18      0
 2097152       1048576  bfloat16     sum      -1    13.60  154.23  269.90      0    13.64  153.73  269.02      0
 4194304       2097152  bfloat16     sum      -1    21.88  191.68  335.43      0    21.95  191.08  334.39      0
 8388608       4194304  bfloat16     sum      -1    37.57  223.27  390.72      0    37.69  222.56  389.47      0
16777216       8388608  bfloat16     sum      -1    68.37  245.38  429.41      0    68.28  245.70  429.97      0
33554432      16777216  bfloat16     sum      -1    130.3  257.55  450.71      0    130.1  257.96  451.43      0
67108864      33554432  bfloat16     sum      -1    252.7  265.56  464.72      0    252.3  266.04  465.56      0

134217728 67108864 bfloat16 sum -1 497.2 269.94 472.39 0 497.4 269.83 472.20 0

NCCL:
(Using MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce")

    1024          1024    f8e4m3     sum      -1    15.15    0.07    0.12      0    15.24    0.07    0.12      0
    2048          2048    f8e4m3     sum      -1    15.44    0.13    0.23      0    15.41    0.13    0.23      0
    4096          4096    f8e4m3     sum      -1    15.59    0.26    0.46      0    15.55    0.26    0.46      0
    8192          8192    f8e4m3     sum      -1    15.78    0.52    0.91      0    15.77    0.52    0.91      0
   16384         16384    f8e4m3     sum      -1    16.46    1.00    1.74      0    16.42    1.00    1.75      0
   32768         32768    f8e4m3     sum      -1    19.48    1.68    2.94      0    19.45    1.68    2.95      0
   65536         65536    f8e4m3     sum      -1    19.60    3.34    5.85      0    19.57    3.35    5.86      0
  131072        131072    f8e4m3     sum      -1    19.94    6.57   11.50      0    19.90    6.59   11.53      0
  262144        262144    f8e4m3     sum      -1    20.13   13.02   22.79      0    20.04   13.08   22.89      0
  524288        524288    f8e4m3     sum      -1    21.99   23.84   41.72      0    20.69   25.34   44.34      0
 1048576       1048576    f8e4m3     sum      -1    30.98   33.85   59.23      0    30.94   33.89   59.31      0
 2097152       2097152    f8e4m3     sum      -1    42.66   49.16   86.02      0    42.55   49.29   86.26      0
 4194304       4194304    f8e4m3     sum      -1    51.87   80.86  141.50      0    51.24   81.86  143.25      0
 8388608       8388608    f8e4m3     sum      -1    80.77  103.85  181.75      0    79.23  105.88  185.29      0
16777216      16777216    f8e4m3     sum      -1    125.7  133.47  233.58      0    122.7  136.71  239.25      0
33554432      33554432    f8e4m3     sum      -1    214.1  156.72  274.26      0    216.2  155.17  271.55      0
67108864      67108864    f8e4m3     sum      -1    375.0  178.95  313.16      0    368.6  182.06  318.61      0

134217728 134217728 f8e4m3 sum -1 735.0 182.60 319.55 0 709.0 189.32 331.31 0

    1024          1024    f8e5m2     sum      -1    15.14    0.07    0.12      0    15.19    0.07    0.12      0
    2048          2048    f8e5m2     sum      -1    15.41    0.13    0.23      0    15.35    0.13    0.23      0
    4096          4096    f8e5m2     sum      -1    15.53    0.26    0.46      0    15.50    0.26    0.46      0
    8192          8192    f8e5m2     sum      -1    15.76    0.52    0.91      0    15.74    0.52    0.91      0
   16384         16384    f8e5m2     sum      -1    16.42    1.00    1.75      0    16.38    1.00    1.75      0
   32768         32768    f8e5m2     sum      -1    19.44    1.69    2.95      0    19.40    1.69    2.96      0
   65536         65536    f8e5m2     sum      -1    19.58    3.35    5.86      0    19.53    3.36    5.87      0
  131072        131072    f8e5m2     sum      -1    19.92    6.58   11.52      0    19.88    6.59   11.54      0
  262144        262144    f8e5m2     sum      -1    20.10   13.04   22.83      0    20.01   13.10   22.93      0
  524288        524288    f8e5m2     sum      -1    21.97   23.86   41.76      0    20.64   25.40   44.46      0
 1048576       1048576    f8e5m2     sum      -1    30.91   33.92   59.36      0    30.88   33.95   59.42      0
 2097152       2097152    f8e5m2     sum      -1    42.52   49.32   86.30      0    42.43   49.43   86.51      0
 4194304       4194304    f8e5m2     sum      -1    51.77   81.02  141.79      0    51.07   82.13  143.73      0
 8388608       8388608    f8e5m2     sum      -1    80.11  104.71  183.24      0    79.42  105.62  184.83      0
16777216      16777216    f8e5m2     sum      -1    126.3  132.86  232.50      0    124.1  135.17  236.54      0
33554432      33554432    f8e5m2     sum      -1    215.8  155.50  272.13      0    215.6  155.64  272.37      0
67108864      67108864    f8e5m2     sum      -1    374.9  178.99  313.23      0    367.3  182.72  319.77      0

134217728 134217728 f8e5m2 sum -1 734.6 182.70 319.73 0 709.2 189.26 331.21 0

2. AMD MI300
Rccl-tests with MSCCLPP

# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

MSCCLPP:

    1024          1024  fp8_e4m3     sum      -1     5.38    0.19    0.33      0     5.52    0.19    0.32      0
    2048          2048  fp8_e4m3     sum      -1     5.44    0.38    0.66      0     5.51    0.37    0.65      0
    4096          4096  fp8_e4m3     sum      -1     5.54    0.74    1.29      0     5.60    0.73    1.28      0
    8192          8192  fp8_e4m3     sum      -1     5.95    1.38    2.41      0     6.08    1.35    2.36      0
   16384         16384  fp8_e4m3     sum      -1     6.50    2.52    4.41      0     6.56    2.50    4.37      0
   32768         32768  fp8_e4m3     sum      -1     9.00    3.64    6.37      0     9.10    3.60    6.30      0
   65536         65536  fp8_e4m3     sum      -1     9.35    7.01   12.26      0     9.45    6.94   12.14      0
  131072        131072  fp8_e4m3     sum      -1    11.72   11.18   19.56      0    11.89   11.02   19.28      0
  262144        262144  fp8_e4m3     sum      -1    12.37   21.19   37.09      0    12.51   20.95   36.66      0
  524288        524288  fp8_e4m3     sum      -1    13.96   37.56   65.72      0    14.04   37.36   65.37      0
 1048576       1048576  fp8_e4m3     sum      -1    19.13   54.81   95.92      0    19.34   54.22   94.89      0
 2097152       2097152  fp8_e4m3     sum      -1    24.55   85.43  149.51      0    24.55   85.43  149.50      0
 4194304       4194304  fp8_e4m3     sum      -1    37.25  112.59  197.03      0    37.30  112.44  196.77      0
 8388608       8388608  fp8_e4m3     sum      -1    61.36  136.70  239.23      0    61.75  135.84  237.72      0
16777216      16777216  fp8_e4m3     sum      -1    109.3  153.53  268.68      0    109.5  153.19  268.09      0
33554432      33554432  fp8_e4m3     sum      -1    200.7  167.22  292.64      0    201.6  166.42  291.23      0
67108864      67108864  fp8_e4m3     sum      -1    388.9  172.56  301.98      0    389.5  172.32  301.55      0

134217728 134217728 fp8_e4m3 sum -1 763.0 175.91 307.83 0 761.9 176.16 308.28 0

    1024          1024  fp8_e5m2     sum      -1     5.33    0.19    0.34      0     5.40    0.19    0.33      0
    2048          2048  fp8_e5m2     sum      -1     5.42    0.38    0.66      0     5.51    0.37    0.65      0
    4096          4096  fp8_e5m2     sum      -1     5.54    0.74    1.29      0     5.62    0.73    1.28      0
    8192          8192  fp8_e5m2     sum      -1     5.95    1.38    2.41      0     6.07    1.35    2.36      0
   16384         16384  fp8_e5m2     sum      -1     6.48    2.53    4.43      0     6.57    2.49    4.36      0
   32768         32768  fp8_e5m2     sum      -1     8.96    3.66    6.40      0     9.03    3.63    6.35      0
   65536         65536  fp8_e5m2     sum      -1     9.32    7.03   12.30      0     9.43    6.95   12.16      0
  131072        131072  fp8_e5m2     sum      -1    11.73   11.18   19.56      0    11.89   11.03   19.30      0
  262144        262144  fp8_e5m2     sum      -1    12.34   21.25   37.19      0    12.51   20.96   36.67      0
  524288        524288  fp8_e5m2     sum      -1    13.99   37.47   65.57      0    14.07   37.26   65.21      0
 1048576       1048576  fp8_e5m2     sum      -1    19.14   54.79   95.89      0    19.34   54.21   94.86      0
 2097152       2097152  fp8_e5m2     sum      -1    24.49   85.64  149.87      0    24.55   85.42  149.48      0
 4194304       4194304  fp8_e5m2     sum      -1    37.25  112.59  197.03      0    37.23  112.65  197.14      0
 8388608       8388608  fp8_e5m2     sum      -1    61.32  136.81  239.41      0    61.69  135.97  237.95      0
16777216      16777216  fp8_e5m2     sum      -1    109.2  153.60  268.79      0    109.6  153.12  267.97      0
33554432      33554432  fp8_e5m2     sum      -1    200.8  167.12  292.47      0    201.6  166.44  291.27      0
67108864      67108864  fp8_e5m2     sum      -1    389.1  172.48  301.84      0    389.3  172.40  301.70      0

134217728 134217728 fp8_e5m2 sum -1 762.7 175.97 307.94 0 762.2 176.09 308.15 0

    1024           512      half     sum      -1     5.26    0.19    0.34      0     5.28    0.19    0.34      0
    2048          1024      half     sum      -1     5.37    0.38    0.67      0     5.44    0.38    0.66      0
    4096          2048      half     sum      -1     5.58    0.73    1.29      0     5.55    0.74    1.29      0
    8192          4096      half     sum      -1     5.92    1.38    2.42      0     6.00    1.36    2.39      0
   16384          8192      half     sum      -1     6.44    2.54    4.45      0     6.48    2.53    4.42      0
   32768         16384      half     sum      -1     8.77    3.74    6.54      0     8.87    3.69    6.46      0
   65536         32768      half     sum      -1     9.16    7.15   12.52      0     9.24    7.09   12.41      0
  131072         65536      half     sum      -1     9.79   13.39   23.44      0    10.01   13.09   22.91      0
  262144        131072      half     sum      -1    12.09   21.69   37.95      0    12.28   21.35   37.36      0
  524288        262144      half     sum      -1    13.76   38.10   66.67      0    13.86   37.82   66.19      0
 1048576        524288      half     sum      -1    19.17   54.70   95.73      0    19.33   54.24   94.91      0
 2097152       1048576      half     sum      -1    24.11   87.00  152.24      0    24.13   86.92  152.10      0
 4194304       2097152      half     sum      -1    36.58  114.66  200.65      0    36.45  115.07  201.37      0
 8388608       4194304      half     sum      -1    60.72  138.14  241.75      0    61.02  137.46  240.56      0
16777216       8388608      half     sum      -1    107.7  155.76  272.59      0    108.2  154.99  271.24      0
33554432      16777216      half     sum      -1    197.6  169.85  297.24      0    198.3  169.20  296.11      0
67108864      33554432      half     sum      -1    384.0  174.76  305.84      0    384.8  174.41  305.21      0

134217728 67108864 half sum -1 752.6 178.33 312.09 0 752.9 178.27 311.97 0

RCCL:
(Using MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce")

    1024          1024  fp8_e4m3     sum      -1    26.27    0.04    0.07      0    26.35    0.04    0.07      0
    2048          2048  fp8_e4m3     sum      -1    26.46    0.08    0.14      0    26.53    0.08    0.14      0
    4096          4096  fp8_e4m3     sum      -1    26.79    0.15    0.27      0    26.88    0.15    0.27      0
    8192          8192  fp8_e4m3     sum      -1    27.39    0.30    0.52      0    27.47    0.30    0.52      0
   16384         16384  fp8_e4m3     sum      -1    28.35    0.58    1.01      0    28.38    0.58    1.01      0
   32768         32768  fp8_e4m3     sum      -1    28.41    1.15    2.02      0    28.45    1.15    2.02      0
   65536         65536  fp8_e4m3     sum      -1    28.98    2.26    3.96      0    29.09    2.25    3.94      0
  131072        131072  fp8_e4m3     sum      -1    29.94    4.38    7.66      0    30.01    4.37    7.64      0
  262144        262144  fp8_e4m3     sum      -1    30.22    8.68   15.18      0    30.32    8.65   15.13      0
  524288        524288  fp8_e4m3     sum      -1    51.27   10.23   17.89      0    51.77   10.13   17.72      0
 1048576       1048576  fp8_e4m3     sum      -1    52.55   19.96   34.92      0    52.69   19.90   34.83      0
 2097152       2097152  fp8_e4m3     sum      -1    55.21   37.99   66.48      0    55.32   37.91   66.34      0
 4194304       4194304  fp8_e4m3     sum      -1    74.59   56.23   98.40      0    74.66   56.18   98.31      0
 8388608       8388608  fp8_e4m3     sum      -1    97.11   86.39  151.18      0    97.26   86.25  150.94      0
16777216      16777216  fp8_e4m3     sum      -1    143.2  117.16  205.03      0    143.5  116.90  204.57      0
33554432      33554432  fp8_e4m3     sum      -1    249.3  134.61  235.57      0    249.6  134.44  235.28      0
67108864      67108864  fp8_e4m3     sum      -1    448.0  149.80  262.15      0    448.4  149.65  261.89      0

134217728 134217728 fp8_e4m3 sum -1 838.3 160.11 280.20 0 839.3 159.92 279.86 0

    1024          1024  fp8_e5m2     sum      -1    25.83    0.04    0.07      0    25.90    0.04    0.07      0
    2048          2048  fp8_e5m2     sum      -1    25.96    0.08    0.14      0    26.05    0.08    0.14      0
    4096          4096  fp8_e5m2     sum      -1    26.27    0.16    0.27      0    26.37    0.16    0.27      0
    8192          8192  fp8_e5m2     sum      -1    26.96    0.30    0.53      0    27.05    0.30    0.53      0
   16384         16384  fp8_e5m2     sum      -1    27.87    0.59    1.03      0    27.97    0.59    1.03      0
   32768         32768  fp8_e5m2     sum      -1    28.02    1.17    2.05      0    28.10    1.17    2.04      0
   65536         65536  fp8_e5m2     sum      -1    28.96    2.26    3.96      0    29.09    2.25    3.94      0
  131072        131072  fp8_e5m2     sum      -1    30.06    4.36    7.63      0    30.12    4.35    7.61      0
  262144        262144  fp8_e5m2     sum      -1    30.27    8.66   15.15      0    30.37    8.63   15.11      0
  524288        524288  fp8_e5m2     sum      -1    50.78   10.32   18.07      0    51.24   10.23   17.91      0
 1048576       1048576  fp8_e5m2     sum      -1    51.95   20.18   35.32      0    52.10   20.13   35.22      0
 2097152       2097152  fp8_e5m2     sum      -1    54.73   38.32   67.05      0    54.82   38.25   66.94      0
 4194304       4194304  fp8_e5m2     sum      -1    74.15   56.56   98.99      0    74.22   56.51   98.90      0
 8388608       8388608  fp8_e5m2     sum      -1    96.55   86.88  152.05      0    96.70   86.75  151.80      0
16777216      16777216  fp8_e5m2     sum      -1    142.7  117.61  205.82      0    143.0  117.35  205.36      0
33554432      33554432  fp8_e5m2     sum      -1    248.6  134.98  236.22      0    248.9  134.79  235.89      0
67108864      67108864  fp8_e5m2     sum      -1    447.1  150.10  262.68      0    447.7  149.91  262.34      0

134217728 134217728 fp8_e5m2 sum -1 837.4 160.28 280.49 0 838.9 159.99 279.98 0

Copy link
Contributor

@Binyang2014 Binyang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Binyang2014
Copy link
Contributor

We need to put our reduction function to a unified place. We can change this in another PR

@azure-pipelines
Copy link

There was an error handling pipeline event 79c0b7b6-6c02-4298-8105-6a2b4742057f.

@seagater seagater merged commit a38c2ee into main Oct 27, 2025
14 checks passed
@seagater seagater deleted the qinghuazhou/allreduce_fp8 branch October 27, 2025 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants