As described in figure 8 of Offloading communication control logic in GPU accelerated applications article, KI model is faster than SA model. But I use libmp benchmark mp_pingpong_all in my ubuntu with P4 gpu and mlx5 nic, I get a result showing KI is almost double latency of SA. So, I wonder if the result of this article is not tested under the benchmark of libmp? If yes, what test samples dose the article use ?