inductor may pad mm for some shapes i.e replace matmul with zeros, cat, matmul and slice. On A100 fp16 fp_GPT2, this feature can improve 30% performance. Check if it's useful on XPU. https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/pad_mm.py