Hi, thanks for your great work on this project!
I have a question regarding the Heuristic-GPU exploring configuration in matmul.py.
I noticed that the loop_order is set to knm, and I’m trying to understand the reasoning behind this choice.
In Nvidia CUTLASS Document, the k dimension is in the inner loop while in you code it is set to outer loop.
Is there a specific performance consideration or hardware constraint that makes the knm order preferable? Or you just run an exhaustive test to find that knm is the best choice on typical GPU architectures?
I would really appreciate it if you could provide some insights or point me to relevant documentation.
Thanks again for maintaining this project!