Why is the l2_loop_order and l1_loop_order in Heuristic-GPU set to knm?

Hi, thanks for your great work on this project!

I have a question regarding the Heuristic-GPU exploring configuration in [matmul.py](https://github.com/PrincetonUniversity/LLMCompass/blob/main/software_model/matmul.py#L558).
I noticed that the loop_order is set to `knm`, and I’m trying to understand the reasoning behind this choice.

In [Nvidia CUTLASS Document](https://docs.nvidia.com/cutlass/media/docs/cpp/efficient_gemm.html), the k dimension is in the inner loop while in you code it is set to outer loop.

Is there a specific performance consideration or hardware constraint that makes the `knm` order preferable? Or you just run an exhaustive test to find that `knm` is the best choice on typical GPU architectures?

I would really appreciate it if you could provide some insights or point me to relevant documentation.
Thanks again for maintaining this project!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the l2_loop_order and l1_loop_order in Heuristic-GPU set to knm? #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why is the l2_loop_order and l1_loop_order in Heuristic-GPU set to knm? #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions