The code could benefit from the following changes:
A CLI with the following abstractions:
#define M 4096 // Rows of A, Rows of C
#define N 8192 // Columns of B, Columns of C
#define K 1024 // Columns of A, Rows of B
#define NGPUS 2 // Number of GPUs to use for computation
- variable sizes of M, N, K
- number of GPUs to use
- error checking for matrix dimensions from CLI values
- device detection to determine if number of GPUs selected is supported (IE 4 is picked, but you only have 2)
- boolean of whether to run host matmul or not, and if not - don't assert correctness or measure cpu timing. Should be pulled into a function.