I do a comparison of host matmul for a large matrix vs device comparison, however my host matmul is a trivial example for correctness.
It could be very cool to optimize the host matrix multiplication and get a fairer comparison of perf.
This is my current function:
// Host function for matrix multiplication
void matMulHost(const int* A, const int* B, int* C, int m, int n, int k) {
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
int sum = 0;
for (int p = 0; p < k; p++) {
sum += A[i * k + p] * B[p * n + j];
}
C[i * n + j] = sum;
}
}
}