Optimized CPU MatMul for TensorParallelism/02_matmul_tp/matmul_tp_big.cu

I do a comparison of host matmul for a large matrix vs device comparison, however my host matmul is a trivial example for correctness.

It could be very cool to optimize the host matrix multiplication and get a fairer comparison of perf.

This is my current function:
```c++
// Host function for matrix multiplication
void matMulHost(const int* A, const int* B, int* C, int m, int n, int k) {
    for (int i = 0; i < m; i++) {
        for (int j = 0; j < n; j++) {
            int sum = 0;
            for (int p = 0; p < k; p++) {
                sum += A[i * k + p] * B[p * n + j];
            }
            C[i * n + j] = sum;
        }
    }
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized CPU MatMul for TensorParallelism/02_matmul_tp/matmul_tp_big.cu #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Optimized CPU MatMul for TensorParallelism/02_matmul_tp/matmul_tp_big.cu #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions