Skip to content

Optimized CPU MatMul for TensorParallelism/02_matmul_tp/matmul_tp_big.cu #18

@dkennetzoracle

Description

@dkennetzoracle

I do a comparison of host matmul for a large matrix vs device comparison, however my host matmul is a trivial example for correctness.

It could be very cool to optimize the host matrix multiplication and get a fairer comparison of perf.

This is my current function:

// Host function for matrix multiplication
void matMulHost(const int* A, const int* B, int* C, int m, int n, int k) {
    for (int i = 0; i < m; i++) {
        for (int j = 0; j < n; j++) {
            int sum = 0;
            for (int p = 0; p < k; p++) {
                sum += A[i * k + p] * B[p * n + j];
            }
            C[i * n + j] = sum;
        }
    }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions