42.3 Matrix Multiplication (Blocked GEMM)

Naïve matrix multiply is memory‑bound; blocking (tiling) improves cache locality.


42.3.1 Naïve GEMM (C ← A·B)

void gemmNaive(float[] A, int M, int K, float[] B, int N, float[] C) {
  for (int i = 0; i < M; i++) {
    for (int k = 0; k < K; k++) {
      float aik = A[i*K + k];
      for (int j = 0; j < N; j++) C[i*N + j] += aik * B[k*N + j];
    }
  }
}

42.3.2 Blocked GEMM

void gemmBlocked(float[] A, int M, int K, float[] B, int N, float[] C, int BS) {
  for (int ii = 0; ii < M; ii += BS) {
    for (int kk = 0; kk < K; kk += BS) {
      for (int jj = 0; jj < N; jj += BS) {
        int iMax = Math.min(ii + BS, M);
        int kMax = Math.min(kk + BS, K);
        int jMax = Math.min(jj + BS, N);
        for (int i = ii; i < iMax; i++) {
          for (int k = kk; k < kMax; k++) {
            float aik = A[i*K + k];
            for (int j = jj; j < jMax; j++) C[i*N + j] += aik * B[k*N + j];
          }
        }
      }
    }
  }
}

Choose BS (block size) based on cache; e.g., 32 or 64 often works well.


42.3.3 Validation

Compare gemmNaive vs gemmBlocked on random inputs and assert small relative error.

float relError(float[] x, float[] y) {
  float num = 0f, den = 0f;
  for (int i = 0; i < x.length; i++) { float d = x[i] - y[i]; num += Math.abs(d); den += Math.abs(y[i]); }
  return num / Math.max(den, 1e-6f);
}