42.3 Matrix Multiplication (Blocked GEMM)
Naïve matrix multiply is memory‑bound; blocking (tiling) improves cache locality.
42.3.1 Naïve GEMM (C ← A·B)
void gemmNaive(float[] A, int M, int K, float[] B, int N, float[] C) {
for (int i = 0; i < M; i++) {
for (int k = 0; k < K; k++) {
float aik = A[i*K + k];
for (int j = 0; j < N; j++) C[i*N + j] += aik * B[k*N + j];
}
}
}
42.3.2 Blocked GEMM
void gemmBlocked(float[] A, int M, int K, float[] B, int N, float[] C, int BS) {
for (int ii = 0; ii < M; ii += BS) {
for (int kk = 0; kk < K; kk += BS) {
for (int jj = 0; jj < N; jj += BS) {
int iMax = Math.min(ii + BS, M);
int kMax = Math.min(kk + BS, K);
int jMax = Math.min(jj + BS, N);
for (int i = ii; i < iMax; i++) {
for (int k = kk; k < kMax; k++) {
float aik = A[i*K + k];
for (int j = jj; j < jMax; j++) C[i*N + j] += aik * B[k*N + j];
}
}
}
}
}
}
Choose BS (block size) based on cache; e.g., 32 or 64 often works well.
42.3.3 Validation
Compare gemmNaive vs gemmBlocked on random inputs and assert small relative error.
float relError(float[] x, float[] y) {
float num = 0f, den = 0f;
for (int i = 0; i < x.length; i++) { float d = x[i] - y[i]; num += Math.abs(d); den += Math.abs(y[i]); }
return num / Math.max(den, 1e-6f);
}