42.5 Performance and Benchmarks

Measure improvements from blocking and layout changes with JMH; tune parameters to your hardware.


42.5.1 JMH Outline: GEMM

// Requires org.openjdk.jmh:jmh-core and jmh-generator-annprocess
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Thread)
public class GemmBench {
  float[] A, B, C1, C2; int M=512, K=512, N=512;

  @Setup public void setup() {
    A = new float[M*K]; B = new float[K*N]; C1 = new float[M*N]; C2 = new float[M*N];
    java.util.Random r = new java.util.Random(42);
    for (int i = 0; i < A.length; i++) A[i] = r.nextFloat();
    for (int i = 0; i < B.length; i++) B[i] = r.nextFloat();
  }

  @Benchmark public void naive() { gemmNaive(A, M, K, B, N, C1); }
  @Benchmark public void blocked() { gemmBlocked(A, M, K, B, N, C2, 64); }

  // include gemmNaive/gemmBlocked implementations or import them
}

42.5.2 Tuning

  • Pick block sizes that keep blocks in L2/L3 cache
  • Avoid false sharing when parallelizing (write distinct regions)
  • Warm up inputs to avoid cold‑cache effects in microbenchmarks

42.5.3 Vector API Bonus

If available, combine blocking with Vector API inside inner loops to increase arithmetic throughput further.