Performance and Benchmarks · JAVA

42.5 Performance and Benchmarks

Measure improvements from blocking and layout changes with JMH; tune parameters to your hardware.

42.5.1 JMH Outline: GEMM

// Requires org.openjdk.jmh:jmh-core and jmh-generator-annprocess
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Thread)
public class GemmBench {
  float[] A, B, C1, C2; int M=512, K=512, N=512;

  @Setup public void setup() {
    A = new float[M*K]; B = new float[K*N]; C1 = new float[M*N]; C2 = new float[M*N];
    java.util.Random r = new java.util.Random(42);
    for (int i = 0; i < A.length; i++) A[i] = r.nextFloat();
    for (int i = 0; i < B.length; i++) B[i] = r.nextFloat();
  }

  @Benchmark public void naive() { gemmNaive(A, M, K, B, N, C1); }
  @Benchmark public void blocked() { gemmBlocked(A, M, K, B, N, C2, 64); }

  // include gemmNaive/gemmBlocked implementations or import them
}

42.5.2 Tuning

Pick block sizes that keep blocks in L2/L3 cache
Avoid false sharing when parallelizing (write distinct regions)
Warm up inputs to avoid cold‑cache effects in microbenchmarks

42.5.3 Vector API Bonus

If available, combine blocking with Vector API inside inner loops to increase arithmetic throughput further.