41.4 Performance and Benchmarks

Quantify speedups and verify vectorization with microbenchmarks and profiling.


41.4.1 JMH Outline

// Requires org.openjdk.jmh:jmh-core and jmh-generator-annprocess
import org.openjdk.jmh.annotations.*;
import jdk.incubator.vector.*;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Thread)
public class VecBench {
  static final VectorSpecies<Float> S = FloatVector.SPECIES_PREFERRED;
  float[] a, b, c;

  @Setup public void setup() {
    a = new float[1_000_000];
    b = new float[1_000_000];
    c = new float[1_000_000];
    for (int i = 0; i < a.length; i++) { a[i] = i; b[i] = i * 2f; }
  }

  @Benchmark public void addVector() {
    int lanes = S.length();
    int ub = S.loopBound(a.length);
    for (int i = 0; i < ub; i += lanes) {
      FloatVector va = FloatVector.fromArray(S, a, i);
      FloatVector vb = FloatVector.fromArray(S, b, i);
      va.add(vb).intoArray(c, i);
    }
    for (int i = ub; i < a.length; i++) c[i] = a[i] + b[i];
  }

  @Benchmark public void addScalar() {
    for (int i = 0; i < a.length; i++) c[i] = a[i] + b[i];
  }
}

41.4.2 Tuning Tips

  • Keep inner loops simple; avoid branches that defeat vectorization
  • Use preferred species for the platform; avoid mixing element types
  • Ensure arrays are hot in cache; prefetching is implicit
  • Measure with JMH; compare scalar vs vector implementations

41.4.3 Pitfalls

  • Complex control flow (if/else in the hot loop) reduces gains
  • Gather/scatter can be memory‑bound; prefer contiguous when possible
  • Small arrays may not amortize overhead; batch work into larger chunks