41.4 Performance and Benchmarks
Quantify speedups and verify vectorization with microbenchmarks and profiling.
41.4.1 JMH Outline
// Requires org.openjdk.jmh:jmh-core and jmh-generator-annprocess
import org.openjdk.jmh.annotations.*;
import jdk.incubator.vector.*;
import java.util.concurrent.TimeUnit;
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Thread)
public class VecBench {
static final VectorSpecies<Float> S = FloatVector.SPECIES_PREFERRED;
float[] a, b, c;
@Setup public void setup() {
a = new float[1_000_000];
b = new float[1_000_000];
c = new float[1_000_000];
for (int i = 0; i < a.length; i++) { a[i] = i; b[i] = i * 2f; }
}
@Benchmark public void addVector() {
int lanes = S.length();
int ub = S.loopBound(a.length);
for (int i = 0; i < ub; i += lanes) {
FloatVector va = FloatVector.fromArray(S, a, i);
FloatVector vb = FloatVector.fromArray(S, b, i);
va.add(vb).intoArray(c, i);
}
for (int i = ub; i < a.length; i++) c[i] = a[i] + b[i];
}
@Benchmark public void addScalar() {
for (int i = 0; i < a.length; i++) c[i] = a[i] + b[i];
}
}
41.4.2 Tuning Tips
- Keep inner loops simple; avoid branches that defeat vectorization
- Use preferred species for the platform; avoid mixing element types
- Ensure arrays are hot in cache; prefetching is implicit
- Measure with JMH; compare scalar vs vector implementations
41.4.3 Pitfalls
- Complex control flow (if/else in the hot loop) reduces gains
- Gather/scatter can be memory‑bound; prefer contiguous when possible
- Small arrays may not amortize overhead; batch work into larger chunks