26.4 Troubleshooting and Performance Optimization
Identifying and resolving GC-related performance issues requires systematic analysis and appropriate tools.
Memory Leaks
// Detecting Memory Leaks
public class MemoryLeakDetection {
public static void printLeakSymptoms() {
System.out.println("=== MEMORY LEAK SYMPTOMS ===");
System.out.println("\n--- INDICATORS ---");
System.out.println("✗ Heap usage increases steadily over time");
System.out.println("✗ Old Gen continuously grows");
System.out.println("✗ Frequent Full GCs");
System.out.println("✗ Full GCs don't reclaim much memory");
System.out.println("✗ Eventually OutOfMemoryError");
System.out.println("✗ Application restarts temporarily fix issue");
System.out.println("\n--- COMMON CAUSES ---");
System.out.println("1. Collections not cleared");
System.out.println(" - Static collections");
System.out.println(" - Caches without eviction");
System.out.println(" - Listeners not removed");
System.out.println("\n2. ThreadLocal leaks");
System.out.println(" - ThreadLocal not removed");
System.out.println(" - Thread pools holding references");
System.out.println("\n3. ClassLoader leaks");
System.out.println(" - Classes not unloaded");
System.out.println(" - Static references to application classes");
System.out.println("\n4. Native memory leaks");
System.out.println(" - JNI resources not freed");
System.out.println(" - Direct buffers not released");
}
// Example: Classic memory leak
public static class LeakyCache {
// ❌ LEAK: Static collection never cleared
private static final Map<String, Object> cache = new HashMap<>();
public void addToCache(String key, Object value) {
cache.put(key, value);
// Objects accumulate, never removed
}
// ✅ FIXED: Use WeakHashMap or implement eviction
private static final Map<String, Object> fixedCache =
new WeakHashMap<>();
public void addToFixedCache(String key, Object value) {
fixedCache.put(key, value);
// Entries can be GC'd when keys become weakly reachable
}
}
// Example: ThreadLocal leak
public static class ThreadLocalLeak {
// ❌ LEAK: ThreadLocal not cleaned up
private static final ThreadLocal<List<String>> cache =
ThreadLocal.withInitial(ArrayList::new);
public void addData(String data) {
cache.get().add(data);
// In thread pool, thread reused, list keeps growing
}
// ✅ FIXED: Clean up ThreadLocal
public void addDataSafely(String data) {
try {
cache.get().add(data);
} finally {
cache.remove(); // Clean up after use
}
}
}
}
Diagnosing Memory Leaks
// Memory Leak Diagnosis Steps
public class LeakDiagnosis {
public static void printDiagnosisSteps() {
System.out.println("=== MEMORY LEAK DIAGNOSIS PROCESS ===");
System.out.println("\n--- STEP 1: CONFIRM LEAK ---");
System.out.println("1. Monitor heap over time:");
System.out.println(" jstat -gcutil <pid> 10000");
System.out.println(" Watch Old Gen (OU column)");
System.out.println(" Steadily increasing OU → likely leak");
System.out.println("\n2. Check GC effectiveness:");
System.out.println(" If Full GC doesn't reclaim memory → definite leak");
System.out.println("\n--- STEP 2: CAPTURE HEAP DUMPS ---");
System.out.println("1. Early heap dump (baseline):");
System.out.println(" jcmd <pid> GC.heap_dump /tmp/heap1.hprof");
System.out.println("\n2. Run application under load");
System.out.println(" Wait for memory to grow");
System.out.println("\n3. Late heap dump (after leak):");
System.out.println(" jcmd <pid> GC.heap_dump /tmp/heap2.hprof");
System.out.println("\n--- STEP 3: ANALYZE WITH ECLIPSE MAT ---");
System.out.println("1. Open heap dump in Eclipse MAT");
System.out.println(" Download: eclipse.org/mat");
System.out.println("\n2. Run Leak Suspects Report");
System.out.println(" Analyzes → Leak Suspects");
System.out.println(" Identifies objects consuming most memory");
System.out.println("\n3. Find Dominator Tree");
System.out.println(" Shows objects and their retained size");
System.out.println(" Retained size = memory freed if object collected");
System.out.println("\n4. Compare heap dumps");
System.out.println(" Compare baseline vs leak dump");
System.out.println(" Identify growing object types");
System.out.println("\n5. Find GC roots");
System.out.println(" Right-click object → Path to GC Roots");
System.out.println(" Shows why object can't be collected");
System.out.println("\n--- STEP 4: FIX LEAK ---");
System.out.println("Common fixes:");
System.out.println(" • Clear collections when done");
System.out.println(" • Remove event listeners");
System.out.println(" • Clean up ThreadLocal");
System.out.println(" • Use weak references for caches");
System.out.println(" • Implement cache eviction");
System.out.println(" • Fix ClassLoader lifecycle");
}
}
OutOfMemoryError Scenarios
// OutOfMemoryError Types
public class OutOfMemoryErrors {
public static void printOOMTypes() {
System.out.println("=== OUT OF MEMORY ERROR TYPES ===");
System.out.println("\n--- 1. HEAP SPACE ---");
System.out.println("Error: java.lang.OutOfMemoryError: Java heap space");
System.out.println("\nCauses:");
System.out.println(" • Heap too small for application");
System.out.println(" • Memory leak");
System.out.println(" • Large object allocation");
System.out.println("\nDiagnosis:");
System.out.println(" 1. Capture heap dump:");
System.out.println(" -XX:+HeapDumpOnOutOfMemoryError");
System.out.println(" -XX:HeapDumpPath=/tmp/heap.hprof");
System.out.println(" 2. Analyze with Eclipse MAT");
System.out.println(" 3. Identify largest objects");
System.out.println("\nSolutions:");
System.out.println(" • Increase heap: -Xmx8g");
System.out.println(" • Fix memory leak");
System.out.println(" • Optimize object usage");
System.out.println("\n--- 2. METASPACE ---");
System.out.println("Error: java.lang.OutOfMemoryError: Metaspace");
System.out.println("\nCauses:");
System.out.println(" • Too many classes loaded");
System.out.println(" • ClassLoader leak (classes not unloaded)");
System.out.println(" • Dynamic class generation");
System.out.println("\nDiagnosis:");
System.out.println(" 1. Monitor metaspace:");
System.out.println(" jstat -gc <pid>");
System.out.println(" Watch MC (metaspace capacity)");
System.out.println(" 2. List loaded classes:");
System.out.println(" jcmd <pid> VM.class_histogram");
System.out.println("\nSolutions:");
System.out.println(" • Set limit: -XX:MaxMetaspaceSize=512m");
System.out.println(" • Fix ClassLoader leak");
System.out.println(" • Reduce dynamic class generation");
System.out.println("\n--- 3. GC OVERHEAD LIMIT EXCEEDED ---");
System.out.println("Error: java.lang.OutOfMemoryError: GC overhead limit exceeded");
System.out.println("\nTriggered when:");
System.out.println(" • >98% of time spent in GC");
System.out.println(" • <2% of heap recovered");
System.out.println(" • Happened 5 consecutive times");
System.out.println("\nCauses:");
System.out.println(" • Heap too small");
System.out.println(" • Memory leak");
System.out.println(" • Application needs too much memory");
System.out.println("\nSolutions:");
System.out.println(" • Increase heap");
System.out.println(" • Fix memory leak");
System.out.println(" • Disable check (not recommended):");
System.out.println(" -XX:-UseGCOverheadLimit");
System.out.println("\n--- 4. DIRECT BUFFER MEMORY ---");
System.out.println("Error: java.lang.OutOfMemoryError: Direct buffer memory");
System.out.println("\nCauses:");
System.out.println(" • Too many DirectByteBuffer allocations");
System.out.println(" • Buffers not released");
System.out.println("\nSolutions:");
System.out.println(" • Increase limit: -XX:MaxDirectMemorySize=1g");
System.out.println(" • Explicitly release buffers");
System.out.println(" • Use try-with-resources");
System.out.println("\n--- 5. UNABLE TO CREATE NEW NATIVE THREAD ---");
System.out.println("Error: java.lang.OutOfMemoryError: unable to create new native thread");
System.out.println("\nCauses:");
System.out.println(" • Too many threads");
System.out.println(" • OS thread limit reached");
System.out.println(" • Insufficient native memory");
System.out.println("\nSolutions:");
System.out.println(" • Reduce thread count");
System.out.println(" • Use thread pools");
System.out.println(" • Increase OS limits (ulimit)");
System.out.println(" • Reduce heap size (more native memory)");
}
}
Long GC Pauses
// Diagnosing Long GC Pauses
public class LongGCPauses {
public static void printPauseDiagnosis() {
System.out.println("=== DIAGNOSING LONG GC PAUSES ===");
System.out.println("\n--- SYMPTOMS ---");
System.out.println("✗ Application unresponsive periodically");
System.out.println("✗ Request timeouts");
System.out.println("✗ SLA violations");
System.out.println("✗ GC logs show long pause times");
System.out.println("\n--- COMMON CAUSES ---");
System.out.println("\n1. Full GC Events");
System.out.println(" Problem: Full GC pauses entire JVM");
System.out.println(" Duration: Seconds to tens of seconds");
System.out.println(" Check GC logs for 'Full GC' events");
System.out.println("\n2. Large Young Generation");
System.out.println(" Problem: More objects to process");
System.out.println(" Solution: Reduce Young Gen size");
System.out.println(" Flag: -XX:MaxNewSize=2g");
System.out.println("\n3. High Promotion Rate");
System.out.println(" Problem: Objects promoted too fast to Old Gen");
System.out.println(" Solution: Increase Survivor space");
System.out.println(" Flag: -XX:SurvivorRatio=6");
System.out.println("\n4. Humongous Objects (G1)");
System.out.println(" Problem: Large objects (>50% region size)");
System.out.println(" Solution: Increase region size");
System.out.println(" Flag: -XX:G1HeapRegionSize=32m");
System.out.println("\n5. Reference Processing");
System.out.println(" Problem: Many weak/soft/phantom references");
System.out.println(" Check: JFR for reference processing time");
System.out.println(" Solution: Reduce reference usage");
System.out.println("\n6. NUMA Awareness");
System.out.println(" Problem: Cross-NUMA node access");
System.out.println(" Solution: Enable NUMA support");
System.out.println(" Flag: -XX:+UseNUMA");
}
public static void printPauseTuningSteps() {
System.out.println("\n=== PAUSE TIME TUNING PROCESS ===");
System.out.println("\n--- STEP 1: MEASURE BASELINE ---");
System.out.println("1. Enable detailed GC logging:");
System.out.println(" -Xlog:gc*:file=gc.log:time,uptime,level,tags");
System.out.println("\n2. Run under production load");
System.out.println("\n3. Analyze logs:");
System.out.println(" - Identify longest pauses");
System.out.println(" - Check pause frequency");
System.out.println(" - Note GC types (Minor, Major, Full)");
System.out.println("\n--- STEP 2: IDENTIFY BOTTLENECK ---");
System.out.println("Check what's taking time in GC:");
System.out.println("\n• Young GC too slow:");
System.out.println(" → Reduce Young Gen size");
System.out.println(" → Increase GC threads");
System.out.println("\n• Old GC too slow:");
System.out.println(" → Lower IHOP (start marking earlier)");
System.out.println(" → Increase heap size");
System.out.println("\n• Frequent Full GC:");
System.out.println(" → Increase heap");
System.out.println(" → Fix memory leak");
System.out.println(" → Lower pause time goal");
System.out.println("\n--- STEP 3: TUNE PARAMETERS ---");
System.out.println("G1 tuning:");
System.out.println(" -XX:MaxGCPauseMillis=50 # Lower goal");
System.out.println(" -XX:InitiatingHeapOccupancyPercent=35 # Earlier marking");
System.out.println("\nOr switch collector:");
System.out.println(" -XX:+UseZGC # Sub-millisecond pauses");
System.out.println("\n--- STEP 4: TEST AND VERIFY ---");
System.out.println("1. Run with new settings");
System.out.println("2. Measure pause times");
System.out.println("3. Check throughput impact");
System.out.println("4. Iterate if needed");
}
}
Heap Dump Analysis
// Heap Dump Analysis Techniques
public class HeapDumpAnalysis {
public static void printAnalysisTechniques() {
System.out.println("=== HEAP DUMP ANALYSIS ===");
System.out.println("\n--- CAPTURING HEAP DUMPS ---");
System.out.println("\n1. On OutOfMemoryError (automatic):");
System.out.println(" -XX:+HeapDumpOnOutOfMemoryError");
System.out.println(" -XX:HeapDumpPath=/tmp/");
System.out.println("\n2. Manual capture (jcmd):");
System.out.println(" jcmd <pid> GC.heap_dump /tmp/heap.hprof");
System.out.println("\n3. Manual capture (jmap):");
System.out.println(" jmap -dump:live,format=b,file=heap.hprof <pid>");
System.out.println(" 'live' option triggers GC first");
System.out.println("\n4. Programmatic capture:");
System.out.println(" HotSpotDiagnosticMXBean mbean =");
System.out.println(" ManagementFactory.getPlatformMXBean(");
System.out.println(" HotSpotDiagnosticMXBean.class);");
System.out.println(" mbean.dumpHeap(\"/tmp/heap.hprof\", true);");
System.out.println("\n--- ECLIPSE MAT ANALYSIS ---");
System.out.println("\n1. Leak Suspects Report");
System.out.println(" Automatically identifies likely leaks");
System.out.println(" Shows biggest memory consumers");
System.out.println("\n2. Histogram");
System.out.println(" Class → Object count → Shallow size → Retained size");
System.out.println(" Shallow: Object's own memory");
System.out.println(" Retained: Memory freed if object collected");
System.out.println("\n3. Dominator Tree");
System.out.println(" Objects sorted by retained size");
System.out.println(" Shows object retention hierarchy");
System.out.println("\n4. Path to GC Roots");
System.out.println(" Right-click object → Path to GC Roots");
System.out.println(" Shows why object can't be collected");
System.out.println(" Excludes:");
System.out.println(" - Weak references");
System.out.println(" - Soft references");
System.out.println(" - Phantom references");
System.out.println("\n5. OQL (Object Query Language)");
System.out.println(" SQL-like queries on heap");
System.out.println(" Example: SELECT * FROM java.lang.String");
System.out.println(" Example: SELECT s FROM java.lang.String s");
System.out.println(" WHERE s.count > 1000");
System.out.println("\n--- COMMON PATTERNS ---");
System.out.println("\n1. Collection Leak:");
System.out.println(" Large HashMap/ArrayList with many entries");
System.out.println(" Fix: Clear collection, implement eviction");
System.out.println("\n2. String Duplication:");
System.out.println(" Many identical String objects");
System.out.println(" Fix: Use String.intern() or deduplication");
System.out.println("\n3. ClassLoader Leak:");
System.out.println(" Old ClassLoader not unloaded");
System.out.println(" Classes and static fields retained");
System.out.println(" Fix: Remove static references to app classes");
System.out.println("\n4. ThreadLocal Leak:");
System.out.println(" ThreadLocalMap entries not cleaned");
System.out.println(" Fix: Call ThreadLocal.remove()");
}
}
Performance Case Studies
// Real-World Performance Issues
public class PerformanceCaseStudies {
public static void printCaseStudy1() {
System.out.println("=== CASE STUDY 1: FREQUENT FULL GCs ===");
System.out.println("\n--- SYMPTOMS ---");
System.out.println("• Full GC every 10 minutes");
System.out.println("• Each Full GC takes 5-10 seconds");
System.out.println("• Application pauses noticeable to users");
System.out.println("\n--- DIAGNOSIS ---");
System.out.println("1. Analyzed GC logs: Old Gen filling up fast");
System.out.println("2. Checked jstat: High promotion rate (100MB/s)");
System.out.println("3. JFR profiling: Excessive allocation in request processing");
System.out.println("\n--- ROOT CAUSE ---");
System.out.println("Objects allocated in request handler surviving Minor GC");
System.out.println(" → Premature promotion to Old Gen");
System.out.println(" → Old Gen filling quickly");
System.out.println(" → Frequent Full GC");
System.out.println("\n--- SOLUTION ---");
System.out.println("1. Increased Young Gen size:");
System.out.println(" -XX:NewRatio=1 (from default 2)");
System.out.println("2. Increased Survivor space:");
System.out.println(" -XX:SurvivorRatio=6 (from 8)");
System.out.println("\n--- RESULTS ---");
System.out.println("✓ Full GC frequency: 10min → 4 hours");
System.out.println("✓ Promotion rate: 100MB/s → 20MB/s");
System.out.println("✓ Application latency improved significantly");
}
public static void printCaseStudy2() {
System.out.println("\n=== CASE STUDY 2: MEMORY LEAK ===");
System.out.println("\n--- SYMPTOMS ---");
System.out.println("• Heap usage steadily increasing");
System.out.println("• Full GCs not reclaiming memory");
System.out.println("• OutOfMemoryError after 48 hours");
System.out.println("\n--- DIAGNOSIS ---");
System.out.println("1. Captured heap dumps at 1hr and 24hr");
System.out.println("2. Analyzed with Eclipse MAT");
System.out.println("3. Dominator tree showed large cache Map");
System.out.println("4. Path to GC roots: Static field → Cache → millions of entries");
System.out.println("\n--- ROOT CAUSE ---");
System.out.println("Static cache Map without eviction policy");
System.out.println(" → Entries never removed");
System.out.println(" → Unbounded growth");
System.out.println(" → Eventually OOM");
System.out.println("\n--- SOLUTION ---");
System.out.println("Replaced HashMap with Caffeine cache:");
System.out.println(" Cache<String, Object> cache = Caffeine.newBuilder()");
System.out.println(" .maximumSize(10_000)");
System.out.println(" .expireAfterWrite(1, TimeUnit.HOURS)");
System.out.println(" .build();");
System.out.println("\n--- RESULTS ---");
System.out.println("✓ Heap usage stable");
System.out.println("✓ No more OOM errors");
System.out.println("✓ Application runs indefinitely");
}
public static void printCaseStudy3() {
System.out.println("\n=== CASE STUDY 3: LONG GC PAUSES (G1) ===");
System.out.println("\n--- SYMPTOMS ---");
System.out.println("• P99 pause time: 500ms (target: <100ms)");
System.out.println("• Request timeouts during GC");
System.out.println("• Using G1 with 32GB heap");
System.out.println("\n--- DIAGNOSIS ---");
System.out.println("1. GC logs showed Young GC pauses >400ms");
System.out.println("2. Large Young Gen (20GB)");
System.out.println("3. Many references to process");
System.out.println("\n--- ROOT CAUSE ---");
System.out.println("G1 trying to meet pause time goal");
System.out.println(" → But Young Gen too large");
System.out.println(" → Can't collect in target time");
System.out.println(" → Pauses exceed goal");
System.out.println("\n--- SOLUTION ---");
System.out.println("Switched to ZGC:");
System.out.println(" -XX:+UseZGC");
System.out.println(" -Xms32g -Xmx32g");
System.out.println("\n--- RESULTS ---");
System.out.println("✓ P99 pause time: 500ms → 0.5ms (1000x improvement)");
System.out.println("✓ No more request timeouts");
System.out.println("✓ Throughput slightly reduced (~5%) but acceptable");
}
}
GC Tuning Checklist
// Comprehensive Tuning Checklist
public class GCTuningChecklist {
public static void printChecklist() {
System.out.println("=== GC TUNING CHECKLIST ===");
System.out.println("\n--- BASELINE (DO THIS FIRST) ---");
System.out.println("☐ Enable GC logging");
System.out.println(" -Xlog:gc*:file=gc.log:time,uptime,tags");
System.out.println("☐ Set heap size (Xms = Xmx)");
System.out.println(" -Xms8g -Xmx8g");
System.out.println("☐ Enable heap dump on OOM");
System.out.println(" -XX:+HeapDumpOnOutOfMemoryError");
System.out.println("☐ Set heap dump path");
System.out.println(" -XX:HeapDumpPath=/var/log/heapdumps/");
System.out.println("☐ Disable explicit GC");
System.out.println(" -XX:+DisableExplicitGC");
System.out.println("\n--- MONITORING ---");
System.out.println("☐ Set up JFR continuous recording");
System.out.println("☐ Monitor GC metrics (throughput, pause time)");
System.out.println("☐ Monitor heap usage trends");
System.out.println("☐ Monitor allocation rate");
System.out.println("☐ Monitor promotion rate");
System.out.println("☐ Set up alerting for Full GC");
System.out.println("\n--- ANALYSIS ---");
System.out.println("☐ Establish baseline metrics");
System.out.println("☐ Identify performance requirements");
System.out.println("☐ Determine primary constraint:");
System.out.println(" • Latency (pause time)");
System.out.println(" • Throughput");
System.out.println(" • Memory footprint");
System.out.println("\n--- COLLECTOR SELECTION ---");
System.out.println("☐ Start with G1 (default)");
System.out.println("☐ If pause times >100ms critical → Consider ZGC/Shenandoah");
System.out.println("☐ If throughput critical (batch) → Consider Parallel");
System.out.println("☐ Test collector under load");
System.out.println("\n--- HEAP SIZING ---");
System.out.println("☐ Set Xms = Xmx (avoid resizing)");
System.out.println("☐ Allocate 25-50% system memory to heap");
System.out.println("☐ Leave memory for OS and other processes");
System.out.println("☐ Consider metaspace limit");
System.out.println("\n--- TUNING (G1) ---");
System.out.println("☐ Set pause time goal");
System.out.println(" -XX:MaxGCPauseMillis=100");
System.out.println("☐ If frequent Full GC:");
System.out.println(" • Increase heap");
System.out.println(" • Lower IHOP (-XX:InitiatingHeapOccupancyPercent=35)");
System.out.println("☐ If premature promotion:");
System.out.println(" • Increase survivor space");
System.out.println(" • Increase tenuring threshold");
System.out.println("\n--- VALIDATION ---");
System.out.println("☐ Test under production load");
System.out.println("☐ Verify pause times meet SLA");
System.out.println("☐ Verify throughput acceptable");
System.out.println("☐ Monitor for regressions");
System.out.println("☐ Document settings and rationale");
}
}
Best Practices
- Capture heap dumps: Enable -XX:+HeapDumpOnOutOfMemoryError.
- Analyze leaks systematically: Use Eclipse MAT for root cause analysis.
- Monitor proactively: Don't wait for OOM to investigate.
- Test fixes under load: Reproduce production conditions.
- Document changes: Record tuning decisions and results.
- Tune incrementally: Change one parameter at a time.
- Consider collector switch: G1 → ZGC for ultra-low latency.
- Profile allocation: Identify and optimize hotspots.
- Watch for Full GC: Investigate causes immediately.
- Use JFR continuously: Low overhead, valuable insights.