26.4 Troubleshooting and Performance Optimization

Identifying and resolving GC-related performance issues requires systematic analysis and appropriate tools.

Memory Leaks

// Detecting Memory Leaks
public class MemoryLeakDetection {

    public static void printLeakSymptoms() {
        System.out.println("=== MEMORY LEAK SYMPTOMS ===");

        System.out.println("\n--- INDICATORS ---");
        System.out.println("✗ Heap usage increases steadily over time");
        System.out.println("✗ Old Gen continuously grows");
        System.out.println("✗ Frequent Full GCs");
        System.out.println("✗ Full GCs don't reclaim much memory");
        System.out.println("✗ Eventually OutOfMemoryError");
        System.out.println("✗ Application restarts temporarily fix issue");

        System.out.println("\n--- COMMON CAUSES ---");
        System.out.println("1. Collections not cleared");
        System.out.println("   - Static collections");
        System.out.println("   - Caches without eviction");
        System.out.println("   - Listeners not removed");

        System.out.println("\n2. ThreadLocal leaks");
        System.out.println("   - ThreadLocal not removed");
        System.out.println("   - Thread pools holding references");

        System.out.println("\n3. ClassLoader leaks");
        System.out.println("   - Classes not unloaded");
        System.out.println("   - Static references to application classes");

        System.out.println("\n4. Native memory leaks");
        System.out.println("   - JNI resources not freed");
        System.out.println("   - Direct buffers not released");
    }

    // Example: Classic memory leak
    public static class LeakyCache {
        // ❌ LEAK: Static collection never cleared
        private static final Map<String, Object> cache = new HashMap<>();

        public void addToCache(String key, Object value) {
            cache.put(key, value);
            // Objects accumulate, never removed
        }

        // ✅ FIXED: Use WeakHashMap or implement eviction
        private static final Map<String, Object> fixedCache = 
            new WeakHashMap<>();

        public void addToFixedCache(String key, Object value) {
            fixedCache.put(key, value);
            // Entries can be GC'd when keys become weakly reachable
        }
    }

    // Example: ThreadLocal leak
    public static class ThreadLocalLeak {
        // ❌ LEAK: ThreadLocal not cleaned up
        private static final ThreadLocal<List<String>> cache = 
            ThreadLocal.withInitial(ArrayList::new);

        public void addData(String data) {
            cache.get().add(data);
            // In thread pool, thread reused, list keeps growing
        }

        // ✅ FIXED: Clean up ThreadLocal
        public void addDataSafely(String data) {
            try {
                cache.get().add(data);
            } finally {
                cache.remove(); // Clean up after use
            }
        }
    }
}

Diagnosing Memory Leaks

// Memory Leak Diagnosis Steps
public class LeakDiagnosis {

    public static void printDiagnosisSteps() {
        System.out.println("=== MEMORY LEAK DIAGNOSIS PROCESS ===");

        System.out.println("\n--- STEP 1: CONFIRM LEAK ---");
        System.out.println("1. Monitor heap over time:");
        System.out.println("   jstat -gcutil <pid> 10000");
        System.out.println("   Watch Old Gen (OU column)");
        System.out.println("   Steadily increasing OU → likely leak");

        System.out.println("\n2. Check GC effectiveness:");
        System.out.println("   If Full GC doesn't reclaim memory → definite leak");

        System.out.println("\n--- STEP 2: CAPTURE HEAP DUMPS ---");
        System.out.println("1. Early heap dump (baseline):");
        System.out.println("   jcmd <pid> GC.heap_dump /tmp/heap1.hprof");

        System.out.println("\n2. Run application under load");
        System.out.println("   Wait for memory to grow");

        System.out.println("\n3. Late heap dump (after leak):");
        System.out.println("   jcmd <pid> GC.heap_dump /tmp/heap2.hprof");

        System.out.println("\n--- STEP 3: ANALYZE WITH ECLIPSE MAT ---");
        System.out.println("1. Open heap dump in Eclipse MAT");
        System.out.println("   Download: eclipse.org/mat");

        System.out.println("\n2. Run Leak Suspects Report");
        System.out.println("   Analyzes → Leak Suspects");
        System.out.println("   Identifies objects consuming most memory");

        System.out.println("\n3. Find Dominator Tree");
        System.out.println("   Shows objects and their retained size");
        System.out.println("   Retained size = memory freed if object collected");

        System.out.println("\n4. Compare heap dumps");
        System.out.println("   Compare baseline vs leak dump");
        System.out.println("   Identify growing object types");

        System.out.println("\n5. Find GC roots");
        System.out.println("   Right-click object → Path to GC Roots");
        System.out.println("   Shows why object can't be collected");

        System.out.println("\n--- STEP 4: FIX LEAK ---");
        System.out.println("Common fixes:");
        System.out.println("  • Clear collections when done");
        System.out.println("  • Remove event listeners");
        System.out.println("  • Clean up ThreadLocal");
        System.out.println("  • Use weak references for caches");
        System.out.println("  • Implement cache eviction");
        System.out.println("  • Fix ClassLoader lifecycle");
    }
}

OutOfMemoryError Scenarios

// OutOfMemoryError Types
public class OutOfMemoryErrors {

    public static void printOOMTypes() {
        System.out.println("=== OUT OF MEMORY ERROR TYPES ===");

        System.out.println("\n--- 1. HEAP SPACE ---");
        System.out.println("Error: java.lang.OutOfMemoryError: Java heap space");

        System.out.println("\nCauses:");
        System.out.println("  • Heap too small for application");
        System.out.println("  • Memory leak");
        System.out.println("  • Large object allocation");

        System.out.println("\nDiagnosis:");
        System.out.println("  1. Capture heap dump:");
        System.out.println("     -XX:+HeapDumpOnOutOfMemoryError");
        System.out.println("     -XX:HeapDumpPath=/tmp/heap.hprof");
        System.out.println("  2. Analyze with Eclipse MAT");
        System.out.println("  3. Identify largest objects");

        System.out.println("\nSolutions:");
        System.out.println("  • Increase heap: -Xmx8g");
        System.out.println("  • Fix memory leak");
        System.out.println("  • Optimize object usage");

        System.out.println("\n--- 2. METASPACE ---");
        System.out.println("Error: java.lang.OutOfMemoryError: Metaspace");

        System.out.println("\nCauses:");
        System.out.println("  • Too many classes loaded");
        System.out.println("  • ClassLoader leak (classes not unloaded)");
        System.out.println("  • Dynamic class generation");

        System.out.println("\nDiagnosis:");
        System.out.println("  1. Monitor metaspace:");
        System.out.println("     jstat -gc <pid>");
        System.out.println("     Watch MC (metaspace capacity)");
        System.out.println("  2. List loaded classes:");
        System.out.println("     jcmd <pid> VM.class_histogram");

        System.out.println("\nSolutions:");
        System.out.println("  • Set limit: -XX:MaxMetaspaceSize=512m");
        System.out.println("  • Fix ClassLoader leak");
        System.out.println("  • Reduce dynamic class generation");

        System.out.println("\n--- 3. GC OVERHEAD LIMIT EXCEEDED ---");
        System.out.println("Error: java.lang.OutOfMemoryError: GC overhead limit exceeded");

        System.out.println("\nTriggered when:");
        System.out.println("  • >98% of time spent in GC");
        System.out.println("  • <2% of heap recovered");
        System.out.println("  • Happened 5 consecutive times");

        System.out.println("\nCauses:");
        System.out.println("  • Heap too small");
        System.out.println("  • Memory leak");
        System.out.println("  • Application needs too much memory");

        System.out.println("\nSolutions:");
        System.out.println("  • Increase heap");
        System.out.println("  • Fix memory leak");
        System.out.println("  • Disable check (not recommended):");
        System.out.println("    -XX:-UseGCOverheadLimit");

        System.out.println("\n--- 4. DIRECT BUFFER MEMORY ---");
        System.out.println("Error: java.lang.OutOfMemoryError: Direct buffer memory");

        System.out.println("\nCauses:");
        System.out.println("  • Too many DirectByteBuffer allocations");
        System.out.println("  • Buffers not released");

        System.out.println("\nSolutions:");
        System.out.println("  • Increase limit: -XX:MaxDirectMemorySize=1g");
        System.out.println("  • Explicitly release buffers");
        System.out.println("  • Use try-with-resources");

        System.out.println("\n--- 5. UNABLE TO CREATE NEW NATIVE THREAD ---");
        System.out.println("Error: java.lang.OutOfMemoryError: unable to create new native thread");

        System.out.println("\nCauses:");
        System.out.println("  • Too many threads");
        System.out.println("  • OS thread limit reached");
        System.out.println("  • Insufficient native memory");

        System.out.println("\nSolutions:");
        System.out.println("  • Reduce thread count");
        System.out.println("  • Use thread pools");
        System.out.println("  • Increase OS limits (ulimit)");
        System.out.println("  • Reduce heap size (more native memory)");
    }
}

Long GC Pauses

// Diagnosing Long GC Pauses
public class LongGCPauses {

    public static void printPauseDiagnosis() {
        System.out.println("=== DIAGNOSING LONG GC PAUSES ===");

        System.out.println("\n--- SYMPTOMS ---");
        System.out.println("✗ Application unresponsive periodically");
        System.out.println("✗ Request timeouts");
        System.out.println("✗ SLA violations");
        System.out.println("✗ GC logs show long pause times");

        System.out.println("\n--- COMMON CAUSES ---");

        System.out.println("\n1. Full GC Events");
        System.out.println("   Problem: Full GC pauses entire JVM");
        System.out.println("   Duration: Seconds to tens of seconds");
        System.out.println("   Check GC logs for 'Full GC' events");

        System.out.println("\n2. Large Young Generation");
        System.out.println("   Problem: More objects to process");
        System.out.println("   Solution: Reduce Young Gen size");
        System.out.println("   Flag: -XX:MaxNewSize=2g");

        System.out.println("\n3. High Promotion Rate");
        System.out.println("   Problem: Objects promoted too fast to Old Gen");
        System.out.println("   Solution: Increase Survivor space");
        System.out.println("   Flag: -XX:SurvivorRatio=6");

        System.out.println("\n4. Humongous Objects (G1)");
        System.out.println("   Problem: Large objects (>50% region size)");
        System.out.println("   Solution: Increase region size");
        System.out.println("   Flag: -XX:G1HeapRegionSize=32m");

        System.out.println("\n5. Reference Processing");
        System.out.println("   Problem: Many weak/soft/phantom references");
        System.out.println("   Check: JFR for reference processing time");
        System.out.println("   Solution: Reduce reference usage");

        System.out.println("\n6. NUMA Awareness");
        System.out.println("   Problem: Cross-NUMA node access");
        System.out.println("   Solution: Enable NUMA support");
        System.out.println("   Flag: -XX:+UseNUMA");
    }

    public static void printPauseTuningSteps() {
        System.out.println("\n=== PAUSE TIME TUNING PROCESS ===");

        System.out.println("\n--- STEP 1: MEASURE BASELINE ---");
        System.out.println("1. Enable detailed GC logging:");
        System.out.println("   -Xlog:gc*:file=gc.log:time,uptime,level,tags");

        System.out.println("\n2. Run under production load");

        System.out.println("\n3. Analyze logs:");
        System.out.println("   - Identify longest pauses");
        System.out.println("   - Check pause frequency");
        System.out.println("   - Note GC types (Minor, Major, Full)");

        System.out.println("\n--- STEP 2: IDENTIFY BOTTLENECK ---");
        System.out.println("Check what's taking time in GC:");

        System.out.println("\n• Young GC too slow:");
        System.out.println("  → Reduce Young Gen size");
        System.out.println("  → Increase GC threads");

        System.out.println("\n• Old GC too slow:");
        System.out.println("  → Lower IHOP (start marking earlier)");
        System.out.println("  → Increase heap size");

        System.out.println("\n• Frequent Full GC:");
        System.out.println("  → Increase heap");
        System.out.println("  → Fix memory leak");
        System.out.println("  → Lower pause time goal");

        System.out.println("\n--- STEP 3: TUNE PARAMETERS ---");
        System.out.println("G1 tuning:");
        System.out.println("  -XX:MaxGCPauseMillis=50  # Lower goal");
        System.out.println("  -XX:InitiatingHeapOccupancyPercent=35  # Earlier marking");

        System.out.println("\nOr switch collector:");
        System.out.println("  -XX:+UseZGC  # Sub-millisecond pauses");

        System.out.println("\n--- STEP 4: TEST AND VERIFY ---");
        System.out.println("1. Run with new settings");
        System.out.println("2. Measure pause times");
        System.out.println("3. Check throughput impact");
        System.out.println("4. Iterate if needed");
    }
}

Heap Dump Analysis

// Heap Dump Analysis Techniques
public class HeapDumpAnalysis {

    public static void printAnalysisTechniques() {
        System.out.println("=== HEAP DUMP ANALYSIS ===");

        System.out.println("\n--- CAPTURING HEAP DUMPS ---");

        System.out.println("\n1. On OutOfMemoryError (automatic):");
        System.out.println("   -XX:+HeapDumpOnOutOfMemoryError");
        System.out.println("   -XX:HeapDumpPath=/tmp/");

        System.out.println("\n2. Manual capture (jcmd):");
        System.out.println("   jcmd <pid> GC.heap_dump /tmp/heap.hprof");

        System.out.println("\n3. Manual capture (jmap):");
        System.out.println("   jmap -dump:live,format=b,file=heap.hprof <pid>");
        System.out.println("   'live' option triggers GC first");

        System.out.println("\n4. Programmatic capture:");
        System.out.println("   HotSpotDiagnosticMXBean mbean =");
        System.out.println("     ManagementFactory.getPlatformMXBean(");
        System.out.println("       HotSpotDiagnosticMXBean.class);");
        System.out.println("   mbean.dumpHeap(\"/tmp/heap.hprof\", true);");

        System.out.println("\n--- ECLIPSE MAT ANALYSIS ---");

        System.out.println("\n1. Leak Suspects Report");
        System.out.println("   Automatically identifies likely leaks");
        System.out.println("   Shows biggest memory consumers");

        System.out.println("\n2. Histogram");
        System.out.println("   Class → Object count → Shallow size → Retained size");
        System.out.println("   Shallow: Object's own memory");
        System.out.println("   Retained: Memory freed if object collected");

        System.out.println("\n3. Dominator Tree");
        System.out.println("   Objects sorted by retained size");
        System.out.println("   Shows object retention hierarchy");

        System.out.println("\n4. Path to GC Roots");
        System.out.println("   Right-click object → Path to GC Roots");
        System.out.println("   Shows why object can't be collected");
        System.out.println("   Excludes:");
        System.out.println("     - Weak references");
        System.out.println("     - Soft references");
        System.out.println("     - Phantom references");

        System.out.println("\n5. OQL (Object Query Language)");
        System.out.println("   SQL-like queries on heap");
        System.out.println("   Example: SELECT * FROM java.lang.String");
        System.out.println("   Example: SELECT s FROM java.lang.String s");
        System.out.println("            WHERE s.count > 1000");

        System.out.println("\n--- COMMON PATTERNS ---");

        System.out.println("\n1. Collection Leak:");
        System.out.println("   Large HashMap/ArrayList with many entries");
        System.out.println("   Fix: Clear collection, implement eviction");

        System.out.println("\n2. String Duplication:");
        System.out.println("   Many identical String objects");
        System.out.println("   Fix: Use String.intern() or deduplication");

        System.out.println("\n3. ClassLoader Leak:");
        System.out.println("   Old ClassLoader not unloaded");
        System.out.println("   Classes and static fields retained");
        System.out.println("   Fix: Remove static references to app classes");

        System.out.println("\n4. ThreadLocal Leak:");
        System.out.println("   ThreadLocalMap entries not cleaned");
        System.out.println("   Fix: Call ThreadLocal.remove()");
    }
}

Performance Case Studies

// Real-World Performance Issues
public class PerformanceCaseStudies {

    public static void printCaseStudy1() {
        System.out.println("=== CASE STUDY 1: FREQUENT FULL GCs ===");

        System.out.println("\n--- SYMPTOMS ---");
        System.out.println("• Full GC every 10 minutes");
        System.out.println("• Each Full GC takes 5-10 seconds");
        System.out.println("• Application pauses noticeable to users");

        System.out.println("\n--- DIAGNOSIS ---");
        System.out.println("1. Analyzed GC logs: Old Gen filling up fast");
        System.out.println("2. Checked jstat: High promotion rate (100MB/s)");
        System.out.println("3. JFR profiling: Excessive allocation in request processing");

        System.out.println("\n--- ROOT CAUSE ---");
        System.out.println("Objects allocated in request handler surviving Minor GC");
        System.out.println("  → Premature promotion to Old Gen");
        System.out.println("  → Old Gen filling quickly");
        System.out.println("  → Frequent Full GC");

        System.out.println("\n--- SOLUTION ---");
        System.out.println("1. Increased Young Gen size:");
        System.out.println("   -XX:NewRatio=1 (from default 2)");
        System.out.println("2. Increased Survivor space:");
        System.out.println("   -XX:SurvivorRatio=6 (from 8)");

        System.out.println("\n--- RESULTS ---");
        System.out.println("✓ Full GC frequency: 10min → 4 hours");
        System.out.println("✓ Promotion rate: 100MB/s → 20MB/s");
        System.out.println("✓ Application latency improved significantly");
    }

    public static void printCaseStudy2() {
        System.out.println("\n=== CASE STUDY 2: MEMORY LEAK ===");

        System.out.println("\n--- SYMPTOMS ---");
        System.out.println("• Heap usage steadily increasing");
        System.out.println("• Full GCs not reclaiming memory");
        System.out.println("• OutOfMemoryError after 48 hours");

        System.out.println("\n--- DIAGNOSIS ---");
        System.out.println("1. Captured heap dumps at 1hr and 24hr");
        System.out.println("2. Analyzed with Eclipse MAT");
        System.out.println("3. Dominator tree showed large cache Map");
        System.out.println("4. Path to GC roots: Static field → Cache → millions of entries");

        System.out.println("\n--- ROOT CAUSE ---");
        System.out.println("Static cache Map without eviction policy");
        System.out.println("  → Entries never removed");
        System.out.println("  → Unbounded growth");
        System.out.println("  → Eventually OOM");

        System.out.println("\n--- SOLUTION ---");
        System.out.println("Replaced HashMap with Caffeine cache:");
        System.out.println("  Cache<String, Object> cache = Caffeine.newBuilder()");
        System.out.println("    .maximumSize(10_000)");
        System.out.println("    .expireAfterWrite(1, TimeUnit.HOURS)");
        System.out.println("    .build();");

        System.out.println("\n--- RESULTS ---");
        System.out.println("✓ Heap usage stable");
        System.out.println("✓ No more OOM errors");
        System.out.println("✓ Application runs indefinitely");
    }

    public static void printCaseStudy3() {
        System.out.println("\n=== CASE STUDY 3: LONG GC PAUSES (G1) ===");

        System.out.println("\n--- SYMPTOMS ---");
        System.out.println("• P99 pause time: 500ms (target: <100ms)");
        System.out.println("• Request timeouts during GC");
        System.out.println("• Using G1 with 32GB heap");

        System.out.println("\n--- DIAGNOSIS ---");
        System.out.println("1. GC logs showed Young GC pauses >400ms");
        System.out.println("2. Large Young Gen (20GB)");
        System.out.println("3. Many references to process");

        System.out.println("\n--- ROOT CAUSE ---");
        System.out.println("G1 trying to meet pause time goal");
        System.out.println("  → But Young Gen too large");
        System.out.println("  → Can't collect in target time");
        System.out.println("  → Pauses exceed goal");

        System.out.println("\n--- SOLUTION ---");
        System.out.println("Switched to ZGC:");
        System.out.println("  -XX:+UseZGC");
        System.out.println("  -Xms32g -Xmx32g");

        System.out.println("\n--- RESULTS ---");
        System.out.println("✓ P99 pause time: 500ms → 0.5ms (1000x improvement)");
        System.out.println("✓ No more request timeouts");
        System.out.println("✓ Throughput slightly reduced (~5%) but acceptable");
    }
}

GC Tuning Checklist

// Comprehensive Tuning Checklist
public class GCTuningChecklist {

    public static void printChecklist() {
        System.out.println("=== GC TUNING CHECKLIST ===");

        System.out.println("\n--- BASELINE (DO THIS FIRST) ---");
        System.out.println("☐ Enable GC logging");
        System.out.println("  -Xlog:gc*:file=gc.log:time,uptime,tags");
        System.out.println("☐ Set heap size (Xms = Xmx)");
        System.out.println("  -Xms8g -Xmx8g");
        System.out.println("☐ Enable heap dump on OOM");
        System.out.println("  -XX:+HeapDumpOnOutOfMemoryError");
        System.out.println("☐ Set heap dump path");
        System.out.println("  -XX:HeapDumpPath=/var/log/heapdumps/");
        System.out.println("☐ Disable explicit GC");
        System.out.println("  -XX:+DisableExplicitGC");

        System.out.println("\n--- MONITORING ---");
        System.out.println("☐ Set up JFR continuous recording");
        System.out.println("☐ Monitor GC metrics (throughput, pause time)");
        System.out.println("☐ Monitor heap usage trends");
        System.out.println("☐ Monitor allocation rate");
        System.out.println("☐ Monitor promotion rate");
        System.out.println("☐ Set up alerting for Full GC");

        System.out.println("\n--- ANALYSIS ---");
        System.out.println("☐ Establish baseline metrics");
        System.out.println("☐ Identify performance requirements");
        System.out.println("☐ Determine primary constraint:");
        System.out.println("  • Latency (pause time)");
        System.out.println("  • Throughput");
        System.out.println("  • Memory footprint");

        System.out.println("\n--- COLLECTOR SELECTION ---");
        System.out.println("☐ Start with G1 (default)");
        System.out.println("☐ If pause times >100ms critical → Consider ZGC/Shenandoah");
        System.out.println("☐ If throughput critical (batch) → Consider Parallel");
        System.out.println("☐ Test collector under load");

        System.out.println("\n--- HEAP SIZING ---");
        System.out.println("☐ Set Xms = Xmx (avoid resizing)");
        System.out.println("☐ Allocate 25-50% system memory to heap");
        System.out.println("☐ Leave memory for OS and other processes");
        System.out.println("☐ Consider metaspace limit");

        System.out.println("\n--- TUNING (G1) ---");
        System.out.println("☐ Set pause time goal");
        System.out.println("  -XX:MaxGCPauseMillis=100");
        System.out.println("☐ If frequent Full GC:");
        System.out.println("  • Increase heap");
        System.out.println("  • Lower IHOP (-XX:InitiatingHeapOccupancyPercent=35)");
        System.out.println("☐ If premature promotion:");
        System.out.println("  • Increase survivor space");
        System.out.println("  • Increase tenuring threshold");

        System.out.println("\n--- VALIDATION ---");
        System.out.println("☐ Test under production load");
        System.out.println("☐ Verify pause times meet SLA");
        System.out.println("☐ Verify throughput acceptable");
        System.out.println("☐ Monitor for regressions");
        System.out.println("☐ Document settings and rationale");
    }
}

Best Practices

  • Capture heap dumps: Enable -XX:+HeapDumpOnOutOfMemoryError.
  • Analyze leaks systematically: Use Eclipse MAT for root cause analysis.
  • Monitor proactively: Don't wait for OOM to investigate.
  • Test fixes under load: Reproduce production conditions.
  • Document changes: Record tuning decisions and results.
  • Tune incrementally: Change one parameter at a time.
  • Consider collector switch: G1 → ZGC for ultra-low latency.
  • Profile allocation: Identify and optimize hotspots.
  • Watch for Full GC: Investigate causes immediately.
  • Use JFR continuously: Low overhead, valuable insights.