Diagnostics & Troubleshooting

Troubleshooting Methodology

JVM troubleshooting follows a cycle of “Monitor → Locate → Analyze → Resolve → Verify”:

Detect anomaly (monitoring alert / user report)
    │
    ▼
Locate problem (tool-based investigation)
    │
    ▼
Analyze cause (log/dump analysis)
    │
    ▼
Implement fix (parameter adjustment / code fix)
    │
    ▼
Verify results (compare against baseline)

High CPU Usage Troubleshooting

Investigation Steps

top — Find the Java process with high CPU usage
top -Hp <pid> — Find the high-CPU thread
printf '%x\n' <tid> — Convert to hexadecimal
jstack <pid> | grep <hex_tid> — View thread stack
Analyze hotspot methods

Detailed Procedure

# 1. Find the Java process
top -c
# PID  USER   %CPU  %MEM  COMMAND
# 12345 app    98.2  15.3  java -jar app.jar

# 2. Find the high-CPU thread
top -Hp 12345
# PID    USER   %CPU  COMMAND
# 12350  app    95.1  java -jar app.jar

# 3. Convert thread ID to hexadecimal
printf '%x\n' 12350
# 303e

# 4. View thread stack
jstack 12345 | grep -A 30 "0x303e"

Common Causes

Cause	Stack Characteristic	Solution
Infinite loop	Method stack repeatedly calls the same method	Fix code logic
Regex backtracking	`java.util.regex.Pattern`	Use possessive quantifiers or atomic groups
Frequent GC	`VM Thread` or GC threads have high CPU	Increase heap / optimize GC
Cryptographic operations	`sun.security.*`	Use caching / async
Serialization	`java.io.ObjectOutputStream`	Optimize serialization strategy

Quick Investigation with Arthas

# Directly find the top 3 threads by CPU usage
thread -n 3

# View specific thread stack
thread <thread-id>

# Trace method execution time
trace com.example.Service method

Memory Leak Investigation

Investigation Steps

Confirm memory is continuously growing (jstat/jmap/JFR)
Export heap dump (jmap -dump or -XX:+HeapDumpOnOutOfMemoryError)
Analyze dump with MAT
Find the leaking object and GC Roots reference chain
Fix the code

Analyzing Heap Dumps with MAT

MAT (Memory Analyzer Tool) is a heap analysis tool provided by Eclipse.

1. Leak Suspects Report

After opening the dump, MAT automatically generates a leak suspects report identifying potential leak points:

Problem Suspect 1:
  The class java.util.concurrent.ConcurrentHashMap$Node occupies 256 MB (50%) of the heap.
  The class is loaded by org.apache.catalina.loader.WebappClassLoader.
  Keywords: java.util.concurrent.ConcurrentHashMap$Node

Problem Suspect 2:
  Thread main keeps local variables with total size 128 MB (25%).

2. Dominator Tree

View the object dominator tree, sorted by Retained Size:

Class                                    | Shallow Heap | Retained Heap
─────────────────────────────────────────|──────────────|──────────────
java.util.concurrent.ConcurrentHashMap   |        48 B  |      256 MB
  └─ java.util.concurrent.ConcurrentHashMap$Node[] | 1 MB |  255 MB
       └─ [0] ConcurrentHashMap$Node      |       32 B  |     10 MB
            └─ value: byte[10485760]      |   10 MB     |     10 MB
       └─ [1] ConcurrentHashMap$Node      |       32 B  |     20 MB
       └─ ...

Shallow Size: Memory occupied by the object itself
Retained Size: Total memory that would be freed if the object were garbage collected (including all objects it dominates)

3. GC Roots Reference Chain

Right-click object → Path To GC Roots → exclude weak/soft references:

Thread main
  └─ local variable: ConcurrentHashMap
       └─ [47] ConcurrentHashMap$Node
            └─ value: byte[10485760]   ← Leaking object

Common Memory Leak Scenarios

Scenario	Cause	Detection Method
Uncleared collections	Map/List keeps adding without removing	MAT: find collection reference chain
ThreadLocal leak	Thread pool reuse causes ThreadLocal not to be released	Check ThreadLocalMap
Unregistered listeners	Registered but never unregistered	Search addListener/removeListener
Unclosed resources	Connections/streams not close()’d	Related objects in Finalizer queue
Unbounded cache	Custom cache with no size limit	Check cache-related object count
Inner class references	Non-static inner class holds outer class reference	MAT: view outer class references

ThreadLocal Leak Example

// Wrong: ThreadLocal not cleaned in thread pool
ExecutorService pool = Executors.newFixedThreadPool(10);
pool.execute(() -> {
    threadLocal.set(largeObject);
    // Forgot remove! Thread is reused, largeObject cannot be collected
});

// Correct: Clean up in finally block
pool.execute(() -> {
    try {
        threadLocal.set(largeObject);
        // Business logic
    } finally {
        threadLocal.remove();  // Must clean up
    }
});

OOM Failure Handling

OOM Types and Investigation Paths

OOM Message	Area	Investigation Direction
`Java heap space`	Heap	Memory leak or insufficient heap
`Metaspace`	Metaspace	Too many classes loaded (dynamic proxies/CGLIB)
`GC overhead limit exceeded`	Heap	Poor GC reclaim efficiency (98% time in GC, reclaiming < 2%)
`Direct buffer memory`	Direct memory	NIO off-heap memory overflow
`Unable to create new native thread`	System memory	Too many threads
`Requested array size exceeds VM limit`	Heap	Array allocation too large

Heap OOM Investigation Flow

OOM: Java heap space
    │
    ├── Sudden OOM?
    │   ├── Yes → Check for bulk data loading
    │   │        → Increase -Xmx
    │   └── No (gradual growth) → Memory leak
    │
    ▼
Export heap dump for analysis
    │
    ├── Abnormal object count → Check code creating those objects
    │
    ├── Oversized collections → Check collection operation logic
    │
    └── Too many Classes/ClassLoaders → Check dynamic class generation

Metaspace OOM Investigation

# View loaded class count
jcmd <pid> GC.class_stats | head -20

# View class histogram
jcmd <pid> GC.class_histogram | head -30

# Common causes
# 1. CGLIB/Spring AOP generating large numbers of proxy classes
# 2. JSPs frequently recompiled
# 3. Groovy/Scala scripts repeatedly compiled
# 4. ClassLoader leaks (web app hot deployment)

# Solutions
# - Increase metaspace: -XX:MaxMetaspaceSize=512m
# - Investigate class leaks: -XX:+TraceClassLoading -XX:+TraceClassUnloading

Thread Deadlock Investigation

Detecting Deadlocks

# Method 1: jstack automatic detection
jstack <pid>
# Output at the end will report:
# Found one Java-level deadlock:
# =============================
# "Thread-1":
#   waiting to lock monitor... (a java.lang.Object),
#   which is held by "Thread-0"
# "Thread-0":
#   waiting to lock monitor... (a java.lang.Object),
#   which is held by "Thread-1"

# Method 2: Arthas
thread -b

# Method 3: JMX
jconsole → Threads → Detect Deadlock

Deadlock Types

1. Classic Deadlock (two threads waiting for each other)

// Thread 1
synchronized(lockA) {
    synchronized(lockB) { /* ... */ }
}

// Thread 2
synchronized(lockB) {
    synchronized(lockA) { /* ... */ }
}

2. Livelock (threads keep retrying but never succeed)

while (true) {
    if (tryLock()) {
        // Operation fails, release lock, retry immediately
        unlock();
    }
}

3. Database Deadlock

-- View deadlocks
SHOW ENGINE INNODB STATUS;

Preventing Deadlocks

Fixed lock ordering
Use tryLock(timeout) instead of lock()
Reduce lock granularity
Use concurrent utility classes instead of explicit locking

Class Loading Conflict Investigation

Common Errors

Error	Cause
`ClassNotFoundException`	Class not found on classpath
`NoClassDefFoundError`	Class existed at compile time but not at runtime
`ClassCastException`	Same-named class loaded by different ClassLoaders
`LinkageError`	Class version mismatch

Investigation Methods

# View class loading source
jcmd <pid> VM.system_properties | grep java.class.path

# Find classes with Arthas
sc -d com.example.MyClass
# classLoaderHash  classLoaderName  loading source

# Find all ClassLoaders
classloader -t

# View which jar a class is loaded from
classloader -c <hash> -r com/example/MyClass.class

Maven Dependency Conflicts

# View dependency tree
mvn dependency:tree -Dverbose | grep conflicting-lib

# Exclude conflicting dependency
<exclusion>
    <groupId>com.example</groupId>
    <artifactId>conflicting-lib</artifactId>
</exclusion>

Emergency Handling Procedures

Application Unresponsive

# 1. Check if process is alive
ps aux | grep java

# 2. Check CPU usage
top -Hp <pid>

# 3. Export thread dumps (3 consecutive times, 5 seconds apart)
jstack <pid> > dump1.txt
sleep 5
jstack <pid> > dump2.txt
sleep 5
jstack <pid> > dump3.txt

# 4. Export heap dump (optional, risky)
jcmd <pid> GC.heap_dump filename=emergency_dump.hprof

# 5. If jstack doesn't work, use forced mode
jstack -F <pid> > forced_dump.txt

# 6. If completely unresponsive, generate core dump
gcore <pid>

Frequent Full GC Emergency

# 1. Confirm Full GC frequency
jstat -gcutil <pid> 1000 10

# 2. Export heap histogram
jcmd <pid> GC.class_histogram > histo.txt

# 3. If memory leak confirmed, export dump
jcmd <pid> GC.heap_dump filename=fullgc_dump.hprof

# 4. Temporary mitigation: increase heap + switch collector
# Modify startup parameters and restart
java -Xmx4g -XX:+UseG1GC -jar app.jar

Summary

This chapter covered troubleshooting methods for common JVM failures: high CPU usage with top+jstack to locate hotspot threads; memory leaks with MAT for heap dump analysis; OOM by type for different areas; deadlocks with jstack or Arthas. Mastering these troubleshooting methods enables quick identification and resolution of JVM issues in production. The next chapter will systematically cover JVM tuning parameters.

Troubleshooting Methodology

High CPU Usage Troubleshooting

Investigation Steps

Detailed Procedure

Common Causes

Quick Investigation with Arthas

Memory Leak Investigation

Investigation Steps

Analyzing Heap Dumps with MAT

Common Memory Leak Scenarios

ThreadLocal Leak Example

OOM Failure Handling

OOM Types and Investigation Paths

Heap OOM Investigation Flow

Metaspace OOM Investigation

Thread Deadlock Investigation

Detecting Deadlocks

Deadlock Types

Preventing Deadlocks

Class Loading Conflict Investigation

Common Errors

Investigation Methods

Maven Dependency Conflicts

Emergency Handling Procedures

Application Unresponsive

Frequent Full GC Emergency

Summary

Comments