Skip to content

Diagnostics & Troubleshooting

6 min read

Troubleshooting Methodology

JVM troubleshooting follows a cycle of “Monitor → Locate → Analyze → Resolve → Verify”:

Detect anomaly (monitoring alert / user report)
    │
    ▼
Locate problem (tool-based investigation)
    │
    ▼
Analyze cause (log/dump analysis)
    │
    ▼
Implement fix (parameter adjustment / code fix)
    │
    ▼
Verify results (compare against baseline)

High CPU Usage Troubleshooting

Investigation Steps

1. top — Find the Java process with high CPU usage
2. top -Hp <pid> — Find the high-CPU thread
3. printf '%x\n' <tid> — Convert to hexadecimal
4. jstack <pid> | grep <hex_tid> — View thread stack
5. Analyze hotspot methods

Detailed Procedure

# 1. Find the Java process
top -c
# PID  USER   %CPU  %MEM  COMMAND
# 12345 app    98.2  15.3  java -jar app.jar

# 2. Find the high-CPU thread
top -Hp 12345
# PID    USER   %CPU  COMMAND
# 12350  app    95.1  java -jar app.jar

# 3. Convert thread ID to hexadecimal
printf '%x\n' 12350
# 303e

# 4. View thread stack
jstack 12345 | grep -A 30 "0x303e"

Common Causes

Cause Stack Characteristic Solution
Infinite loop Method stack repeatedly calls the same method Fix code logic
Regex backtracking java.util.regex.Pattern Use possessive quantifiers or atomic groups
Frequent GC VM Thread or GC threads have high CPU Increase heap / optimize GC
Cryptographic operations sun.security.* Use caching / async
Serialization java.io.ObjectOutputStream Optimize serialization strategy

Quick Investigation with Arthas

# Directly find the top 3 threads by CPU usage
thread -n 3

# View specific thread stack
thread <thread-id>

# Trace method execution time
trace com.example.Service method

Memory Leak Investigation

Investigation Steps

1. Confirm memory is continuously growing (jstat/jmap/JFR)
2. Export heap dump (jmap -dump or -XX:+HeapDumpOnOutOfMemoryError)
3. Analyze dump with MAT
4. Find the leaking object and GC Roots reference chain
5. Fix the code

Analyzing Heap Dumps with MAT

MAT (Memory Analyzer Tool) is a heap analysis tool provided by Eclipse.

1. Leak Suspects Report

After opening the dump, MAT automatically generates a leak suspects report identifying potential leak points:

Problem Suspect 1:
  The class java.util.concurrent.ConcurrentHashMap$Node occupies 256 MB (50%) of the heap.
  The class is loaded by org.apache.catalina.loader.WebappClassLoader.
  Keywords: java.util.concurrent.ConcurrentHashMap$Node

Problem Suspect 2:
  Thread main keeps local variables with total size 128 MB (25%).

2. Dominator Tree

View the object dominator tree, sorted by Retained Size:

Class                                    | Shallow Heap | Retained Heap
─────────────────────────────────────────|──────────────|──────────────
java.util.concurrent.ConcurrentHashMap   |        48 B  |      256 MB
  └─ java.util.concurrent.ConcurrentHashMap$Node[] | 1 MB |  255 MB
       └─ [0] ConcurrentHashMap$Node      |       32 B  |     10 MB
            └─ value: byte[10485760]      |   10 MB     |     10 MB
       └─ [1] ConcurrentHashMap$Node      |       32 B  |     20 MB
       └─ ...
  • Shallow Size: Memory occupied by the object itself
  • Retained Size: Total memory that would be freed if the object were garbage collected (including all objects it dominates)

3. GC Roots Reference Chain

Right-click object → Path To GC Roots → exclude weak/soft references:

Thread main
  └─ local variable: ConcurrentHashMap
       └─ [47] ConcurrentHashMap$Node
            └─ value: byte[10485760]   ← Leaking object

Common Memory Leak Scenarios

Scenario Cause Detection Method
Uncleared collections Map/List keeps adding without removing MAT: find collection reference chain
ThreadLocal leak Thread pool reuse causes ThreadLocal not to be released Check ThreadLocalMap
Unregistered listeners Registered but never unregistered Search addListener/removeListener
Unclosed resources Connections/streams not close()’d Related objects in Finalizer queue
Unbounded cache Custom cache with no size limit Check cache-related object count
Inner class references Non-static inner class holds outer class reference MAT: view outer class references

ThreadLocal Leak Example

// Wrong: ThreadLocal not cleaned in thread pool
ExecutorService pool = Executors.newFixedThreadPool(10);
pool.execute(() -> {
    threadLocal.set(largeObject);
    // Forgot remove! Thread is reused, largeObject cannot be collected
});

// Correct: Clean up in finally block
pool.execute(() -> {
    try {
        threadLocal.set(largeObject);
        // Business logic
    } finally {
        threadLocal.remove();  // Must clean up
    }
});

OOM Failure Handling

OOM Types and Investigation Paths

OOM Message Area Investigation Direction
Java heap space Heap Memory leak or insufficient heap
Metaspace Metaspace Too many classes loaded (dynamic proxies/CGLIB)
GC overhead limit exceeded Heap Poor GC reclaim efficiency (98% time in GC, reclaiming < 2%)
Direct buffer memory Direct memory NIO off-heap memory overflow
Unable to create new native thread System memory Too many threads
Requested array size exceeds VM limit Heap Array allocation too large

Heap OOM Investigation Flow

OOM: Java heap space
    │
    ├── Sudden OOM?
    │   ├── Yes → Check for bulk data loading
    │   │        → Increase -Xmx
    │   └── No (gradual growth) → Memory leak
    │
    ▼
Export heap dump for analysis
    │
    ├── Abnormal object count → Check code creating those objects
    │
    ├── Oversized collections → Check collection operation logic
    │
    └── Too many Classes/ClassLoaders → Check dynamic class generation

Metaspace OOM Investigation

# View loaded class count
jcmd <pid> GC.class_stats | head -20

# View class histogram
jcmd <pid> GC.class_histogram | head -30

# Common causes
# 1. CGLIB/Spring AOP generating large numbers of proxy classes
# 2. JSPs frequently recompiled
# 3. Groovy/Scala scripts repeatedly compiled
# 4. ClassLoader leaks (web app hot deployment)

# Solutions
# - Increase metaspace: -XX:MaxMetaspaceSize=512m
# - Investigate class leaks: -XX:+TraceClassLoading -XX:+TraceClassUnloading

Thread Deadlock Investigation

Detecting Deadlocks

# Method 1: jstack automatic detection
jstack <pid>
# Output at the end will report:
# Found one Java-level deadlock:
# =============================
# "Thread-1":
#   waiting to lock monitor... (a java.lang.Object),
#   which is held by "Thread-0"
# "Thread-0":
#   waiting to lock monitor... (a java.lang.Object),
#   which is held by "Thread-1"

# Method 2: Arthas
thread -b

# Method 3: JMX
jconsole → Threads → Detect Deadlock

Deadlock Types

1. Classic Deadlock (two threads waiting for each other)

// Thread 1
synchronized(lockA) {
    synchronized(lockB) { /* ... */ }
}

// Thread 2
synchronized(lockB) {
    synchronized(lockA) { /* ... */ }
}

2. Livelock (threads keep retrying but never succeed)

while (true) {
    if (tryLock()) {
        // Operation fails, release lock, retry immediately
        unlock();
    }
}

3. Database Deadlock

-- View deadlocks
SHOW ENGINE INNODB STATUS;

Preventing Deadlocks

  • Fixed lock ordering
  • Use tryLock(timeout) instead of lock()
  • Reduce lock granularity
  • Use concurrent utility classes instead of explicit locking

Class Loading Conflict Investigation

Common Errors

Error Cause
ClassNotFoundException Class not found on classpath
NoClassDefFoundError Class existed at compile time but not at runtime
ClassCastException Same-named class loaded by different ClassLoaders
LinkageError Class version mismatch

Investigation Methods

# View class loading source
jcmd <pid> VM.system_properties | grep java.class.path

# Find classes with Arthas
sc -d com.example.MyClass
# classLoaderHash  classLoaderName  loading source

# Find all ClassLoaders
classloader -t

# View which jar a class is loaded from
classloader -c <hash> -r com/example/MyClass.class

Maven Dependency Conflicts

# View dependency tree
mvn dependency:tree -Dverbose | grep conflicting-lib

# Exclude conflicting dependency
<exclusion>
    <groupId>com.example</groupId>
    <artifactId>conflicting-lib</artifactId>
</exclusion>

Emergency Handling Procedures

Application Unresponsive

# 1. Check if process is alive
ps aux | grep java

# 2. Check CPU usage
top -Hp <pid>

# 3. Export thread dumps (3 consecutive times, 5 seconds apart)
jstack <pid> > dump1.txt
sleep 5
jstack <pid> > dump2.txt
sleep 5
jstack <pid> > dump3.txt

# 4. Export heap dump (optional, risky)
jcmd <pid> GC.heap_dump filename=emergency_dump.hprof

# 5. If jstack doesn't work, use forced mode
jstack -F <pid> > forced_dump.txt

# 6. If completely unresponsive, generate core dump
gcore <pid>

Frequent Full GC Emergency

# 1. Confirm Full GC frequency
jstat -gcutil <pid> 1000 10

# 2. Export heap histogram
jcmd <pid> GC.class_histogram > histo.txt

# 3. If memory leak confirmed, export dump
jcmd <pid> GC.heap_dump filename=fullgc_dump.hprof

# 4. Temporary mitigation: increase heap + switch collector
# Modify startup parameters and restart
java -Xmx4g -XX:+UseG1GC -jar app.jar

Summary

This chapter covered troubleshooting methods for common JVM failures: high CPU usage with top+jstack to locate hotspot threads; memory leaks with MAT for heap dump analysis; OOM by type for different areas; deadlocks with jstack or Arthas. Mastering these troubleshooting methods enables quick identification and resolution of JVM issues in production. The next chapter will systematically cover JVM tuning parameters.

Edit this page

Comments