Introduction: That night when operations and maintenance work collectively
"Brother Fan! The online service response time has reached 10 seconds!" At 1 a.m., intern Xiao Li's voice was crying.
On the monitoring screen, the JVM heap memory curve is like a rocket - the newly expanded 16G memory was completely eaten up in 30 minutes.
I gritted my teeth and slapped the table: "Turn the code that has been launched in the past week to the sky!"
The first pit: Static assembly into a perpetual motion machine
▌ Scroll code (real project clip)
// Cache user AI dialogue history → Failed to write!
public class ChatHistoryCache {
private static Map<Long, List<String>> cache = new HashMap<>();
public static void addMessage(Long userId, String msg) {
(userId, k -> new ArrayList<>()).add(msg);
}
}
▌ The scene of the car crash
- When the number of users surges, cached data only enters and does not exit, and memory is bursting for 48 hours
- Use Arthas to capture the current situation:
vmtool --action getInstances -c 4614556e
Seeing Map size exceeds tens of millions - MAT Analysis:
HashMap$Node
Objects account for 82% of the heap memory
▌ Correct posture
// Use Guava to cache with expiration time instead
private static Cache<Long, List<String>> cache = ()
.expireAfterAccess(1, )
.maximumSize(10000)
.build();
The second pit: Lambda forgot to close the file stream
▌ Fatal code (processing AI model files)
// Load local model file → Overturn writing!
public void loadModels(List<File> files) {
(file -> {
try {
InputStream is = new FileInputStream(file); // Close if missing!
parseModel(is);
} catch (IOException e) { /*...*/ }
});
}
▌ Weird phenomenon
- Three days after the service was running, it suddenly reported "Too many open files"
- Linux troubleshooting:
lsof -p Process ID | grep 'deleted'
Found a large number of unreleased file handles - JVM monitoring:
jcmd PID VM.native_memory
The number of file descriptors exceeded 10,000
▌ Rescue plan
// Correct writing method: try-with-resources automatically close
(file -> {
try (InputStream is = new FileInputStream(file)) { // Automatically shut down the stream
parseModel(is);
} catch (IOException e) { /*...*/ }
});
The third pit: Spring incident monitoring becomes a nail household
▌ Trick code (message notification module)
// Monitor the completion event of AI processing → Overturning writing!
@Component
public class NotifyService {
@EventListener
public void handleAiEvent(AICompleteEvent event) {
// Error holding external service reference
(this::sendNotification);
}
}
▌ Memory curve
- Every time the event is triggered, the listener object is strongly referenced by external services and will never be released.
- MAT Analysis:
NotifyService
The number of instances increases linearly with time - GC Log: Elderly occupancy rate increases by 5% per week
▌ A trick to avoid pitfalls
//Unbound with weak reference
public void handleAiEvent(AICompleteEvent event) {
WeakReference<NotifyService> weakRef = new WeakReference<>(this);
(() -> {
NotifyService service = ();
if (service != null) ();
});
}
The fourth pit: zombie missions in the thread pool
▌ Problem code (asynchronous processing of AI requests)
// Asynchronous thread pool configuration → overturn writing!
@Bean
public Executor asyncExecutor() {
return new ThreadPoolExecutor(10, 10,
0L, ,
new LinkedBlockingQueue<>()); // Unbounded queue!
}
▌ Disaster scene
- When requesting burst, the queue accumulates 500,000 tasks, and each task holds an AI response object.
- Heap dump display:
byte[]
Accounts for 90% of memory, all of which are pending response data - Monitoring indicators:
queue_size
The indicator continues to be high and does not fall
▌ Correct configuration
// Set queue upper limit + reject policy
new ThreadPoolExecutor(10, 50,
60L, ,
new ArrayBlockingQueue<>(1000),
new ());
Pit 5: The ghost in the MyBatis connection pool
▌ Fatal code (query user conversation history)
public List<ChatRecord> getHistory(Long userId) {
SqlSession session = ();
try {
return ("queryHistory", userId);
} finally {
// Forget() → Connection pool is gradually exhausted
}
}
▌ Leaked evidence
- Druid monitoring panel shows that the number of active connections reaches the maximum
- Log error:
Cannot get connection from pool, timeout 30000ms
- Heap analysis:
SqlSession
The number of instances increases abnormally
▌ Correct posture
// Use try-with-resources to close automatically
try (SqlSession session = ()) {
return ("queryHistory", userId);
}
The Sixth Pit: The Gentle Trap of the Third Party Library
▌ Problem code (caches user preferences)
// Incorrect configuration when using Ehcache
CacheConfiguration<Long, UserPreference> config = new CacheConfiguration<>()
.setName("user_prefs")
.setMaxEntriesLocalHeap(10000); // Only the quantity is set, no expiration time is set!
▌ Memory Symptoms
- GC log shows that seniors grow 3% per week
- Arthas Monitoring:
watch getCachedUser
Return object survival time exceeds 7 days - OOM is triggered during pressure measurement, and a large number of them are found in the heap.
UserPreference
Object
▌ Correct configuration
(3600) // 1 hour expires
.setDiskExpiryThreadIntervalSeconds(60); // Expiry check interval
Seventh pit: ThreadLocal won't clean after use
▌ Fatal code (user context pass)
public class UserContextHolder {
private static final ThreadLocal<User> currentUser = new ThreadLocal<>();
public static void set(User user) {
(user);
}
// Missing remove method!
}
▌ Memory exception
- After thread pool reuse, old user data accumulation in ThreadLocal
- MAT Analysis:
User
The object isThreadLocalMap
Strong quotes cannot be released - Monitoring discovery: Each thread holds an average of 50 expired user objects
▌ Repair plan
// It must be cleaned after use!
public static void remove() {
();
}
// Force cleaning in the interceptor
@Around("execution(* ..*.*(..))")
public Object clearContext(ProceedingJoinPoint pjp) throws Throwable {
try {
return ();
} finally {
(); // The key!
}
}
Ultimate troubleshooting toolbox
1. Arthas three-combo
# Real-time monitoring of GC situation
dashboard -n 5 -i 2000
# Track the frequency of suspicious method calls
trace addCacheEntry -n 10
# Dynamically modify log level (no restart required)
logger --name ROOT --level debug
2. Three tricks for MAT analysis
- Dominator Tree: Expose the memory devourer
- Path to GC Roots: Follow the clues to find the murderer
-
OQL Black Technology:
SELECT * FROM WHERE size > 10000 SELECT toString(msg) FROM WHERE LIKE "%OOM%"
3. Online fire extinguishing order package
# Quickly view the heap memory distribution
jhsdb jmap --heap --pid <PID>
# Ranking of count objects
jmap -histo:live <PID> | head -n 20
# Force trigger Full GC (use with caution!)
jcmd <PID>
Twelve military regulations on preventing leakage
- All caches must be set to double insurance: Expiration time + capacity limit
-
IO operation triple protection:
try (InputStream is = ...) { // First level useStream(is); } catch (IOException e) { // Second level ("IO exception", e); } finally { // The third level cleanupTempFiles(); }
-
Four principles of thread pool:
- No unbounded queues
- No unreasonable core numbers
- Don't ignore rejection policies
- Do not store magnified objects
-
Spring component three checks:
- Check event listener reference chain
- Check the collection class in a singleton object
- Check the thread pool configuration of @Async annotation
-
Third-party library two-inspection:
- Verify connection pool return mechanism
- Verify the default configuration of cache
-
Key points of code review:
- A collection of all static modifications
- All close()/release() call points
- Where all internal classes hold external references
Operation and Maintenance Lao Fan’s Pit Avoidance Diary
2024-03-20 2 am
"Xiao Wang, do you know why I have so little hair?
Back then, someone saved the user session in ThreadLocal and did not clean it up.
As a result, when 100,000 online users are online at the same time—
That memory leaks faster than a barber shop fader! "
Self-test question: Can you see where this code will leak?
// Dangerous code! Please find out three leaks
public class ModelLoader {
private static List<Model> loadedModels = new ArrayList<>();
public void load(String path) {
Model model = new Model(((path)));
(model);
()
.scheduleAtFixedRate(() -> (), 1, 1, HOURS);
}
}
The answer is revealed:
- There is no cleanup mechanism for static collection
- Timed task thread pool is not closed
- Anonymous internal class holds Model strong reference