双十一期间查询服务某节点 OOM 原因分析

雨翔河

首页列表关于

2019-11-13 07:29

> 背景： 因为双十一期间，物流在双十一之后的压力是巨大的，在之前我们的某ES集群已经扩了三个数据节点，今天11.12号发现集群的机械硬盘的机器根本扛不住，
  决定临时就两个压力大的节点进行迁移到ssd机器，迁移是没有问题的，但是迁移完成后检查应用日志发现了一个惊天的秘密。

问题：  查询服务其中一个节点在凌晨1点30分宕机，直到1点35分恢复，挂了足足有5分钟.

查询服务是查询数据库和es的，收拢了大量的查询操作在里面，还有少量的定时任务。

这个服务一直以来都是很稳定的，如果不是查日志根本就没有发现这个问题（运维的监控没有做好）。

随意的看了下当时java进程的jstat信息：

```
2019-11-13 01:32:31 实例名称: query_web_01 实例PID号: 126944 jstat详细信息
 S0C        S1C    S0U    S1U      EC       EU        OC         OU          MC       MU        CCSC   CCSU     YGC     YGCT   FGC    FGCT     GCT   
209664.0 209664.0  0.0   209664.0 1677824.0 1677824.0 2097152.0  2097152.0  130304.0 123811.5 16640.0 15320.4   1674   49.591  21    134.999  184.590
209664.0 209664.0  0.0   209664.0 1677824.0 1677824.0 2097152.0  2097152.0  130304.0 123811.5 16640.0 15320.4   1674   49.591  21    134.999  184.590
209664.0 209664.0  0.0   209664.0 1677824.0 1677824.0 2097152.0  2097152.0  130304.0 123811.5 16640.0 15320.4   1674   49.591  21    134.999  184.590
209664.0 209664.0  0.0   209664.0 1677824.0 1677824.0 2097152.0  2097152.0  130304.0 123811.5 16640.0 15320.4   1674   49.591  21    134.999  184.590
```

这不看不要紧，一看吓一跳，fullGC整整21次，直接吓尿了。老年代被打爆了，这多大数据量扔老年代。

赶紧看下gc日志：

```
2019-11-13T01:26:33.292+0800: 85314.005: [GC (Allocation Failure) 2019-11-13T01:26:33.293+0800: 85314.006: [ParNew: 1682405K->3733K(1887488K), 0.0211158 secs] 2093981K->415336K(3984640K), 0.0221292 secs] [Times: user=0.07 sys=0.00, real=0.02 secs] 
Heap after GC invocations=1671 (full 0):
 par new generation   total 1887488K, used 3733K [0x00000006c0000000, 0x0000000740000000, 0x0000000740000000)
  eden space 1677824K,   0% used [0x00000006c0000000, 0x00000006c0000000, 0x0000000726680000)
  from space 209664K,   1% used [0x0000000733340000, 0x00000007336e56d8, 0x0000000740000000)
  to   space 209664K,   0% used [0x0000000726680000, 0x0000000726680000, 0x0000000733340000)
 concurrent mark-sweep generation total 2097152K, used 411602K [0x0000000740000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 124006K, capacity 129820K, committed 129920K, reserved 1163264K
  class space    used 15367K, capacity 16490K, committed 16512K, reserved 1048576K
}
{Heap before GC invocations=1671 (full 0):
 par new generation   total 1887488K, used 1681557K [0x00000006c0000000, 0x0000000740000000, 0x0000000740000000)
  eden space 1677824K, 100% used [0x00000006c0000000, 0x0000000726680000, 0x0000000726680000)
  from space 209664K,   1% used [0x0000000733340000, 0x00000007336e56d8, 0x0000000740000000)
  to   space 209664K,   0% used [0x0000000726680000, 0x0000000726680000, 0x0000000733340000)
 concurrent mark-sweep generation total 2097152K, used 411602K [0x0000000740000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 124205K, capacity 130050K, committed 130304K, reserved 1163264K
  class space    used 15384K, capacity 16523K, committed 16640K, reserved 1048576K
2019-11-13T01:30:01.585+0800: 85522.298: [GC (Allocation Failure) 2019-11-13T01:30:01.586+0800: 85522.299: [ParNew: 1681557K->209664K(1887488K), 2.5644538 secs] 2093160K->830666K(3984640K), 2.5652375 secs] [Times: user=4.41 sys=0.84, real=2.57 secs] 
Heap after GC invocations=1672 (full 0):
 par new generation   total 1887488K, used 209664K [0x00000006c0000000, 0x0000000740000000, 0x0000000740000000)
  eden space 1677824K,   0% used [0x00000006c0000000, 0x00000006c0000000, 0x0000000726680000)
  from space 209664K, 100% used [0x0000000726680000, 0x0000000733340000, 0x0000000733340000)
  to   space 209664K,   0% used [0x0000000733340000, 0x0000000733340000, 0x0000000740000000)
 concurrent mark-sweep generation total 2097152K, used 621002K [0x0000000740000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 124205K, capacity 130050K, committed 130304K, reserved 1163264K
  class space    used 15384K, capacity 16523K, committed 16640K, reserved 1048576K
}
{Heap before GC invocations=1672 (full 0):
 par new generation   total 1887488K, used 1887488K [0x00000006c0000000, 0x0000000740000000, 0x0000000740000000)
  eden space 1677824K, 100% used [0x00000006c0000000, 0x0000000726680000, 0x0000000726680000)
  from space 209664K, 100% used [0x0000000726680000, 0x0000000733340000, 0x0000000733340000)
  to   space 209664K,   0% used [0x0000000733340000, 0x0000000733340000, 0x0000000740000000)
 concurrent mark-sweep generation total 2097152K, used 621002K [0x0000000740000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 124205K, capacity 130050K, committed 130304K, reserved 1163264K
  class space    used 15384K, capacity 16523K, committed 16640K, reserved 1048576K
2019-11-13T01:30:07.176+0800: 85527.889: [GC (Allocation Failure) 2019-11-13T01:30:07.176+0800: 85527.889: [ParNew: 1887488K->209664K(1887488K), 3.2933791 secs] 2508490K->2023127K(3984640K), 3.2939436 secs] [Times: user=7.50 sys=1.33, real=3.30 secs] 
Heap after GC invocations=1673 (full 0):
 par new generation   total 1887488K, used 209664K [0x00000006c0000000, 0x0000000740000000, 0x0000000740000000)
  eden space 1677824K,   0% used [0x00000006c0000000, 0x00000006c0000000, 0x0000000726680000)
  from space 209664K, 100% used [0x0000000733340000, 0x0000000740000000, 0x0000000740000000)
  to   space 209664K,   0% used [0x0000000726680000, 0x0000000726680000, 0x0000000733340000)
 concurrent mark-sweep generation total 2097152K, used 1813463K [0x0000000740000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 124205K, capacity 130050K, committed 130304K, reserved 1163264K
  class space    used 15384K, capacity 16523K, committed 16640K, reserved 1048576K
}
2019-11-13T01:30:10.479+0800: 85531.192: [GC (CMS Initial Mark) [1 CMS-initial-mark: 1813463K(2097152K)] 2041237K(3984640K), 0.1026301 secs] [Times: user=0.20 sys=0.05, real=0.10 secs] 
2019-11-13T01:30:10.582+0800: 85531.295: [CMS-concurrent-mark-start]
2019-11-13T01:30:12.708+0800: 85533.422: [CMS-concurrent-mark: 2.120/2.127 secs] [Times: user=3.69 sys=0.18, real=2.13 secs] 
2019-11-13T01:30:12.709+0800: 85533.422: [CMS-concurrent-preclean-start]
2019-11-13T01:30:12.720+0800: 85533.433: [CMS-concurrent-preclean: 0.011/0.011 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2019-11-13T01:30:12.720+0800: 85533.433: [CMS-concurrent-abortable-preclean-start]
2019-11-13T01:30:12.720+0800: 85533.433: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2019-11-13T01:30:12.729+0800: 85533.443: [GC (CMS Final Remark) [YG occupancy: 1408088 K (1887488 K)]2019-11-13T01:30:12.730+0800: 85533.443: [Rescan (parallel) , 0.2314041 secs]2019-11-13T01:30:12.961+0800: 85533.674: [weak refs processing, 0.0060704 secs]2019-11-13T01:30:12.967+0800: 85533.680: [class unloading, 1.9673187 secs]2019-11-13T01:30:14.935+0800: 85535.648: [scrub symbol table, 0.0265757 secs]2019-11-13T01:30:14.961+0800: 85535.674: [scrub string table, 0.0025693 secs][1 CMS-remark: 1813463K(2097152K)] 3221551K(3984640K), 2.2559640 secs] [Times: user=1.04 sys=0.01, real=2.25 secs] 
2019-11-13T01:30:14.986+0800: 85535.699: [CMS-concurrent-sweep-start]
{Heap before GC invocations=1673 (full 1):
 par new generation   total 1887488K, used 1887488K [0x00000006c0000000, 0x0000000740000000, 0x0000000740000000)
  eden space 1677824K, 100% used [0x00000006c0000000, 0x0000000726680000, 0x0000000726680000)
  from space 209664K, 100% used [0x0000000733340000, 0x0000000740000000, 0x0000000740000000)
  to   space 209664K,   0% used [0x0000000726680000, 0x0000000726680000, 0x0000000733340000)
 concurrent mark-sweep generation total 2097152K, used 1532169K [0x0000000740000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 124206K, capacity 130114K, committed 130304K, reserved 1163264K
  class space    used 15384K, capacity 16523K, committed 16640K, reserved 1048576K
2019-11-13T01:30:15.934+0800: 85536.647: [GC (Allocation Failure) 2019-11-13T01:30:15.934+0800: 85536.647: [ParNew: 1887488K->1887488K(1887488K), 0.0000468 secs]2019-11-13T01:30:15.934+0800: 85536.647: [CMS2019-11-13T01:30:16.748+0800: 85537.461: [CMS-concurrent-sweep: 1.686/1.762 secs] [Times: user=2.13 sys=0.18, real=1.76 secs] 
 (concurrent mode failure): 1532169K->2097152K(2097152K), 8.2907813 secs] 3419657K->2894498K(3984640K), [Metaspace: 124206K->124206K(1163264K)], 8.2978841 secs] [Times: user=8.22 sys=0.07, real=8.30 secs] 
Heap after GC invocations=1674 (full 2):
 par new generation   total 1887488K, used 797346K [0x00000006c0000000, 0x0000000740000000, 0x0000000740000000)
  eden space 1677824K,  47% used [0x00000006c0000000, 0x00000006f0aa8ae0, 0x0000000726680000)
  from space 209664K,   0% used [0x0000000733340000, 0x0000000733340000, 0x0000000740000000)
  to   space 209664K,   0% used [0x0000000726680000, 0x0000000726680000, 0x0000000733340000)
 concurrent mark-sweep generation total 2097152K, used 2097152K [0x0000000740000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 123808K, capacity 129436K, committed 130304K, reserved 1163264K
  class space    used 15318K, capacity 16404K, committed 16640K, reserved 1048576K
}
2019-11-13T01:30:26.237+0800: 85546.950: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2097152K(2097152K)] 3829729K(3984640K), 0.5307804 secs] [Times: user=1.01 sys=0.00, real=0.53 secs] 
```

从上面可以看出来，1：30之前一切都是正常的，从1：30开始就出现了对象特别多的情况，直接打爆老年代。

因为老年代不够，fullGC一直在进行，结果进行了也没用，因为量太大，OOM了。

是什么导致了这么多对象突然增加，赶紧去看了下当时的jstack和java进程内CPU和内存占用高线程对比：

```
2019-11-13 01:32:31 实例名称: query_web_01 实例PID号: 126944 top10线程资源消耗情况
   XXX XXX     XX  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                        
127462 xxx     20   0 8153m 4.6g  12m S  1.8 59.8   0:14.87 java              
127656 xxx     20   0 8153m 4.6g  12m S  1.8 59.8   3:45.26 java              
127683 xxx     20   0 8153m 4.6g  12m S  1.8 59.8   5:12.37 java              
127703 xxx     20   0 8153m 4.6g  12m S  1.8 59.8   5:13.03 java              
127827 xxx     20   0 8153m 4.6g  12m S  1.8 59.8   3:26.99 java              
126944 xxx     20   0 8153m 4.6g  12m S  0.0 59.8   0:00.00 java              
126952 xxx     20   0 8153m 4.6g  12m S  0.0 59.8   0:41.23 java              
126956 xxx     20   0 8153m 4.6g  12m S  0.0 59.8   0:38.67 java              
126957 xxx     20   0 8153m 4.6g  12m S  0.0 59.8   0:35.93 java   
```

这内存占用好吓人，将 `127462` 转为十六进制为 `1f1e6` 搜索当时的jstack日志：

```
"DEFAULT.omitPostDataSyncJob_Scheduler_Worker-1" #xxxx prio=5 os_prio=0 tid=xxxxxx nid=0x1f1e6 runnable [0x000xxxx]
   java.lang.Thread.State: RUNNABLE
	at com.mysql.cj.protocol.a.NativePacketPayload.readBytes(NativePacketPayload.java:558)
	at com.mysql.cj.protocol.a.NativePacketPayload.readBytes(NativePacketPayload.java:508)
	at com.mysql.cj.protocol.a.TextRowFactory.createFromMessage(TextRowFactory.java:66)
	at com.mysql.cj.protocol.a.TextRowFactory.createFromMessage(TextRowFactory.java:42)
	at com.mysql.cj.protocol.a.ResultsetRowReader.read(ResultsetRowReader.java:87)
	at com.mysql.cj.protocol.a.ResultsetRowReader.read(ResultsetRowReader.java:42)
	at com.mysql.cj.protocol.a.NativeProtocol.read(NativeProtocol.java:1679)
	at com.mysql.cj.protocol.a.TextResultsetReader.read(TextResultsetReader.java:87)
	at com.mysql.cj.protocol.a.TextResultsetReader.read(TextResultsetReader.java:48)
	at com.mysql.cj.protocol.a.NativeProtocol.read(NativeProtocol.java:1692)
	at com.mysql.cj.protocol.a.NativeProtocol.readAllResults(NativeProtocol.java:1746)
	at com.mysql.cj.protocol.a.NativeProtocol.sendQueryPacket(NativeProtocol.java:1035)
	at com.mysql.cj.NativeSession.execSQL(NativeSession.java:1185)
	at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:952)
	- locked <0x00xxxxxxx> (a com.mysql.cj.jdbc.ConnectionImpl)
	at com.mysql.cj.jdbc.ClientPreparedStatement.executeQuery(ClientPreparedStatement.java:1025)
	- locked <0x00xxxxxxx> (a com.mysql.cj.jdbc.ConnectionImpl)
	at com.xxx.query.service.AbsDBHelper.batchProduce(AbsDBHelper.java:86)
	at com.xxx.query.job.OmitPostDataSyncJob.process(OmitPostDataSyncJob.java:53)
```

这是个定时任务的类，用处是处理前一整天所有的数据进行处理，每天凌晨1：30准时执行，从这里我们可以看到也可以猜测， `setFetchSize` 没有起作用，直接这条sql查询出了所有的数据加载到了内存，导致了OOM。
为此我们专门查了下这个类的日志，代码的逻辑在执行前和执行结束后都有打印日志，但是我查了最近一个礼拜的应用日志，只有执行前的日志，没有结束的日志，由此推断应用其实每天凌晨1：30都在挂。（细思极恐，这种事情研发居然完全不知情！！！）
运维的脚本监控到应用长期不可访问（因为FullGC），就会kill掉进程，重新启动。

查到这里了问题就很明显了，下面我们来分析下导致OOM的代码

```
...
conn = getConn();
String sql = "select * from test";
pstst = conn.prepareStatement(sql, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
int fetchSize = 500;
pstst.setFetchSize(fetchSize);
rs = pstst.executeQuery();
...
```

因为双十一期间的这几天的量本来就是一年最高的，这个表的数据量极其的大，从这里我们可以看出是 `setFetchSize` 没有生效，导致所有的数据全部加载出来了，实践证明也的确是这样。

这里的查询我们用的mysql服务端版本是 5.7.27 ，客户端的driver版本是 8.0.12 。
查了下mysql官方的文档。

官方文档：  
https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html

> Another alternative is to use cursor-based streaming to retrieve a set number of rows each time. This can be done by setting the connection property useCursorFetch to true, and then calling setFetchSize(int) with int being the desired number of rows to be fetched each time:

```
conn = DriverManager.getConnection("jdbc:mysql://localhost/?useCursorFetch=true", "user", "s3cr3t");
stmt = conn.createStatement();
stmt.setFetchSize(100);
rs = stmt.executeQuery("SELECT * FROM your_table_here");
```

如果你要用 `setFetchSize` ，要在mysql的连接那里设置 `useCursorFetch=true` , 我们的这个应用是没有设置的。

解决方案：

先停掉这个定时任务，放弃使用游标，改为 `limit` 操作。

### 总结：

这种应用大半夜的宕机对于研发人员来说是很难受的，但是最难受的是应用节点宕机了，研发人员根本不知道，因为没有任何消息通知到研发人员，这里面隐藏的问题会像一个埋的很深的雷，随时出事情，关键是给我整的早上才下班离开公司。

类型：工作标签：oom,java,jdbc