雨翔河
首页
列表
关于
ES某节点CPU增长至100%的诡异问题
2020-06-14 07:20
> 这是一个从事发到目前为止我没有从根本上解决的技术问题,也是我心中的一个非常大的疑惑。 写于: 2020-06-14 周日,下午14:00 ### 问题已解决,解决过程和方案可以看文章最末尾,解决方案写于 2020-06-20 15:26 问题一: 2020年6月4号上午10点左右(高峰期),A集群某节点,我们姑且称之为37号节点,CPU增长至100%居高不下,导致查询队列积压,大量查询请求被拒绝,扩容后问题解决(我们之后新增了五个节点),但是扩容之后的集群里,37号节点平时的CPU占用还是要远远高于其他节点, 相差10倍左右,也就是说其它节点在闲置着看戏,但是扩容后至少不会出现CPU占满的情况。 问题二: 2020年6月12号上午10点左右(高峰期),B集群某节点,我们姑且称之为2号节点,CPU增长至100%居高不下,导致查询队列积压,大量查询请求被拒绝,扩容后问题解决,但是2号节点平时的CPU占用要远远高于其他节点, 相差10倍左右。 我们排查出第三方接口查询量突增非常多的量,因为时间紧急,我们熔断了第三方的查询接口,超过N的QPS直接返回错误信息,2号节点的CPU直线下跌至正常状态,其它节点在这期间闲置看戏。 #### 问题一,经过排查,我们检查了当时37号节点的ES进程里占用非常高的线程情况: ``` "elasticsearch[logsearch-datanode-36][index][T#14]" #273 daemon prio=5 os_prio=0 tid=0x00007ef8e8084800 nid=0x6f62 runnable [0x00007ef4d37eb000] java.lang.Thread.State: RUNNABLE at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444) at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419) at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298) at java.lang.ThreadLocal.get(ThreadLocal.java:163) at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78) at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81) at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123) at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140) at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032) at org.elasticsearch.index.engine.InternalEngine.innerDelete(InternalEngine.java:582) - locked <0x00000000c3c36900> (a java.lang.Object) at org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:558) at org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:641) at org.elasticsearch.action.delete.TransportDeleteAction.executeDeleteRequestOnPrimary(TransportDeleteAction.java:143) at org.elasticsearch.action.delete.TransportDeleteAction.shardOperationOnPrimary(TransportDeleteAction.java:136) at org.elasticsearch.action.delete.TransportDeleteAction.shardOperationOnPrimary(TransportDeleteAction.java:54) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - <0x000000008ae17720> (a java.util.concurrent.ThreadPoolExecutor$Worker) "elasticsearch[logsearch-datanode-36][index][T#13]" #272 daemon prio=5 os_prio=0 tid=0x00007ef8f4085000 nid=0x6f61 runnable [0x00007ef4d38ec000] java.lang.Thread.State: RUNNABLE at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444) at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419) at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298) at java.lang.ThreadLocal.get(ThreadLocal.java:163) at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78) at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81) at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123) at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140) at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032) at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501) - locked <0x00000001619253d8> (a java.lang.Object) at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457) at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601) at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836) at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - <0x000000008ae10650> (a java.util.concurrent.ThreadPoolExecutor$Worker) "elasticsearch[logsearch-datanode-36][index][T#12]" #271 daemon prio=5 os_prio=0 tid=0x00007ef8f8083800 nid=0x6f60 runnable [0x00007ef4d39ed000] java.lang.Thread.State: RUNNABLE at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444) at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419) at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298) at java.lang.ThreadLocal.get(ThreadLocal.java:163) at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78) at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81) at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123) at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140) at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032) at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501) - locked <0x0000000156f6a138> (a java.lang.Object) at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457) at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601) at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836) at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnReplica(TransportIndexAction.java:196) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:170) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:66) at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:392) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:291) at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:283) at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - <0x000000008ae1b570> (a java.util.concurrent.ThreadPoolExecutor$Worker) "elasticsearch[logsearch-datanode-36][index][T#11]" #270 daemon prio=5 os_prio=0 tid=0x00007ef8e0003000 nid=0x6f5f runnable [0x00007ef4d3aee000] java.lang.Thread.State: RUNNABLE at java.lang.ThreadLocal$ThreadLocalMap.set(ThreadLocal.java:467) at java.lang.ThreadLocal$ThreadLocalMap.access$100(ThreadLocal.java:298) at java.lang.ThreadLocal.setInitialValue(ThreadLocal.java:184) at java.lang.ThreadLocal.get(ThreadLocal.java:170) at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:481) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.elasticsearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:55) at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:450) at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601) at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836) at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - <0x000000008ae15d78> (a java.util.concurrent.ThreadPoolExecutor$Worker) "elasticsearch[logsearch-datanode-36][index][T#10]" #269 daemon prio=5 os_prio=0 tid=0x00007ef910083800 nid=0x6f5e runnable [0x00007ef4d3bee000] java.lang.Thread.State: RUNNABLE at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444) at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419) at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298) at java.lang.ThreadLocal.get(ThreadLocal.java:163) at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78) at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81) at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123) at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140) at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032) at org.elasticsearch.index.engine.InternalEngine.innerDelete(InternalEngine.java:582) - locked <0x000000016c7c8a98> (a java.lang.Object) at org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:558) at org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:641) at org.elasticsearch.action.delete.TransportDeleteAction.executeDeleteRequestOnPrimary(TransportDeleteAction.java:143) at org.elasticsearch.action.delete.TransportDeleteAction.shardOperationOnPrimary(TransportDeleteAction.java:136) at org.elasticsearch.action.delete.TransportDeleteAction.shardOperationOnPrimary(TransportDeleteAction.java:54) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - <0x000000008ae19da0> (a java.util.concurrent.ThreadPoolExecutor$Worker) "elasticsearch[logsearch-datanode-36][index][T#9]" #267 daemon prio=5 os_prio=0 tid=0x00007ef8f0082800 nid=0x6f5c runnable [0x00007ef4d3df0000] java.lang.Thread.State: RUNNABLE at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444) at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419) at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298) at java.lang.ThreadLocal.get(ThreadLocal.java:163) at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78) at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81) at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123) at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140) at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032) at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501) - locked <0x0000000156f71bf8> (a java.lang.Object) at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457) at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601) at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836) at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - <0x000000008ae178c8> (a java.util.concurrent.ThreadPoolExecutor$Worker) "elasticsearch[logsearch-datanode-36][index][T#8]" #266 daemon prio=5 os_prio=0 tid=0x00007ef8e8082800 nid=0x6f5b runnable [0x00007ef4d3ef1000] java.lang.Thread.State: RUNNABLE at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444) at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419) at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298) at java.lang.ThreadLocal.get(ThreadLocal.java:163) at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:481) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.elasticsearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:55) at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:450) at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601) at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836) at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - <0x000000008ae19f48> (a java.util.concurrent.ThreadPoolExecutor$Worker) "elasticsearch[logsearch-datanode-36][index][T#7]" #265 daemon prio=5 os_prio=0 tid=0x00007ef8f4083000 nid=0x6f5a runnable [0x00007ef4d3ff3000] java.lang.Thread.State: RUNNABLE at java.lang.ThreadLocal$ThreadLocalMap.set(ThreadLocal.java:467) at java.lang.ThreadLocal$ThreadLocalMap.access$100(ThreadLocal.java:298) at java.lang.ThreadLocal.setInitialValue(ThreadLocal.java:184) at java.lang.ThreadLocal.get(ThreadLocal.java:170) at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78) at org.apache.lucene.index.CodecReader.getNumericDocValues(CodecReader.java:143) at org.apache.lucene.index.FilterLeafReader.getNumericDocValues(FilterLeafReader.java:430) at org.elasticsearch.common.lucene.uid.PerThreadIDAndVersionLookup.<init>(PerThreadIDAndVersionLookup.java:77) at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:83) at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123) at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140) at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032) at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501) - locked <0x00000001aa0732e0> (a java.lang.Object) at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457) at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601) at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836) at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnReplica(TransportIndexAction.java:196) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:170) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:66) at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:392) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:291) at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:283) at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - <0x000000008ae1b718> (a java.util.concurrent.ThreadPoolExecutor$Worker) "elasticsearch[logsearch-datanode-36][index][T#6]" #264 daemon prio=5 os_prio=0 tid=0x00007ef8f8082000 nid=0x6f59 runnable [0x00007ef4d40f4000] java.lang.Thread.State: RUNNABLE at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444) at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419) at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298) at java.lang.ThreadLocal.get(ThreadLocal.java:163) at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78) at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81) at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123) at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140) at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032) at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501) - locked <0x00000000c3c36b70> (a java.lang.Object) at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457) at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601) at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836) at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - <0x000000008ae15f20> (a java.util.concurrent.ThreadPoolExecutor$Worker) "elasticsearch[logsearch-datanode-36][index][T#5]" #263 daemon prio=5 os_prio=0 tid=0x00007ef8e0001000 nid=0x6f58 runnable [0x00007ef4d41f5000] java.lang.Thread.State: RUNNABLE at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444) at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419) at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298) at java.lang.ThreadLocal.get(ThreadLocal.java:163) at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78) at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81) at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123) at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140) at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032) at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501) - locked <0x0000000156f69b08> (a java.lang.Object) at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457) at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601) at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836) at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnReplica(TransportIndexAction.java:196) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:170) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:66) at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:392) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:291) at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:283) at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - <0x000000008ae1a0f0> (a java.util.concurrent.ThreadPoolExecutor$Worker) ``` 起初我是怀疑是不是 ThreadLocal里的ThreadLocalMap导致的,为此又恶补了下 ThreadLocalMap 的原理,当存储线程变量的时候,它会分配一个数组给这个线程用,变量放进去的时候用ThreadLocal的这个变量的hash值定位到在数组的那个位置去,如果有冲突则找这个位置的下一个位置。 这是设置值的时候,如果在get的时候也是通过ThreadLocal的hash去找到数组的位置,冲突了之后找当前位置的下一个位置,如果下一个位置的这个key已经是null了说明被回收了,那么要进行一次清理,它会向后清理掉为空的值,这就是所谓的启发式清理,没有固定的线程去做这个,靠着查询的时候出现空值来进行清理。 本来我怀疑这里有问题,因为当冲突过多的时候,或者这个数组太大了,因为冲突多了,就会扩容这个数组,导致每次冲突就要往下寻找数据,这是一个循环操作非常的损耗CPU,会不会是这个数组快要满了但是又没有满,导致了查询一直在循环的寻找这个ThreadLocal,后来我在本地起了20个线程,跑了40亿个数据来验证这个观点,每一千万次查询和插入操作,只会出现三四个这种耗时达到100ms以上的情况,且耗时长的那三四个操作都维持在200ms左右,这对性能理论上没有影响。 #### 问题二: 后来又觉得会不会是数据不均衡导致的,我们看了下数据分片存储的数据大小,几乎都是一致的,只有1G的差别大小,这可以忽略不计。 又怀疑是系统核心参数配置的问题导致的,那台机器的CPU配置是节能模式,我们改为了高性能模式,也没有太大的变化。 到这里又发现了一个疑点,所有节点进来的流量是一样的,但是唯独那台机器出去的流量是别的节点的10倍。 是否是某个查询的结果落到了2号节点,那个数据分片正好都在2号,那个查询结果的量特别特别的大,但是我们粗略的也没看出来有哪个查询的量特别庞大。 最后猜测是否是机器原因,别的机器都正常,唯独那两个集群的那两个节点存在问题。 两个集群的ES版本都比较低,2.3.4 问题节点的机器是物理机,配置很高。 记录下,后续再看下怎么解决。 ### 问题已解决,写于 2020-06-20 15:26 1. 6-16号晚上我们将CPU高的那台36机器下线了,集群就变成平稳且健康了。 2. 下线的这台36物理机器从配置上来看是非常高的,内存60G以上,CPU是32核,但是下线了这个CPU高的节点后就恢复了。另外一个集群也把CPU飙高的02节点下线了,集群整体非常稳定。 3. 这里猜测原因是第一,这两个节点的硬件或者系统存在问题,第二是ES 2.3.4版本对集群内部的机器异构支持很差(集群内部机器的硬件配置不一致)。 ### 事发整个实际过程回顾 2020-06-04,快递员集群出现CPU负载过高的问题,37和36节点都出现CPU很高,其他节点闲置。 2020-06-04 10:00 到 2020-06-04 12:00 出现的问题:件已经入柜,但是FC管家查询不到信息,不会有明显影响,只是写入ES的速度很慢,kafka消息积压,等到恢复后,受影响的积压数据自动就写入了。 2020-06-04 13:16分左右,37节点突然下降到正常水平,36节点在持续CPU负载过高。 2020-06-04 14:30左右,36节点瞬间恢复正常水平。 2020-06-04 下午17:10左右,36节点的CPU又达到90%以上,37稍微有点高,但是还是健康状态。 2020-06-04 下午17:10 到 2020-06-04 18:30 出现的问题:件已经取出,但是FC管家查询到的数据还是未取状态,不会有明显影响,只是写入ES的速度很慢,kafka消息积压,等到恢复后,受影响的积压数据自动就写入了。 2020-06-04 下午18:55左右,36节点的CPU瞬间由高点下降到正常水平. 2020-06-04 晚上20:00 开始扩容,加入了6台物理机,晚上22:40分左右完成扩容。 2020-06-05 早上10:00 一切正常,36节点略微有点偏高但是处于正常水平,CPU负载20%左右,CPU负载明显高于其他节点。 2020-06-12 10:00 左右 消费者集群 post_consumer_data_02节点出现CPU负载变高,我们称之为189节点。 2020-06-12 10:20左右,客服和全员群上报微信支付分订单问题,原因:微信支付分查询使用的是消费者集群的索引数据。 2020-06-12 10:30左右开放平台报大量请求查询变慢甚至查询不出来结果,经研发排查,发现是第三方(华为)接口查询量大增,查询流量进入消费者集群。 2020-06-12 10:45左右, 紧急熔断掉第三方调用我方流量最大的查询接口。 2020-06-12 10:50左右,189节点的CPU在10:52从99%瞬间回落至8%,业务恢复正常,后续问题上报存在时间差,其实已经恢复。 2020-06-12 11:00 ,和第三方沟通查询量限制,但是189节点CPU变高的原因尚未找到。 2020-06-13 周六,继续熔断第三方接口,限流。 2020-06-14 周日,继续熔断第三方接口,限流。 2020-06-15 周一晚上,如果继续限流,PDD和HW等第三方都会来投诉,15号晚上我们把第三方的查询接口由消费者集群切换到了快递员集群。 基于2020-06-04那天晚上快递员集群已经扩容,我们把流量导向这个性能更强大的集群,取消限流。 2020-06-16 周二早上10:19分,快递员集群的36节点CPU开始变高,其他节点闲置。客服系统、工单系统、FC管家查询派件信息卡顿、缓慢、查不到结果。 2020-06-16 11:30,对第三方查询接口进行限流,36节点CPU瞬间恢复。 2020-06-16 中午13:04分左右,对37节点的分片进行了迁移,切换了该节点上的两个分片到别的节点,效果不是非常明显。 2020-06-16 13:44,取消第三方查询接口的流量限制。 2020-06-16 15:37分左右,36节点CPU开始增长且上下波动剧烈。讨论ES快递员集群全部下线虚拟机节点,下掉出问题严重的36节点。 2020-06-16 23:50左右开始准备ES快递员集群的迁移,下掉所有的虚拟机节点,下掉问题节点36节点。 2020-06-17 凌晨04:27分,快递员集群14个数据节点 2个master节点 2个client节点 全部运行于物理机,36机器移除快递员ES集群。 2020-06-17 上午10:00,快递员集群正常,高峰期,37这个节点比起其它的节点还是弱了一些,没有36后它是最差的,但是处于健康可控状态。 2020-06-17 晚上22:55左右,下线之前出问题的消费者集群189节点。 2020-06-18 到 现在,集群处理稳定监控状态,快递员集群37节点的CPU负载要略高于所有其他节点两倍,但是处于健康可控状态。 ### 结论: 快递员集群36节点和消费者189节点存在系统或者机器问题, 37节点也存在类似的问题,但是相对于差的来说要好一些。
类型:工作
标签:elasticsearch,linux
Copyright © 雨翔河
我与我周旋久
独孤影
开源实验室