雨翔河
首页
列表
关于
网络故障引起的 kafka 自身的 BUG
2019-09-11 13:41
2019-09-03 17:06:25 机房网络出现一分钟波动,交换机问题导致kafka集群相互之间偶尔失联。 kafka日志如下所示: ``` [2019-09-03 17:06:25,610] WARN Attempting to send response via channel for which there is no open connection, connection id xxxxx (kafka.network.Processor) [2019-09-03 17:06:31,906] INFO Unable to read additional data from server sessionid 0x46b0xxxx027, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2019-09-03 17:06:32,076] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2019-09-03 17:06:32,609] INFO Opening socket connection to server xxxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2019-09-03 17:06:33,810] WARN Client session timed out, have not heard from server in 1796ms for sessionid 0x46bxxxx40027 (org.apache.zookeeper.ClientCnxn) [2019-09-03 17:06:33,810] INFO Client session timed out, have not heard from server in 1796ms for sessionid 0x46b03bxxx027, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2019-09-03 17:06:34,942] INFO Opening socket connection to server xxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2019-09-03 17:06:36,059] INFO [Partition opStaffCancelPost-18 broker=180] Shrinking ISR from 180,182,183 to 180 (kafka.cluster.Partition) [2019-09-03 17:06:36,059] WARN Client session timed out, have not heard from server in 2092ms for sessionid 0x46b0xxxx027 (org.apache.zookeeper.ClientCnxn) [2019-09-03 17:06:36,059] INFO Client session timed out, have not heard from server in 2092ms for sessionid 0x46b0xxxx0027, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2019-09-03 17:06:36,382] INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient) [2019-09-03 17:06:37,305] INFO Opening socket connection to server xxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2019-09-03 17:06:38,507] WARN Client session timed out, have not heard from server in 2135ms for sessionid 0x46bxxxx0027 (org.apache.zookeeper.ClientCnxn) ``` 短暂的波动网络持续了1分钟左右,之后网络恢复。 本来对于高可用的kafka集群来说应该也是可以自动恢复的,但是事与愿违。 接着是 `kafka-manager` 的监控出现异常,大量的 `topic` 全部都没有 `broker` ,且消息的 `offset` 不再变化。 手动使用 `kafka client`连接上去看也是很奇怪,看不到消息进来消费,但是应用生产和消费却是正常的。 在kafka的日志里看到如下: ``` [2019-09-03 17:11:37,572] INFO [Partition opStaffCancelPost-18 broker=180] Cached zkVersion [50] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2019-09-03 17:11:37,572] INFO [Partition sendThrowThirdBoxRetry-1 broker=180] Shrinking ISR from 180,182,183 to 180 (kafka.cluster.Partition) [2019-09-03 17:11:37,574] INFO [Partition sendThrowThirdBoxRetry-1 broker=180] Cached zkVersion [48] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2019-09-03 17:11:37,574] INFO [Partition __consumer_offsets-42 broker=180] Shrinking ISR from 180,181,182 to 180,181 (kafka.cluster.Partition) [2019-09-03 17:11:37,576] INFO [Partition __consumer_offsets-42 broker=180] Cached zkVersion [45] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) ``` 第二天发现监控还是这样,情况有点不对(这时候应用本身使用kafka生产和消费是正常的,但是各种监控数据却说它异常),那套kafka集群还是处于有问题的状态,上午11点多开始我们手动重启节点,这个时候再次出现故障,整套kafka集群连接失败生产不了消息,其中kafka日志如下: ``` [2019-09-04 11:59:12,862] INFO [Partition routeStaffPostQueue-15 broker=180] Shrinking ISR from 180,182,183 to 180,183 (kafka.cluster.Partition) [2019-09-04 11:59:12,864] INFO [Partition routeStaffPostQueue-15 broker=180] Cached zkVersion [43] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2019-09-04 11:59:12,864] INFO [Partition sfPushFvpRetryMsgProcQueue-5 broker=180] Shrinking ISR from 180,182,183 to 180,183 (kafka.cluster.Partition) [2019-09-04 11:59:12,865] INFO [Partition sfPushFvpRetryMsgProcQueue-5 broker=180] Cached zkVersion [41] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2019-09-04 11:59:12,866] INFO [Partition openRouteRetryMsgProcQueue-3 broker=180] Shrinking ISR from 180,182,183 to 180 (kafka.cluster.Partition) [2019-09-04 11:59:12,867] INFO [Partition openRouteRetryMsgProcQueue-3 broker=180] Cached zkVersion [44] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2019-09-04 11:59:12,867] INFO [Partition routeStaffCancelQueue-5 broker=180] Shrinking ISR from 180,182,183 to 180,183 (kafka.cluster.Partition) [2019-09-04 11:59:12,870] INFO [Partition routeStaffCancelQueue-5 broker=180] Cached zkVersion [43] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) ``` 原因: 网络出现问题的时候,当kafka的controller和zk的会话过期了且失去了控制权,这个时候这个僵尸controller在短时间内还在继续更新zk和向broker发送 LeaderAndIsrRequests 。当这种情况发生的时候,其他的broker还没有更新leader信息和isr,导致后续需要更新的时候在zk上更新失败。 kafka官方已经确认了这个BUG,且在 `KAFKA-5642`通过合适的处理zk会话过期事件修复了这个问题。 可以参考kafka官方对这一事件的处理,和我们遇到的问题是一样的: https://issues.apache.org/jira/browse/KAFKA-2729 我们使用的kakfa是1.0.0,官方的修复版本是在 1.1.0,所以如果还处于1.1.0以下版本的kafka用户,一定要注意下这个问题,可以调整下连接zk的超时时间,让超时时间多续几秒钟,要么就升级kafka版本。 ``` zookeeper.connection.timeout.ms=10000 zookeeper.session.timeout.ms=10000 ``` 另外插一句,所谓的高可用,也并不是想象中的那么高可用,很多东西在PPT上写的好,实际上操作起来因为开发人员的原因或者别的什么因素,很多看起来很美好的东西并不是真的完美的,包括所谓的数据库异地双活。。。
类型:工作
标签:kafka,bug
Copyright © 雨翔河
我与我周旋久
独孤影
开源实验室