网络故障引起的 kafka 自身的 BUG

雨翔河

首页列表关于

2019-09-11 13:41

2019-09-03 17:06:25

机房网络出现一分钟波动，交换机问题导致kafka集群相互之间偶尔失联。

kafka日志如下所示：

```
[2019-09-03 17:06:25,610] WARN Attempting to send response via channel for which there is no open connection, connection id xxxxx (kafka.network.Processor)
[2019-09-03 17:06:31,906] INFO Unable to read additional data from server sessionid 0x46b0xxxx027, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:32,076] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient)
[2019-09-03 17:06:32,609] INFO Opening socket connection to server xxxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:33,810] WARN Client session timed out, have not heard from server in 1796ms for sessionid 0x46bxxxx40027 (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:33,810] INFO Client session timed out, have not heard from server in 1796ms for sessionid 0x46b03bxxx027, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:34,942] INFO Opening socket connection to server xxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:36,059] INFO [Partition opStaffCancelPost-18 broker=180] Shrinking ISR from 180,182,183 to 180 (kafka.cluster.Partition)
[2019-09-03 17:06:36,059] WARN Client session timed out, have not heard from server in 2092ms for sessionid 0x46b0xxxx027 (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:36,059] INFO Client session timed out, have not heard from server in 2092ms for sessionid 0x46b0xxxx0027, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:36,382] INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient)
[2019-09-03 17:06:37,305] INFO Opening socket connection to server xxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:38,507] WARN Client session timed out, have not heard from server in 2135ms for sessionid 0x46bxxxx0027 (org.apache.zookeeper.ClientCnxn)
```

短暂的波动网络持续了1分钟左右，之后网络恢复。

本来对于高可用的kafka集群来说应该也是可以自动恢复的，但是事与愿违。

接着是 `kafka-manager` 的监控出现异常，大量的 `topic` 全部都没有 `broker` ，且消息的 `offset` 不再变化。

手动使用 `kafka client`连接上去看也是很奇怪，看不到消息进来消费，但是应用生产和消费却是正常的。

在kafka的日志里看到如下：
```
[2019-09-03 17:11:37,572] INFO [Partition opStaffCancelPost-18 broker=180] Cached zkVersion [50] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-09-03 17:11:37,572] INFO [Partition sendThrowThirdBoxRetry-1 broker=180] Shrinking ISR from 180,182,183 to 180 (kafka.cluster.Partition)
[2019-09-03 17:11:37,574] INFO [Partition sendThrowThirdBoxRetry-1 broker=180] Cached zkVersion [48] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-09-03 17:11:37,574] INFO [Partition __consumer_offsets-42 broker=180] Shrinking ISR from 180,181,182 to 180,181 (kafka.cluster.Partition)
[2019-09-03 17:11:37,576] INFO [Partition __consumer_offsets-42 broker=180] Cached zkVersion [45] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
```

第二天发现监控还是这样，情况有点不对（这时候应用本身使用kafka生产和消费是正常的，但是各种监控数据却说它异常），那套kafka集群还是处于有问题的状态，上午11点多开始我们手动重启节点，这个时候再次出现故障，整套kafka集群连接失败生产不了消息，其中kafka日志如下：

```
[2019-09-04 11:59:12,862] INFO [Partition routeStaffPostQueue-15 broker=180] Shrinking ISR from 180,182,183 to 180,183 (kafka.cluster.Partition)
[2019-09-04 11:59:12,864] INFO [Partition routeStaffPostQueue-15 broker=180] Cached zkVersion [43] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-09-04 11:59:12,864] INFO [Partition sfPushFvpRetryMsgProcQueue-5 broker=180] Shrinking ISR from 180,182,183 to 180,183 (kafka.cluster.Partition)
[2019-09-04 11:59:12,865] INFO [Partition sfPushFvpRetryMsgProcQueue-5 broker=180] Cached zkVersion [41] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-09-04 11:59:12,866] INFO [Partition openRouteRetryMsgProcQueue-3 broker=180] Shrinking ISR from 180,182,183 to 180 (kafka.cluster.Partition)
[2019-09-04 11:59:12,867] INFO [Partition openRouteRetryMsgProcQueue-3 broker=180] Cached zkVersion [44] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-09-04 11:59:12,867] INFO [Partition routeStaffCancelQueue-5 broker=180] Shrinking ISR from 180,182,183 to 180,183 (kafka.cluster.Partition)
[2019-09-04 11:59:12,870] INFO [Partition routeStaffCancelQueue-5 broker=180] Cached zkVersion [43] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
```

原因：

网络出现问题的时候，当kafka的controller和zk的会话过期了且失去了控制权，这个时候这个僵尸controller在短时间内还在继续更新zk和向broker发送 LeaderAndIsrRequests 。当这种情况发生的时候，其他的broker还没有更新leader信息和isr，导致后续需要更新的时候在zk上更新失败。

kafka官方已经确认了这个BUG，且在 `KAFKA-5642`通过合适的处理zk会话过期事件修复了这个问题。

可以参考kafka官方对这一事件的处理,和我们遇到的问题是一样的：

https://issues.apache.org/jira/browse/KAFKA-2729

我们使用的kakfa是1.0.0，官方的修复版本是在 1.1.0,所以如果还处于1.1.0以下版本的kafka用户，一定要注意下这个问题，可以调整下连接zk的超时时间，让超时时间多续几秒钟，要么就升级kafka版本。

```
zookeeper.connection.timeout.ms=10000
zookeeper.session.timeout.ms=10000
```

另外插一句，所谓的高可用，也并不是想象中的那么高可用，很多东西在PPT上写的好，实际上操作起来因为开发人员的原因或者别的什么因素，很多看起来很美好的东西并不是真的完美的，包括所谓的数据库异地双活。。。

类型：工作标签：kafka,bug