Skip to content

Split brain recovery 'too fast' #65

@evdherberg

Description

@evdherberg

After

Recovering from split-brain; firing leftCluster()/joinedCluster() events

the ClusterListener proceeds to fireLeftClusterAndWaitToComplete. This is not guaranteed to wait for leftCluster handling on other nodes. Possibly this results in interleaving of execution of various left/joinCluster handlers.

2021.10.14 09:56:41 �[33mWARN �[m [hz.openfire.event-2]: org.jivesoftware.openfire.plugin.util.cache.ClusterListener - Recovering from split-brain; firing leftCluster()/joinedCluster() events
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.priority-generic-operation.thread-0]: com.hazelcast.internal.partition.InternalPartitionService - [172.60.0.20]:5701 [openfire] [3.12.5] Applied partition state update with version: 1808
2021.10.14 09:56:41 �[32mINFO �[m [hz.openfire.cached.thread-4]: com.hazelcast.nio.tcp.TcpIpConnector - [172.60.0.20]:5701 [openfire] [3.12.5] Connecting to /172.60.0.10:5701, timeout: 10000, bind-any: true
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-4]: com.hazelcast.nio.tcp.TcpIpEndpointManager - [172.60.0.20]:5701 [openfire] [3.12.5] Established socket connection between /172.60.0.20:40609 and /172.60.0.10:5701
2021.10.14 09:56:41 �[32mINFO �[m [hz.openfire.IO.thread-in-0]: com.hazelcast.nio.tcp.TcpIpConnection - [172.60.0.20]:5701 [openfire] [3.12.5] Initialized new cluster connection between /172.60.0.20:5701 and /172.60.0.10:55053
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.event-2]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Firing leftCluster() event
2021.10.14 09:56:41 �[32mINFO �[m [hz.openfire.event-2]: org.jivesoftware.openfire.cluster.ClusterManager - Firing left cluster event for this node
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.event-2]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Waiting for leftCluster() event to be called [timeout=30 seconds]
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-2]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Going to send 'join' presence stanzas on the local cluster node for 0 occupant(s) of cluster node b6a7aa53-e796-4d23-b2b1-b80650fe7181 that is now part of our cluster.
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-2]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Going to send 'join' presence stanzas on the local cluster node for 0 occupant(s) of cluster node b6a7aa53-e796-4d23-b2b1-b80650fe7181 that is now part of our cluster.
2021.10.14 09:56:41 �[32mINFO �[m [ClusterManager events dispatcher]: org.jivesoftware.util.cache.CacheFactory - Clustering stopped; cache migration complete
2021.10.14 09:56:41 �[32mINFO �[m [hz.openfire.IO.thread-in-1]: com.hazelcast.nio.tcp.TcpIpConnection - [172.60.0.20]:5701 [openfire] [3.12.5] Initialized new cluster connection between /172.60.0.20:40609 and /172.60.0.10:5701
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.spi.RoutingTableImpl - Restoring cache content for cache 'Routing Servers Cache' by adding all outgoing server routes that are connected to the local cluster node.
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.spi.RoutingTableImpl - Restoring cache content for cache 'Routing Components Cache' by adding all component routes that are connected to the local cluster node.
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.spi.RoutingTableImpl - Restoring cache content for cache 'Routing Users Cache', 'Routing AnonymousUsers Cache' and 'Routing User Sessions' by adding all client routes that are connected to the local cluster node.
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.spi.RoutingTableImpl - Adding client route user2@xmpp.localhost.example/jsxc-fc5ea58c
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.spi.RoutingTableImpl - Adding client route ClientRoute{nodeID=a7206884-1c97-4915-a969-68d5c33dce18, available=true} to users cache under key user2@xmpp.localhost.example/jsxc-fc5ea58c
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-4]: org.jivesoftware.openfire.muc.spi.OccupantManager - Replaced occupants with arguments null Occupant{roomName='muc1', nickname='user1', realJID=user1@xmpp.localhost.example/jsxc-7d2c6fd3, lastActive=2021-10-14T09:56:41.642849Z} 73230fd6-f29d-488e-aae4-775607a41246
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-4]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Going to send 'join' presence stanzas on the local cluster node for 1 occupant(s) of cluster node 73230fd6-f29d-488e-aae4-775607a41246 that is now part of our cluster.
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-4]: org.jivesoftware.openfire.muc.spi.MUCPersistenceManager - Attempting to load room 'muc1' from the database.
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-5]: org.jivesoftware.openfire.muc.spi.OccupantManager - Replaced occupants with arguments Occupant{roomName='muc1', nickname='user1', realJID=user1@xmpp.localhost.example/jsxc-7d2c6fd3, lastActive=2021-10-14T09:56:41.642849Z} null 73230fd6-f29d-488e-aae4-775607a41246
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-5]: org.jivesoftware.openfire.muc.spi.OccupantManager - Replaced occupants with arguments null Occupant{roomName='muc1', nickname='user1', realJID=user1@xmpp.localhost.example/jsxc-7d2c6fd3, lastActive=2021-10-14T09:56:41.652447Z} 73230fd6-f29d-488e-aae4-775607a41246
2021.10.14 09:56:41 �[32mINFO �[m [hz.openfire.cached.thread-5]: org.jivesoftware.openfire.muc.spi.OccupantManager - We received a copy of local MUC occupants from node 73230fd6-f29d-488e-aae4-775607a41246, but we already had this information. This hints at a possible inefficient sharing of data across the cluster.
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-5]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Going to send 'join' presence stanzas on the local cluster node for 1 occupant(s) of cluster node 73230fd6-f29d-488e-aae4-775607a41246 that is now part of our cluster.
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Cluster event: service conference.xmpp.localhost.example left a cluster - going to restore 0 rooms
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.OccupantManager - Reset occupants because we left the cluster
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.LocalMUCRoomManager - Restoring cache content for cache 'MUC Service 'conference' Rooms' after we left the cluster, by adding all MUC Rooms that are known to the local node.
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-4]: org.jivesoftware.openfire.group.GroupJID - Parsing JID from string: admin@xmpp.localhost.example
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-4]: org.jivesoftware.openfire.muc.spi.FMUCHandler - (room: 'muc1@conference.xmpp.localhost.example'): Changing outbound join configuration. Existing: null, New: null
2021.10.14 09:56:41 �[33mWARN �[m [hz.openfire.cached.thread-4]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - A remote cluster node (73230fd6-f29d-488e-aae4-775607a41246) tells us that user user1@xmpp.localhost.example/jsxc-7d2c6fd3 is supposed to be an occupant (using nickname 'user1') of a room named 'muc1' but the data in the cluster cache does not indicate that this is true.
2021.10.14 09:56:41 �[33mWARN �[m [hz.openfire.cached.thread-5]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - A remote cluster node (73230fd6-f29d-488e-aae4-775607a41246) tells us that user user1@xmpp.localhost.example/jsxc-7d2c6fd3 is supposed to be an occupant (using nickname 'user1') of a room named 'muc1' but the data in the cluster cache does not indicate that this is true.
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-5]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Finished sending 'join' presence stanzas on the local cluster node.
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-4]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Finished sending 'join' presence stanzas on the local cluster node.
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Going to send 'leave' presence stanzas on the local cluster node for 1 occupant(s) of one or more cluster nodes that are no longer part of our cluster.
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Stanza now being sent: <presence type="unavailable" to="user2@xmpp.localhost.example/jsxc-fc5ea58c" from="muc1@conference.xmpp.localhost.example/user1"><x xmlns="http://jabber.org/protocol/muc#user"><item role="none" jid="user1@xmpp.localhost.example/jsxc-7d2c6fd3"/></x></presence>
2021.10.14 09:56:41 �[33mWARN �[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Unable to find MUCRole for recipient 'user2@xmpp.localhost.example/jsxc-fc5ea58c' in room muc1@conference.xmpp.localhost.example while broadcasting 'leave' presence for occupants on disconnected cluster node(s).
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Finished sending 'leave' presence stanzas on the local cluster node.
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Cluster event: service conference.xmpp.localhost.example left a cluster - going to restore 2 rooms
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.OccupantManager - Reset occupants because we left the cluster
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.LocalMUCRoomManager - Restoring cache content for cache 'MUC Service 'conference' Rooms' after we left the cluster, by adding all MUC Rooms that are known to the local node.
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Going to send 'leave' presence stanzas on the local cluster node for 0 occupant(s) of one or more cluster nodes that are no longer part of our cluster.
2021.10.14 09:56:41 �[32mINFO �[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - This node (a7206884-1c97-4915-a969-68d5c33dce18/xmpp2.localhost.example) has left the cluster
2021.10.14 09:56:41 �[32mINFO �[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - Sending message to admins: The local node ('xmpp2.localhost.example') has left the cluster - this node no longer has any resilience (enabled=true)
2021.10.14 09:56:41 �[36mDEBUG�[m [TaskEngine-pool-12]: org.jivesoftware.openfire.XMPPServer - Sending message to admin [jid=admin@xmpp.localhost.example, message=The local node ('xmpp2.localhost.example') has left the cluster - this node no longer has any resilience]
2021.10.14 09:56:41 �[36mDEBUG�[m [TaskEngine-pool-12]: org.jivesoftware.openfire.spi.RoutingTableImpl - Unable to route packet. No session is available so store offline. <message from="xmpp.localhost.example" to="admin@xmpp.localhost.example"><body>The local node ('xmpp2.localhost.example') has left the cluster - this node no longer has any resilience</body></message> 
2021.10.14 09:56:41 �[36mDEBUG�[m [TaskEngine-pool-12]: org.jivesoftware.openfire.spi.RoutingTableImpl - Failed to route packet to JID: admin@xmpp.localhost.example packet: <message from="xmpp.localhost.example" to="admin@xmpp.localhost.example"><body>The local node ('xmpp2.localhost.example') has left the cluster - this node no longer has any resilience</body></message>
2021.10.14 09:56:41 �[36mDEBUG�[m [TaskEngine-pool-12]: org.jivesoftware.openfire.MessageRouter - Message sent to unreachable address: <message from="xmpp.localhost.example" to="admin@xmpp.localhost.example"><body>The local node ('xmpp2.localhost.example') has left the cluster - this node no longer has any resilience</body></message>
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.util.cache.CacheFactory - CacheFactory is not listening for cluster events anymore because it left the cluster
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.event-2]: org.jivesoftware.openfire.plugin.util.cache.ClusterListener - Firing joinedCluster() event```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions