I have a cluster problem, and no clue to it.
After some years of running I had to stop one firewall node(srx550) - this was the node1. After the reboot it's interfaces were down (bot in fpc0 and in fpc3) - so I took if offlilne until replace the HW.
Later I tried to start the fw node withot any cable and the interfaces started normally, so I tried to put it back to the cluster.
When It started it immediately become the active node on RG0 but the reth interfaces remain in down status (with all the ge interraces up) so I turnd off again. No preemtion configured so the interfaces remained active in the other node node1.
After it I discovered, that node1 RG0 shows an error (GR) - probably this is the reason why node 0 took mastership when I plugged back.
Now node0 is turned off, I have this GR (GRES monitoring) error and the firewall is working.
I would like to take node0 back in charge, but first I want to clear this GR error.
When I check show chassis cluster information deatil I can see that gres-not-ready ....
user@firewall-node1> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring RE Relinquish monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 0
node0 0 lost n/a n/a n/a
node1 255 primary no yes GR
Redundancy group: 1 , Failover count: 0
node0 0 lost n/a n/a n/a
node1 0 primary no no CS
user@firewall-node1> show chassis cluster information detail
Configured mode: active-active
Operational mode: active-active
Heartbeat interval: 1000 ms
Heartbeat threshold: 3
Control link recovery: Disabled
Fabric link down timeout: 66 sec
Node health information:
Local node health: Not healthy
Remote node health: Healthy
Redundancy group: 0, Threshold: 255, Monitoring failures: gres-not-ready
Please help me clearing this gr error.
node1 is not in healthy state. I think it is becuase of the split brian scenario occured. And the node1 RG0 is priority is 255 which means there was a manual failover. Reset the value. "request chassis cluster failover reset redundancy-group 0" You have to reboot node1 to recover from the unhealthy state.
First reboot node1 and same time power on node0 so that down time can be reduced and kernel state will not be synced to node0 from node1
Thaks for the reply, Yesterday I restarted and it cleared the error.
I left the other node turned off.
Finally I've found that junbo frame was not enabled on the switch where the HA link travelled between the nodes, that caused the original error.