Ok, seeing as this is 3+ hours of my life that I will never get back, I thought I would take another hour to record this simple solution to fix an annoying problem.
I have two SRX340s on 18.4R3-S2 code. The problem isn't necessarily related to this code version, but I thought I'd mention it anyway.
They are configured to be in HA. All of the configuration is correct:
=====================================================================================================
user@host> show configuration | match "chassis cluster|fab" | display set | except desc
set chassis cluster control-link-recovery
set chassis cluster reth-count 4
set chassis cluster redundancy-group 0 node 0 priority 100
set chassis cluster redundancy-group 0 node 1 priority 1
set chassis cluster redundancy-group 1 node 0 priority 150
set chassis cluster redundancy-group 1 node 1 priority 100
set chassis cluster redundancy-group 1 gratuitous-arp-count 5
set chassis cluster redundancy-group 1 hold-down-interval 300
set chassis cluster redundancy-group 1 interface-monitor ge-0/0/8 weight 90
set chassis cluster redundancy-group 1 interface-monitor ge-0/0/9 weight 90
set chassis cluster redundancy-group 1 interface-monitor ge-0/0/10 weight 90
set chassis cluster redundancy-group 1 interface-monitor ge-0/0/11 weight 90
set chassis cluster redundancy-group 1 interface-monitor ge-5/0/8 weight 90
set chassis cluster redundancy-group 1 interface-monitor ge-5/0/9 weight 90
set chassis cluster redundancy-group 1 interface-monitor ge-5/0/10 weight 90
set chassis cluster redundancy-group 1 interface-monitor ge-5/0/11 weight 90
set interfaces fab0 fabric-options member-interfaces ge-0/0/0
set interfaces fab1 fabric-options member-interfaces ge-5/0/0
=====================================================================================================
All of the relevant interfaces are physically up:
user@host> show interfaces terse | match "fab|ge-5/0/0|ge-0/0/0|inter"
Interface Admin Link Proto Local Remote
ge-0/0/0 up up
ge-0/0/0.0 up up aenet --> fab0.0
ge-5/0/0 up up
ge-5/0/0.0 up up aenet --> fab1.0
fab0 up up
fab0.0 up up inet 30.17.0.200/24
fab1 up up
fab1.0 up up inet 30.18.0.200/24
swfab0 up down
swfab1 up down
But there are errors under the chassis cluster status. Namely the FL error.
But it may be worth noting that other monitoring errors are present, even though all their prerequisites to pass monitoring checks are met:
{primary:node0}
user@host> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring RE Relinquish monitoring
IS IRQ storm
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 1
node0 100 primary no no None
node1 0 secondary no no FL
Redundancy group: 1 , Failover count: 1
node0 0 primary no no CS
node1 0 secondary no no IF CS FL
And the SRXs are telling you that the fab (data-plane HA ports) are physically up, but their monitored status us down:
{primary:node0}[edit]
user@host# run show chassis cluster information
node0:
--------------------------------------------------------------------------
Redundancy Group Information:
Redundancy Group 0 , Current State: primary, Weight: 255
Time From To Reason
Jan 19 14:28:57 hold secondary Hold timer expired
Jan 19 14:28:58 secondary primary Better priority (100/1)
Redundancy Group 1 , Current State: primary, Weight: 0
Time From To Reason
Jan 19 14:28:58 hold secondary Hold timer expired
Jan 19 14:28:59 secondary primary Remote yield (0/0), due to IF CS CF failures
Chassis cluster LED information:
Current LED color: Amber
Last LED change reason: Monitored objects are down
Control port tagging:
Disabled
Failure Information:
Coldsync Monitoring Failure Information:
Statistics:
Coldsync Total SPUs: 1
Coldsync completed SPUs: 0
Coldsync not complete SPUs: 1
Fabric-link Failure Information:
Fabric Interface: fab0
Child interface Physical / Monitored Status
ge-0/0/0 Up / Down
node1:
--------------------------------------------------------------------------
Redundancy Group Information:
Redundancy Group 0 , Current State: secondary, Weight: 0
Time From To Reason
Jan 19 14:28:47 hold secondary Hold timer expired
Redundancy Group 1 , Current State: secondary, Weight: -255
Time From To Reason
Jan 19 14:28:48 hold secondary Hold timer expired
Chassis cluster LED information:
Current LED color: Amber
Last LED change reason: Monitored objects are down
Control port tagging:
Disabled
Failure Information:
Coldsync Monitoring Failure Information:
Statistics:
Coldsync Total SPUs: 1
Coldsync completed SPUs: 0
Coldsync not complete SPUs: 1
Fabric-link Failure Information:
Fabric Interface: fab1
Child interface Physical / Monitored Status
ge-5/0/0 Up / Down
{primary:node0}[edit]
When you check the fab link statistics, you notice that they are 1-way; not sure if it's important, but it may be worth noting the same behaviour can be seen if you run this command from the secondary node):
user@host> show chassis cluster statistics
Control link statistics:
Control link 0:
Heartbeat packets sent: 160
Heartbeat packets received: 160
Heartbeat packet errors: 0
Fabric link statistics:
Child link 0
Probes sent: 318
Probes received: 0
*rest of the output is somewhat irrelevant and ommitted for brevity*
The following troubleshooting steps yield no difference:
- Rebooting both devices.
- Rebooting a single device.
- Performing a commit full.
- Logging onto the secondary and performing a "request chassis cluster configuration-synchronize"
- Changing the physical cable to a different port and moving the configuration.
- Swapping the physical cable completely with a new one.
- Ensuring that the MTU of every port is 9014 or less (not 9192)
The way to fix this is to simply force a failover of the cluster on Redundancy Group 0.The command is "
request chassis cluster failover redundancy-group 0 node 1 force" (the force parameter is needed because in the current state, the secondary node wouldn't usually be able to become the master; that is you're essentially saying "please failover to an unhealthy node").
I suspect for some reason, both devices were playing the master role despite what the "chassis cluster status" was saying.
Absolutely bizarre issue.
After the failover, everything worked as expected because of course, the actual configuration and interface status was completely fine. I was able to fail back later without issue and node0 is the master again.
Hope this helps someone in the future, seeing as
this closed out prematurely.
------------------------------
Kindest Regards,
Purplezorz
------------------------------