SRX HA Cluster - Redundancy Group 1 - Fabric Link Physically Up, Monitored Status Down

View Only

last person joined: 7 days ago

Ask questions and share experiences with Juniper Connected Security. Discuss Advanced Threat Protection, SecIntel, Secure Analytics, Secure Connect, Security Director, and all things related to Juniper security technologies.

Back to discussions

Expand all | Collapse all

SRX HA Cluster - Redundancy Group 1 - Fabric Link Physically Up, Monitored Status Down

1. SRX HA Cluster - Redundancy Group 1 - Fabric Link Physically Up, Monitored Status Down

Recommend

Purplezorz

Posted 01-19-2021 13:20

Ok, seeing as this is 3+ hours of my life that I will never get back, I thought I would take another hour to record this simple solution to fix an annoying problem.

I have two SRX340s on 18.4R3-S2 code. The problem isn't necessarily related to this code version, but I thought I'd mention it anyway.

They are configured to be in HA. All of the configuration is correct:

=====================================================================================================
user@host> show configuration | match "chassis cluster|fab" | display set | except desc
set chassis cluster control-link-recovery
set chassis cluster reth-count 4
set chassis cluster redundancy-group 0 node 0 priority 100
set chassis cluster redundancy-group 0 node 1 priority 1
set chassis cluster redundancy-group 1 node 0 priority 150
set chassis cluster redundancy-group 1 node 1 priority 100
set chassis cluster redundancy-group 1 gratuitous-arp-count 5
set chassis cluster redundancy-group 1 hold-down-interval 300
set chassis cluster redundancy-group 1 interface-monitor ge-0/0/8 weight 90
set chassis cluster redundancy-group 1 interface-monitor ge-0/0/9 weight 90
set chassis cluster redundancy-group 1 interface-monitor ge-0/0/10 weight 90
set chassis cluster redundancy-group 1 interface-monitor ge-0/0/11 weight 90
set chassis cluster redundancy-group 1 interface-monitor ge-5/0/8 weight 90
set chassis cluster redundancy-group 1 interface-monitor ge-5/0/9 weight 90
set chassis cluster redundancy-group 1 interface-monitor ge-5/0/10 weight 90
set chassis cluster redundancy-group 1 interface-monitor ge-5/0/11 weight 90
set interfaces fab0 fabric-options member-interfaces ge-0/0/0
set interfaces fab1 fabric-options member-interfaces ge-5/0/0
=====================================================================================================

All of the relevant interfaces are physically up:

user@host> show interfaces terse | match "fab|ge-5/0/0|ge-0/0/0|inter"
Interface               Admin Link Proto    Local                 Remote
ge-0/0/0                up    up
ge-0/0/0.0              up    up   aenet    --> fab0.0
ge-5/0/0                up    up
ge-5/0/0.0              up    up   aenet    --> fab1.0
fab0                    up    up
fab0.0                  up    up   inet     30.17.0.200/24
fab1                    up    up
fab1.0                  up    up   inet     30.18.0.200/24
swfab0                  up    down
swfab1                  up    down

But there are errors under the chassis cluster status. Namely the FL error.
But it may be worth noting that other monitoring errors are present, even though all their prerequisites to pass monitoring checks are met:

{primary:node0}
user@host> show chassis cluster status
Monitor Failure codes:
    CS  Cold Sync monitoring        FL  Fabric Connection monitoring
    GR  GRES monitoring             HW  Hardware monitoring
    IF  Interface monitoring        IP  IP monitoring
    LB  Loopback monitoring         MB  Mbuf monitoring
    NH  Nexthop monitoring          NP  NPC monitoring
    SP  SPU monitoring              SM  Schedule monitoring
    CF  Config Sync monitoring      RE  Relinquish monitoring
    IS  IRQ storm

Cluster ID: 1
Node   Priority Status               Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 1
node0  100      primary              no      no       None
node1  0        secondary            no      no       FL

Redundancy group: 1 , Failover count: 1
node0  0        primary              no      no       CS
node1  0        secondary            no      no       IF CS FL

And the SRXs are telling you that the fab (data-plane HA ports) are physically up, but their monitored status us down:

{primary:node0}[edit]
user@host# run show chassis cluster information
node0:
--------------------------------------------------------------------------
Redundancy Group Information:

    Redundancy Group 0 , Current State: primary, Weight: 255

        Time            From                 To                   Reason
        Jan 19 14:28:57 hold                 secondary            Hold timer expired
        Jan 19 14:28:58 secondary            primary              Better priority (100/1)

    Redundancy Group 1 , Current State: primary, Weight: 0

        Time            From                 To                   Reason
        Jan 19 14:28:58 hold                 secondary            Hold timer expired
        Jan 19 14:28:59 secondary            primary              Remote yield (0/0), due to IF CS CF  failures

Chassis cluster LED information:
    Current LED color: Amber
    Last LED change reason: Monitored objects are down
Control port tagging:
    Disabled

Failure Information:

    Coldsync Monitoring Failure Information:
        Statistics:
            Coldsync Total SPUs: 1
            Coldsync completed SPUs: 0
            Coldsync not complete SPUs: 1

    Fabric-link Failure Information:
        Fabric Interface: fab0
          Child interface   Physical / Monitored Status
          ge-0/0/0              Up   / Down

node1:
--------------------------------------------------------------------------
Redundancy Group Information:

    Redundancy Group 0 , Current State: secondary, Weight: 0

        Time            From                 To                   Reason
        Jan 19 14:28:47 hold                 secondary            Hold timer expired

    Redundancy Group 1 , Current State: secondary, Weight: -255

        Time            From                 To                   Reason
        Jan 19 14:28:48 hold                 secondary            Hold timer expired

Chassis cluster LED information:
    Current LED color: Amber
    Last LED change reason: Monitored objects are down
Control port tagging:
    Disabled

Failure Information:

    Coldsync Monitoring Failure Information:
        Statistics:
            Coldsync Total SPUs: 1
            Coldsync completed SPUs: 0
            Coldsync not complete SPUs: 1

    Fabric-link Failure Information:
        Fabric Interface: fab1
          Child interface   Physical / Monitored Status
          ge-5/0/0              Up   / Down

{primary:node0}[edit]

When you check the fab link statistics, you notice that they are 1-way; not sure if it's important, but it may be worth noting the same behaviour can be seen if you run this command from the secondary node):

user@host> show chassis cluster statistics
Control link statistics:
    Control link 0:
        Heartbeat packets sent: 160
        Heartbeat packets received: 160
        Heartbeat packet errors: 0
Fabric link statistics:
    Child link 0
        Probes sent: 318
        Probes received: 0
*rest of the output is somewhat irrelevant and ommitted for brevity*

The following troubleshooting steps yield no difference:

Rebooting both devices.
Rebooting a single device.
Performing a commit full.
Logging onto the secondary and performing a "request chassis cluster configuration-synchronize"
Changing the physical cable to a different port and moving the configuration.
Swapping the physical cable completely with a new one.
Ensuring that the MTU of every port is 9014 or less (not 9192)

The way to fix this is to simply force a failover of the cluster on Redundancy Group 0.

The command is "request chassis cluster failover redundancy-group 0 node 1 force" (the force parameter is needed because in the current state, the secondary node wouldn't usually be able to become the master; that is you're essentially saying "please failover to an unhealthy node").
I suspect for some reason, both devices were playing the master role despite what the "chassis cluster status" was saying.
Absolutely bizarre issue.
After the failover, everything worked as expected because of course, the actual configuration and interface status was completely fine. I was able to fail back later without issue and node0 is the master again.

Hope this helps someone in the future, seeing as this closed out prematurely.

------------------------------
Kindest Regards,

Purplezorz
------------------------------