Security

SRX HA Cluster - Redundancy Group 1 - Fabric Link Physically Up, Monitored Status Down

  • 1.  SRX HA Cluster - Redundancy Group 1 - Fabric Link Physically Up, Monitored Status Down

    Posted 01-19-2021 13:20
    Ok, seeing as this is 3+ hours of my life that I will never get back, I thought I would take another hour to record this simple solution to fix an annoying problem.

    I have two SRX340s on 18.4R3-S2 code. The problem isn't necessarily related to this code version, but I thought I'd mention it anyway. 

    They are configured to be in HA. All of the configuration is correct:

    =====================================================================================================
    user@host> show configuration | match "chassis cluster|fab" | display set | except desc
    set chassis cluster control-link-recovery
    set chassis cluster reth-count 4
    set chassis cluster redundancy-group 0 node 0 priority 100
    set chassis cluster redundancy-group 0 node 1 priority 1
    set chassis cluster redundancy-group 1 node 0 priority 150
    set chassis cluster redundancy-group 1 node 1 priority 100
    set chassis cluster redundancy-group 1 gratuitous-arp-count 5
    set chassis cluster redundancy-group 1 hold-down-interval 300
    set chassis cluster redundancy-group 1 interface-monitor ge-0/0/8 weight 90
    set chassis cluster redundancy-group 1 interface-monitor ge-0/0/9 weight 90
    set chassis cluster redundancy-group 1 interface-monitor ge-0/0/10 weight 90
    set chassis cluster redundancy-group 1 interface-monitor ge-0/0/11 weight 90
    set chassis cluster redundancy-group 1 interface-monitor ge-5/0/8 weight 90
    set chassis cluster redundancy-group 1 interface-monitor ge-5/0/9 weight 90
    set chassis cluster redundancy-group 1 interface-monitor ge-5/0/10 weight 90
    set chassis cluster redundancy-group 1 interface-monitor ge-5/0/11 weight 90
    set interfaces fab0 fabric-options member-interfaces ge-0/0/0
    set interfaces fab1 fabric-options member-interfaces ge-5/0/0
    =====================================================================================================​


    All of the relevant interfaces are physically up:

    user@host> show interfaces terse | match "fab|ge-5/0/0|ge-0/0/0|inter"
    Interface               Admin Link Proto    Local                 Remote
    ge-0/0/0                up    up
    ge-0/0/0.0              up    up   aenet    --> fab0.0
    ge-5/0/0                up    up
    ge-5/0/0.0              up    up   aenet    --> fab1.0
    fab0                    up    up
    fab0.0                  up    up   inet     30.17.0.200/24
    fab1                    up    up
    fab1.0                  up    up   inet     30.18.0.200/24
    swfab0                  up    down
    swfab1                  up    down​

    But there are errors under the chassis cluster status. Namely the FL error.
    But it may be worth noting that other monitoring errors are present, even though all their prerequisites to pass monitoring checks  are met:


    {primary:node0}
    user@host> show chassis cluster status
    Monitor Failure codes:
        CS  Cold Sync monitoring        FL  Fabric Connection monitoring
        GR  GRES monitoring             HW  Hardware monitoring
        IF  Interface monitoring        IP  IP monitoring
        LB  Loopback monitoring         MB  Mbuf monitoring
        NH  Nexthop monitoring          NP  NPC monitoring
        SP  SPU monitoring              SM  Schedule monitoring
        CF  Config Sync monitoring      RE  Relinquish monitoring
        IS  IRQ storm
    
    Cluster ID: 1
    Node   Priority Status               Preempt Manual   Monitor-failures
    
    Redundancy group: 0 , Failover count: 1
    node0  100      primary              no      no       None
    node1  0        secondary            no      no       FL
    
    Redundancy group: 1 , Failover count: 1
    node0  0        primary              no      no       CS
    node1  0        secondary            no      no       IF CS FL
    ​


    And the SRXs are telling you that the fab (data-plane HA ports) are physically up, but their monitored status us down:

    {primary:node0}[edit]
    user@host# run show chassis cluster information
    node0:
    --------------------------------------------------------------------------
    Redundancy Group Information:
    
        Redundancy Group 0 , Current State: primary, Weight: 255
    
            Time            From                 To                   Reason
            Jan 19 14:28:57 hold                 secondary            Hold timer expired
            Jan 19 14:28:58 secondary            primary              Better priority (100/1)
    
        Redundancy Group 1 , Current State: primary, Weight: 0
    
            Time            From                 To                   Reason
            Jan 19 14:28:58 hold                 secondary            Hold timer expired
            Jan 19 14:28:59 secondary            primary              Remote yield (0/0), due to IF CS CF  failures
    
    Chassis cluster LED information:
        Current LED color: Amber
        Last LED change reason: Monitored objects are down
    Control port tagging:
        Disabled
    
    Failure Information:
    
        Coldsync Monitoring Failure Information:
            Statistics:
                Coldsync Total SPUs: 1
                Coldsync completed SPUs: 0
                Coldsync not complete SPUs: 1
    
        Fabric-link Failure Information:
            Fabric Interface: fab0
              Child interface   Physical / Monitored Status
              ge-0/0/0              Up   / Down
    
    node1:
    --------------------------------------------------------------------------
    Redundancy Group Information:
    
        Redundancy Group 0 , Current State: secondary, Weight: 0
    
            Time            From                 To                   Reason
            Jan 19 14:28:47 hold                 secondary            Hold timer expired
    
        Redundancy Group 1 , Current State: secondary, Weight: -255
    
            Time            From                 To                   Reason
            Jan 19 14:28:48 hold                 secondary            Hold timer expired
    
    Chassis cluster LED information:
        Current LED color: Amber
        Last LED change reason: Monitored objects are down
    Control port tagging:
        Disabled
    
    Failure Information:
    
        Coldsync Monitoring Failure Information:
            Statistics:
                Coldsync Total SPUs: 1
                Coldsync completed SPUs: 0
                Coldsync not complete SPUs: 1
    
        Fabric-link Failure Information:
            Fabric Interface: fab1
              Child interface   Physical / Monitored Status
              ge-5/0/0              Up   / Down
    
    {primary:node0}[edit]​

    When you check the fab link statistics, you notice that they are 1-way; not sure if it's important, but it may be worth noting the same behaviour can be seen if you run this command from the secondary node):

    user@host> show chassis cluster statistics
    Control link statistics:
        Control link 0:
            Heartbeat packets sent: 160
            Heartbeat packets received: 160
            Heartbeat packet errors: 0
    Fabric link statistics:
        Child link 0
            Probes sent: 318
            Probes received: 0
    *rest of the output is somewhat irrelevant and ommitted for brevity*​

    The following troubleshooting steps yield no difference:

    • Rebooting both devices.
    • Rebooting a single device.
    • Performing a commit full.
    • Logging onto the secondary and performing a "request chassis cluster configuration-synchronize"
    • Changing the physical cable to a different port and moving the configuration.
    • Swapping the physical cable completely with a new one.
    • Ensuring that the MTU of every port is 9014 or less (not 9192)


    The way to fix this is to simply force a failover of the cluster on Redundancy Group 0.

    The command is "request chassis cluster failover redundancy-group 0 node 1 force" (the force parameter is needed because in the current state, the secondary node wouldn't usually be able to become the master; that is you're essentially saying "please failover to an unhealthy node").
    I suspect for some reason, both devices were playing the master role despite what the "chassis cluster status" was saying.
    Absolutely bizarre issue.
    After the failover, everything worked as expected because of course, the actual configuration and interface status was completely fine. I was able to fail back later without issue and node0 is the master again.

    Hope this helps someone in the future, seeing as this closed out prematurely.

    ------------------------------
    Kindest Regards,

    Purplezorz
    ------------------------------