Security

 View Only

IMPORTANT MODERATION NOTICE

This community is currently under full moderation, meaning  all posts will be reviewed before appearing in the community. Please expect a brief delay—there is no need to post multiple times. If your post is rejected, you'll receive an email outlining the reason(s). We've implemented full moderation to control spam. Thank you for your patience and participation.



Expand all | Collapse all

srx 1500 cluster issue: received probe packet is ALWAYS zero on the primary node - fabric link is physically up, monitor status down

  • 1.  srx 1500 cluster issue: received probe packet is ALWAYS zero on the primary node - fabric link is physically up, monitor status down

    Posted 01-16-2022 15:10
    Dear Security community,
    Im facing a really strange issue with srx1500 cluster: it seems that both node lost communication between each other and traffic stopped being processed. Currently all the services like ipsec vpn and bgp connections are down.After checking i noticed that the received probe on node0 (the primary) is always 0 and the fabric link fab0 is physically up but its showing on the monitor status down.I even disabled the fabric link monitoring by" set chassis cluster no-fabric-monitoring" and then rebooted node0 but still the same issue. i forced the failover between nodes many times, i rebooted both nodes several times but still the same behaviour. Please find below all the information regarding the cluster:

    Software version:JUNOS Software Release [15.1X49-D180.2]
    cluster configuration(  i deactivated the monitoring for now to troubleshoot further):

    @sn-dx-node0> show configuration chassis cluster 

    no-fabric-monitoring;

    reth-count 128;

    redundancy-group 0 {

        node 0 priority 100;

        node 1 priority 1;

    }

    redundancy-group 1 {

        node 0 priority 100;

        node 1 priority 1;

        inactive: interface-monitor {

            xe-0/0/16 weight 255;

            xe-7/0/16 weight 255;

        }

    }

    redundancy-group 2 {

        node 0 priority 100;

        node 1 priority 1;

        inactive: interface-monitor {

            ge-0/0/12 weight 255;

            ge-7/0/12 weight 255;

        }

    }
    cluster status:

    Cluster ID: 1

    Node   Priority Status         Preempt Manual   Monitor-failures

    Redundancy group: 0 , Failover count: 1

    node0  255      primary        no      yes      None           

    node1  1        secondary      no      yes      None           

    Redundancy group: 1 , Failover count: 1

    node0  255      primary        no      yes      HW             

    node1  1        secondary      no      yes      None           

    Redundancy group: 2 , Failover count: 1

    node0  255      primary        no      yes      HW             

    node1  1        secondary      no      yes      None

    Cluster statistics:

    {primary:node0}

    @sn-dx-node0> show chassis cluster statistics 

    Control link statistics:

        Control link 0:

        Heartbeat packets sent: 70200

        Heartbeat packets received: 70207

        Heartbeat packet errors: 0

    Fabric link statistics:

        Child link 0

        Probes sent: 159815

        Probes received: 0

        Child link 1

        Probes sent: 0

        Probes received: 0

    Cluster interface

    msaidani@sn-dx-node0> show chassis cluster interfaces 

    Control link status: Up

    Control interfaces: 

        Index   Interface   Monitored-Status   Internal-SA   Security

        0       em0         Up                 Disabled      Disabled  

    Fabric link status: Down

    Fabric interfaces: 

        Name    Child-interface    Status                    Security

                                   (Physical/Monitored)

        fab0    ge-0/0/11          Up   / Down               Disabled   

        fab0   

        fab1    ge-7/0/11          Up   / Up                 Disabled   

        fab1   

    Redundant-ethernet Information:     

        Name         Status      Redundancy-group

        reth0        Down        Not configured   

        reth1        Up          1                

        reth2        Down        2                

    interface ge-0/0/11   status:

    msaidani@sn-dx-node0> show interfaces ge-0/0/11 

    Physical interface: ge-0/0/11, Enabled, Physical link is Up

      Interface index: 323, SNMP ifIndex: 523

      Link-level type: 64, MTU: 9014, LAN-PHY mode, Link-mode: Full-duplex,

      Speed: 1000mbps, BPDU Error: None, MAC-REWRITE Error: None,

      Loopback: Disabled, Source filtering: Disabled, Flow control: Enabled,

      Auto-negotiation: Enabled, Remote fault: Online

      Device flags   : Present Running

      Interface flags: SNMP-Traps Internal: 0x4000

      Link flags     : None

      CoS queues     : 8 supported, 8 maximum usable queues

      Current address: c0:bf:a7:a5:30:30, Hardware address: c0:bf:a7:a5:2f:0b

      Last flapped   : 2022-01-15 20:06:49 UTC (19:34:08 ago)

      Input rate     : 0 bps (0 pps)

      Output rate    : 2264 bps (1 pps)

      Active alarms  : None

      Active defects : None

      Interface transmit statistics: Disabled

      Logical interface ge-0/0/11.0 (Index 77) (SNMP ifIndex 572)

        Flags: Up SNMP-Traps 0x4000 Encapsulation: ENET2

        Input packets : 123

        Output packets: 170252

        Security: Zone: Null

        Protocol aenet, AE bundle: fab0.0   Link Index: 0

    Troubleshoot action done so far:

    • Rebooting both devices several times
    • Rebooting a single device.
    • Performing "set chassis cluster no-fabric-monitoring" then reboot node0
    • Performing "request chassis cluster failover redundancy-group 0 node 0 force"
    • Logging onto the secondary and performing a "request chassis cluster configuration-synchronize"
    • Changing the physical cable to a different port and moving the configuration.
    • Swapping the physical cable completely with a new one.


    As the last option im thinking about disabling the cluster and doing it all over again but before doing that i wanted to check with you if you may have other options.
    Thank you in advance 




    ------------------------------
    Maroua Saidani
    ------------------------------


  • 2.  RE: srx 1500 cluster issue: received probe packet is ALWAYS zero on the primary node - fabric link is physically up, monitor status down

    Posted 01-17-2022 05:39
    Hi Maroua,

    Cluster status output shows that there is a Hardware Monitoring Failure (HW) on node0. Please check for any active alarm (show chassis alarm) and coredump (show system core-dumps) on node0. The cluster priority of node0 is 255. When you do manual failover, the priority will be set to 255 and it should be cleared using 'request chassis cluster failover reset redundancy-group <0/1> command. Otherwise it will prevent auto failover when there is a monitoring failure and cause outage.

    Thanks.
    Nellikka


  • 3.  RE: srx 1500 cluster issue: received probe packet is ALWAYS zero on the primary node - fabric link is physically up, monitor status down

    Posted 01-17-2022 07:16
    Hi Nellikka,
    Thank you so much for your reply. 
    Yes i forced the failover as one of the troubleshoot step i did ,and there is already an outage right now as no service works even when automatic failover to node1 happen.
    I`m also aware about that HW failure and im in the process of checking/replacing my SFPs, cables.. , but does that impact the issue with fab0 ? thats what i can not figure it out and i wanted to check with you.
    Here is show chassis alarm on node 0:

    @sn-dx-node0> show chassis alarms 

    node0:

    --------------------------------------------------------------------------

    1 alarms currently active

    Alarm time               Class  Description

    2022-01-15 20:19:52 UTC  Major  FPC 0 Major Errors

    and below is core dump

    @sn-dx-node0> show system core-dumps    

    node0:

    --------------------------------------------------------------------------

    -rw-rw----  1 root  wheel   65404663 Jul 18  2019 /var/crash/vmcore.0.gz

    -rw-rw----  1 root  wheel   52874807 Jul 18  2019 /var/crash/vmcore.1.gz

    -rw-rw----  1 root  wheel   64108763 Jul 22  2019 /var/crash/vmcore.2.gz

    /var/tmp/*core*: No such file or directory

    /var/tmp/pics/*core*: No such file or directory

    /var/crash/kernel.*: No such file or directory

    /var/jails/rest-api/tmp/*core*: No such file or directory

    /tftpboot/corefiles/*core*: No such file or directory

    total files: 3

    /var/crash/corefiles:

    total blocks: 12

    total files: 0


    Thank you in advance 

     



    ------------------------------
    Maroua Saidani
    ------------------------------