SRX

Expand all | Collapse all

Flaky SRX Cluster cycles for no reason; 12.3X48-D45.6; SRX210HE2; vlans w/out family ethernet-switching

  • 1.  Flaky SRX Cluster cycles for no reason; 12.3X48-D45.6; SRX210HE2; vlans w/out family ethernet-switching

    Posted 04-10-2017 06:03

    Hi, folks:

     

    Scrarching my head on this one.  Have a cluster of 2 x SRX210HE2.  Cluter has been only marginally stable for roughly 2 years.  Just upgraded to 12.3X48-D45.6 and got motivated to really look into the problem.  Control and fab interfaces are direct connect; 3' of (overkill) Cat 7 shielded.  Have changed cables.  So it's not that.

     

    Everything is fine for hours/days, then suddenly jsrpd thinks the fabric interface went down and initiates a failover.  The failover churns for a while, cold synchs, then switches back.  This happens fairly regularly, hours to days in periodacity, with no specific driving event (may be a traffic spike or some other network event, but nothing obvious or scheduled that I've been able to correlate).

     

    Intersting, these 'hiccups' don't appear to be noted in show chasis cluster information. Weird.

     

    clutertrace and messages from both nodes are attached for one of these complete cycles that happened last night.  the -node1 files are the backup node; the filenames without annotation are node0, the primary.

     

    I enabled cluster traceoptions debug and I get:

     

    **

    hours/days of normal:

    Apr  9 22:01:46 [fsm_process_hb]: Received TLV hb packet total_len=117 RG info length = 29

    Apr  9 22:01:48 [jsrpd_service_incoming_hb]: HB Received len=137

     

    **

    Then, an event that starts like this (again, no rhyme or reason, just "happens"):

    Apr  9 22:01:50 [fsm_process_hb]: Received TLV hb packet total_len=117 RG info length = 29

    Apr  9 22:01:51 [jsrpd_service_incoming_hb]: HB Received len=137

    Apr  9 22:01:51 [fsm_process_hb]: Received TLV hb packet total_len=117 RG info length = 29

    Apr  9 22:01:53 [jsrpd_pfe_rx_ipc_msg_handler]: Received fabric monitor status

    Apr  9 22:01:53 [jsrpd_pfe_fabmon_status_update]: Received fabric monitor link status from node0 PFE

    Apr  9 22:01:53 [jsrpd_pfe_fabmon_status_update]: lnk_idx:0 name(ge-0/0/1) ifd_state(curr:1, new:1), pfe_state(curr:1, new:0)

    Apr  9 22:01:53 [jsrpd_set_fab_chld_lnk_state]: State of lnk-0 of fab0 changed from UP to DOWN

    Apr  9 22:01:53 [jsrpd_snmp_interface_state_trap]: Successfully sent jnxJsChClusterIntfTrap trap with severity minor to inform that fab 0 lnk- 0 -  ge-0/0/1 state changed from UP to DOWN on cluster 1; reason: status changed

    Apr  9 22:01:53 [jsrpd_sigusr2_handler]: received SIGUSR2, pid 1678

    Apr  9 22:01:53 [jsrpd_sigusr2_handler]: received SIGUSR2, pid 1678

    Apr  9 22:01:53 [jsrpd_pfe_fabmon_status_update]: Fabric link current state: UP link up count:0

    Apr  9 22:01:53 [jsrpd_pfe_fabmon_status_update]: Fabric link state change detected current state: UP fabric link up count:0

    Apr  9 22:01:53 [jsrpd_pfe_fabmon_status_update]: Control link status UP, Fabric monitoring: Enabled

    Apr  9 22:01:53 [fsm_set_link_status_timer]: link status timer started, delay=1000

    Apr  9 22:01:53 [jsrpd_fabric_info_dump]: HA Fabric Info: After fabric status is updated

    Apr  9 22:01:53 [jsrpd_fabric_info_dump]:   node0: fab0 is Active with 1 child (AggId: 130)

    Apr  9 22:01:53 [jsrpd_fabric_info_dump]:    fab0: ge-0/0/1 (0/0/1) is Active : ifd_state: Up pfe_state: Down

    Apr  9 22:01:53 [jsrpd_fabric_info_dump]:    fab1:  (0/0/0) is Inactive : ifd_state: Down pfe_state: Down

    Apr  9 22:01:53 [jsrpd_fabric_info_dump]:   node1: fab1 is Active with 1 child (AggId: 150)

    Apr  9 22:01:53 [jsrpd_fabric_info_dump]:    fab0: ge-2/0/1 (2/0/1) is Active : ifd_state: Up pfe_state: Up

    Apr  9 22:01:53 [jsrpd_fabric_info_dump]:    fab1:  (0/0/0) is Inactive : ifd_state: Down pfe_state: Down

    Apr  9 22:01:53 [jsrpd_pfe_fabmon_status_update]: jsrpd_pfe_fabmon_status_update: fab child link state change detected

    Apr  9 22:01:53 [get_fpc_frm_intf]: printing fpc_num 0

    Apr  9 22:01:53 [jsrpd_ssam_fab_update]: SSAM_BLOB_CONTENTS: intf name: ge-0/0/1 old_mon_state: 1 old_link_state: 1 new_mon_state: 1 new_link_state: 0

    Apr  9 22:01:53 [get_fpc_frm_intf]: printing fpc_num

    Apr  9 22:01:53 [jsrpd_ssam_fab_update]: SSAM_BLOB_CONTENTS: intf name:  old_mon_state: 0 old_link_state: 0 new_mon_state: 0 new_link_state: 0

    Apr  9 22:01:53 [get_fpc_frm_intf]: printing fpc_num 2

     

    **

    syslog messages shows this when the event starts:

    Apr  9 22:01:50  r0 PERF_MON: RTPERF_CPU_THRESHOLD_EXCEEDED: FPC 0 PIC 0 CPU utilization exceeds threshold, current value=100

    Apr  9 22:02:05  r0 jsrpd[1678]: JSRPD_HA_HEALTH_WEIGHT_LOW: Detected cluster1-Node1-RG1's health weight(0) low, send out SNMP trap

    Apr  9 22:02:05  r0 jsrpd[1678]: JSRPD_HA_HEALTH_WEIGHT_LOW: Detected cluster1-Node1-RG2's health weight(0) low, send out SNMP trap

    Apr  9 22:02:06  r0 /kernel: STP: STP IPC op 1 (ForwardingState) failed, err 1 (Unknown)

    Apr  9 22:02:06  r0 last message repeated 8 times

    Apr  9 22:02:25  r0 jsrpd[1678]: JSRPD_HA_CONTROL_LINK_DOWN: HA control link monitor status is marked down

    Apr  9 22:02:27  r0 /kernel: STP: STP IPC op 1 (ForwardingState) failed, err 1 (Unknown)

    Apr  9 22:02:27  r0 last message repeated 7 times

     

    **

    It churns away for about 3 minutes, downing the reths, cold-syching, and other aerobics, then eventually recovers:

    Apr  9 22:03:38 [jsrpd_pfe_fabmon_update_lnk_status]: jsrpd_pfe_fabmon_update_lnk_status: lnk_idx:0, link_state(curr:1, new:1)

    Apr  9 22:03:38 [jsrpd_pfe_fabmon_update_lnk_status]: jsrpd_pfe_fabmon_update_lnk_status: lnk_idx:1, link_state(curr:0, new:0)

    Apr  9 22:03:38 [jsrpd_pfe_fabmon_update_lnk_status]: jsrpd_pfe_fabmon_update_lnk_status: lnk_idx:2, link_state(curr:1, new:255)

    Apr  9 22:03:38 [jsrpd_pfe_fabmon_update_lnk_status]: State of lnk-0 of fab1 remains UP

    Apr  9 22:03:38 [jsrpd_pfe_fabmon_update_lnk_status]: jsrpd_pfe_fabmon_update_lnk_status: lnk_idx:3, link_state(curr:0, new:255)

    Apr  9 22:03:38 [jsrpd_pfe_fabmon_update_lnk_status]: State of lnk-1 of fab1 remains DOWN

    Apr  9 22:03:38 [jsrpd_fabric_info_dump]: HA Fabric Info: After fabric child status is updated

    Apr  9 22:03:38 [jsrpd_fabric_info_dump]:   node0: fab0 is Active with 1 child (AggId: 130)

    Apr  9 22:03:38 [jsrpd_fabric_info_dump]:    fab0: ge-0/0/1 (0/0/1) is Active : ifd_state: Up pfe_state: Up

    Apr  9 22:03:38 [jsrpd_fabric_info_dump]:    fab1:  (0/0/0) is Inactive : ifd_state: Down pfe_state: Down

    Apr  9 22:03:38 [jsrpd_fabric_info_dump]:   node1: fab1 is Active with 1 child (AggId: 158)

    Apr  9 22:03:38 [jsrpd_fabric_info_dump]:    fab0: ge-2/0/1 (2/0/1) is Active : ifd_state: Up pfe_state: Up

    Apr  9 22:03:38 [jsrpd_fabric_info_dump]:    fab1:  (0/0/0) is Inactive : ifd_state: Down pfe_state: Down

    Apr  9 22:03:38 [fsm_process_link_status_event]: control link UP/UP fabric link UP/UP

    Apr  9 22:03:38 [fsm_process_link_status_event]: no change to both control (UP) and fabric (UP) link status

    Apr  9 22:03:38 [jsrpd_service_incoming_hb]: HB Received len=137

    Apr  9 22:03:38 [fsm_process_hb]: Received TLV hb packet total_len=117 RG info length = 29

    Apr  9 22:03:38 [jsrpd_snmp_health_weight_trap]: Successfully sent an snmp-trap due to change of weight.cluster1-node1-RG1: Healthy(255)

    Apr  9 22:03:38 [jsrpd_snmp_health_weight_trap]: Successfully sent an snmp-trap due to change of weight.cluster1-node1-RG2: Healthy(255)

    Apr  9 22:03:39 [jsrpd_service_incoming_hb]: HB Received len=137

    Apr  9 22:03:39 [fsm_process_hb]: Received TLV hb packet total_len=117 RG info length = 29

    Apr  9 22:03:40 [jsrpd_service_incoming_hb]: HB Received len=137

    Apr  9 22:03:40 [fsm_process_hb]: Received TLV hb packet total_len=117 RG info length = 29

    Apr  9 22:03:41 [jsrpd_service_incoming_hb]: HB Received len=137

     

    **

    Aside from these bounces, the cluster functions normally and is otherwise unremarkable.

     

    Perhaps a clue: syslog messages shows:

    Apr  9 22:01:50  r0 PERF_MON: RTPERF_CPU_THRESHOLD_EXCEEDED: FPC 0 PIC 0 CPU utilization exceeds threshold, current value=100

    Apr  9 22:02:05  r0 jsrpd[1678]: JSRPD_HA_HEALTH_WEIGHT_LOW: Detected cluster1-Node1-RG1's health weight(0) low, send out SNMP trap

    Apr  9 22:02:05  r0 jsrpd[1678]: JSRPD_HA_HEALTH_WEIGHT_LOW: Detected cluster1-Node1-RG2's health weight(0) low, send out SNMP trap

     

    However, I wasn't able to find any documentation about the cluster automatically degrading when CPU hits 100?  Also, if that were the case, I'd expect some more sane reporting of this event in show chassis cluster information? From everything I know about SRX, cpu of occasionally 100% is not unusual and hardly seems like cause to bounce the cluster.

     

    Ideas?! Help?!

     

    Thanks in advance!

    Jeff

     

     

    @r0> show configuration chassis        

    cluster {

        traceoptions {

            file clustertrace size 100m files 5;

            flag all;

            level debug;

        }

        reth-count 2;

        redundancy-group 0 {

            node 0 priority 100;

            node 1 priority 1;

        }

        redundancy-group 1 {

            node 0 priority 100;

            node 1 priority 1;

            interface-monitor {

                fe-0/0/2 weight 255;

                fe-2/0/2 weight 255;

            }

        }

        redundancy-group 2 {

            node 0 priority 100;

            node 1 priority 1;

            interface-monitor {

                ge-0/0/0 weight 255;

                ge-2/0/0 weight 255;

            }

        }

    }

     

    Nothing seems to be crashing:

    @r0> show system core-dumps 

    node0:

    --------------------------------------------------------------------------

    /var/crash/*core*: No such file or directory

    /var/tmp/*core*: No such file or directory

    /var/tmp/pics/*core*: No such file or directory

    /var/crash/kernel.*: No such file or directory

    /tftpboot/corefiles/*core*: No such file or directory

     

    node1:

    --------------------------------------------------------------------------

    /var/crash/*core*: No such file or directory

    /var/tmp/*core*: No such file or directory

    /var/tmp/pics/*core*: No such file or directory

    /var/crash/kernel.*: No such file or directory

    /tftpboot/corefiles/*core*: No such file or directory

     

    @r0> show chassis routing-engine         

    node0:

    --------------------------------------------------------------------------

    Routing Engine status:

        Temperature                 58 degrees C / 136 degrees F

        Total memory              2048 MB Max   881 MB used ( 43 percent)

          Control plane memory    1072 MB Max   536 MB used ( 50 percent)

          Data plane memory        976 MB Max   342 MB used ( 35 percent)

        CPU utilization:

          User                       6 percent

          Background                 0 percent

          Kernel                     8 percent

          Interrupt                  0 percent

          Idle                      87 percent

        Model                          RE-SRX210HE2

        Serial ID                      ACKT2857

        Start time                     2017-04-08 21:05:40 CDT

        Uptime                         1 day, 10 hours, 48 minutes, 1 second

        Last reboot reason             0x20:power-button soft power off

        Load averages:                 1 minute   5 minute  15 minute

                                           0.33       0.17       0.11

     

    node1:

    --------------------------------------------------------------------------

    Routing Engine status:

        Temperature                 55 degrees C / 131 degrees F

        Total memory              2048 MB Max   840 MB used ( 41 percent)

          Control plane memory    1072 MB Max   493 MB used ( 46 percent)

          Data plane memory        976 MB Max   342 MB used ( 35 percent)

        CPU utilization:

          User                       4 percent

          Background                 0 percent

          Kernel                     7 percent

          Interrupt                  0 percent

          Idle                      89 percent

        Model                          RE-SRX210HE2

        Serial ID                      ACKP5227

        Start time                     2017-04-08 21:28:36 CDT

        Uptime                         1 day, 10 hours, 25 minutes, 4 seconds

        Last reboot reason             0x20:power-button soft power off

        Load averages:                 1 minute   5 minute  15 minute

                                           0.20       0.05       0.02

     

    @r0> show security monitoring fpc 0      

    node0:

    --------------------------------------------------------------------------

    FPC 0

      PIC 0

        CPU utilization          :   21 %

        Memory utilization       :   35 %

        Current flow session     : 5581

        Current flow session IPv4: 5581

        Current flow session IPv6:    0

        Max flow session         : 65536

    Total Session Creation Per Second (for last 96 seconds on average):  322

    IPv4  Session Creation Per Second (for last 96 seconds on average):  322

    IPv6  Session Creation Per Second (for last 96 seconds on average):    0

     

    node1:

    --------------------------------------------------------------------------

    FPC 0

      PIC 0

        CPU utilization          :    8 %

        Memory utilization       :   35 %

        Current flow session     : 6839

        Current flow session IPv4: 6839

        Current flow session IPv6:    0

        Max flow session         : 65536

    Total Session Creation Per Second (for last 96 seconds on average):  287

    IPv4  Session Creation Per Second (for last 96 seconds on average):  287

    IPv6  Session Creation Per Second (for last 96 seconds on average):    0

     

    @r0> show chassis cluster status 

    Monitor Failure codes:

        CS  Cold Sync monitoring        FL  Fabric Connection monitoring

        GR  GRES monitoring             HW  Hardware monitoring

        IF  Interface monitoring        IP  IP monitoring

        LB  Loopback monitoring         MB  Mbuf monitoring

        NH  Nexthop monitoring          NP  NPC monitoring              

        SP  SPU monitoring              SM  Schedule monitoring

        CF  Config Sync monitoring

     

    Cluster ID: 1

    Node   Priority Status         Preempt Manual   Monitor-failures

     

    Redundancy group: 0 , Failover count: 3

    node0  100      primary        no      no       None           

    node1  1        secondary      no      no       None           

     

    Redundancy group: 1 , Failover count: 3

    node0  100      primary        no      no       None           

    node1  1        secondary      no      no       None           

     

    Redundancy group: 2 , Failover count: 3

    node0  100      primary        no      no       None           

    node1  1        secondary      no      no       None           

     

    @r0> show chassis cluster information 

    node0:

    --------------------------------------------------------------------------

    Redundancy Group Information:

     

        Redundancy Group 0 , Current State: primary, Weight: 255

     

    Time            From           To             Reason

            Apr  8 21:10:22 hold           secondary      Hold timer expired

            Apr  8 21:10:38 secondary      primary        Only node present

            Apr  8 21:40:32 primary        secondary-hold Manual failover

            Apr  8 21:45:32 secondary-hold secondary      Ready to become secondary

            Apr  8 21:47:21 secondary      primary        Remote is in secondary hold

     

        Redundancy Group 1 , Current State: primary, Weight: 255

     

    Time            From           To             Reason

            Apr  8 21:10:23 hold           secondary      Hold timer expired

            Apr  8 21:10:38 secondary      primary        Only node present

            Apr  8 21:38:38 primary        secondary-hold Manual failover

            Apr  8 21:38:39 secondary-hold secondary      Ready to become secondary

            Apr  8 21:46:56 secondary      primary        Remote is in secondary hold

     

        Redundancy Group 2 , Current State: primary, Weight: 255

     

    Time            From           To             Reason

            Apr  8 21:10:23 hold           secondary      Hold timer expired

            Apr  8 21:10:38 secondary      primary        Only node present

            Apr  8 21:38:50 primary        secondary-hold Manual failover

            Apr  8 21:38:51 secondary-hold secondary      Ready to become secondary

            Apr  8 21:45:05 secondary      primary        Remote is in secondary hold

     

    Chassis cluster LED information:

        Current LED color: Green

        Last LED change reason: No failures

    Control port tagging:

        Disabled

     

    node1:

    --------------------------------------------------------------------------

    Redundancy Group Information:           

     

        Redundancy Group 0 , Current State: secondary, Weight: 255

     

    Time            From           To             Reason

            Apr  8 21:33:08 hold           secondary      Hold timer expired

            Apr  8 21:40:32 secondary      primary        Remote is in secondary hold

            Apr  8 21:47:21 primary        secondary-hold Manual failover

            Apr  8 21:52:21 secondary-hold secondary      Ready to become secondary

     

        Redundancy Group 1 , Current State: secondary, Weight: 255

     

    Time            From           To             Reason

            Apr  8 21:33:08 hold           secondary      Hold timer expired

            Apr  8 21:38:37 secondary      primary        Remote is in secondary hold

            Apr  8 21:46:56 primary        secondary-hold Manual failover

            Apr  8 21:46:57 secondary-hold secondary      Ready to become secondary

     

        Redundancy Group 2 , Current State: secondary, Weight: 255

     

    Time            From           To             Reason

            Apr  8 21:33:08 hold           secondary      Hold timer expired

            Apr  8 21:38:49 secondary      primary        Remote is in secondary hold

            Apr  8 21:45:05 primary        secondary-hold Manual failover

            Apr  8 21:45:06 secondary-hold secondary      Ready to become secondary

     

    Chassis cluster LED information:

        Current LED color: Green

        Last LED change reason: No failures

    Control port tagging:

        Disabled

     

     

     

     

    Attachment(s)

    txt
    syslog.9apr.txt   16K 1 version
    txt
    syslog-node1.9apr.txt   2K 1 version
    txt
    clustertrace.9apr.txt   311K 1 version


  • 2.  RE: Flaky SRX Cluster cycles for no reason; 12.3X48-D45.6; SRX210HE2; vlans w/out family ethernet-switching

    Posted 04-13-2017 17:33

    Ping?  Any takers?  Juniper Engineers: Is this level of instability normai and/or acceptable?

     

    I'm seeing one of these bounces roughly every 24 hours; no particular pattern regarding time or other periodics.  Each event the first syslog message on the primary cluster node is: 

     

    Apr 11 00:17:54  r0 PERF_MON: RTPERF_CPU_THRESHOLD_EXCEEDED: FPC 0 PIC 0 CPU utilization exceeds threshold, current value=100

     

    So, question, is there an automatic cluster node degrade initiated at some load threshhold?  My overall load is OK but I don't know if the occasional "100" load is causing or a result of the cluster issues.

     

    jschmidt@r0> show chassis routing-engine 

    node0:

    --------------------------------------------------------------------------

    Routing Engine status:

        Temperature                 57 degrees C / 134 degrees F

        Total memory              2048 MB Max   922 MB used ( 45 percent)

          Control plane memory    1072 MB Max   579 MB used ( 54 percent)

          Data plane memory        976 MB Max   342 MB used ( 35 percent)

        CPU utilization:

          User                       6 percent

          Background                 0 percent

          Kernel                     7 percent

          Interrupt                  0 percent

          Idle                      87 percent

        Model                          RE-SRX210HE2

        Serial ID                      ACKT2857

        Start time                     2017-04-08 21:05:40 CDT

        Uptime                         4 days, 22 hours, 25 minutes, 14 seconds

        Last reboot reason             0x20:power-button soft power off

        Load averages:                 1 minute   5 minute  15 minute

                                           0.11       0.18       0.16

     

    node1:

    --------------------------------------------------------------------------

    Routing Engine status:

        Temperature                 54 degrees C / 129 degrees F

        Total memory              2048 MB Max   881 MB used ( 43 percent)

          Control plane memory    1072 MB Max   536 MB used ( 50 percent)

          Data plane memory        976 MB Max   342 MB used ( 35 percent)

        CPU utilization:

          User                       4 percent

          Background                 0 percent

          Kernel                     6 percent

          Interrupt                  0 percent

          Idle                      90 percent

        Model                          RE-SRX210HE2

        Serial ID                      ACKP5227

        Start time                     2017-04-08 21:28:36 CDT

        Uptime                         4 days, 22 hours, 2 minutes, 19 seconds

        Last reboot reason             0x20:power-button soft power off

        Load averages:                 1 minute   5 minute  15 minute

                                           0.13       0.07       0.02

     



  • 3.  RE: Flaky SRX Cluster cycles for no reason; 12.3X48-D45.6; SRX210HE2; vlans w/out family ethernet-switching

     
    Posted 04-13-2017 18:28

    Hi,

     

    No SRX clusters are not normally unstable as your describe. I would assume that the high CPU utilisation on the FPC is causing the failover. If the FPC is reaching 100% I would look at upgrading to a larger SRX that can handle the load.

     

    To try and avoid the failover, look at tuning the heartbeat and failover parameters, see this https://www.juniper.net/techpubs/en_US/junos/topics/concept/chassis-cluster-control-link-heartbeat-understanding.html

     

    Tim



  • 4.  RE: Flaky SRX Cluster cycles for no reason; 12.3X48-D45.6; SRX210HE2; vlans w/out family ethernet-switching

    Posted 04-14-2017 07:58

    Thanks for the response!

     

    I have looked at the heartbeat parameters, including the article you referenced.  However, my problem isn't that I'm missing heartbeats or that the control link is failing, it's that (I think) the SRX is initiating a failover when the CPU hits 100 (not sure if that 100 unit is in units of percent or something else).  I haven't been able to find any reference to a cluster node failing over due to CPU load nor any expected parameters like maximum load to trigger the failover, for how long the CPU spike must persist before failover is triggered, etc.  I would expect all of these sorts of parameters if there is a CPU-triggered failover.  Ideas or pointers on that?

     

    Thx,
    Jeff