SRX

Expand all | Collapse all

SRX220H2 - Cluster Issues (secondary node flapping): High CPU JSRPD

Jump to Best Answer
  • 1.  SRX220H2 - Cluster Issues (secondary node flapping): High CPU JSRPD

    Posted 05-14-2018 10:04

    I have a branch office with a cluster of SRX220H2s that recently started exhibiting flapping issues with the secondary node in the cluster.  Every 5-10 minutes, the secondary node will be kicked out of the cluster, then added several minutes later, before starting the cycle over.  We've tried hard booting the secondary node to see if it would join and stick in the cluster, but it doesn't seem to help.

     

    Additionally, I've noticed that the control-plane cpu on the primary node is consistently at 100%, with the jsrpd process consuming an awful amount of resources.  We have a number of essentially identical branch clusters elsewhere, none of which have jsrpd consuming high resources.  I know that that process is involved with the cluster process, in terms of messaging.  Checking the jsrpd logs, I'm seeing something very unusual:

     

     

    May 14 16:55:04 TCP-S: accepted client connection.
    May 14 16:55:04 TCP-S: TCP client from 130.16.0.1/56547 connected
    May 14 16:55:04 TCP-S: TCP peer closed connection
    May 14 16:55:04 last message repeated 100 times (hit threshold of (100))
    May 14 16:55:04 last message repeated 200 times (hit threshold of (200))
    May 14 16:55:04 last message repeated 300 times (hit threshold of (300))
    May 14 16:55:04 last message repeated 400 times (hit threshold of (400))
    May 14 16:55:04 last message repeated 500 times (hit threshold of (500))
    May 14 16:55:04 last message repeated 600 times (hit threshold of (600))
    May 14 16:55:05 last message repeated 700 times (hit threshold of (700))
    May 14 16:55:05 last message repeated 800 times (hit threshold of (800))
    

    Here's the system process extensive command output:

     

     

    show system processes extensive
    node0:
    --------------------------------------------------------------------------
    last pid: 47616;  load averages:  1.28,  1.26,  1.42  up 431+22:43:27    16:59:15
    140 processes: 19 running, 108 sleeping, 2 zombie, 11 waiting
    
    Mem: 210M Active, 149M Inact, 1036M Wired, 145M Cache, 112M Buf, 432M Free
    Swap:
    
      PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
     1403 root        5  76    0   996M 58812K RUN    0    ??? 102.20% flowd_octeon_hm
     1406 root        1 139    0 14096K  7032K RUN    0 727.7H 76.66% jsrpd
       22 root        1 171   52     0K    16K RUN    0 7574.2  0.00% idle: cpu0
       23 root        1 -20 -139     0K    16K RUN    0 118.8H  0.00% swi7: clock
        5 root        1 -16    0     0K    16K rtfifo 0  42.7H  0.00% rtfifo_kern_recv
       25 root        1 -40 -159     0K    16K WAIT   0  40.4H  0.00% swi2: netisr 0
     1413 root        1  76    0 12452K  5768K select 0  33.9H  0.00% license-check
    

    show chasis cluster interfaces:

    Control link status: Up
    
    Control interfaces:
        Index   Interface        Status   Internal-SA
        0       fxp1             Up       Disabled
    
    Fabric link status: Up
    
    Fabric interfaces:
        Name    Child-interface    Status
                                   (Physical/Monitored)
        fab0    ge-0/0/5           Up   / Up
        fab0
        fab1    ge-3/0/5           Up   / Up
        fab1
    
    Redundant-ethernet Information:
        Name         Status      Redundancy-group
        reth0        Up          1
        reth1        Up          1
        reth2        Up          1
    
    Redundant-pseudo-interface Information:
        Name         Status      Redundancy-group
        lo0          Up          0
    
    Interface Monitoring:
        Interface         Weight    Status    Redundancy-group
        ge-3/0/0          255       Down      1
        ge-0/0/0          255       Up        1
    
    {primary:node0}

    last 100 of show log chassisd

    show log chassisd | last 100
    May 14 16:39:58 SCC: pseudo_create_devs_swfab: Skipping creation of swfab1, since fabric presence is set to true
    May 14 16:39:58 SCC: lcc_detach_interfaces_not_online lcc 1
    May 14 16:39:58 CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(3)
    May 14 16:39:58 CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(4)
    May 14 16:39:58 CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(5)
    May 14 16:40:06 SCC: pfpc ready fpc 3 i2c 1897
    May 14 16:40:06 SCC: fpc 3 clean, bringing online
    May 14 16:40:06 SCC: lcc_send_fpc_online_cmd_generic:  lcc 1 fpc 0
    May 14 16:40:06 SCC: pic_online_req for fpc 3, pic 0  lcc_slot 1 in lcc_recv_pic_online_req
    May 14 16:40:06 SCC: lcc_send_pic_online_ack: On Switch-chassis: fpc 3 pic 0 pic_type 0x669 msg_len 20 tlv_len 0
    May 14 16:40:06 SCC: From SCC send: fru 13361152 lcc_slot 1 online ack to LCC
    May 14 16:40:06 SCC: From Switch-Chassis send: fpc 3 pic 0 online ack to LCC
    May 14 16:40:08 SCC: lcc_recv_pic_attach: pic attach pic 0, flags 0x0, portcount 8, fpc 3
    May 14 16:40:08 SCC: pic_set_online: i2c 0x669 pic 0 fpc 3 state 5 in_issu 0
    May 14 16:40:08 SCC:  pic_type=1641 pic_slot=0 fpc_slot=3 pic_i2c_id=1641
    
    May 14 16:40:08 SCC: fpc slot 3 pic_present 0x0 => 0x1
    May 14 16:40:08 SCC: FPC 3 PIC 0, attaching clean
    May 14 16:40:08 SCC: Creating pic entry, baseport 0, nports 8, port 0
    
    May 14 16:40:08 SCC: create_pic_entry: pic i2c 0x669, hw qs 8 supported qs 8, flags 0x0, pic port 0
    May 14 16:40:08 SCC: Created pic for ge-3/0/0
    
    May 14 16:40:08 SCC: Creating pic entry, baseport 0, nports 8, port 1
    
    May 14 16:40:08 SCC: create_pic_entry: pic i2c 0x669, hw qs 8 supported qs 8, flags 0x0, pic port 1
    May 14 16:40:08 SCC: Created pic for ge-3/0/1
    
    May 14 16:40:08 SCC: Creating pic entry, baseport 0, nports 8, port 2
    
    May 14 16:40:08 SCC: create_pic_entry: pic i2c 0x669, hw qs 8 supported qs 8, flags 0x0, pic port 2
    May 14 16:40:08 SCC: Created pic for ge-3/0/2
    
    May 14 16:40:08 SCC: Creating pic entry, baseport 0, nports 8, port 3
    
    May 14 16:40:08 SCC: create_pic_entry: pic i2c 0x669, hw qs 8 supported qs 8, flags 0x0, pic port 3
    May 14 16:40:08 SCC: Created pic for ge-3/0/3
    
    May 14 16:40:08 SCC: Creating pic entry, baseport 0, nports 8, port 4
    
    May 14 16:40:08 SCC: create_pic_entry: pic i2c 0x669, hw qs 8 supported qs 8, flags 0x0, pic port 4
    May 14 16:40:08 SCC: Created pic for ge-3/0/4
    
    May 14 16:40:08 SCC: Creating pic entry, baseport 0, nports 8, port 5
    
    May 14 16:40:08 SCC: create_pic_entry: pic i2c 0x669, hw qs 8 supported qs 8, flags 0x0, pic port 5
    May 14 16:40:08 SCC: Created pic for ge-3/0/5
    
    May 14 16:40:08 SCC: Creating pic entry, baseport 0, nports 8, port 6
    
    May 14 16:40:08 SCC: create_pic_entry: pic i2c 0x669, hw qs 8 supported qs 8, flags 0x0, pic port 6
    May 14 16:40:08 SCC: Created pic for ge-3/0/6
    
    May 14 16:40:08 SCC: Creating pic entry, baseport 0, nports 8, port 7
    
    May 14 16:40:08 SCC: create_pic_entry: pic i2c 0x669, hw qs 8 supported qs 8, flags 0x0, pic port 7
    May 14 16:40:08 SCC: Created pic for ge-3/0/7
    
    May 14 16:40:08 CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for ge-3/0/0
    May 14 16:40:08 SCC: ifdev_create entered ge-3/0/0
    May 14 16:40:08 SCC: ge-3/0/0: large delay buffer cleared
    May 14 16:40:08 SCC: fpc_is_q_neompc: no valid ideeprom for slot 3
    May 14 16:40:08 SCC: fpc_is_q_sangria: no valid ideeprom for slot 3
    May 14 16:40:08 CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for ge-3/0/1
    May 14 16:40:08 SCC: ifdev_create entered ge-3/0/1
    May 14 16:40:08 SCC: ge-3/0/1: large delay buffer cleared
    May 14 16:40:08 SCC: fpc_is_q_neompc: no valid ideeprom for slot 3
    May 14 16:40:08 SCC: fpc_is_q_sangria: no valid ideeprom for slot 3
    May 14 16:40:08 CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for ge-3/0/2
    May 14 16:40:08 SCC: ifdev_create entered ge-3/0/2
    May 14 16:40:08 SCC: ge-3/0/2: large delay buffer cleared
    May 14 16:40:08 SCC: fpc_is_q_neompc: no valid ideeprom for slot 3
    May 14 16:40:08 SCC: fpc_is_q_sangria: no valid ideeprom for slot 3
    May 14 16:40:08 CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for ge-3/0/3
    May 14 16:40:08 SCC: ifdev_create entered ge-3/0/3
    May 14 16:40:08 SCC: ge-3/0/3: large delay buffer cleared
    May 14 16:40:08 SCC: fpc_is_q_neompc: no valid ideeprom for slot 3
    May 14 16:40:08 SCC: fpc_is_q_sangria: no valid ideeprom for slot 3
    May 14 16:40:08 CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for ge-3/0/4
    May 14 16:40:08 SCC: ifdev_create entered ge-3/0/4
    May 14 16:40:08 SCC: ge-3/0/4: large delay buffer cleared
    May 14 16:40:08 SCC: fpc_is_q_neompc: no valid ideeprom for slot 3
    May 14 16:40:08 SCC: fpc_is_q_sangria: no valid ideeprom for slot 3
    May 14 16:40:08 CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for ge-3/0/5
    May 14 16:40:08 SCC: ifdev_create entered ge-3/0/5
    May 14 16:40:08 SCC: ge-3/0/5: large delay buffer cleared
    May 14 16:40:08 SCC: fpc_is_q_neompc: no valid ideeprom for slot 3
    May 14 16:40:08 SCC: fpc_is_q_sangria: no valid ideeprom for slot 3
    May 14 16:40:08 CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for ge-3/0/6
    May 14 16:40:08 SCC: ifdev_create entered ge-3/0/6
    May 14 16:40:08 SCC: ge-3/0/6: large delay buffer cleared
    May 14 16:40:08 SCC: fpc_is_q_neompc: no valid ideeprom for slot 3
    May 14 16:40:08 SCC: fpc_is_q_sangria: no valid ideeprom for slot 3
    May 14 16:40:08 CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for ge-3/0/7
    May 14 16:40:08 SCC: ifdev_create entered ge-3/0/7
    May 14 16:40:08 SCC: ge-3/0/7: large delay buffer cleared
    May 14 16:40:08 SCC: fpc_is_q_neompc: no valid ideeprom for slot 3
    May 14 16:40:08 SCC: fpc_is_q_sangria: no valid ideeprom for slot 3
    May 14 16:40:08 SCC: PIC (fpc 3 pic 0) message operation: add. ifd count 8, flags 0x3 in mesg
    May 14 16:40:08 LCC: ignoring PIC message on LCC
    

    For the moment, I've disabled the ports on the switch for the second node (node1) that keeps flapping, just so I don't keep seeing it go on and off, but can renable if needed. 

    Any thoughts are appreciated!

     



  • 2.  RE: SRX220H2 - Cluster Issues (secondary node flapping): High CPU JSRPD

    Posted 05-14-2018 18:37

    Hi,

     

    Please provide the output of the below mentioned command:

    show chassis cluster information detail

     

    Is node0 CPU normal after disabling node1 interfaces from the Switch?

     



  • 3.  RE: SRX220H2 - Cluster Issues (secondary node flapping): High CPU JSRPD

    Posted 05-15-2018 04:00

    The control-plan cpu for node0 did not return to normal, so I ended up turning the switchports back on and permitting connectivity between the two nodes.  The cluster reformed and has actually been stable, but I'm still seeing 100% cpu control-plane on node0, much of which is still consumed by the jsrpd process.  I'm tempted to forcibly restart the jsrpd process on node0, but I don't know what affect that would have on the cluster or on the operational status of node0.

     

    Here's the output of that command:

     

    show chassis cluster information detail
    node0:
    --------------------------------------------------------------------------
    Redundancy mode:
        Configured mode: active-active
        Operational mode: active-active
    Cluster configuration:
        Heartbeat interval: 1000 ms
        Heartbeat threshold: 3
        Control link recovery: Disabled
        Fabric link down timeout: 66 sec
    Node health information:
        Local node health: Healthy
        Remote node health: Healthy
    
    Redundancy group: 0, Threshold: 255, Monitoring failures: none
        Events:
            Mar  8 18:16:20.013 : hold->secondary, reason: Hold timer expired
            Jul 20 01:42:14.787 : secondary->primary, reason: Remote yield (100/0)
    
    Redundancy group: 1, Threshold: 255, Monitoring failures: none
        Events:
            Mar  8 18:22:52.767 : hold->secondary, reason: Hold timer expired
            Mar  8 18:22:55.679 : secondary->primary, reason: Better priority (100/1)
    
    Redundancy group: 2, Threshold: 255, Monitoring failures: none
        Events:
            Mar  8 18:28:36.929 : hold->secondary, reason: Hold timer expired
            Mar  8 18:28:40.658 : secondary->primary, reason: Better priority (100/1)
    Control link statistics:
        Control link 0:
            Heartbeat packets sent: 37289640
            Heartbeat packets received: 37179519
            Heartbeat packet errors: 0
            Duplicate heartbeat packets received: 0
        Control recovery packet count: 0
        Sequence number of last heartbeat packet sent: 37289640
        Sequence number of last heartbeat packet received: 65685
    Fabric link statistics:
        Child link 0
            Probes sent: 74763886
            Probes received: 74261735
        Child link 1
            Probes sent: 0
            Probes received: 0
    Switch fabric link statistics:
        Probe state : DOWN
        Probes sent: 0
        Probes received: 0
        Probe recv errors: 0
        Probe send errors: 0
        Probe recv dropped: 0
        Sequence number of last probe sent: 0
        Sequence number of last probe received: 0
    
    Chassis cluster LED information:
        Current LED color: Green
        Last LED change reason: No failures
    Control port tagging:
        Disabled
    
    Cold Synchronization:
        Status:
            Cold synchronization completed for: N/A
            Cold synchronization failed for: N/A
            Cold synchronization not known for: N/A
            Current Monitoring Weight: 0
    
        Statistics:
            Number of cold synchronization completed: 0
            Number of cold synchronization failed: 0
    
        Events:
            Mar  8 18:20:04.632 : Cold sync for PFE  is RTO sync in process
            Mar  8 18:20:05.450 : Cold sync for PFE  is Post-req check in process
            Mar  8 18:20:07.439 : Cold sync for PFE  is Completed
    
    Loopback Information:
    
        PIC Name        Loopback        Nexthop     Mbuf
        -------------------------------------------------
                        Success         Success     Success
    
    Interface monitoring:
        Statistics:
            Monitored interface failure count: 110
    
        Events:
            May 14 16:00:47.618 : Interface ge-3/0/0 monitored by rg 1, changed state from Up to Down
            May 14 16:04:49.508 : Interface ge-3/0/0 monitored by rg 1, changed state from Down to Up
            May 14 16:08:41.523 : Interface ge-3/0/0 monitored by rg 1, changed state from Up to Down
            May 14 16:12:42.731 : Interface ge-3/0/0 monitored by rg 1, changed state from Down to Up
            May 14 16:16:29.162 : Interface ge-3/0/0 monitored by rg 1, changed state from Up to Down
            May 14 16:20:31.862 : Interface ge-3/0/0 monitored by rg 1, changed state from Down to Up
            May 14 16:36:14.067 : Interface ge-3/0/0 monitored by rg 1, changed state from Up to Down
            May 14 16:40:14.408 : Interface ge-3/0/0 monitored by rg 1, changed state from Down to Up
            May 14 16:46:42.724 : Interface ge-3/0/0 monitored by rg 1, changed state from Up to Down
            May 14 22:39:28.700 : Interface ge-3/0/0 monitored by rg 1, changed state from Down to Up
    
    Fabric monitoring:
        Status:
            Fabric Monitoring: Enabled
            Activation status: Active
            Fabric Status reported by data plane: Up
            JSRPD internal fabric status: Up
    
    Fabric link events:
            May 14 16:40:10.483 : Fabric link fab0 is down
            May 14 16:40:10.487 : Child ge-0/0/5 of fab0 is down
            May 14 16:40:10.493 : Child ge-3/0/5 of fab1 is up
            May 14 16:40:12.742 : Fabric link fab0 is up
            May 14 16:40:12.753 : Child ge-0/0/5 of fab0 is up
            May 14 16:40:13.458 : Fabric link fab1 is up
            May 14 16:40:13.473 : Child ge-3/0/5 of fab1 is up
            May 14 16:40:16.529 : Child link-0 of fab0 is up, pfe notification
            May 14 16:40:16.684 : Child link-0 of fab1 is up, pfe notification
            May 14 16:40:17.573 : Fabric link up, link status timer
    
    Control link status: Up
        Server information:
            Server status : Running
            Server connected to None
        Client information:
            Client status : Inactive
            Client connected to None
    Control port tagging:
        Disabled
    
    Control link events:
            May 14 16:04:33.419 : Control link fxp1 is up
            May 14 16:08:27.790 : Control link down, link status timer
            May 14 16:08:41.583 : Control link fxp1 is up
            May 14 16:12:25.540 : Control link fxp1 is up
            May 14 16:16:16.535 : Control link down, link status timer
            May 14 16:16:29.293 : Control link fxp1 is up
            May 14 16:20:14.217 : Control link fxp1 is up
            May 14 16:36:00.062 : Control link down, link status timer
            May 14 16:36:14.143 : Control link fxp1 is up
            May 14 16:39:58.684 : Control link fxp1 is up
    
    Hardware monitoring:
        Status:
            Activation status: Enabled
            Redundancy group 0 failover for hardware faults: Enabled
            Hardware redundancy group 0 errors: 0
            Hardware redundancy group 1 errors: 0
    
    Schedule monitoring:
        Status:
            Activation status: Disabled
            Schedule slip detected: None
            Timer ignored: No
    
        Statistics:
            Total slip detected count: 3510
            Longest slip duration: 9(s)
    
        Events:
            May 15 10:32:01.782 : Detected schedule slip
            May 15 10:33:01.972 : Cleared schedule slip
            May 15 10:37:04.209 : Detected schedule slip
            May 15 10:38:04.528 : Cleared schedule slip
            May 15 10:42:06.585 : Detected schedule slip
            May 15 10:43:06.675 : Cleared schedule slip
            May 15 10:47:08.831 : Detected schedule slip
            May 15 10:48:08.890 : Cleared schedule slip
            May 15 10:52:10.837 : Detected schedule slip
            May 15 10:53:10.993 : Cleared schedule slip
    
    node1:
    --------------------------------------------------------------------------
    Redundancy mode:
        Configured mode: active-active
        Operational mode: active-active
    Cluster configuration:
        Heartbeat interval: 1000 ms
        Heartbeat threshold: 3
        Control link recovery: Disabled
        Fabric link down timeout: 66 sec
    Node health information:
        Local node health: Healthy
        Remote node health: Healthy
    
    Redundancy group: 0, Threshold: 255, Monitoring failures: none
        Events:
            May 14 16:30:59.516 : hold->secondary, reason: Hold timer expired
    
    Redundancy group: 1, Threshold: 255, Monitoring failures: none
        Events:
            May 14 16:30:59.761 : hold->secondary, reason: Hold timer expired
    
    Redundancy group: 2, Threshold: 255, Monitoring failures: none
        Events:
            May 14 16:30:59.781 : hold->secondary, reason: Hold timer expired
    Control link statistics:
        Control link 0:
            Heartbeat packets sent: 65686
            Heartbeat packets received: 64764
            Heartbeat packet errors: 0
            Duplicate heartbeat packets received: 0
        Control recovery packet count: 0
        Sequence number of last heartbeat packet sent: 65686
        Sequence number of last heartbeat packet received: 37289641
    Fabric link statistics:
        Child link 0
            Probes sent: 131398
            Probes received: 131397
        Child link 1
            Probes sent: 0
            Probes received: 0
    Switch fabric link statistics:
        Probe state : DOWN
        Probes sent: 0
        Probes received: 0
        Probe recv errors: 0
        Probe send errors: 0
        Probe recv dropped: 0
        Sequence number of last probe sent: 0
        Sequence number of last probe received: 0
    
    Chassis cluster LED information:
        Current LED color: Green
        Last LED change reason: No failures
    Control port tagging:
        Disabled
    
    Cold Synchronization:
        Status:
            Cold synchronization completed for: N/A
            Cold synchronization failed for: N/A
            Cold synchronization not known for: N/A
            Current Monitoring Weight: 0
    
        Statistics:
            Number of cold synchronization completed: 0
            Number of cold synchronization failed: 0
    
        Events:
            May 14 16:31:50.807 : Cold sync for PFE  is RTO sync in process
            May 14 16:31:54.200 : Cold sync for PFE  is Post-req check in process
            May 14 16:31:56.205 : Cold sync for PFE  is Completed
    
    Loopback Information:
    
        PIC Name        Loopback        Nexthop     Mbuf
        -------------------------------------------------
                        Success         Success     Success
    
    Interface monitoring:
        Statistics:
            Monitored interface failure count: 1
    
        Events:
            May 14 16:31:16.111 : Interface ge-0/0/0 monitored by rg 1, changed state from Down to Up
            May 14 16:31:50.756 : Interface ge-3/0/0 monitored by rg 1, changed state from Down to Up
            May 14 16:38:19.150 : Interface ge-3/0/0 monitored by rg 1, changed state from Up to Down
            May 14 22:31:04.275 : Interface ge-3/0/0 monitored by rg 1, changed state from Down to Up
    
    Fabric monitoring:
        Status:
            Fabric Monitoring: Enabled
            Activation status: Active
            Fabric Status reported by data plane: Up
            JSRPD internal fabric status: Up
    
    Fabric link events:
            May 14 16:31:45.984 : Child ge-0/0/5 of fab0 is down
            May 14 16:31:46.310 : Child ge-3/0/5 of fab1 is down
            May 14 16:31:46.408 : Child ge-3/0/5 of fab1 is up
            May 14 16:31:47.850 : Fabric monitoring suspension is revoked by remote node
            May 14 16:31:48.986 : Fabric link fab0 is up
            May 14 16:31:48.996 : Child ge-0/0/5 of fab0 is up
            May 14 16:31:49.737 : Fabric link fab1 is up
            May 14 16:31:49.745 : Child ge-3/0/5 of fab1 is up
            May 14 16:31:52.416 : Child link-0 of fab1 is up, pfe notification
            May 14 16:31:53.431 : Fabric link up, link status timer
    
    Control link status: Up
        Server information:
            Server status : Inactive
            Server connected to None
        Client information:
            Client status : Inactive
            Client connected to None
    Control port tagging:
        Disabled
    
    Control link events:
            May 14 16:30:27.927 : Control link fxp1 is down
            May 14 16:30:29.828 : Control link fxp1 is down
            May 14 16:30:37.315 : Control link fxp1 is up
            May 14 16:31:14.583 : Control link fxp1 is up
            May 14 16:31:25.891 : Control link fxp1 is up
    
    Hardware monitoring:
        Status:
            Activation status: Enabled
            Redundancy group 0 failover for hardware faults: Enabled
            Hardware redundancy group 0 errors: 0
            Hardware redundancy group 1 errors: 0
    
    Schedule monitoring:
        Status:
            Activation status: Disabled
            Schedule slip detected: None
            Timer ignored: No
    
        Statistics:
            Total slip detected count: 1
            Longest slip duration: 3(s)
    
        Events:
            May 14 16:30:36.691 : Detected schedule slip
            May 14 16:31:37.251 : Cleared schedule slip
    
    {primary:node0}
    


  • 4.  RE: SRX220H2 - Cluster Issues (secondary node flapping): High CPU JSRPD

    Posted 05-15-2018 20:42

    Hi,

    As per the given output, I do see lot of jsrpd scheduler slip on node0. This may be because of high RE cpu on node0. You may try to failover all RGs to node1 and observe the RE CPU on node0.

     

        Statistics:
            Total slip detected count: 3510
            Longest slip duration: 9(s)

    control link and Fab links are connected via Switch? I see a lot of flap on May 14



  • 5.  RE: SRX220H2 - Cluster Issues (secondary node flapping): High CPU JSRPD
    Best Answer

    Posted 05-21-2018 05:15
    Just to update and close: we still could not determine the root cause, but were able to resolve the issue by rebooting node0. The cluster reformed, the jsprd process process utilization returned to normal levels, along with the cpu utilization.