Hi, folks:
Scrarching my head on this one. Have a cluster of 2 x SRX210HE2. Cluter has been only marginally stable for roughly 2 years. Just upgraded to 12.3X48-D45.6 and got motivated to really look into the problem. Control and fab interfaces are direct connect; 3' of (overkill) Cat 7 shielded. Have changed cables. So it's not that.
Everything is fine for hours/days, then suddenly jsrpd thinks the fabric interface went down and initiates a failover. The failover churns for a while, cold synchs, then switches back. This happens fairly regularly, hours to days in periodacity, with no specific driving event (may be a traffic spike or some other network event, but nothing obvious or scheduled that I've been able to correlate).
Intersting, these 'hiccups' don't appear to be noted in show chasis cluster information. Weird.
clutertrace and messages from both nodes are attached for one of these complete cycles that happened last night. the -node1 files are the backup node; the filenames without annotation are node0, the primary.
I enabled cluster traceoptions debug and I get:
**
hours/days of normal:
Apr 9 22:01:46 [fsm_process_hb]: Received TLV hb packet total_len=117 RG info length = 29
Apr 9 22:01:48 [jsrpd_service_incoming_hb]: HB Received len=137
**
Then, an event that starts like this (again, no rhyme or reason, just "happens"):
Apr 9 22:01:50 [fsm_process_hb]: Received TLV hb packet total_len=117 RG info length = 29
Apr 9 22:01:51 [jsrpd_service_incoming_hb]: HB Received len=137
Apr 9 22:01:51 [fsm_process_hb]: Received TLV hb packet total_len=117 RG info length = 29
Apr 9 22:01:53 [jsrpd_pfe_rx_ipc_msg_handler]: Received fabric monitor status
Apr 9 22:01:53 [jsrpd_pfe_fabmon_status_update]: Received fabric monitor link status from node0 PFE
Apr 9 22:01:53 [jsrpd_pfe_fabmon_status_update]: lnk_idx:0 name(ge-0/0/1) ifd_state(curr:1, new:1), pfe_state(curr:1, new:0)
Apr 9 22:01:53 [jsrpd_set_fab_chld_lnk_state]: State of lnk-0 of fab0 changed from UP to DOWN
Apr 9 22:01:53 [jsrpd_snmp_interface_state_trap]: Successfully sent jnxJsChClusterIntfTrap trap with severity minor to inform that fab 0 lnk- 0 - ge-0/0/1 state changed from UP to DOWN on cluster 1; reason: status changed
Apr 9 22:01:53 [jsrpd_sigusr2_handler]: received SIGUSR2, pid 1678
Apr 9 22:01:53 [jsrpd_sigusr2_handler]: received SIGUSR2, pid 1678
Apr 9 22:01:53 [jsrpd_pfe_fabmon_status_update]: Fabric link current state: UP link up count:0
Apr 9 22:01:53 [jsrpd_pfe_fabmon_status_update]: Fabric link state change detected current state: UP fabric link up count:0
Apr 9 22:01:53 [jsrpd_pfe_fabmon_status_update]: Control link status UP, Fabric monitoring: Enabled
Apr 9 22:01:53 [fsm_set_link_status_timer]: link status timer started, delay=1000
Apr 9 22:01:53 [jsrpd_fabric_info_dump]: HA Fabric Info: After fabric status is updated
Apr 9 22:01:53 [jsrpd_fabric_info_dump]: node0: fab0 is Active with 1 child (AggId: 130)
Apr 9 22:01:53 [jsrpd_fabric_info_dump]: fab0: ge-0/0/1 (0/0/1) is Active : ifd_state: Up pfe_state: Down
Apr 9 22:01:53 [jsrpd_fabric_info_dump]: fab1: (0/0/0) is Inactive : ifd_state: Down pfe_state: Down
Apr 9 22:01:53 [jsrpd_fabric_info_dump]: node1: fab1 is Active with 1 child (AggId: 150)
Apr 9 22:01:53 [jsrpd_fabric_info_dump]: fab0: ge-2/0/1 (2/0/1) is Active : ifd_state: Up pfe_state: Up
Apr 9 22:01:53 [jsrpd_fabric_info_dump]: fab1: (0/0/0) is Inactive : ifd_state: Down pfe_state: Down
Apr 9 22:01:53 [jsrpd_pfe_fabmon_status_update]: jsrpd_pfe_fabmon_status_update: fab child link state change detected
Apr 9 22:01:53 [get_fpc_frm_intf]: printing fpc_num 0
Apr 9 22:01:53 [jsrpd_ssam_fab_update]: SSAM_BLOB_CONTENTS: intf name: ge-0/0/1 old_mon_state: 1 old_link_state: 1 new_mon_state: 1 new_link_state: 0
Apr 9 22:01:53 [get_fpc_frm_intf]: printing fpc_num
Apr 9 22:01:53 [jsrpd_ssam_fab_update]: SSAM_BLOB_CONTENTS: intf name: old_mon_state: 0 old_link_state: 0 new_mon_state: 0 new_link_state: 0
Apr 9 22:01:53 [get_fpc_frm_intf]: printing fpc_num 2
**
syslog messages shows this when the event starts:
Apr 9 22:01:50 r0 PERF_MON: RTPERF_CPU_THRESHOLD_EXCEEDED: FPC 0 PIC 0 CPU utilization exceeds threshold, current value=100
Apr 9 22:02:05 r0 jsrpd[1678]: JSRPD_HA_HEALTH_WEIGHT_LOW: Detected cluster1-Node1-RG1's health weight(0) low, send out SNMP trap
Apr 9 22:02:05 r0 jsrpd[1678]: JSRPD_HA_HEALTH_WEIGHT_LOW: Detected cluster1-Node1-RG2's health weight(0) low, send out SNMP trap
Apr 9 22:02:06 r0 /kernel: STP: STP IPC op 1 (ForwardingState) failed, err 1 (Unknown)
Apr 9 22:02:06 r0 last message repeated 8 times
Apr 9 22:02:25 r0 jsrpd[1678]: JSRPD_HA_CONTROL_LINK_DOWN: HA control link monitor status is marked down
Apr 9 22:02:27 r0 /kernel: STP: STP IPC op 1 (ForwardingState) failed, err 1 (Unknown)
Apr 9 22:02:27 r0 last message repeated 7 times
**
It churns away for about 3 minutes, downing the reths, cold-syching, and other aerobics, then eventually recovers:
Apr 9 22:03:38 [jsrpd_pfe_fabmon_update_lnk_status]: jsrpd_pfe_fabmon_update_lnk_status: lnk_idx:0, link_state(curr:1, new:1)
Apr 9 22:03:38 [jsrpd_pfe_fabmon_update_lnk_status]: jsrpd_pfe_fabmon_update_lnk_status: lnk_idx:1, link_state(curr:0, new:0)
Apr 9 22:03:38 [jsrpd_pfe_fabmon_update_lnk_status]: jsrpd_pfe_fabmon_update_lnk_status: lnk_idx:2, link_state(curr:1, new:255)
Apr 9 22:03:38 [jsrpd_pfe_fabmon_update_lnk_status]: State of lnk-0 of fab1 remains UP
Apr 9 22:03:38 [jsrpd_pfe_fabmon_update_lnk_status]: jsrpd_pfe_fabmon_update_lnk_status: lnk_idx:3, link_state(curr:0, new:255)
Apr 9 22:03:38 [jsrpd_pfe_fabmon_update_lnk_status]: State of lnk-1 of fab1 remains DOWN
Apr 9 22:03:38 [jsrpd_fabric_info_dump]: HA Fabric Info: After fabric child status is updated
Apr 9 22:03:38 [jsrpd_fabric_info_dump]: node0: fab0 is Active with 1 child (AggId: 130)
Apr 9 22:03:38 [jsrpd_fabric_info_dump]: fab0: ge-0/0/1 (0/0/1) is Active : ifd_state: Up pfe_state: Up
Apr 9 22:03:38 [jsrpd_fabric_info_dump]: fab1: (0/0/0) is Inactive : ifd_state: Down pfe_state: Down
Apr 9 22:03:38 [jsrpd_fabric_info_dump]: node1: fab1 is Active with 1 child (AggId: 158)
Apr 9 22:03:38 [jsrpd_fabric_info_dump]: fab0: ge-2/0/1 (2/0/1) is Active : ifd_state: Up pfe_state: Up
Apr 9 22:03:38 [jsrpd_fabric_info_dump]: fab1: (0/0/0) is Inactive : ifd_state: Down pfe_state: Down
Apr 9 22:03:38 [fsm_process_link_status_event]: control link UP/UP fabric link UP/UP
Apr 9 22:03:38 [fsm_process_link_status_event]: no change to both control (UP) and fabric (UP) link status
Apr 9 22:03:38 [jsrpd_service_incoming_hb]: HB Received len=137
Apr 9 22:03:38 [fsm_process_hb]: Received TLV hb packet total_len=117 RG info length = 29
Apr 9 22:03:38 [jsrpd_snmp_health_weight_trap]: Successfully sent an snmp-trap due to change of weight.cluster1-node1-RG1: Healthy(255)
Apr 9 22:03:38 [jsrpd_snmp_health_weight_trap]: Successfully sent an snmp-trap due to change of weight.cluster1-node1-RG2: Healthy(255)
Apr 9 22:03:39 [jsrpd_service_incoming_hb]: HB Received len=137
Apr 9 22:03:39 [fsm_process_hb]: Received TLV hb packet total_len=117 RG info length = 29
Apr 9 22:03:40 [jsrpd_service_incoming_hb]: HB Received len=137
Apr 9 22:03:40 [fsm_process_hb]: Received TLV hb packet total_len=117 RG info length = 29
Apr 9 22:03:41 [jsrpd_service_incoming_hb]: HB Received len=137
**
Aside from these bounces, the cluster functions normally and is otherwise unremarkable.
Perhaps a clue: syslog messages shows:
Apr 9 22:01:50 r0 PERF_MON: RTPERF_CPU_THRESHOLD_EXCEEDED: FPC 0 PIC 0 CPU utilization exceeds threshold, current value=100
Apr 9 22:02:05 r0 jsrpd[1678]: JSRPD_HA_HEALTH_WEIGHT_LOW: Detected cluster1-Node1-RG1's health weight(0) low, send out SNMP trap
Apr 9 22:02:05 r0 jsrpd[1678]: JSRPD_HA_HEALTH_WEIGHT_LOW: Detected cluster1-Node1-RG2's health weight(0) low, send out SNMP trap
However, I wasn't able to find any documentation about the cluster automatically degrading when CPU hits 100? Also, if that were the case, I'd expect some more sane reporting of this event in show chassis cluster information? From everything I know about SRX, cpu of occasionally 100% is not unusual and hardly seems like cause to bounce the cluster.
Ideas?! Help?!
Thanks in advance!
Jeff
@r0> show configuration chassis
cluster {
traceoptions {
file clustertrace size 100m files 5;
flag all;
level debug;
}
reth-count 2;
redundancy-group 0 {
node 0 priority 100;
node 1 priority 1;
}
redundancy-group 1 {
node 0 priority 100;
node 1 priority 1;
interface-monitor {
fe-0/0/2 weight 255;
fe-2/0/2 weight 255;
}
}
redundancy-group 2 {
node 0 priority 100;
node 1 priority 1;
interface-monitor {
ge-0/0/0 weight 255;
ge-2/0/0 weight 255;
}
}
}
Nothing seems to be crashing:
@r0> show system core-dumps
node0:
--------------------------------------------------------------------------
/var/crash/*core*: No such file or directory
/var/tmp/*core*: No such file or directory
/var/tmp/pics/*core*: No such file or directory
/var/crash/kernel.*: No such file or directory
/tftpboot/corefiles/*core*: No such file or directory
node1:
--------------------------------------------------------------------------
/var/crash/*core*: No such file or directory
/var/tmp/*core*: No such file or directory
/var/tmp/pics/*core*: No such file or directory
/var/crash/kernel.*: No such file or directory
/tftpboot/corefiles/*core*: No such file or directory
@r0> show chassis routing-engine
node0:
--------------------------------------------------------------------------
Routing Engine status:
Temperature 58 degrees C / 136 degrees F
Total memory 2048 MB Max 881 MB used ( 43 percent)
Control plane memory 1072 MB Max 536 MB used ( 50 percent)
Data plane memory 976 MB Max 342 MB used ( 35 percent)
CPU utilization:
User 6 percent
Background 0 percent
Kernel 8 percent
Interrupt 0 percent
Idle 87 percent
Model RE-SRX210HE2
Serial ID ACKT2857
Start time 2017-04-08 21:05:40 CDT
Uptime 1 day, 10 hours, 48 minutes, 1 second
Last reboot reason 0x20:power-button soft power off
Load averages: 1 minute 5 minute 15 minute
0.33 0.17 0.11
node1:
--------------------------------------------------------------------------
Routing Engine status:
Temperature 55 degrees C / 131 degrees F
Total memory 2048 MB Max 840 MB used ( 41 percent)
Control plane memory 1072 MB Max 493 MB used ( 46 percent)
Data plane memory 976 MB Max 342 MB used ( 35 percent)
CPU utilization:
User 4 percent
Background 0 percent
Kernel 7 percent
Interrupt 0 percent
Idle 89 percent
Model RE-SRX210HE2
Serial ID ACKP5227
Start time 2017-04-08 21:28:36 CDT
Uptime 1 day, 10 hours, 25 minutes, 4 seconds
Last reboot reason 0x20:power-button soft power off
Load averages: 1 minute 5 minute 15 minute
0.20 0.05 0.02
@r0> show security monitoring fpc 0
node0:
--------------------------------------------------------------------------
FPC 0
PIC 0
CPU utilization : 21 %
Memory utilization : 35 %
Current flow session : 5581
Current flow session IPv4: 5581
Current flow session IPv6: 0
Max flow session : 65536
Total Session Creation Per Second (for last 96 seconds on average): 322
IPv4 Session Creation Per Second (for last 96 seconds on average): 322
IPv6 Session Creation Per Second (for last 96 seconds on average): 0
node1:
--------------------------------------------------------------------------
FPC 0
PIC 0
CPU utilization : 8 %
Memory utilization : 35 %
Current flow session : 6839
Current flow session IPv4: 6839
Current flow session IPv6: 0
Max flow session : 65536
Total Session Creation Per Second (for last 96 seconds on average): 287
IPv4 Session Creation Per Second (for last 96 seconds on average): 287
IPv6 Session Creation Per Second (for last 96 seconds on average): 0
@r0> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 3
node0 100 primary no no None
node1 1 secondary no no None
Redundancy group: 1 , Failover count: 3
node0 100 primary no no None
node1 1 secondary no no None
Redundancy group: 2 , Failover count: 3
node0 100 primary no no None
node1 1 secondary no no None
@r0> show chassis cluster information
node0:
--------------------------------------------------------------------------
Redundancy Group Information:
Redundancy Group 0 , Current State: primary, Weight: 255
Time From To Reason
Apr 8 21:10:22 hold secondary Hold timer expired
Apr 8 21:10:38 secondary primary Only node present
Apr 8 21:40:32 primary secondary-hold Manual failover
Apr 8 21:45:32 secondary-hold secondary Ready to become secondary
Apr 8 21:47:21 secondary primary Remote is in secondary hold
Redundancy Group 1 , Current State: primary, Weight: 255
Time From To Reason
Apr 8 21:10:23 hold secondary Hold timer expired
Apr 8 21:10:38 secondary primary Only node present
Apr 8 21:38:38 primary secondary-hold Manual failover
Apr 8 21:38:39 secondary-hold secondary Ready to become secondary
Apr 8 21:46:56 secondary primary Remote is in secondary hold
Redundancy Group 2 , Current State: primary, Weight: 255
Time From To Reason
Apr 8 21:10:23 hold secondary Hold timer expired
Apr 8 21:10:38 secondary primary Only node present
Apr 8 21:38:50 primary secondary-hold Manual failover
Apr 8 21:38:51 secondary-hold secondary Ready to become secondary
Apr 8 21:45:05 secondary primary Remote is in secondary hold
Chassis cluster LED information:
Current LED color: Green
Last LED change reason: No failures
Control port tagging:
Disabled
node1:
--------------------------------------------------------------------------
Redundancy Group Information:
Redundancy Group 0 , Current State: secondary, Weight: 255
Time From To Reason
Apr 8 21:33:08 hold secondary Hold timer expired
Apr 8 21:40:32 secondary primary Remote is in secondary hold
Apr 8 21:47:21 primary secondary-hold Manual failover
Apr 8 21:52:21 secondary-hold secondary Ready to become secondary
Redundancy Group 1 , Current State: secondary, Weight: 255
Time From To Reason
Apr 8 21:33:08 hold secondary Hold timer expired
Apr 8 21:38:37 secondary primary Remote is in secondary hold
Apr 8 21:46:56 primary secondary-hold Manual failover
Apr 8 21:46:57 secondary-hold secondary Ready to become secondary
Redundancy Group 2 , Current State: secondary, Weight: 255
Time From To Reason
Apr 8 21:33:08 hold secondary Hold timer expired
Apr 8 21:38:49 secondary primary Remote is in secondary hold
Apr 8 21:45:05 primary secondary-hold Manual failover
Apr 8 21:45:06 secondary-hold secondary Ready to become secondary
Chassis cluster LED information:
Current LED color: Green
Last LED change reason: No failures
Control port tagging:
Disabled