Dear community,
yesterday we have upgraded SRX1500 cluster in our production environment to new SW version using Juniper instructions on upgrading cluster with minimal downtime (
Article ID: KB17947). Upgrade was done from version 15.1X49-D150.2 to 18.4R3.3.
Everything went smoothly. After finishing upgrade we had node0 running as secondary for both RGs (RG0 and RG1), and node1 as primary. We performed manual failover for both RGs to node0. After that everything was OK too. But we started to check NTP status on both nodes, and we found that secondary node1 is not able to synchronize time (it should do it from primary node over control link). SRX is a time provider for the systems behind it, and NTP is critical for our applications. So in case failover would happen to this node it would take time for it to synchronize NTP and thus affecting our systems.
So decision was made just to restart secondary node. After node came up, when checking cluster status, we see this:
admin@FW02_SRX> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring RE Relinquish monitoring
IS IRQ storm
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 0
node0 254 primary no no None
node1 1 secondary no no None
Redundancy group: 1 , Failover count: 0
node0 200 primary no no None
node1 0 secondary no no HW
For redundancy group 1 node1 should have priority 100, but here it shows 1. Also we see Monitor-failures HW status.
And here the output of another command:
admin@FW02_SRX> show chassis cluster information
node0:
--------------------------------------------------------------------------
Redundancy Group Information:
Redundancy Group 0 , Current State: primary, Weight: 255
Time From To Reason
Jan 26 19:50:07 hold secondary Hold timer expired
Jan 26 20:05:33 secondary primary Remote is in secondary hold
Redundancy Group 1 , Current State: primary, Weight: 255
Time From To Reason
Jan 26 19:50:08 hold secondary Hold timer expired
Jan 26 20:06:15 secondary primary Remote is in secondary hold
Chassis cluster LED information:
Current LED color: Green
Last LED change reason: No failures
node1:
--------------------------------------------------------------------------
Redundancy Group Information:
Redundancy Group 0 , Current State: secondary, Weight: 255
Time From To Reason
Jan 26 21:44:40 hold secondary Hold timer expired
Redundancy Group 1 , Current State: secondary, Weight: 0
Time From To Reason
Jan 26 21:44:41 hold secondary Hold timer expired
Chassis cluster LED information:
Current LED color: Amber
Last LED change reason: Monitored objects are down
Failure Information:
Hardware Monitoring Failure Information:
Redundancy Group Report Time
RG1+ Jan 26 21:44:13.296
I was checking other information (interfaces, etc) but everything looks OK. While checking logs I don't see what could cause this problem (to be honest I don't know very deeply internalls of SRX yet). Maybe this can be related from chassisd log:
Jan 26 21:44:11 LCC: ch_jdaf_client_monitor_reconnect.17011 ch_lcm_reconnects 1
Jan 26 21:44:13 LCC: ch_info_local_hw_error_blob_update: HWMon rg0_num_errors 0, rg1_num_errors 1
Jan 26 21:44:13 LCC: ch_info_local_hw_error_blob_set: HWMon got RG0 0 flag RG1 1 flag
Jan 26 21:44:13 LCC: ch_srxtvp_ha_failover_on_coredump: srxpfe coredump started
Jan 26 21:44:13 LCC: ch_jdaf_lcm_fru_info_cb: FRU info recieved!
Jan 26 21:44:13 LCC: cbd_tvp_jdaf_info_update
Jan 26 21:44:13 LCC: cbd_tvp_jdaf_info_update: Data rcvd for 1 cbd
and also this:
Jan 26 21:44:08 Successfully opened the hwdb dynamic database
Jan 26 21:44:08 Successfully created the root node of hwdb dynamic database
Jan 26 21:44:08 Successfully created hwdb handle 0x922af30 for FRU FPC
Jan 26 21:44:08 Successfully created hwdb handle 0x922af40 for FRU CB
Jan 26 21:44:08 Successfully created hwdb handle 0x922af50 for FRU Routing Engine
Jan 26 21:44:08 Successfully created hwdb handle 0x922af60 for FRU PEM
Jan 26 21:44:08 Successfully created hwdb handle 0x922af70 for FRU SCB
Jan 26 21:44:08 Successfully created hwdb handle 0x922af00 for FRU FAN
Jan 26 21:44:08 Initializing Cooling Zones!!
Jan 26 21:44:08 Initializing Zone: 0!!
Jan 26 21:44:08 I2C read error for slot 0Any ideas what is the problem? Where to look else? BTW, we have restarted this node couple times and the issue remains.
Thank you.
------------------------------
RAMUNAS DAUKSA
------------------------------