SRX1500 CHASSIS CLUSTER indicates HW issue

View Only

last person joined: yesterday

Ask questions and share experiences about the SRX Series, vSRX, and cSRX.

Back to discussions

Expand all | Collapse all

SRX1500 CHASSIS CLUSTER indicates HW issue

m00nfx01-27-2021 07:14

Dear community, yesterday we have upgraded SRX1500 cluster in our production environment to new SW ...

kronicklez01-27-2021 11:43

Hi, Based on your output it look like core dump created. So u need to open JTAC case. Jan 26 21:44:13 ...

1. SRX1500 CHASSIS CLUSTER indicates HW issue

0 Recommend
m00nfx
Posted 01-27-2021 07:14

Reply Reply Privately
Dear community,

yesterday we have upgraded SRX1500 cluster in our production environment to new SW version using Juniper instructions on upgrading cluster with minimal downtime (Article ID: KB17947). Upgrade was done from version 15.1X49-D150.2 to 18.4R3.3.

Everything went smoothly. After finishing upgrade we had node0 running as secondary for both RGs (RG0 and RG1), and node1 as primary. We performed manual failover for both RGs to node0. After that everything was OK too. But we started to check NTP status on both nodes, and we found that secondary node1 is not able to synchronize time (it should do it from primary node over control link). SRX is a time provider for the systems behind it, and NTP is critical for our applications. So in case failover would happen to this node it would take time for it to synchronize NTP and thus affecting our systems.

So decision was made just to restart secondary node. After node came up, when checking cluster status, we see this:

admin@FW02_SRX> show chassis cluster status

Monitor Failure codes:

CS Cold Sync monitoring FL Fabric Connection monitoring

GR GRES monitoring HW Hardware monitoring

IF Interface monitoring IP IP monitoring

LB Loopback monitoring MB Mbuf monitoring

NH Nexthop monitoring NP NPC monitoring

SP SPU monitoring SM Schedule monitoring

CF Config Sync monitoring RE Relinquish monitoring

IS IRQ storm

Cluster ID: 1

Node Priority Status Preempt Manual Monitor-failures

Redundancy group: 0 , Failover count: 0

node0 254 primary no no None

node1 1 secondary no no None

Redundancy group: 1 , Failover count: 0

node0 200 primary no no None

node1 0 secondary no no HW

For redundancy group 1 node1 should have priority 100, but here it shows 1. Also we see Monitor-failures HW status.

And here the output of another command:

admin@FW02_SRX> show chassis cluster information

node0:

--------------------------------------------------------------------------

Redundancy Group Information:

Redundancy Group 0 , Current State: primary, Weight: 255

Time From To Reason

Jan 26 19:50:07 hold secondary Hold timer expired

Jan 26 20:05:33 secondary primary Remote is in secondary hold

Redundancy Group 1 , Current State: primary, Weight: 255

Time From To Reason

Jan 26 19:50:08 hold secondary Hold timer expired

Jan 26 20:06:15 secondary primary Remote is in secondary hold

Chassis cluster LED information:

Current LED color: Green

Last LED change reason: No failures

node1:

--------------------------------------------------------------------------

Redundancy Group Information:

Redundancy Group 0 , Current State: secondary, Weight: 255

Time From To Reason

Jan 26 21:44:40 hold secondary Hold timer expired

Redundancy Group 1 , Current State: secondary, Weight: 0

Time From To Reason

Jan 26 21:44:41 hold secondary Hold timer expired

Chassis cluster LED information:

Current LED color: Amber

Last LED change reason: Monitored objects are down

Failure Information:

Hardware Monitoring Failure Information:

Redundancy Group Report Time

RG1+ Jan 26 21:44:13.296

I was checking other information (interfaces, etc) but everything looks OK. While checking logs I don't see what could cause this problem (to be honest I don't know very deeply internalls of SRX yet). Maybe this can be related from chassisd log:

Jan 26 21:44:11 LCC: ch_jdaf_client_monitor_reconnect.17011 ch_lcm_reconnects 1
Jan 26 21:44:13 LCC: ch_info_local_hw_error_blob_update: HWMon rg0_num_errors 0, rg1_num_errors 1
Jan 26 21:44:13 LCC: ch_info_local_hw_error_blob_set: HWMon got RG0 0 flag RG1 1 flag
Jan 26 21:44:13 LCC: ch_srxtvp_ha_failover_on_coredump: srxpfe coredump started
Jan 26 21:44:13 LCC: ch_jdaf_lcm_fru_info_cb: FRU info recieved!
Jan 26 21:44:13 LCC: cbd_tvp_jdaf_info_update
Jan 26 21:44:13 LCC: cbd_tvp_jdaf_info_update: Data rcvd for 1 cbd

and also this:

Jan 26 21:44:08 Successfully opened the hwdb dynamic database
Jan 26 21:44:08 Successfully created the root node of hwdb dynamic database
Jan 26 21:44:08 Successfully created hwdb handle 0x922af30 for FRU FPC
Jan 26 21:44:08 Successfully created hwdb handle 0x922af40 for FRU CB
Jan 26 21:44:08 Successfully created hwdb handle 0x922af50 for FRU Routing Engine
Jan 26 21:44:08 Successfully created hwdb handle 0x922af60 for FRU PEM
Jan 26 21:44:08 Successfully created hwdb handle 0x922af70 for FRU SCB
Jan 26 21:44:08 Successfully created hwdb handle 0x922af00 for FRU FAN
Jan 26 21:44:08 Initializing Cooling Zones!!
Jan 26 21:44:08 Initializing Zone: 0!!
Jan 26 21:44:08 I2C read error for slot 0

Any ideas what is the problem? Where to look else? BTW, we have restarted this node couple times and the issue remains.

Thank you.

------------------------------
RAMUNAS DAUKSA
------------------------------
2. RE: SRX1500 CHASSIS CLUSTER indicates HW issue

0 Recommend
kronicklez
Posted 01-27-2021 11:43

Reply Reply Privately
Hi,

Based on your output it look like core dump created. So u need to open JTAC case.

Jan 26 21:44:13 LCC: ch_srxtvp_ha_failover_on_coredump: srxpfe coredump started

Thanks

Original Message

SRX

SRX1500 CHASSIS CLUSTER indicates HW issue

m00nfx01-27-2021 07:14

kronicklez01-27-2021 11:43

1. SRX1500 CHASSIS CLUSTER indicates HW issue

2. RE: SRX1500 CHASSIS CLUSTER indicates HW issue