SRX

Expand all | Collapse all

SRX1500 CHASSIS CLUSTER indicates HW issue

  • 1.  SRX1500 CHASSIS CLUSTER indicates HW issue

    Posted 01-27-2021 07:14
    Dear community,

    yesterday  we have upgraded SRX1500 cluster in our production environment to new SW version using Juniper instructions on upgrading cluster with minimal downtime (Article ID: KB17947). Upgrade was done from version 15.1X49-D150.2 to 18.4R3.3.

    Everything went smoothly. After finishing upgrade we had node0 running as secondary for both RGs (RG0 and RG1), and node1 as primary. We performed manual failover for both RGs to node0. After that everything was OK too. But we started to check NTP status on both nodes, and we found that secondary node1 is not able to synchronize time (it should do it from primary node over control link). SRX is a time provider for the systems behind it, and NTP is critical for our applications. So in case failover would happen to this node it would take time for it to synchronize NTP and thus affecting our systems.

    So decision was made just to restart secondary node.  After node came up, when checking cluster status, we see this:

    admin@FW02_SRX> show chassis cluster status 
    Monitor Failure codes:
        CS  Cold Sync monitoring        FL  Fabric Connection monitoring
        GR  GRES monitoring             HW  Hardware monitoring
        IF  Interface monitoring        IP  IP monitoring
        LB  Loopback monitoring         MB  Mbuf monitoring
        NH  Nexthop monitoring          NP  NPC monitoring              
        SP  SPU monitoring              SM  Schedule monitoring
        CF  Config Sync monitoring      RE  Relinquish monitoring
        IS  IRQ storm
     
    Cluster ID: 1
    Node   Priority Status               Preempt Manual   Monitor-failures
     
    Redundancy group: 0 , Failover count: 0
    node0  254      primary              no      no       None           
    node1  1        secondary            no      no       None           
     
    Redundancy group: 1 , Failover count: 0
    node0  200      primary              no      no       None           
    node1  0        secondary            no      no       HW     

    For redundancy group 1 node1 should have priority 100, but here it shows 1. Also we see Monitor-failures HW status.

    And here the output of another command:

    admin@FW02_SRX> show chassis cluster information 
    node0:
    --------------------------------------------------------------------------
    Redundancy Group Information:
     
        Redundancy Group 0 , Current State: primary, Weight: 255
     
    Time            From                 To                   Reason
            Jan 26 19:50:07 hold                 secondary            Hold timer expired
            Jan 26 20:05:33 secondary            primary              Remote is in secondary hold
     
        Redundancy Group 1 , Current State: primary, Weight: 255
     
    Time            From                 To                   Reason
            Jan 26 19:50:08 hold                 secondary            Hold timer expired
            Jan 26 20:06:15 secondary            primary              Remote is in secondary hold
     
    Chassis cluster LED information:
        Current LED color: Green
        Last LED change reason: No failures
     
    node1:
    --------------------------------------------------------------------------
    Redundancy Group Information:
     
        Redundancy Group 0 , Current State: secondary, Weight: 255
     
    Time            From                 To                   Reason
            Jan 26 21:44:40 hold                 secondary            Hold timer expired
     
        Redundancy Group 1 , Current State: secondary, Weight: 0
     
    Time            From                 To                   Reason
            Jan 26 21:44:41 hold                 secondary            Hold timer expired
     
    Chassis cluster LED information:
        Current LED color: Amber
        Last LED change reason: Monitored objects are down
     
    Failure Information:
     
        Hardware Monitoring Failure Information:
            Redundancy Group   Report Time  
            RG1+               Jan 26 21:44:13.296 

    I was checking other information (interfaces, etc) but everything looks OK. While checking logs I don't see what could cause this problem (to be honest I don't know very deeply internalls of SRX yet). Maybe this can be related from chassisd log:

    Jan 26 21:44:11 LCC: ch_jdaf_client_monitor_reconnect.17011 ch_lcm_reconnects 1
    Jan 26 21:44:13 LCC: ch_info_local_hw_error_blob_update: HWMon rg0_num_errors 0, rg1_num_errors 1
    Jan 26 21:44:13 LCC: ch_info_local_hw_error_blob_set: HWMon got RG0 0 flag RG1 1 flag
    Jan 26 21:44:13 LCC: ch_srxtvp_ha_failover_on_coredump: srxpfe coredump started
    Jan 26 21:44:13 LCC: ch_jdaf_lcm_fru_info_cb: FRU info recieved!
    Jan 26 21:44:13 LCC: cbd_tvp_jdaf_info_update
    Jan 26 21:44:13 LCC: cbd_tvp_jdaf_info_update: Data rcvd for 1 cbd

    and also this:

    Jan 26 21:44:08 Successfully opened the hwdb dynamic database
    Jan 26 21:44:08 Successfully created the root node of hwdb dynamic database
    Jan 26 21:44:08 Successfully created hwdb handle 0x922af30 for FRU FPC
    Jan 26 21:44:08 Successfully created hwdb handle 0x922af40 for FRU CB
    Jan 26 21:44:08 Successfully created hwdb handle 0x922af50 for FRU Routing Engine
    Jan 26 21:44:08 Successfully created hwdb handle 0x922af60 for FRU PEM
    Jan 26 21:44:08 Successfully created hwdb handle 0x922af70 for FRU SCB
    Jan 26 21:44:08 Successfully created hwdb handle 0x922af00 for FRU FAN
    Jan 26 21:44:08 Initializing Cooling Zones!!
    Jan 26 21:44:08 Initializing Zone: 0!!
    Jan 26 21:44:08 I2C read error for slot 0



    Any ideas what is the problem? Where to look else? BTW, we have restarted this node couple times and the issue remains.

    Thank you.






    ------------------------------
    RAMUNAS DAUKSA
    ------------------------------


  • 2.  RE: SRX1500 CHASSIS CLUSTER indicates HW issue

    Posted 01-27-2021 11:43
    Hi,

    Based on your output it look like core dump created. So u need to open JTAC case.

    Jan 26 21:44:13 LCC: ch_srxtvp_ha_failover_on_coredump: srxpfe coredump started

    Thanks