SRX

Expand all | Collapse all

node1 goes from hold to secondary to disabled

Jump to Best Answer
  • 1.  node1 goes from hold to secondary to disabled

    Posted 06-10-2020 04:21

    After upgrading a pair of SRX320s to 15.1X49-D210, I cannot get the cluster to reform.

     

    The primary node comes up ok but I cannot get the secondary online.

     

    I've tried doing the following on the secondary:

    set chassis cluster cluster-id 0 node 0 reboot

    ...

    load factory-defaults

    set chassis cluster cluster-id 1 node 1 reboot

     

    But on the primary, the status goes "lost -> hold -> secondary -> disabled".

     

    On the secondary the only hint is in chassisid log file:

     

    LCC: send: fpc 0 pic 0 online ack
    LCC: pic attach pic 0, flags 0x0, portcount 58, fpc 0
    LCC: pic_set_online: i2c 0x689 pic 0 fpc 0 state 3 in_issu 0
    LCC: pic_type=1673 pic_slot=0 fpc_slot=0 pic_i2c_id=1673
    LCC: hwdb: entry for pic 1673 at slot 0 in fpc 0 inserted
    LCC: FPC 0 PIC 0, attaching clean
    LCC: not in vc mode
    LCC: Forwarding pic attach to FWDD fpc 0, pic 0
    LCC: Got a pic attach ack from fwdd fpc 0pic 0
    LCC: FWDD pic attach ack recd fpc 0, pic 0
    LCC: pic_copy_port_info:Got SFP Rev= , Pno=NON-JNPR, Sno=PG54Q4Q
    LCC: SIGWINCH handler
    LCC: Node entering disabled state
    CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 0 offline: Chassis cluster disable
    LCC: fpc_down slot 0 reason Chassis cluster disable cargs 0xfa6120
    LCC: fpc_srxsme_disconnect slot is 0
    LCC: fpc_offline_now - slot 0, reason: Chassis cluster disable, error OK transition state 1
    CHASSISD_SNMP_TRAP3: ENTITY trap generated: entStateOperDisabled (entPhysicalIndex 7, entStateAdmin 3, entStateAlarm 0)
    LCC: fpc_offline_now - slot 0, is_resync_ready cleared
    LCC: mic_get_mic_slot: clp1: fpc_slot=0, pic_slot=0, i2c=0x689
    LCC: hwdb: entry for fpc 1929 at slot 0 deleted
    CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 1 offline: Chassis cluster disable
    LCC: fpc_down slot 1 reason Removal cargs 0x0
    LCC: fpc_offline_now - slot 1, reason: Chassis cluster disable, error OK transition state 1
    CHASSISD_SNMP_TRAP3: ENTITY trap generated: entStateOperDisabled (entPhysicalIndex 8, entStateAdmin 1, entStateAlarm 0)
    LCC: fpc_srxsme_is_mpim_present: slot 1, FPC not present
    LCC: fpc_srxsme_init: slot 1, FPC not detected
    CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 2 offline: Chassis cluster disable
    LCC: fpc_down slot 2 reason Removal cargs 0x0
    LCC: fpc_offline_now - slot 2, reason: Chassis cluster disable, error OK transition state 1
    CHASSISD_SNMP_TRAP3: ENTITY trap generated: entStateOperDisabled (entPhysicalIndex 9, entStateAdmin 1, entStateAlarm 0)
    LCC: fpc_srxsme_is_mpim_present: slot 2, FPC not present
    LCC: fpc_srxsme_init: slot 2, FPC not detected
    ...
    LCC: Unable to read FPC 6 ID EEPROM
    LCC: I2C read error for slot 6
    ...

    There's an error in jam_chassisid but that file is not on either SRX:

    jam_dso_find_open.776:dir: /usr/sbin/jam
    jam_dso_find_open.799:Failed to Open Dir /usr/sbin/jam
    jam_get_db_attribute.1013:DB Get failed for chasd.lc.modelinfo.711-062269 with ret 3
    jam_get_modelnumstr.1176:Got model num str for partno: 711-062269
    jam_dso_find_open.776:dir: /usr/sbin/jam
    jam_dso_find_open.799:Failed to Open Dir /usr/sbin/jam
    jam_get_db_attribute.1013:DB Get failed for chasd.lc.modelinfo.711-062269 with ret 3
    jam_get_modelnumstr.1176:Got model num str for partno: 711-062269
    jam_get_db_attribute.1011 ERR:DB Get failed for chasd.lc.modelinfo. with error 3
    jam_get_modelnumstr.1176:Got model num str for partno:
    jam_dso_find_open.776:dir: /usr/sbin/jam
    jam_dso_find_open.799:Failed to Open Dir /usr/sbin/jam
    jam_get_db_attribute.1013:DB Get failed for chasd.lc.modelinfo.711-062269 with ret 3
    jam_get_modelnumstr.1176:Got model num str for partno: 711-062269

    So I'm a bit confused about what to do next.... is the unit actually faulty?



  • 2.  RE: node1 goes from hold to secondary to disabled

    Posted 06-10-2020 04:33

    When the secondary is out of the cluster, all of the ge interfaces show up correctly as being up:

     

    root> show interfaces terse | match ge-
    ge-0/0/0                up    up
    ge-0/0/1                up    up
    ge-0/0/2                up    up
    ge-0/0/3                up    up
    ge-0/0/4                up    up
    ge-0/0/5                up    up
    ge-0/0/6                up    up
    ge-0/0/7                up    down
    ge-0/0/8                up    down
    ge-0/0/9                up    down

    So I'm not concerned about that.

     



  • 3.  RE: node1 goes from hold to secondary to disabled

     
    Posted 06-10-2020 04:54

    Hello Baldwizard,

     

    Greetings!

     

    As per the description, I understand that the Secondary node is not online:

     

    Can you help us with the below outputs:

     

    > show chassis alarms no-forwarding

    > show chassis cluster status

    > show chassis cluster statistics
    > show chassis cluster information
    > show log jsrpd

     

    Also, check the below KB to verify  how chassis cluster nodes are configured and up on J-Series and SRX:

    https://kb.juniper.net/InfoCenter/index?page=content&id=KB15439&actp=METADATA

     

    Best Regards,

    Lingabasappa H

     


    #SRX
    #cluster


  • 4.  RE: node1 goes from hold to secondary to disabled

    Posted 06-10-2020 05:07

    Interesting that there's one alarm:

     

    > show chassis alarms no-forwarding
    1 alarms currently active
    Alarm time Class Description
    2020-06-10 21:41:02 EST Major NSD fails to restart because subcomponents fail

     



  • 5.  RE: node1 goes from hold to secondary to disabled

    Posted 06-10-2020 05:15

    Hello baldwizard,

    Regarding this alarm 

     

    > show chassis alarms no-forwarding
    1 alarms currently active
    Alarm time Class Description
    2020-06-10 21:41:02 EST Major NSD fails to restart because subcomponents fail

     

    Starting in Junos OS Releases 12.3X48-D85, 15.1X49-D180, and 19.2R1, a system alarm is triggered when the Network Security Process (NSD) is unable to restart due to the failure of one or more NSD subcomponents. The alarm logs about the NSD are saved in the messages log. The alarm is automatically cleared when NSD restarts successfully. The show chassis alarms and show system alarms commands are updated to display the following output when NSD is unable to restart - NSD fails to restart because subcomponents fail.

     

    Kindly go through the below Docs 

    https://www.juniper.net/documentation/en_US/junos/topics/concept/security-alarm-overview.html

     

    https://www.juniper.net/documentation/en_US/junos/information-products/topic-collections/release-notes/15.1x49-d180/junos-release-notes-15.1X49-D180.pdf

     

    I hope this helps. Please mark my post as "Accept as solution" if that has answered your query.

     

    Kudos are always appreciated!

     



  • 6.  RE: node1 goes from hold to secondary to disabled

     
    Posted 06-10-2020 05:20

    Hello Baldwizard,

     

    > show chassis alarms no-forwarding
    1 alarms currently active
    Alarm time Class Description
    2020-06-10 21:41:02 EST Major NSD fails to restart because subcomponents fail

     

    To clear the above alarm, please run the below command in a safe Maintainence window:

     

    >restart network-security

     

    I suspect that the daemon got stuck and it needs to be restarted, but restarting the process could impact your traffic for a short period of time.

     

    I hope this helps. Please mark this post "Accept as solution" if this answers your query.

     

    Kudos are always appreciated! Smiley Happy

     

    Best Regards,

    Lingabasappa H


    #Netwrok-Security
    #Daemon


  • 7.  RE: node1 goes from hold to secondary to disabled

    Posted 06-10-2020 05:54

    Hi baldwizard,

     

    Firstly, please verify active alarms on both nodes. 

     

    From the active alarm that you pasted output for, I see the active alarm is regarding NSD failure due to subcomponent failure.

     

    Please note that the NSD process handles all security-related config and pushes them into the PFE. Since you are seeing these alerts, I suspect that the daemon might have gotten stuck and it needs to be restarted, but please keep in mind that restarting the process could impact your traffic for a short period of time.

     

    To restart the daemon:

    > restart network-security

     

    If the above command does not solve the issue, please restart the device:

    > request system reboot

     

    Please be aware that you take precautionary measures while rebooting the node. You might not want to do a reboot on a node that is primary.

     

    Hope this helps 🙂

     

    Please mark "Accepted Solution" if this helps you solve your query.

    Kudos are always appreciated!



  • 8.  RE: node1 goes from hold to secondary to disabled
    Best Answer

    Posted 06-10-2020 05:04

    Ok, this appears to be because there was an interface configuration present on the non-cluster member for one of the HA interfaces, ge-0/0/0. I found that deep in a log file but that wasn't visible!

     

    /var/log/dcd

     

    I needed to do a "delete interface ge-0/0/0" from the non-cluster state of the secondary (it then only had the root password in its local configuration) and then reboot.



  • 9.  RE: node1 goes from hold to secondary to disabled

     
    Posted 06-10-2020 05:13

    Hello Baldwizard,

     

    Thanks for the reply.

     

    Did deleting the interface ge-0/0/0 from a non-cluster member in the secondary node and then followed a reboot resolved the issue?

     

    Request you to mark the solution for the queries you post as accepted if it answered your query/queries.

    This would enable others to find the right solution for the same/similar queries on the forum.

     

    I hope this helps. Please mark my post as "Accept as solution" if that has answered your query.

     

    Kudos are always appreciated! Smiley Happy

     

    Best Regards,

    Lingabasappa H



  • 10.  RE: node1 goes from hold to secondary to disabled

    Posted 06-10-2020 05:07

    Hello Baldwizard

    Greetings !!

     

    Kindly provide us the Output of the Below Commands

    show chassis cluster status
    show chassis fpc pic-status
    show chassis alarms
    show log jsrpd
    show chassis cluster information no-forwarding

    Meanwhile You can go through the below Docs it will be benefical For trouebleshooting

    https://kb.juniper.net/InfoCenter/index?page=content&id=KB20641&actp=METADATA

    https://kb.juniper.net/InfoCenter/index?page=content&id=KB15421&actp=METADATA 

    Please mark "Accept as solution" if this answers your query. 

     

    Kudos are appreciated too



  • 11.  RE: node1 goes from hold to secondary to disabled

    Posted 06-10-2020 05:55

    Hi baldwizard,

     

    Firstly, please verify active alarms on both nodes. 

     

    From the active alarm that you pasted output for, I see the active alarm is regarding NSD failure due to subcomponent failure.

     

    Please note that the NSD process handles all security-related config and pushes them into the PFE. Since you are seeing these alerts, I suspect that the daemon might have gotten stuck and it needs to be restarted, but please keep in mind that restarting the process could impact your traffic for a short period of time.

     

    To restart the daemon:

    > restart network-security

     

    If the above command does not solve the issue, please restart the device:

    > request system reboot

     

    Please be aware that you take precautionary measures while rebooting the node. You might not want to do a reboot on a node that is primary.

     

    Hope this helps 🙂

     

    Please mark "Accepted Solution" if this helps you solve your query

    Kudos are always appreciated!