SRX

Expand all | Collapse all

SRX 240 HA cluster lost its secondary unit

  • 1.  SRX 240 HA cluster lost its secondary unit

    Posted 12-10-2018 03:42

    Hi,

     

    We have a SRX 240 HA cluster and the secondary unit seems to be lost. We can't connect to it via SSH, only on its console port. 

    show chassis cluster status says its lost. 

     

     

    Cluster ID: 1
    Node   Priority Status         Preempt Manual   Monitor-failures
    
    Redundancy group: 0 , Failover count: 9
    node0  200      primary        no      no       None
    node1  0        lost           n/a     n/a      n/a
    
    Redundancy group: 1 , Failover count: 1
    node0  0        primary        no      no       CS
    node1  0        lost           n/a     n/a      n/a
    
    Redundancy group: 2 , Failover count: 1
    node0  0        primary        no      no       CS
    node1  0        lost           n/a     n/a      n/a
    
    Redundancy group: 3 , Failover count: 1
    node0  0        primary        no      no       CS
    node1  0        lost           n/a     n/a      n/a
    
    Redundancy group: 4 , Failover count: 3
    node0  0        primary        no      no       CS
    node1  0        lost           n/a     n/a      n/a
    
    

     

    When we console on to the secondary device we see that it can't even see its own interfaces:

     

     

    show interfaces terse
    Interface               Admin Link Proto    Local                 Remote
    fxp0                    up    up
    fxp0.0                  up    up   inet     192.168.x.y/29
    fxp1                    up    up
    fxp1.0                  up    up   inet     130.16.0.1/2
                                       tnp      0x2100001
    fxp2                    up    up
    fxp2.0                  up    up   tnp      0x2100001
    gre                     up    up
    ipip                    up    up
    lo0                     up    up
    lsi                     up    up
    mtun                    up    up
    pimd                    up    up
    pime                    up    up
    tap                     up    up

     

    Primary device can see its own interfaces, but not the secondary's. Control link seems to working but the fabric links are not. 

     

     show chassis cluster control-plane statistics
    Control link statistics:
        Control link 0:
            Heartbeat packets sent: 5184071
            Heartbeat packets received: 4956136
            Heartbeat packet errors: 0
    Fabric link statistics:
        Child link 0
            Probes sent: 883891
            Probes received: 0
        Child link 1
            Probes sent: 530051
            Probes received: 0
    

    We've checked the cabling a few times, everything is okay, nobody touched it since last year when it was installed. 

    Software image is the same on both devices. 

     

    Model: srx240h2
    JUNOS Software Release [12.3X48-D45.6]
    

    When we disabled the clustering on the secondary device, after the reboot the device could see its interfaces. 

    We thought the problem is with the secondary unit so we replaced it with another 240. After enabling clustering on it and loading the config on to it, the problem still occurs.. 

     

    Also, when we are on the secondary, we can see error messages:

     

    mgmtfw01-b mgmtfw01-b CMLC: Chassis Manager terminated
    
    Message from syslogd@mgmtfw01-b at Dec 10 12:25:09  ...
    mgmtfw01-b mgmtfw01-b CMLC: Chassis Manager terminated

     

    Has anyone seen this kind of behavior? 

     

    Config of the cluster:

     

    set groups node0 system host-name mgmtfw01-a
    set groups node0 interfaces fxp0 unit 0 family inet address 192.168.
    set groups node1 system host-name mgmtfw01-b
    set groups node1 interfaces fxp0 unit 0 family inet address 192.168.
    set apply-groups "${node}"
    set chassis cluster control-link-recovery
    set chassis cluster reth-count 10
    set chassis cluster redundancy-group 1 node 0 priority 200
    set chassis cluster redundancy-group 1 node 1 priority 100
    set chassis cluster redundancy-group 1 interface-monitor ge-0/0/14 weight 128
    set chassis cluster redundancy-group 1 interface-monitor ge-0/0/15 weight 128
    set chassis cluster redundancy-group 1 interface-monitor ge-5/0/15 weight 128
    set chassis cluster redundancy-group 1 interface-monitor ge-5/0/14 weight 128
    set chassis cluster redundancy-group 2 node 0 priority 200
    set chassis cluster redundancy-group 2 node 1 priority 100
    set chassis cluster redundancy-group 2 interface-monitor ge-0/0/13 weight 255
    set chassis cluster redundancy-group 2 interface-monitor ge-5/0/13 weight 255
    set chassis cluster redundancy-group 0 node 0 priority 200
    set chassis cluster redundancy-group 0 node 1 priority 100
    set chassis cluster redundancy-group 3 node 0 priority 200
    set chassis cluster redundancy-group 3 node 1 priority 100
    set chassis cluster redundancy-group 3 interface-monitor ge-0/0/11 weight 128
    set chassis cluster redundancy-group 3 interface-monitor ge-0/0/12 weight 128
    set chassis cluster redundancy-group 3 interface-monitor ge-5/0/11 weight 128
    set chassis cluster redundancy-group 3 interface-monitor ge-5/0/12 weight 128
    set chassis cluster redundancy-group 4 node 0 priority 200
    set chassis cluster redundancy-group 4 node 1 priority 100
    set chassis cluster redundancy-group 4 interface-monitor ge-0/0/10 weight 255
    set chassis cluster redundancy-group 4 interface-monitor ge-5/0/10 weight 255
    set interfaces ge-0/0/10 gigether-options redundant-parent reth4
    set interfaces ge-0/0/11 gigether-options redundant-parent reth3
    set interfaces ge-0/0/12 gigether-options redundant-parent reth3
    set interfaces ge-0/0/13 gigether-options redundant-parent reth2
    set interfaces ge-0/0/14 gigether-options redundant-parent reth1
    set interfaces ge-0/0/15 gigether-options redundant-parent reth1
    set interfaces ge-5/0/10 gigether-options redundant-parent reth4
    set interfaces ge-5/0/11 gigether-options redundant-parent reth3
    set interfaces ge-5/0/12 gigether-options redundant-parent reth3
    set interfaces ge-5/0/13 gigether-options redundant-parent reth2
    set interfaces ge-5/0/14 gigether-options redundant-parent reth1
    set interfaces ge-5/0/15 gigether-options redundant-parent reth1
    set interfaces fab0 fabric-options member-interfaces ge-0/0/2
    set interfaces fab0 fabric-options member-interfaces ge-0/0/3
    set interfaces fab1 fabric-options member-interfaces ge-5/0/2
    set interfaces fab1 fabric-options member-interfaces ge-5/0/3
    set interfaces reth1 vlan-tagging
    set interfaces reth1 gratuitous-arp-reply
    set interfaces reth1 redundant-ether-options redundancy-group 1
    set interfaces reth1 redundant-ether-options minimum-links 1
    set interfaces reth1 redundant-ether-options lacp active
    set interfaces reth1 redundant-ether-options lacp periodic slow
    set interfaces reth2 gratuitous-arp-reply
    set interfaces reth2 redundant-ether-options redundancy-group 2
    set interfaces reth2 unit 0 description 
    set interfaces reth2 unit 0 family inet address 
    set interfaces reth3 gratuitous-arp-reply
    set interfaces reth3 redundant-ether-options redundancy-group 3
    set interfaces reth3 redundant-ether-options minimum-links 1
    set interfaces reth3 redundant-ether-options lacp active
    set interfaces reth3 redundant-ether-options lacp periodic slow
    set interfaces reth3 unit 0 description ****
    set interfaces reth3 unit 0 family inet address 
    set interfaces reth4 vlan-tagging
    set interfaces reth4 gratuitous-arp-reply
    set interfaces reth4 redundant-ether-options redundancy-group 4
    

    Thanks!

     

     

     



  • 2.  RE: SRX 240 HA cluster lost its secondary unit

    Posted 12-10-2018 03:52

    Please share the output of below mentioned commands in problem state from secondary:

    show version

    show chassis alarms

    show system core-dump

    show chassis routing-engine | no-more

    show chassis fpc pic-status

    show chassis fpc detail | no-more

    show chassis cluster status

    show chassis cluster interfaces | no-more

    show chassis cluster information detail | no-more

     

     



  • 3.  RE: SRX 240 HA cluster lost its secondary unit

    Posted 12-10-2018 04:02

    Thanks for the reply. Here are the requested outputs.

     

    me@mgmtfw01-b> show version
    node0:
    --------------------------------------------------------------------------
    Hostname: mgmtfw01-a
    Model: srx240h2
    JUNOS Software Release [12.3X48-D45.6]
    
    node1:
    --------------------------------------------------------------------------
    Hostname: mgmtfw01-b
    Model: srx240h2
    JUNOS Software Release [12.3X48-D45.6]
    
    {secondary:node1}
    me@mgmtfw01-b> show chassis alarms
    node0:
    --------------------------------------------------------------------------
    No alarms currently active
    
    node1:
    --------------------------------------------------------------------------
    No alarms currently active
    
    {secondary:node1}
    me@mgmtfw01-b> show system core-dumps
    node0:
    --------------------------------------------------------------------------
    /var/crash/*core*: No such file or directory
    /var/tmp/*core*: No such file or directory
    /var/tmp/pics/*core*: No such file or directory
    /var/crash/kernel.*: No such file or directory
    /tftpboot/corefiles/*core*: No such file or directory
    
    node1:
    --------------------------------------------------------------------------
    /var/crash/*core*: No such file or directory
    /var/tmp/*core*: No such file or directory
    /var/tmp/pics/*core*: No such file or directory
    /var/crash/kernel.*: No such file or directory
    /tftpboot/corefiles/*core*: No such file or directory
    
    {secondary:node1}
    me@mgmtfw01-b> show chassis routing-engine | no-more
    node0:
    --------------------------------------------------------------------------
    Routing Engine status:
        Temperature                 39 degrees C / 102 degrees F
        CPU temperature             39 degrees C / 102 degrees F
        Total memory              2048 MB Max  1126 MB used ( 55 percent)
          Control plane memory    1072 MB Max   557 MB used ( 52 percent)
          Data plane memory        976 MB Max   566 MB used ( 58 percent)
        CPU utilization:
          User                      91 percent
          Background                 0 percent
          Kernel                     9 percent
          Interrupt                  0 percent
          Idle                       1 percent
        Model                          RE-SRX240H2
        Serial ID                      ACMX4357
        Start time                     2018-10-11 11:15:51 CEST
        Uptime                         60 days, 2 hours, 42 minutes, 5 seconds
        Last reboot reason             Router rebooted after a normal shutdown.
        Load averages:                 1 minute   5 minute  15 minute
                                           1.01       1.15       1.19
    
    node1:
    --------------------------------------------------------------------------
    Routing Engine status:
        Temperature                 40 degrees C / 104 degrees F
        CPU temperature             39 degrees C / 102 degrees F
        Total memory              2048 MB Max   389 MB used ( 19 percent)
          Control plane memory    1072 MB Max   386 MB used ( 36 percent)
          Data plane memory        976 MB Max     0 MB used (  0 percent)
        CPU utilization:
          User                      20 percent
          Background                 0 percent
          Kernel                    74 percent
          Interrupt                  0 percent
          Idle                       6 percent
        Model                          RE-SRX240H2
        Serial ID                      ACMZ8364
        Start time                     2018-12-10 10:21:21 CET
        Uptime                         2 hours, 13 minutes, 45 seconds
        Last reboot reason             Router rebooted after a normal shutdown.
        Load averages:                 1 minute   5 minute  15 minute
                                           1.46       1.83       1.83
    
    {secondary:node1}
    me@mgmtfw01-b> show chassis fpc pic-status
    node0:
    --------------------------------------------------------------------------
    Slot 0   Online       FPC
      PIC 0  Online       16x GE Base PIC
    
    node1:
    --------------------------------------------------------------------------
    Slot 0   Present      FPC
    
    {secondary:node1}
    me@mgmtfw01-b> show chassis fpc detail | no-more
    node0:
    --------------------------------------------------------------------------
    Slot 0 information:
      State                               Online
      Total CPU DRAM                      ---- CPU less FPC ----
      Start time                          2018-12-05 09:46:59 CET
      Uptime                              5 days, 3 hours, 11 minutes, 23 seconds
    
    node1:
    --------------------------------------------------------------------------
    Slot 0 information:
      State                               Present
      Total CPU DRAM                      ---- CPU less FPC ----
    
    {secondary:node1}
    me@mgmtfw01-b> show chassis cluster status
    Monitor Failure codes:
        CS  Cold Sync monitoring        FL  Fabric Connection monitoring
        GR  GRES monitoring             HW  Hardware monitoring
        IF  Interface monitoring        IP  IP monitoring
        LB  Loopback monitoring         MB  Mbuf monitoring
        NH  Nexthop monitoring          NP  NPC monitoring
        SP  SPU monitoring              SM  Schedule monitoring
        CF  Config Sync monitoring
    
    Cluster ID: 1
    Node   Priority Status         Preempt Manual   Monitor-failures
    
    Redundancy group: 0 , Failover count: 0
    node0  200      primary        no      no       None
    node1  0        secondary      no      no       CF
    
    Redundancy group: 1 , Failover count: 0
    node0  0        primary        no      no       CS
    node1  0        secondary      no      no       IF CS CF
    
    Redundancy group: 2 , Failover count: 0
    node0  0        primary        no      no       CS
    node1  0        secondary      no      no       IF CS CF
    
    Redundancy group: 3 , Failover count: 0
    node0  0        primary        no      no       CS
    node1  0        secondary      no      no       IF CS CF
    
    Redundancy group: 4 , Failover count: 0
    node0  0        primary        no      no       CS
    node1  0        secondary      no      no       IF CS CF
    
    {secondary:node1}
    me@mgmtfw01-b> show chassis cluster interfaces | no-more
    Control link status: Up
    
    Control interfaces:
        Index   Interface   Monitored-Status   Internal-SA
        0       fxp1        Up                 Disabled
    
    Fabric link status: Down
    
    Fabric interfaces:
        Name    Child-interface    Status
                                   (Physical/Monitored)
        fab0
        fab0
        fab1
        fab1
    
    Redundant-pseudo-interface Information:
        Name         Status      Redundancy-group
        lo0          Up          0
    
    Interface Monitoring:
        Interface         Weight    Status    Redundancy-group
        ge-5/0/14         128       Down      1
        ge-5/0/15         128       Down      1
        ge-0/0/15         128       Down      1
        ge-0/0/14         128       Down      1
        ge-5/0/13         255       Down      2
        ge-0/0/13         255       Down      2
        ge-5/0/12         128       Down      3
        ge-5/0/11         128       Down      3
        ge-0/0/12         128       Down      3
        ge-0/0/11         128       Down      3
        ge-5/0/10         255       Down      4
        ge-0/0/10         255       Down      4
    
    {secondary:node1}
    me@mgmtfw01-b> show chassis cluster information detail | no-more
    node0:
    --------------------------------------------------------------------------
    Redundancy mode:
        Configured mode: active-active
        Operational mode: active-active
    Cluster configuration:
        Heartbeat interval: 1000 ms
        Heartbeat threshold: 3
        Control link recovery: Enabled
        Fabric link down timeout: 66 sec
    Node health information:
        Local node health: Not healthy
        Remote node health: Not healthy
    
    Redundancy group: 0, Threshold: 255, Monitoring failures: none
        Events:
            Oct 11 11:15:13.709 : hold->secondary, reason: Hold timer expired
            Oct 25 15:39:14.955 : secondary->primary, reason: Only node present
            Dec  5 09:39:39.985 : primary->secondary-hold, reason: Manual failover
            Dec  5 09:39:49.702 : secondary-hold->primary, reason: Only node present
            Dec  5 09:41:40.678 : primary->secondary-hold, reason: Manual failover
            Dec  5 09:42:05.439 : secondary-hold->primary, reason: Only node present
            Dec  5 09:43:12.740 : primary->secondary-hold, reason: Manual failover
            Dec  5 09:43:38.498 : secondary-hold->primary, reason: Only node present
            Dec  5 09:45:16.073 : primary->secondary-hold, reason: Manual failover
            Dec  5 09:45:42.458 : secondary-hold->primary, reason: Only node present
    
    Redundancy group: 1, Threshold: 0, Monitoring failures: cold-sync-monitoring
        Events:
            Oct 11 11:15:13.773 : hold->secondary, reason: Hold timer expired
            Oct 25 15:39:14.907 : secondary->ineligible, reason: Fabric link down
            Oct 25 15:39:15.106 : ineligible->primary, reason: Only node present
    
    Redundancy group: 2, Threshold: 0, Monitoring failures: cold-sync-monitoring
        Events:
            Oct 11 11:15:13.812 : hold->secondary, reason: Hold timer expired
            Oct 25 15:39:14.911 : secondary->ineligible, reason: Fabric link down
            Oct 25 15:39:15.138 : ineligible->primary, reason: Only node present
    
    Redundancy group: 3, Threshold: 0, Monitoring failures: cold-sync-monitoring
        Events:
            Oct 11 11:15:13.849 : hold->secondary, reason: Hold timer expired
            Oct 25 15:39:14.916 : secondary->ineligible, reason: Fabric link down
            Oct 25 15:39:15.142 : ineligible->primary, reason: Only node present
    
    Redundancy group: 4, Threshold: 0, Monitoring failures: cold-sync-monitoring
        Events:
            Oct 11 11:15:13.888 : hold->secondary, reason: Hold timer expired
            Oct 11 17:32:32.836 : secondary->primary, reason: Remote is in secondary hold
            Oct 25 15:39:14.917 : primary->ineligible, reason: Fabric link down
            Oct 25 15:39:15.169 : ineligible->primary, reason: Only node present
    Control link statistics:
        Control link 0:
            Heartbeat packets sent: 5185608
            Heartbeat packets received: 4956247
            Heartbeat packet errors: 0
            Duplicate heartbeat packets received: 0
        Control recovery packet count: 0
        Sequence number of last heartbeat packet sent: 5185628
        Sequence number of last heartbeat packet received: 5231
    Fabric link statistics:
        Child link 0
            Probes sent: 886973
            Probes received: 0
        Child link 1
            Probes sent: 533133
            Probes received: 0
    Switch fabric link statistics:
        Probe state : DOWN
        Probes sent: 0
        Probes received: 0
        Probe recv errors: 0
        Probe send errors: 0
        Probe recv dropped: 0
        Sequence number of last probe sent: 0
        Sequence number of last probe received: 0
    
    Chassis cluster LED information:
        Current LED color: Amber
        Last LED change reason: Monitored objects are down
    Control port tagging:
        Disabled
    
    Cold Synchronization:
        Status:
            Cold synchronization completed for: N/A
            Cold synchronization failed for: N/A
            Cold synchronization not known for: N/A
            Current Monitoring Weight: 255
    
        Progress:
            CS Prereq               0 of 1 SPUs completed
               1. if_state sync          1 SPUs completed
               2. fabric link            0 SPUs completed
               3. policy data sync       1 SPUs completed
               4. cp ready               0 SPUs completed
               5. VPN data sync          0 SPUs completed
               6. Dynamic addr sync      0 SPUs completed
            CS RTO sync             0 of 1 SPUs completed
            CS Postreq              0 of 1 SPUs completed
    
        Statistics:
            Number of cold synchronization completed: 0
            Number of cold synchronization failed: 0
    
        Events:
            Oct 11 11:17:27.928 : Cold sync for PFE  is RTO sync in process
            Oct 11 11:17:27.929 : Cold sync for PFE  is Post-req check in process
            Oct 11 11:17:27.936 : Cold sync for PFE  is Completed
            Dec  5 09:54:17.411 : Cold sync for PFE  is Not complete
    
    Loopback Information:
    
        PIC Name        Loopback        Nexthop     Mbuf
        -------------------------------------------------
                        Success         Success     Success
    
    Interface monitoring:
        Statistics:
            Monitored interface failure count: 303
    
        Events:
            Dec  6 12:50:19.901 : Interface ge-0/0/14 monitored by rg 1, changed state from Down to Up
            Dec  6 12:50:22.364 : Interface ge-0/0/15 monitored by rg 1, changed state from Down to Up
            Dec  6 12:50:36.740 : Interface ge-0/0/11 monitored by rg 3, changed state from Up to Down
            Dec  6 12:50:36.855 : Interface ge-0/0/12 monitored by rg 3, changed state from Up to Down
            Dec  6 12:50:39.944 : Interface ge-0/0/11 monitored by rg 3, changed state from Down to Up
            Dec  6 12:50:40.046 : Interface ge-0/0/12 monitored by rg 3, changed state from Down to Up
            Dec  6 12:50:44.296 : Interface ge-0/0/14 monitored by rg 1, changed state from Up to Down
            Dec  6 12:50:46.808 : Interface ge-0/0/15 monitored by rg 1, changed state from Up to Down
            Dec  6 12:50:48.643 : Interface ge-0/0/14 monitored by rg 1, changed state from Down to Up
            Dec  6 12:50:49.966 : Interface ge-0/0/15 monitored by rg 1, changed state from Down to Up
    
    Fabric monitoring:
        Status:
            Fabric Monitoring: Enabled
            Activation status: Suspended by local node and other node
            Fabric Status reported by data plane: Down
            JSRPD internal fabric status: Down
    
    Fabric link events:
            Dec 10 12:54:42.405 : Fabric link fab1 is down
            Dec 10 12:54:42.429 : Fabric link fab1 is down
            Dec 10 12:54:42.450 : Fabric link fab1 is deleted
            Dec 10 12:54:42.488 : Fabric link fab0 is up
            Dec 10 12:55:39.375 : Fabric link fab1 is down
            Dec 10 12:55:39.412 : Fabric link fab1 is down
            Dec 10 12:55:39.465 : Fabric link fab1 is down
            Dec 10 12:55:39.510 : Fabric link fab0 is up
            Dec 10 12:55:39.549 : Fabric link fab1 is down
            Dec 10 12:55:39.575 : Fabric link fab1 is down
    
    Control link status: Up
        Server information:
            Server status : Connected
            Server connected to 130.16.0.1/52793
        Client information:
            Client status : Inactive
            Client connected to None
    Control port tagging:
        Disabled
    
    Control link events:
            Dec 10 12:43:01.319 : Control link up, link status timer
            Dec 10 12:43:33.079 : Control link fxp1 is up
            Dec 10 12:48:27.022 : Control link down, link status timer
            Dec 10 12:48:39.021 : Control link fxp1 is up
            Dec 10 12:49:04.788 : Control link up, link status timer
            Dec 10 12:49:36.445 : Control link fxp1 is up
            Dec 10 12:54:30.529 : Control link down, link status timer
            Dec 10 12:54:42.497 : Control link fxp1 is up
            Dec 10 12:55:08.238 : Control link up, link status timer
            Dec 10 12:55:39.522 : Control link fxp1 is up
    
    Hardware monitoring:
        Status:
            Activation status: Enabled
            Redundancy group 0 failover for hardware faults: Enabled
            Hardware redundancy group 0 errors: 0
            Hardware redundancy group 1 errors: 0
    
    Schedule monitoring:
        Status:
            Activation status: Disabled
            Schedule slip detected: None
            Timer ignored: No
    
        Statistics:
            Total slip detected count: 31
            Longest slip duration: 25(s)
    
        Events:
            Dec  7 10:57:17.079 : Detected schedule slip
            Dec  7 10:58:17.170 : Cleared schedule slip
            Dec  7 12:23:33.217 : Detected schedule slip
            Dec  7 12:24:33.330 : Cleared schedule slip
            Dec  8 07:07:46.401 : Detected schedule slip
            Dec  8 07:08:46.859 : Cleared schedule slip
            Dec  8 11:52:31.165 : Detected schedule slip
            Dec  8 11:53:31.260 : Cleared schedule slip
            Dec  9 03:51:24.219 : Detected schedule slip
            Dec  9 03:52:24.317 : Cleared schedule slip
    
    Configuration Synchronization:
        Status:
            Activation status: Enabled
            Last sync operation: Auto-Sync
            Last sync result: Succeeded
            Last sync mgd messages:
                mgd: rcp: /config/juniper.conf: No such file or directory
                Non-existant dump device /dev/bo0s1b
                mgd: commit complete
    
        Events:
            Oct 11 11:15:35.406 : Auto-Sync: In progress. Attempt: 1
            Oct 11 11:18:22.218 : Auto-Sync: Clearing mgd. Attempt: 1
            Oct 11 11:18:31.062 : Auto-Sync: Succeeded. Attempt: 1
    
    Cold Synchronization Progress:
        CS Prereq               0 of 1 SPUs completed
           1. if_state sync          1 SPUs completed
           2. fabric link            0 SPUs completed
           3. policy data sync       1 SPUs completed
           4. cp ready               0 SPUs completed
           5. VPN data sync          0 SPUs completed
           6. Dynamic addr sync      0 SPUs completed
        CS RTO sync             0 of 1 SPUs completed
        CS Postreq              0 of 1 SPUs completed
    
     Command history:
            Dec  5 09:44:15.890 : Manual failover of RG-0 to node0
            Dec  5 09:44:28.871 : Manual failover reset of RG-0
            Dec  5 09:44:33.187 : Manual failover of RG-0 to node0
            Dec  5 09:45:02.494 : Manual failover of RG-0 to node0
            Dec  5 09:45:38.155 : Manual failover reset of RG-0
            Dec  5 09:45:52.176 : Manual failover of RG-0 to node0
            Dec  5 15:28:26.029 : Manual failover reset of RG-4
            Dec  5 15:28:39.491 : Manual failover reset of RG-3
    
    node1:
    --------------------------------------------------------------------------
    Redundancy mode:
        Configured mode: active-active
        Operational mode: unknown
    Cluster configuration:
        Heartbeat interval: 1000 ms
        Heartbeat threshold: 3
        Control link recovery: Enabled
        Fabric link down timeout: 66 sec
    Node health information:
        Local node health: Not healthy
        Remote node health: Not healthy
    
    Redundancy group: 0, Threshold: 0, Monitoring failures: config-sync-monitoring
        Events:
            Dec 10 10:24:57.009 : hold->secondary, reason: Hold timer expired
    
    Redundancy group: 1, Threshold: -511, Monitoring failures: interface-monitoring, cold-sync-monitoring, config-sync-monitoring
        Events:
            Dec 10 10:24:57.102 : hold->secondary, reason: Hold timer expired
    
    Redundancy group: 2, Threshold: -510, Monitoring failures: interface-monitoring, cold-sync-monitoring, config-sync-monitoring
        Events:
            Dec 10 10:24:57.585 : hold->secondary, reason: Hold timer expired
    
    Redundancy group: 3, Threshold: -511, Monitoring failures: interface-monitoring, cold-sync-monitoring, config-sync-monitoring
        Events:
            Dec 10 10:24:57.622 : hold->secondary, reason: Hold timer expired
    
    Redundancy group: 4, Threshold: -510, Monitoring failures: interface-monitoring, cold-sync-monitoring, config-sync-monitoring
        Events:
            Dec 10 10:24:57.671 : hold->secondary, reason: Hold timer expired
    Control link statistics:
        Control link 0:
            Heartbeat packets sent: 5211
            Heartbeat packets received: 4757
            Heartbeat packet errors: 0
            Duplicate heartbeat packets received: 0
        Control recovery packet count: 0
        Sequence number of last heartbeat packet sent: 5237
        Sequence number of last heartbeat packet received: 5185634
    Fabric link statistics:
        Child link 0
            Probes sent: 0
            Probes received: 0
        Child link 1
            Probes sent: 0
            Probes received: 0
    Switch fabric link statistics:
        Probe state : DOWN
        Probes sent: 0
        Probes received: 0
        Probe recv errors: 0
        Probe send errors: 0
        Probe recv dropped: 0
        Sequence number of last probe sent: 0
        Sequence number of last probe received: 0
    
    Chassis cluster LED information:
        Current LED color: Amber
        Last LED change reason: Monitored objects are down
    Control port tagging:
        Disabled
    
    Cold Synchronization:
        Status:
            Cold synchronization completed for: N/A
            Cold synchronization failed for: N/A
            Cold synchronization not known for: N/A
            Current Monitoring Weight: 255
    
        Progress:
            CS Prereq               0 of 1 SPUs completed
               1. if_state sync          0 SPUs completed
               2. fabric link            0 SPUs completed
               3. policy data sync       0 SPUs completed
               4. cp ready               0 SPUs completed
               5. VPN data sync          0 SPUs completed
               6. Dynamic addr sync      0 SPUs completed
            CS RTO sync             0 of 1 SPUs completed
            CS Postreq              0 of 1 SPUs completed
    
        Statistics:
            Number of cold synchronization completed: 0
            Number of cold synchronization failed: 0
    
    Loopback Information:
    
        PIC Name        Loopback        Nexthop     Mbuf
        -------------------------------------------------
                        Success         Success     Success
    
    Interface monitoring:
        Statistics:
            Monitored interface failure count: 0
    
    Fabric monitoring:
        Status:
            Fabric Monitoring: Enabled
            Activation status: Suspended by local node and other node
            Fabric Status reported by data plane: Down
            JSRPD internal fabric status: Down
    
    Fabric link events:
            Dec 10 11:30:58.765 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
            Dec 10 11:31:06.997 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
            Dec 10 11:31:13.229 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
            Dec 10 11:37:02.814 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
            Dec 10 11:37:10.999 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
            Dec 10 11:43:05.846 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
            Dec 10 11:43:14.009 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
            Dec 10 11:49:09.882 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
            Dec 10 11:49:18.072 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
            Dec 10 12:34:13.129 : Fabric monitoring is suspended due to USPIPC CONNECTION failure
    
    Control link status: Up
        Server information:
            Server status : Inactive
            Server connected to None
        Client information:
            Client status : Connected
            Client connected to 129.16.0.1/62845
    Control port tagging:
        Disabled
    
    Control link events:
            Dec 10 11:37:43.889 : Control link up, link status timer
            Dec 10 11:43:05.929 : Control link fxp1 is down
            Dec 10 11:43:05.929 : Control link down, flowd is down
            Dec 10 11:43:14.746 : Control link fxp1 is up
            Dec 10 11:43:47.390 : Control link up, link status timer
            Dec 10 11:49:09.964 : Control link fxp1 is down
            Dec 10 11:49:09.965 : Control link down, flowd is down
            Dec 10 11:49:19.221 : Control link fxp1 is up
            Dec 10 11:49:50.982 : Control link up, link status timer
            Dec 10 12:34:13.119 : Control link fxp1 is up
    
    Hardware monitoring:
        Status:
            Activation status: Enabled
            Redundancy group 0 failover for hardware faults: Enabled
            Hardware redundancy group 0 errors: 0
            Hardware redundancy group 1 errors: 0
    
    Schedule monitoring:
        Status:
            Activation status: Disabled
            Schedule slip detected: None
            Timer ignored: No
    
        Statistics:
            Total slip detected count: 16
            Longest slip duration: 2578(s)
    
        Events:
            Dec 10 11:31:11.706 : Detected schedule slip
            Dec 10 11:32:13.095 : Cleared schedule slip
            Dec 10 11:37:15.779 : Detected schedule slip
            Dec 10 11:38:17.021 : Cleared schedule slip
            Dec 10 11:43:17.973 : Detected schedule slip
            Dec 10 11:44:19.238 : Cleared schedule slip
            Dec 10 11:49:22.655 : Detected schedule slip
            Dec 10 11:50:24.368 : Cleared schedule slip
            Dec 10 12:34:13.090 : Detected schedule slip
            Dec 10 12:35:13.349 : Cleared schedule slip
    
    Configuration Synchronization:
        Status:
            Activation status: Enabled
            Last sync operation: Auto-Sync
            Last sync result: In progress
            Last sync mgd messages:
                mgd: rcp: /config/juniper.conf: No such file or directory
    
        Events:
            Dec 10 10:25:23.643 : Auto-Sync: In progress. Attempt: 1
            Dec 10 12:34:13.078 : Auto-Sync: Retry needed. Attempt: 1
            Dec 10 12:34:18.930 : Auto-Sync: In progress. Attempt: 2
    
    Cold Synchronization Progress:
        CS Prereq               0 of 1 SPUs completed
           1. if_state sync          0 SPUs completed
           2. fabric link            0 SPUs completed
           3. policy data sync       0 SPUs completed
           4. cp ready               0 SPUs completed
           5. VPN data sync          0 SPUs completed
           6. Dynamic addr sync      0 SPUs completed
        CS RTO sync             0 of 1 SPUs completed
        CS Postreq              0 of 1 SPUs completed
    
    {secondary:node1}
    


  • 4.  RE: SRX 240 HA cluster lost its secondary unit

    Posted 12-10-2018 04:12

    The RE CPU utilization is more than 95% on both nodes and the FPC is not online node1. You may have to check for the cause of high RE CPU.

    Please share the output:

    show security flow status

    show system process extensive | no-more



  • 5.  RE: SRX 240 HA cluster lost its secondary unit

    Posted 12-10-2018 07:14

    Hi,

     

    The outputs as requested:

     

    me@mgmtfw01-b> show security flow status
    node0:
    --------------------------------------------------------------------------
      Flow forwarding mode:
        Inet forwarding mode: flow based
        Inet6 forwarding mode: drop
        MPLS forwarding mode: drop
        ISO forwarding mode: drop
        Enhanced route scaling mode: Disabled
      Flow trace status
        Flow tracing status: off
      Flow session distribution
        Distribution mode: RR-based
      Flow ipsec performance acceleration: off
      Flow packet ordering
        Ordering mode: Hardware
    
    node1:
    --------------------------------------------------------------------------
      Flow forwarding mode:
        Inet forwarding mode: none (reboot needed to change to flow based)
        Inet6 forwarding mode: drop
        MPLS forwarding mode: none (reboot needed to change to drop)
        ISO forwarding mode: drop
        Enhanced route scaling mode: Disabled
      Flow trace status
        Flow tracing status: off
      Flow session distribution
        Distribution mode: RR-based
      Flow ipsec performance acceleration: off
      Flow packet ordering
        Ordering mode: Hardware
    
    {secondary:node1}
    me@mgmtfw01-b> show system processes extensive | no-more
    node0:
    --------------------------------------------------------------------------
    last pid: 41674;  load averages:  1.12,  1.19,  1.16  up 60+05:57:17    16:12:38
    149 processes: 18 running, 118 sleeping, 1 zombie, 12 waiting
    
    Mem: 228M Active, 135M Inact, 1089M Wired, 255M Cache, 112M Buf, 266M Free
    Swap:
    
    
      PID USERNAME     THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
    19056 root           7  76    0  1026M 86948K select 0 398.5H 290.09% flowd_octeon_hm
    19903 root           1 139    0 15080K  5644K RUN    0  96.6H 74.41% eventd
       22 root           1 171   52     0K    16K RUN    0 1059.6  0.00% idle: cpu0
       23 root           1 -20 -139     0K    16K WAIT   0 752:25  0.00% swi7: clock
       25 root           1 -40 -159     0K    16K WAIT   0 552:54  0.00% swi2: netisr 0
     1715 root           1  76    0 15556K  6692K select 0 430:29  0.00% rtlogd
     1720 root           1  76    0 14412K  6252K select 0 337:01  0.00% license-check
        5 root           1 -16    0     0K    16K rtfifo 0 323:41  0.00% rtfifo_kern_recv
     1710 root           1  76    0 17828K  5576K select 0 108:14  0.00% shm-rtsdbd
     1711 root           1  76    0 16108K  7900K select 0  74:01  0.00% jsrpd
       26 root           1 -16    0     0K    16K -      0  49:23  0.00% yarrow
     1696 root           1  76    0  3348K  1428K select 0  41:55  0.00% bslockd
       52 root           1 -16    0     0K    16K psleep 0  41:33  0.00% vmkmemdaemon
    19046 root           1  76    0 25896K 16992K select 0  34:19  0.00% snmpd
     1716 root           1  76    0 19720K  8692K select 0  33:12  0.00% utmd
     1719 root           3  76    0 16608K  5484K select 0  23:03  0.00% wland
    19042 root           1  76    0 33300K 14248K select 0  22:59  0.00% mib2d
    22605 root           1   4    0     0K    16K proxy_ 0  22:55  0.00% peerproxy02100001
        2 root           1  -8    0     0K    16K -      0  15:40  0.00% g_event
       19 root           1 171   52     0K    16K RUN    3  15:38  0.00% idle: cpu3
       20 root           1 171   52     0K    16K RUN    2  15:29  0.00% idle: cpu2
       42 root           1  20    0     0K    16K syncer 0  15:05  0.00% syncer
     1807 root           1  76    0     0K    16K select 0  13:26  0.00% peerproxy01100001
    19022 root           1  76    0 22580K 10208K select 0  13:07  0.00% l2ald
       43 root           1  20    0     0K    16K vnlrum 0  13:03  0.00% vnlru_mem
        3 root           1  -8    0     0K    16K -      0  12:39  0.00% g_up
        4 root           1  -8    0     0K    16K -      0  12:35  0.00% g_down
    19051 root           1  76    0 21264K  6344K select 0  11:03  0.00% bdbrepd
     1718 root           1  76    0  7700K  5592K select 0  10:43  0.00% ntpd
    19054 root           1  76    0   129M 20648K select 0  10:03  0.00% chassisd
    19047 root           1  76    0 30068K 10676K select 0   7:35  0.00% pfed
     1808 root           1   4    0     0K    16K proxy_ 0   7:07  0.00% peerproxy02100001
     1695 root           1  76    0  2320K   916K select 0   7:05  0.00% watchdog
    19024 root           1  76    0 24588K 10840K select 0   6:08  0.00% cosd
     1714 root           1  76    0 43256K  9944K select 0   5:53  0.00% idpd
       21 root           1 171   52     0K    16K RUN    1   5:13  0.00% idle: cpu1
     1751 root           1  76    0 15280K  6956K select 0   5:03  0.00% bfdd
     1701 root           1  76    0 14940K  5384K select 0   4:32  0.00% craftd
    19108 root           8   8    0 82796K  7616K nanslp 0   4:26  0.00% ipfd
    19038 root           1   8    0 29848K  5440K nanslp 0   4:12  0.00% wmic
    19045 root           1  76    0 14348K  5824K select 0   4:07  0.00% alarmd
    19040 root           1   4    0 11500K  5248K kqread 0   3:32  0.00% mcsnoopd
    19021 root           1   4    0 57076K 27744K kqread 0   2:50  0.00% rpd
    19043 root           1  76    0 31388K 12424K select 0   2:46  0.00% kmd
    18987 root           1  76    0 17880K  8568K select 0   2:23  0.00% ppmd
    19023 root           1  76    0 15768K  7544K select 0   2:23  0.00% rmopd
    19044 root           1  76    0 39992K  9772K select 0   2:16  0.00% dcd
       45 root           1 -16    0     0K    16K sdflus 0   2:03  0.00% softdepflush
    19052 root           1  76    0 29968K 15008K select 0   2:02  0.00% nsd
       40 root           1 171   52     0K    16K pgzero 0   1:49  0.00% pagezero
    19053 root           1  76    0 27076K 10292K select 0   1:45  0.00% smid
    19097 nobody         1   4    0 10496K  1484K kqread 0   1:35  0.00% webapid
    19049 root           1  76    0  9196K  3736K select 0   1:34  0.00% irsd
       32 root           1   8    0     0K    16K dwcint 0   1:26  0.00% dwc0
       44 root           1  -4    0     0K    16K vlruwt 0   1:25  0.00% vnlru
    18989 root           1  76    0 17836K  7872K select 0   1:24  0.00% lacpd
       41 root           1 -16    0     0K    16K psleep 0   1:20  0.00% bufdaemon
       50 root           1 -16    0     0K    16K psleep 0   1:16  0.00% vmuncachedaemon
    19032 root           1  76    0 16820K  6728K select 0   1:02  0.00% pkid
     1702 root           1  76    0 46924K 25700K select 0   0:53  0.00% mgd
     1705 root           1  76    0  6936K  2148K select 0   0:49  0.00% inetd
     1437 root           1   8    0  2720K  1160K nanslp 0   0:44  0.00% cron
       30 root           1 -28 -147     0K    16K WAIT   0   0:30  0.00% swi5: cambio
       82 root           1  -8    0     0K    16K mdwait 0   0:25  0.00% md1
    19101 nobody         1  76    0 10884K  5080K select 0   0:25  0.00% httpd
    19048 root           1  76    0 27668K 10524K select 0   0:22  0.00% dfwd
        6 root           1   8    0     0K    16K -      0   0:20  0.00% kqueue taskq
    19034 root           1  76    0 24048K  8376K select 0   0:17  0.00% smihelperd
        9 root           1 -16    0     0K    16K psleep 0   0:17  0.00% pagedaemon
    40859 root           1  77    0 56296K 23532K select 0   0:15  0.00% mgd
       47 root           1  -8    0     0K    16K select 0   0:13  0.00% if_pfe_listen
    19036 root           3  79    0 18940K  6292K ucond  0   0:11  0.00% syshmd
    19025 root           1  76    0 10264K  3812K select 0   0:10  0.00% pppd
       37 root           1 -36 -155     0K    16K WAIT   0   0:10  0.00% swi3: ip6opt ipopt
    40840 root           1  76    0 10708K  3676K select 0   0:07  0.00% sshd
      396 root           1  -8    0     0K    16K mdwait 0   0:07  0.00% md2
    40858 root           1  76    0 55636K 19600K select 0   0:07  0.00% cli
    19033 root           1  76    0 15276K  5760K select 0   0:06  0.00% httpd-gk
        1 root           1   8    0  1596K   892K wait   0   0:06  0.00% init
    19035 root           1  76    0 12336K  4176K select 0   0:05  0.00% nstraced
    19037 root           1  76    0  9656K  2964K select 0   0:04  0.00% smtpd
    19901 root           1  76    0  3332K  1308K select 0   0:01  0.00% usbd
       36 root           1   8    0     0K    16K usbevt 0   0:01  0.00% usb1
     1697 root           1  77    0  3656K  1536K select 0   0:01  0.00% tnetd
       33 root           1   8    0     0K    16K usbevt 0   0:01  0.00% usb0
     1713 root           1  76    0 21684K  6448K select 0   0:01  0.00% appsecured
     1712 root           2   4    0 21172K  5736K select 0   0:01  0.00% appidd
     1721 root           1   4    0 11848K  3308K select 0   0:01  0.00% sdxd
    19031 root           1  82    0 16784K  5684K select 0   0:00  0.00% wwand
    19030 root           1  83    0 12644K  3884K select 0   0:00  0.00% sendd
    19028 root           1  77    0 11336K  3508K select 0   0:00  0.00% oamd
       31 root           1 -48 -167     0K    16K WAIT   0   0:00  0.00% swi0: uart
    19039 root           1  20    0 10244K  3712K pause  0   0:00  0.00% webapid
     1547 root           1  -8    0     0K    16K mdwait 0   0:00  0.00% md4
    19029 root           1  93    0 12132K  3236K select 0   0:00  0.00% mplsoamd
    19050 root           1  76    0  9548K  2936K select 0   0:00  0.00% relayd
       59 root           1  -8    0     0K    16K mdwait 0   0:00  0.00% md0
    40844 root           1  20    0  5056K  3116K pause  0   0:00  0.00% csh
        7 root           1   8    0     0K    16K -      0   0:00  0.00% thread taskq
    19849 root           1   5    0  4556K  1832K ttyin  0   0:00  0.00% login
    41674 root           1  79    0 24628K  2124K CPU0   0   0:00  0.00% top
    41673 root           1  77    0 46956K  4720K select 0   0:00  0.00% mgd
     1532 root           1  -8    0     0K    16K mdwait 0   0:00  0.00% md3
        8 root           1   8    0     0K    16K -      0   0:00  0.00% mastership taskq
        0 root           1  -8    0     0K     0K WAIT   0   0:00  0.00% swapper
       49 root           1   4    0     0K    16K purge_ 0   0:00  0.00% kern_pir_proc
       51 root           1  -8    0     0K    16K select 0   0:00  0.00% if_pic_listen0
       54 root           1   8    0     0K    16K -      0   0:00  0.00% nfsiod 0
       53 root           1   4    0     0K    16K dump_r 0   0:00  0.00% kern_dump_proc
       46 root           1  76    0     0K    16K sleep  0   0:00  0.00% netdaemon
       56 root           1   8    0     0K    16K -      0   0:00  0.00% nfsiod 2
       57 root           1   8    0     0K    16K -      0   0:00  0.00% nfsiod 3
       55 root           1   8    0     0K    16K -      0   0:00  0.00% nfsiod 1
       35 root           1   8    0     0K    16K dwcint 0   0:00  0.00% dwc1
       34 root           1   8    0     0K    16K usbtsk 0   0:00  0.00% usbtask
       10 root           1 -16    0     0K    16K ktrace 0   0:00  0.00% ktrace
       28 root           1 -12 -131     0K    16K WAIT   0   0:00  0.00% swi9: +
       29 root           1 -12 -131     0K    16K WAIT   0   0:00  0.00% swi9: task queue
       27 root           1 -16 -135     0K    16K WAIT   0   0:00  0.00% swi8: +
       24 root           1 -24 -143     0K    16K WAIT   0   0:00  0.00% swi6: vm
       38 root           1 -32 -151     0K    16K WAIT   0   0:00  0.00% swi4: ip6mismatch+
       39 root           1 -44 -163     0K    16K WAIT   0   0:00  0.00% swi1: ipfwd
       15 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu7
       14 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu8
       13 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu9
       12 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu10
       11 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu11
       17 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu5
       18 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu4
       16 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu6
    
    node1:
    --------------------------------------------------------------------------
    last pid:  2746;  load averages:  0.92,  0.97,  0.97  up 0+05:28:55    15:49:46
    121 processes: 18 running, 91 sleeping, 1 zombie, 11 waiting
    
    Mem: 140M Active, 90M Inact, 1061M Wired, 227M Cache, 112M Buf, 454M Free
    Swap:
    
    
      PID USERNAME     THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
     2723 root           7  76    0  1026M 86940K select 0  16:56 367.53% flowd_octeon_hm
     1623 root           1  76    0 15940K  7588K select 0  42:11  0.00% jsrpd
       22 root           1 171   52     0K    16K RUN    0  19:56  0.00% idle: cpu0
       20 root           1 171   52     0K    16K RUN    2  11:15  0.00% idle: cpu2
       19 root           1 171   52     0K    16K RUN    3  11:13  0.00% idle: cpu3
       21 root           1 171   52     0K    16K RUN    1   3:42  0.00% idle: cpu1
     1659 root           1  76    0 22208K  6800K select 0   2:42  0.00% bdbrepd
       23 root           1 -20 -139     0K    16K RUN    0   2:18  0.00% swi7: clock
     1656 root           1  76    0  9832K  3940K select 0   2:10  0.00% ksyncd
     1266 root           1  76    0 15080K  5032K select 0   1:35  0.00% eventd
     1633 root           1  76    0 14412K  6056K select 0   1:17  0.00% license-check
     1660 root           1  76    0 27008K  9624K select 0   1:10  0.00% smid
        5 root           1 -16    0     0K    16K rtfifo 0   1:05  0.00% rtfifo_kern_recv
     1666 root           1  76    0 23224K  8816K select 0   0:41  0.00% pfed
     1667 root           1  76    0 33352K 12076K select 0   0:40  0.00% mib2d
     1630 root           1  76    0 19548K  8280K select 0   0:39  0.00% utmd
       25 root           1 -40 -159     0K    16K WAIT   0   0:29  0.00% swi2: netisr 0
     1622 root           1  76    0 17720K  3656K select 0   0:28  0.00% shm-rtsdbd
      386 root           1  -8    0     0K    16K mdwait 0   0:17  0.00% md2
     1669 root           1  76    0 14344K  5500K select 0   0:16  0.00% alarmd
       82 root           1  -8    0     0K    16K mdwait 0   0:15  0.00% md1
        3 root           1  -8    0     0K    16K -      0   0:12  0.00% g_up
     1608 root           1  76    0  3348K  1392K select 0   0:10  0.00% bslockd
       52 root           1 -16    0     0K    16K psleep 0   0:09  0.00% vmkmemdaemon
     1668 root           1  76    0 19040K  9652K select 0   0:09  0.00% snmpd
     1629 root           1  76    0 15532K  6460K select 0   0:08  0.00% rtlogd
       26 root           1 -16    0     0K    16K -      0   0:07  0.00% yarrow
     1627 root           1  76    0 43268K  9680K select 0   0:07  0.00% idpd
        1 root           1   8    0  1596K   892K wait   0   0:07  0.00% init
     1655 root           1  76    0 15920K  6344K select 0   0:06  0.00% ppmd
        4 root           1  -8    0     0K    16K -      0   0:06  0.00% g_down
     1663 root           4  76    0 81740K  4924K select 0   0:05  0.00% ipfd
     1632 root           3  76    0 16604K  5088K select 0   0:04  0.00% wland
     1664 root           1  76    0  9196K  3428K select 0   0:04  0.00% irsd
        2 root           1  -8    0     0K    16K -      0   0:04  0.00% g_event
       42 root           1  20    0     0K    16K syncer 0   0:04  0.00% syncer
     1670 root           1  76    0 39288K  8508K select 0   0:04  0.00% dcd
       43 root           1  20    0     0K    16K vnlrum 0   0:03  0.00% vnlru_mem
     1614 root           1  76    0 46924K 25700K select 0   0:03  0.00% mgd
     2531 root           1  76    0 50536K 25720K select 0   0:02  0.00% mgd
     1631 root           1  76    0  7640K  4748K select 0   0:02  0.00% ntpd
       40 root           1 171   52     0K    16K pgzero 0   0:02  0.00% pagezero
     1613 root           1  76    0 14936K  5408K select 0   0:02  0.00% craftd
       32 root           1   8    0     0K    16K dwcint 0   0:02  0.00% dwc0
     2727 root           1  76    0   128M 18556K select 0   0:02  0.00% chassisd
     1607 root           1  76    0  2320K   916K select 0   0:02  0.00% watchdog
     1657 root           1  76    0 15216K  6828K select 0   0:01  0.00% bfdd
     1665 root           1  76    0 26572K  8824K select 0   0:01  0.00% dfwd
     1624 root           2  82    0 21216K  5532K select 0   0:01  0.00% appidd
     1625 root           1  76    0 21684K  6260K select 0   0:01  0.00% appsecured
     1662 root           1   8    0 26620K  9936K nanslp 0   0:01  0.00% nsd
       30 root           1 -28 -147     0K    16K WAIT   0   0:01  0.00% swi5: cambio
     1654 root           1   8    0 27344K  7172K nanslp 0   0:01  0.00% kmd
     2737 budaim         1   6    0 55624K 19116K ttywri 0   0:01  0.00% cli
     1658 root           1  76    0 17452K  5352K select 0   0:01  0.00% lacpd
       45 root           1 -16    0     0K    16K sdflus 0   0:00  0.00% softdepflush
       41 root           1 -16    0     0K    16K psleep 0   0:00  0.00% bufdaemon
       44 root           1  -4    0     0K    16K vlruwt 0   0:00  0.00% vnlru
       50 root           1 -16    0     0K    16K psleep 0   0:00  0.00% vmuncachedaemon
     1634 root           1  95    0 11836K  3128K select 0   0:00  0.00% sdxd
     1349 root           1   8    0  2720K  1160K nanslp 0   0:00  0.00% cron
     1661 root           1  89    0  9548K  2912K select 0   0:00  0.00% relayd
     1617 root           1  87    0  6936K  2292K select 0   0:00  0.00% inetd
     2738 root           1  76    0 46976K  7192K select 0   0:00  0.00% mgd
       59 root           1  -8    0     0K    16K mdwait 0   0:00  0.00% md0
     2736 root           1   8    0  4828K  1812K wait   0   0:00  0.00% login
     1239 root           1  76    0  3332K  1288K select 0   0:00  0.00% usbd
       49 root           1   4    0     0K    16K purge_ 0   0:00  0.00% kern_pir_proc
     2533 root           1   4    0  6280K  2052K sbwait 0   0:00  0.00% rcp
     1609 root           1  77    0  3656K  1536K select 0   0:00  0.00% tnetd
        9 root           1 -16    0     0K    16K psleep 0   0:00  0.00% pagedaemon
       31 root           1 -48 -167     0K    16K WAIT   0   0:00  0.00% swi0: uart
     2746 root           1  81    0 24596K  1968K CPU0   0   0:00  0.00% top
     2745 root           1  77    0 46956K  4680K select 0   0:00  0.00% mgd
        7 root           1   8    0     0K    16K -      0   0:00  0.00% thread taskq
     1444 root           1  -8    0     0K    16K mdwait 0   0:00  0.00% md3
       36 root           1   8    0     0K    16K usbevt 0   0:00  0.00% usb1
        0 root           1  -8    0     0K     0K WAIT   0   0:00  0.00% swapper
     1459 root           1  -8    0     0K    16K mdwait 0   0:00  0.00% md4
       33 root           1   8    0     0K    16K usbevt 0   0:00  0.00% usb0
       51 root           1  -8    0     0K    16K select 0   0:00  0.00% if_pic_listen0
       47 root           1  -8    0     0K    16K select 0   0:00  0.00% if_pfe_listen
       53 root           1   4    0     0K    16K dump_r 0   0:00  0.00% kern_dump_proc
       46 root           1  76    0     0K    16K sleep  0   0:00  0.00% netdaemon
       54 root           1   8    0     0K    16K -      0   0:00  0.00% nfsiod 0
       55 root           1   8    0     0K    16K -      0   0:00  0.00% nfsiod 1
       56 root           1   8    0     0K    16K -      0   0:00  0.00% nfsiod 2
       57 root           1   8    0     0K    16K -      0   0:00  0.00% nfsiod 3
       35 root           1   8    0     0K    16K dwcint 0   0:00  0.00% dwc1
        6 root           1   8    0     0K    16K -      0   0:00  0.00% kqueue taskq
       34 root           1   8    0     0K    16K usbtsk 0   0:00  0.00% usbtask
        8 root           1   8    0     0K    16K -      0   0:00  0.00% mastership taskq
       10 root           1 -16    0     0K    16K ktrace 0   0:00  0.00% ktrace
       29 root           1 -12 -131     0K    16K WAIT   0   0:00  0.00% swi9: task queue
       28 root           1 -12 -131     0K    16K WAIT   0   0:00  0.00% swi9: +
       27 root           1 -16 -135     0K    16K WAIT   0   0:00  0.00% swi8: +
       24 root           1 -24 -143     0K    16K WAIT   0   0:00  0.00% swi6: vm
       38 root           1 -32 -151     0K    16K WAIT   0   0:00  0.00% swi4: ip6mismatch+
       37 root           1 -36 -155     0K    16K WAIT   0   0:00  0.00% swi3: ip6opt ipopt
       39 root           1 -44 -163     0K    16K WAIT   0   0:00  0.00% swi1: ipfwd
       15 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu7
       14 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu8
       13 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu9
       16 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu6
       11 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu11
       17 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu5
       18 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu4
       12 root           1 171   52     0K    16K CPU0   0   0:00  0.00% idle: cpu10
    


  • 6.  RE: SRX 240 HA cluster lost its secondary unit

    Posted 12-10-2018 08:38

    The 'eventd' process which is responsible for logging, is the top process contributing to high RE CPU. You may have to check your security policy logging configuration and disable it to reduce CPU utilization.

     

    Find out which policy is getting more hitcount and disbale logging if it is enabled.

    show security policies hit-count descending

    clear security policies hit-count <----------------- Reset the count and check again

    show security policies hit-count descending

    Also check your syslog configuration and fine tune if required ( show configuration system syslog)

    Once the RE CPU is normal, check the cluster status and if it is not ok, you may have to reboot secondary node



  • 7.  RE: SRX 240 HA cluster lost its secondary unit

    Posted 12-11-2018 01:00

    Well, I reviewed the policies, but we need logging so I couldn't configure it out. 

    A reboot was issued but the problem is still around. 



  • 8.  RE: SRX 240 HA cluster lost its secondary unit

    Posted 12-11-2018 02:05

    You have to reduce RE CPU utilization to stabilize cluster. For this enable stream mode logging to offload the RE: https://kb.juniper.net/InfoCenter/index?page=content&id=KB16224&actp=METADATA

    an if possible, remove logging at session-init if it is configured. Session-close log has a session summary which also tells you when the session started

     

     



  • 9.  RE: SRX 240 HA cluster lost its secondary unit

    Posted 12-13-2018 08:00

    Hi,

     

    The probem caused by an MTU problem between chassis members (via swichtes), which is solved now, and everything works fine.

    You mentioned syslog settings, session-init actually. May I have your exact configuration of logging? I tried using only session-close logs, but I can't see it really useful, because "timestamp-"elapsed-time"" for a session information is little bit difficult to read.

    My recent configuration is:

     

    set security log mode stream
    set security log source-address x.x.x.x
    set security log stream traffic3 format sd-syslog
    set security log stream traffic3 category all
    set security log stream traffic3 host x.x.x.x
    set security log stream traffic3 host port xxxx

     

    Thank you very much in advance!



  • 10.  RE: SRX 240 HA cluster lost its secondary unit

    Posted 12-13-2018 08:16
    Thanks for the update. Are you still seeing high cpu caused by eventd?


  • 11.  RE: SRX 240 HA cluster lost its secondary unit

    Posted 12-13-2018 08:44

    I found a strange thing in our configuration:

     

    set security log mode stream
    set security log source-address x.x.x.x
    set security log stream traffic
    set security log stream traffic2
    set security log stream traffic3 host x.x.x.x
    set security log stream traffic3 host port xxx

     

    Stream "traffic" and "traffic2" were set, but not any server. I deleted these so session-init. CPU usage decreased to 0%. After this I enabled again logging session init. Now CPU usage is about 6%, but now is out of working hours. We will see it tomorrow. 🙂