Dear Security community,
Im facing a really strange issue with srx1500 cluster: it seems that both node lost communication between each other and traffic stopped being processed. Currently all the services like ipsec vpn and bgp connections are down.After checking i noticed that the received probe on node0 (the primary) is always 0 and the fabric link fab0 is physically up but its showing on the monitor status down.I even disabled the fabric link monitoring by"
set chassis cluster no-fabric-monitoring" and then rebooted node0 but still the same issue. i forced the failover between nodes many times, i rebooted both nodes several times but still the same behaviour. Please find below all the information regarding the cluster:
Software version:JUNOS Software Release [15.1X49-D180.2]
cluster configuration( i deactivated the monitoring for now to troubleshoot further):
@sn-dx-node0> show configuration chassis cluster
no-fabric-monitoring;
reth-count 128;
redundancy-group 0 {
node 0 priority 100;
node 1 priority 1;
}
redundancy-group 1 {
node 0 priority 100;
node 1 priority 1;
inactive: interface-monitor {
xe-0/0/16 weight 255;
xe-7/0/16 weight 255;
}
}
redundancy-group 2 {
node 0 priority 100;
node 1 priority 1;
inactive: interface-monitor {
ge-0/0/12 weight 255;
ge-7/0/12 weight 255;
}
}
cluster status:
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 1
node0 255 primary no yes None
node1 1 secondary no yes None
Redundancy group: 1 , Failover count: 1
node0 255 primary no yes HW
node1 1 secondary no yes None
Redundancy group: 2 , Failover count: 1
node0 255 primary no yes HW
node1 1 secondary no yes None
Cluster statistics:
{primary:node0}
@sn-dx-node0> show chassis cluster statistics
Control link statistics:
Control link 0:
Heartbeat packets sent: 70200
Heartbeat packets received: 70207
Heartbeat packet errors: 0
Fabric link statistics:
Child link 0
Probes sent: 159815
Probes received: 0
Child link 1
Probes sent: 0
Probes received: 0
Cluster interface
msaidani@sn-dx-node0> show chassis cluster interfaces
Control link status: Up
Control interfaces:
Index Interface Monitored-Status Internal-SA Security
0 em0 Up Disabled Disabled
Fabric link status: Down
Fabric interfaces:
Name Child-interface Status Security
(Physical/Monitored)
fab0 ge-0/0/11 Up / Down Disabled
fab0
fab1 ge-7/0/11 Up / Up Disabled
fab1
Redundant-ethernet Information:
Name Status Redundancy-group
reth0 Down Not configured
reth1 Up 1
reth2 Down 2
interface ge-0/0/11 status:
msaidani@sn-dx-node0> show interfaces ge-0/0/11
Physical interface: ge-0/0/11, Enabled, Physical link is Up
Interface index: 323, SNMP ifIndex: 523
Link-level type: 64, MTU: 9014, LAN-PHY mode, Link-mode: Full-duplex,
Speed: 1000mbps, BPDU Error: None, MAC-REWRITE Error: None,
Loopback: Disabled, Source filtering: Disabled, Flow control: Enabled,
Auto-negotiation: Enabled, Remote fault: Online
Device flags : Present Running
Interface flags: SNMP-Traps Internal: 0x4000
Link flags : None
CoS queues : 8 supported, 8 maximum usable queues
Current address: c0:bf:a7:a5:30:30, Hardware address: c0:bf:a7:a5:2f:0b
Last flapped : 2022-01-15 20:06:49 UTC (19:34:08 ago)
Input rate : 0 bps (0 pps)
Output rate : 2264 bps (1 pps)
Active alarms : None
Active defects : None
Interface transmit statistics: Disabled
Logical interface ge-0/0/11.0 (Index 77) (SNMP ifIndex 572)
Flags: Up SNMP-Traps 0x4000 Encapsulation: ENET2
Input packets : 123
Output packets: 170252
Security: Zone: Null
Protocol aenet, AE bundle: fab0.0 Link Index: 0
Troubleshoot action done so far:
- Rebooting both devices several times
- Rebooting a single device.
- Performing "set chassis cluster no-fabric-monitoring" then reboot node0
- Performing "request chassis cluster failover redundancy-group 0 node 0 force"
- Logging onto the secondary and performing a "request chassis cluster configuration-synchronize"
- Changing the physical cable to a different port and moving the configuration.
- Swapping the physical cable completely with a new one.
As the last option im thinking about disabling the cluster and doing it all over again but before doing that i wanted to check with you if you may have other options.
Thank you in advance
------------------------------
Maroua Saidani
------------------------------