This message was posted by a user wishing to remain anonymous
hello guys,
I have a very strange issue on my infrastructure, without almost changing anything.
We have some floor switches and one cluster of core switches. The diagram is something like the below:
floor3switch (10.0.80.13) <------->4xcluster core switch (10.0.80.10) <-------> another floor switch (10.0.80.12)
|
|
SRX 10.0.80.1
All the above infrastructure belongs to the same management VLAN (vlan-id 80). when I ping all the switches from my PC everything is working great except floor3switch. irb is configured correctly with vlan-id 80 (has not been changed since 2018).
Suddenly this week, our monitoring tool lost contact for 5 minutes and suddenly connectivity restored. this keeps happening until now.
I tried to access the switch from my pc, but nothing. So, I accessed another switch on the same vlan let's name it floor2switch and tried to ping 3rd floorswitch (same vlan 80) without any success. So the problem is not routing based, as on the same vlan connectivity is failing.
I start tcpdump on floor3switch and coreswitch and seems that floor3sw is broadcasting ARP requests for its gw (10.0.80.1) and for every switch that i instruct to ping.
ARPs are being broadcasting successfully and reach their destinations, but responds are never getting back to floor3. suddenly connectivity restores for 5 mins and i can ping my GW without being able to ping all floors. i have to leave ping for a while in order to get the ARP request for every other device.
tcpdump on my floor3switch is below.
tcpdump -ne -i irb not port 161
17:15:09.252177 Out 38:4f:49:xx:xx:20 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 80, p 0, ethertype ARP, arp who-has 10.0.80.12 tell 10.0.80.13
tcpdump on my floor2switch is below.
tcpdump -ne -i irbp
17:40:34.760738 Out 7c:e2:ca:xx:xx:90 > 38:4f:49:xx:xx:20, ethertype 802.1Q (0x8100), length 50: vlan 80, p 0, ethertype ARP, arp reply 10.0.80.12 is-at 7c:e2:ca:67:05:90
tcpdump on my core-switch is below.
tcpdump -ne-i ae3 not port 22
17:36:22.226116 In 38:4f:49:xx:xx:20 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 64: vlan 80, p 0, ethertype ARP, arp who-has 10.0.80.12 tell 10.0.80.13
As you can understand from the above floor3sw sends the ARP req to coreswitch and coreswitch transmits it to floor2. floor2 sends the ARP reply (unicast) to floor3 switch but never gets it back.
floor3> show arp no-resolve
MAC Address Address Interface Flags
00:10:db:ff:10:01 10.0.80.1 irb.80 [ae3.0] none
floor2 arp record along with other floor switches are not present
floor02> show arp no-resolve
MAC Address Address Interface Flags
00:10:db:ff:10:01 10.0.80.1 irb.80 [ae16.0] none -->SRX
44:aa:xx:xx:xx 10.0.80.10 irb.80 [ae16.0] none --> Core-sw
38:4f:49:xx:xx:20 10.0.80.13 irb.80 [ae16.0] none -->Flooe3
Also rstp is globally enabled on floor3, but not logs showing for potential loop
All ports have the same status
show interfaces |match "BPDU Error"
Speed: Auto, Duplex: Auto, BPDU Error: None, Loop Detect PDU Error: None,
thank you for your time