Log in to ask questions, share your expertise, or stay connected to content you value. Don’t have a login? Learn how to become a member.
I have a very strange issue on my infrastructure, without almost changing anything.
We have some floor switches and one cluster of core switches. The diagram is something like the below:
floor3switch (10.0.80.13) <------->4xcluster core switch (10.0.80.10) <-------> another floor switch (10.0.80.12)
All the above infrastructure belongs to the same management VLAN (vlan-id 80). when I ping all the switches from my PC everything is working great except floor3switch. irb is configured correctly with vlan-id 80 (has not been changed since 2018).
Suddenly this week, our monitoring tool lost contact for 5 minutes and suddenly connectivity restored. this keeps happening until now.
I tried to access the switch from my pc, but nothing. So, I accessed another switch on the same vlan let's name it floor2switch and tried to ping 3rd floorswitch (same vlan 80) without any success. So the problem is not routing based, as on the same vlan connectivity is failing. I start tcpdump on floor3switch and coreswitch and seems that floor3sw is broadcasting ARP requests for its gw (10.0.80.1) and for every switch that i instruct to ping. ARPs are being broadcasting successfully and reach their destinations, but responds are never getting back to floor3. suddenly connectivity restores for 5 mins and i can ping my GW without being able to ping all floors. i have to leave ping for a while in order to get the ARP request for every other device.
tcpdump on my floor3switch is below.
tcpdump -ne -i irb not port 161
17:15:09.252177 Out 38:4f:49:xx:xx:20 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 80, p 0, ethertype ARP, arp who-has 10.0.80.12 tell 10.0.80.13
tcpdump on my floor2switch is below.
tcpdump -ne -i irbp17:40:34.760738 Out 7c:e2:ca:xx:xx:90 > 38:4f:49:xx:xx:20, ethertype 802.1Q (0x8100), length 50: vlan 80, p 0, ethertype ARP, arp reply 10.0.80.12 is-at 7c:e2:ca:67:05:90
tcpdump on my core-switch is below.
tcpdump -ne-i ae3 not port 2217:36:22.226116 In 38:4f:49:xx:xx:20 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 64: vlan 80, p 0, ethertype ARP, arp who-has 10.0.80.12 tell 10.0.80.13
As you can understand from the above floor3sw sends the ARP req to coreswitch and coreswitch transmits it to floor2. floor2 sends the ARP reply (unicast) to floor3 switch but never gets it back.
floor3> show arp no-resolveMAC Address Address Interface Flags00:10:db:ff:10:01 10.0.80.1 irb.80 [ae3.0] none
floor2 arp record along with other floor switches are not present
floor02> show arp no-resolve MAC Address Address Interface Flags00:10:db:ff:10:01 10.0.80.1 irb.80 [ae16.0] none -->SRX44:aa:xx:xx:xx 10.0.80.10 irb.80 [ae16.0] none --> Core-sw38:4f:49:xx:xx:20 10.0.80.13 irb.80 [ae16.0] none -->Flooe3
Also rstp is globally enabled on floor3, but not logs showing for potential loop
All ports have the same status
show interfaces |match "BPDU Error"
Speed: Auto, Duplex: Auto, BPDU Error: None, Loop Detect PDU Error: None,
thank you for your time
Sounds like a problem on the physical interface(s) that form the lag (ae) interface. I have a suggestion: check the MAC addresses learning on each port (end to end) for case when it works vs when it not works. Another suggestion is try to shut one by one each of the interfaces in the LAG of floor switch3 and see how it works.
I had a recent issue with a new setup and connectivity issues. After looking further I found an arp entry in the switch for an interface with a mac of a different device, external to the switch infrastructure. I have no idea how that happened. I did a run clear arp and cleared the table and I also pulled one of the uplinks in the ae. I am relooking at my config.