Original Message:
Sent: 09-30-2025 04:48
From: EMTSU
Subject: Incomprehensible behavior with all EX2300s at the site after planned power outage
Were you able to get to the bottom of this?
Original Message:
Sent: 07-18-2025 10:34
From: TacticalDonut164
Subject: Incomprehensible behavior with all EX2300s at the site after planned power outage
Hoping to get some help here with a very confusing problem we are having.
I have a ticket open with JTAC and have worked with a few different engineers on this without any success.
To give some context, this site is really big, it's basically three sites in one. So let's just say site 1 (1-), site 2 (2-), site 3 (3-).
I hope the topology below helps to clarify this setup (obviously IPs and names are not accurate):
On Saturday, July 12th, site 3 had a scheduled power outage starting at 8:00 AM MDT. As requested, I scheduled their six IDFs (3-AS1 through 3-AS6) to power off at 7:00 AM MDT.
Beginning at 8:55 AM CDT (7:55 AM MDT, i.e. right around when the power outage started, they may have started early), every single EX2300 series switch at the site went down simultaneously:
This included one switch at site 1, and five switches at site 2. Once the maintenance was over, three switches at site 3 never came back up. The only thing unusual about the maintenance is that someone screwed it up and took 3-CR (site 3's core) down as well before it came back up a bit later.
If I log into any of the site's core switches, and try to ping the 2300s, you get this:
1-CR> ping 1-AS4
PING 1-as4.company.com (10.0.0.243): 56 data bytes
64 bytes from 10.0.0.243: icmp_seq=1 ttl=64 time=4792.365 ms
64 bytes from 10.0.0.243: icmp_seq=2 ttl=64 time=4691.200 ms
64 bytes from 10.0.0.243: icmp_seq=13 ttl=64 time=4808.979 ms
64 bytes from 10.0.0.243: icmp_seq=14 ttl=64 time=4713.175 ms
^C
--- 1-as4.company.com ping statistics ---
22 packets transmitted, 4 packets received, 81% packet loss
round-trip min/avg/max/stddev = 4691.200/4751.430/4808.979/50.196 ms
It is completely impossible to remote into any of these. It's required to work with the site to get console access.
On sessions with JTAC, we determined that the CPU is not high, there is no problem with heap or storage, and all transit traffic continues to flow perfectly normally. Usually onsite IT will actually be plugged into the impacted switch during our meeting with no problems at all. Everything looks completely normal from a user standpoint, thankfully.
- We have tried rebooting the switch, with no success.
- Then we tried upgrading the code to 23.4R2-S4 from 21.something (which produced a PoE Short CirCuit alarm), with no success.
- I tried to add another IRB in a different subnet, with no success.
- We put two computers on that switch in the management VLAN (i.e. the 10.0.0/24 segment), statically assigned IPs, and both computers could ping each other with sub-10ms response times.
There is one exception to the majority of these findings. 2-AS3. The switch highlighted yellow.
- On Saturday night, you could ping it. One of my colleagues was able to SCP into it to upgrade firmware. I could not get into it except via Telnet on a jump server.
- Mist could see it, but attempting to upgrade via Mist returned a connectivity error.
- The next morning, I could no longer ping it. I could still get in with Telnet only on that jump server.
- I added a new IRB in a different subnet. After committing the changes I could ping that IP but still not do anything else with it.
- The next next morning, I could no longer ping the new IP either.
If you try to ping it from up here at the HQ, you get:
HQ-CR> ping 2-AS3
PING 2-as3.company.com (10.0.0.234): 56 data bytes
64 bytes from 10.0.0.234: icmp_seq=0 ttl=62 time=95.480 ms
64 bytes from 10.0.0.234: icmp_seq=1 ttl=62 time=91.539 ms
64 bytes from 10.0.0.234: icmp_seq=2 ttl=62 time=97.411 ms
64 bytes from 10.0.0.234: icmp_seq=3 ttl=62 time=81.785 ms
If you try to ping the HQ core from 2-AS3, you get:
2-AS3> ping 10.0.1.254
PING 10.0.1.254 (10.0.1.254): 56 data bytes
64 bytes from 10.0.1.254: icmp_seq=0 ttl=62 time=4763.407 ms
64 bytes from 10.0.1.254: icmp_seq=1 ttl=62 time=4767.519 ms
64 bytes from 10.0.1.254: icmp_seq=3 ttl=62 time=4767.144 ms
64 bytes from 10.0.1.254: icmp_seq=4 ttl=62 time=4763.674 ms
^C
--- 10.0.1.254 ping statistics ---
11 packets transmitted, 4 packets received, 63% packet loss
round-trip min/avg/max/stddev = 4763.407/4765.436/4767.519/1.902 ms
It's not something with the WAN or the INET or the EdgeConnect. Because with the exception of this switch, you get these terrible response times even pinging from the core, which is in the same subnet, so it is literally just switch to switch traffic.
1-CR> show route forwarding-table destination 1-AS4
Routing table: default.inet
Internet:
Destination Type RtRef Next hop Type Index NhRef Netif
10.0.0.243/32 dest 0 44:aa:50:XX:XX:XX ucst 1817 1 ae4.0
1-CR> show interfaces ae4 descriptions
Interface Admin Link Description
ae4 up up 1-AS4
So I am unsure as to what's going on here. We have looked and looked. There doesn't seem to be a loop or a storm. Onsite IT doesn't have access to any of these switches so they could not have made any changes to these.
The power outage is the only thing I can think of. Because it is the only thing that we approved and it went through the change advisory board. I'm not saying shadow IT didn't do something stupid but considering also the timing of the switches going down right at the start of the maintenance...
I just have no idea. If I can get some suggestions so I can bring those into our next meeting with JTAC that would be great.
Thanks!