Switching

 View Only
last person joined: 4 days ago 

Ask questions and share experiences about EX and QFX portfolios and all switching solutions across your data center, campus, and branch locations.
  • 1.  Incomprehensible behavior with all EX2300s at the site after planned power outage

    Posted 07-18-2025 10:35
    Edited by TacticalDonut164 07-18-2025 10:46

    Hoping to get some help here with a very confusing problem we are having.

    I have a ticket open with JTAC and have worked with a few different engineers on this without any success.

    To give some context, this site is really big, it's basically three sites in one. So let's just say site 1 (1-), site 2 (2-), site 3 (3-).

    I hope the topology below helps to clarify this setup (obviously IPs and names are not accurate):

    On Saturday, July 12th, site 3 had a scheduled power outage starting at 8:00 AM MDT. As requested, I scheduled their six IDFs (3-AS1 through 3-AS6) to power off at 7:00 AM MDT.

    Beginning at 8:55 AM CDT (7:55 AM MDT, i.e. right around when the power outage started, they may have started early), every single EX2300 series switch at the site went down simultaneously:

    This included one switch at site 1, and five switches at site 2. Once the maintenance was over, three switches at site 3 never came back up. The only thing unusual about the maintenance is that someone screwed it up and took 3-CR (site 3's core) down as well before it came back up a bit later.

    If I log into any of the site's core switches, and try to ping the 2300s, you get this:

    1-CR> ping 1-AS4
    PING 1-as4.company.com (10.0.0.243): 56 data bytes
    64 bytes from 10.0.0.243: icmp_seq=1 ttl=64 time=4792.365 ms
    64 bytes from 10.0.0.243: icmp_seq=2 ttl=64 time=4691.200 ms
    64 bytes from 10.0.0.243: icmp_seq=13 ttl=64 time=4808.979 ms
    64 bytes from 10.0.0.243: icmp_seq=14 ttl=64 time=4713.175 ms
    ^C
    --- 1-as4.company.com ping statistics ---
    22 packets transmitted, 4 packets received, 81% packet loss
    round-trip min/avg/max/stddev = 4691.200/4751.430/4808.979/50.196 ms

    It is completely impossible to remote into any of these. It's required to work with the site to get console access.

    On sessions with JTAC, we determined that the CPU is not high, there is no problem with heap or storage, and all transit traffic continues to flow perfectly normally. Usually onsite IT will actually be plugged into the impacted switch during our meeting with no problems at all. Everything looks completely normal from a user standpoint, thankfully.

    • We have tried rebooting the switch, with no success.
    • Then we tried upgrading the code to 23.4R2-S4 from 21.something (which produced a PoE Short CirCuit alarm), with no success.
    • I tried to add another IRB in a different subnet, with no success.
    • We put two computers on that switch in the management VLAN (i.e. the 10.0.0/24 segment), statically assigned IPs, and both computers could ping each other with sub-10ms response times.

    There is one exception to the majority of these findings. 2-AS3. The switch highlighted yellow.

    • On Saturday night, you could ping it. One of my colleagues was able to SCP into it to upgrade firmware. I could not get into it except via Telnet on a jump server.
    • Mist could see it, but attempting to upgrade via Mist returned a connectivity error.
    • The next morning, I could no longer ping it. I could still get in with Telnet only on that jump server.
    • I added a new IRB in a different subnet. After committing the changes I could ping that IP but still not do anything else with it.
    • The next next morning, I could no longer ping the new IP either.

    If you try to ping it from up here at the HQ, you get:

    HQ-CR> ping 2-AS3
    PING 2-as3.company.com (10.0.0.234): 56 data bytes
    64 bytes from 10.0.0.234: icmp_seq=0 ttl=62 time=95.480 ms
    64 bytes from 10.0.0.234: icmp_seq=1 ttl=62 time=91.539 ms
    64 bytes from 10.0.0.234: icmp_seq=2 ttl=62 time=97.411 ms
    64 bytes from 10.0.0.234: icmp_seq=3 ttl=62 time=81.785 ms

    If you try to ping the HQ core from 2-AS3, you get:

    2-AS3> ping 10.0.1.254
    PING 10.0.1.254 (10.0.1.254): 56 data bytes
    64 bytes from 10.0.1.254: icmp_seq=0 ttl=62 time=4763.407 ms
    64 bytes from 10.0.1.254: icmp_seq=1 ttl=62 time=4767.519 ms
    64 bytes from 10.0.1.254: icmp_seq=3 ttl=62 time=4767.144 ms
    64 bytes from 10.0.1.254: icmp_seq=4 ttl=62 time=4763.674 ms
    ^C
    --- 10.0.1.254 ping statistics ---
    11 packets transmitted, 4 packets received, 63% packet loss
    round-trip min/avg/max/stddev = 4763.407/4765.436/4767.519/1.902 ms

    It's not something with the WAN or the INET or the EdgeConnect. Because with the exception of this switch, you get these terrible response times even pinging from the core, which is in the same subnet, so it is literally just switch to switch traffic.

    1-CR> show route forwarding-table destination 1-AS4
    Routing table: default.inet
    Internet:
    Destination Type RtRef Next hop Type Index NhRef Netif
    10.0.0.243/32 dest 0 44:aa:50:XX:XX:XX ucst 1817 1 ae4.0

    1-CR> show interfaces ae4 descriptions
    Interface Admin Link Description
    ae4 up up 1-AS4

    So I am unsure as to what's going on here. We have looked and looked. There doesn't seem to be a loop or a storm. Onsite IT doesn't have access to any of these switches so they could not have made any changes to these.

    The power outage is the only thing I can think of. Because it is the only thing that we approved and it went through the change advisory board. I'm not saying shadow IT didn't do something stupid but considering also the timing of the switches going down right at the start of the maintenance...

    I just have no idea. If I can get some suggestions so I can bring those into our next meeting with JTAC that would be great.

    Thanks!



  • 2.  RE: Incomprehensible behavior with all EX2300s at the site after planned power outage

    This message was posted by a user wishing to remain anonymous
    Posted 07-21-2025 03:29
    This message was posted by a user wishing to remain anonymous

    We have had something similar on one of our EX2300-48P's over the last few years it would operate fine but occasionally would not respond on it's management IP sometimes for a few hours other times a few days, although this was mostly back on 20.4 and I think we had it a few times back on 21.4, but not something we have not seen since we upgraded to 23.4R2-S3. 




  • 3.  RE: Incomprehensible behavior with all EX2300s at the site after planned power outage

    Posted 07-21-2025 15:18

    How good are you guys at setting your recovery snapshots after your code upgrades? The reason I ask is you might have a software mismatch on one of your Virtual Chassis. I have seen some strange behaviors with software mismatches within a Virtual Chassis when a member fails to boot and uses the recovery partition (which was the previous version).

    Just throwing that out there as something to lookout for. 

     



    ------------------------------
    Hope this helps.

    MarkHope this helps.

    Mark
    ------------------------------



  • 4.  RE: Incomprehensible behavior with all EX2300s at the site after planned power outage

    Posted 09-30-2025 04:48

    Were you able to get to the bottom of this?

    -------------------------------------------



  • 5.  RE: Incomprehensible behavior with all EX2300s at the site after planned power outage

    Posted 10-02-2025 10:43
    Edited by JOHN WILLIAMSON 10-02-2025 10:43

    As requested, I scheduled their six IDFs (3-AS1 through 3-AS6) to power off at 7:00 AM MDT.

    Beginning at 8:55 AM CDT (7:55 AM MDT, i.e. right around when the power outage started, they may have started early), every single EX2300 series switch at the site went down simultaneously. 

    That sounds like your shutdown command did not work. The switches should have all shutdown just a minute or two after you sent it and not been up at all when the power was shut off.  The standard shutdown command should have shutdown all members of any VC as well.  The switches should have come back up when the power was back on, assuming the power was off long enough that any UPSes on the network racks shut off.  

    I assume none of this is news to you, since you've been in contact with JTAC. 



    ------------------------------
    JOHN WILLIAMSON
    ------------------------------