Clients intermittently "hanging" using Mist AP32 and Juniper EX2300/EX4600

  • 1.  Clients intermittently "hanging" using Mist AP32 and Juniper EX2300/EX4600

    Posted 02-21-2021 15:21
    At least twice a day - generally busier times of the school day (10ish and 1ish) - clients will appear to have no internet for anywhere from a couple minutes to ~5 or sometimes up to 10 minutes. They have full bars on their WIFI connection, and may or may not have an IP address (sometimes self-assigned, but usually a valid IP), but can't use a web browser. Unfortunately, by the time the issue is reported, clients are working again, so I've not been able to do extensive testing from the client's perspective.

    Not all buildings are affected at the same time, though sometimes the internet "brownout" happens in another building within a minute or two of the first occurence in another building.

    Default storm-control enabled on all ports of my EX2300 edge switches and EX4600 core, and Mist AP32 access points with a mix of macOS/Windows/iOS/Android/Chromebook clients. Dual EX4600 virtualchassis at the core with 9 EX2300 edge VCs (1, 2, or 3 switches, depending on the size of the building), all connected back to the EX4600s via two 10G SFP+ fibre links in an 20G LACP aggregated interface. EX4600 is using 19.3R2.9 and EX2300s are using several different firmware versions from 18 to 19.3R2.9 to 20.2R1.10 to 20.4R1.12. Using DHCP Relay on the EX4600 from a Windows DHCP server and DNS is coming from a couple Linux BIND servers - both on different VLANs from the clients.

    All my EX2300 configs are done using interface-range statements like this one:
    set interfaces interface-range wap unit 0 family ethernet-switching storm-control default
    and the definition of EX2300 storm-control is boilerplate:
    set forwarding-options storm-control-profiles default all
    Mist APs complain that DHCP and DNS servers aren't working for periods of 5-10 minutes (sometimes longer), and clients have apparent 802.1X authentication timeouts or errors during this time (since passwords are saved on the clients, this is a red herring - the client isn't fat-fingering the password, they're just having communication issues).

    No error messages in the Juniper switch logs.

    DHCP server is on VLAN 10 (relayed through EX4600 to other VLANs), DNS on VLAN 12, most clients are on VLAN 4 (BYOD network), teachers are on VLAN 3, roughly 1 AppleTV per classroom on VLAN 6.

    I haven't yet been able to capture traffic during the "brownout" event.

    I'm wondering whether any of this sounds familiar to you.

    Is it possible it's some sort of unicast storm that's causing DNS and DHCP unicast packets to be undeliverable and resulting in the client experiencing this as "no internet"? Are there any gotchas with storm-control?

    I've heard that AppleTVs can still have issues with becoming the default gateway due to some weird long-standing bug with the bonjour sleep proxy service...but the AppleTVs are all on a different VLAN from the clients, so this is probably not the cause. This is also affecting Windows laptops and Chromebooks just as much as Apple devices.

    Perhaps something else entirely?

    Any assistance would be much appreciated!