For the time being we have a relatively flat, mostly layer-2 network. The topology is going to change in the coming months, so don't yell at me about this design, please. It's a transition from an old design and other hardware to Juniper and, soon, a good layer-3/VLAN design. 🙂
We have many 3200's in closets on fiber and copper back to a six member 4200 VC at the core.
In the meantime, though, we've had a random problem for months that is infuriating and seemingly impossible to track down. We've moved from the 9.5 series code to 10.0R4.7 recently with no resolution of the issue.
In a nutshell, SOMETHING in our network will suddenly cause the sfid process on the core 4200 VC to skyrocket. This morning's incident hit about 22% of CPU when running top. During this time, the network as a whole becomes massively sluggish and we have tremendous (or complete) packet loss across segments (namely our Cisco to our DS3).
I'm not suggesting that the sfid process is the problem - but it's the primary symptom to look for when this happens.
Scenario: life is good, all is well... then BLAM... massive packet loss going everywhere and a seemingly down network from a user perspective. Nagios goes bat losing packets as it does ping checks against the 3200's, etc.
Enough control remains, though, that I can ssh into the 4200 VC at the core, run top and confirm that sfid is indeed at very high numbers CPU-wise.
At this point, it's a process of elimination of shutting down segments until the "offending segment" is taken offline. The instant the interface leading to that set of 3200's (on fiber to the VC) is disabled, traffic is normal, packet loss vanishes and the sfid process sinks back down to its "normal" level of 2.5% - 4.5% fluctuations.
While this is happening, the 3200's on the "offending" segment are fine in terms of sfid CPU use. It's only the VC at the core going crazy.
To this day we have NEVER been able to figure out what causes this. Nothing in the logs, no packets can be captured, etc. It's like it's such a low-level occurrence that it doesn't register - but it totally eats up the VC until the "bad" segment is disabled and the offending device on that segment dealt with.
Thus far, the offending devices have often been small user-attached cheap switches that we've worked to eliminate. In most cases it will be a little 5 port NetGear switch or something similar that someone has attached that works without flaw for months or more - then is suddenly the culprit in one of these meltdowns. Reset power on the Netgear (or similar device) and everything is instantly quiet again.
Yes, this is all further argument for leaving the layer-2 topology and that WILL be happening. But in the meantime, I really need (and WANT) to understand both what is happening and how to debug it.
I have a segment currently "on hold" right now that is causing the issue today and I'm about to set out to locate what device in that segment is causing this issue. In the meantime, it's not a highly important segment and I can use it to my advantage in testing if there are traceoptions or other techniques I could use to possibly generate logs or other clues to the nature of what is going on and sending sfid through the roof.
If I enable that segment, sfid climbs and packet loss starts instantly on the VC... disable it and it just as quickly settles to normal.
When I enable/disable the segments, I'm simply disabling/enabling the fiber interface on the VC that serves one or more 3200's on that branch of the network.
Any thoughts? The fact that some device on our network can so randomly take it all down is, as you can imagine, tremendously dangerous and frustrating.
Open to all ideas...
- Aaron
#sfid#3200#4200#storm#VC