For the time being we have a relatively flat, mostly layer-2 network. The topology is going to change in the coming months, so don't yell at me about this design, please. It's a transition from an old design and other hardware to Juniper and, soon, a good layer-3/VLAN design. 🙂
We have many 3200's in closets on fiber and copper back to a six member 4200 VC at the core.
In the meantime, though, we've had a random problem for months that is infuriating and seemingly impossible to track down. We've moved from the 9.5 series code to 10.0R4.7 recently with no resolution of the issue.
In a nutshell, SOMETHING in our network will suddenly cause the sfid process on the core 4200 VC to skyrocket. This morning's incident hit about 22% of CPU when running top. During this time, the network as a whole becomes massively sluggish and we have tremendous (or complete) packet loss across segments (namely our Cisco to our DS3).
I'm not suggesting that the sfid process is the problem - but it's the primary symptom to look for when this happens.
Scenario: life is good, all is well... then BLAM... massive packet loss going everywhere and a seemingly down network from a user perspective. Nagios goes bat losing packets as it does ping checks against the 3200's, etc.
Enough control remains, though, that I can ssh into the 4200 VC at the core, run top and confirm that sfid is indeed at very high numbers CPU-wise.
At this point, it's a process of elimination of shutting down segments until the "offending segment" is taken offline. The instant the interface leading to that set of 3200's (on fiber to the VC) is disabled, traffic is normal, packet loss vanishes and the sfid process sinks back down to its "normal" level of 2.5% - 4.5% fluctuations.
While this is happening, the 3200's on the "offending" segment are fine in terms of sfid CPU use. It's only the VC at the core going crazy.
To this day we have NEVER been able to figure out what causes this. Nothing in the logs, no packets can be captured, etc. It's like it's such a low-level occurrence that it doesn't register - but it totally eats up the VC until the "bad" segment is disabled and the offending device on that segment dealt with.
Thus far, the offending devices have often been small user-attached cheap switches that we've worked to eliminate. In most cases it will be a little 5 port NetGear switch or something similar that someone has attached that works without flaw for months or more - then is suddenly the culprit in one of these meltdowns. Reset power on the Netgear (or similar device) and everything is instantly quiet again.
Yes, this is all further argument for leaving the layer-2 topology and that WILL be happening. But in the meantime, I really need (and WANT) to understand both what is happening and how to debug it.
I have a segment currently "on hold" right now that is causing the issue today and I'm about to set out to locate what device in that segment is causing this issue. In the meantime, it's not a highly important segment and I can use it to my advantage in testing if there are traceoptions or other techniques I could use to possibly generate logs or other clues to the nature of what is going on and sending sfid through the roof.
If I enable that segment, sfid climbs and packet loss starts instantly on the VC... disable it and it just as quickly settles to normal.
When I enable/disable the segments, I'm simply disabling/enabling the fiber interface on the VC that serves one or more 3200's on that branch of the network.
Any thoughts? The fact that some device on our network can so randomly take it all down is, as you can imagine, tremendously dangerous and frustrating.
Open to all ideas...
Sorry to hear about Your network problems. A few points to clarify/check:
1/ are "cheap switches" attached to the core via untagged or tagged/trunk ports?
2/ what Spanning Tree flavour is in use if any?
3/ is there any non-IPv4 multicast running in the network? ES-IS or IS-IS perhaps?
4/ is native VLAN being trunked/used on trunks to carry actual payload traffic?
Thanks for the reply!
1 - The "cheap switches" are sometimes attached to access ports on the 3200's in the closets. Those ports would be on the default VLAN. The only ports on those 3200's going back to the core that are trunked are the fiber uplinks back to the core or any additional downstream 3200's in that same closet. That make sense?
2 - rstp with seemingly default options beyond a bridge-priority of 4k.
3 - I am not aware of any and nothing is configured as such intentionally right now.
4 - The untagged default VLAN is where hosts currently reside, so yes, it is what is trunked between the core and the uplinks to the 3200's in the closets. Those trunk ports also carry some other VLANs we have defined that are not yet all in use. There is one other major VLAN existing on the core VC that is somewhat odd and also temporary. It has a collection of hosts in it that are behind a transparent, bridging firewall that cross the default VLAN and that firewall VLAN. That means anything going in or out of those hosts to the main network is passing through the bridging firewall for policy filtering along the way (two ethernet cards in that machine - one on default, one in the VLAN). It looks bizarre since it does create duplicate entries for some MAC addresses that are seen in both the default and the firewall VLAN... but it has never seemed to be a source of trouble. It, too, is going away soon. I patently dislike that design but have had to live with it during the transition from old network to new.
Please don't hesitate to ask additional questions or for clarity if I didn't answer properly...
Thanks for Your reply, I have a few more points at this time:
1/ You said RSTP is used - how does it prevent loops on VLANs other than untagged one? Or do you have loop-free topology for tagged VLANs? Or are You in fact using VSTP or MSTP?
2/ when the segment is misbehaving, did you also observe the elevated pps (packet-per-second) rate on ports which are BLK by spanning-tree?
regarding possible t'shooting steps -
3/ use "mac-move-limit" to determine if there is indeed a L2 loop
4/ if You can afford playing with misbehaving segment, I'd suggest to use a separate inline tap/port-mirroring device/switch capable of port-mirroring to insert on the link which is currently disabled. Before you ask - yes, EX supports "analyzer" capability but I've seen EX (re)setting VLAN tags on analyzer output port to completely random values so please use a separate inline device if You can afford it. Stll You can use EX "analyzer" feature to capture traffic if nothing else is available for inline capture.
Appreciate your help on this.
I gather you're feeling that a L2 loop is the primary possible cause here. That has been my feeling in the past, too, but I've yet to find the loop if it's happening. I'm not saying it isn't... I just haven't been able to see that actual scenario yet.
1 - None of the tagged VLANs are in use right now in the places where this is happening. While the tagged VLANs are declared on the trunk ports of all of the switches for future use - nothing is actually using those yet. All of the traffic and the hosts involved in these cases are on the default, untagged VLAN. Not a single port within the "offending" switches is assigned to anything but the default VLAN, basically. Does that answer the question?
In order to create a loop like this, I would assume it would involve one segment (or span) coming off the core crossing to another, separate segment... correct? A loop WITHIN one of the segments strikes me as something dealt with at the level of the 3200 where it is taking place. In that regard, for instance, I can assure you that both ends of an ethernet cable, for instance, plugged into one of the 3200's instantly BLKS, then declares one as BLK'd BKUP a moment later. The configuration in this regard is universal to our system.
So, in my remark about two segments crossing: if A is the core and B and C are EX-3200 offshoots of it, the loop would be created by something suddenly joining B and C. I would assume in that case that B and C would address this loop by seeing A available from two paths and BLK'ing one - correct?
Ultimately, I don't yet see how that can happen in our topology from a physical standpoint.
Also, looking back on my references to the "cheap switch" incidents... those were dead-ends, topology-wise. I'm referring to the core (A) connecting by fiber to a closet 3200 (B) and the NetGear connected off a single access (not trunk) port on B. When this happens, simply disconnecting that NetGear - which is in no way looped - causes the core (A) to return to normal. The 3200 (B) that is in between the core and this NetGear exhibits no real issues.
2 - I can't swear either way on that. I've battled this intermittently for so long that I've lost track of what I've checked. Next incident, once I know the segment, I'll check the pps stats of it in that crazy state and see if something suggests major traffic. And when I say next incident, I'm irked to say that the ideal test case I had today - once again - evaporated. I'm have to wait yet again for the next implosion and debug from there. The next might not be so convenient as this incident.
3 - Assuming this is a loop or something similar - this mac-move-limit sounds pretty useful. Could something OTHER than a move create this scenario? Something like a host flickering on and off rapidly in some manner? Our logs show disconnects on ports and I've never seen that kind of indicator (trust me, I've checked)... but I figured I'd ask.
4 - Yes, I can definitely insert a sniffer... that's not a problem. And I agree, I'd prefer it to be a passive, outside device in the path in case the nature of this causes the Juniper analyzer to utterly ignore or otherwise be blind to this unknown possible traffic.
Ultimately, what do we know about the sfid process? What does it represent? What kinds of activities cause it to work hardest? Knowing that, can we better theorize types of incidents - and maybe methods for logging them - that could trigger this?
In closing, I just want to emphasize the seemingly non-loop way in which a simple device like a cheap NetGear switch can sit on the END of a span and, in some manner, cause massive grief all the way back at the core. That, to me, is the big mystery. If it's a loop, I've not yet envisioned how it might happen. 😞
Any and all queries and thoughts help! 🙂
I tend not to think the issue is a switching loop. My understanding is that standard RSTP will be transparent to VLANs, that is, even if there are 20 VLANs on a trunk port, if RSTP decides that port will be disbled to prevent a loop, it will be disabled, and that will impact all VLANs on the port. It's essentially all or nothing for a port and VLANs aren't given any consideration. I've also seen a number of switching loops impacting EX4200s, and they handled it quite well, the CPU usage was not bad, just a crazy number of PPS on some of the ports. I'm not saying it certainly isn't a switching loop, but it doesn't seem like it to me.
I also don't see why the core VC composed of 6 4200s would be severely impacted, but the 3200s not at all if there were a loop.
Are any of the aggregation 3200s configured as VCs as well?
I haven't been able to find much on the sfid process, just a few references to it as the "software forwarding process" in some outstanding issues or release notes Juniper documents. You say you see it at 22% or so...I assume the total CPU usage you see in TOP is close to 100%? What other processes are big offenders? Do you see anything odd happen with the VC, like the RE and BACKUP RE changing roles or anything?
No, there are no 3200's as VCs (they are, in fact, capable of VC). I will refer to the closets as "stacks" since we frequently have more than one 3200 in any given closet (depending on building size or area served) - but even a stack is not the proper term since they are simply joined together with ethernet cables on ports defined as trunks. One unit in the "stack" has the fiber going back to the core 4200 VC. Had budgets allowed, our 80+ 3200's would have all been VC-capable 4200's... but this design more than works for us horsepower-wise.
Here's the thing... when sfid is even around 12% things are getting nasty. Pings across the core are dropped or hit insanely high roundtrip times. 22% is a real mess. Nothing else is coming close, though, in a CPU-usage sorted top list. So, no, the overall usage is not 100%. Everything else is likely single digits or less than 1%. But with an sfid in the high teens or 20's, forwarding of traffic is a total mess.
HOWEVER, things are still moving enough that my ssh session to the VC core is functional. And while it might feel a bit jerky or semi-slow at moments, it's totally usable. I'll have multiple ssh sessions open, one running top, another in CLI mode and another in the CLI's config mode so I can save time moving between actions under these conditions. It does not freeze me out. I can, with slight delays and slowness - but minimal - connect to the 3200's that are part of the offending segment and inspect and interact with them. They do NOT, however, have a high sfid. They look normal, but their initial blocking and, ultimately, the removal of whatever device is attached to them causing the mystery issue will calm down the core immediately.
The issue has happened again tonight and, once again, I've located the segment coming off a 3200 that is the source of the headache and blocked it WITHIN the 3200 while leaving the 3200 itself (and everything else it hosts) fully connected to the core. INSTANTLY pings through the core are normal and sfid plummets to normal. I'd like to analyze it deeper, but I'm still at a loss for what I'm looking for at this point that I might find via some as yet unknown debug/trace options and tools, etc.
It's really crazy stuff.
Whatever is going on, I've seen no real evidence that it's chronic to Juniper gear or JUNOS. What on earth is so different about our relatively simple network that we're seeing these issues and others have not? You're right - there is virtually no substantial reference to sfid. And while I don't accuse it of BEING the problem, it's clearly at least affected enough by whatever is happening to be a major clue here.
Let me emphasize, too, that this is HIGHLY random. It cropped up once months after our network was installed last summer and running very quietly. That was the first time I found a stupid $20 Netgear 5-port switch causing it. It blew my mind. It happened again some weeks or months later. I immediately asked if there was a similar Netgear on the problem segment. There was, we reset it and things were fine. I began referring to it as "the Netgear" problem. Those two incidents established the pattern and we had one or two more over the months. Nothing chronic - just random and annoying.
One night we had a thunderstorm and it happened. Took me quite a while, but I painfully did the segment by segment shutdown to find the offending one. That segment went to a set of 3200's that, in turn, went further downstream by fiber to another 3200 (not a common topology for us - but nothing odd about it). Ultimately that one was a little cheap D-Link brand switch in someone's office being used as a port multiplier where they had only one jack (same as the Netgear situations). I reset it and everything was fine.
My point here is that those cheap little mini switches are able to bring our core to its knees - and that's downstream behind two separate sets of 3200's - none of which are adversely affected in between.
What the heck!?
We upgraded to 10.0R4.7 recently after hearing that some wiggy sfid-related issues were resolved over the 9.x series. I didn't see the issue for a month or more and breathed a sign of relief. Then yesterday morning it happened and now twice more today. Three occurrences in 24 hours out of the friggin' blue after another quiet period. I was stupid enough to think that, while we never understood the cause, at least it might be over with 10.x. Nope.
So here I am, trying as hard as I can to understand the WHY and HOW so I can make the changes needed to stop it. Otherwise, I'm not getting a decent night's sleep or a day off (and I was off yesterday day traveling and ended up spending the morning on the phone remote-debugging that thing). But you all know this crap never happens until you're away and it's of the utmost inconvenience... like some cosmic conspiracy. Life of a network admin.
Aside from this driving me buts, I really love our Juniper gear. I just need this bit out of my life. 🙂
Is it possible that there is some device on these switches that is still running IPX protocol? Some print servers etc. still have it enabled by default, and I have seen some crazy packet storm type issues coming from devices like that at times. You might be able to isolate the port that is having the issue on one of your switches, and span it to do some packet captures so you can better identify what is going on.
Thanks for Your replies.
In my experience, L2 loops are created not only by topology/cabling errors and/or STP issues but also when a device forwards the frame out of the all ports including the one it has received it on. If Netgear exhibits this behaviour, STP won't help but "mac-move-limit" will. Try it first with "log" option to identify offending traffic.
But if there is only 1 frame then it is not enough to create large PPS/packet dtorm. Something has to be constantly adding these frames.
SFID is "Software Forwarding Infrastructure Daemon" - a SW process which handles control traffic to/from EX itself and also transit non-IPv4/non-IPv6 multicast such as ISO (ES-IS, IS-IS) multicast, etc. Basically, any Ethernet frame whose dst MAC starts with 03 or 09. Some MSFT and HP server redundancy protocols also use this kind of multicast.
Given that You are seeing high CPU caused by SFID together with high PPS it is possible that non-IPv4/non-IPv6 multicast frames got looped between core VC and Netgear. That's why packet capture is important to determine whether this is indeed the case.
Here's the latest in my saga.
We had three sfid/core blowups in 24 hours (Wed/Thur). Friday morning I set up my "trap" since it came off the same segment every time and I assumed (right) that it would again.
I found Python source for a ping tool and modified it to ping with time stamps embedded and write those constantly to a file. I did this because pings across the core to our Cisco router (and other things) virtually cease when this is happening. I'll explain why I did this later.
I then took a netbook, installed Wireshark, grabbed a little (ironically) NetGear hub (NOT switch) and placed it inline between the Juniper Switch and the offending device. This way all traffic is, in theory, being captured before, during AND after one of these incidents.
The offending device (almost - keep reading) in this case is a Total Access 1200 (I think) DSL head unit. It's one of several of ours that provide DSL service through our phone system to residents on campus. It's on port ge-0/0/15 of a 3200 in the telecom building which is then connected by fiber to the core. We run our DSL as a bridged extension of our LAN so, for the most part, moving between the normal office ethernet and DSL in your home is invisible and seamless.
I had determined earlier that when things blow up in this latest round of incidents, blocking that DSL head unit connected to ge-0/0/15 of the 3200 instantly solves the problem. sfid CPU usage on the core drops and pings return.
So, Wireshark is now running 24/7 with a 1 TB hard drive and writing 50 MB continuous dumps (tshark, actually... the commandline version so it has lower overhead). My ping tool is running on the same laptop for the express purpose of giving me a correlating timestamp (once per second) for when pings stop. I can match that time within the Wireshark dump and start looking within that vicinity of things blowing up.
The problem did not return (of course) until 1:30 AM this morning and alarms woke me up. It happened again sometime after 5 AM, 7 AM and just before noon when I shut down ge-0/0/15 entirely for the rest of the day.
This evening I re-enabled ge-0/0/15 again and within seconds things blew up again. This time I hopped on the DSL head unit before it became overwhelming and disabled all 24 DSL ports in that head unit. Things returned to normal. I then systematically re-enabled each DSL port one by one. When I hit #7, stuff hit the fan. I now know specifically which DSL span is causing the trouble, but I've not yet had time to visit that location to see what gear is connected on the other end.
ANYWAY... here's the deal on the packet dumps: I see NOTHING of interest. Perhaps I am overlooking something - but there are no storms, not strange bursts of packets, no odd spanning tree notices or heartbeats. NOTHING.
So here's the summary: some as-yet-unknown equipment sitting on the far end of a DSL modem on our campus connecting back to a DSL head unit which connects to a 3200 in the telco building which connects by fiber to the core is blowing up the ENTIRE campus. SFID goes batty, packet loss is rampant, users scream the network is down and I see absolutely nothing in the packet capture between the DSL head unit and the 3200 to which it is connected that tells me anything.
I set max-mac-move (or mac-max-move?) to "1 second" with logging for all vlans. I applied it to the core and to the 3200 closest to the offending device/DSL/span. Nothing. No log entries, etc.
I DID notice that sfid cpu incrases for both the core AND the 3200 to which this DSL head unit is connected. That's either new or I had forgotten it from past incidents.
There are, however, no log entries on either that look odd. No spanning tree strangeness, blocking, BPDU errors, etc. Nothing. Everything LOOKS normal everywhere - but sfid is exploding and our core is going to its knees.
I will report back with what hardware I find on the remote end of that DSL span once I can get there.
In the meantime... any ideas?
Thanks for Your last post, very informative.
I take it as you have the packet captures from the time of the last incident.
Did you have time to look for the following:
1/ any non-IPv4/non-IPv6/non-ARP frames are showing up? Specifically, any non-IPv4/non-IPv6 multicast?
2/ any IPv4 multicast with TTL==0|1? (would be relevant only if You have L3 interfaces in that VLAN on your core VC and 3200 switch)
3/ any IPv4/IPv6 packets directed at the core VC and 3200 switch itself (again, would be relevant if You have L3 interfaces in that VLAN on VC|3200)
4/ what was the PPS rate at the time of incident, both uplink/3200 ge-0/0/15 input and downlink/3200 ge-0/0/15 output? If that's not available, what "show interfaces ge-0/0/15 media detail" says about relative # of unicast|broadcast|multicast frames assuming you haven't cleared the stats on this interface?
5/ any non-Ethernet II/non-Ethernet DIX encapsulated frames directed at the core VC|3200 switch itself, like 802.2 LLC/SNAP?
6/ any significant number of Slow Protocols (Ethertype 0x8809) frames in the captures?
Do you have any Microsoft Windows 2008 server NLB clusters in that segment of your network?
Just a quick placeholder reply to say that I'm not dead. We did kill this particular incident, but another will undoubtedly occur since we still didn't finger the exact cause. I have all of the packet captures and will be doing the analysis ASAP based on the suggestions in the latest forum replies... I've just been swamped with other things the last week and itching to get back to this mystery.
Will report back as soon as I have a chance to dig into those captures and look at the specific theories suggested...
I have the same kind of issue on some ex2200;
Did you manage to fix the issue or find other info regarding the sfid ?
At long last, I'm back with more incidents of our sfid toxicity and still no clear idea what's happening. We've had quite a few in the last week (for no apparent reason) and a TON of huge incidents tonight. Again, I find no real rhyme or reason thus far. Half the time when I shut down a segment to try to isolate a cause... it stops... immediately making you think that's the offending area... then it comes back again. By its very nature, it's almost impossible to find a true cause/effect scenario since every time you think you're zeroing in, you get proven wrong.
pkcpkc - no, I've not solved it... but I'm VERY intrigued to hear you're experiencing something similar. Can you elaborate? Any progress?
Driving me completely nuts...
Do you have dhcp-option82 configured enabled?
Sorry to bump an old thread, but I stumbled across this when searching for solutions to a problem I was experiencing between and EX4200 and an F5 BIG-IP. My issue turned out be be a bridging loop caused by some bad-design (in my opinion) on the F5. (http://support.f5.com/kb/en-us/solutions/public/5000/500/sol5566.html)
Anyway, I'm curious to know if you ever unearthed a cause for your symptom? Sounds like you did some very solid, systematic troubleshooting. Perhaps pointing to compounding factors/causes?
Hi amahler !
Have you find RCA for your problem yet ?
I have similar problem when running EX4500+EX4200 VC (12.3R3).
CPU varied high/low rapidly because of sfid.
I have checked spanning-tree but but it's seem normally.
I've captured and saw some strange packet:
- There are many packet with TTL = 1 and ip destination which doesn't exist. Because TTL = 1 so Switch have to reply with TTL exceed (may cause CPU high). I've tried to discard these packet but not success because switch doesn't support filter packet with ttl=1 yet 😞
- There are some broadcast ARP message from switch to find MAC address of IP destination which doesn't exist. I have tried to discard these IP destination by configure static route discard.
Have anyone know shell command to check detail sfid process ?
Not that I can offer any help on the issue, but why are you not able to create the filter with that match condition?
set firewall family inet filter BLOCKTTL1 term TTL from ttl 1set firewall family inet filter BLOCKTTL1 term TTL then acceptset firewall family inet filter BLOCKTTL1 term ACeetALTTL then accept
commit checkconfiguration check succeeds
Model: ex4200-24tJUNOS Base OS boot [12.2R1.8]
I found this on wiki but you are on your own as to its usage and Juniper support.
"lcdd" from a shell (not the cli) connects you to various other parts of the switch,including the software forwarding infrastructure (sfid), chassis manager (chassism),and the virtual chassis system (vccpd). You don't need to be root to get into these."
For how long is the sfid running with high CPU?
How do you know that the sfid has high CPU if you are not checking the processes by luck exactly in the same time? Are you getting any alerts or is there any service impact?
If you have enough time to connect to the switch to collect data while the sfid is running with high CPU, then might worth to open a JTAC case to collect specific data pertaining to sfid.
Do not run such commands without clear instructions.
Never tried to configure a filter to match the ttl so I wonder if this is actually working although you can commit or the packets do not have ttl of 1.
If the filter will work, then get rid of these packets and see if the sfid CPU decreases.
=====If this worked for you please flag my post as an "Accepted Solution" so others can benefit. A kudo would be cool if you think I earned it.