Yeah.... I've been through the guide several times looking for clues. Everything seems up to speed with the MNHA configuration.
When it comes to JTAC, my reseller and local SE team has escalated this and I hope for response soon. Fingers crossed.
Original Message:
Sent: 05-04-2025 15:35
From: fb35523
Subject: SRX4100 traffic drops due to high CPU usage by nsd process during commits
We have several customers running MNHA, but I'm not sure if that's with the SRX4100 or other models. The model type shouldn't really matter though. Are you sure the MNHA parameters are setup as they should and that the traffic path(s) between them are OK? I guess you have already checked this guide:
https://www.juniper.net/documentation/us/en/software/junos/high-availability/topics/example/mnha-configuration-example.html
If you get no response from JTAC, make the urgency of the problem clear (realistic expectations and requirements) in the case notes and call them on the phone and request to talk to the case owner. If that doesn't help, use the "escalate" button in the case portal. Surely, your local Juniper SE team can help out too? I have a feeling I know who they are ;) Perhaps there are special recommendations on Junos releases for MNHA, have them check that!
Original Message:
Sent: 05-03-2025 06:09
From: vidar.stokke
Subject: SRX4100 traffic drops due to high CPU usage by nsd process during commits
Hey all.
We've involved JTAC on this issue, but we haven't heard anything from them and we are in a hurry to find out what this issue might be.
The thing is that we have an SRX4100 were we see traffic drops during every commit. Even the easiest commits triggers the issue, ie. a change of description in a interface name. We observe that during the commits, the nsd process is very eager and consumes almost 100% CPU during a short periode of time. This timespan matches the timespan were we see packet drops.
There are several things that we have observed:
- This issue is only seen on our SRX4100s configured with MNHA. We have several SRX4100s set up in chassis cluster without the issue.
- The SRX has almost 1000 policy rules, but we've removed all of them and this does not seem to solve anything. Also this is way below the number of rules that the SRX4100 is capable of.
- We've removed several DNS-name address book entries that didn't resolve to anything, but this did not help.
- We've tried to remove all SRG1+ except one, to see if that helped. But no fix.
- This is not load related because this SRX has been taken out of the production network and does not currently handle any traffic.
We have seen this PR related to nsd, but it is fixed in 23.4R2-S4 and this is the same version as we are currently running. Also our problem does not match the triggers in the PR.
I understand that this is hard for any of you to find out what might be the cause, but I am interested to hear if anyone has seen similar issues and most of all if anyone has any pointers to how we can get this temporarily fixed until a fixed software version arrives.
------------------------------
Best regards
Vidar Stokke
------------------------------