Generally speaking, I really like working with the SRX. We use 210, 220, and 240 models throughout the company. It's trivially easy to set up tunnels with OSPF to do all kinds of neat inter-office connectivity, and working with JTAC is WAY better than Cisco TAC. (we have a Cisco phone system)
Five years ago, we bought 15 new SRXes from an authorized Juniper dealer, and each was installed in a separate geographic location.
I'm having GRAVE concerns about their reliability. In the past 2 years, 5 of the 15 have failed with a 6th one heading to the toilet.
6 failures out of 15... that's a 40% failure rate in 5 years. For the record, all are on APC UPSes of varying capacities, and utility power problems are extremely rare.Is the SRX really this much of a failure-prone dog? Juniper Netscreens we bought circa 2005-06 are still running TODAY with no problems at all... which is why I was so anxious to adopt the SRX at new locations. But wow.... the problems never end.Are we alone in this experience?
No - you're not the only one. We didn't have any jack- or pushbutton-issues, but loads of problems with bad blocks in NAND which often lead to problems during upgrade (i.e. change of boot partition). ISSU going haywire, Systems responding extremly slow after config change (had to re-image the divice). Or SRXes stuck in bootlaoder for no reason - issuing a 'boot' then brings them up (had it with severeal SRX300 so far) - but of course that has to be done from console, i.e. driving to customers site and do it locally since customers usually don't have serial adapter nor want to / are able to revive their equipment. Not to mention the extended downtime at customers site...
And it's not the SRXes alone - in the last few months, we had increasing problems with EX-switches too.
Corrupted filesystems (no power outage - NAND simply 'slowly dies' during regular operation within 2 years. JTAC tells me that's normal and we have to live with this). Update of a 9 chassis- VC left 4 of the chassis in boot-prompt
Sponatnoues reboot after a simple commit, false emergency fire-shutdowns due to possible bug in CPU temp sensor.
JUNOS Quality suffered massively - we ran into many bugs in the past - most of them 'confidential', i.e. we didn't even had a chance to circumvent them. To make things worse many (not all!) JTAC engineers have a strange way of tackeling problems ('please try to install a different JUNOS-Version in your production environment- we don't know if it will work (potluck), but hey - it's just half an hour of downtime (if you're lucky) and a drive to the customers site (since you might loose network access to the devices and need console access) - it might cost you a few thousand bucks, but be honest-money is not an issue...) or (well NAND problems ar inadvertable - please check nand on all your (200+) devices once a week to quickly identify problems...).
And I have the feeling that often, they didn't even try once to actually install their recommended versions of Junos on the corresponding devices - we had it more than once that the recommendation didn't work at all on the device (too little memory). Funny things then happen (e.g. systems boots, and forward packets but doesn't NAT anymore - no error messages...).
I already complained multiple times toward Juniper to beef up their QA again - so far in vein.
Juniper has been my go-to vendor for over a decade-- but their reliability problems are killing us-- and rapidly changing my mind.
I hate to say it, but I'm t taking another look at pfSense, because that will give me control over hardware quality. Last I saw, they didn't do routed IPSec which was a show-stopper, and I really DON'T want to mess with Cisco PIX. Dealing with TAC for our Cisco phone system is a big enough nightmare. But none of our Cisco gear (switches, VPN gateway, phone system) have failed in any way.
Juniper is killing themselves with quality control problems. Maybe not on million-dollar carrier gear, but definitely on branch tier equipment.
Thanks a lot for the feedabck.
To understand better, what is the model of the newly procured SRXes and what is the JUNOS verison this fleet is running?
"Newly procured?" Per my original post, these were bought 4-5 years ago-- which is still fairly young in networking gear terms. (except the one with the reset button problem is just under 3 years old, problem started at age 2)
210HE, 220H, 240H
Some are on 12.1X46-D67, others are still at 12.1X44-D30.4.
That's another MAJOR complaint. ALL of our devices are still under PAID support, but there is NO JUNOS version we can run that mitigates vulnerabilities CVE-2016-10012, CVE-2016-10010, CVE-2015-6564 and CVE-2015-8325. The fix is 12.3X48-D55 but none of our devices can run that build, per JTAC, because they are not the newer H2 model. It is also impossible to disable SSL 3.0 and TLS 1.0 (per JTAC) because the builds that do that are also NOT able to run on our still-paid-supported gear. I put in an enhancement request for that, but haven't heard a thing. So I've had to disable nearly all external access on devices that are a long distance away.
Since I wrote the original post 3 months ago, we've had 2 additional failures. One crashed in service and on reboot couldn't find boot device (flash failure). To replace it, I pulled a gently-used SRX off the shelf which was removed from a shuttered location. Unit was running perfectly when it was gracefully shut down and brought back to the corporate server room for storage. When it was booted to replace the flash-failed unit mentioned above, the primary boot partition couldn't be read so it booted to backup partition. I tried to reformat the failed partition (req sys snap slice alt) but that failed with (can't remember the exact words) an error related to partition inaccessible or media unreadable-- something like that. So it smells like another flash failure.Unfortunately, after looking at options, we had no choice but to buy more Juniper because of the effort involved mixing another vendor into a production environment with so many tunnels. So we're getting a batch of SRX320 and 340 models.I hope they're more reliable, because my confidence Juniper is at an all-time low right now.
I am sad to hear you have so many failures. This looks like an anomaly to me. Our experience with ten SRX240 boxes after ~5 years of working in the lab rack - zero failures. Are you monitoring devices temperature, is it not too high?
Yes, all devices are kept in rooms with proper cooling and humidity. The failed colo router is in a premium colocation facility where temp, humidity, and power are rigorously maintained-- and we've reviewed the logs to verify. In our own on-premises telco/server rooms, we have dedicated cooling, and make extensive use of APC brand UPSes in various configurations.There is nothing environmentally that would explain the failures. Additionally, each of these locations has other brands of equipment, from Cisco switches and voice gateways, to HP and Dell servers, to video surveillance, and many other types of gear. The only.... and I stress ONLY.... equipment failures we've had are the Juniper SRXes.Corrupt or missing Flash. Front panel reset button that seems to frequently "push itself" (enough that I had to disable it in config). Ports that go bad for no apparent reason. One unit even randomly goes dark (as if losing power) for a few seconds then powers back on. (we replaced the power supply and cables on that one, but problem remained).Many different types of failures, but only on our Juniper SRX devices. The Juniper branded Netscreens (NS-25, NS-50) bought in 2006 are still running perfectly with zero failures after 12 years.
We are an MSP in the Netherlands. We've been running Juniper for 10 years now. Some SSG-5's for 8 years+. After many interface issues with SRX-200/220 but reliable operation we switched to SRX-300's. In the last 4 years we replaced ALL SRX-300's all due to: no-interface / overheating / crashing of general failure. We replaced 12 units and have made the hard descision to switch to Unifi equipement. A lot less funtionality but for generic router/firewall very capable. I'm very displeased with the way the SRX-300 had done in the last years.
Refreshing regards,Chris - Lime Networks
Thank you for the feedback and very sorry to hear your experience of reliability problems for SRX300 series product lines.
We had an issue with out internal storage component, which was causing random crash, boot-failure, reboot etc..
Storage issues were noticed when excessive logging was written onto the disk.
Based on feedback from many of our customers, we have changed the storage component, which provides better IO speed and reliability.
Field response has been positive so far from customers who are using SRX300 series devices with new storage component and newer Junos version(15.1X49-D150 and above)
We would defintely like to help you and fix all the reliability issues that you have been experiencing.
Could you open up a JTAC support ticket to assist you better?
This is a major concern with the SRX3XX series line and yes there is a newer revision with a new storage/flash chip that is supposed to help. I'd be shocked if any of the SRX340's we use have lasted 6 months with some dying 3 months in. We have run 17.4 and 18.4 code running "ngfw" features and external logging.
Due note on the newer revision with the SRXs and new storage chip there are minimum Junos requirements to run. The is a TSB written covering this: TSB17581
There is also a KB article written about disabling some logging features to increase the lifespan of the storage chip.
I have also had some reliability problems with the SRX340 platform. We recently installed 6 units at 3 sites all in a high availability clustered configuration and in less than 9 months every single unit has been replaced. 100% failure rate in 9 months! In one case the replacement was DOA and had to be replaced. Other than the DOA replacement, we have not seen a replacement unit fail yet. Not sure what to read into that. Not sure exactly what is causing the problems. The devices fail, shows a kernel panic error, and then will not boot. Juniper has been great in sending out replacements but it is not clear that the underlying problems have been identified and fixed.
Hello, curious if you opened up JTAC case so they can take a look at these failures along with engineering to avoid this reported issue in future.
Would an active Jcare contract be required to have units that have suffered premature flash failure be looked at? I purchased 8 SRX300s about 2.5 years ago, they don't have active Jcare though as it was seen to be cheaper to just have a spare on hand. I've now had to replace 2 of them because of this issue. Both of them were always problematic with random crashes requiring a power cycle every 30-90 days. I got tired of it for one of them that was more critical so I replaced it. The other one I kept on service but on Friday it crashed again and seems to now have complete NAND failure. I got the unit booted back up from USB, but it does not even detect the da0 disk anymore. The other one that I replaced, while I can still get it to boot will sometimes will sometimes fail booting.
The reliability of these units is very concerning, especially when we are using them in places like fire stations. When they fail our dispatchers are not able to deliver critical information to fire fighters.
I would expect that a active support contract would be required.
However, since you don't have support, you can always buy/upgrade the eUSB chip built into these branch SRXs that use the eUSB storage.
For ultimate durability you can go for the eUSB chip with SLC Nand flash (the current ATP chip is MLC). We ordered a replacement eUSB from the same manufacture ATP but utilizing SLC from digikey and the SRX recognized the disk and we were able to install Junos. There are other manufactures of the same style eUSB storage chips.
The flash chip on the SRX300 series was changed to at different brand and type around June 2019 so a device manufactured after June 2019 will have way higher durability than previous batches. That is also why devices from this date (both new and RMA) will require at least Junos 15.1X49-D150, 17.4R3, 18.2R2 or 18.3R1 as previous releases does not have the needed driver.
This is mentioned here: https://kb.juniper.net/InfoCenter/index?page=content&id=TSB17581
An active support with hardware replacement is needed in case the unit is more than a year old and out of warranty - but then the RMA goes through without any issues.
18.4R3, 19.1R3, and, 19.2R2 were released as well that should help with the amount of data being written to the eUSB chip. One major culprit that effected us was the user authentication table to the database (UserFW/UserID). This database has since been moved off the eUSB and is now stored in RAM.
We are a Juniper Reseller. We have had 7 SRX300's crash over a 6 month period. All documented. They start by locking up and after approximately 3 restarts, they fail.
We have seen 4 SRX300 fail recently. Console connection indicates that no boot media is available. Same behavior where they lock up, need reboot to work normally. After a few reboots, they will not boot. These devices are all less than 3 years old. Running 15.1X49-D210 and 220.