SRX

Expand all | Collapse all

Significant SRX reliability problems

  • 1.  Significant SRX reliability problems

    Posted 12-06-2017 13:48

    Generally speaking, I really like working with the SRX.  We use 210, 220, and 240 models throughout the company.  It's trivially easy to set up tunnels with OSPF to do all kinds of neat inter-office connectivity, and working with JTAC is WAY better than Cisco TAC.  (we have a Cisco phone system)

     

    Five years ago, we bought 15 new SRXes from an authorized Juniper dealer, and each was installed in a separate geographic location.

     

    I'm having GRAVE concerns about their reliability.  In the past 2 years, 5 of the 15 have failed with a 6th one heading to the toilet.

    • One lost its flash-- no storage recognized at boot.  It will only boot from a USB stick.
    • Another suddenly came up with a huge number of flash errors, enough that we had to remove from service-- and this one's in a very high quality colo facility (clean power always).
    • One has a "reset" button problem, such that it kept resetting itself to factory defaults  randomly.  I had to set "config-button no-clear" as a workaround.
    • One randomly lost power internally several times a day... not an OS crash, but as in "all the lights blink off then back on".  (Power supply swap didn't help.)
    • One slowly lost its RJ-45 interfaces, one at a time.  I moved services to other interfaces as they failed, until one day...... the unit just crashed and never rebooted.
    • Another one is starting the "randomly loses power internally" issue, in the exact same way as the other one did.  I'm configuring its replacement today.

    6 failures out of 15... that's a 40% failure rate in 5 years.  For the record, all are on APC UPSes of varying capacities, and utility power problems are extremely rare.

    Is the SRX really this much of a failure-prone dog?  Juniper Netscreens we bought circa 2005-06 are still running TODAY with no problems at all... which is why I was so anxious to adopt the SRX at new locations.  But wow.... the problems never end.

    Are we alone in this experience?



  • 2.  RE: Significant SRX reliability problems

    Posted 12-07-2017 01:53

    Hi!

     

    No - you're not the only one. We didn't have any jack- or pushbutton-issues, but loads of problems with bad blocks in NAND which often lead to problems during upgrade (i.e. change of boot partition). ISSU going haywire, Systems responding extremly slow after config change (had to re-image the divice). Or SRXes stuck in bootlaoder for no reason - issuing a 'boot' then brings them up (had it with severeal SRX300 so far) - but of course that has to be done from console, i.e. driving to customers site and do it locally since customers usually don't have serial adapter nor want to / are able to revive their equipment. Not to mention the extended downtime at customers site...

     

    And it's not the SRXes alone - in the last few months, we had increasing problems with EX-switches too.

    Corrupted filesystems (no power outage - NAND simply 'slowly dies' during regular operation within 2 years. JTAC tells me that's normal and we have to live with this). Update of a 9 chassis- VC left 4 of the chassis in boot-prompt

    Sponatnoues reboot after a simple commit, false emergency fire-shutdowns due to possible bug in CPU temp sensor.

    JUNOS Quality suffered massively - we ran into many bugs in the past - most of them 'confidential', i.e. we didn't even had a chance to circumvent them. To make things worse many (not all!) JTAC engineers have a strange way of tackeling problems ('please try to install a different JUNOS-Version in your production environment- we don't know if it will work (potluck), but hey - it's just half an hour of downtime (if you're lucky) and a drive to the customers site (since you might loose network access to the devices and need console access) - it might cost you a few thousand bucks, but be honest-money is not an issue...) or (well NAND problems ar inadvertable - please check nand on all your (200+) devices once a week to quickly identify problems...).

    And I have the feeling that often, they didn't even try once to actually install their recommended versions of Junos on the corresponding devices - we had it more than once that the recommendation didn't work at all on the device (too little memory). Funny things then happen (e.g. systems boots, and forward packets but doesn't NAT anymore - no error messages...).

    I already complained multiple times toward Juniper to beef up their QA again - so far in vein.

     

    Kai



  • 3.  RE: Significant SRX reliability problems

    Posted 12-07-2017 06:40

    Juniper has been my go-to vendor for over a decade-- but their  reliability problems are killing us-- and rapidly changing my mind.

     

    I hate to say it, but I'm t taking another look at pfSense, because that will give me control over hardware quality.  Last I saw, they didn't do routed IPSec which was a show-stopper, and I really DON'T want to mess with Cisco PIX.  Dealing with TAC for our Cisco phone system is a big enough nightmare.  But none of our Cisco gear (switches, VPN gateway, phone system) have failed in any way.

     

    Juniper is killing themselves with quality control problems.  Maybe not on million-dollar carrier gear, but definitely on branch tier equipment.



  • 4.  RE: Significant SRX reliability problems

     
    Posted 12-07-2017 22:00

    Hello 

     

    Thanks a lot for the feedabck. 

     

    To understand better, what is the model of the newly procured SRXes and what is the JUNOS verison this fleet is running?

     

    Regards,

     

    Vikas



  • 5.  RE: Significant SRX reliability problems

    Posted 12-08-2017 03:36

    "Newly procured?"  Per my original post, these were bought 4-5 years ago-- which is still fairly young in networking gear terms.  (except the one with the reset button problem is just under 3 years old, problem started at age 2)


    210HE, 220H, 240H

     

    Some are on 12.1X46-D67, others are still at 12.1X44-D30.4.  

     

    That's another MAJOR complaint.  ALL of our devices are still under PAID support, but there is NO JUNOS version we can run that mitigates vulnerabilities CVE-2016-10012, CVE-2016-10010, CVE-2015-6564 and CVE-2015-8325.  The fix is 12.3X48-D55 but none of our devices can run that build, per JTAC, because they are not the newer H2 model.  It is also impossible to disable SSL 3.0 and TLS 1.0 (per JTAC) because the builds that do that are also NOT able to run on our still-paid-supported gear.  I put in an enhancement request for that, but haven't heard a thing.  So I've had to disable nearly all external access on devices that are a long distance away.



  • 6.  RE: Significant SRX reliability problems

    Posted 03-06-2018 11:05

    Since I wrote the original post 3 months ago, we've had 2 additional failures.  One crashed in service and on reboot couldn't find boot device (flash failure). 

    To replace it, I pulled a gently-used SRX off the shelf which was removed from a shuttered location.  Unit was running perfectly when it was gracefully shut down and brought back to the corporate server room for storage.  When it was booted to replace the flash-failed unit mentioned above, the primary boot partition couldn't be read so it booted to backup partition.  I tried to reformat the failed partition (req sys snap slice alt) but that failed with (can't remember the exact words) an error related to partition inaccessible or media unreadable-- something like that.  So it smells like another flash failure.

    Unfortunately, after looking at options, we had no choice but to buy more Juniper because of the effort involved mixing another vendor into a production environment with so many tunnels.  So we're getting a batch of SRX320 and 340 models.

    I hope they're more reliable, because my confidence Juniper is at an all-time low right now.



  • 7.  RE: Significant SRX reliability problems

     
    Posted 03-08-2018 07:00

    Hi

     

    I am sad to hear you have so many failures. This looks like an anomaly to me. Our experience with ten SRX240 boxes after ~5 years of working in the lab rack - zero failures. Are you monitoring devices temperature, is it not too high?

     



  • 8.  RE: Significant SRX reliability problems

    Posted 03-08-2018 09:59

    Yes, all devices are kept in rooms with proper cooling and humidity.  The failed colo router is in a premium colocation facility where temp, humidity, and power are rigorously maintained-- and we've reviewed the logs to verify.  In our own on-premises telco/server rooms, we have dedicated cooling, and make extensive use of APC brand UPSes in various configurations.

    There is nothing environmentally that would explain the failures.  Additionally, each of these locations has other brands of equipment, from Cisco switches and voice gateways, to HP and Dell servers, to video surveillance, and many other types of gear.  The only.... and I stress ONLY.... equipment failures we've had are the Juniper SRXes.

    Corrupt or missing Flash.  Front panel reset button that seems to frequently "push itself" (enough that I had to disable it in config).  Ports that go bad for no apparent reason.  One unit even randomly goes dark (as if losing power) for a few seconds then powers back on.  (we replaced the power supply and cables on that one, but problem remained).

    Many different types of failures, but only on our Juniper SRX devices.  The Juniper branded Netscreens (NS-25, NS-50) bought in 2006 are still running perfectly with zero failures after 12 years.



  • 9.  RE: Significant SRX reliability problems

    Posted 08-14-2019 00:37

    Hi, 

     

    We are an MSP in the Netherlands. We've been running Juniper for 10 years now. Some SSG-5's for 8 years+. After many interface issues with SRX-200/220 but reliable operation we switched to SRX-300's. In the last 4 years we replaced ALL SRX-300's all due to: no-interface / overheating / crashing  of general failure. We replaced 12 units and have made the hard descision to switch to Unifi equipement. A lot less funtionality but for generic router/firewall very capable. I'm very displeased with the way the SRX-300 had done in the last years.

     

    Refreshing regards,
    Chris -  Lime Networks



  • 10.  RE: Significant SRX reliability problems

     
    Posted 08-18-2019 21:28

    Hello Chris,

     

    Thank you for the feedback and very sorry to hear your experience of reliability problems for SRX300 series product lines.

     

    We had an issue with out internal storage component, which was causing random crash, boot-failure, reboot etc..

    Storage issues were noticed when excessive logging was written onto the disk.

     

    Based on feedback from many of our customers, we have changed the storage component, which provides better IO speed and reliability.

    Field response has been positive so far from customers who are using SRX300 series devices with new storage component and newer Junos version(15.1X49-D150 and above)

     

    We would defintely like to help you and fix all the reliability issues that you have been experiencing.

    Could you open up a JTAC support ticket to assist you better?

     

    Regards,

    Raveen



  • 11.  RE: Significant SRX reliability problems

    Posted 08-22-2019 15:09

    This is a major concern with the SRX3XX series line and yes there is a newer revision with a new storage/flash chip that is supposed to help. I'd be shocked if any of the SRX340's we use have lasted 6 months with some dying 3 months in. We have run 17.4 and 18.4 code running "ngfw" features and external logging.

     

    Due note on the newer revision with the SRXs and new storage chip there are minimum Junos requirements to run. The is a TSB written covering this: TSB17581

     

    https://kb.juniper.net/InfoCenter/index?page=content&id=TSB17581&actp=SUBSCRIPTION&act=login

     

    There is also a KB article written about disabling some logging features to increase the lifespan of the storage chip.

     

    https://kb.juniper.net/InfoCenter/index?page=content&id=KB34534

     

     



  • 12.  RE: Significant SRX reliability problems

    Posted 09-25-2019 08:22

    I have also had some reliability problems with the SRX340 platform.  We recently installed 6 units at 3 sites all in a high availability clustered configuration and in less than 9 months every single unit has been replaced.  100% failure rate in 9 months!  In one case the replacement was DOA and had to be replaced.  Other than the DOA replacement, we have not seen a replacement unit fail yet.  Not sure what to read into that.  Not sure exactly what is causing the problems. The devices fail, shows a kernel panic error, and then will not boot.  Juniper has been great in sending out replacements but it is not clear that the underlying problems have been identified and fixed.



  • 13.  RE: Significant SRX reliability problems

    Posted 09-26-2019 10:15

    Hello, curious if you opened up JTAC case so they can take a look at these failures along with engineering to avoid this reported issue in future. 



  • 14.  RE: Significant SRX reliability problems

    Posted 01-22-2020 07:24

    Would an active Jcare contract be required to have units that have suffered premature flash failure be looked at?  I purchased 8 SRX300s about 2.5 years ago, they don't have active Jcare though as it was seen to be cheaper to just have a spare on hand.  I've now had to replace 2 of them because of this issue.  Both of them were always problematic with random crashes requiring a power cycle every 30-90 days.  I got tired of it for one of them that was more critical so I replaced it.  The other one I kept on service but on Friday it crashed again and seems to now have complete NAND failure.  I got the unit booted back up from USB, but it does not even detect the da0 disk anymore.  The other one that I replaced, while I can still get it to boot will sometimes will sometimes fail booting.

     

    The reliability of these units is very concerning, especially when we are using them in places like fire stations.  When they fail our dispatchers are not able to deliver critical information to fire fighters.



  • 15.  RE: Significant SRX reliability problems

    Posted 01-22-2020 09:58

    I would expect that a active support contract would be required.

     

    However, since you don't have support, you can always buy/upgrade the eUSB chip built into these branch SRXs that use the eUSB storage.

     

    For ultimate durability you can go for the eUSB chip with SLC Nand flash (the current ATP chip is MLC). We ordered a replacement eUSB from the same manufacture ATP but utilizing SLC from digikey and the SRX recognized the disk and we were able to install Junos. There are other manufactures of the same style eUSB storage chips. 



  • 16.  RE: Significant SRX reliability problems

    Posted 01-22-2020 10:36

    The flash chip on the SRX300 series was changed to at different brand and type around June 2019 so a device manufactured after June 2019 will have way higher durability than previous batches. That is also why devices from this date (both new and RMA) will require at least Junos 15.1X49-D150, 17.4R3, 18.2R2 or 18.3R1 as previous releases does not have the needed driver.

     

    This is mentioned here: https://kb.juniper.net/InfoCenter/index?page=content&id=TSB17581

     

    An active support with hardware replacement is needed in case the unit is more than a year old and out of warranty - but then the RMA goes through without any issues.



  • 17.  RE: Significant SRX reliability problems

    Posted 01-22-2020 11:42

    18.4R3, 19.1R3, and, 19.2R2 were released as well that should help with the amount of data being written to the eUSB chip. One major culprit that effected us was the user authentication table to the database (UserFW/UserID). This database has since been moved off the eUSB and is now stored in RAM. 



  • 18.  RE: Significant SRX reliability problems

    Posted 06-02-2020 05:20

    We are a Juniper Reseller. We have had 7 SRX300's crash over a 6 month period. All documented. They start by locking up and after approximately 3 restarts, they fail.

     



  • 19.  RE: Significant SRX reliability problems

    Posted 08-27-2020 06:02

    We have seen 4 SRX300 fail recently. Console connection indicates that no boot media is available. Same behavior where they lock up, need reboot to work normally. After a few reboots, they will not boot. These devices are all less than 3 years old. Running 15.1X49-D210 and 220.



  • 20.  RE: Significant SRX reliability problems

    Posted 4 days ago
    I hava a SRX345 with the same problem. This unit is 2 years old only...  The device has been rebooted for some times and now it no recognise the internal media.  

    Is there any way to recover it??? Or does we need to open a JTAC?? Is this a production defect??

    Thanks in advance!!

    ------------------------------
    DAVID BABIANO RODRIGUEZ
    ------------------------------



  • 21.  RE: Significant SRX reliability problems

    Posted 2 days ago
    I have had the same reliability issues with the SRX300 and SRX340 devices. Random reboots, lockups when attempting to reboot from cli or Space and lockups while operational requiring a power cycle to recover. I have since had all my SRX300 and SRX340 models replaced by JTAC using the RMA process. 

    To determine which models have the eUSB NAND storage that causes the reliability problems, run the cli command: "show chassis hardware detail no-forwarding". If the return contains the following:

    Routing Engine          REV 0x09     650-065039      CV87654321      RE-SRX300
          da0       7671 MB   ATP CG eUSB                                                                   Nand Flash                     <<<< ATP CG eUSB 
     
    On the newer SRX300 series running the same cli command "show chassis hardware detail no-forwarding" we see the change in the Nand Flash storage used as below:

    Routing Engine         REV 0x14     650-065039      CV12345678      RE-SRX300
          da0       7640 MB   USB Flash Module                                                        Nand Flash                      <<<< USB Flash Module 

    We no longer have ATP CG eUSB Nand Flash Storage which caused all the reliability problems.

    My SRX300 Branch Series devices are so much more reliable since having them all replaced by RMA.




    ------------------------------
    Stuart
    ------------------------------