Switching

Expand all | Collapse all

Ex2300 keeps switching but ...fpc0 (buf alloc) failed allocating packet buffer

  • 1.  Ex2300 keeps switching but ...fpc0 (buf alloc) failed allocating packet buffer

    Posted 12-02-2020 15:03
    Bonjour everybody, 
        I have 200+ EX2300 running smoothly at various client sites. Just letting trafic goes through between 2 interface so I can keep an eye on it.
    recently,   I lost remote contact with  one  switch. All trafic kept flowing, but no access to my management vlan. 
    plugging at console, I noticed all the log messages file absolutely filled with this .
    Nov 18 13:05:01 ReseauBiblio-SitePrincipal fpc0 brcm_pkt_buf_alloc:393 (buf alloc) failed allocating packet buffer
    Nov 18 13:05:01 ReseauBiblio-SitePrincipal fpc0 (buf alloc) failed allocating packet buffer
    Nov 18 13:05:01 ReseauBiblio-SitePrincipal fpc0 brcm_pkt_buf_alloc:393 (buf alloc) failed allocating packet buffer
    Nov 18 13:05:01 ReseauBiblio-SitePrincipal fpc0 (buf alloc) failed allocating packet buffer
    Nov 18 13:05:01 ReseauBiblio-SitePrincipal fpc0 brcm_pkt_buf_alloc:393 (buf alloc) failed allocating packet buffer
    Nov 18 13:05:01 ReseauBiblio-SitePrincipal fpc0 (buf alloc) failed allocating packet buffer​


    so I replaced switch A with switch B, same configuration. 
    switch A in the lab is back to normal, and switch B in the field crashed after 2 days ... trafic still flowing. . All messages log are filled with the same 
    lines, 
    client won't let me change the switch because all trafic keeps flowing perfectly :-) 

    at this point, I welcome any idea as to what can put a switch in such a state. 

    configuration is basically similar on all 200+ switches.
    2 switches failed with the same problem on the same site.

    opened case with JTAC mention "memory leak" but have not find a cause yet. 

    any clue ? 
    Michel



    ------------------------------
    Michel Lapointe
    ------------------------------


  • 2.  RE: Ex2300 keeps switching but ...fpc0 (buf alloc) failed allocating packet buffer

     
    Posted 12-02-2020 16:17
    Hello Michel,

    although I do not know if this is a known issue (looks like a clear memory leak), I would suggest you to upgrade to a current JUNOS release (e.g. 19.4R3) and check if the issue still occurs. Sometimes this saves you from JTAC cases which can take a very long time.

    ------------------------------
    ------------------------------
    If my answer provides the solution, please mark my post as "Accepted Solution".
    If you think my answer helps, please spend some Kudos
    ------------------------------



  • 3.  RE: Ex2300 keeps switching but ...fpc0 (buf alloc) failed allocating packet buffer

    Posted 12-02-2020 17:11
    Hello F1ght3r

    I am running 182R3-S3 on all my 200+ switches. None of them behaved like that since the deployment that started in february. 
    so image upgrade is on hold for now. 
    I used the same config files on switch A and B, which is a carbon copy of the other switches , except for description and irb interface IP adress. 
    2 switches acting like this must have something in common. either the environnement or the  config files - 
    but what kind of environnement or config would bring a switch in such a state ?
    log messages are useless since they are fillled with teh same line. 
    Michel

    ------------------------------
    Michel Lapointe
    ------------------------------



  • 4.  RE: Ex2300 keeps switching but ...fpc0 (buf alloc) failed allocating packet buffer

     
    Posted 12-03-2020 04:03
    Hello Michel,

    this highly depends on the customer traffic. It is absolutely possible that one customer on each switch sends specific packets which may lead to the EX memory leak.
    The long term resolution is solving this together with JTAC. The short term try can be using a current JUNOS release which includes the newest software fixed, to check if the issue still appears or not.

    ------------------------------
    ------------------------------
    If my answer provides the solution, please mark my post as "Accepted Solution".
    If you think my answer helps, please spend some Kudos
    ------------------------------



  • 5.  RE: Ex2300 keeps switching but ...fpc0 (buf alloc) failed allocating packet buffer

    Posted 12-03-2020 08:43
    Something lik this has been posted before in the community..
    https://community.juniper.net/communities/community-home/digestviewer/viewthread?MID=73003
    I will look in my emails  for more detailed version but ..  In the past we had this issue with 18 were just as you described  the PFE / Data plane kept running but control plane did not.  Reboot fixes this temporarily .
    Ive been running 19.4R on my 2300s in stand a lone and VC with no issues . Uptimes since last upgrade  276 days - 294 days in 10 sites,

    SIDE NOTE ..  Do to the recent  PR  https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1491905
    Ive upgrading mine as we have years of seeing mcast kill our 2300s in the way it described ..We have been bypassing this issue we think via our wireless controller dropping mcasts at the ap. We thought it was just do to the low capacity of the 2300s.. But here is to hoping this shows me im wrong and the 2300 can do a lot more ..


  • 6.  RE: Ex2300 keeps switching but ...fpc0 (buf alloc) failed allocating packet buffer

    Posted 12-03-2020 09:28
      |   view attached
    Hello tgreaser,
    thanks for the answer -
    I checked the first link and while scrolling down, and it  did ring a bell. Then I saw myself in the communications from 2019!!!!  turns out this was related to the infamous PR1442376  https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1442376&actp=SUBSCRIPTION that cause 2300 switches to go zombies: absolutely no comm whatsoever, but trafic kept flowing.  this is not the case here since I have console access and the switch right side selector still works, all things that PR1442376  were covering.  I was for a moment afraid that PR1442376 had raised it's ugly head and would be forced to act on the promise I made at the time and kill myself :-) 

    the second link looks very promising, though: turns out I am running 18.2R3S3,  which is affected by the issue.  The problem occurs at a specific site, for a specific little network.  We moved trafic to another switch yesterday, but left the old one running and the same error message keep coming out (see video     the PR mentionvery low idle cpu as a symptom_ so  i'll go and check it  this afternoon.

    if the new switch goes berserk again, I'll upgrade the old one to 18.2R3S5 (the recommended EX2300-C) image)  and move back the wiring to it. ...  and keep you posted .

    a video is worth a thousand words: I attached a clip of the old switch reacting to a monitor start messages command 
    Thanks again.
    Michel


    ------------------------------
    Michel Lapointe
    ------------------------------



  • 7.  RE: Ex2300 keeps switching but ...fpc0 (buf alloc) failed allocating packet buffer

    Posted 12-03-2020 15:33
    I had this happen once. It's a good example of the clear separation between the RE (control plane) and the PFE (data plane).

    In my case, there was no fancy fix. I rebooted the switch, then I put the JTAC recommended Junos on it. 
    Haven't had the problem reoccur since.


  • 8.  RE: Ex2300 keeps switching but ...fpc0 (buf alloc) failed allocating packet buffer

    Posted 12-03-2020 16:19
    bonsoir Luke and Tgreaser 
         I am pursuing the https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1491905  possibility.  Here is what I found today:
    the old switch is still running after being disconnected and the tafic connected to the new switch. So I have the down switch still spitting messages and refusing to talk to anybody except from console. 
    the PR mentions ...
    To check if the device has high CPU load due to this issue, the administrator can issue the following command:
    user@host> show chassis routing-engine
    Routing Engine status:
    ..
    Idle 2 percent
    
    the "Idle" value shows as low (2 % in the example above), and also the following command:
    user@host> show system processes summary
    ..
    PID USERNAME PRI NICE SIZE RES STATE TIME WCPU COMMAND
    11639 root 52 0 283M 11296K select 12:15 44.97% eventd
    11803 root 81 0 719M 239M RUN 251:12 31.98% fxpc{FXPC}
    
    the eventd and the fxpc processes might use higher WCPU percentage (respectively 44.97% and 31.98% in the above example).
    ​

    here is what I found on my good switch ...

    root@ReseauBiblio-SitePrincipal> show system processes summary

    ...

      PID USERNAME PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND

       11 root     155 ki31     0K    16K RUN    129.6H  81.98% idle

    10156 root     -52   r0   727M   248M select 311:55   4.98% fxpc{fxpc}

    10156 root      52    0   727M   248M select 387:50   3.96% fxpc{fxpc}

       21 root     -16    -     0K    16K -      179:28   0.98% rand_harvestq

    10177 root      20    0   486M   131M select  64:30   0.98% authd

     

    {master:0}

    root@ReseauBiblio-SitePrincipal> show chassis routing-engine Routing Engine status:

    ....

        5 sec CPU utilization:

          User                       6 percent

          Background                 0 percent

          Kernel                     9 percent

          Interrupt                  1 percent

          Idle                      83 percent

    ...

    now here are the readings from the sick switch

    root@ReseauBiblio-SitePrincipal> show system processes summary

    last pid: 11709;  load averages:  3.33,  3.49,  3.47  up 5+16:26:53    15:28:45

    ...

      PID USERNAME PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND

    4894 root     -52   r0   727M   249M select  56.9H  51.95% fxpc{fxpc}

    4894 root      78    0   727M   249M RUN     24.9H  21.97% fxpc{fxpc}

    4869 root      78    0   287M 12144K RUN     23.3H  19.97% eventd

       21 root     -16    -     0K    16K -      162:25   0.98% rand_harvestq

     

     

    {master:0}

    root@ReseauBiblio-SitePrincipal> show chassis routing-engine Routing Engine status:

      ...

        5 sec CPU utilization:

          User                      69 percent

          Background                 0 percent

          Kernel                    30 percent

          Interrupt                  1 percent

          Idle                       0 percent

      ...

    i'll send this later to JTAC to know what they think. Right now, I am hoping for this switch to go down like all the others. I'll then install 18.2R3S5 on the old one and transfer back the wiring and, hopefully, call it problem solved. 

    then all I have to do  is to install the image on the other 229 .switches .. :-(

    Michel



    ------------------------------
    Michel Lapointe
    ------------------------------