View Only
last person joined: 2 days ago 

Ask questions and share experiences about EX and QFX portfolios and all switching solutions across your data center, campus, and branch locations.
  • 1.  QFX5K Packet Buffer Architecture [Tech post]

    Posted 07-19-2023 16:41
    Edited by spuluka 10-10-2023 05:39


    On QFX5K platforms all packets which enters ingress pipeline will be stored at central MMU packet buffers before it egress. Packet buffers are required for burst absorption, packet replication and for multiple other use cases. Without packet buffers bursty traffic will get tail dropped and leads to poor network performance.  Excess burst might be received on egress port under below scenarios.

    ·      Multiple ingress ports sending traffic to single egress port.

    ·      Traffic from high speed ingress ports to low speed egress ports.

    Packet buffers also essential to implement various congestion management applications like Priority flow control (PFC), Explicit congestion notification (ECN) & WRED.

    QFX5K data-centre switches has shallow on-chip buffer. These switches use Broadcom's Trident and Tomahawk series ASIC. This shallow on-chip buffer provides low latency packet forwarding. Managing this limited packet buffer among different congested queues and providing fair buffer access is very important. This document explains QFX5K buffer architecture in detail.

    Packet Buffers

    On QFX5K platforms, on the chip buffers are measured/allocated in the units of cells.

    Cell size varies in different platforms (refer below table). Packets are stored in multiple cells based on its size. First cell will store 64 bytes of packet meta-data (except QFX5220 where meta-data is only 40bytes) and remining space used to store actual packet. Subsequent cells only used to store packet data.


    Cell size (bytes)

    QFX5100, QFX5110, QFX5200, QFX5210




    QFX5220, QFX5130



    Packets from ingress port placed in MMU buffer before it is scheduled by egress port. With no congestion on egress queues, these packets will be dequeued from MMU as soon as they arrive. At max ~3KB buffer utilization will be seen per port based on the port load.   In case of congestion packets are continued to place in MMU buffers until buffers exhausted.

    Concept of Ingress and Egress buffers

    Buffer memory has separate ingress and egress accounting to make accept, drop, or pause decisions. Because the switch has a single pool of memory with separate ingress and egress accounting, the full amount of buffer memory is available from both the ingress and the egress perspective. Packets are accounted for as they enter and leave the switch, but there is no concept of a packet arriving at an ingress buffer and then being moved to an egress buffer. 

    The MMU tracks all the cells utilization from multiple perspectives to support multiple buffering goals.  These goals typically include things like asserting flow control to avoid packet loss, tail dropping when a resource becomes congested, or randomly dropping to allow stateful protocols to degrade gracefully. Here multiple perspectives mainly mean from ingress and egress perspective.  This is also called as ingress admission control & egress admission control.

    Buffers are accounted by both ingress admission control and egress admission control against different entities like Priority Group (ingress), Queue (egress), Service pool & Port .  All these entities have static & dynamic thresholds for Buffer utilization. Once any of the ingress or egress buffer threshold reached packets are dropped. Packets stored till the threshold will egress out. 

    Desired property for lossy traffic is tail drops on egress queues during congestion. Because ingress drops will cause head of line blocking, which will impact other uncongested ports too. When switch operating on store-on-forward mode, if you don't configure ingress & egress buffer thresholds properly, it will not be possible to control where the packet drops should occur (ingress or egress). Here by configuring egress buffer threshold lesser than ingress buffer threshold for lossy traffic we ensure drops happen at egress.


    Dedicated and Shared Buffers

    Buffers are divided into two primary parts from both  ingress and egress perspective:

    ·      Shared buffers are a global memory pool that the switch allocates dynamically to ports as needed, so the buffers are shared among the switch ports.  80 – 85 % of total buffers allocated to shared buffer part based on platform. Shared buffers are again internally divided among different service pools.

    ·      Dedicated buffers are a memory pool divided equally among the switch ports. Each port receives a minimum guaranteed amount of buffer space, dedicated to each port, not shared among ports. (15 – 20 % of total buffers based on platform)

    Dedicated buffers are statically allocated based on port speed (higher speed = more buffers). Dedicated buffers are allocated even when port is down. Default allocated dedicated buffers can only be increased, cannot be reduced.  During congestion dedicated buffers are utilized first. Once the dedicated buffers are exhausted, congested ports will start utilizing the dynamic shared buffer.  Port level dedicated buffers are divided among queues based on egress scheduler "buffer-size" configuration. When per queue level dedicated buffer calculated based on scheduler "buffer-size" configuration, any fractional dedicated buffer value rounded-off to closest lower full integer and programmed in hardware to avoid over allocation.  Any unused port dedicated buffers after allocating dedicated buffers to all configured queues will be allocated to first configured queue.

    When single queue congested, it can use all its dedicated buffers + 50% of the shared buffer allocated for the corresponding service pool for which the queue mapped. Remaining 50% is reserved for other queues.  If two queues are congested, each queue can use its dedicated buffers + 33% of the shared buffer.


    Example – Lets consider QFX5120 switch with 32MB packet buffers (27MB shared + 5MB dedicated).  Let's assume all 27MB shared buffer allocated to lossy service pool. Single congested lossy queue on 25G interface can use 52 KB of dedicated buffer + 13.5 MB of shared buffer (buffering time of 542 microseconds). 

    Maximum shared buffer utilization by a congested priority group (on ingress) or queue (on egress) is controlled by dynamic Alpha value. Alpha value is used to ensure fairness among the competing entities for shared buffer. Maximum number of cells can be utilized from shared pool by congested PG/Queue is calculated as below:

     Peak shared buffer = (shared pool size * alpha) / (1 + (alpha * number of competing queues))


    Default Alpha values:  

           Ingress lossy Alpha: 10      Ingress lossless Alpha: 7         Ingress alpha for Multicast traffic  : 10

           Egress lossy Alpha: 7         Egress lossless Alpha: 10         Egress Multicast Alpha  :  7 


    SW alpha [0,1,2,3,4,5,6,7,8,9,10] mapped to HW alpha [1/128, 1/64, 1/32, 1/16, 1/8, ¼, ½, 1, 2, 4, 8]

    Example: Max shared buffer for a lossy queue, with alpha 9 (all 100% shared buffer given to lossy):