Switching

 View Only
last person joined: 3 days ago 

Ask questions and share experiences about EX and QFX portfolios and all switching solutions across your data center, campus, and branch locations.
  • 1.  QFX5K Packet Buffer Architecture [Tech post]

    Posted 07-19-2023 16:41
    Edited by spuluka 10-10-2023 05:39

    Overview

    On QFX5K platforms all packets which enters ingress pipeline will be stored at central MMU packet buffers before it egress. Packet buffers are required for burst absorption, packet replication and for multiple other use cases. Without packet buffers bursty traffic will get tail dropped and leads to poor network performance.  Excess burst might be received on egress port under below scenarios.

    ·      Multiple ingress ports sending traffic to single egress port.

    ·      Traffic from high speed ingress ports to low speed egress ports.

    Packet buffers also essential to implement various congestion management applications like Priority flow control (PFC), Explicit congestion notification (ECN) & WRED.

    QFX5K data-centre switches has shallow on-chip buffer. These switches use Broadcom's Trident and Tomahawk series ASIC. This shallow on-chip buffer provides low latency packet forwarding. Managing this limited packet buffer among different congested queues and providing fair buffer access is very important. This document explains QFX5K buffer architecture in detail.

    Packet Buffers

    On QFX5K platforms, on the chip buffers are measured/allocated in the units of cells.

    Cell size varies in different platforms (refer below table). Packets are stored in multiple cells based on its size. First cell will store 64 bytes of packet meta-data (except QFX5220 where meta-data is only 40bytes) and remining space used to store actual packet. Subsequent cells only used to store packet data.

    Platforms

    Cell size (bytes)

    QFX5100, QFX5110, QFX5200, QFX5210

    208

    QFX5120

    256

    QFX5220, QFX5130

    254

     

    Packets from ingress port placed in MMU buffer before it is scheduled by egress port. With no congestion on egress queues, these packets will be dequeued from MMU as soon as they arrive. At max ~3KB buffer utilization will be seen per port based on the port load.   In case of congestion packets are continued to place in MMU buffers until buffers exhausted.

    Concept of Ingress and Egress buffers

    Buffer memory has separate ingress and egress accounting to make accept, drop, or pause decisions. Because the switch has a single pool of memory with separate ingress and egress accounting, the full amount of buffer memory is available from both the ingress and the egress perspective. Packets are accounted for as they enter and leave the switch, but there is no concept of a packet arriving at an ingress buffer and then being moved to an egress buffer. 

    The MMU tracks all the cells utilization from multiple perspectives to support multiple buffering goals.  These goals typically include things like asserting flow control to avoid packet loss, tail dropping when a resource becomes congested, or randomly dropping to allow stateful protocols to degrade gracefully. Here multiple perspectives mainly mean from ingress and egress perspective.  This is also called as ingress admission control & egress admission control.

    Buffers are accounted by both ingress admission control and egress admission control against different entities like Priority Group (ingress), Queue (egress), Service pool & Port .  All these entities have static & dynamic thresholds for Buffer utilization. Once any of the ingress or egress buffer threshold reached packets are dropped. Packets stored till the threshold will egress out. 

    Desired property for lossy traffic is tail drops on egress queues during congestion. Because ingress drops will cause head of line blocking, which will impact other uncongested ports too. When switch operating on store-on-forward mode, if you don't configure ingress & egress buffer thresholds properly, it will not be possible to control where the packet drops should occur (ingress or egress). Here by configuring egress buffer threshold lesser than ingress buffer threshold for lossy traffic we ensure drops happen at egress.

     

    Dedicated and Shared Buffers

    Buffers are divided into two primary parts from both  ingress and egress perspective:

    ·      Shared buffers are a global memory pool that the switch allocates dynamically to ports as needed, so the buffers are shared among the switch ports.  80 – 85 % of total buffers allocated to shared buffer part based on platform. Shared buffers are again internally divided among different service pools.

    ·      Dedicated buffers are a memory pool divided equally among the switch ports. Each port receives a minimum guaranteed amount of buffer space, dedicated to each port, not shared among ports. (15 – 20 % of total buffers based on platform)

    Dedicated buffers are statically allocated based on port speed (higher speed = more buffers). Dedicated buffers are allocated even when port is down. Default allocated dedicated buffers can only be increased, cannot be reduced.  During congestion dedicated buffers are utilized first. Once the dedicated buffers are exhausted, congested ports will start utilizing the dynamic shared buffer.  Port level dedicated buffers are divided among queues based on egress scheduler "buffer-size" configuration. When per queue level dedicated buffer calculated based on scheduler "buffer-size" configuration, any fractional dedicated buffer value rounded-off to closest lower full integer and programmed in hardware to avoid over allocation.  Any unused port dedicated buffers after allocating dedicated buffers to all configured queues will be allocated to first configured queue.

    When single queue congested, it can use all its dedicated buffers + 50% of the shared buffer allocated for the corresponding service pool for which the queue mapped. Remaining 50% is reserved for other queues.  If two queues are congested, each queue can use its dedicated buffers + 33% of the shared buffer.

     

    Example – Lets consider QFX5120 switch with 32MB packet buffers (27MB shared + 5MB dedicated).  Let's assume all 27MB shared buffer allocated to lossy service pool. Single congested lossy queue on 25G interface can use 52 KB of dedicated buffer + 13.5 MB of shared buffer (buffering time of 542 microseconds). 

    Maximum shared buffer utilization by a congested priority group (on ingress) or queue (on egress) is controlled by dynamic Alpha value. Alpha value is used to ensure fairness among the competing entities for shared buffer. Maximum number of cells can be utilized from shared pool by congested PG/Queue is calculated as below:

     Peak shared buffer = (shared pool size * alpha) / (1 + (alpha * number of competing queues))

     

    Default Alpha values:  

           Ingress lossy Alpha: 10      Ingress lossless Alpha: 7         Ingress alpha for Multicast traffic  : 10

           Egress lossy Alpha: 7         Egress lossless Alpha: 10         Egress Multicast Alpha  :  7 

     

    SW alpha [0,1,2,3,4,5,6,7,8,9,10] mapped to HW alpha [1/128, 1/64, 1/32, 1/16, 1/8, ¼, ½, 1, 2, 4, 8]

    Example: Max shared buffer for a lossy queue, with alpha 9 (all 100% shared buffer given to lossy):

    max = (27 MB * 4) / (1 + (4*1)) = 21.6 MB

    Trade-off between shared and dedicated buffer

    The trade-off between shared buffer space and dedicated buffer space is:

    • Shared buffers provide better absorption of traffic bursts because there is a larger pool of dynamic buffers that ports can use as needed to handle the bursts. However, all flows that exhaust their dedicated buffer space compete for the shared buffer pool. A larger shared buffer pool means a smaller dedicated buffer pool, and therefore more competition for the shared buffer pool because more flows exhaust their dedicated buffer allocation. Too much shared buffer space results in no single flow receiving very much shared buffer space, to maintain fairness when many flows contend for that space.
    • Dedicated buffers provide guaranteed buffer space to each port. The larger the dedicated buffer pool, the less likely that congestion on one port affects traffic on another port, because the traffic does not need to use as much shared buffer space. However, less shared buffer space means less ability to dynamically absorb traffic bursts.

    For optimal burst absorption, the switch needs enough dedicated buffer space to avoid persistent competition for the shared buffer space. When fewer flows compete for the shared buffers, the flows that need shared buffer space to absorb bursts receive more of the shared buffer because fewer flows exhaust their dedicated buffer space.

    The default configuration and the configurations recommended for different traffic scenarios allocate 100 percent of the user-configurable memory space to the global shared buffer pool because the amount of space reserved for dedicated buffers provides enough space to avoid persistent competition for dynamic shared buffers. This results in fewer flows competing for the shared buffers, so the competing flows receive more of the buffer space.

    Service pools 

    Service pools are used to provide isolation between different application services (Lossy, lossless & Multicast). This is done by splitting the total available shared buffer for each service pool.    In ingress port traffic is mapped to priority groups. Each ingress port has 8 PGs.  Priority groups are mapped to ingress service pool. 6 (0-5) lossless & 2 (6&7) lossy PGs.

    By default all priority mapped to lossy service pool. Based on the PFC configuration, PFC enabled priority will be mapped to one of the lossless PG.  

    When a lossless PG congested it will use PG min (dedicated buffer) first, then it will use shared buffer from lossless service pool. Once the max threshold reached PFC will be triggered for the mapped priority. Till the peer device reduce the traffic, on the fly packets will use the head-room buffers allocated for the PG.  These headroom buffers are taken from the head room service pool.  If the allocated headroom buffers are not enough to hold the on the packets, ingress drops will be seen & it is counted as resource errors. When lossy PG congested it will use shared buffer from lossy service pool.

    In Egress port, traffic is mapped to egress queues based on the ingress traffic classification. Queues are mapped to service pools. By default, Q3 & Q4 configured as lossless queue and mapped to lossless service pool.  Other unicast queues mapped to lossy SP. Multicast queues mapped to multicast SP (QFX5220 & QFX5130 platforms don't have multicast service pool).  Using "no-loss" keyword a queue can be configured as lossless queue. Maximum 6 no-loss queues supported. When a queue congested it will first use the QMIN (dedicated buffer) and then it will use the shared buffer from corresponding service pool. Once max threshold reached packets will be tail dropped. Port level dedicated buffers can be flexibly assigned to different queues using "buffer-size" configuration under scheduler hierarchy.

    Packet pipeline based Buffering models

    On QFX5100, QFX5110 & QFX5120 platforms buffer access is global. It means that all ports present in these switches regardless of which pipe it belongs, will have access to complete buffer. For example, QFX5100 has 12MB buffer which is global to all ports. 

    QFX5200 & QFX5210 platforms has buffer access per cross-point (XPE). These platforms have 4 cross-points. So total buffer divided equally among these 4 cross-points. For example, QFX5200 has 16MB total buffer, which is divided to 4MB per cross-point.  Based on ingress and egress pipe, cross-point is decided for packets. If that packet experience congestion, it can only use that cross-point buffer alone.  On these platforms, we can achieve effective buffer utilization by spreading the ports across all 4 pipes equally. Following table gives the combination for cross-point selection.

    Ingress Pipelines

    Egress Pipelines

    XPE

    0 & 3

    0 & 1

     XPE0

    0 & 3

    2 & 3

     XPE1

    1 & 2

    0 & 1

     XPE2

    1 & 2

    2 & 3

     XPE3

    QFX5220 & QFX5130 platforms has buffer access per ITM (Ingress traffic manager). These platforms have 2 ITMS which is ITM0 & ITM1. Set of 4 pipes 0,1,6 & 7 mapped to ITM0 and

    another set of 4 pipes 2,3,4&5 mapped to ITM1.  Total available buffer divided equally among both ITM. For example, QFX5220 has 64 MB buffer, per ITM gets 32MB buffer access. In the ITM based architecture, based on which ingress pipeline packet ingress, corresponding ITM selected, and buffers utilized. In these platforms, if the egress port congested from two ingress ports belongs to two different ITMs, we will see maximum buffer utilization for that queue, as buffers will be used in both ITMs. Also, these platforms follow uniform buffer model, where we need to allocate same number of shared buffers for ingress and egress partitions of a service pool. Ingress lossy buffer percent should be same as egress lossy buffer percent & ingress lossless buffer percent should be same as egress lossless buffer percent. Buffer allocated for ingress headroom service pool is reserved at egress also, even though we don't have any headroom buffer accounting in egress. We use dynamic threshold value (alpha) to achieve lossy and lossless behavior on these platforms.

    Default we configure egress lossy alpha (7) lesser than ingress lossy alpha (10) & ingress lossless alpha (7) lesser than egress lossless alpha (10). 

    Store and Forward & Cut-through mode of operation

    QFX5K switches by default operates on store and forward (SF) mode of MMU. In this mode entire packet must be received in the MMU before it is getting scheduled. Based on the source port speed and packet length, buffering duration will change. Most of the applications will be using SF mode.  For applications which requires low latency Cut-through (CT) mode can be configured. It provides low latency compared to SF mode.  In CT mode, packet transmission can start as soon as first cell of packet received in MMU. It doesn't wait for complete packet to arrive in MMU. Ingress and egress admission control checks explained in above sections are not performed for CT eligible traffic. CT traffic also skips the scheduling block and directly goes to egress pipeline.  CT eligibility check is performed for each packet based on various factors like source port speed, destination port speed, type of traffic, port type, etc. If a packet fails CT eligibility it falls back to SF mode.

    QFX5K default shared buffer values 

    All buffer values in KB, except alpha.

     

    QFX5100

    QFX5110

    QFX5120

    QFX5130

    QFX5200

    QFX5210

    QFX5220

    Total

    12480

    16640

    32256

    134792

    16640

    43264

    65536

    Ingress dedicated

    2912

    4860

    3928

    7868

    4860

    14040

    7868

    Ingress shared

    9567

    11779

    28328

    111263

    11779

    29224

    44420

    Ingress lossy buffer

    4400

    5418

    13030

    77884

    5418

    13443

    31094

    Ingress lossy alpha

    10

    10

    10

    10

    10

    10

    10

    Ingress lossless buffer

    861

    1060

    2549

    22252

    1060

    2630

    8884

    Ingress lossless alpha

    7

    7

    7

    7

    7

    7

    7

    Ingress headroom buffer

    4305

    5300

    12747

    11126

    5300

    13150

    4442

    Egress dedicated

    3744

    5408

    4860

    12739

    5408

    15184

    12739

    Egress shared

    8736

    11232

    27396

    111263

    11232

    28080

    44420

    Egress lossy buffer

    2708

    3481

    8492

    77884

    3481

    8704

    31094

    Egress lossy alpha

    7

    7

    7

    7

    7

    7

    7

    Egress lossless buffer

    4368

    5616

    13698

    22252

    5616

    14040

    8884

    Egress lossless alpha

    10

    10

    10

    10

    10

    10

    10

    Egress multicast buffer

    1659

    2134

    5205

    NA

    2134

    5335

    NA

    Egress multicast alpha

    7

    7

    7

    NA

    7

    7

    NA

    Multicast MCQE (cells)

    49152

    49152

    24576

    40960

    32768

    32768

    27200

    Multicast Buffering

    On QFX5100/QFX5110/QFX5120/QFX5200/QFX5210 platforms, multicast traffic uses separate multicast service pool at egress. In ingress direction it will use lossy service pool without ethernet pause / PFC config. When ethernet pause / PFC configured, multicast traffic will use lossless service pool at ingress.  On QFX5220/QFX5130 platforms there is no separate service pool for multicast traffic. So it will use lossy service pool buffers at ingress and egress.  On all the QFX5K platforms, multicast traffic  has two different threshold criteria at egress. One is based on cells (DB) and another is based on entries (MCQE). Each replicated copies of multicast packet use separate MCQE entry. However DB cells are allocated only for single copy of multicast packet. For smaller size packets with high number of replication, MCQE will exhaust before DB.  For larger size packets with small number if replication, DB cells will exhaust before MCQE.

    Buffer monitoring

    QFX5K switches support queue depth monitoring (peak buffer occupancy) via Jvision infrastructure. Queue depth data exported via native UDP or gRPC.  Minimum export interval for Jvision is 2 seconds. So peak buffer occupancy within the 2 second interval will get exported.  Any non-zero values will get exported.

    QFX5K also supports buffer monitoring (peak buffer occupancy) via CLI. Peak buffer utilized by congested PG (at ingress) & congested queue (at egress) can be monitored from CLI.

    Buffer monitoring needs to be enabled using CLI "set chassis fpc 0 traffic-manager buffer-monitor-enable".

    Dynamic buffer utilization can be displayed using below CLI commands.

               show interfaces priority-group buffer-occupancy <interface name>

               show interfaces queue buffer-occupancy <interface name>

    Above CLIs will display peak buffer occupancy values for duration between last execution and current execution of CLI. Values displayed in bytes. 



    ------------------------------
    Parthipan TS
    ------------------------------



  • 2.  RE: QFX5K Packet Buffer Architecture [Tech post]

    Posted 11-17-2023 09:44

    " Example – Lets consider QFX5120 switch with 32MB packet buffers (27MB shared + 5MB dedicated).  Let's assume all 27MB shared buffer allocated to lossy service pool. Single congested lossy queue on 25G interface can use 52 KB of dedicated buffer + 13.5 MB of shared buffer (buffering time of 542 microseconds). "

    In above example , Shouldn't it be 5MB of dedicated buffer + 13.5 MB of shared buffer ? 




  • 3.  RE: QFX5K Packet Buffer Architecture [Tech post]

    Posted 11-21-2023 08:29

    Hi , 

    5MB is the total amount of dedicated buffer.  This 5MB is divided among all available ports. So individual ports gets only a portion of the total available dedicated buffer. On QFX5120, 25G ports gets 52KB of dedicated buffer. 100G ports gets 208KB of dedicated buffer. That's the reason we used 52KB in the calculation instead of 5MB. 

    Thanks,

    Parthipan 



    ------------------------------
    Parthipan TS
    ------------------------------