
Priority Flow Control (PFC) can be used in Ethernet fabrics to achieve lossless traffic—particularly important in AI/ML workloads and HPC—by pausing specific priority queues when congestion arises, avoiding costly retransmissions. The article details best practices for configuring PFC on Juniper QFX5K switches, handling buffer headroom, DSCP-based PFC, and mechanisms to detect and recover from PFC deadlocks.
Introduction
Ethernet is becoming the de facto standard for network infrastructure on AI/ML and HPC deployments. In AI/ML scenarios, where massive amounts of data are being transferred, retransmissions due to packet loss can significantly slow down model training or data processing. Lossless Ethernet minimizes this by preventing packet loss, ensuring smoother, more efficient data flow. This document aims to cover how Priority Flow Control (PFC) can help to achieve lossless Ethernet, PFC configuration best practices on Juniper’s QFX5K platforms, and finally discuss some of the drawbacks of PFC and how to solve those.
PFC (802.1Qbb) is a link-level flow control mechanism that helps to achieve a lossless Ethernet network. PFC does flow control on a per-priority basis, which is different from Ethernet pause, which does flow control at the port level. When PFC-enabled traffic priority gets congested on a switch's egress queue, the corresponding ingress port will trigger a PFC pause frame with the corresponding priority towards the peer.
Upon receiving a PFC pause frame, the peer device will slow down the traffic rate by pausing its egress queue. This can create back-pressure on the corresponding ingress port on the peer device and can generate PFC towards its peer. Thus, a chain of PFC-based flow control is achieved all the way from the congestion point to the traffic source. On the same switch, both lossless and lossy traffic can co-exist.
Note that a PFC frame, which was generated by the congested switch, is terminated at the peer switch itself; it is not forwarded until the traffic source.

Figure 1: PFC Network Level View
PFC Buffer Thresholds and PFC XOFF/XON Frames
Switches implement PFC functionality based on buffer thresholds, utilizing two key thresholds:
- 1. PFC XOFF Threshold: When the buffer utilization of a PFC-enabled priority exceeds this threshold, the switch interface generates a PFC XOFF frame to notify the peer to pause traffic for that priority. The XOFF frame includes a time quanta, which specifies how long the peer should pause transmission.
- Maximum time quanta per PFC XOFF frame = 65,535
- 1 quanta = 512 bits
- Maximum data paused by a single PFC XOFF frame = 65,535 × 512 = 33,553,920 bits
- Number of PFC XOFF frames required to fully pause an 800G port = (800,000,000,000 ÷ 33,553,920) ≈ 23,842
- 2. PFC XON Threshold: When buffer utilization drops below this threshold, the switch generates a PFC XON frame, signaling the peer to resume transmission. As part of the XON frame, time quanta is set to 0.
A queue paused due to receiving a PFC XOFF frame can resume transmission under either of the following conditions:
- A PFC XON frame is received.
- The PFC XOFF time quanta expires.
In a typical PFC flow-control scenario, the peer link alternates between receiving PFC XOFF and XON frames to regulate traffic flow. However, in cases where a congested egress link cannot drain packets fast enough, the PFC XOFF timer may expire, prompting the switch to send a refresh PFC XOFF frame to the peer. This can lead to multiple consecutive PFC XOFF frames being sent before a PFC XON frame.
Figure 3: PFC XON Frame
Need for PFC Headroom
As discussed in the previous section, a PFC XOFF frame is generated when the buffer threshold exceeds the XOFF limit. Once generated, this PFC frame must travel across the wire, where the peer device detects the pause frame and responds by pausing the corresponding queue. This process introduces a time delay.
During this delay, any in-flight packets transmitted from the switch interface before the peer interface pauses its egress queue must be delivered without packet loss. To accommodate these in-flight packets, each PFC-enabled priority requires a dedicated headroom buffer. The required headroom buffer size depends on the PFC latency, which is defined as follows:
PFC latency = Local PFC transmission latency + Wire latency + Remote PFC reception latency
QFX5K Implementation of PFC
PFC TX Implementation (Local Congestion)
QFX5K platforms have shallow buffers. Around 30% of buffers are dedicated, and the remaining 70% buffers are shared. To implement PFC functionality effectively, QFX5K platforms have a separate lossless service pool. To support PFC, traffic must be mapped to a lossless service pool. This is achieved using below 3 configuration entities below.
- 1. Lossless Forwarding Class.
- 2. Classifier.
- 3. Congestion Notification Profile.
Step 1:
User must configure a forwarding class as lossless, which will be associated with one of the egress queues by specifying the “no-loss” keyword.
set class-of-service forwarding-classes class rdma queue-num 3 no-loss
Step 2:
User must configure the IEEE 802.1p / DSCP classifier to map the intended lossless traffic into the no-loss forwarding class, configured in Step 1. This classifier must be attached to all the ingress ports, where the lossless traffic is expected to ingress.
set class-of-service classifiers ieee-802.1 rdma-cls forwarding-class rdma loss-priority low code-points 011
set class-of-service interfaces xe-0/0/30 unit 0 classifiers ieee-802.1 rdma-cls
Step 3:
User must configure the Input congestion notification profile and specify the traffic priorities for which PFC must be enabled. This CNP must also be attached to all the ingress ports.
set class-of-service congestion-notification-profile cnp1 input ieee-802.1 code-point 011 pfc
set class-of-service interfaces et-0/0/30 congestion-notification-profile cnp1
Unless CNP is configured for a priority, traffic for that priority is treated as lossy and utilizes only the lossy service pool shared buffers.
For PFC to be enabled on a port:
- All PFC-enabled code points must be present in the classifier.
- These code points must be mapped to a no-loss forwarding class.
- If these conditions are not met, PFC will not be enabled on the port.
Ingress Priority Groups
- Each port supports 8 priority groups (PGs), ranging from 0 to 7.
- By default, all traffic is mapped to PG 7, which is lossy.
- PGs 0 to 5 are designated for lossless traffic.
- PG 6 is unused. No traffic will be mapped to PG 6.
- When PFC is configured for a priority, it is assigned to one of the available lossless PGs, which are mapped to the lossless service pool.
Egress Queues
- Each port has 12 egress queues (0 to 11) or 10 egress queues (0 to 9) based on the platform. All the platforms support 8 unicast queues (0 to 7). The remaining are multicast queues. PFC is supported only on unicast queues.
- Based on the classifier, lossless priority traffic is mapped to one of the lossless egress queues, which are also associated with the lossless service pool.
In Ingress, lossless traffic is mapped to lossless shared buffers as below.
Figure 4: Lossless Traffic Mapping in ingress
In Egress, lossless traffic is mapped to lossless shared buffers as below.
Figure 5: Lossless Traffic Mapping in egress
Since there are only six lossless Priority Groups (PGs) available at ingress, configuring PFC for more than six priorities means some priorities must share the same PG, leading to fate sharing.
For example, if PFC is enabled for eight priorities (0 to 7):
- Priorities 0 to 5 are mapped to PGs 0 to 5, respectively.
- Priority 6 is mapped to PG0, and Priority 7 is mapped to PG1.
- As a result, Priority 0 and Priority 6 share PG0, while Priority 1 and Priority 7 share PG1.
On QFX5K platforms, access to the shared buffer is managed by a factor called the Dynamic Threshold, which ensures fairness among competing entities.
The PFC XOFF threshold is determined by both the PG’s dynamic shared buffer and the PG’s dedicated buffer:
PFC XOFF threshold = PG dynamic shared buffer + PG dedicated buffer
The PG dynamic shared buffer is calculated as:
PG dynamic shared buffer = (lossless service pool size * dynamic threshold) / (1 + (dynamic threshold * number of competing PGs))
- The default dynamic threshold for the lossless service pool is 7.
- This can be increased up to 9 for greater shared buffer access, but at the cost of momentary unfairness for newly congested priorities.
- Setting ingress lossless alpha to 10 is not recommended, as it could cause egress drops before triggering PFC.
- Dynamic threshold can be configured at the global service pool level via “class-of-service shared-buffer” hierarchy or individual priority group level via “class-of-service dynamic-threshold-profile” hierarchy.
PFC XOFF Triggering Process
When a lossless packet enters a switch, it is placed in a buffer and accounted for under both the ingress PG threshold and the egress queue threshold.
- If the egress queue becomes congested, the number of packets waiting in the buffer increases.
- Eventually, this can lead to the ingress PG threshold exceeding the PFC XOFF limit.
- At this point, a PFC XOFF frame is generated and sent to the peer to control the flow of traffic.
Figure 6: PFC XOFF/XON Triggers
Once a lossless packet is admitted by passing ingress admission control, it can’t be dropped by any buffer-related thresholds in the system. So for lossless packets, only ingress PG thresholds are important & egress queue thresholds will not make any difference.
During the configuration of PFC, for each PFC-enabled codepoint switch algorithm calculates the required headroom buffers and allocates the same to the corresponding Priority Group (PG). This headroom buffer calculation logic takes a few configurable parameters, such as Cable Length, MTU & MRU, also it takes a few internal parameters, such as interface speed. Following is the headroom buffer allocation for some of the combinations for a single PFC-enabled lossless priority. These are approximate values; they can vary slightly between different platforms due to the MMU cell size difference.
These headroom buffers allocated during PFC configuration are carved out from the global lossless headroom buffer space. Apart from the headroom buffer, there is a small amount of dedicated buffer (~3.7KB) per lossless priority group, also carved out from the global lossless headroom buffer space.
So if there is not enough required “headroom buffer + dedicated buffer” available in the lossless headroom pool, then headroom allocation will fail for PFC-enabled priorities. In this case, it can lead to in-flight packet drops on the ingress port. This is accounted as “Resource errors” under interface counters (show interfaces extensive <>).
It is essential to have an overall estimate of required headroom buffers based on how many priorities PFC will be enabled, and on how many ports PFC will be enabled & accordingly reserve the same amount of buffers under the lossless headroom pool.
For example, if PFC needs to be enabled on a single priority across 10 X 800G ports with default MRU, MTU, and Cable length, then 458KB * 10 = 4580KB needs to be reserved for the headroom pool.
In certain scenarios, we would see minor drops for lossless traffic, which are accounted as “Resource errors” on the ingress ports. This may indicate the problem of insufficient headroom buffers. In such a scenario, headroom buffers for the given lossless priority can be increased by increasing the default cable length and MRU parameters under the congestion notification profile configuration.
On platforms that have 2 Ingress Traffic Managers (ITM), the global headroom pool buffer is equally divided between ITMs. Port-level headroom is carved out from the corresponding ITM headroom pool, where that port belongs.
PFC Rx Implementation (Remote Congestion)
To implement PFC Rx, the user must configure an output CNP by specifying the PFC priority and the queue that needs to be paused upon reception of those PFC priority packets. This queue is called flow-control-queue. The user can map multiple flow-control-queues with the same PFC priority. In that case, when the PFC frame is received for that priority, all the associated queues will be flow controlled.
This output CNP must be applied to all the egress interfaces where PFC frames are expected to be received. Upon receiving PFC frames, MAC will instruct the MMU block to stop scheduling the packets to the corresponding flow control queues. Flow control queues mapped in the output CNP must be lossless queues to cause ingress back pressure. Otherwise, we would see tail drops on the flow control queues, which is not expected for lossless.
Configuration
set class-of-service congestion-notification-profile cnp1 output ieee-802.1 code-point 011 flow-control-queue 3
set class-of-service interfaces et-0/0/31 congestion-notification-profile cnp1
Note that under the same CNP, a user can specify both input and output PFC priorities.
MAC identifies PFC frame based on Ethertype 0x8808, Opcode 0x101 & DST MAC 01:80:C2:00:00:01. All these conditions to be met for a packet to be detected as a valid PFC pause frame. After notifying MMU to pause the queue, MAC drops the PFC packets so that they will not reach the forwarding stage of the pipeline.
DSCP-based PFC
Protocols such as Remote Direct Memory Access (RDMA) over Converged Ethernet version 2 (ROCEv2) require lossless behavior for traffic across Layer 3 connections to Layer 2 Ethernet subnetworks.
Traditionally, PFC can be used to prevent traffic loss when congestion occurs on Layer 2 or Layer 3 interfaces for VLAN-tagged traffic by selectively pausing traffic on any of eight priorities corresponding to IEEE 802.1p code points in the VLAN headers of incoming traffic on an interface.
However, untagged traffic—traffic without VLAN tagging—cannot be examined for IEEE 802.1p code points on which to pause traffic.
To trigger PFC on a DSCP value, the DSCP value must be mapped explicitly in the configuration to a PFC priority to use in the PFC pause frames sent to the peer when congestion occurs for that code point. You can map traffic on a DSCP value to a PFC priority when you define the no-loss forwarding class with which you want to classify DSCP-based PFC traffic.
The forwarding class must also be mapped to an output queue with no-loss behavior. A DSCP classifier (instead of an IEEE 802.1p classifier) is also required to specify that incoming traffic with the above-configured DSCP value belongs to the no-loss forwarding class. Any DSCP values for which DSCP-based PFC is enabled on an interface must be specified in either the default DSCP classifier or in a user-defined DSCP classifier associated with the interface.
DSCP-based PFC must be applied only for untagged traffic; otherwise, behavior is undefined.
Sample Configuration
set interfaces xe-0/0/1 unit 0 family inet address 10.1.1.2/24
set class-of-service forwarding-classes class fc1 queue-num 1 no-loss
set class-of-service forwarding-classes class fc1 pfc-priority 3
set class-of-service congestion-notification-profile dpfc-cnp input dscp code-point 110000 pfc
set class-of-service classifiers dscp dpfc forwarding-class fc1 loss-priority low code-points 110000
set class-of-service interfaces xe-0/0/1 congestion-notification-profile dpfc-cnp
set class-of-service interfaces xe-0/0/1 classifiers dscp dpfc
Custom PFC XON Configuration
By default, on QFX5K PFC XON threshold is calculated as below.
PFC XON = (PFC XOFF - 20 cells)
(a cell is the basic unit of buffer)
Based on the user requirement, this XON offset value can be configured via the following CLI command
set class-of-service congestion-notification-profile cnp1 input ieee-802.1 code-point 011 pfc xon 100
Note: PFC XON and MRU values configured under the congestion notification profile are properties of the Ingress port Priority Group. Junos CLI doesn't have any equivalent construct for ingress port Priority Group, unlike the queue, which has a forwarding-class construct in CLI. Codepoint (priority) to Priority Group mapping happens in PFE based on CNP configuration using internal logic.
As there are only 6 lossless priority groups per ingress port, there is a possibility of more than one IEEE 802.1P / DSCP code points getting mapped to the same PG (fate sharing). In this case, if the code points are mapped to the same PG configured with different XON values, then the XON value associated with the highest code point will be programmed in HW for the corresponding priority group.
This is due to entries in the CNP profile being processed and programmed in ascending order based on codepoint. If the code points are mapped to the same PG configured with different MRU values, then the MRU value associated with the lowest code point will be used to calculate PFC headroom for that PG.
DCBX
The DCBX protocol is used to exchange PFC capability information between PFC-enabled peers. DCBX controls the operational state of PFC on each interface where PFC is enabled by configuration. DCBX ensures that the PFC feature is enabled on a port only if both the local switch and the peer device are capable of supporting PFC and are provisioned consistently.
DCBX is an optional configuration for PFC functionality.
Note: As of now, DCBX is supported only for IEEE 8021. P-based PFC. It is not supported for DSCP-based PFC.
PFC in EVPN VxLAN Fabric
In an EVPN VxLAN environment, if the ingress traffic is VLAN tagged, use an IEEE 802.1P-based classifier and PFC on the UNI interface at the ingress leaf. After packets are encapsulated with VxLAN headers, 8021.P bits are not exposed in spine and egress leaf devices for classification. So spine and egress leaf devices must use a DSCP-based classifier and PFC.
Figure 7: PFC implementation in an EVPN-VXLAN Fabric
Note: If the incoming traffic is not VLAN tagged, use the DSCP classifier and the DSCP CNP in all places.
Sample Configurations
Ingress & Egress Leafs
set class-of-service classifiers dscp nni-cls forwarding-class fc3 loss-priority low code-points 011111
set class-of-service classifiers ieee-802.1 uni-cls forwarding-class fc3 loss-priority low code-points 011
set class-of-service forwarding-classes class fc3 queue-num 3
set class-of-service forwarding-classes class fc3 no-loss
set class-of-service forwarding-classes class fc3 pfc-priority 3
set class-of-service congestion-notification-profile dscp-cnp input dscp code-point 011111 pfc
set class-of-service congestion-notification-profile dscp-cnp output ieee-802.1 code-point 011 flow-control-queue 3
set class-of-service congestion-notification-profile ieee-cnp input ieee-802.1 code-point 011 pfc
set class-of-service congestion-notification-profile ieee-cnp output ieee-802.1 code-point 011 flow-control-queue 3
set class-of-service rewrite-rules dscp nni-rw forwarding-class fc3 loss-priority low code-point 011111
UNI (Access Port)
set class-of-service interfaces et-0/0/44 congestion-notification-profile ieee-cnp
set class-of-service interfaces et-0/0/44 unit 0 classifiers ieee-802.1 uni-cls
NNI (Network Port)
set class-of-service interfaces et-0/0/45 congestion-notification-profile dscp-cnp
set class-of-service interfaces et-0/0/45 unit 0 classifiers dscp nni-cls
set class-of-service interfaces et-0/0/45 unit 0 rewrite-rules dscp nni-rw
Spine
set class-of-service classifiers dscp nni-cls forwarding-class fc3 loss-priority low code-points 011111
set class-of-service forwarding-classes class fc3 queue-num 3
set class-of-service forwarding-classes class fc3 no-loss
set class-of-service forwarding-classes class fc3 pfc-priority 3
set class-of-service congestion-notification-profile dscp-cnp input dscp code-point 011111 pfc
set class-of-service congestion-notification-profile dscp-cnp output ieee-802.1 code-point 011 flow-control-queue 3
set class-of-service interfaces et-0/0/* unit 0 classifiers dscp nni-cls
set class-of-service interfaces et-0/0/* congestion-notification-profile dscp-cnp
PFC Deadlock Detection and Recovery
One of the side effects of PFC in the network is PFC deadlock. PFC deadlock is a network state in which congestion occurs on multiple switches simultaneously due to a loop or other causes, the interface buffer usage of each switch exceeds the threshold, the switches wait for each other to release resources, and data flows on all switches are permanently blocked. The following are some common scenarios in which PFC deadlock can occur.
- A misbehaving NIC is sending continuous PFC, which could propagate the congestion all the way back to the source and potentially halt the entire network.
- Cyclic Buffer Dependency, which can cause a PFC loop and eventual deadlock.
The picture below explains the scenario where a single misbehaving NIC halts the network.

Figure 8: A Faulty NIC Causing a PFC Deadlock
In an IP CLOS architecture, this PFC deadlock could be seen during events such as link flap or device reboot, which results in ECMP re-distributing the traffic to different paths. Although the Clos network does not have route loops, when a link fails, a flow loop happens and Cyclic Buffer Dependency appears. PFC deadlocks may happen.
Cyclic Buffer Dependency occurs when multiple switches (or ports) create a loop of PFC dependencies, such that:
- Switch A pauses Switch B (because A’s buffer for a priority is full),
- Switch B pauses Switch C, and
- Switch C pauses Switch A.
Now, each switch is waiting for the other to resume transmission — no one can send packets anymore. This results in a PFC deadlock — a network-wide traffic stall.
Figure 9: Cyclic Buffer Dependency
QFX5K platforms solve the PFC deadlock issue through the PFC watchdog feature. This is a reactive mechanism to detect and recover from the PFC deadlock
Detection
The detection mechanism periodically monitors PFC-enabled queues to identify situations where the downstream device continuously asserts PFC pause frames, while the queue still holds packets waiting for transmission. If such a stall is observed, the queue is transitioned into a mitigation state.
The system must detect this stalled condition within a defined detection time window. In this implementation, detection time is determined by two configurable parameters:
- poll-interval: The frequency (in milliseconds) at which queues are polled.
- detection: the number of consecutive detection windows to consider before declaring a stall.
The total detection time is calculated as:
Detection Time = Poll-Interval × Detection
This defines how often the PFC Watchdog collects data for each PFC-enabled queue.
The poll interval can be configured as 1 ms, 10 ms, or 100 ms, while the detection count can range from 2 to 15. By combining these parameters, the PFC deadlock detection window can be adjusted from 2 ms to 1500 ms, allowing flexible tuning based on network performance and sensitivity requirements.
| Poll Interval (ms) |
Min Detection Window (ms) |
Max Detection Window (ms) |
| 1 |
2 |
15 |
| 10 |
20 |
150 |
| 100 |
200 |
1500 |
To detect PFC deadlock, in HW, there is a countdown timer which monitors the duration of PFC XOFF state for each {Egress port, PFC priority}. For each PFC priority on each port, there is one timer counter. The counter decrements when the associated {Egress port, PFC priority} is in PFC XOFF state. Counter reset to configurable PFC deadlock detection timer when {Egress port, PFC priority} is in PFC XON state. If the timer decrements to zero, HW generates an interrupt to indicate the deadlock condition.
This happens when any of the interrupt status bits become 1 and it is not masked. Subsequent deadlock events are suppressed in HW until SW services the initial interrupt. This timer counter is the same as the “detection” parameter in the config.
PFC XOFF and states are associated with the PFC priority level. Single PFC priority can be mapped to more than one egress queue on an egress port. If a PFC priority is mapped to two egress queues, and if a PFC storm is received on that priority, both egress queues will be in PFC XOFF state. So when deadlock detection is enabled at the PFC priority level, it will be applicable for both queues associated with that PFC priority.
With PFC watchdog detection time, there is an error margin of 1 detection window time due to Hardware limitations. For example, when the PFC watchdog poll-interval granularity is configured as 1ms and the number of poll cycles configured through “detection” configuration as 1, then there is a possibility of PFC watchdog getting triggered between 0 ms to 1 ms. Here, 0 ms is the worst case and 1 ms is the best case. In this case, if detection is configured as 2, the PFC watchdog will get triggered between 1 ms to 2 ms.
The same behavior can happen for 10 ms (between 0 ms to 10 ms) and 100ms (0 ms to 100 ms) granularity too, with a detection configuration of 1. Due to this error margin with the hardware timer design, there is a possibility of false detection of the PFC watchdog. Chances of false detection will reduce with a high detection value, such as 15. Due to this limitation detection window configuration range is allowed from 2 to 15.
The recommended PFC watchdog detection timer is 200 ms, which is the default value. A very aggressive detection timer, in the range of 2 ms to 10 ms, might result in false PFC deadlock detection.
Mitigation
Once SW receives a PFC deadlock detection interrupt from HW, it places the particular {egr port, pfc priority} in the ignore-PFC-XOFF state in MMU. When the ignore-PFC-XOFF state is active for {egr port, pfc priority} MMU scheduler ignores the PFC XOFF state and begins servicing all queues that are associated with the corresponding {egr port, pfc priority} as if they are eligible for service. If a queue is associated with multiple PFC priorities, then all the associated priorities should be in ignore-PFC-XOFF to service that queue.
This mitigation process happens for a finite duration known as “recovery time”. This is a configurable parameter using the “recovery” knob under “pfc-watchdog”. Valid values for recovery timer are in the range of [100 ms,1500 ms] with a step size of 100ms. So valid recovery timer values are 100, 200, 300, 400, etc., till 1500.
When the MMU places {egress port, PFC priority} in ignore-PFC-XOFF state, below two actions can be taken for the queued packets of the deadlock-detected queue. This is configurable under the “watchdog-action” hierarchy.
Drop: This action will discard all the queued packets on the deadlock-detected queue during the recovery period. This will be the default action.
Forward: This action will forward all the queued packets on the deadlock-detected queue during the recovery period.
So, during the PFC watchdog recovery window, all the received PFC pause frames on the watchdog-detected priority are ignored. Similarly, all the data packets that are already queued/classified to the watchdog detected queue will be either dropped in case of watchdog recovery action of drop or forwarded in case of watchdog recovery action of forward. These recovery actions are applicable only during the recovery window.
PFC deadlock detection and recovery events are logged in syslog by default.
Restoration
After the recovery time has expired for a PFC queue that has been mitigated, the PFC queue will be returned to its nominal forwarding state. Ignore-PFC-XOFF state for {egr port, pfc priority} will be reset, and the scheduler will consider PFC-XOFF state for scheduling the packet. Deadlock detection HW countdown timer will be reset and start monitoring deadlock.
PFC Counters & Debug Commands
PFC Transmit and Receive counters are available at the interface level for each priority. PFC watchdog detection and recovery counters are also available as part of the same command.
Sample Output
root> show interfaces extensive et-0/0/0
<snip>
MAC Priority Flow Control Statistics:
Priority : 0 0 0
Priority : 1 0 0
Priority : 2 0 0
Priority : 3 0 0
Priority : 4 0 0
Priority : 5 0 0
Priority : 6 0 0
Priority : 7 0 0
Priority Flow Control Watchdog Statistics:
Queue Detected Recovered LastPacketDropCount TotalPacketDropCount
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
The CLI & CLI-PFE commands below can be used for more detailed debugging of PFC.
Example Configuration
> show configuration | display set | match class-of | except groups
set class-of-service congestion-notification-profile cnp1 pfc-watchdog poll-interval 100
set class-of-service congestion-notification-profile cnp1 pfc-watchdog detection 10
set class-of-service congestion-notification-profile cnp1 pfc-watchdog recovery 1000
set class-of-service congestion-notification-profile cnp1 pfc-watchdog watchdog-action drop
set class-of-service congestion-notification-profile cnp1 input cable-length 10
set class-of-service congestion-notification-profile cnp1 input ieee-802.1 code-point 011 pfc
set class-of-service congestion-notification-profile cnp1 input ieee-802.1 code-point 011 mru 9k
set class-of-service congestion-notification-profile cnp1 input ieee-802.1 code-point 011 xon 100
set class-of-service congestion-notification-profile cnp1 output ieee-802.1 code-point 011 flow-control-queue 3
set class-of-service interfaces et-0/0/16 congestion-notification-profile cnp1
Show Commands
> show class-of-service interface et-0/0/16
Physical interface: et-0/0/16, Index: 1273
Maximum usable queues: 12, Queues in use: 5
Exclude aggregate overhead bytes: disabled
Logical interface aggregate statistics: disabled
Scheduler map: <default>
Congestion-notification: Enabled, Name: cnp1, Index: 1
Logical interface: et-0/0/16.0, Index: 1067
Object Name Type Index
Classifier ieee8021p-default ieee-802.1 2
> show class-of-service shared-buffer
Ingress:
Total Buffer : 169207 KB
Dedicated Buffer : 30847 KB
Shared Buffer : 84206 KB
Lossless : 16841 KB
Lossless Headroom : 8420 KB
Lossy : 58944 KB
Lossless dynamic threshold : 7
Lossy dynamic threshold : 10
Lossless Headroom Utilization:
Node Device Total Used Free
0 8420 KB 64 KB 8356 KB
ITM0 Headroom Utilization:
Total Used Free
4210 KB 0 KB 4210 KB
ITM1 Headroom Utilization:
Total Used Free
4210 KB 64 KB 4146 KB
Egress:
Total Buffer : 169207 KB
Dedicated Buffer : 47208 KB
Shared Buffer : 84206 KB
Lossless : 16841 KB
Lossy : 58944 KB
Lossy dynamic threshold : 7
root@ # cli-pfe
root@ pfe> show class-of-service ifd_index 1273
IEEE classifier : 2
IEEE Cls IFL Refcount : 1
DSCP classifier : -1
DSCP Cls IFL Refcount : 0
Fixed classifier : -1
Fixed Cls IFL Refcount : 0
IEEE rewrite : -1
IEEE Rewrite IFL Refcount : 0
DSCP rewrite : -1
DSCP Rewrite IFL Refcount : 0
Scheduler-map : 1
CNP : 1
DCN : 0
DECN : -1
Dynamic Threshold Profile : -1
Ingress Dedicated Buffers : 210
Egress Dedicated Buffers : 360
Queue0 : 18
Q0-gport : 0x24160000
Queue1 : 0
Q1-gport : 0x24160001
Queue2 : 0
Q2-gport : 0x24160002
Queue3 : 126
Q3-gport : 0x24160003
Queue4 : 126
Q4-gport : 0x24160004
Queue5 : 0
Q5-gport : 0x24160005
Queue6 : 0
Q6-gport : 0x24160006
Queue7 : 18
Q7-gport : 0x24160007
Queue8 : 72
Q8-gport : 0x30160000
Queue9 : 0
Q9-gport : 0x30160001
Queue10 : 0
Q10-gport : 0x30160002
Queue11 : 0
Q11-gport : 0x30160003
PG0 Headroom allocation : 261 PG0 Sw Alpha : 7
PG1 Headroom allocation : 0 PG1 Sw Alpha : 7
PG2 Headroom allocation : 0 PG2 Sw Alpha : 7
PG3 Headroom allocation : 0 PG3 Sw Alpha : 7
PG4 Headroom allocation : 0 PG4 Sw Alpha : 7
PG5 Headroom allocation : 0 PG5 Sw Alpha : 7
PG6 Headroom allocation : 0 PG6 Sw Alpha : 10
PG7 Headroom allocation : 0 PG7 Sw Alpha : 10
root@:pfe> show class-of-service congestion-notification-profile brief
CNP index In-entries Out-entries In-type Cable-length PRI-TO-PG(HW) PG-TO-PFC(HW) Output(HW)
1 1 1 IEEE 10 1 0 0
root@:pfe> show class-of-service congestion-notification-profile index 1
Code-Point Direction MRU Flow-control-Queue XON CNP-TYPE PG FC-ID
3 In , Out 9000 3 100 IEEE 0 1
root@:pfe> show class-of-service congestion-notification-profile pfc-watchdog-info
CNP Index PFC WD Status PFC WD Action PFC WD Poll_interval PFC WD Detection_interval PFC WD Recovery_interval
1 1 1 100 1000 1000
Conclusion
Priority Flow Control plays a vital role in achieving lossless Ethernet by providing granular control over traffic flow across different priorities. Through per-priority pause mechanisms, PFC enables data center networks to handle latency-sensitive and congestion-prone traffic, especially in environments using RDMA or storage workloads, without compromising the performance of other traffic classes.
However, as explored in this blog, enabling PFC is not a simple “turn-on-and-forget” feature. Its effectiveness depends on careful configuration, proper buffer management, and end-to-end consistency across the network path. Misconfigurations or hardware limitations can easily lead to head-of-line blocking or even PFC deadlocks, which can severely impact network performance.
By combining robust PFC monitoring, deadlock detection and recovery mechanisms, and the use of appropriate show and diagnostic commands, network operators can ensure PFC operates reliably and recovers gracefully from congestion scenarios.
In essence, PFC remains a powerful but sensitive tool: when implemented thoughtfully, it helps unlock the full potential of high-performance Ethernet fabrics designed for modern, data-intensive workloads.
Useful Links
Glossary
- AI: Artificial Intelligence
- CNP: Congestion Notification Protocol
- DSCP: Differentiated Services Code Point
- HPC: High Performance Computing
- ML: Machine Learning
- MMU: Memory Management Unit
- PFC: Priority Flow Control
- PG: Priority Group
Acknowledgements
Thanks to Yong Han
Comments
If you want to reach out for comments, feedback, or questions, drop us a mail at:
Revision History
| Version |
Author(s) |
Date |
Comments |
| 1 |
Parthipan TS |
November 2025 |
Initial Publication |

#QFXSeries