RoCEv2:RDMA over Converged Ethernet standard
It is based on
RDMA(remote direct memory access): is a data transfer method that allows data to be transferred from one computer's memory to another's without involving the host CPU. This approach reduces the overhead of the traditional TCP/IP stack and improves data transfer efficiency.
PAUSE frame: The most basic traffic control technique is the Ethernet Pause mechanism defined by IEEE 802.3: when a downstream device in the network finds that its traffic receiving capacity is less than the upstream device's transmitting capacity, it will proactively sends a Pause frame to the upstream device, requesting the upstream device to pause the traffic transmitting and wait for a period of time before resuming transmitting.
DCQCNAlgorithm:DCQCN, known as Data Center Quantized Congestion Notification, is currently the most widely used congestion control algorithm in RoCEv2 networks, which combines the QCN algorithm and the DCTCP algorithm.
How it works: In DCQCN, when a network device (e.g., a switch) detects that its buffer queue length exceeds a preset threshold, it sets a flag (e.g., an ECN bit) on the packet indicating the presence of congestion. The sender receives a packet with a congestion flag and adjusts its sending rate based on that feedback. This is usually done by reducing the size of its congestion window. DCQCN also includes a mechanism that allows the sender to dynamically adjust its sending rate based on congestion feedback, rather than simply reducing the rate, thus optimizing the efficient use of network resources.
QCN (Quantized Congestion Notification) algorithm: is that it provides a way to quantitatively inform the sender of the level of congestion in the network. In contrast to traditional TCP congestion control methods, QCN allows thenetwork equipment(e.g., switches) mark packets with specific flag bits when congestion is detected and quantitatively feed congestion information back to the sender. In this way, the sender can adjust its sending rate based on the specific congestion information it receives, rather than simply relying on lost retransmission or delay-based mechanisms.
Specifically, the QCN mechanism consists of the following:
-
Quantization of congestion signals: Network devices can mark packets with different levels to indicate the current congestion state when congestion is detected.
-
Receiver feedback: The receiving end collects information about these tags and sends it back to the sending end.
-
Transmitter response: The sender adjusts its sending rate based on the feedback it receives to reduce congestion.
Head-of-line blocking, or HOL blocking, is a performance-limiting phenomenon in the context of computer networks.. It is caused by the blocking of the first packet in a column (the header) which results in the blocking of the entire column.
FCTUsually represents"Flow Completion Time"Flow Completion Time is the time it takes for a data flow to be completely transmitted from the beginning to the end.
“Backpressure" (counterpressure or back pressure)is a networking term that refers to a flow control mechanism that prevents data overflow or packet loss by notifying the sender to slow down or temporarily stop sending data when the receiver is unable to process the data coming from the sender.
ECN (Explicit Congestion Notification)
-
Flagging rather than discarding: When a router detects congestion, it doesn't simply drop the packet, itModify the ECN field in the IP header of the packetThe ECN field is usually set to "CE" (Congestion Experienced). Typically, the ECN field is set to "CE" (Congestion Experienced).
-
sender's response: When the receiving end (e.g., a receiving host or intermediate router) receives a packet with a CE tag, it reflects this in an ACK (acknowledgement) message returned to the sending end. Upon receiving such an ACK, the sender knows that there is congestion in the network and reduces its sending rate accordingly.
cnp(Congestion Notification Packet :Congestion notification is a method of sending a message to the host server from the forwarding device after queue congestion is detected on the forwarding device.ECN Congestion Marking Message. The host server receives the ECN congestion marking message and sends the source server theCNP Congestion Notification Messageto notify the source server to reduce the packet sending rate.
ECMP(Equal-cost multi-path)
ECMP is a hop-by-hop flow-based load balancing policy that updates the routing table when a router discovers multiple optimal paths for the same destination address, adding multiple rules for this destination that correspond to multiple next hops. These paths can be utilized simultaneously to forward data and increase bandwidth.
The following four data center network topologies
CLOS Networking
The core idea of a CLOS network is to build complex, large-scale networks with multiple small, low-cost units. A simple CLOS network is a three-level interconnection architecture that contains an input level, an intermediate level, and an output level.
In the following figure, m is the number of input ports of each submodule, n is the number of output ports of each submodule, and r is the number of submodules at each level. After a reasonable rearrangement, as long as r2 ≥ max(m1,n3) is satisfied, then a non-blocking path can always be found for any input to output.
Switch Fabric
Switch fabric refers to the ports inside a switch that connect inputs and outputs. The simplest switch fabric architecture is the Crossbar model, which is a matrix of switches, where each crosspoint is a switch, and the switch controls the switches to accomplish the forwarding of inputs to specific outputs. A Crossbar model is shown below:
Fat-Tree Network Architecture
Each layer has the same number of physical switches, so it is guaranteed that its bandwidth is convergence-free.
Leaf-Spine Architecture
A two-tier structure that simplifies the traditional CLOS three-tier structure. It is characterized by non-blocking (non-blocking) and symmetry, and can provide high-performance intra-data center communications.
describe: There are four ToRs (T1-T4),four leaves (L1-L4) and two spines (S1-S2). Each ToR represents a different IPsubnet.
Explanation of each term:
-
ToRs (Top-of-Rack switches): Here it is mentioned that there are four ToR switches.ToR switches are usually switches located at the top of the rack and are used to connect servers within the same rack to the external network.
-
Leaves: There are four leaf nodes mentioned here. in data center network architectures "leafs" usually refer to the switches that are directly connected to the ToR switches that make up the edge of the network.
-
Spines: Here it is mentioned that there are two spine nodes (Spines). Spine nodes are at the core of the network and are responsible for connecting all the leaf nodes and providing high bandwidth, low latency data exchange paths.
RPCRemote Procedure Call (RPC) is a computer communication protocol that allows a program to call another program or service located on a different computer or network environment as if it were a local program.
BDP stands for "Bandwidth-Delay Product. "Bandwidth Delay Product", represents the product of the available bandwidth (the maximum amount of data that can be transmitted per unit of time) and the Round Trip Time (RTT, the time it takes for a packet to travel from the sender to the receiver and back again).
parking lot problem
This is because, just like vehicles in a parking lot, when there are fewer vehicles at one entrance, it can enter the parking lot faster (i.e., get higher throughput), while more vehicles at other entrances result in more waiting time (i.e., lower throughput). This problem highlights the need to consider the balance between fairness and efficiency in network design and flow control mechanisms.
Congestion Control for Large-Scale RDMA Deployment
1. What was addressed?
Use DCQCN to use blunt and fast PFC flow control to prevent packet loss in time; use fine-grained slow end-to-end congestion control to adjust the sending rate and avoid persistent triggering of PFC. achieve low latency, high throughput and low CPU utilization
2. What is the core idea of the method?
3. Design details and evaluation .
Credit-Scheduled Delay-Bounded Congestion Control
1. What was addressed? ExpressPass solves the congestion control challenges in data center networks due to small round-trip times (RTTs), bursty traffic arrivals, and large numbers of concurrent flows (thousands).
Methodological core:The core idea of ExpressPass is to use credit packets to control congestion before packets are sent, allowing for constrained latency and fast convergence. The system gracefully handles bursty stream arrivals and avoids low link utilization problems in the case of multiple bottlenecks. Credit packets are rate-limited to ensure that a certain percentage (e.g., 5%) of the link capacity is not exceeded at any given moment, and the remaining 95% is used for packet transmission. With a finely designed credit feedback loop, the system ensures high utilization, fairness and fast convergence while limiting queue growth to prevent data loss.
Design Details:
-
Credit and packet scheduling: Ensure that packet scheduling is not interrupted by RTT differences across paths, resulting in queue backlogs.
-
Fairness and Multi-Bottleneck Handling: Resolve multiple bottlenecks and ensure that all streams share bandwidth fairly while maintaining high link utilization.
Evaluation:
-
performance: ExpressPass converges up to 80 times faster over 10Gbps links compared to DCTCP, and the performance advantage is even more pronounced as link speed increases.
-
Load Balancing and Flow Completion Time: Significant performance improvements under heavy load workloads, especially for small and medium-sized streams, and significant reductions in stream completion time compared to RCP, DCTCP, HULL, and DX.
Bolt
-
What is the solution to the problem? Bolt addresses the problem of addressing the challenges in data center networks due to the increase in bandwidth to 200 Gbps and above, in particular the greater sensitivity of network transmissions to congestion due to larger bandwidth delay products (BDPs), as well as the higher demands placed on congestion control (CC) algorithms.
-
The core idea of the method is:
Sub-RTT control (SRC): Responds to congestion faster than RTT-based control loops, reducing the delay in control decisions.
Proactive Rating Up (PRU): Anticipate future stream completion and quickly take up released bandwidth to avoid underutilization of bandwidth.
Supply and Demand Matching (SM): Clearly match bandwidth demand and supply to maximize utilization.
-
Design details and evaluation
Minimize feedback delays: Generate congestion notifications at the switch and reflect them directly to the sender, reducing feedback time.
Stream completion event foreshadowing: The sender signals stream completion in advance to hide the incremental delay and avoid underutilization.
Fast and stable: Update cwnd each time feedback is received, at most once per packet, to resist observation noise.
It is feasible to achieve precise congestion control at the sub-RTT level, significantly improving network performance