Location>code7788 >text

Kirin V10 deployment ROCEv2 network card configuration steps

Popularity:100 ℃/2025-04-13 09:01:40

Here are the steps to configure RoCEv2 for Kirin Server V10:

Step 1: Confirm hardware and driver support

Before you begin configuration, you must first make sure your server hardware meets the requirements. Usually requiredMellanox ConnectX series network card(for example, mlx5 series), and the latest OFED driver package is installed. You can check the driver status by following command:

modinfo mlx5_core # View kernel module information
 lspci | grep Mellanox # Confirm network card model

If you find that the driver is not loading correctly, you need to download the corresponding version of the driver from the Mellanox official website and install it.

Step 2: Switch the network card to RoCEv2 mode

By default, RDMA may run in RoCEv1 mode (based on Ethernet layer 2), while RoCEv2 needs to switch to Layer 3 IP mode. usecma_roce_modeTool adjustment (assuming the network card device name ismlx5_1):

cma_roce_mode -d mlx5_1 -p 1 -m 2

Here-m 2Indicates that RoCEv2 is enabled. After completion, it is recommended to passdmesg | grep RDMACheck the kernel log and confirm that the mode switch is successful.

Step 3: Configure flow control and priority

RoCEv2 is sensitive to network quality and needs to be coordinatedDCQCN (Dynamic Congestion Control)andPFC (Priority Flow Control). Assume that the network card interface name isens1np0, need to be set in the system:

  1. Turn on ECN and Priority
    Enable the ECN function of Priority 3 (usually used for RoCE traffic):
    echo 1 > /sys/class/net/ens1np0/ecn/roce_np/enable/3
    echo 1 > /sys/class/net/ens1np0/ecn/roce_rp/enable/3
    
  2. Mark CNP messages
    Set the DSCP value and 802.1p priority of the congestion notification message (CNP):
    echo 48 > /sys/class/net/ens1np0/ecn/roce_np/cnp_dscp # DSCP=48
     echo 6 > /sys/class/net/ens1np0/ecn/roce_np/cnp_802p_prio # 802.1p priority 6

Step 4: Optimize network card queue scheduling

By Mellanoxmlnx_qosTools adjust QoS policies to ensure that RoCE traffic has sufficient bandwidth. For example, assign higher weights to priority 3:

mlnx_qos -i ens1np0 --trust=dscp # Trust DSCP tag
 mlnx_qos -i ens1np0 -f 0,0,0,1,0,0,0,0 # Enable PFC in priority 3
 mlnx_qos -i ens1np0 -s ets,ets,ets,ets,ets,ets,ets,strict,strict -t 10,10,50,10,10,0,0 # Queue weight allocation

The key to this step is to enable the queue of priority 3 (corresponding to RoCEv2 traffic) to obtain a higher bandwidth ratio and avoid other traffic from preempting resources.

Step 5: Configure the switch side

If the server is connected to the switch, make sure the switch configuration is consistent with the network card. For example:

  • Enable on the switchDSCP-based PFC, and enable flow control for DSCP=48 (i.e. priority 3).
  • Confirm that the switch's ECN function is enabled and matches the server's DSCP/802.1p mapping.
    The specific configuration commands vary according to the switch model, and it is recommended to refer to the switch manufacturer's documentation.

Step 6: Verify the configuration

The last step is to test whether RoCEv2 works properly. Recommended useib_send_bwTools perform bandwidth testing:
Server:

ib_send_bw -d mlx5_1 --report_gbits -F -R

Client:

ib_send_bw -d mlx5_1 --report_gbits -F -R <Server IP>

If you see a stable high bandwidth (such as 25Gbps or 100Gbps, depending on the network card model), the configuration is successful. If packet loss or low bandwidth occurs, you can passethtool -S ens1np0Check network card statistics, or use Wireshark to capture packets and analyze ECN and CNP packets.

Things to note

  • Restart the network service: After the configuration is completed, it is recommended to restart the network service to make the settings take effect:
    systemctl restart NetworkManager # or traditional network service
  • Kernel parameters: If you use the network card binding, you need to/etc//Medium configurationmiimon=100 mode=4(802.3ad dynamic aggregation).
  • Firmware upgrade: If you encounter compatibility issues, you may need to upgrade the network card firmware.

Through the above steps, you should be able to successfully deploy RoCEv2 on Kirin V10. If you encounter problems during operation, you can prioritize checking whether the driver version and switch configuration match, which is the most common point of failure.