Location>code7788 >text

You can convert and deploy the RDKX5 model at a glance. Are you sure you don’t want to learn it?

Popularity:502 ℃/2025-01-20 16:41:42

Author:SkyXZ

CSDN:SkyXZ~-CSDN Blog

Blog Park:SkyXZ - Blog Park

Host environment: WSL2-Ubuntu22.04+Cuda12.6, D-Robotics-OE 1.2.8, Ubuntu20.04 GPU Docker

End-side device environment: RDK X5-Server-3.1.0

After buying the RDK X5, I only stayed in the use of the Raspberry Pai? Want to deploy deep learning but don’t know where to start with BPU? I finally found the OE delivery package and Model Zoo, but I don’t know what they do? I know you are in a hurry, but don't be in a hurry yet! Follow this tutorial to learn how to quantitatively deploy models at a glance and it will take you 30 minutes to bid farewell to the newbies in RDK model quantitative deployment! ! ! First, let’s refer to the materials and documents for this tutorial:

  • Sweet potato robot RDK user manual:1. Quick Start | RDK DOC
  • Sweet potato X5 algorithm tool chain:Digua X5 algorithm tool chain version released
  • Sweet potato RDK_ModelZoo introduction manual:4.3.1 ModelZoo Overview | RDK DOC
  • Digua RDK_ModelZoo warehouse address:/D-Robotics/rdk_model_zoo

1. Introduction to algorithm tool chain and environment installation

Currently, the models we train on GPUs usually use floating-point number format, because the floating-point type can provide higher calculation accuracy and flexibility. However, for edge devices, the computing power and storage resources required for floating-point type models are far away. exceeds its carrying capacity, so a AI acceleration chips on general edge devices basically only support INT8 (common precision for industry processors) fixed-point models, and our X5 BPU is no exception, so we need to convert our trained floating-point models into fixed-point models. This process is called quantification of the model, andDigua Robot officially developed a set of D-Robotics algorithm tool chain based on the D-Robotics processor.Floating-point models can be quantized into fixed-point models conveniently and quickly, and quickly deployed on D-Robotics processors! ! !Below we introduce how to install the algorithm tool chain:

Since the D-Robotics algorithm toolchain can only run in the Linux environment for the time being, everyone must first ensure that their development machine meets the following requirements and has WSL2-Ubuntu installed (for details, please refer to:Say goodbye to virtual machines! WSL2 installation and configuration tutorial! ! ! - SkyXZ - Blog Park) or Ubuntu in a virtual machine. Since the official has provided us with the docker image of the tool chain, the system version of Ubuntu is not very important.

image-20250119024603630

(1) Install Docker and NVIDIA Container Toolkit

​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​?Get Docker | Docker Docs) and NVIDIA Container Toolkit (Official requirements of Digua are 1.13.1-1.13.5. For installation details, see:Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit 1.17.3 documentation), then I will take you through this process from the beginning. The first is to install Docker. We first uninstall the docker installed by default on the system and install some necessary support:

#If there is one, delete it. If an error message says there is no one, then it doesn’t matter. Don’t worry about it.
 sudo apt-get remove docker docker-engine containerd runc
 #Download necessary dependencies
 sudo apt install apt-transport-https ca-certificates curl software-properties-common gnupg lsb-release

By default, we will not use proxies, so all our sources use domestic sources. After we add Ali’s GPG KEY and Ali’s APT sources, we can directly APT install the latest version of Docker.

# step 1 Add Alibaba GPG Key
 curl -fsSL /docker-ce/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/

 # step 2 Add Alibaba Docker APT source
 echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/] /docker-ce/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt  // > /dev/null

 # step 3 Update
 sudo apt update
 sudo apt-get update

 # step 4 Download Docker
 sudo apt install docker-ce docker-ce-cli

 # step 5 Verify Docker installation
 sudo docker version #View Docker version
 sudo systemctl status docker #Verify Docker running status

image-20250119030911095

If it is verified that the Docker installation has output and is running normally, it means that our Docker installation is complete. Then we add users without root permissions to the Docker user group, so that we can allow the current user to operate without root or without adding Use the docker command normally under sudo:

  sudo groupadd docker
  sudo gpasswd -a ${USER} docker
  sudo service docker restart

But it’s not over yet, because there is a high probability that everyone will rundocker run hello-worldThe following network error will always be reported:

image-20250119031023140

This is because the Docker source image cannot be directly accessed in China for the time being. We need to use third-party Docker sources. I have sorted out some common Docker sources for you here. You only need to add them./etc/docker/Just file:

# step 1 Create or edit /etc/docker/
 sudo nano /etc/docker/
 # step 2 Copy and paste into the file
 {
     "registry-mirrors": [
         "",
         "",
         "",
         "",
         ".",
         "",
         "",
         "",
         "/ustclug/mirrorrequest",
         ""
     ]
 }
 # step 3 Reload the configuration file and restart docker
 sudo systemctl daemon-reload
 sudo systemctl restart docker
 # step 4 View the Docker configuration to check whether the configuration is successful
 sudo docker info

image-20250119032119216

You can see it is runningdocker infoAfter the command, the terminal outputs the docker source address we added before. At this time, we run it again.docker run hello-worldYou can see that docker successfully downloaded the corresponding image and printed it out.“Hello from Docker!”

image-20250119032356533

After installing docker, let’s install itNVIDIA Container Toolkit If your computer does not have a GPU or is using a virtual machine such as a VM, you can skip this step. Since you cannot access the GPU, you do not need to install it in this step., this tool chain component is a set of tools provided by Nvidia. After installation, we can use GPU in Docker and support GPU acceleration. Since Nvidia's documentation is very detailed, we follow the steps in Nvidia's documentation. Installation configuration

Similar to the previous Docker, we need to add the official source of Nvidia. After adding it, we can directly use APT to install it.

# step 1 Configure production repository
 curl -fsSL /libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/ \
   && curl -s -L /libnvidia-container/stable/deb/ | \
     sed 's#deb https://#deb [signed-by=/usr/share/keyrings/] https://#g' | \
     sudo tee /etc/apt//
 # step 2 Update
 sudo apt-get update
 # step 3 Install using APT
 sudo apt-get install -y nvidia-container-toolkit #If there is no agent, this part will take longer

Then we started to configure NVIDIA Container Runtime for Docker. This part is very simple only requires two lines of commands:

sudo nvidia-ctk runtime configure --runtime=docker #Use the nvidia-ctk command to modify the /etc/docker/ file
 sudo systemctl restart docker #Restart the Docker daemon process

Finally, enter the following command to verify whether our configuration is successful. If the following picture appears, it means that the Nvidia Container Toolkit installation is complete! ! !

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

image-20250119034046615

(2) Configure and use the D-Robotics algorithm tool chain

​ ​ ​ Okay, if there are no problems after completing the above process, it means that we have now completed all the pre-configuration! Then we can start configuring our algorithm tool chain. First, we download the OE delivery package of RDK (the latest version as of the release of the article is V1.2.8) and the corresponding Docker image

# Download OE-v1.2.8 delivery package
 wget -c ftp://x5ftp@/OpenExplorer/v1.2.8_release/horizon_x5_open_explorer_v1.2.8-py310_20240926. --ftp-password=x5ftp@123$%

 # Choose the following CPU or GPU version of the Docker image to download, just choose one of the two.
 #Ubuntu20.04 CPU Docker image
 wget -c ftp://x5ftp@/OpenExplorer/v1.2.8_release/docker_openexplorer_ubuntu_20_x5_cpu_v1.2. --ftp-password=x5ftp@123$%
 #Ubuntu20.04 GPU Docker image
 wget -c ftp://x5ftp@/OpenExplorer/v1.2.8_release/docker_openexplorer_ubuntu_20_x5_gpu_v1.2. --ftp-password=x5ftp@123$%

 # HorizonRising Sun 5 Algorithm Tool Chain User Development Documents (Download on Demand)
 wget -c ftp://x5ftp@/OpenExplorer/v1.2.8_release/x5_doc-v1.2. --ftp-password=x5ftp@123$%
 #Checksum (download and use on demand)
 wget -c ftp://x5ftp@/OpenExplorer/v1.2.8_release/ --ftp-password=x5ftp@123$%

Since the docker system file is large, we need to wait for a while and enter it after downloadinglsYou can see two files

image-20250119115749661

We enter the following command to decompress:

tar -xvfhorizon_x5_open_explorer_v1.2.8-py310_20240926. #Decompress the OE delivery package

After decompression is complete, we enter the OE package. We can see that the structure of our OE package is as follows, divided into two large folderspackageandsamplespackageIt mainly contains the development environment of the board side of the RDK series and the host side. Since we use the Docker image, we can leave this folder alone and let's mainly take a look.samplesBaoba,samplesThe following is divided into three folders, the third of whichmodel zooIt's in the second folderai_toolchain/model_zooSoft link to the first folderai_benchmarkYes Digua officially provides an AI benchmark test sample package for evaluating the performance and accuracy of classification, detection and segmentation models. It supports single frame delay and dual-core multi-thread scheduling performance evaluation. Through this package we can evaluate whether the model meets performance requirements and verify quantification. The final model accuracy, but generally speaking, if we use the official version of the Yolo series without fine-tuning, we don’t need to pay too much attention to this part.

image-20250119120312514

​ ​ ​ Then let’s look at our highlightai_toolchainThe model tool chain, through the following structure diagram, we can see that it mainly consists of four parts, namely model quantification conversion examples, model training examples and our model running examples.We will introduce its specific usage in Section 3.

image-20250119121127552

After reading the OE delivery package, we started to import the Docker image. Since this docker image depends on the OE package to run, we need to set the docker mapping path, and then we can import the docker image from the tar package:

#Everyone can modify it according to their own path
 export version=v1.2.8
 export ai_toolchain_package_path=/path/OE/horizon_x5_open_explorer_v1.2.8-py310_20240926#Please modify the path yourself
 export dataset_path=/path/OE/dataset #Please modify the path yourself. If there is no dataset, please create it yourself.
 #Import image
 docker load < docker_openexplorer_ubuntu_20_x5_gpu_v1.2.

image-20250119125258889
Since our image is relatively large, it will take a long time to import. Just wait and see. Then we enter the following command to start the docker image.

sudo docker run -it --rm --gpus all --shm-size=15g -v "$ai_toolchain_package_path":/open_explorer -v "$dataset_path":/data openexplorer/ai_toolchain_ubuntu_20_x5_gpu:v1.2.8-py310

Then enter the command in the imagehb_mapperThe following printout means that the installation of our environment is complete~~

image-20250119125411789

        Small Tips:You can~/.bashrcUse alias to add the following line, and then you can enter it directly in the terminalRDK_Ai_ToolchainOpen the tool chain, so you don’t have to remember such long instructions.

alias RDK_Ai_Toolchain="sudo docker run -it --rm --gpus all --shm-size=15g -v "$ai_toolchain_package_path":/open_explorer -v "$dataset_path":/data openexplorer/ai_toolchain_ubuntu_20_x5_gpu:v1.2.8-py310"  

image-20250119125605455

        At this point, our sweet potato toolchain environment has been fully installed and configured! ! !

2. Introduction to Model Zoo

I think that for students who have just gotten the RDK board, we cannot bypass the newly launched Model Zoo of Digua Robot and directly learn the RDK algorithm tool chain. Therefore, our X5 model quantification conversion deployment tutorial will start with the Model Zoo. . Model Zoo, as its name implies, literally we can know that this is a"Model Zoo", this is an open source community algorithm case warehouse maintained by the Digua developer community.According to the official explanationThis warehouse contains various heterogeneous sweet potato models (such as Yolo series, FCOS, ResNet, PaddleOCR, etc.) that can be deployed directly on the board and are suitable for a variety of scenarios and have strong versatility, including but not limited to image classification, target detection, Carefully selected and optimized in fields such as semantic segmentation and natural language processing, it has efficient performance andalreadyA series of .bin models that can be run directly after quantization conversion, and C++/Python and Jupyter running examples are also provided for users.

So how do we use this warehouse? We first pull Model Zoo from Github. We can see the project structure of Model Zoo as shown in the figure:

git clone /D-Robotics/rdk_model_zoo #Pull Model Zoo

image-20250119035153468

There are Chinese and English bilingual README and README image resource folders under the main folder.resource, and our most importantdemofolder, which contains all officially supported models according to the target detectiondetect, target classificationclassification, key point detectionPoseetc. are classified, we usedetectOpen the target detection model as an example and you can see that there are many officially supported model series. If we open the Yolov5 folder again, we can see that there are official C++/Jupyter deployment routines as well as officially converted model files and model quantification. ptq configuration file

image-20250119133223538

image-20250119133307704

        I believe that everyone should have a basic understanding of Model Zoo by now. Next, we will use Yolov5-V2.0 as an example to introduce how to convert the model.

3. Model Quantification Example Tutorial

Next, we officially entered the use of the tool chain. We take the official version of YOLOV5-V2.0 as an example to bring you some of the concepts while completing the model transformation. This process will be based onrdk_model_zoo/demos/detect/YOLOv5/README_cn.mdThe official document description in the Sweet Potato Model Zoo is introduced. First, we pull the official source code of Yolov5-V2.0 and download the official model weights:

git clone /ultralytics/ #Clone warehouse
 cd yolov5 #Enter the warehouse
 git checkout v2.0 #Switch branch
 git branch #Check, if: * (HEAD detached at v2.0) appears, it means the branch switch is completed
 #I use the official 80 category weight for demonstration. If you have a trained model, you don’t need to perform this step. Just use your own model.
 wget /ultralytics/yolov5/releases/download/v2.0/ -O yolov5s_tag2. #Download official model weights

Since our BPU needs to use 4-dimensional NHWC output, that is(batch_size, height, width, channels), and the Yolov5 source code uses the PyTorch framework, so its output is NCHW, that is(batch_size, channels, height, width), so we need to modify the output part of the model so that the .pt file we trained has the correct output format when exported to an ONNX file. We first open the yolov5/models/ file and locate it at about 22 lines. Since We only need to modify the output header when exporting the model to ONNX and keep it as it is during training.Therefore, it is recommended that you do not delete the original code and choose to use comments as shown in my picture., we can modify it with the following code:

def forward(self, x):
    return [[i](x[i]).permute(0,2,3,1).contiguous() for i in range()]

image-20250119135304956

​ ​ ​ Then we use the model export tool provided by Yolo officialWe first copy this file from yolov5/models/

cp ./models/ .

Then we enter this file. Since we only need to export the ONNX model, we delete the 32 lines that export TorchScript and the 60 lines that export CoreML, leaving only the part that exports ONNX. At the same time, we add opset to the part that exports ONNX. Choose the version and add an onnx simplify program to do some graph optimization and constant folding operations

PS: Each ONNX operation (such as convolution, activation, matrix multiplication, etc.) has a specific version, and the opset version refers to the operator version supported in ONNX we are currently using, and our RDK series currently only supports Opset10 and Opset11, so we need to specify to use version 11

try:
    import onnx
    from onnxsim import simplify

    print('\nStarting ONNX export with onnx %s...' % onnx.__version__)
    f = ('.pt', '.onnx')  # filename
    ()  # only for ONNX
    (model, img, f, verbose=False, opset_version=11, input_names=['images'],
                      output_names=['small', 'medium', 'big'])
    # Checks
    onnx_model = (f)  # load onnx model
    .check_model(onnx_model)  # check onnx model
    print(.printable_graph(onnx_model.graph))  # print a human readable model
    # simplify
    onnx_model, check = simplify(
        onnx_model,
        dynamic_input_shape=False,
        input_shapes=None)
    assert check, 'assert check failed'
    (onnx_model, f)
    print('ONNX export success, saved as %s' % f)
except Exception as e:
    print('ONNX export failure: %s' % e)

        If there are students who find it troublesome to modify, you can directly copy the code copy assignment that I have modified below and replace the original content:

import argparse
from  import *
from utils import google_utils
if __name__ == '__main__':
    parser = ()
    parser.add_argument('--weights', type=str, default='./', help='weights path')
    parser.add_argument('--img-size', nargs='+', type=int, default=[640, 640], help='image size')
    parser.add_argument('--batch-size', type=int, default=1, help='batch size')
    opt = parser.parse_args()
    opt.img_size *= 2 if len(opt.img_size) == 1 else 1  # expand
    print(opt)
    img = ((opt.batch_size, 3, *opt.img_size))  # image size(1,3,320,192) iDetection
    google_utils.attempt_download()
    model = (, map_location=('cpu'))['model'].float()
    ()
    [-1].export = True  # set Detect() layer export=True
    y = model(img)  # dry run
    try:
        import onnx
        from onnxsim import simplify

        print('\nStarting ONNX export with onnx %s...' % onnx.__version__)
        f = ('.pt', '.onnx')  # filename
        ()  # only for ONNX
        (model, img, f, verbose=False, opset_version=11, input_names=['images'],
                      output_names=['small', 'medium', 'big'])
    # Checks
        onnx_model = (f)  # load onnx model
        .check_model(onnx_model)  # check onnx model
        print(.printable_graph(onnx_model.graph))  # print a human readable model
    # simplify
        onnx_model, check = simplify(
            onnx_model,
            dynamic_input_shape=False,
            input_shapes=None)
        assert check, 'assert check failed'
        (onnx_model, f)
        print('ONNX export success, saved as %s' % f)
    except Exception as e:
        print('ONNX export failure: %s' % e)

After completing these operations, we can export the .pt model we trained as an ONNX model.(By default, everyone has configured the Conda environment of Yolov5)

python3  --weights ./

image-20250119143623886

​ ​ ​ Then we can start to quantify the model! We add the exported .onnx model into our OE package

cp ./ /path/to/OE # Everyone can modify the copied path according to their own configuration.

Tips: In order to standardize file management, I created a new Model folder in the OE package to uniformly manage my own model projects. I recommend that everyone adopt this method.

image-20250119143946683

Follow usStart the officially provided algorithm tool chain docker imageFirst, check our ONNX. Here we need to use an official command from Diguahb_mapper checkerIts specific usage is as follows:

hb_mapper checker --model-type ${model_type} \
                     --march ${march} \
                     --proto ${proto} \
                     --model ${caffe_model/onnx_model} \
                     --input-shape ${input_node} ${input_shape} \
                     --output ${output}
 # --model-type is used to specify the model type for checking the input. Currently, only caffe or onnx is supported.
 # --march is used to specify the D-Robotics processor type that needs to be adapted. The available values ​​are bernoulli2 and bayes;
 # RDK X3 is set to bernoulli2, RDK Ultra is set to bayes, RDK X5 is set to bayes-e
 # --proto This parameter is only valid when model-type specifies caffe, and the value is the prototxt file name of the Caffe model.
 # --model When model-type is specified as caffe, the value is the caffemodel file name of the Caffe model.
 # When model-type is specified as onnx, the value is the name of the ONNX model file
 # --input-shape optional parameter, explicitly specify the input shape of the model
 # The value is {input_name} {NxHxWxC/NxCxHxW}, and input_name and shape are separated by spaces.
 # For example, the model input name is data1 and the input shape is [1,224,224,3], then the configuration should be --input-shape data1 1x224x224x3
 # If the shape configured here is inconsistent with the shape information in the model, the configuration here shall prevail.

According to the official introduction to this command, we enter the following command to check our model. The system will have a long output. At the same time, we can also find from the output that the BPU of X5 supports all operators of Yolov5-2.0, that is It is said that all calculations of the model can be performed on the X5 BPU.

#Modify --model parameters according to your own model path
 hb_mapper checker --model-type onnx --march bayes-e --model /path/to/model

image-20250119144853938

image-20250119144950513

If there are no problems with this step, then we can start converting the model. The algorithm tool chain of Digua uses the PTQ solution. The same algorithm tool chain of Digua also provides us with a similar command. Use this command The conversion from the floating-point model to the D-Robotics hybrid heterogeneous model will be automatically completed. After this stage, a model that can be run on the D-Robotics processor will be obtained. Let's take a look at the official command analysis first:

PS: PTQ (Post-Training Quantization) is a technology that converts an already trained model into a low-precision (such as 8-bit integer) representation to reduce the storage and computing overhead of the model without retraining. In the case of models, quantizing the model after training is used to speed up the inference process and reduce the model size while trying to maintain its performance.

# Disable fast-perf mode
 hb_mapper makertbin --config ${config_file} \
                       --model-type ${model_type}
 # Enable fast-perf mode
 hb_mapper makertbin --fast-perf --model ${caffe_model/onnx_model} --model-type ${model_type} \
                   --proto ${caffe_proto} \
                   --march ${march}
 # --help Display help information and exit
 # -c, --config The configuration file for model compilation is in yaml format, and the file name uses the .yaml suffix.
 # --model-type is used to specify the model type of conversion input. Currently, it supports setting caffe or onnx.
 # --fast-perf Turn on the fast-perf mode. After this mode is turned on, a bin model with the highest performance that can run on the board side will be generated during the conversion process.
 # If fast-perf mode is enabled, the following configuration is required
 # --model Caffe or ONNX floating point model file
 # --proto is used to specify the Caffe model prototxt file
 # --march BPU microarchitecture, if using RDK X3, set it to bernoulli2, if using RDK Ultra, set it to bayes, if using RDK X5, set it to bayes-e

We see that this command requires us to provide a configuration file for model compilation. In this configuration file, we need to configure parameters related to model conversion, such asNecessary parameters such as the data preprocessing method used in the original floating point model training framework, the mean value of image subtraction, image preprocessing scaling ratio, compiler related parameters, etc., if you are using the model series in the Digua Model Zoo, Digua official has provided you with a PTQ configuration file that you can use directly, which is stored in the specific folder of each model.Generally speaking, we only need to modify it according to our own environment and board-side equipment.onnx_model model locationmarch architectureas well ascal_data_dirJust verify the address of the set

image-20250119150532203

image-20250119150610660

But at this time, some friends will come and ask:oops! What should I do if the model I use is not available in Model Zoo? How do I write these parameters myself?Don’t worry, Digua official has also prepared PTQ template files for different devices and models (Caffe, ONNX) for everyone.8.5 Algorithm tool chain class | RDK DOCIn the last part of the linked document, there are model quantification yaml file templates for RDK X3, RDK X5 and RDK Ultra. Friends who need it can take it by themselves. At this time, friends will ask again:Huh! ? So what are the parameters in this YAML file used for? How should I configure it?, don’t be impatient! According to the official documentation of Sweet PotatoModel conversion yaml configuration parametersThis part has a very detailed introductionDetailed explanation of PTQ principles and steps | RDK DOC, but everyone should note that in the configuration file, all four parameter group positions need to exist. Specific parameters are divided into optional and required. Optional parameters do not need to be configured.

image-20250119152346544

Then we continue to start the teaching of model conversion. Based on the above, we know that the Yolov5-2.0 we use is included in the official Model Zoo, so we can directly use the PTQ configuration file provided by the official to us. We first download it from the Model Zoo Copy into our OE package:

#Modify according to your own configuration. Copy YAML to docker in the OE package to access it. It is recommended to use the same path as the model.
 cp /path/to/demos/detect/YOLOv5/ptq_yamls/yolov5_detect_bayese_640x640_nv12.yaml /path/to/OE

image-20250119152804252

​ ​ ​ Then we modify the model path and architecture in the YAML file and change the output path according to our own needs, etc.But this time we found that we also need to prepare the validation set calibration dataIt is used for calibration in the process of converting our floating-point model to a fixed-point model. This is also simple. The calibration sample is actually what everyone uses when training the model.training set or validation set, so we only need to copy nearly 100 data sets into our OE package. At the same time, the official provides us with an option in the accompanying text parameters.preprocess_onUsed to enable automatic processing of image calibration samples. After using this parameter, the tool chain will automatically use skimage to read and automatically resize the calibration data set to the input node size.(Although this parameter is very convenient, it is still recommended to read the official user manual and the examples in the OE package and write a data processing code yourself)

image-20250119161652550

image-20250119161619190

After modifying and adding the content that needs to be modified according to our needs, we can start the model conversion. Enter the following command in the docker environment and wait for a while if no error is reported, which means our conversion is successful! After the model conversion is successful, aoutputFolder, inside is the model we converted~

hb_mapper makertbin --model-type onnx --config yolov5_detect_bayese_640x640_nv12.yaml

image-20250119161934623

Although our model has been converted, in order to ensure safety we still need to perform visual inspection and input and output inspection of the model. We first enter the following command on the command line, and the tool chain will automaticallyhb_perf_resultGenerate a visual structure diagram of our converted Bin model file

hb_perf /path/to/model #Change to your own model path

image-20250119162407291

After checking that there are no errors, we can start to check the input and output of our model. Enter the following command. The tool chain will print the basic information of the input and output of our model.

hrt_model_exec model_info --model_file /path/to/model #Modify to your own model path

image-20250119162612575

        At this point, if there are no problems with the model's structure, input and output, it means that our model conversion is complete! ! !

4. Model deployment application examples

Next comes the model deployment step that everyone is most concerned about and curious about! ! ! In the past, RDK only supported the C++ model deployment interface, but with the release of Python deploys inference code! ! !

  • Reference manual:Model Inference Interface Description (TODO: Add C++ example) | RDK DOC

Since the official has provided us with the sample code of the corresponding model in the Model Zoo and contains detailed comments, you can take out ourRDK X5 on our development boardUse the official code examples in advance to test whether your model conversion is successful. The corresponding code is in the specific folder of the corresponding model in Model Zoo.cppfolder

image-20250119182801474

We open the insideJust modify the model path, number of categories, macro definitions of tag names, and the path of the test image.

image-20250119194106362

image-20250119194122882

​ ​ ​ Then run the executable file after compilation to see the recognition results

mkdir build && cd build 
cmake ..
make
./main

image-20250119194224431

image-20250119194250893

Pay attention! The above operations are all done on the board! You can also take a look at this file to understand the model deployment process.

(1) Complete Cmake

Next, I will use the single-category Yolov5-V2.0 version model that I trained and converted before as an example to lead you to deploy C++ model reasoning from scratch. The model reasoning API of RDK is mainly divided into six categories:Model reasoning library information acquisition, model loading and release, model information acquisition, model reasoning, model memory operations, model pre-processingThese six types of APIs also represent that there should be six steps for model inference in our code. First, we create the Cmake file.

#step 1 Set project and version minimum requirements
 cmake_minimum_required(VERSION 2.8)
 project(rdk_yolov8_detect)
 #step 2 Set C++ standards
 set(CMAKE_CXX_STANDARD 11)
 set(CMAKE_CXX_STANDARD_REQUIRED ON)
 #step 3 Set the compilation type
 if(NOT CMAKE_BUILD_TYPE)
     set(CMAKE_BUILD_TYPE Release)
 endif()
 message(STATUS "Build type: ${CMAKE_BUILD_TYPE}")
 #step 4 Set compilation options
 set(CMAKE_CXX_FLAGS_DEBUG " -Wall -Werror -g -O0 ")
 set(CMAKE_C_FLAGS_DEBUG " -Wall -Werror -g -O0 ")
 set(CMAKE_CXX_FLAGS_RELEASE " -Wall -Werror -O3 ")
 set(CMAKE_C_FLAGS_RELEASE " -Wall -Werror -O3 ")
 # Dependency settings
 set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -Wl,-unresolved-symbols=ignore-in-shared-libs")
 #step 5 Add external dependency packages
 find_package(OpenCV REQUIRED)# OpenCV
 #step 6 Set the RDK BPU library path
 set(DNN_PATH "/usr/include/dnn") # BPU header file path
 set(DNN_LIB_PATH "/usr/lib/") # BPU library file path
 #step 7 Add header file path
 include_directories(
     ${DNN_PATH}
     ${OpenCV_INCLUDE_DIRS}
 )
 #step 8 Add library file path
 link_directories(
     ${DNN_LIB_PATH}
 )
 #step 9 Add source files
 add_executable(main
    
 )
 #step 10 Link dependent libraries
 target_link_libraries(main
     ${OpenCV_LIBS} # OpenCV library
     dnn # RDK BPU library
     pthread # thread library
     rt # real-time library
     dl # dynamic link library
 )
 #step 11 Installation target
 install(TARGETS main
     RUNTIME DESTINATION bin
 )

(2) Complete header file import and macro definition

After writing Cmake, we can start writing our C++ code! ! ! Let's first create aThe file then imports some necessary header files:

// C/C++ Standard Libraries
 #include <iostream> //Input and output streams
 #include <vector> // vector container
 #include <algorithm> // algorithm library
 #include <chrono> // Time related functions
 #include <iomanip> // Input and output format control

 // Thrid Party Libraries
 #include <opencv2/> // OpenCV main header file
 #include <opencv2/dnn/> // OpenCV deep learning module

 // RDK BPU libDNN API
 #include "dnn/hb_dnn.h" //BPU basic functions
 #include "dnn/hb_dnn_ext.h" // BPU extension function
 #include "dnn/plugin/hb_dnn_layer.h" // BPU layer definition
 #include "dnn/plugin/hb_dnn_plugin.h" // BPU plug-in
 #include "dnn/hb_sys.h" // BPU system functions

Then, in order to make our code more standardized and more comprehensive, we modify the detection parameters. We use macro definitions to configure parameters such as model path, number of categories, and confidence. At the same time, we add an error checking macro definition, which allows us to Determine whether the API execution is correct when operating the API.At the same time, considering that the requirements for image display during debugging and non-debugging are different, we added two macro definitionsDETECT_MODEENABLE_DRAWUsed to enable single image reasoning or real-time reasoning and whether to start drawing and display functions respectively.

// Error checking macro
 #define RDK_CHECK_SUCCESS(value, errmsg) \
     do \
     { \
         auto ret_code = value; \
         if (ret_code != 0) \
         { \
             std::cout << errmsg << ", error code:" << ret_code; \
             return ret_code; \
         } \
     } while (0);
    
 //Default parameter definition
 #define DEFAULT_MODEL_PATH "/root/Deep_Learning/YOLOv5/models/tennis_detect_640x640_bayese_.bin" //Model path
 #define DEFAULT_CLASSES_NUM 1 //Model category
 #define DEFAULT_NMS_THRESHOLD 0.45f //NMS threshold, default 0.45
 #define DEFAULT_SCORE_THRESHOLD 0.25f // Confidence threshold, default 0.25
 #define DEFAULT_NMS_TOP_K 300 //Number of first K frames selected by NMS, default 300
 #define DEFAULT_FONT_SIZE 1.0f // Font size for drawing labels, default 1.0
 #define DEFAULT_FONT_THICKNESS 1.0f // Font thickness of drawing labels, default 1.0
 #define DEFAULT_LINE_SIZE 2.0f // Line width for drawing rectangular box, default 2.0
 #define DETECT_MODE 0 //Selection of inference mode 0 for single picture, 1 for real-time detection
 #define ENABLE_DRAW 0 // 1: enable drawing, 0: disable drawing

(3)BPU detection package

We encapsulate the inference code into a BPU_Detect class, which contains three main functional interfacesInit()Detect()Release()They are used to initialize BPU and model, perform detection and release resources respectively. In order to complete these three main functions, we also created several internal tool functions.LoadModel()GetModelInfo()PreProcess()Inference();PostProcess();DrawResults()as well asPrintResults(), respectively used to load the model, obtain model information, image preprocessing, model inference, post-processing, image drawing and result formatting printing functions

class BPU_Detect {
     public:
     BPU_Detect(const std::string& model_path = DEFAULT_MODEL_PATH,
                  int classes_num = DEFAULT_CLASSES_NUM,
                  float nms_threshold = DEFAULT_NMS_THRESHOLD,
                  float score_threshold = DEFAULT_SCORE_THRESHOLD,
                  int nms_top_k = DEFAULT_NMS_TOP_K,
                  int d_mode = DETECT_MODE);
         ~BPU_Detect(); // Destructor
         bool Init(); // Initialize BPU and model
         bool Detect(const cv::Mat& input_img, cv::Mat& output_img); //Perform detection
         bool Release(); // Release resources
     private:
     bool LoadModel(); // Load model
         void GetModelInfo(); // Get model information
         bool PreProcess(const cv::Mat& input_img); // Image preprocessing
         bool Inference(); // Model inference
         bool PostProcess(); // Post-processing
         void DrawResults(cv::Mat& img); // Draw results
         void PrintResults() const; // Print detection results
 //Member variables (arranged according to constructor initialization order)
         std::string model_path_; //Model file path
         int classes_num_; // Number of categories
         float nms_threshold_; // NMS threshold
         float score_threshold_; // Confidence threshold
         int nms_top_k_; //The maximum number of frames retained by NMS
         bool is_initialized_; // Initialization status flag
         float font_size_; // draw text size
         float font_thickness_; // draw text thickness
         float line_size_; // draw line thickness

We start by first completing our constructor and destructor. We transfer all the values ​​​​of our macro definitions into the constructor and set our small, medium, and large anchors. At the same time, when we destructor free up our resources

PS:What are Anchors? In computer vision, especially object detection,Anchorsis a set of predefined bounding boxes that are used to match objects in the input image. The size, shape and position of these anchor points are usually determined before model training in order to solve the problem of different target scales. In general,AnchorIt can be regarded as a "reference box". Its function is to cover a certain area on the image in advance, and then the model will predict the location and size of the actual target based on these predefined boxes.

//Add private class member variables
 class BPU_Detect {
     private:
     std::vector<std::string> class_names_; // Category names
         std::vector<std::pair<double, double>> s_anchors_;
         std::vector<std::pair<double, double>> m_anchors_;
         std::vector<std::pair<double, double>> l_anchors_;
 }
 //Constructor implementation
 BPU_Detect::BPU_Detect(const std::string& model_path,
                           int classes_num,
                           float nms_threshold,
                           float score_threshold,
                           int nms_top_k)
     : model_path_(model_path),
       classes_num_(classes_num),
       nms_threshold_(nms_threshold),
       score_threshold_(score_threshold),
       nms_top_k_(nms_top_k),
       is_initialized_(false),
       font_size_(DEFAULT_FONT_SIZE),
       font_thickness_(DEFAULT_FONT_THICKNESS),
       line_size_(DEFAULT_LINE_SIZE) {
     class_names_ = {CLASSES_LIST}; // Initialize class names
     std::vector<float> anchors = {10.0, 13.0, 16.0, 30.0, 33.0, 23.0,
                                  30.0, 61.0, 62.0, 45.0, 59.0, 119.0,
                                  116.0, 90.0, 156.0, 198.0, 373.0, 326.0};//Initialize anchors
     //Set small, medium, large anchors
     for(int i = 0; i < 3; i++) {
         s_anchors_.push_back({anchors[i*2], anchors[i*2+1]});
         m_anchors_.push_back({anchors[i*2+6], anchors[i*2+7]});
         l_anchors_.push_back({anchors[i*2+12], anchors[i*2+13]});
     }
 }
 // Destructor implementation
 BPU_Detect::~BPU_Detect() {
     if(is_initialized_) {
         Release();
     }
 }

(4) Complete the private LoadModel() function

We then started to implement ourLoadModel()Model loading private class function, we can see from the official API user manual that the official provides two ways to load models, namely loading from files and loading models from memory. Comparatively speaking, these two methodsFromFilesThis function is relatively slow due to file I/O operations and the code is simple. However, because the model files are stored independently, it is more suitable for development and debugging.FromDDRBecause this function reads directly from memory, it is faster and suitable for embedded systems or scenarios that require fast loading. However, the disadvantage is that the code is more complex and is closer to the way TensorRT loads the model. The specific introduction of the two APIs is as follows:

/**
  * Creates and initializes Horizon DNN Networks from file list
  * @param[out] packedDNNHandle Horizon DNN handle, pointing to multiple models
  * @param[in] modelFileNames path to the model file
  * @param[in] modelFileCount number of model files
  * @return 0 if success, return defined error code otherwise
  */
 int32_t hbDNNInitializeFromFiles(hbPackedDNNHandle_t *packedDNNHandle,
                                  char const **modelFileNames,
                                  int32_t modelFileCount);
 /**
  * Creates and initializes Horizon DNN Networks from memory
  * @param[out] packedDNNHandle Horizon DNN handle, pointing to multiple models
  * @param[in] modelData pointer to the model file
  * @param[in] modelDataLengths The length of model data
  * @param[in] modelDataCount The number of model data
  * @return 0 if success, return defined error code otherwise
  */
 int32_t hbDNNInitializeFromDDR(hbPackedDNNHandle_t *packedDNNHandle,
                                const void **modelData,
                                int32_t *modelDataLengths,
                                int32_t modelDataCount);

We can see that both APIs pass in the model and thenhbPackedDNNHandle_tThe structure type passes out the model handle, so if we want to use this function, we first need to usehbPackedDNNHandle_tCreate a private class member variablepacked_dnn_handle_, since these two model loading methods are relatively common, we introduce the use of the two APIs here:

        We start from simpleFromFilesAPI starts to be introduced,First of all, since we used the macro definition to import the model path earlier, we need to use a character pointer variable here to get our model path address, and then use ourerror checking macroTo call the model loading API

//Add private class member variables
 class BPU_Detect {
     private:
 hbPackedDNNHandle_t packed_dnn_handle_;
 }
 // Method One FromFiles
 const char* model_file_name = model_path_.c_str(); //Get the file path character pointer
 RDK_CHECK_SUCCESS(
         hbDNNInitializeFromFiles(&packed_dnn_handle_, &model_file_name, 1),
         "Initialize model from file failed");//Call the model loading API

        Next we will introduce how to use the API to read the model from memory.The core of this step is to obtain the memory of the file. We first use the C++ official library to open our model file. Then we move the file pointer to the end to get the file size. After getting the model size, we can usemallocThe function allocates memory for the model. After we input the model data into the memory and verify whether the model file is completely read, we can prepare the model data array and length array to use the RDK's model loading API to initialize the model from the memory. You can see this process. Much more trouble than the previous API

//Add private class member variables
 class BPU_Detect {
     private:
 hbPackedDNNHandle_t packed_dnn_handle_;
 }
 FILE* fp = fopen(model_path_.c_str(), "rb"); // Open the model file
 if (!fp) {
     std::cout << "Failed to open model file: " << model_path_ << std::endl;
     return false;
 }
 // Get file size:
 fseek(fp, 0, SEEK_END); // 1. Move the file pointer to the end
 size_t model_size = static_cast<size_t>(ftell(fp)); // 2. Get the current position (i.e. file size)
 fseek(fp, 0, SEEK_SET); // 3. Reset the file pointer to the beginning

 // Allocate memory for model data
 void* model_data = malloc(model_size);
 if (!model_data) {
     std::cout << "Failed to allocate memory for model data" << std::endl;
     fclose(fp);
     return false;
 }
 //Read model data into memory
 size_t read_size = fread(model_data, 1, model_size, fp);
 fclose(fp);

 // Verify that the file has been read completely
 if (read_size != model_size) {
     std::cout << "Failed to read model data, expected " << model_size
              << " bytes, but got " << read_size << " bytes" << std::endl;
     free(model_data);
     return false;
 }

 // Prepare model data array and length array
 const void* model_data_array[] = {model_data};
 int32_t model_data_length[] = {static_cast<int32_t>(model_size)};
 // Initialize the model from memory using the BPU API
 RDK_CHECK_SUCCESS(
     hbDNNInitializeFromDDR(&packed_dnn_handle_, model_data_array, model_data_length, 1),
     "Initialize model from DDR failed");

 // Release temporarily allocated memory
 free(model_data);

        So far ourLoadModel()That’s it! ! ! We once again add a macro definition to select the model loading method. The complete code is as follows:

#define LOAD_FROM_DDR 0 // 0: Load model from file, 1: Load model from memory
 //Two implementations of loading models
 bool BPU_Detect::LoadModel() {
 #if LOAD_FROM_DDR
     //Read model data from file to memory
     auto read_start = std::chrono::high_resolution_clock::now();
     FILE* fp = fopen(model_path_.c_str(), "rb");
     if (!fp) {
         std::cout << "Failed to open model file: " << model_path_ << std::endl;
         return false;
     }
     // Get file size
     fseek(fp, 0, SEEK_END);
     size_t model_size = static_cast<size_t>(ftell(fp));
     fseek(fp, 0, SEEK_SET);
     // Allocate memory and read model data
     void* model_data = malloc(model_size);
     if (!model_data) {
         std::cout << "Failed to allocate memory for model data" << std::endl;
         fclose(fp);
         return false;
     }
     size_t read_size = fread(model_data, 1, model_size, fp);
     fclose(fp);
     if (read_size != model_size) {
         std::cout << "Failed to read model data, expected " << model_size
                  << " bytes, but got " << read_size << " bytes" << std::endl;
         free(model_data);
         return false;
     }
     //Load model from memory
     auto init_start = std::chrono::high_resolution_clock::now();
    
     const void* model_data_array[] = {model_data};
     int32_t model_data_length[] = {static_cast<int32_t>(model_size)};
     RDK_CHECK_SUCCESS(
         hbDNNInitializeFromDDR(&packed_dnn_handle_, model_data_array, model_data_length, 1),
         "Initialize model from DDR failed");
     // release memory
     free(model_data);
 #else
     //Load model from file
     const char* model_file_name = model_path_.c_str();
     RDK_CHECK_SUCCESS(
         hbDNNInitializeFromFiles(&packed_dnn_handle_, &model_file_name, 1),
         "Initialize model from file failed");
 #endif
     return true;
 }

(5) Complete the private GetModelInfo() function

We continue to introduce ourGetModelInfo()function, this function is used to obtain theGet model informationIncluding the model name list, model handle, input information, output information and other basic information of the model. By consulting the official API manual, we can see that there are nine APIs for obtaining this part of the model information, namely:

  • hbDNNGetModelNameList()used to getpackedDNNHandleThe name list and number of the pointed models
/**
  * Get model names from given packed handle
  * @param[out] modelNameList model name list
  * @param[out] modelNameCount number of model names
  * @param[in] packedDNNHandle Horizon DNN handle, pointing to multiple models
  * @return 0 if success, return defined error code otherwise
  */
 int32_t hbDNNGetModelNameList(char const ***modelNameList,
                               int32_t *modelNameCount,
                               hbPackedDNNHandle_t packedDNNHandle);
  • hbDNNGetModelHandle()for use frompackedDNNHandleObtains the handle of a model in the pointed model list and allows the caller to use the returned value across functions and threads.dnnHandle
/**
  * Get DNN Network handle from packed Handle with given model name
  * @param[out] dnnHandle DNN handle, pointing to a model
  * @param[in] packedDNNHandle DNN handle, pointing to multiple models
  * @param[in] modelName model name
  * @return 0 if success, return defined error code otherwise
  */
 int32_t hbDNNGetModelHandle(hbDNNHandle_t *dnnHandle,
                             hbPackedDNNHandle_t packedDNNHandle,
                             char const *modelName);
  • hbDNNGetInputCount()used to getdnnHandleThe number of input tensors pointed to by the model
/**
  * Get input count
  * @param[out] inputCount The number of model input tensors
  * @param[in] dnnHandle DNN handle, pointing to a model
  * @return 0 if success, return defined error code otherwise
  */
 int32_t hbDNNGetInputCount(int32_t *inputCount, hbDNNHandle_t dnnHandle);
  • hbDNNGetInputName()used to getdnnHandleThe name of the model input tensor pointed to
/**
  * Get model input name
  * @param[out] name The name of the model input tensor
  * @param[in] dnnHandle DNN handle, pointing to a model
  * @param[in] inputIndex The number of the model input tensor
  * @return 0 if success, return defined error code otherwise
  */
 int32_t hbDNNGetInputName(char const **name, hbDNNHandle_t dnnHandle,
                           int32_t inputIndex);
  • hbDNNGetInputTensorProperties()used to getdnnHandleProperties of the model-specific input tensor pointed to
/**
  * Get input tensor properties
  * @param[out] properties input tensor information
  * @param[in] dnnHandle DNN handle, pointing to a model
  * @param[in] inputIndex The number of the model input tensor
  * @return 0 if success, return defined error code otherwise
  */
 int32_t hbDNNGetInputTensorProperties(hbDNNTensorProperties *properties,
                                       hbDNNHandle_t dnnHandle,
                                       int32_t inputIndex);
  • hbDNNGetOutputCount()used to getdnnHandleThe number of model output tensors pointed to
/**
  * Get output count
  * @param[out] outputCount The number of model output tensors
  * @param[in] dnnHandle DNN handle, pointing to a model
  * @return 0 if success, return defined error code otherwise
  */
 int32_t hbDNNGetOutputCount(int32_t *outputCount, hbDNNHandle_t dnnHandle);
  • hbDNNGetOutputName()used to getdnnHandleThe name of the pointed model output tensor
/**
  * Get model output name
  * @param[out] name The name of the model output tensor
  * @param[in] dnnHandle DNN handle, pointing to a model
  * @param[in] outputIndex The number of the model output tensor
  * @return 0 if success, return defined error code otherwise
  */
 int32_t hbDNNGetOutputName(char const **name, hbDNNHandle_t dnnHandle,
                            int32_t outputIndex);
  • hbDNNGetOutputTensorProperties()used to getdnnHandleProperties of the model-specific output tensor pointed to
/**
  * Get output tensor properties
  * @param[out] properties output tensor information
  * @param[in] dnnHandle DNN handle, pointing to a model
  * @param[in] outputIndex The number of the model output tensor
  * @return 0 if success, return defined error code otherwise
  */
 int32_t hbDNNGetOutputTensorProperties(hbDNNTensorProperties *properties,
                                        hbDNNHandle_t dnnHandle,
                                        int32_t outputIndex);

But in this step we only need to use five APIs specific to the model itself to obtain the basic information of our model. First we usehbDNNGetModelNameList()function to get the number of packaged models in the Bin model we loaded. Since we know that we are only using Yolov5, if we detect that there are multiple packages in the Bin model after our conversion, it means that our Bin model is wrong. , so we first create two variables according to the requirements of the API to obtain the model list and quantity. Then we call the API and determine whether the number of models is correct. The specific code implementation is as follows:

//Add private class member variables
 class BPU_Detect {
     private:
 const char* model_name_;// model name
 }
 // Get the model name list and quantity
 const char** model_name_list; //Create model list variables
 int model_count = 0; //Create model packaging quantity variable
 RDK_CHECK_SUCCESS(
         hbDNNGetModelNameList(&model_name_list, &model_count, packed_dnn_handle_),
         "hbDNNGetModelNameList failed");
 if(model_count > 1) {
     std::cout << "Model count: " << model_count << std::endl;
     std::cout << "Please check the model count!" << std::endl;
     return false;
 }
 model_name_ = model_name_list[0];

After checking that there are no errors in the model list, we can obtain a return value of the model that the caller can use across functions and threads.dnnHandlehandle, we first use it according to the requirements of the APIhbDNNHandle_tCreate a private class member variable, and then you can call the API directly

//Add private class member variables
 class BPU_Detect {
     private:
 hbDNNHandle_t dnn_handle_;//Model handle
 }
 // Get model handle
 RDK_CHECK_SUCCESS(
     hbDNNGetModelHandle(&dnn_handle_, packed_dnn_handle_, model_name_),
     "hbDNNGetModelHandle failed");

After creating the model handle, we can obtain the input information! This part involves two APIs:hbDNNGetInputCountUsed to obtain the number of model network inputs andhbDNNGetInputTensorPropertiesThe tensor used to obtain the model input is still because we are using the detection model of Yolov5, so our model should be a model with one input. If there are multiple inputs, it means that our model is wrong. At the same time, we foundhbDNNGetInputTensorPropertiesThe API output is ahbDNNTensorPropertiesType structure, we look at the structure definition and find that this structure is a nested structure, which is nestedhbDNNTensorShapestructure,hbDNNQuantiShiftstructure,hbDNNQuantiScalestructure andhbDNNQuantiTypeThe structure can accurately describe the input tensor information. The structure definition and the explanation of each member are as follows:

typedef struct {
   int32_t dimensionSize[HB_DNN_TENSOR_MAX_DIMENSIONS];//Indicates the size of each dimension of the tensor. HB_DNN_TENSOR_MAX_DIMENSIONS indicates the maximum number of dimensions that the tensor can have.
   int32_t numDimensions;//The number of dimensions of the tensor, indicating how many dimensions the tensor is
 } hbDNNTensorShape;

 typedef struct {
   int32_t shiftLen;//Offset length during quantization, indicating the amount of offset data
   uint8_t *shiftData; //Pointer to offset data. These data are usually used to shift tensor data during the quantization process.
 } hbDNNQuantiShift;

 typedef struct {
   int32_t scaleLen;//The length of the scaling factor, indicating how many scaling factors there are
   float *scaleData; //Pointer to scaling factor data, usually used to adjust the size of tensor data during quantization
   int32_t zeroPointLen;//The length of the zero point, indicating the number of zero point data
   int8_t *zeroPointData;//Pointer to zero point data, which is used to adjust the zero point of the tensor during the quantization process
 } hbDNNQuantiScale;

 typedef enum {
   NONE, //No quantification
   SHIFT, //Use displacement quantization
   SCALE//Use scaling quantization
 } hbDNNQuantiType;

 typedef struct {
   hbDNNTensorShape validShape; //The valid shape of the tensor, indicating the true size of the tensor
   hbDNNTensorShape alignedShape; //The aligned shape of the tensor, indicating the aligned tensor size
   int32_t tensorLayout;//Tensor layout, indicating how data is organized in memory
   int32_t tensorType;//The data type of the tensor, indicating the data type of the elements in the tensor
   hbDNNQuantiShift shift;//Offset information in quantization
   hbDNNQuantiScale scale;//Scaling information in quantization
   hbDNNQuantiType quantiType; //Quantization type, indicating whether quantization uses displacement, scaling or no quantization
   int32_t quantizeAxis;//The axis of quantization, indicating in which dimension the quantization operation is applied
   int32_t alignedByteSize;//The aligned byte size, indicating the size of the tensor after alignment in memory
   int32_t stride[HB_DNN_TENSOR_MAX_DIMENSIONS];//The stride of each dimension represents the element interval of each dimension of the tensor and supports the maximum number of dimensions HB_DNN_TENSOR_MAX_DIMENSIONS
 } hbDNNTensorProperties;

After understanding these structures, we can define some of our variables based on the structure parameters. At the same time, because we know that our model is single-input, we also know that the data we input should be NV12, and the data layout is NCHW. At the same time, enter the valid of Tensor data The shape should be (1,3,H,W), so after we use the API to obtain our input information, we can also use this security information to perform some input security checks, so we first add some necessary private class member variables , then we can callhbDNNGetInputCountandhbDNNGetInputTensorPropertiesThese two APIs are used to obtain input information,Then we can perform security checks based on the number of inputs received and the input tensor

//Add private class member variables
 class BPU_Detect {
     private:
         //Model input parameters
         int input_h_;//Input height
         int input_w_;//Input width
         hbDNNTensorProperties input_properties_; //Input tensor properties
 }
 // Get input information
 int32_t input_count = 0;
 RDK_CHECK_SUCCESS(
     hbDNNGetInputCount(&input_count, dnn_handle_),
     "hbDNNGetInputCount failed");
 RDK_CHECK_SUCCESS(
     hbDNNGetInputTensorProperties(&input_properties_, dnn_handle_, 0),
     "hbDNNGetInputTensorProperties failed");
 /*--------------------------------The following is a model safety check--------------  ---*/
 //Check the number of model inputs
 if(input_count > 1){
     std::cout << "Model input node is greater than 1, please check!" << std::endl;
     return false;
 }
 //Check the input type of the model
 if(input_properties_. == 4){
     std::cout << "Input tensor type: HB_DNN_IMG_TYPE_NV12" << std::endl;
 }
 else{
     std::cout << "The input tensor type is not HB_DNN_IMG_TYPE_NV12, please check!" << std::endl;
     return false;
 }
 //Check the input data arrangement of the model
 if(input_properties_.tensorType == 1){
     std::cout << "Input tensor data layout: HB_DNN_LAYOUT_NCHW" << std::endl;
 }
 else{
     std::cout << "The input tensor data layout is not HB_DNN_LAYOUT_NCHW, please check!" << std::endl;
     return false;
 }
 // Check the valid shape of the model input Tensor data
 input_h_ = input_properties_.[2];
 input_w_ = input_properties_.[3];
 if (input_properties_. == 4)
 {
     std::cout << "The input size is: (" << input_properties_.[0];
     std::cout << ", " << input_properties_.[1];
     std::cout << ", " << input_h_;
     std::cout << ", " << input_w_ << ")" << std::endl;
 }
 else
 {
     std::cout << "The input size is not (1,3,640,640), please check!" << std::endl;
     return false;
 }

After the input is obtained and checked, how can our output fall behind? Then we can start checking our output. We usehbDNNGetOutputCountGet the number of outputs. Since we know that Yolov5 should have three outputs, we can check the output of the model here. After getting the number of outputs, we usehbDNNTensorCreate a private class variablehbDN output_tensors_Then you can usehbDNNTensorThis type allocates memory for the model’s output

//Add private class member variables
 class BPU_Detect {
     private:
     hbDNNTensor* output_tensors_;//Output tensor array
 }
 //Model output quantity check
 int32_t output_count = 0;
 RDK_CHECK_SUCCESS(
     hbDNNGetOutputCount(&output_count, dnn_handle_),
     "hbDNNGetOutputCount failed");
 //Allocate output tensor memory
 output_tensors_ = new hbDNNTensor[output_count];

But there is a very important step that needs to be completed here. Since YOLOv5 has 3 output heads, corresponding to 3 different scales of feature maps, we also need to ensure that the output order of the model is: Small target (8 times downsampling) - > Medium target (16x downsampling) -> Large target (32x downsampling). In order to complete this step, we first define an array of output order.output_order_[3], then we manually initialize the model output sequence of the model and define our expected output feature map size and number of channels. Then we can use a for loop to traverse each of our expected output scales. If we obtain the actual feature map size and If the number of channels matches what we expect, we can record the correct output sequence.

//Add private class member variables
 class BPU_Detect {
     private:
         int output_order_[3];//Output order mapping
 }
 //Initialize default order
 output_order_[0] = 0; // Default 1st output
 output_order_[1] = 1; // Default 2nd output
 output_order_[2] = 2; // Default 3rd output
 // Define the desired output feature map size and number of channels
 int32_t expected_shapes[3][3] = {
     {H_8, W_8, 3 * (5 + classes_num_)}, // Small target feature map: H/8 x W/8
     {H_16, W_16, 3 * (5 + classes_num_)}, // Medium target feature map: H/16 x W/16
     {H_32, W_32, 3 * (5 + classes_num_)} // Large target feature map: H/32 x W/32
 };
 // Iterate through each desired output scale
 for(int i = 0; i < 3; i++) {
     // Traverse the actual output nodes
     for(int j = 0; j < 3; j++) {
         hbDNNTensorProperties output_properties;//Get the properties of the current output node
         RDK_CHECK_SUCCESS(
             hbDNNGetOutputTensorProperties(&output_properties, dnn_handle_, j),
             "Get output tensor properties failed");
         // Get the actual feature map size and number of channels
         int32_t actual_h = output_properties.[1];
         int32_t actual_w = output_properties.[2];
         int32_t actual_c = output_properties.[3];
         // If actual size and number of channels match expected
         if(actual_h == expected_shapes[i][0] &&
            actual_w == expected_shapes[i][1] &&
            actual_c == expected_shapes[i][2]) {
             output_order_[i] = j; // Record the correct output order
             break;
         }
     }
 }

        So far ourGetModelInfo()That’s it! ! ! The specific complete code is as follows:

// Get model information implementation
 bool BPU_Detect::GetModelInfo() {
     // Get the list of model names
     const char** model_name_list;
     int model_count = 0;
     RDK_CHECK_SUCCESS(
         hbDNNGetModelNameList(&model_name_list, &model_count, packed_dnn_handle_),
         "hbDNNGetModelNameList failed");
     if(model_count > 1) {
         std::cout << "Model count: " << model_count << std::endl;
         std::cout << "Please check the model count!" << std::endl;
         return false;
     }
     model_name_ = model_name_list[0];
     // Get model handle
     RDK_CHECK_SUCCESS(
         hbDNNGetModelHandle(&dnn_handle_, packed_dnn_handle_, model_name_),
         "hbDNNGetModelHandle failed");
     // Get input information
     int32_t input_count = 0;
     RDK_CHECK_SUCCESS(
         hbDNNGetInputCount(&input_count, dnn_handle_),
         "hbDNNGetInputCount failed");
     RDK_CHECK_SUCCESS(
         hbDNNGetInputTensorProperties(&input_properties_, dnn_handle_, 0),
         "hbDNNGetInputTensorProperties failed");

     if(input_count > 1){
         std::cout << "Model input node is greater than 1, please check!" << std::endl;
         return false;
     }
     if(input_properties_. == 4){
         std::cout << "Input tensor type: HB_DNN_IMG_TYPE_NV12" << std::endl;
     }
     else{
         std::cout << "The input tensor type is not HB_DNN_IMG_TYPE_NV12, please check!" << std::endl;
         return false;
     }
     if(input_properties_.tensorType == 1){
         std::cout << "Input tensor data layout: HB_DNN_LAYOUT_NCHW" << std::endl;
     }
     else{
         std::cout << "The input tensor data layout is not HB_DNN_LAYOUT_NCHW, please check!" << std::endl;
         return false;
     }
     // Get input size
     input_h_ = input_properties_.[2];
     input_w_ = input_properties_.[3];
     if (input_properties_. == 4)
     {
         std::cout << "The input size is: (" << input_properties_.[0];
         std::cout << ", " << input_properties_.[1];
         std::cout << ", " << input_h_;
         std::cout << ", " << input_w_ << ")" << std::endl;
     }
     else
     {
         std::cout << "The input size is not (1,3,640,640), please check!" << std::endl;
         return false;
     }
     // Get the output information and adjust the output order
     int32_t output_count = 0;
     RDK_CHECK_SUCCESS(
         hbDNNGetOutputCount(&output_count, dnn_handle_),
         "hbDNNGetOutputCount failed");
     //Allocate output tensor memory
     output_tensors_ = new hbDNNTensor[output_count];
     // =============== Adjust output header sequence mapping ===============
     // YOLOv5 has 3 output heads, corresponding to 3 different scales of feature maps.
     // Need to ensure that the output order is: small target (8x downsampling) -> medium target (16x downsampling) -> large target (32x downsampling)
     //Initialize default order
     output_order_[0] = 0; // Default 1st output
     output_order_[1] = 1; // Default 2nd output
     output_order_[2] = 2; // Default 3rd output
     // Define the desired output feature map size and number of channels
     int32_t expected_shapes[3][3] = {
         {H_8, W_8, 3 * (5 + classes_num_)}, // Small target feature map: H/8 x W/8
         {H_16, W_16, 3 * (5 + classes_num_)}, // Medium target feature map: H/16 x W/16
         {H_32, W_32, 3 * (5 + classes_num_)} // Large target feature map: H/32 x W/32
     };
     // Iterate through each desired output scale
     for(int i = 0; i < 3; i++) {
         // Traverse the actual output nodes
         for(int j = 0; j < 3; j++) {
             // Get the properties of the current output node
             hbDNNTensorProperties output_properties;
             RDK_CHECK_SUCCESS(
                 hbDNNGetOutputTensorProperties(&output_properties, dnn_handle_, j),
                 "Get output tensor properties failed");
             // Get the actual feature map size and number of channels
             int32_t actual_h = output_properties.[1];
             int32_t actual_w = output_properties.[2];
             int32_t actual_c = output_properties.[3];

             // If actual size and number of channels match expected
             if(actual_h == expected_shapes[i][0] &&
                actual_w == expected_shapes[i][1] &&
                actual_c == expected_shapes[i][2]) {
                 //Record the correct output sequence
                 output_order_[i] = j;
                 break;
             }
         }
     }
     //Print out sequence mapping information
     std::cout << "\n============ Output Order Mapping ============" << std::endl;
     std::cout << "Small object (1/" << 8 << "): output[" << output_order_[0] << "]" << std::endl;
     std::cout << "Medium object (1/" << 16 << "): output[" << output_order_[1] << "]" << std::endl;
     std::cout << "Large object (1/" << 32 << "): output[" << output_order_[2] << "]" << std::endl;
     std::cout << "==========================================\  n" << std::endl;

     return true;
 }

(6) Complete the private PreProcess() function

Next we can complete the pre-processing function of the model. Image pre-processing is nothing more than image size conversion and image format conversion, so this part is relatively simple and I will talk about it a little faster. We use letterbox for image size conversion. way, as we all know, there is an image conversion function in OpenCV calledresizeThis function can directly transform the image size. However, because the implementation of this function is too simple and crude, when the image size is inconsistent, the aspect ratio of the image will be changed, causing image distortion. For example, in the following situation, you can see the image on the right There was a twist

image-20250120142123344

When we use the LetterBox method, we can see that the picture is not distorted because the LetterBox method maintains the aspect ratio of the original image when resizing the image and scales it proportionally. When the long side is resized to the required length When , the remaining part of the short side is filled with gray, thus maintaining the aspect ratio of the original image.

image-20250120142307932

So next we use LetterBox to implement image preprocessing. The specific code is as follows. The core idea is to scale the image proportionally to adapt to the target size while maintaining the aspect ratio of the original image. To ensure that the image is centered within the target dimensions, the empty areas will be filled with a fill, usually a neutral color (such as 127, 127, 127). This way we avoid distortion when the image is scaled and ensure that the aspect ratio of the image remains the same

//Add private class member variables
 class BPU_Detect {
     private:
         float x_scale_; //X-direction scaling ratio
         float y_scale_; // Y direction scaling ratio
         int x_shift_; // X direction offset
         int y_shift_; // Y direction offset
         cv::Mat resized_img_; // Scaled image
         hbDNNTensor input_tensor_; //Input tensor
 }
 //Use letterbox method for preprocessing
 x_scale_ = std::min(1.0f * input_h_ / input_img.rows, 1.0f * input_w_ / input_img.cols);
 y_scale_ = x_scale_;

 int new_w = input_img.cols * x_scale_;
 x_shift_ = (input_w_ - new_w) / 2;
 int x_other = input_w_ - new_w - x_shift_;

 int new_h = input_img.rows * y_scale_;
 y_shift_ = (input_h_ - new_h) / 2;
 int y_other = input_h_ - new_h - y_shift_;

 cv::resize(input_img, resized_img_, cv::Size(new_w, new_h));
 cv::copyMakeBorder(resized_img_, resized_img_, y_shift_, y_other,
                    x_shift_, x_other, cv::BORDER_CONSTANT, cv::Scalar(127, 127, 127));

After completing the scaling of the image size, we use the OpenCV function to convert the image to NV12 format:

//Convert to NV12 format
 cv::Mat yuv_mat;
 cv::cvtColor(resized_img_, yuv_mat, cv::COLOR_BGR2YUV_I420);

After completing the previous image operations, we have to start preparing the input data for the model! Next, we need to convert the processed image data into an input format that our model can accept. In this process, we first allocate memory for the input tensor and copy the processed image data (YUV format) to in memory to ensure that the model can access and use this data correctly. Which involves an API forhbSysAllocCachedMem, let’s take a look at his explanation and the structure definitions involved:

/**
 * Allocate cachable system memory
 * @param[out] mem
 * @param[in] size
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbSysAllocCachedMem(hbSysMem *mem, uint32_t size);

typedef struct {
  hbSysMem sysMem[4];
  hbDNNTensorProperties properties;
} hbDNNTensor;

typedef struct {
  uint64_t phyAddr;
  void *virAddr;
  uint32_t memSize;
} hbSysMem;

​ ​ ​According to the API, we first need to create ahbSysMemStructure, this structure is used to describe the physical address of memory (phyAddr), virtual address (virAddr) and the size of the memory (memSize). Next, we callhbSysAllocCachedMemThe function allocates memory for the input tensor. The allocated memory is cacheable, which means that the hardware can directly access this memory when processing the data without frequent swapping with main memory.hbDNNTensoris a structure used to store the entire tensor information, which contains multiplehbSysMemStructures to describe different parts of data (such as input, output, etc.). andhbDNNTensorPropertiesThen store attribute information about the tensor, such as the shape, data type, quantization information, etc. of the tensor.

We first passhbSysAllocCachedMemAllocate cache memory for input tensors,sysMem[0]is used for storageYUVData memory.sizefor imagesYUVThe memory size required for the data, i.e.3 * input_h_ * input_w_ / 2, this is because the memory layout of the YUV format requires the data of the Y component, U component and V component to be stored separately. The Y component occupies a larger memory space, and the U and V components each occupy half the size. Then we will process the YUV Image data fromyuv_matCopy toynv12in, amongynv12we passedhbSysAllocCachedMemThe virtual address of the allocated memory, then we convert it to the NV12 format by alternating copies of the U and V components to meet the input requirements of the model. Finally, after the data is ready, we call the hbSysFlushMem function to clean the memory cache. The specific implementation code is as follows:

// Prepare to input tensor
 hbSysAllocCachedMem(&input_tensor_.sysMem[0], int(3 * input_h_ * input_w_ / 2));
 uint8_t* yuv = yuv_mat.ptr<uint8_t>();
 uint8_t* ynv12 = (uint8_t*)input_tensor_.sysMem[0].virAddr;
 // Calculate the height and width of the UV part, and the size of the Y part
 int uv_height = input_h_ / 2;
 int uv_width = input_w_ / 2;
 int y_size = input_h_ * input_w_;
 //Copy the Y component data to the input tensor
 memcpy(ynv12, yuv, y_size);
 // Get the UV component position in NV12 format
 uint8_t* nv12 = ynv12 + y_size;
 uint8_t* u_data = yuv + y_size;
 uint8_t* v_data = u_data + uv_height * uv_width;
 //Write U and V components alternately into NV12 format
 for(int i = 0; i < uv_width * uv_height; i++) {
     *nv12++ = *u_data++;
     *nv12++ = *v_data++;
 }
 //Clear the memory cache to ensure that the data is ready for use by the model
 hbSysFlushMem(&input_tensor_.sysMem[0], HB_SYS_MEM_CACHE_CLEAN);//Clear the cache to ensure data synchronization

        So far ourPreProcess()That’s it! ! ! The specific complete code is as follows:

(7) Complete the private Inference() function

We are now going to complete our inference part. After consulting the user manual, we can see that in the inference part we mainly need the following two APIs. According to the API introduction, we can seehbDNNInferMainly used to perform our model inference andhbDNNWaitTaskDoneIt is used to wait for the inference task to complete or timeout. Its main function is to wait for the execution results of the inference task until the task is completed or the specified timeout period is exceeded.

/**
  *DNN inference
  * @param[out] taskHandle: return a pointer represent the task if success, otherwise nullptr
  Returns a pointer representing the task. If successful, returns a pointer to the task handle. On failure, returns nullptr.
  * @param[out] output: pointer to the output tensor array, the size of array should be equal to $(`hbDNNGetOutputCount`)
  Pointer to an array of output tensors. The size of the array should be equal to the number returned by hbDNNGetOutputCount.
  * @param[in] input: input tensor array, the size of array should be equal to $(`hbDNNGetInputCount`)
  Pointer to the input tensor array. The size of the array should be equal to the number returned by hbDNNGetInputCount.
  * @param[in] dnnHandle: pointer to the dnn handle
  DNN handle, used to identify the model used by the inference task
  * @param[in] inferCtrlParam: infer control parameters
  Inference control parameters, used to set some configuration items during the inference process (such as whether to use acceleration, inference mode, etc.)
  * @return 0 if success, return defined error code otherwise
  */
 int32_t hbDNNInfer(hbDNNTaskHandle_t *taskHandle, hbDNNTensor **output,
                    hbDNNTensor const *input, hbDNNHandle_t dnnHandle,
                    hbDNNInferCtrlParam *inferCtrlParam);
 /**
  * Wait util task completed or timeout.
  * @param[in] taskHandle: pointer to the task
  * @param[in] timeout: timeout of milliseconds
  * @return 0 if success, return defined error code otherwise
  */
 int32_t hbDNNWaitTaskDone(hbDNNTaskHandle_t taskHandle, int32_t timeout);

​ ​ Then we check the followinghbDNNInferCtrlParam *inferCtrlParamThe parameters are defined and passed in as follows:

#define HB_DNN_INITIALIZE_INFER_CTRL_PARAM(param) \
   { \
     (param)->bpuCoreId = HB_BPU_CORE_ANY; \
     (param)->dspCoreId = HB_DSP_CORE_ANY; \
     (param)->priority = HB_DNN_PRIORITY_LOWEST; \
     (param)->more = false; \
     (param)->customId = 0; \
     (param)->reserved1 = 0; \
     (param)->reserved2 = 0; \
   }
 typedef struct {
   int32_t bpuCoreId; //// BPU core ID, used to specify which BPU core the inference task is executed on
   int32_t dspCoreId; //// DSP core ID, used to specify which DSP core the inference task is executed on
   int32_t priority; //// Priority of inference task
   int32_t more; //// Whether there are more inference tasks, usually set to false
   int64_t customId; //// Custom ID, which can be used to identify inference tasks
   int32_t reserved1; //// Reserved field, not used yet
   int32_t reserved2; //// Reserved field, not used yet
 } hbDNNInferCtrlParam;

After understanding the above, we can start writing our reasoning part! Let's first complete some pre-tasks before performing inference. We create ahbDNNTaskHandle_tInference task handle of typetask_handle_Used to identify the uniqueness of an inference task to facilitate our task management, and then initialize the task handletask_handle_fornullptr, to ensure that it is empty before the inference task starts. For each output tensor, we first get its attributes, and then based on the aligned size of the output tensor (alignedByteSize) allocate the corresponding memory. Memory allocation has been introduced beforehbSysAllocCachedMemTo complete, this function will ensure that each output tensor is allocated an appropriately sized cache memory to ensure that subsequent data processing will not cause memory out-of-bounds or access errors.So our code is as follows:

//Add private class member variables
 class BPU_Detect {
     private:
     hbDNNTaskHandle_t task_handle_; // Inference task handle
 }
 //Initialize the task handle to nullptr
 task_handle_ = nullptr;
 //Initialize input tensor attributes
 input_tensor_.properties = input_properties_;
 // Get the output tensor attributes
 for(int i = 0; i < 3; i++) {
     hbDNNTensorProperties output_properties;
     RDK_CHECK_SUCCESS(
         hbDNNGetOutputTensorProperties(&output_properties, dnn_handle_, i),
         "Get output tensor properties failed");
     output_tensors_[i].properties = output_properties;

     // Allocate memory for output
     int out_aligned_size = output_properties.alignedByteSize;
     RDK_CHECK_SUCCESS(
         hbSysAllocCachedMem(&output_tensors_[i].sysMem[0], out_aligned_size),
         "Allocate output memory failed");
 }

After completing the prerequisite tasks, we can start to perform inference! ! ! We first usehbDNNInferCtrlParamCreate inference parameters and use the officially providedHB_DNN_INITIALIZE_INFER_CTRL_PARAMPass it in, and then we can callhbDNNInferExecute the reasoning, and at the same time wehbDNNWaitTaskDonefunction to wait for the inference task to complete

hbDNNInferCtrlParam infer_ctrl_param;
HB_DNN_INITIALIZE_INFER_CTRL_PARAM(&infer_ctrl_param);
RDK_CHECK_SUCCESS(
        hbDNNInfer(&task_handle_, &output_tensors_, &input_tensor_, dnn_handle_, &infer_ctrl_param),
        "Model inference failed");
RDK_CHECK_SUCCESS(
    hbDNNWaitTaskDone(task_handle_, 0),
    "Wait task done failed");

        So far ourInference()That’s it! ! ! The specific complete code is as follows:

//Inference implementation
 bool BPU_Detect::Inference() {
     //Initialize the task handle to nullptr
     task_handle_ = nullptr;
    
     //Initialize input tensor attributes
     input_tensor_.properties = input_properties_;
    
     // Get the output tensor attributes
     for(int i = 0; i < 3; i++) {
         hbDNNTensorProperties output_properties;
         RDK_CHECK_SUCCESS(
             hbDNNGetOutputTensorProperties(&output_properties, dnn_handle_, i),
             "Get output tensor properties failed");
         output_tensors_[i].properties = output_properties;
        
         // Allocate memory for output
         int out_aligned_size = output_properties.alignedByteSize;
         RDK_CHECK_SUCCESS(
             hbSysAllocCachedMem(&output_tensors_[i].sysMem[0], out_aligned_size),
             "Allocate output memory failed");
     }
    
     hbDNNInferCtrlParam infer_ctrl_param;
     HB_DNN_INITIALIZE_INFER_CTRL_PARAM(&infer_ctrl_param);
    
     RDK_CHECK_SUCCESS(
         hbDNNInfer(&task_handle_, &output_tensors_, &input_tensor_, dnn_handle_, &infer_ctrl_param),
         "Model inference failed");
    
     RDK_CHECK_SUCCESS(
         hbDNNWaitTaskDone(task_handle_, 0),
         "Wait task done failed");
    
     return true;
 }

(8) Complete the private ProcessFeatureMap() function

We still need to complete the post-processingProcessFeatureMapFunction, this function is a feature map processing auxiliary function. It is mainly used to extract the bounding box of the target detection and its corresponding score from the output feature map of the network, and store this information for subsequent NMS (non-maximum value) suppression) processing, first, we output the quantization type of the tensor (quantiType) to check if the quantization type of the output tensor is notNONE, an error message will be output and returned, because the inference task here assumes that the output data is an unquantized floating point number. If it is quantized data, the processing method will be different.

if (output_tensor. != NONE) {
    std::cout << "Output tensor quantization type should be NONE!" << std::endl;
    return;
}

​ ​ ​ Then in order to ensure that the data read from the memory is the latest, we callhbSysFlushMemfunction to refresh the memory cache. This operation will synchronize the data in the memory to the main memory to prevent read and write inconsistencies caused by the cache.

/**
 * Flush cachable system memory
 * @param[in] mem
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbSysFlushMem(hbSysMem *mem, int32_t flag);

hbSysFlushMem(&output_tensor.sysMem[0], HB_SYS_MEM_CACHE_INVALIDATE);

Then we passoutput_tensor.sysMem[0].virAddrto get the data address of the output tensor and convert it tofloat*Type, this address points to the original data output by model inference

auto* raw_data = reinterpret_cast<float*>(output_tensor.sysMem[0].virAddr);

​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​ with For loop to traverse each position of the output feature map (heightandwidth), each position here contains some prediction data, including the center coordinates, width and height of the bounding box, and category score, and each anchor point (anchors) represents the shape of a possible target

for(int h = 0; h < height; h++) {
    for(int w = 0; w < width; w++) {
        for(const auto& anchor : anchors) {

For each location, we first read the current prediction data (including bounding box location and category score, etc.), and then based on the confidence of the location (cur_raw[4], usually the probability of object existence) is filtered, if the confidence is lower than the preset threshold (conf_thres_raw), skip the processing of this position

if(cur_raw[4] < conf_thres_raw) continue;

​ ​ ​ Next, we will find the maximum class probability among the scores of all classes (cur_raw[5]arrivecur_raw[classes_num_+5]), that is, predict the target category to which the current anchor point belongs

int cls_id = 5;
int end = classes_num_ + 5;
for(int i = 6; i < end; i++) {
    if(cur_raw[i] > cur_raw[cls_id]) {
        cls_id = i;
    }
}

After finding the maximum category probability, we can calculate the final score of the current anchor point. This final score is calculated based on the confidence of the anchor point and the maximum category probability. BysigmoidThe inversion of the function is calculated to obtain the final target score, which is lower thanscore_threshold_The detection results will be filtered out

float score = 1.0f / (1.0f + std::exp(-cur_raw[4])) * 
              1.0f / (1.0f + std::exp(-cur_raw[cls_id]));
if(score < score_threshold_) continue;

Finally we will decode the specific position and size of the bounding box. according tosigmoidTo calculate the function inversion, we take the center coordinate (cur_raw[0]cur_raw[1]) and width and height (cur_raw[2]cur_raw[3]) are recovered from the output of the network to the actual bounding box coordinates, and then we convert them to the actual size of the image, while saving the calculated bounding box and score to the corresponding category.bboxes_ (array that stores the positions of all detection boxes)andscores_(storage the corresponding scores)middle

float stride = input_h_ / height;
float center_x = ((1.0f / (1.0f + std::exp(-cur_raw[0]))) * 2 - 0.5f + w) * stride;
float center_y = ((1.0f / (1.0f + std::exp(-cur_raw[1]))) * 2 - 0.5f + h) * stride;
float bbox_w = std::pow((1.0f / (1.0f + std::exp(-cur_raw[2]))) * 2, 2) * ;
float bbox_h = std::pow((1.0f / (1.0f + std::exp(-cur_raw[3]))) * 2, 2) * ;
float bbox_x = center_x - bbox_w / 2.0f;
float bbox_y = center_y - bbox_h / 2.0f;

bboxes_[cls_id].push_back(cv::Rect2d(bbox_x, bbox_y, bbox_w, bbox_h));
scores_[cls_id].push_back(score);

        So far ourProcessFeatureMap()That’s it! ! ! The specific complete code is as follows:

// Feature map processing auxiliary function
 void BPU_Detect::ProcessFeatureMap(hbDNNTensor& output_tensor,
                                   int height, int width,
                                   const std::vector<std::pair<double, double>>& anchors,
                                   float conf_thres_raw) {
     // Check the quantization type
     if (output_tensor. != NONE) {
         std::cout << "Output tensor quantization type should be NONE!" << std::endl;
         return;
     }
    
     // refresh memory
     hbSysFlushMem(&output_tensor.sysMem[0], HB_SYS_MEM_CACHE_INVALIDATE);
    
     //Get the output data pointer
     auto* raw_data = reinterpret_cast<float*>(output_tensor.sysMem[0].virAddr);
    
     // Traverse each position of the feature map
     for(int h = 0; h < height; h++) {
         for(int w = 0; w < width; w++) {
             for(const auto& anchor : anchors) {
                 // Get prediction data for the current location
                 float* cur_raw = raw_data;
                 raw_data += (5 + classes_num_);
                
                 // Conditional probability filtering
                 if(cur_raw[4] < conf_thres_raw) continue;
                
                 // Find the maximum class probability
                 int cls_id = 5;
                 int end = classes_num_ + 5;
                 for(int i = 6; i < end; i++) {
                     if(cur_raw[i] > cur_raw[cls_id]) {
                         cls_id = i;
                     }
                 }
                
                 // Calculate final score
                 float score = 1.0f / (1.0f + std::exp(-cur_raw[4])) *
                             1.0f / (1.0f + std::exp(-cur_raw[cls_id]));
                
                 // score filter
                 if(score < score_threshold_) continue;
                 cls_id -= 5;
                
                 // decode bounding box
                 float stride = input_h_ / height;
                 float center_x = ((1.0f / (1.0f + std::exp(-cur_raw[0]))) * 2 - 0.5f + w) * stride;
                 float center_y = ((1.0f / (1.0f + std::exp(-cur_raw[1]))) * 2 - 0.5f + h) * stride;
                 float bbox_w = std::pow((1.0f / (1.0f + std::exp(-cur_raw[2]))) * 2, 2) * ;
                 float bbox_h = std::pow((1.0f / (1.0f + std::exp(-cur_raw[3]))) * 2, 2) * ;
                 float bbox_x = center_x - bbox_w / 2.0f;
                 float bbox_y = center_y - bbox_h / 2.0f;
                
                 //Save test results
                 bboxes_[cls_id].push_back(cv::Rect2d(bbox_x, bbox_y, bbox_w, bbox_h));
                 scores_[cls_id].push_back(score);
             }
         }
     }
 }

(9) Complete the private PostProcess() function

After the inference is completed, of course it’s time to post-process. Our post-processing is mainly divided into the following three steps:Clear the last results, process the output feature map, and perform NMS (non-maximum suppression) on each category, before each inference and post-processing starts, we first clear the previously stored detection results.bboxes_stores the detected bounding box,scores_Store the score of each bounding box,indices_Store the category index corresponding to each bounding box, and then we based on the number of categories of the detection task (classes_num_) to adjust the size of the bounding box, score and index to adapt to different categories of detection results, and according to the presetscore_threshold_(score threshold), convert it to its original logarithmic formconf_thres_raw。(PS: This conversion is to match the output format of the model, because usually the score range output by the deep learning model is based on logarithmic calculation) The specific code is as follows:

//Add private class member variables
 class BPU_Detect {
     private:
     //Storage of detection results
         std::vector<std::vector<cv::Rect2d>> bboxes_; // Bounding boxes for each category
         std::vector<std::vector<float>> scores_; // Score for each category
         std::vector<std::vector<int>> indices_; // Index after NMS
         // YOLOv5 anchors information
         std::vector<std::pair<double, double>> s_anchors_; // small target anchors
         std::vector<std::pair<double, double>> m_anchors_; // Target anchors
         std::vector<std::pair<double, double>> l_anchors_; // Large target anchors
 }

 bboxes_.clear(); // Clear bounding boxes
 scores_.clear(); // Clear scores
 indices_.clear(); // Clear indexes

 bboxes_.resize(classes_num_); // Adjust the size of the bounding box array according to the number of classes
 scores_.resize(classes_num_); // Adjust the size of the score array according to the number of categories
 indices_.resize(classes_num_); // Adjust the size of the index array according to the number of categories

 float conf_thres_raw = -log(1 / score_threshold_ - 1);

Since multi-scale output is often used in target detection tasks, each scale is responsible for the detection of targets of different sizes. At this time, we can call the function we defined.ProcessFeatureMapThe feature map auxiliary processing function is responsible for processing these feature maps

// Process the output of three scales
 ProcessFeatureMap(output_tensors_[0], H_8, W_8, s_anchors_, conf_thres_raw);
 ProcessFeatureMap(output_tensors_[1], H_16, W_16, m_anchors_, conf_thres_raw);
 ProcessFeatureMap(output_tensors_[2], H_32, W_32, l_anchors_, conf_thres_raw);

Finally we use thecv::dnn::NMSBoxesfunction to suppress duplicate boxes based on the score of the bounding box, the degree of overlap (IOU), and the set threshold, and finally obtain the bounding box index of each category (indices_

for(int i = 0; i < classes_num_; i++) {
    cv::dnn::NMSBoxes(bboxes_[i], scores_[i], score_threshold_, 
                       nms_threshold_, indices_[i], , nms_top_k_);
}

        So far ourProcessFeatureMap()That’s it! ! ! The specific complete code is as follows:

// Post-processing implementation
 bool BPU_Detect::PostProcess() {
     //Clear the last result
     bboxes_.clear();
     scores_.clear();
     indices_.clear();
    
     //Resize
     bboxes_.resize(classes_num_);
     scores_.resize(classes_num_);
     indices_.resize(classes_num_);
    
     float conf_thres_raw = -log(1 / score_threshold_ - 1);
    
     // Process the output of three scales
     ProcessFeatureMap(output_tensors_[0], H_8, W_8, s_anchors_, conf_thres_raw);
     ProcessFeatureMap(output_tensors_[1], H_16, W_16, m_anchors_, conf_thres_raw);
     ProcessFeatureMap(output_tensors_[2], H_32, W_32, l_anchors_, conf_thres_raw);
    
     // Perform NMS for each category
     for(int i = 0; i < classes_num_; i++) {
         cv::dnn::NMSBoxes(bboxes_[i], scores_[i], score_threshold_,
                          nms_threshold_, indices_[i], , nms_top_k_);
     }
    
     return true;
 }

(10) Complete the private DrawResults() function

​ ​ ​ Then we will complete our result drawing display tool auxiliary functionDrawResultsLa! Since the requirements for the result box are different during development or debugging, we first create a macro definition to select whether we need to draw the result box.

#define ENABLE_DRAW 0 // Drawing switch: 0-disable, 1-enable

Since this part belongs entirely to OpenCV, we will not describe it in detail. We only need to traverse the detection results after NMS for each category. The only thing to note is that our images are preprocessed usingLetterBoxmethod to resize, so we need to passx_shift_y_shift_x_scale_y_scale_Perform coordinate transformation with other parameters to restore the bounding box to the correct image space.

float x1 = (bboxes_[cls_id][idx].x - x_shift_) / x_scale_;
float y1 = (bboxes_[cls_id][idx].y - y_shift_) / y_scale_;
float x2 = x1 + (bboxes_[cls_id][idx].width) / x_scale_;
float y2 = y1 + (bboxes_[cls_id][idx].height) / y_scale_;

       at lastDrawResultsThe complete code of the function is as follows:

// Drawing result implementation
 void BPU_Detect::DrawResults(cv::Mat& img) {
 #if ENABLE_DRAW
     for(int cls_id = 0; cls_id < classes_num_; cls_id++) {
         if(!indices_[cls_id].empty()) {
             for(size_t i = 0; i < indices_[cls_id].size(); i++) {
                 int idx = indices_[cls_id][i];
                 float x1 = (bboxes_[cls_id][idx].x - x_shift_) / x_scale_;
                 float y1 = (bboxes_[cls_id][idx].y - y_shift_) / y_scale_;
                 float x2 = x1 + (bboxes_[cls_id][idx].width) / x_scale_;
                 float y2 = y1 + (bboxes_[cls_id][idx].height) / y_scale_;
                 float score = scores_[cls_id][idx];
                
                 // draw bounding box
                 cv::rectangle(img, cv::Point(x1, y1), cv::Point(x2, y2),
                             cv::Scalar(255, 0, 0), line_size_);
                
                 // draw labels
                 std::string text = class_names_[cls_id] + ": " +
                                 std::to_string(static_cast<int>(score * 100)) + "%";
                 cv::putText(img, text, cv::Point(x1, y1 - 5),
                           cv::FONT_HERSHEY_SIMPLEX, font_size_,
                           cv::Scalar(0, 0, 255), font_thickness_, cv::LINE_AA);
             }
         }
     }
 #endif
     //Print test results
     PrintResults();
 }

(11) Complete the private PrintResults() function

There is only one leftPrintResultFunction, there is nothing to say about this function. You only need to use a For loop to normalize the output results of the printed model. You only need to knowindices_The data in is arranged by category (cls_id) stored, each category contains the index of all detection frames filtered by NMS under the category,So the complete code is as follows:

//Print detection results implementation
 void BPU_Detect::PrintResults() const {
     //Print the overall information of the test results
     int total_detections = 0;
     for(int cls_id = 0; cls_id < classes_num_; cls_id++) {
         total_detections += indices_[cls_id].size();
     }
     std::cout << "\n============ Detection Results ============" << std::endl;
     std::cout << "Total detections: " << total_detections << std::endl;
    
     for(int cls_id = 0; cls_id < classes_num_; cls_id++) {
         if(!indices_[cls_id].empty()) {
             std::cout << "\nClass: " << class_names_[cls_id] << std::endl;
             std::cout << "Number of detections: " << indices_[cls_id].size() << std::endl;
             std::cout << "Details:" << std::endl;
            
             for(size_t i = 0; i < indices_[cls_id].size(); i++) {
                 int idx = indices_[cls_id][i];
                 float x1 = (bboxes_[cls_id][idx].x - x_shift_) / x_scale_;
                 float y1 = (bboxes_[cls_id][idx].y - y_shift_) / y_scale_;
                 float x2 = x1 + (bboxes_[cls_id][idx].width) / x_scale_;
                 float y2 = y1 + (bboxes_[cls_id][idx].height) / y_scale_;
                 float score = scores_[cls_id][idx];
                
                 //Print detailed information of each detection frame
                 std::cout << " Detection " << i + 1 << ":" << std::endl;
                 std::cout << " Position: (" << x1 << ", " << y1 << ") to (" << x2 << ", " << y2 << ")" << std::  endl;
                 std::cout << " Confidence: " << std::fixed << std::setprecision(2) << score * 100 << "%" << std::endl;
             }
         }
     }
     std::cout << "========================================\n"  << std::endl;
 }

        At this point we have completed all the private auxiliary functions and can start completing the three public functions! ! !

(12) Complete the public Init() function

We first complete our initialization function. In the initialization phase, we only need to load the model and obtain and check the model information, so we directly call ourLoadModelfunction sumGetModelInfofunction

if(!LoadModel()) {
        std::cout << "Failed to load model!" << std::endl;
        return false;
    }
if(!GetModelInfo()) {
    std::cout << "Failed to get model info!" << std::endl;
    return false;
}

       Finally, we add the initialization flag and time output to complete our initialization function! ! ! The complete code is as follows:

// Initialization function implementation
 bool BPU_Detect::Init() {
     if(is_initialized_) {
         std::cout << "Already initialized!" << std::endl;
         return true;
     }
    
     auto init_start = std::chrono::high_resolution_clock::now();
    
     if(!LoadModel()) {
         std::cout << "Failed to load model!" << std::endl;
         return false;
     }
    
     if(!GetModelInfo()) {
         std::cout << "Failed to get model info!" << std::endl;
         return false;
     }
    
     is_initialized_ = true;
    
     auto init_end = std::chrono::high_resolution_clock::now();
     float init_time = std::chrono::duration_cast<std::chrono::microseconds>(init_end - init_start).count() / 1000.0f;
    
     std::cout << "\n============ Model Loading Time ============" << std::endl;
     std::cout << "Total init time: " << std::fixed << std::setprecision(2) << init_time << " ms" << std::endl;
     std::cout << "==========================================\n  " << std::endl;
    
     return true;
 }

(13) Complete the public Detect() function

​ ​ ​ Then we complete ourDetectTo detect the function, we first check whether it is successfully initialized:

if(!is_initialized_) {
        std::cout << "Please initialize first!" << std::endl;
        return false;
    }

​ ​ ​ Then we call in turnPreProcesspreprocessing function,Inferenceinference function andPostProcessThe post-processing function also calls ourDrawResultsJust function:

if(!PreProcess(input_img)) {
        return false;
    }
if(!Inference()) {
        return false;
    }
if(!PostProcess()) {
        return false;
    }

DrawResults(output_img);

        Finally, we add the output of time to complete our initialization function! ! ! The complete code is as follows:

// Detection function implementation
 bool BPU_Detect::Detect(const cv::Mat& input_img, cv::Mat& output_img) {
     if(!is_initialized_) {
         std::cout << "Please initialize first!" << std::endl;
         return false;
     }
    
     auto total_start = std::chrono::high_resolution_clock::now();
    
 #if ENABLE_DRAW
     input_img.copyTo(output_img);
 #endif

     // Preprocessing time statistics
     auto preprocess_start = std::chrono::high_resolution_clock::now();
     if(!PreProcess(input_img)) {
         return false;
     }
     auto preprocess_end = std::chrono::high_resolution_clock::now();
     float preprocess_time = std::chrono::duration_cast<std::chrono::microseconds>(preprocess_end - preprocess_start).count() / 1000.0f;
    
     //Inference time statistics
     auto infer_start = std::chrono::high_resolution_clock::now();
     if(!Inference()) {
         return false;
     }
     auto infer_end = std::chrono::high_resolution_clock::now();
     float infer_time = std::chrono::duration_cast<std::chrono::microseconds>(infer_end - infer_start).count() / 1000.0f;
    
     // Post-processing time statistics
     auto postprocess_start = std::chrono::high_resolution_clock::now();
     if(!PostProcess()) {
         return false;
     }
     auto postprocess_end = std::chrono::high_resolution_clock::now();
     float postprocess_time = std::chrono::duration_cast<std::chrono::microseconds>(postprocess_end - postprocess_start).count() / 1000.0f;
    
     // Draw result time statistics
     auto draw_start = std::chrono::high_resolution_clock::now();
     DrawResults(output_img);
     auto draw_end = std::chrono::high_resolution_clock::now();
     float draw_time = std::chrono::duration_cast<std::chrono::microseconds>(draw_end - draw_start).count() / 1000.0f;
    
     //Total time statistics
     auto total_end = std::chrono::high_resolution_clock::now();
     float total_time = std::chrono::duration_cast<std::chrono::microseconds>(total_end - total_start).count() / 1000.0f;
    
     //Print time statistics
     std::cout << "\n============ Time Statistics ============" << std::endl;
     std::cout << "Preprocess time: " << std::fixed << std::setprecision(2) << preprocess_time << " ms" << std::endl;
     std::cout << "Inference time: " << std::fixed << std::setprecision(2) << infer_time << " ms" << std::endl;
     std::cout << "Postprocess time: " << std::fixed << std::setprecision(2) << postprocess_time << " ms" << std::endl;
     std::cout << "Draw time: " << std::fixed << std::setprecision(2) << draw_time << " ms" << std::endl;
     std::cout << "Total time: " << std::fixed << std::setprecision(2) << total_time << " ms" << std::endl;
     std::cout << "FPS: " << std::fixed << std::setprecision(2) << 1000.0f / total_time << std::endl;
     std::cout << "======================================\n" <<  std::endl;
    
     return true;
 }

(14) Complete the public Release() function

The last thing we need to complete is our resource release function. We first check whether our function has been initialized. If not, there is no need to release the resource:

if(!is_initialized_) {
        return true;
    }

Then we check whether our reasoning task is over. If it is not over, we need to use ithbDNNReleaseTaskTo release the inference task, the explanation of this API is as follows,

/**
 * Release a task and its related resources. If the task has not been executed then it will be canceled,
 * and if the task has not been finished then it will be stopped.
 * This interface will return immediately, and all operations will run in the background
 * @param[in] taskHandle: pointer to the task
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbDNNReleaseTask(hbDNNTaskHandle_t taskHandle);

So our code only needs to call this function and set the task handle pointer to empty

if(task_handle_) {
        hbDNNReleaseTask(task_handle_);
        task_handle_ = nullptr;
    }

Finally we usehbSysFreeMemAPI to release input, output and model memory in sequence:

/**
  * Free mem
  * @param[in] mem
  * @return 0 if success, return defined error code otherwise
  */
 int32_t hbSysFreeMem(hbSysMem *mem);

 // Release input memory
 if(input_tensor_.sysMem[0].virAddr) {
     hbSysFreeMem(&(input_tensor_.sysMem[0]));
 }

 // Release output memory
 for(int i = 0; i < 3; i++) {
     if(output_tensors_ && output_tensors_[i].sysMem[0].virAddr) {
         hbSysFreeMem(&(output_tensors_[i].sysMem[0]));
     }
 }

 if(output_tensors_) {
     delete[] output_tensors_;
     output_tensors_ = nullptr;
 }

 // release model
 if(packed_dnn_handle_) {
     hbDNNRelease(packed_dnn_handle_);
     packed_dnn_handle_ = nullptr;
 }

        Finally, we add some details and complete our resource release function! ! ! The complete code is as follows:

//Release resource implementation
 bool BPU_Detect::Release() {
     if(!is_initialized_) {
         return true;
     }
    
     // Release task
     if(task_handle_) {
         hbDNNReleaseTask(task_handle_);
         task_handle_ = nullptr;
     }
    
     try {
         // Release input memory
         if(input_tensor_.sysMem[0].virAddr) {
             hbSysFreeMem(&(input_tensor_.sysMem[0]));
         }
        
         // Release output memory
         for(int i = 0; i < 3; i++) {
             if(output_tensors_ && output_tensors_[i].sysMem[0].virAddr) {
                 hbSysFreeMem(&(output_tensors_[i].sysMem[0]));
             }
         }
        
         if(output_tensors_) {
             delete[] output_tensors_;
             output_tensors_ = nullptr;
         }
        
         // release model
         if(packed_dnn_handle_) {
             hbDNNRelease(packed_dnn_handle_);
             packed_dnn_handle_ = nullptr;
         }
     } catch(const std::exception& e) {
         std::cout << "Exception during release: " << () << std::endl;
     }
    
     is_initialized_ = false;
     return true;
 }

(15) Implement the Main function

The tutorial is coming to an end here. Next, we only need to implement the logic of calling the class and then reasoning to complete our teaching in this section.The current code logic has not been optimized, and the reasoning has not reached the best performance. Please look forward to the specific optimization tutorial to be released in the next year! ! !

To use this detection class is actually very simple. We only need to create an instance of the detector, and then perform initialization operations on the detection class. Then we only need to input the picture or frame to be detected into our()Just use the example, and finally release the resources! ! !

BPU_Detect detector;
 // initialization
 if (!()) {
     std::cout << "Failed to initialize detector" << std::endl;
     return -1;
 }
 if (!(input_img, output_img)) {
     std::cout << "Detection failed" << std::endl;
     return -1;
 }
 // Release resources
 ();

        Remember the single image and real-time detection macro definitions we added above? We add the judgment of this macro definition and some details to the main function. The complete code is as follows:

int main() {
     //Create detector instance
     BPU_Detect detector;
     // initialization
     if (!()) {
         std::cout << "Failed to initialize detector" << std::endl;
         return -1;
     }
 #if DETECT_MODE == 0
     //Single picture detection mode
     std::cout << "Single image detection mode" << std::endl;
     //Read test image
     cv::Mat input_img = cv::imread("/path/to/img");
     if (input_img.empty()) {
         std::cout << "Failed to load image" << std::endl;
         return -1;
     }
     //Perform detection
     cv::Mat output_img;
 #if ENABLE_DRAW
     if (!(input_img, output_img)) {
         std::cout << "Detection failed" << std::endl;
         return -1;
     }
     // save results
     cv::imwrite("cpp_result.jpg", output_img);
 #else
     if (!(input_img, output_img)) {
         std::cout << "Detection failed" << std::endl;
         return -1;
     }
 #endif
 #else
     // Real-time detection mode
     std::cout << "Real-time detection mode" << std::endl;
     //Open camera
     cv::VideoCapture cap(0);
     if (!()) {
         std::cout << "Failed to open camera" << std::endl;
         return -1;
     }
     cv::Mat frame, output_frame;
     while (true) {
         // read a frame
         cap >> frame;
         if (()) {
             std::cout << "Failed to read frame" << std::endl;
             break;
         }
         //Execute detection
         if (!(frame, output_frame)) {
             std::cout << "Detection failed" << std::endl;
             break;
         }
 #if ENABLE_DRAW
         //display results
         cv::imshow("Real-time Detection", output_frame);
        
         // Press 'q' to exit
         if (cv::waitKey(1) == 'q') {
             break;
         }
 #endif
     }
 #if ENABLE_DRAW
     // Release the camera
     ();
     cv::destroyAllWindows();
 #endif
 #endif
     // Release resources
     ();
     return 0;
 }

The complete code is for reference only

// Standard C++ library
 #include <iostream> //Input and output streams
 #include <vector> // vector container
 #include <algorithm> // algorithm library
 #include <chrono> // Time related functions
 #include <iomanip> // Input and output format control

 // OpenCV library
 #include <opencv2/> // OpenCV main header file
 #include <opencv2/dnn/> // OpenCV deep learning module

 // Horizon RDK BPU API
 #include "dnn/hb_dnn.h" //BPU basic functions
 #include "dnn/hb_dnn_ext.h" // BPU extension function
 #include "dnn/plugin/hb_dnn_layer.h" // BPU layer definition
 #include "dnn/plugin/hb_dnn_plugin.h" // BPU plug-in
 #include "dnn/hb_sys.h" // BPU system functions

 // Error checking macro definition
 #define RDK_CHECK_SUCCESS(value, errmsg) \
     do \
     { \
         auto ret_code = value; \
         if (ret_code != 0) \
         { \
             std::cout << errmsg << ", error code:" << ret_code; \
             return ret_code; \
         } \
     } while (0);

 //Default parameter definitions related to models and detection
 #define DEFAULT_MODEL_PATH "/root/Deep_Learning/YOLOv5/models/tennis_detect_640x640_bayese_.bin" //Default model path
 #define DEFAULT_CLASSES_NUM 1 //Default number of categories
 #define CLASSES_LIST "tennis_ball" // Category name
 #define DEFAULT_NMS_THRESHOLD 0.45f // Non-maximum suppression threshold
 #define DEFAULT_SCORE_THRESHOLD 0.25f // Confidence threshold
 #define DEFAULT_NMS_TOP_K 300 //The maximum number of frames reserved by NMS
 #define DEFAULT_FONT_SIZE 1.0f // Drawing text size
 #define DEFAULT_FONT_THICKNESS 1.0f // Draw text thickness
 #define DEFAULT_LINE_SIZE 2.0f // Draw line thickness

 //Run mode selection
 #define DETECT_MODE 0 // Detection mode: 0-single picture, 1-real-time detection
 #define ENABLE_DRAW 0 // Drawing switch: 0-disable, 1-enable
 #define LOAD_FROM_DDR 1 //Model loading method: 0-load from file, 1-load from memory

 // Feature map scale definition (based on the multiple relationship of the input size)
 #define H_8 (input_h_ / 8) // 1/8 of the input height
 #define W_8 (input_w_ / 8) // 1/8 of the input width
 #define H_16 (input_h_ / 16) // 1/16 of the input height
 #define W_16 (input_w_ / 16) // 1/16 of the input width
 #define H_32 (input_h_ / 32) // 1/32 of the input height
 #define W_32 (input_w_ / 32) // 1/32 of the input width

 // BPU target detection class
 class BPU_Detect {
 public:
     //Constructor: initialize detector parameters
     // @param model_path: model file path
     // @param classes_num: Number of detection categories
     // @param nms_threshold: NMS threshold
     // @param score_threshold: Confidence threshold
     // @param nms_top_k: The maximum number of frames retained by NMS
     BPU_Detect(const std::string& model_path = DEFAULT_MODEL_PATH,
                  int classes_num = DEFAULT_CLASSES_NUM,
                  float nms_threshold = DEFAULT_NMS_THRESHOLD,
                  float score_threshold = DEFAULT_SCORE_THRESHOLD,
                  int nms_top_k = DEFAULT_NMS_TOP_K);
    
     // Destructor: release resources
     ~BPU_Detect();

     //Main functional interface
     bool Init(); // Initialize BPU and model
     bool Detect(const cv::Mat& input_img, cv::Mat& output_img); //Perform target detection
     bool Release(); // Release all resources

 private:
     // Internal utility function
     bool LoadModel(); // Load model file
     bool GetModelInfo(); // Get the input and output information of the model
     bool PreProcess(const cv::Mat& input_img); // Image preprocessing (resize and format conversion)
     bool Inference(); //Perform model inference
     bool PostProcess(); // Post-processing (NMS, etc.)
     void DrawResults(cv::Mat& img); // Draw detection results on the image
     void PrintResults() const; // Print detection results to the console

     // Feature map processing auxiliary function
     // @param output_tensor: output tensor
     // @param height, width: feature map size
     // @param anchors: anchor boxes corresponding to the scale
     // @param conf_thres_raw: original confidence threshold
     void ProcessFeatureMap(hbDNNTensor& output_tensor,
                           int height, int width,
                           const std::vector<std::pair<double, double>>& anchors,
                           float conf_thres_raw);

     //Member variables (arranged according to constructor initialization order)
     std::string model_path_; //Model file path
     int classes_num_; // Number of categories
     float nms_threshold_; // NMS threshold
     float score_threshold_; // Confidence threshold
     int nms_top_k_; //The maximum number of frames retained by NMS
     bool is_initialized_; // Initialization status flag
     float font_size_; // draw text size
     float font_thickness_; // draw text thickness
     float line_size_; // draw line thickness
    
     // BPU related variables
     hbPackedDNNHandle_t packed_dnn_handle_; // Packed model handle
     hbDNNHandle_t dnn_handle_; // Model handle
     const char* model_name_; // model name
    
     // Input and output tensors
     hbDNNTensor input_tensor_; //Input tensor
     hbDNNTensor* output_tensors_; // Output tensor array
     hbDNNTensorProperties input_properties_; //Input tensor properties
    
     //Task related
     hbDNNTaskHandle_t task_handle_; // Inference task handle
    
     //Model input parameters
     int input_h_; //Input height
     int input_w_; // input width
    
     //Storage of detection results
     std::vector<std::vector<cv::Rect2d>> bboxes_; // Bounding boxes for each category
     std::vector<std::vector<float>> scores_; // Score for each category
     std::vector<std::vector<int>> indices_; // Index after NMS
    
     //Image processing parameters
     float x_scale_; //X-direction scaling ratio
     float y_scale_; // Y direction scaling ratio
     int x_shift_; // X direction offset
     int y_shift_; // Y direction offset
     cv::Mat resized_img_; // Scaled image
    
     // YOLOv5 anchors information
     std::vector<std::pair<double, double>> s_anchors_; // small target anchors
     std::vector<std::pair<double, double>> m_anchors_; // Target anchors
     std::vector<std::pair<double, double>> l_anchors_; // Large target anchors
    
     // Output processing
     int output_order_[3]; // Output order mapping
     std::vector<std::string> class_names_; // Category name list
 };

 //Constructor implementation
 BPU_Detect::BPU_Detect(const std::string& model_path,
                           int classes_num,
                           float nms_threshold,
                           float score_threshold,
                           int nms_top_k)
     : model_path_(model_path),
       classes_num_(classes_num),
       nms_threshold_(nms_threshold),
       score_threshold_(score_threshold),
       nms_top_k_(nms_top_k),
       is_initialized_(false),
       font_size_(DEFAULT_FONT_SIZE),
       font_thickness_(DEFAULT_FONT_THICKNESS),
       line_size_(DEFAULT_LINE_SIZE) {
    
     //Initialize category name
     class_names_ = {CLASSES_LIST};
    
     //Initialize anchors
     std::vector<float> anchors = {10.0, 13.0, 16.0, 30.0, 33.0, 23.0,
                                  30.0, 61.0, 62.0, 45.0, 59.0, 119.0,
                                  116.0, 90.0, 156.0, 198.0, 373.0, 326.0};
    
     //Set small, medium, large anchors
     for(int i = 0; i < 3; i++) {
         s_anchors_.push_back({anchors[i*2], anchors[i*2+1]});
         m_anchors_.push_back({anchors[i*2+6], anchors[i*2+7]});
         l_anchors_.push_back({anchors[i*2+12], anchors[i*2+13]});
     }
 }

 // Destructor implementation
 BPU_Detect::~BPU_Detect() {
     if(is_initialized_) {
         Release();
     }
 }

 // Initialization function implementation
 bool BPU_Detect::Init() {
     if(is_initialized_) {
         std::cout << "Already initialized!" << std::endl;
         return true;
     }
    
     auto init_start = std::chrono::high_resolution_clock::now();
    
     if(!LoadModel()) {
         std::cout << "Failed to load model!" << std::endl;
         return false;
     }
    
     if(!GetModelInfo()) {
         std::cout << "Failed to get model info!" << std::endl;
         return false;
     }
    
     is_initialized_ = true;
    
     auto init_end = std::chrono::high_resolution_clock::now();
     float init_time = std::chrono::duration_cast<std::chrono::microseconds>(init_end - init_start).count() / 1000.0f;
    
     std::cout << "\n============ Model Loading Time ============" << std::endl;
     std::cout << "Total init time: " << std::fixed << std::setprecision(2) << init_time << " ms" << std::endl;
     std::cout << "==========================================\n  " << std::endl;
    
     return true;
 }

 //Load model implementation
 bool BPU_Detect::LoadModel() {
     // Record the starting point of the total loading time
     auto load_start = std::chrono::high_resolution_clock::now();

 #if LOAD_FROM_DDR
     // Used to record the time of reading model data from the file
     float read_time = 0.0f;
 #endif
     // Used to record the time of model initialization
     float init_time = 0.0f;
    
 #if LOAD_FROM_DDR
     // =============== Read model from file to memory ===============
     auto read_start = std::chrono::high_resolution_clock::now();
    
     //Open model file
     FILE* fp = fopen(model_path_.c_str(), "rb");
     if (!fp) {
         std::cout << "Failed to open model file: " << model_path_ << std::endl;
         return false;
     }
    
     // Get file size:
     fseek(fp, 0, SEEK_END); // 1. Move the file pointer to the end
     size_t model_size = static_cast<size_t>(ftell(fp)); // 2. Get the current position (i.e. file size)
     fseek(fp, 0, SEEK_SET); // 3. Reset the file pointer to the beginning
    
     // Allocate memory for model data
     void* model_data = malloc(model_size);
     if (!model_data) {
         std::cout << "Failed to allocate memory for model data" << std::endl;
         fclose(fp);
         return false;
     }
    
     //Read model data into memory
     size_t read_size = fread(model_data, 1, model_size, fp);
     fclose(fp);
    
     // Calculate file reading time
     auto read_end = std::chrono::high_resolution_clock::now();
     read_time = std::chrono::duration_cast<std::chrono::microseconds>(read_end - read_start).count() / 1000.0f;
    
     // Verify that the file has been read completely
     if (read_size != model_size) {
         std::cout << "Failed to read model data, expected " << model_size
                  << " bytes, but got " << read_size << " bytes" << std::endl;
         free(model_data);
         return false;
     }
    
     // =============== Initialize model from memory ===============
     auto init_start = std::chrono::high_resolution_clock::now();
    
     // Prepare model data array and length array
     const void* model_data_array[] = {model_data};
     int32_t model_data_length[] = {static_cast<int32_t>(model_size)};
    
     // Initialize the model from memory using the BPU API
     RDK_CHECK_SUCCESS(
         hbDNNInitializeFromDDR(&packed_dnn_handle_, model_data_array, model_data_length, 1),
         "Initialize model from DDR failed");
    
     // Release temporarily allocated memory
     free(model_data);
    
     // Calculate model initialization time
     auto init_end = std::chrono::high_resolution_clock::now();
     init_time = std::chrono::duration_cast<std::chrono::microseconds>(init_end - init_start).count() / 1000.0f;
    
 #else
     // =============== Initialize the model directly from the file ===============
     auto init_start = std::chrono::high_resolution_clock::now();
    
     // Get the model file path
     const char* model_file_name = model_path_.c_str();
    
     // Initialize model from file using BPU API
     RDK_CHECK_SUCCESS(
         hbDNNInitializeFromFiles(&packed_dnn_handle_, &model_file_name, 1),
         "Initialize model from file failed");
    
     // Calculate model initialization time
     auto init_end = std::chrono::high_resolution_clock::now();
     init_time = std::chrono::duration_cast<std::chrono::microseconds>(init_end - init_start).count() / 1000.0f;
 #endif

     // =============== Calculate and print total time statistics ===============
     auto load_end = std::chrono::high_resolution_clock::now();
     float total_load_time = std::chrono::duration_cast<std::chrono::microseconds>(load_end - load_start).count() / 1000.0f;

     //Print time statistics
     std::cout << "\n============ Model Loading Details ============" << std::endl;
 #if LOAD_FROM_DDR
     std::cout << "File reading time: " << std::fixed << std::setprecision(2) << read_time << " ms" << std::endl;
 #endif
     std::cout << "Model init time: " << std::fixed << std::setprecision(2) << init_time << " ms" << std::endl;
     std::cout << "Total loading time: " << std::fixed << std::setprecision(2) << total_load_time << " ms" << std::endl;
     std::cout << "==============================================  \n" << std::endl;

     return true;
 }

 // Get model information implementation
 bool BPU_Detect::GetModelInfo() {
     // Get the list of model names
     const char** model_name_list;
     int model_count = 0;
     RDK_CHECK_SUCCESS(
         hbDNNGetModelNameList(&model_name_list, &model_count, packed_dnn_handle_),
         "hbDNNGetModelNameList failed");
     if(model_count > 1) {
         std::cout << "Model count: " << model_count << std::endl;
         std::cout << "Please check the model count!" << std::endl;
         return false;
     }
     model_name_ = model_name_list[0];
    
     // Get model handle
     RDK_CHECK_SUCCESS(
         hbDNNGetModelHandle(&dnn_handle_, packed_dnn_handle_, model_name_),
         "hbDNNGetModelHandle failed");
    
     // Get input information
     int32_t input_count = 0;
     RDK_CHECK_SUCCESS(
         hbDNNGetInputCount(&input_count, dnn_handle_),
         "hbDNNGetInputCount failed");
     RDK_CHECK_SUCCESS(
         hbDNNGetInputTensorProperties(&input_properties_, dnn_handle_, 0),
         "hbDNNGetInputTensorProperties failed");

     if(input_count > 1){
         std::cout << "Model input node is greater than 1, please check!" << std::endl;
         return false;
     }
     if(input_properties_. == 4){
         std::cout << "Input tensor type: HB_DNN_IMG_TYPE_NV12" << std::endl;
     }
     else{
         std::cout << "The input tensor type is not HB_DNN_IMG_TYPE_NV12, please check!" << std::endl;
         return false;
     }
     if(input_properties_.tensorType == 1){
         std::cout << "Input tensor data layout: HB_DNN_LAYOUT_NCHW" << std::endl;
     }
     else{
         std::cout << "The input tensor data layout is not HB_DNN_LAYOUT_NCHW, please check!" << std::endl;
         return false;
     }
     // Get input size
     input_h_ = input_properties_.[2];
     input_w_ = input_properties_.[3];
     if (input_properties_. == 4)
     {
         std::cout << "The input size is: (" << input_properties_.[0];
         std::cout << ", " << input_properties_.[1];
         std::cout << ", " << input_h_;
         std::cout << ", " << input_w_ << ")" << std::endl;
     }
     else
     {
         std::cout << "The input size is not (1,3,640,640), please check!" << std::endl;
         return false;
     }
    
     // Get the output information and adjust the output order
     int32_t output_count = 0;
     RDK_CHECK_SUCCESS(
         hbDNNGetOutputCount(&output_count, dnn_handle_),
         "hbDNNGetOutputCount failed");
    
     //Allocate output tensor memory
     output_tensors_ = new hbDNNTensor[output_count];
    
     // =============== Adjust output header sequence mapping ===============
     // YOLOv5 has 3 output heads, corresponding to 3 different scales of feature maps.
     // Need to ensure that the output order is: small target (8x downsampling) -> medium target (16x downsampling) -> large target (32x downsampling)
    
     //Initialize default order
     output_order_[0] = 0; // Default 1st output
     output_order_[1] = 1; // Default 2nd output
     output_order_[2] = 2; // Default 3rd output

     // Define the desired output feature map size and number of channels
     int32_t expected_shapes[3][3] = {
         {H_8, W_8, 3 * (5 + classes_num_)}, // Small target feature map: H/8 x W/8
         {H_16, W_16, 3 * (5 + classes_num_)}, // Medium target feature map: H/16 x W/16
         {H_32, W_32, 3 * (5 + classes_num_)} // Large target feature map: H/32 x W/32
     };

     // Iterate through each desired output scale
     for(int i = 0; i < 3; i++) {
         // Traverse the actual output nodes
         for(int j = 0; j < 3; j++) {
             // Get the properties of the current output node
             hbDNNTensorProperties output_properties;
             RDK_CHECK_SUCCESS(
                 hbDNNGetOutputTensorProperties(&output_properties, dnn_handle_, j),
                 "Get output tensor properties failed");
            
             // Get the actual feature map size and number of channels
             int32_t actual_h = output_properties.[1];
             int32_t actual_w = output_properties.[2];
             int32_t actual_c = output_properties.[3];

             // If actual size and number of channels match expected
             if(actual_h == expected_shapes[i][0] &&
                actual_w == expected_shapes[i][1] &&
                actual_c == expected_shapes[i][2]) {
                 //Record the correct output sequence
                 output_order_[i] = j;
                 break;
             }
         }
     }

     //Print out sequence mapping information
     std::cout << "\n============ Output Order Mapping ============" << std::endl;
     std::cout << "Small object (1/" << 8 << "): output[" << output_order_[0] << "]" << std::endl;
     std::cout << "Medium object (1/" << 16 << "): output[" << output_order_[1] << "]" << std::endl;
     std::cout << "Large object (1/" << 32 << "): output[" << output_order_[2] << "]" << std::endl;
     std::cout << "==========================================\  n" << std::endl;

     return true;
 }

 // Detection function implementation
 bool BPU_Detect::Detect(const cv::Mat& input_img, cv::Mat& output_img) {
     if(!is_initialized_) {
         std::cout << "Please initialize first!" << std::endl;
         return false;
     }
    
     auto total_start = std::chrono::high_resolution_clock::now();
    
 #if ENABLE_DRAW
     input_img.copyTo(output_img);
 #endif

     // Preprocessing time statistics
     auto preprocess_start = std::chrono::high_resolution_clock::now();
     if(!PreProcess(input_img)) {
         return false;
     }
     auto preprocess_end = std::chrono::high_resolution_clock::now();
     float preprocess_time = std::chrono::duration_cast<std::chrono::microseconds>(preprocess_end - preprocess_start).count() / 1000.0f;
    
     //Inference time statistics
     auto infer_start = std::chrono::high_resolution_clock::now();
     if(!Inference()) {
         return false;
     }
     auto infer_end = std::chrono::high_resolution_clock::now();
     float infer_time = std::chrono::duration_cast<std::chrono::microseconds>(infer_end - infer_start).count() / 1000.0f;
    
     // Post-processing time statistics
     auto postprocess_start = std::chrono::high_resolution_clock::now();
     if(!PostProcess()) {
         return false;
     }
     auto postprocess_end = std::chrono::high_resolution_clock::now();
     float postprocess_time = std::chrono::duration_cast<std::chrono::microseconds>(postprocess_end - postprocess_start).count() / 1000.0f;
    
     // Draw result time statistics
     auto draw_start = std::chrono::high_resolution_clock::now();
     DrawResults(output_img);
     auto draw_end = std::chrono::high_resolution_clock::now();
     float draw_time = std::chrono::duration_cast<std::chrono::microseconds>(draw_end - draw_start).count() / 1000.0f;
    
     //Total time statistics
     auto total_end = std::chrono::high_resolution_clock::now();
     float total_time = std::chrono::duration_cast<std::chrono::microseconds>(total_end - total_start).count() / 1000.0f;
    
     //Print time statistics
     std::cout << "\n============ Time Statistics ============" << std::endl;
     std::cout << "Preprocess time: " << std::fixed << std::setprecision(2) << preprocess_time << " ms" << std::endl;
     std::cout << "Inference time: " << std::fixed << std::setprecision(2) << infer_time << " ms" << std::endl;
     std::cout << "Postprocess time: " << std::fixed << std::setprecision(2) << postprocess_time << " ms" << std::endl;
     std::cout << "Draw time: " << std::fixed << std::setprecision(2) << draw_time << " ms" << std::endl;
     std::cout << "Total time: " << std::fixed << std::setprecision(2) << total_time << " ms" << std::endl;
     std::cout << "FPS: " << std::fixed << std::setprecision(2) << 1000.0f / total_time << std::endl;
     std::cout << "======================================\n" <<  std::endl;
    
     return true;
 }

 // Preprocessing implementation
 bool BPU_Detect::PreProcess(const cv::Mat& input_img) {
     //Use letterbox method for preprocessing
     x_scale_ = std::min(1.0f * input_h_ / input_img.rows, 1.0f * input_w_ / input_img.cols);
     y_scale_ = x_scale_;
    
     int new_w = input_img.cols * x_scale_;
     x_shift_ = (input_w_ - new_w) / 2;
     int x_other = input_w_ - new_w - x_shift_;
    
     int new_h = input_img.rows * y_scale_;
     y_shift_ = (input_h_ - new_h) / 2;
     int y_other = input_h_ - new_h - y_shift_;
    
     cv::resize(input_img, resized_img_, cv::Size(new_w, new_h));
     cv::copyMakeBorder(resized_img_, resized_img_, y_shift_, y_other,
                        x_shift_, x_other, cv::BORDER_CONSTANT, cv::Scalar(127, 127, 127));
    
     //Convert to NV12 format
     cv::Mat yuv_mat;
     cv::cvtColor(resized_img_, yuv_mat, cv::COLOR_BGR2YUV_I420);
    
     // Prepare to input tensor
     hbSysAllocCachedMem(&input_tensor_.sysMem[0], int(3 * input_h_ * input_w_ / 2));
     uint8_t* yuv = yuv_mat.ptr<uint8_t>();
     uint8_t* ynv12 = (uint8_t*)input_tensor_.sysMem[0].virAddr;
     // Calculate the height and width of the UV part, and the size of the Y part
     int uv_height = input_h_ / 2;
     int uv_width = input_w_ / 2;
     int y_size = input_h_ * input_w_;
     //Copy the Y component data to the input tensor
     memcpy(ynv12, yuv, y_size);
     // Get the UV component position in NV12 format
     uint8_t* nv12 = ynv12 + y_size;
     uint8_t* u_data = yuv + y_size;
     uint8_t* v_data = u_data + uv_height * uv_width;
     //Write U and V components alternately into NV12 format
     for(int i = 0; i < uv_width * uv_height; i++) {
         *nv12++ = *u_data++;
         *nv12++ = *v_data++;
     }
     //Clear the memory cache to ensure that the data is ready for use by the model
     hbSysFlushMem(&input_tensor_.sysMem[0], HB_SYS_MEM_CACHE_CLEAN);//Clear the cache to ensure data synchronization
     return true;
 }

 //Inference implementation
 bool BPU_Detect::Inference() {
     //Initialize the task handle to nullptr
     task_handle_ = nullptr;
    
     //Initialize input tensor attributes
     input_tensor_.properties = input_properties_;
    
     // Get the output tensor attributes
     for(int i = 0; i < 3; i++) {
         hbDNNTensorProperties output_properties;
         RDK_CHECK_SUCCESS(
             hbDNNGetOutputTensorProperties(&output_properties, dnn_handle_, i),
             "Get output tensor properties failed");
         output_tensors_[i].properties = output_properties;
        
         // Allocate memory for output
         int out_aligned_size = output_properties.alignedByteSize;
         RDK_CHECK_SUCCESS(
             hbSysAllocCachedMem(&output_tensors_[i].sysMem[0], out_aligned_size),
             "Allocate output memory failed");
     }
    
     hbDNNInferCtrlParam infer_ctrl_param;
     HB_DNN_INITIALIZE_INFER_CTRL_PARAM(&infer_ctrl_param);
    
     RDK_CHECK_SUCCESS(
         hbDNNInfer(&task_handle_, &output_tensors_, &input_tensor_, dnn_handle_, &infer_ctrl_param),
         "Model inference failed");
    
     RDK_CHECK_SUCCESS(
         hbDNNWaitTaskDone(task_handle_, 0),
         "Wait task done failed");
    
     return true;
 }

 // Post-processing implementation
 bool BPU_Detect::PostProcess() {
     //Clear the last result
     bboxes_.clear();
     scores_.clear();
     indices_.clear();
    
     //Resize
     bboxes_.resize(classes_num_);
     scores_.resize(classes_num_);
     indices_.resize(classes_num_);
    
     float conf_thres_raw = -log(1 / score_threshold_ - 1);
    
     // Process the output of three scales
     ProcessFeatureMap(output_tensors_[0], H_8, W_8, s_anchors_, conf_thres_raw);
     ProcessFeatureMap(output_tensors_[1], H_16, W_16, m_anchors_, conf_thres_raw);
     ProcessFeatureMap(output_tensors_[2], H_32, W_32, l_anchors_, conf_thres_raw);
    
     // Perform NMS for each category
     for(int i = 0; i < classes_num_; i++) {
         cv::dnn::NMSBoxes(bboxes_[i], scores_[i], score_threshold_,
                          nms_threshold_, indices_[i], , nms_top_k_);
     }
    
     return true;
 }

 //Print detection results implementation
 void BPU_Detect::PrintResults() const {
     //Print the overall information of the test results
     int total_detections = 0;
     for(int cls_id = 0; cls_id < classes_num_; cls_id++) {
         total_detections += indices_[cls_id].size();
     }
     std::cout << "\n============ Detection Results ============" << std::endl;
     std::cout << "Total detections: " << total_detections << std::endl;
    
     for(int cls_id = 0; cls_id < classes_num_; cls_id++) {
         if(!indices_[cls_id].empty()) {
             std::cout << "\nClass: " << class_names_[cls_id] << std::endl;
             std::cout << "Number of detections: " << indices_[cls_id].size() << std::endl;
             std::cout << "Details:" << std::endl;
            
             for(size_t i = 0; i < indices_[cls_id].size(); i++) {
                 int idx = indices_[cls_id][i];
                 float x1 = (bboxes_[cls_id][idx].x - x_shift_) / x_scale_;
                 float y1 = (bboxes_[cls_id][idx].y - y_shift_) / y_scale_;
                 float x2 = x1 + (bboxes_[cls_id][idx].width) / x_scale_;
                 float y2 = y1 + (bboxes_[cls_id][idx].height) / y_scale_;
                 float score = scores_[cls_id][idx];
                
                 //Print detailed information of each detection frame
                 std::cout << " Detection " << i + 1 << ":" << std::endl;
                 std::cout << " Position: (" << x1 << ", " << y1 << ") to (" << x2 << ", " << y2 << ")" << std::  endl;
                 std::cout << " Confidence: " << std::fixed << std::setprecision(2) << score * 100 << "%" << std::endl;
             }
         }
     }
     std::cout << "========================================\n"  << std::endl;
 }

 // Drawing result implementation
 void BPU_Detect::DrawResults(cv::Mat& img) {
 #if ENABLE_DRAW
     for(int cls_id = 0; cls_id < classes_num_; cls_id++) {
         if(!indices_[cls_id].empty()) {
             for(size_t i = 0; i < indices_[cls_id].size(); i++) {
                 int idx = indices_[cls_id][i];
                 float x1 = (bboxes_[cls_id][idx].x - x_shift_) / x_scale_;
                 float y1 = (bboxes_[cls_id][idx].y - y_shift_) / y_scale_;
                 float x2 = x1 + (bboxes_[cls_id][idx].width) / x_scale_;
                 float y2 = y1 + (bboxes_[cls_id][idx].height) / y_scale_;
                 float score = scores_[cls_id][idx];
                
                 // draw bounding box
                 cv::rectangle(img, cv::Point(x1, y1), cv::Point(x2, y2),
                             cv::Scalar(255, 0, 0), line_size_);
                
                 // draw labels
                 std::string text = class_names_[cls_id] + ": " +
                                 std::to_string(static_cast<int>(score * 100)) + "%";
                 cv::putText(img, text, cv::Point(x1, y1 - 5),
                           cv::FONT_HERSHEY_SIMPLEX, font_size_,
                           cv::Scalar(0, 0, 255), font_thickness_, cv::LINE_AA);
             }
         }
     }
 #endif
     //Print test results
     PrintResults();
 }

 // Feature map processing auxiliary function
 void BPU_Detect::ProcessFeatureMap(hbDNNTensor& output_tensor,
                                   int height, int width,
                                   const std::vector<std::pair<double, double>>& anchors,
                                   float conf_thres_raw) {
     // Check the quantization type
     if (output_tensor. != NONE) {
         std::cout << "Output tensor quantization type should be NONE!" << std::endl;
         return;
     }
    
     // refresh memory
     hbSysFlushMem(&output_tensor.sysMem[0], HB_SYS_MEM_CACHE_INVALIDATE);
    
     //Get the output data pointer
     auto* raw_data = reinterpret_cast<float*>(output_tensor.sysMem[0].virAddr);
    
     // Traverse each position of the feature map
     for(int h = 0; h < height; h++) {
         for(int w = 0; w < width; w++) {
             for(const auto& anchor : anchors) {
                 // Get prediction data for the current location
                 float* cur_raw = raw_data;
                 raw_data += (5 + classes_num_);
                
                 // Conditional probability filtering
                 if(cur_raw[4] < conf_thres_raw) continue;
                
                 // Find the maximum class probability
                 int cls_id = 5;
                 int end = classes_num_ + 5;
                 for(int i = 6; i < end; i++) {
                     if(cur_raw[i] > cur_raw[cls_id]) {
                         cls_id = i;
                     }
                 }
                
                 // Calculate final score
                 float score = 1.0f / (1.0f + std::exp(-cur_raw[4])) *
                             1.0f / (1.0f + std::exp(-cur_raw[cls_id]));
                
                 // score filter
                 if(score < score_threshold_) continue;
                 cls_id -= 5;
                
                 // decode bounding box
                 float stride = input_h_ / height;
                 float center_x = ((1.0f / (1.0f + std::exp(-cur_raw[0]))) * 2 - 0.5f + w) * stride;
                 float center_y = ((1.0f / (1.0f + std::exp(-cur_raw[1]))) * 2 - 0.5f + h) * stride;
                 float bbox_w = std::pow((1.0f / (1.0f + std::exp(-cur_raw[2]))) * 2, 2) * ;
                 float bbox_h = std::pow((1.0f / (1.0f + std::exp(-cur_raw[3]))) * 2, 2) * ;
                 float bbox_x = center_x - bbox_w / 2.0f;
                 float bbox_y = center_y - bbox_h / 2.0f;
                
                 //Save test results
                 bboxes_[cls_id].push_back(cv::Rect2d(bbox_x, bbox_y, bbox_w, bbox_h));
                 scores_[cls_id].push_back(score);
             }
         }
     }
 }

 //Release resource implementation
 bool BPU_Detect::Release() {
     if(!is_initialized_) {
         return true;
     }
    
     // Release task
     if(task_handle_) {
         hbDNNReleaseTask(task_handle_);
         task_handle_ = nullptr;
     }
    
     try {
         // Release input memory
         if(input_tensor_.sysMem[0].virAddr) {
             hbSysFreeMem(&(input_tensor_.sysMem[0]));
         }
        
         // Release output memory
         for(int i = 0; i < 3; i++) {
             if(output_tensors_ && output_tensors_[i].sysMem[0].virAddr) {
                 hbSysFreeMem(&(output_tensors_[i].sysMem[0]));
             }
         }
        
         if(output_tensors_) {
             delete[] output_tensors_;
             output_tensors_ = nullptr;
         }
        
         // release model
         if(packed_dnn_handle_) {
             hbDNNRelease(packed_dnn_handle_);
             packed_dnn_handle_ = nullptr;
         }
     } catch(const std::exception& e) {
         std::cout << "Exception during release: " << () << std::endl;
     }
    
     is_initialized_ = false;
     return true;
 }

 //Modify main function
 int main() {
     //Create detector instance
     BPU_Detect detector;
    
     // initialization
     if (!()) {
         std::cout << "Failed to initialize detector" << std::endl;
         return -1;
     }

 #if DETECT_MODE == 0
     //Single picture detection mode
     std::cout << "Single image detection mode" << std::endl;
    
     //Read test image
     cv::Mat input_img = cv::imread("/root/Deep_Learning/YOLOv5/imgs/tennis_1_frame_0001.jpg");
     if (input_img.empty()) {
         std::cout << "Failed to load image" << std::endl;
         return -1;
     }
    
     //Execute detection
     cv::Mat output_img;
 #if ENABLE_DRAW
     if (!(input_img, output_img)) {
         std::cout << "Detection failed" << std::endl;
         return -1;
     }
     // save results
     cv::imwrite("cpp_result.jpg", output_img);
 #else
     if (!(input_img, output_img)) {
         std::cout << "Detection failed" << std::endl;
         return -1;
     }
 #endif

 #else
     // Real-time detection mode
     std::cout << "Real-time detection mode" << std::endl;
    
     //Open camera
     cv::VideoCapture cap(0);
     if (!()) {
         std::cout << "Failed to open camera" << std::endl;
         return -1;
     }
    
     cv::Mat frame, output_frame;
     while (true) {
         // read a frame
         cap >> frame;
         if (()) {
             std::cout << "Failed to read frame" << std::endl;
             break;
         }
        
         //Execute detection
         if (!(frame, output_frame)) {
             std::cout << "Detection failed" << std::endl;
             break;
         }
        
 #if ENABLE_DRAW
         //display results
         cv::imshow("Real-time Detection", output_frame);
        
         // Press 'q' to exit
         if (cv::waitKey(1) == 'q') {
             break;
         }
 #endif
     }
    
 #if ENABLE_DRAW
     // Release the camera
     ();
     cv::destroyAllWindows();
 #endif
 #endif
    
     // Release resources
     ();
    
     return 0;
 }