Tesseract is an open source OCR (Optical Character Recognition) engine that converts text in images into machine-readable text formats. Since a colleague in the group once contributed RVV (RISC-V Vector) code to this project, I plan to take it out and learn it separately.
PR link is here:Add RISC-V V support by hleft · Pull Request #4346 · tesseract-ocr/tesseract. Because I have a certain amount of space, I only selected the assembly part to read.
static int DotProduct(const int8_t *u, const int8_t *v, int num) {
int total = 0;
asm __volatile__ (
" .option arch, +v \n\t"
" vsetvli t0,zero,e32,m8,ta,ma \n\t"
" v0,0 \n\t"
"1: \n\t"
" vsetvli t0,%[num],e8,m2,ta,ma \n\t"
" v16,0(%[u]) \n\t"
" v24,0(%[v]) \n\t"
" sub %[num],%[num],t0 \n\t"
" v8,v24,v16 \n\t"
" add %[u],%[u],t0 \n\t"
" add %[v],%[v],t0 \n\t"
" vsetvli zero,zero,e16,m4,tu,ma \n\t"
" v0,v0,v8 \n\t"
" bnez %[num],1b \n\t"
" vsetvli t0,zero,e32,m8,ta,ma \n\t"
" v8,zero \n\t"
" v0,v0,v8 \n\t"
" %[total],v0 \n\t"
: [u] "+r" (u),
[v] "+r" (v),
[num] "+r" (num),
[total] "+r" (total)
:
: "cc", "memory"
);
return total;
}
This function is mainly used to implement one-dimensional vector product and is optimized by embedded assembly. In addition to RVV assembly, it can also use the encapsulated riscv_vector.h interface. However, the most original assembly is used here, and we read it in segments.
" vsetvli t0,zero,e32,m8,ta,ma \n\t"
" v0,0 \n\t"
vsetvli is an instruction related to the vector register group. Here, the vector length is set to the maximum (zero means that it is automatically calculated according to the configuration), and then the vector register is initialized to 0.
"1: \n\t"
" vsetvli t0,%[num],e8,m2,ta,ma \n\t"
" v16,0(%[u]) \n\t"
" v24,0(%[v]) \n\t"
" sub %[num],%[num],t0 \n\t"
1 Here means that you have entered the loop. The advantage of using RVV is that the step size will be automatically adjusted during the loop process. For example, the length is 18. If the step size is 8 each time, the traditional SIMD needs 8+8+3. 8 can be implemented using vector instruction set, but 3 here requires ordinary for loop handwriting, but RVV will automatically ignore this process, and don’t worry about crossing the bounds. Just pay attention to the internal of the loop itself, because the hardware will automatically adjust the vector step size to 3 according to the situation. In addition, here vsetvli loads the num operand to the t0 register, the register stores the vector step size, e8 represents the element size, which is equivalent to the int8 type, because the function parameters are also passed in pointers to int8 *.
" v8,v24,v16 \n\t"
" add %[u],%[u],t0 \n\t"
" add %[v],%[v],t0 \n\t"
The instruction first sums the elements in the two vector register groups v24 and v16, then expands the bit width to 16 bits, and stores it in the v8 vector register. The reason for expanding the bit width is that multiplying an 8-bit number by 8-bit number may become a 16-bit number. add here moves pointers to the two function parameters passed in separately.
" vsetvli zero,zero,e16,m4,tu,ma \n\t"
" v0,v0,v8 \n\t"
" bnez %[num],1b \n\t"
At this point, vsetvli re-changed the operand type to 16-bit, because it had been expanded to 16-bit when the above multiplication was just now. Then the result of v8 is accumulated to the v0 vector register group. Since the final return value is 32 bits, the addition of the extended bit width is also used here.
" vsetvli t0,zero,e32,m8,ta,ma \n\t"
" v8,zero \n\t"
" v0,v0,v8 \n\t"
" %[total],v0 \n\t"
Finally, vsetvli readjusts the elements in the vector register to 32 bits, then clears v8, reduces all elements of the v0 register group to the v0 register, and finally moves the result to the variable total, and this function is completed at this point.
From this point of view, RVV still has an advantage in automatically adjusting the step size (compared to SIMD), but since the assembly is obscure and difficult to understand, I plan to find the code of RVV C-Intrinsics for dissection next time.