Original text:/a/62/
1. Background
Some time ago, the company online a set of Go implementation of the recommendation system, online found that the MMR layer is only pure computation, but time-consuming is very outrageous, through the pprof to locate the problem after the optimization, although it reduces the very much, but we believe that there is still room for optimization.
You can see that the average daily time taken is 126ms and 360ms for the P95.
The main time-consuming part of the MMR layer is focused on the cosine similarity computation part, which we use thegonumlibraries for computation, whose underlying layer on x86 platforms utilizes theSSE
The instruction set is accelerated.
SSE
The instruction set is very old.xmm
registers can only store two double-precision floating-point numbers, and only two double-precision floating-point calculations can be performed in parallel at a time, whereas theAVX2
The instruction set can parallelize four computations, theoretically giving a twofold performance boost, so we decided to use our ownAVX2
Instruction set handwritten assembly replaces thegonum
Coop.
1.1 Cosine similarity algorithm
The cosine similarity is calculated as
The corresponding code is
import "/v1/gonum/floats"
func CosineSimilarity(a, b []float64) float64 {
dotProduct := (a, b) // countacap (a poem)bdot product of vectors
normA := (a, 2) // count向量a(used form a nominal expression)L2modulus (math.)
normB := (b, 2) // count向量b(used form a nominal expression)L2modulus (math.)
return dotProduct / (normA * normB)
}
2. Acceleration of Dot product calculations
gonum dot product calculationDot
part of the assembly code is shown below:
TEXT ·DotUnitary(SB), NOSPLIT, $0
...
loop_uni:
// sum += x[i] * y[i] unrolled 4x.
MOVUPD 0(R8)(SI*8), X0
MOVUPD 0(R9)(SI*8), X1
MOVUPD 16(R8)(SI*8), X2
MOVUPD 16(R9)(SI*8), X3
MULPD X1, X0
MULPD X3, X2
ADDPD X0, X7
ADDPD X2, X8
ADDQ $4, SI // i += 4
SUBQ $4, DI // n -= 4
JGE loop_uni // if n >= 0 goto loop_uni
...
end_uni:
ADDPD X8, X7
MOVSD X7, X0
UNPCKHPD X7, X7
ADDSD X0, X7
MOVSD X7, sum+48(FP) // Return final sum.
RET
which can be seen to usexmm
registers to compute two double-precision floating-point numbers in parallel, and also uses theunfolding in a circle
The optimization means that the computation of 4 elements in a loop is performed simultaneously.
We utilizeAVX2
Instruction set parallel computation of four double-precision floating-point numbers for acceleration
loop_uni:
// sum += x[i] * y[i] unrolled 8x.
VMOVUPD 0(R8)(SI*8), Y0 // Y0 = x[i:i+4]
VMOVUPD 0(R9)(SI*8), Y1 // Y1 = y[i:i+4]
VMOVUPD 32(R8)(SI*8), Y2 // Y2 = x[i+4:i+8]
VMOVUPD 32(R9)(SI*8), Y3 // Y3 = x[i+4:i+8]
VMOVUPD 64(R8)(SI*8), Y4 // Y4 = x[i+8:i+12]
VMOVUPD 64(R9)(SI*8), Y5 // Y5 = y[i+8:i+12]
VMOVUPD 96(R8)(SI*8), Y6 // Y6 = x[i+12:i+16]
VMOVUPD 96(R9)(SI*8), Y7 // Y7 = x[i+12:i+16]
VFMADD231PD Y0, Y1, Y8 // Y8 = Y0 * Y1 + Y8
VFMADD231PD Y2, Y3, Y9
VFMADD231PD Y4, Y5, Y10
VFMADD231PD Y6, Y7, Y11
ADDQ $16, SI // i += 16
CMPQ DI, SI
JG loop_uni // if len(x) > i goto loop_uni
You can see that we're using 8 of them in each loop at the same time.ymm
registers i.e. 16 numbers in one cycle, and it also uses theVFMADD231PD
The instruction also performs the calculation of multiplicative accumulation.
Final Benchmark results:
BenchmarkDot Calculating 8 numbers in one loop
BenchmarkDot-2 14994770 78.85 ns/op
BenchmarkDot16 16 numbers in a loop
BenchmarkDot16-2 22867993 53.46 ns/op
BenchmarkGonumDot Gonum dot product calculation
BenchmarkGonumDot-2 8264486 144.4 ns/op
You can see that for the dot product part we get about 2.7x performance improvement
3. Acceleration of L2-paradigm calculations
gonum
The algorithms in the library for performing L2 paradigm calculations are not conventionala1^2 + a2^2 ... + aN^2
This calculation, instead, employs theNetlib
The algorithm, which reduces overflow and underflow, has the following Go source code:
func L2NormUnitary(x []float64) (norm float64) {
var scale float64
sumSquares := 1.0
for _, v := range x {
if v == 0 {
continue
}
absxi := (v)
if (absxi) {
return ()
}
if scale < absxi {
s := scale / absxi
sumSquares = 1 + sumSquares*s*s
scale = absxi
} else {
s := absxi / scale
sumSquares += s * s
}
}
if (scale, 1) {
return (1)
}
return scale * (sumSquares)
}
The assembly code is rather obscure, but the Go source code shows that parallelism is not used, and only one number is counted in each loop.
TEXT ·L2NormUnitary(SB), NOSPLIT, $0
...
loop:
MOVSD (X_)(IDX*8), ABSX // absxi = x[i]
...
Our optimized core code is as follows:
loop:
VMOVUPD 0(R8)(SI*8), Y0 // Y0 = x[i:i+4]
VMOVUPD 32(R8)(SI*8), Y1 // Y1 = y[i+4:i+8]
VMOVUPD 64(R8)(SI*8), Y2 // Y2 = x[i+8:i+12]
VMOVUPD 96(R8)(SI*8), Y3 // Y3 = x[i+12:i+16]
VMOVUPD 128(R8)(SI*8), Y4 // Y4 = x[i+16:i+20]
VMOVUPD 160(R8)(SI*8), Y5 // Y5 = y[i+20:i+24]
VMOVUPD 192(R8)(SI*8), Y6 // Y6 = x[i+24:i+28]
VMOVUPD 224(R8)(SI*8), Y7 // Y7 = x[i+28:i+32]
VFMADD231PD Y0, Y0, Y8 // Y8 = Y0 * Y0 + Y8
VFMADD231PD Y1, Y1, Y9
VFMADD231PD Y2, Y2, Y10
VFMADD231PD Y3, Y3, Y11
VFMADD231PD Y4, Y4, Y12
VFMADD231PD Y5, Y5, Y13
VFMADD231PD Y6, Y6, Y14
VFMADD231PD Y7, Y7, Y15
ADDQ $32, SI // i += 32
CMPQ DI, SI
JG loop // if len(x) > i goto loop
We used the original algorithmic calculations to take advantage of the ability to parallelize the computation, and looped through, calculating 32 numbers simultaneously in a single loop for the final Benchmark results:
BenchmarkAVX2L2Norm
BenchmarkAVX2L2Norm-2 29381442 40.99 ns/op
BenchmarkGonumL2Norm
BenchmarkGonumL2Norm-2 1822386 659.4 ns/op
You can see that you get about 16x performance improvement
4. Summary
With this optimization we end up with in the cosine similarity calculation section(144.4 + 659.4 * 2) / (53.46 + 40.99 * 2) = 10.8
times the performance increase, the effect is still very significant. Compared to theRemembering a failed SIMD instruction optimization computation" This failed initial attempt was still very successful this time, and the power of SIMD was actually felt.
Also a lot has gone up in this optimization process
AVX-512 Instruction Downclocking Issues
AVX-512 instructions because of higher parallelism theoretically higher performance, but AVX-512 instructions will cause CPU downclocking, so the industry is very careful to use, which can be referred to byte json parsing librariessonic
This issue of./bytedance/sonic/issues/319
loop unrolling optimization
The advantages of doing more work in one cycle are many:
- Reduced loop control overhead, fewer updates to loop variables and fewer conditional judgments, reduced likelihood of branch prediction failure
- Increased instruction parallelism, more instructions can be executed in parallel in the pipeline
However, cycling too many registers at once does give better performance in terms of actual Benchmark, but I haven't seen any information on this, so I hope an expert in this area can enlighten me.