Accelerating Recommender System MMR Layer Cosine Similarity Calculations Using the AVX2 Instruction Set

Original text:/a/62/

1. Background

Some time ago, the company online a set of Go implementation of the recommendation system, online found that the MMR layer is only pure computation, but time-consuming is very outrageous, through the pprof to locate the problem after the optimization, although it reduces the very much, but we believe that there is still room for optimization.

You can see that the average daily time taken is 126ms and 360ms for the P95.

The main time-consuming part of the MMR layer is focused on the cosine similarity computation part, which we use thegonumlibraries for computation, whose underlying layer on x86 platforms utilizes theSSEThe instruction set is accelerated.

SSEThe instruction set is very old.xmmregisters can only store two double-precision floating-point numbers, and only two double-precision floating-point calculations can be performed in parallel at a time, whereas theAVX2The instruction set can parallelize four computations, theoretically giving a twofold performance boost, so we decided to use our ownAVX2Instruction set handwritten assembly replaces thegonumCoop.

1.1 Cosine similarity algorithm

The cosine similarity is calculated as

The corresponding code is

import "/v1/gonum/floats"

func CosineSimilarity(a, b []float64) float64 {
    dotProduct := (a, b) // countacap (a poem)bdot product of vectors
    normA := (a, 2) // count向量a(used form a nominal expression)L2modulus (math.)
    normB := (b, 2) // count向量b(used form a nominal expression)L2modulus (math.)
    return dotProduct / (normA * normB)
}

2. Acceleration of Dot product calculations

gonum dot product calculationDotpart of the assembly code is shown below:

TEXT ·DotUnitary(SB), NOSPLIT, $0
    ...
loop_uni:
	// sum += x[i] * y[i] unrolled 4x.
	MOVUPD 0(R8)(SI*8), X0
	MOVUPD 0(R9)(SI*8), X1
	MOVUPD 16(R8)(SI*8), X2
	MOVUPD 16(R9)(SI*8), X3
	MULPD  X1, X0
	MULPD  X3, X2
	ADDPD  X0, X7
	ADDPD  X2, X8

	ADDQ $4, SI   // i += 4
	SUBQ $4, DI   // n -= 4
	JGE  loop_uni // if n >= 0 goto loop_uni

    ...

end_uni:
	ADDPD    X8, X7
	MOVSD    X7, X0
	UNPCKHPD X7, X7
	ADDSD    X0, X7
	MOVSD    X7, sum+48(FP) // Return final sum.
	RET

which can be seen to usexmmregisters to compute two double-precision floating-point numbers in parallel, and also uses theunfolding in a circleThe optimization means that the computation of 4 elements in a loop is performed simultaneously.

We utilizeAVX2Instruction set parallel computation of four double-precision floating-point numbers for acceleration

loop_uni:
	// sum += x[i] * y[i] unrolled 8x.
	VMOVUPD 0(R8)(SI*8), Y0 // Y0 = x[i:i+4]
	VMOVUPD 0(R9)(SI*8), Y1 // Y1 = y[i:i+4]
	VMOVUPD 32(R8)(SI*8), Y2 // Y2 = x[i+4:i+8]
	VMOVUPD 32(R9)(SI*8), Y3 // Y3 = x[i+4:i+8]
	VMOVUPD 64(R8)(SI*8), Y4 // Y4 = x[i+8:i+12]
	VMOVUPD 64(R9)(SI*8), Y5 // Y5 = y[i+8:i+12]
	VMOVUPD 96(R8)(SI*8), Y6 // Y6 = x[i+12:i+16]
	VMOVUPD 96(R9)(SI*8), Y7 // Y7 = x[i+12:i+16]
	VFMADD231PD Y0, Y1, Y8 // Y8 = Y0 * Y1 + Y8
	VFMADD231PD Y2, Y3, Y9
	VFMADD231PD Y4, Y5, Y10
	VFMADD231PD Y6, Y7, Y11
	ADDQ $16, SI   // i += 16
	CMPQ DI, SI
	JG  loop_uni // if len(x) > i goto loop_uni

You can see that we're using 8 of them in each loop at the same time.ymmregisters i.e. 16 numbers in one cycle, and it also uses theVFMADD231PDThe instruction also performs the calculation of multiplicative accumulation.

Final Benchmark results:

BenchmarkDot Calculating 8 numbers in one loop
BenchmarkDot-2 14994770 78.85 ns/op
BenchmarkDot16 16 numbers in a loop
BenchmarkDot16-2 22867993 53.46 ns/op
BenchmarkGonumDot Gonum dot product calculation
BenchmarkGonumDot-2 8264486 144.4 ns/op

You can see that for the dot product part we get about 2.7x performance improvement

3. Acceleration of L2-paradigm calculations

gonumThe algorithms in the library for performing L2 paradigm calculations are not conventionala1^2 + a2^2 ... + aN^2This calculation, instead, employs theNetlibThe algorithm, which reduces overflow and underflow, has the following Go source code:

func L2NormUnitary(x []float64) (norm float64) {
	var scale float64
	sumSquares := 1.0
	for _, v := range x {
		if v == 0 {
			continue
		}
		absxi := (v)
		if (absxi) {
			return ()
		}
		if scale < absxi {
			s := scale / absxi
			sumSquares = 1 + sumSquares*s*s
			scale = absxi
		} else {
			s := absxi / scale
			sumSquares += s * s
		}
	}
	if (scale, 1) {
		return (1)
	}
	return scale * (sumSquares)
}

The assembly code is rather obscure, but the Go source code shows that parallelism is not used, and only one number is counted in each loop.

TEXT ·L2NormUnitary(SB), NOSPLIT, $0
    ...
loop:
	MOVSD   (X_)(IDX*8), ABSX // absxi = x[i]
	...

Our optimized core code is as follows:

loop:
	VMOVUPD 0(R8)(SI*8), Y0 // Y0 = x[i:i+4]
	VMOVUPD 32(R8)(SI*8), Y1 // Y1 = y[i+4:i+8]
	VMOVUPD 64(R8)(SI*8), Y2 // Y2 = x[i+8:i+12]
	VMOVUPD 96(R8)(SI*8), Y3 // Y3 = x[i+12:i+16]
	VMOVUPD 128(R8)(SI*8), Y4 // Y4 = x[i+16:i+20]
	VMOVUPD 160(R8)(SI*8), Y5 // Y5 = y[i+20:i+24]
	VMOVUPD 192(R8)(SI*8), Y6 // Y6 = x[i+24:i+28]
	VMOVUPD 224(R8)(SI*8), Y7 // Y7 = x[i+28:i+32]
	VFMADD231PD Y0, Y0, Y8 // Y8 = Y0 * Y0 + Y8
	VFMADD231PD Y1, Y1, Y9
	VFMADD231PD Y2, Y2, Y10
	VFMADD231PD Y3, Y3, Y11
	VFMADD231PD Y4, Y4, Y12
	VFMADD231PD Y5, Y5, Y13
	VFMADD231PD Y6, Y6, Y14
	VFMADD231PD Y7, Y7, Y15

	ADDQ $32, SI // i += 32
	CMPQ DI, SI
	JG  loop // if len(x) > i goto loop

We used the original algorithmic calculations to take advantage of the ability to parallelize the computation, and looped through, calculating 32 numbers simultaneously in a single loop for the final Benchmark results:

BenchmarkAVX2L2Norm
BenchmarkAVX2L2Norm-2          29381442                40.99 ns/op
BenchmarkGonumL2Norm
BenchmarkGonumL2Norm-2           1822386               659.4 ns/op

You can see that you get about 16x performance improvement

4. Summary

With this optimization we end up with in the cosine similarity calculation section(144.4 + 659.4 * 2) / (53.46 + 40.99 * 2) = 10.8times the performance increase, the effect is still very significant. Compared to theRemembering a failed SIMD instruction optimization computation" This failed initial attempt was still very successful this time, and the power of SIMD was actually felt.

Also a lot has gone up in this optimization process

AVX-512 Instruction Downclocking Issues

AVX-512 instructions because of higher parallelism theoretically higher performance, but AVX-512 instructions will cause CPU downclocking, so the industry is very careful to use, which can be referred to byte json parsing librariessonicThis issue of./bytedance/sonic/issues/319

loop unrolling optimization

The advantages of doing more work in one cycle are many:

Reduced loop control overhead, fewer updates to loop variables and fewer conditional judgments, reduced likelihood of branch prediction failure
Increased instruction parallelism, more instructions can be executed in parallel in the pipeline

However, cycling too many registers at once does give better performance in terms of actual Benchmark, but I haven't seen any information on this, so I hope an expert in this area can enlighten me.