Common performance optimizations related to golang slice

Introducing some performance optimizations for slice associations that are commonly used in development. Given that the golang compiler itself is not capable of optimization, the cost of optimization will have to be shared among the developers themselves.

The optimization tools that will be presented in this article are the following ones:

Pre-allocate memory when creating a slice
Pre-allocate memory before manipulating slice
Sensible cap values in slice expressions
Optimization of adding multiple zero-valued elements
unfolding in a circle
Avoiding loss from for-range replication of data
Boundary check elimination
Parallel processing of slice
Multiplexing slice's memory
Delete multiple elements efficiently
Reduce GC scanning stress

This article will not discuss the cache hit rate and SIMD, I know that these two are also related to the performance of the slice, but the former I think is a qualified developer must understand, there are many excellent tutorials on the Internet do not need me to repeat, the latter unless the performance bottleneck is really in the data throughput, otherwise the general should not be taken into account, especially in the go language, so this article will not be introduced to the two topics.

Finally, I would like to remind before opening, performance bottlenecks to rely on testing and profile to locate, performance optimization program benefits and overhead also need to be measured by performance testing, remember not to rigidly apply.

This article is rather long, so I suggest that you can pick and choose what you are interested in and read through it when you have time.

Index of this article

Pre-allocate memory when creating a slice

Pre-allocate memory before manipulating slice

Sensible cap values in slice expressions

Optimization of adding multiple zero-valued elements to slice

unfolding in a circle

Avoiding loss from for-ranges replicating data

Avoiding replication

Avoiding conversion overhead when traversing strings

BCE Boundary Check Elimination

Parallel processing of slice

reuse

Delete multiple elements efficiently

Delete all elements

Remove head or tail elements

Remove the element in the middle position

Reduce GC scanning stress

summarize

Pre-allocate memory when creating a slice

Pre-allocated memory is the most common optimization, and I'll explain how to do it in two parts: at creation time and in use.

Allocating enough memory for the slice to be created ahead of time eliminates the performance loss associated with expanding the slice when elements are added later.

The specific practices are as follows:

s1 := make([]T, 0, number of preallocated elements)

// Another less common means of pre-allocation, where the number of elements must be constant
var arr [number of elements]T
s2 := arr[:]

It's very simple code, and I won't do any performance testing.

As mentioned earlier, the performance loss caused by expansion when adding elements, this loss is divided into two aspects, one is the expansion needs to recalculate the slice's cap, especially after the adoption of a more moderate allocation strategy after 1.19 the amount of computation has increased, and the other lies in the reallocation of memory, if you can't expand in situ you need to reallocate a piece of memory to move the data over, and then free up the original memory, the more you add elements the more likely to encounter this situation, which is quite a bit of overhead. The more elements you add, the more likely you are to encounter this situation, which is a significant overhead.

Also the scaling strategy used by slice can sometimes be wasteful, like the following:

func main() {
    var a []int
    for i := 0; i < 2048; i++ {
            a = append(a, i)
    }
    (cap(a)) // go1.22: 2560
}

As you can see, we added 2048 elements, but go ended up allocating us memory for 2560 elements, wasting almost 500.

Pre-allocations are not a silver bullet, though, and there are limited scenarios in which they can be applied:

Applicable Scenarios:

Scenarios where you know exactly how many elements will be in the slice
The number of elements, while uncertain, is strictly limited to re[x, y]interval, at which point you can choose to set the preallocation size toy+1, of course the difference between x and y can't be too large, something like 1 and 1000 should obviously not be preallocated, the main judgment is based on the worst case memory waste rate.

I don't recommend using preallocation except in the two scenarios above, because allocating memory itself comes at a performance cost, and preallocation when it's not either of the above scenarios will inevitably produce a lot of waste, and the performance cost of that waste will likely outweigh the cost of scaling up.

Pre-allocated memory has another benefit: if the size of the allocation is a constant or a constant expression, there is a chance that it will be recognized by escape analysis as being the right size to be allocated on the stack, thus improving performance even further. This is also implemented by the compiler in the following code:

// /golang/go/blob/master/src/cmd/compile/internal/walk/#L412

// walkMakeSlice walks an OMAKESLICE node.
func walkMakeSlice(n *, init *) {
	l :=
	r :=
	if r == nil {
		r = safeExpr(l, init)
		l = r
	}
	t := ()
	if ().NotInHeap() {
		("%v can't be allocated in Go; it is incomplete (or unallocatable)", ())
	}
	if () == {
		if why := (n); why != "" {
			("%v has EscNone, but %v", n, why)
		}
		// probeiIs it a constant
		i := (r)
		if i < 0 {
			("walkExpr: invalid index %v", r)
		}

		// probe通过后创建slicetemporary variable，distribute on a stack
	}

	// Escaped.，At this point the code for the call is generated
    // expense or outlaymallocgcAllocate memory from the heap
}

Allocating memory on the stack is faster and puts a little less pressure on the gc, but it's not in our control where objects will be allocated, and all we can do is create opportunities for objects to be allocated on the stack and that's it.

Pre-allocate memory before manipulating slice

Starting with the entry of the slices package into the standard library, it is also possible to preallocate memory when manipulating an existing slice.

Of course it was possible before, but you have to take some detours, if you're interested in checking it outHow it's done.

See the results with a simple test:

func BenchmarkAppend(b *) {
	for i := 0; i < ; i++ {
		s := []int{1, 2, 3, 4, 5}
		for j := 0; j < 1024; j++ {
			s = append(s, j)
		}
	}
}

func BenchmarkAppendWithGrow(b *) {
	for i := 0; i < ; i++ {
		s := []int{1, 2, 3, 4, 5}
		s = (s, 1024)
		for j := 0; j < 1024; j++ {
			s = append(s, j)
		}
	}
}

Here are the results, compared with benchstat:

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
         │      │                              │
         │   sec/op    │   sec/op     vs base                │
Append-8   4.149µ ± 3%   1.922µ ± 5%  -53.69% (p=0.000 n=10)

         │        │                               │
         │     B/op      │     B/op      vs base                │
Append-8   19.547Ki ± 0%   9.250Ki ± 0%  -52.68% (p=0.000 n=10)

         │     │                             │
         │ allocs/op  │ allocs/op   vs base                │
Append-8   8.000 ± 0%   1.000 ± 0%  -87.50% (p=0.000 n=10)

Not only is it twice as fast and saves 50% of memory, but the optimized code requires only one memory allocation compared to code that does not use Grow.

The reason for the performance improvement is exactly the same as in the previous section: the overhead of multiple expansions is avoided.

Also the benefits of saving memory are there as in the previous section:

func main() {
s1 := make([]int, 10, 50) // note that there is already some pre-allocation
for i := 0; i < 1024; i++ {
s1 = append(s1, i)
}
(cap(s1)) // 1280

s2 := make([]int, 10, 50)
s2 = (s3, 1024)
for i := 0; i < 1024; i++ {
s2 = append(s2, i)
}
(cap(s2)) // 1184
}

As shown in the example, the memory utilization of the former is 80% while the latter is 86.5%.Although Grow also utilizes the append mechanism to expand capacity, it makes fuller use of memory and avoids waste.。

Also like the previous section, there are only two applicable scenarios for pre-assignment before use:

Scenarios where you know exactly how many elements will be added to the slice
The number of elements appended, while uncertain, is strictly limited to re[x, y]interval, at which point you can choose to set the preallocation size toy+1。

Also if you are splicing multiple slice, it is best to use theThis is because it internally preallocates enough memory with Grow, which is faster than using append directly. This is a living example of the optimizations described in this section.

Sensible cap values in slice expressions

/golang/go/pull/64835

In more recent versions of go slice expressions are allowed to have a third argument, the value of cap, in a similar form:slice[start:end:capEnd]。

Notice I usedcapEndInstead of cap, because this parameter is not the length of cap, but the maximum number of (index-1) elements of the original array or slice that the new slice can access. An example:slice[1:2:3], this expression creates a new slice of length2-1That is, 1, which gives access to the index of the original slice3-1i.e., elements of 2, so the elements accessible to the new slice are actuallyindex 1cap (a poem)index 2Two, cap for 2.

Why add this parameter? Because you can limit the scope of the slice access to avoid accidentally changing the data.

Of course then what does cap do when there is no third argument? It is of course the equivalent ofcap(old slice) - startUp.

What does this have to do with performance optimization? Look at an example:

func noop(s []int) int {
	return s[1] + s[2]
}

func BenchmarkSlice(b *) {
	slice := []int{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
	for i := 0; i < ; i++ {
		noop(slice[1:5])
	}
}

func BenchmarkSliceWithEqualCap(b *) {
	slice := []int{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
	for i := 0; i < ; i++ {
		noop(slice[1:5:5])
	}
}

Test results:

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
BenchmarkSlice-8                1000000000               0.3263 ns/op          0 B/op          0 allocs/op
BenchmarkSliceWithEqualCap-8    1000000000               0.3015 ns/op          0 B/op          0 allocs/op

If comparisons are made with benchstat, on average, using theslice[1:5:5]of the code is about 3% faster.

In fact, there's a little go optimization here, when the second argument in the slice expression is the same as the third argument, cap can just take the length of the previous calculation instead of calculating it additionally. That's a few less memory accesses and one less subtraction operation.

If you don't believe me, take a look at the compiler'scoding：

// slice computes the slice v[i:j:k] and returns ptr, len, and cap of result.
// i,j,k may be nil, in which case they are set to their default value.
// v may be a slice, string or pointer to an array.
func (s *state) slice(v, i, j, k *, bounded bool) (p, l, c *) {
	t :=
	var ptr, len, cap *
	switch {
	case ():
		ptr = s.newValue1(, (()), v)
        // countslice(used form a nominal expression)lencap (a poem)cap
		len = s.newValue1(, [], v)
		cap = s.newValue1(, [], v)
	case ():
		// an omission，It's not important here.
	case ():
		// 同上an omission
	default:
		("bad type in slice %v\n", t)
	}

	// If it iss[:j:k]，iwill default to0
	if i == nil {
		i = ([], 0)
	}
    // If it iss[i:]，imitatejset tolen(s)
	if j == nil {
		j = len
	}
	three := true
    // If it iss[i:j:], imitatekset tocap(s)
	if k == nil {
		three = false
		k = cap
	}

	// treat (sb a certain way)i，jcap (a poem)kPerform boundary checks

	// 先理解成加减乘除(used form a nominal expression)运算符就行
	subOp := (, [])
	mulOp := (, [])
	andOp := (, [])

	// Calculate the length (rlen) and capacity (rcap) of the new slice.
	// For strings the capacity of the result is unimportant. However,
	// we use rcap to test if we've generated a zero-length slice.
	// Use length of strings for that.
	rlen := s.newValue2(subOp, [], j, i)
	rcap := rlen
	if j != k && !() {
		rcap = s.newValue2(subOp, [], k, i)
	}

	// countslice(used form a nominal expression)内存从那里开始(used form a nominal expression)，It's not important to ignore

	return rptr, rlen, rcap
}

Overall nothing difficult, all the slice expression will eventually go to this function, this function will produce the corresponding opcode, this opcode will be a relatively simple optimization, and then the compiler according to these opcodes to generate a real program can run.

focus onif j != k && !()This one. The one in the branch.rcap = s.newValue2(subOp, [], k, i)Translated into normal go code, this is equivalent torcap = k - iHow the value of k is calculated is described in the previous comment. This means that if the two or three arguments of the slice expression have the same value and are not string, then the length will be reused directly without additional computation. Extra words, here although I used the word "calculate", but the actual rcap and rlen are still just expressions, the real result is to run in the program can be calculated, if you are interested in their own study of the go compiler.

It's this small optimization that brings subtle performance improvements.

Of course, these are just details in code generation, and I wouldn't normally recommend this for that reason alone.

So more important is the security mentioned earlier: limiting the scope of slicing access and avoiding accidental changes to the data. On top of that there is not only no performance degradation but also a small increase, which is the icing on the cake.

Applicable scenarios: when the cap and length of a slice should theoretically be equal in length, it's best to set them both explicitly, for example:slice[i : j+2 : j+2]This way.

The above scenario is estimated to account for about half of the scenarios, of course, there are many other scenarios that do not meet the above requirements, so do not be rigid, everything is subject to performance testing.

Optimization of adding multiple zero-valued elements to slice

There are also some tricks to add "0" to slice, see the test below:

func BenchmarkAppendZeros1(b *) {
for i := 0; i < ; i++ {
slice := []int{}
slice = append(slice, []int{0, 0, 0, 0, 0, 0}...)
}
}

// Optimized version
func BenchmarkAppendZeros2(b *) {
for i := 0; i < ; i++ {
slice := []int{}
slice = append(slice, make([]int, 5)...)
}
}

Test results:

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
              │      │                             │
              │   sec/op    │   sec/op     vs base               │
AppendZeros-8   31.79n ± 2%   30.04n ± 2%  -5.50% (p=0.000 n=10)

              │     │                         │
              │    B/op    │    B/op     vs base            │
AppendZeros-8   48.00 ± 0%   48.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

              │     │                         │
              │ allocs/op  │ allocs/op   vs base            │
AppendZeros-8   1.000 ± 0%   1.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

One line of code with a 5% performance improvement with no change in memory usage.

The secret remains in the compiler.

regardless of whether...append(s1, s2...)neverthelessappend(s1, make([]T, length)...), compilers have special handling.

The process for the former looks like this:

Create s2 (if s2 is a literal of slice)
Check s1's cap, expand if not enough
Copy the contents of s2 into s1

The process when using make looks like this:

Check s1's cap, expand if not enough
Do memclr on free memory for length length s1 (set all values in memory to 0)

The code is here:/golang/go/blob/master/src/cmd/compile/internal/walk/#L647

The secret to the performance improvement is that you don't have to create a temporary slice, and that memclr is faster because it does fewer and simpler things than copy.

And apparentlyappend(s1, make([]T, length)...)The readability of the book is also better, so to speak.

Scenario: when you need to add consecutive zeros to the slice.

unfolding in a circle

Handling data in slice with loops is also a common requirement, and the form of a normal loop accessing data can be more flexible than for-range, which will be mentioned in the next section, and is not affected by 1.22 Changing the Runtime Behavior of Ranges.

When it comes to loop-related optimization, loop unrolling is an inescapable topic. As the name suggests, it is the process of taking a loop that is supposed to be iterated n times and replacing it with one that handles m times more data than the original in each iteration, so that the total number of iterations will be reduced ton/m + 1Times.

Why is this faster? One thing is that there are many fewer loop jumps and boundary conditions to update and compare. Another is that modern CPUs have something called an instruction pipeline, which can run multiple instructions at the same time, and if there are no data dependencies between them (the latter depends on the former as input), expanding the loop means that there is an opportunity to parallelize some of the instructions, thus increasing throughput.

However, this is usually not a programmer's concern, because how to expand loops and when they should be expanded and when they shouldn't be (loops are expanded to affect whether the current function can be inlined, etc.) are all things that a compiler with a good optimization process should do.

What about GO, you ask? That's a natural no. Between runtime performance and language expressiveness, go chose compilation speed. It does compile fast, however optimization is going to be a black eye.

So you'll just have to write it yourself:

func loop(s []int) int {
	sum := 0
	for i := 0; i < len(s); i++ {
		sum += s[i]
	}
	return sum
}

func unroll4(s []int) int {
	sum := 0
	for i := 0; i < len(s); i += 4 {
		sum += s[i]
		sum += s[i+1]
		sum += s[i+2]
		sum += s[i+3]
	}
	return sum
}

func BenchmarkLoop(b *) {
	s := []int{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 35, 26, 27, 28, 29, 30, 31}
	for i := 0; i < ; i++ {
		loop(s)
	}
}

func BenchmarkUnroll4(b *) {
	s := []int{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 35, 26, 27, 28, 29, 30, 31}
	for i := 0; i < ; i++ {
		unroll4(s)
	}
}

func BenchmarkUnroll8(b *) {
	s := []int{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 35, 26, 27, 28, 29, 30, 31}
	for i := 0; i < ; i++ {
		unroll8(s)
	}
}

The test uses a slice of 32 ints, first compared to processing four data in a loop:

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
         │      │                              │
         │   sec/op    │   sec/op     vs base                │
Unroll-8   9.718n ± 3%   3.196n ± 2%  -67.11% (p=0.000 n=10)

         │     │                         │
         │    B/op    │    B/op     vs base            │
Unroll-8   0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

         │     │                         │
         │ allocs/op  │ allocs/op   vs base            │
Unroll-8   0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

That's a pretty big improvement of almost 67%. Then we compare it to processing 8 pieces of data at a time:

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
         │      │                              │
         │   sec/op    │   sec/op     vs base                │
Unroll-8   9.718n ± 3%   2.104n ± 1%  -78.34% (p=0.000 n=10)

This time it's a 78% improvement, which is 30% faster compared to the method of processing only four at a time and processing eight.

I only dealt with the total amount of data for the sake of convenience is an integer multiple of the number of data processed in each iteration, non-integer multiples of the time you need to use the "Duff device", in the go implementation is more troublesome, so I'm lazy. However, in view of the loop to bring a very big improvement, if you determine the loop processing slice of the code is a performance bottleneck, you may wish to try to realize the effect.

Scenario: the length of the slice needs to be maintained at a fixed value, and the length needs to be an integer multiple of the amount of data processed in each iteration.

Scenarios that require careful performance testing: If a single loop needs to handle a lot of content with a long code, then the effect of the expansion is likely to be not so good or even counterproductive, because too much code will affect the current function and the current code to call whether the function is inlined and the escape analysis of local variables, the former will make the function call overhead is amplified at the same time to interfere with the branch prediction and pipeline execution resulting in performance The latter leads to unnecessary escapes that reduce performance and increase heap memory usage.

In addition, there is no need to be bound by a multiple of 4 or 2 or whatever how many elements are processed per iteration, theoretically no matter how many are processed at a time there will be a significant performance improvement, the actual test is also true, the effect of processing 3, 5 or 7 at a time is almost the same as when there are 4 or 8, in general the more processing at a time, the more obvious the improvement. However, if the expansion is too much, it will develop into a scenario that requires rigorous testing as mentioned above. So I would suggest that it is best not to expand the number of treatments to more than 8.

Avoiding loss from for-ranges replicating data

The normal loop structure provides flexible access, but if you're traversing slice I think most people's first choice would be the for-ranges structure.

This section is not so much about performance optimization as it is about avoiding the performance pitfalls of "for-ranges".

Let's start with where the traps are.

There are really two traps, one you can basically avoid and the other depends on the situation. We'll start with the one we can avoid completely.

Avoiding replication

The first pitfall is that when range traverses slice, it makes a copy of the data to be traversed into the loop variable, and each iteration of range's loop traversal from 1.22 onwards creates a new instance, which, if you don't pay attention to it, not only degrades performance but also puts a sharp rise in memory pressure. What we want to do is avoid the overhead of unnecessary copying.

As an example, let's use a file containing 8int64and 1 string structure filled with slice then compare the performance when copied and not copied:

type Data struct {
a, b, c, d, e, f, g, h int64
text string
text string }

func generateData(n int) []Data {
ret := make([]Data, 0, n)
for i := range int64(n) {
ret = append(ret, Data{
a: i, b: i + 1, i, n







text: "Test",
})
}
return ret
}

// Examples that would cause extra data to be copied
func BenchmarkRanges1(b *) {
data := generateData(100)
()
for i := 0; i < ; i++ {
tmp := int64(0)
for _, v := range data { // data is copied to loop variable v
tmp -= -
}
}
}

// Example of avoiding replication
func BenchmarkRanges2(b *) {
data := generateData(100)
()
for i := 0; i < ; i++ {
tmp := int64(0)
for i := range data { // note these two lines
v := &data[i]
tmp -= -
}
}
}

Results:

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
         │      │                             │
         │   sec/op    │   sec/op     vs base               │
Ranges-8   33.51n ± 2%   32.63n ± 1%  -2.41% (p=0.000 n=10)

Using pointers or accessing directly through indexes avoids copying, and as the results show, the difference in performance is more pronounced the larger the structure. In addition, the new version of go has changed the semantics of range from reusing loop variables to creating new loop variables every round, which makes some for-range loops with copying overheads slower.

Scenario: need to traverse each element, traversing the slice of the single data is relatively large and explicitly do not need to traverse the data is copied additionally to the loop variable.

Avoiding conversion overhead when traversing strings

Strings may be a bit off-topic, but the point we're about to make is also barely related to slice.

The pitfall is that range traverses the string by converting its contents to a rune, which is an overhead step, especially if the string contains only ascii characters.

Write a simple example to see how much performance is lost:

func checkByte(s string) bool {
	for _, b := range []byte(s) {
		if b == '\n' {
			return true
		}
	}
	return false
}

func checkRune(s string) bool {
	for _, r := range s {
		if r == '\n' {
			return true
		}
	}
	return false
}

func BenchmarkRanges1(b *) {
	s := "abcdefghijklmnopqrstuvwxyz1234567890."
	for i := 0; i < ; i++ {
		checkRune(s)
	}
}

func BenchmarkRanges2(b *) {
	s := "abcdefghijklmnopqrstuvwxyz1234567890."
	for i := 0; i < ; i++ {
		checkByte(s)
	}
}

Here are the results:

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
         │      │                              │
         │   sec/op    │   sec/op     vs base                │
Ranges-8   36.07n ± 2%   23.95n ± 1%  -33.61% (p=0.000 n=10)

Convert string to[]byteThe performance of re-traversal is actually improved by 1/3In other words if you don't notice this pit, you are going to lose so much performance for nothing. In other words if you don't notice this pitfall, then you are going to lose so much performance for nothing.

And converting string to[]byteThere is no need to allocate new memory, you can directly reuse the data inside the string, provided of course that you don't modify the converted slice, here we give this slice directly to range, which doesn't modify the slice, so the overhead of the conversion is eliminated.

This optimization is from 1.6, and it's interesting to see the compiler code:/golang/go/blob/master/src/cmd/compile/internal/walk/#L316 (Look at the code in fact there are other optimizations for this kind of conversion, such as when the string is shorter converted out of the[]byte(will be allocated on the stack)

Of course, if you're dealing with characters outside of ASCII, such as Chinese characters, then this optimization won't work.

Scenario: need to iterate through the processing of the characters in the string are in the range of ASCII encoding, such as only line breaks in the English half-width numbers and half-width punctuation of the string.

BCE Boundary Check Elimination

Boundary checking is the process of checking whether the value of an argument exceeds the maximum limit and whether it will access memory out of bounds in scenarios such as accessing a slice element, using a slice expression, or making to create a slice. These checks are added to the corresponding locations by the compiler based on the information obtained at compile time, and the checked code is run at runtime.

This feature is very important for the security of the program.

So is it true that any place where the above expression is present will result in a bounds check? The answer is no, because boundary checking requires taking the length or cap of the slice and comparing it, the check will panic when it fails, the whole thing is a bit time-consuming and not very friendly to branch prediction, and overall adding a check for every expression that accesses a slice element will drag down performance.

Thus boundary check elimination appears as a matter of course - some scenarios where it is obvious that INDEX cannot have a boundary crossing problem, then the check is completely unnecessary.

How can I see where the compiler has inserted checks? You can use the following command:go build -gcflags='-d=ssa/check_bce' 。

of the preceding sectionunroll4As an example:

$ go build -gcflags='-d=ssa/check_bce' 

# command-line-arguments
./:8:11: Found IsInBounds
./:9:11: Found IsInBounds
./:10:11: Found IsInBounds
./:11:11: Found IsInBounds

Currently you will see two outputsIsInBoundscap (a poem)IsSliceInBounds. Both are inserted into the proof of boundary detection, the check is more or less the same, with only minor differences, interested to see how ssa generates the code for both:/golang/go/blob/master/src/cmd/compile/internal/ssa/#L25798

So how do these checks get eliminated? Specifically they can be categorized into several situations, but there are bound to be quite a few changes as compilers evolve, so I'm not going to list them all.

Since it's not enumerated, there must be a roughly general rule: if it can be deduced in the expression before accessing slice using index that the current index value will not be out of bounds, then the check can be eliminated.

A few examples:

s1 := make([]T, 10)
s1[9] // Constant index values are compiled to determine if they are out of bounds, so there is no need to insert a runtime check.
_ = s1[i&6] // The value of the index must be between 0 and 6, the check is eliminated

var s2 []int
_ = s2[:i] // check
_ = s2[:i] // duplicate access, boundary check eliminated
_ = s2[:i+1] // check
_ = s2[:i+1] // duplicate expression, checked so check is eliminated

func f(s []int) int {
    if len(s) < 3 {
        panic("error")
    }

    return s[1] + s[2] // The previous if guarantees that these two accesses must not be out of bounds, so the check can be eliminated
}

// A common practice to avoid multiple bounds checks through temporary variables
func f2(s []int) int {
    tmp := s[:4:4] // Boundary checking here. It also makes use of the aforementioned sensible setting of the slice expression's cap to avoid extra overhead.
    a := tmp[2] // The check at tmp ensures that it won't be out of bounds here, so it won't be checked again.
    b := tmp[3] // same as above
    return a+b
}

I didn't list all the examples, so if you want to see them you can go to thehere are。

Of course there are hidden scenarios that can't be eliminated from the check:

func f(s []int, i int) {
    if i < len(s) {
        (s[i]) // can't be eliminated because i is a signed integer and may be less than 0
    }
}

func f(s []int, i int) {
    if 0 < i && i < len(s) {
        (s[i+2]) // can't be eliminated because i is a signed integer, and in case i+2 overflows, the index value will be negative because of the round trip
    }
}

With this knowledge, the previousunroll4There are four boundary checks, which is not really necessary, so it can be changed to something like the following:

func unroll4(s []int) int {
unroll4(s []int)
for i := 0; i < len(s); i += 4 {
tmp := s[i : i+4 : i+4] // only checked once here
sum += tmp[0]
sum += tmp[1]
sum += tmp[2]
sum += tmp[3]
}
tmp[1] sum += tmp[2] sum += tmp[3] }
}

Doing so would actually still check it once, and could it be completely eliminated?

func unroll4(s []int) int {
sum := 0
for len(s) >= 4 {
sum += s[0]
sum += s[1]
sum += s[2]
sum += s[2] sum += s[3]
        s = s[4:] // Ignore the four elements that have already been processed, and since len(s) >= 4, there's no need to check here either
}
return sum
}

This eliminates the check completely, but there is one more assignment of slice.

However, my example here is just too simple, and performance tests show that boundary check elimination does not result in a performance gain, and the one that eliminates the check completely instead results in a slight performance degradation due to the extra slice assignment operation (compared to eliminating it to just one check).

For an example of a more pronounced effect, check out theThis blog.。

Scenario: can be effectively utilizedlen(slice)The place to try BCE is where the results of the

In other cases, performance tests are needed to determine whether there is an improvement and by how much. Changes like this that don't enhance security like setting the slice expression cap value or increase readability like adding null values in bulk with make, I personally think that unless there's a real performance bottleneck and there's no other way to optimize it, I'd like to see a change like this.Otherwise, it is not recommended to make this type of change if the boost is less than 5%.。

Parallel processing of slice

Having talked about loop unrolling earlier, a further optimization based on this means is parallel processing. Parallelism here does not mean SIMD, but rather parallelism that relies on a goroutine implementation.

The premise of being able to parallelize is that the processing of slice elements does not depend on each other, for examples[1]The processing relies ons[0]The results of the process like this.

After you can determine that the processing of slice can be parallelized, you can write some parallel code, such as parallel summing:

func Sum(s []int64) int64 {
	// suppose that...sThe length of the4000
	var sum atomic.Int64
	var wg
	// everyonegoroutinedeal with800classifier for individual things or people, general, catch-all classifier
	for i := 0; i < len(s); i += 800 {
		(1)
		go func(ss []int) {
			defer ()
			var ret int64
			for j := range ss {
				ret += ss[j]
			}
			(ret)
		}(s[i: i+800])
	}
	()
	return ()
}

Very simple code. As with loop unrolling, it needs to take care of an additional number of remaining elements that are not enough to be processed at once.

In addition, the creation and destruction of concatenators and the synchronization of data are time-consuming, and if there are very few elements in the slice, it is not worthwhile to process them in parallel.

Scenarios: there are a lot of elements in a slice, processing of elements can be done in parallel without interfering with each other, and importantly, golang programs can use more than one cpu core to ensure that the code can really run "in parallel".

reuse

It's a common trope to reuse slice, where reusing the[]byteis the most common.

Multiplexing can be utilized, which can also look like the following:

buf := make([]byte, 1024)
for {
	read(buf)
	...
	// reuse
	buf = buf[:0]
}

included among thesebuf = buf[:0]Make the slice's cap unchanged and the length zeroed out so that the memory of the slice can be reused. This is accomplished by using theThis is also needed to make the length of slice zero.

In addition, the use ofWhen also pay attention to the slice size can not be too large, otherwise it will also increase the burden on the gc. Generally speaking, more than 1M slice size is not recommended to store, of course, must also be combined with the project requirements and performance testing to determine the upper limit of the size.

Scenario: your slice memory can be used repeatedly (preferably the kind that can be reused directly without even cleaning up, cleaning up will make the optimization effect a little bit discounted) and the multiple creation of slice really becomes a performance bottleneck.

Delete multiple elements efficiently

Deleting elements is also a common requirement, and here we will also discuss it in three cases.

All three are included in the standard library'sSo I recommend using the standard library rather than writing your own. Therefore, this section does not have a scenario context, but each subsection gives advice for some specific scenarios.

Delete all elements

If you don't want to reuse slice even after deleting an element, just set it to nil.

If you also want to reuse the memory, utilize the memory we have in thereuseIt's mentioned in that verse.s := s[:0]That's fine, but it's not enough to zero out all the deleted elements to prevent memory leaks, which is all we can do until 1.21:

func deleteSlice[T any, S ~[]T](s S) S {
	var zero T
	for i := range s {
		s[i] = zero
	}
	return s[:0]
}

After 1.21 we have the clear built-in function, and the code can be drastically simplified:

func deleteSlice[T any, S ~[]T](s S) S {
	clear(s)
	return s[:0]
}

The performance of the two ways of writing is the same, because go is specifically optimized for writing zero values in a for-range loop, and the effect is the same as using clear directly:

func BenchmarkClear(b *) {
	for i := 0; i < ; i++ {
		var a = [...]uint64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}
		clear(a[:])
	}
}

func BenchmarkForRange(b *) {
	for i := 0; i < ; i++ {
		var a = [...]uint64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}
		for j := range a {
			a[j] = 0
		}
	}
}

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
BenchmarkClear-8      	1000000000	         0.2588 ns/op	       0 B/op	       0 allocs/op
BenchmarkForRange-8   	1000000000	         0.2608 ns/op	       0 B/op	       0 allocs/op

But if the form of the loop is not a for-range, then it won't eat this optimization:

func BenchmarkClear(b *) {
	for i := 0; i < ; i++ {
		var a = [...]uint64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}
		clear(a[:])
	}
}

func BenchmarkForLoop(b *) {
	for i := 0; i < ; i++ {
		var a = [...]uint64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}
		for j := 0; j < 20; j++ {
			a[j] = 0
		}
	}
}

goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz
BenchmarkClear-8     	1000000000	         0.2613 ns/op	       0 B/op	       0 allocs/op
BenchmarkForLoop-8   	173418799	         7.088 ns/op	       0 B/op	       0 allocs/op

The speed difference is an order of magnitude. For those interested in the "loop write zero" optimization, you can see how it was implemented here:arrayclear. This optimization also works for MAP.

We can briefly compare the performance of null to nil and clear:

func BenchmarkDeleteWithClear(b *) {
	for i := 0; i < ; i++ {
		a := []uint64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}
		clear(a)
		a = a[:0]
	}
}

func BenchmarkDeleteWithSetNil(b *) {
	for i := 0; i < ; i++ {
		a := []uint64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}
		a = a[:] // Prevents the compiler from thinking thataNot used
		a = nil
	}
}

From the results, it doesn't make much difference if it's just a deletion operation:

BenchmarkDeleteWithClear-8      1000000000               0.2592 ns/op          0 B/op          0 allocs/op
BenchmarkDeleteWithSetNil-8     1000000000               0.2631 ns/op          0 B/op          0 allocs/op

So which way to choose mainly depends on whether you want to reuse slice's memory later, if you need to reuse it, use clear, otherwise just set it to nil.

Remove head or tail elements

Removing trailing elements is the easiest and fastest way to do it onlys = s[:index]This one. Be careful not to forget to use theclearClear the deleted section.

The only disadvantage of this method is that the memory of the deleted part is not freed, which usually doesn't hurt and can be reused when new elements are added, but if you won't reuse the memory and are sensitive to waste, you'll have to allocate a new slice and copy the elements that you want to leave behind, but be aware that doing so will be a lot slower and consume more memory during the deletion process (as the both the old and new slice have to exist at the same time).

There are more options for removing header elements, two common ones (we need to keep the relative order between elements):s = s[index+1:]ors = append(s[:0], s[index+1:]...)。

In the former case, a new slice is created, and the underlying array starts by pointing to the original slice'sindex+1At the end, note that although the underlying array is reused, the cap is actually reduced, and there is no chance that the deleted portion of memory will be reused again. This method requires that the elements be zeroed out before deletion.

The latter does not create a new slice, it takes theindex+1The starting element is panned to the head of the slice, which also removes the head element (which is overwritten). You don't need to actively clear the element to use this scheme, and you can choose to use clear if you're not sure about the space left at the end of the move, but it's generally not recommended.

Theoretically the former truly wastes memory but performs better, though performance should always be proven by benchmark:

func BenchmarkClearWithReSlice(b *) {
	for i := 0; i < ; i++ {
		a := []uint64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}
		// Delete header7individual element
		clear(a[:7])
		a = a[7:]
	}
}

func BenchmarkClearWithAppend(b *) {
	for i := 0; i < ; i++ {
		a := []uint64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}
		a = append(a[:0], a[7:]...)
	}
}

The test results show that indeed the first method is fast:

BenchmarkClearWithReSlice-8     1000000000               0.2636 ns/op          0 B/op          0 allocs/op
BenchmarkClearWithAppend-8      100000000               10.82 ns/op            0 B/op          0 allocs/op

Append is an order of magnitude slower, and even though memmove has been optimized quite a bit, moving data around in memory is still slow.

In the actual application should be based on the memory utilization efficiency and operating speed of the comprehensive consideration to choose the appropriate program.

Remove the element in the middle position

Deletion of the middle part of the elements to maintain the relative order, can only be used to move the deleted part of the elements behind the front to cover this approach:

s := append(s[:index], s[index+n:]...)

This method also doesn't need to actively clear the deleted elements because they are overwritten. Utilizing an append instead of a for loop has two advantages besides the poor optimization of for loops mentioned earlier: cleaner code and the ability to utilize memmove.

Since the method is unique without much reference, the performance is not tested.

Reduce GC scanning stress

Simply put, try not to store a lot of pointers or structures containing pointers in slice. The more pointers there are the more work gc needs to do when scanning for objects, which will eventually lead to performance degradation.

More specific explanations and performance tests can be seenthis one。

Applicable scenarios: no special needs and the size of the element is not particularly large, better to store values than pointers.

As a tradeoff, if you choose to store values, you have to be careful about the overhead caused by additional replication.

summarize

By personal experience, the most frequently used optimizations are, in order, pre-allocating memory, avoiding for-ranges stepping on pits, slice reuse, and loop unrolling. These are also the most effective in terms of improvement.

You'll have to figure out how to use these optimization tricks yourself if the compiler's optimizations aren't powerful enough.

Sometimes it is also possible to use escape analysis rules to do optimization, but just asthis articleThat said, you shouldn't even consider escape analysis in the vast majority of cases.

There is another way: give the go compiler shared code to improve the performance of the compiled product. Although there will be a lot of resistance, but I still believe that there are big guys will be able to do it. That's why I'm posting the code for how to optimize the compiler, just to throw some light on the problem.

And the most important thing: performance issues whether it's positioning or optimization, theAll must be based on performance testingIt is important to remember not to rely solely on "experience" and "inferences" that are not supported by facts.

In the end I hope that this article will be a handy tool for you to use when optimizing performance, rather than an eight-letter word to memorize for an interview.