Golang's concatenation is slowing down｜Subtractive ordering time analysis of 1 million concatenations

preamble

This article will analyze for you the time overhead of using concatenated sorting with three versions of subsumption sorting (the length of the slices being sorted ranges from 128 to 1000w)

This issue demo address:/BaiZe1998/go-learning

Past Video Explained 📺: B-Site:ShirasawaTalk, public number: Shirazetalk

Think concurrency is always faster

Thread: the basic unit of OS scheduling, used for scheduling to the CPU for execution. Thread switching is a hefty operation because it requires saving the thread context of the current running state in the CPU, switching to the executable state, and at the same time scheduling a thread in the executable state to the CPU for execution.
Concatenation: Threads switch CPU cores from OS context, while Goroutine switches concatenation from Go runtime context. Go concatenation takes less memory than threads (2KB/2MB), and context switching for concatenation is 80-90% faster than for threads.

🌟 GMP Model:

G：Goroutine
- Execution state: dispatched to M for execution
- Executable state: waiting to be dispatched
- Waiting state: blocked for some reason
M：OS thread
P：CPU core
- Each P has a local G queue (task queue)
- All P's have a common G-queue (task queue)

Concurrent scheduling rules: every OS thread (M) is scheduled to execute on P and then every G runs on M.

🌟 The above figure shows a scenario where a machine with a 4-core CPU schedules a Go thread:

At this time, P2 is idle because M3 has finished executing and released the occupancy of P2. Although P2's local queue is empty and there are no Gs that can be scheduled for execution, at certain intervals, Go runtime will go to the Global queue and other P's local queues to steal some Gs to be used for scheduling execution (there are currently 6 Gs that can be executed).

In particular, before Go1.14, scheduling of Go threads was cooperative, so Go threads that switched would only do so because of blocking waits (IO/channel/mutex, etc.), but after Go1.14, threads with a runtime of more than 10ms are marked as preemptible, and can be preempted by other threads for P's execution.

The case for consolidation

🌟 To corroborate that sometimes multiple concatenations do not necessarily improve performance, here are three examples using subsumption sorting:

Main function:

func main() {
	size := []int{128, 512, 1024, 2048, 100000, 1000000, 10000000}
	sortVersion := []struct {
		name string
		sort func([]int)
	}{
		{"Mergesort V1", SequentialMergesortV1},
		{"Mergesort V2", SequentialMergesortV2},
		{"Mergesort V3", SequentialMergesortV3},
	}
	for _, s := range size {
		("Testing Size: %d\n", s)
		o := GenerateSlice(s)
		for _, v := range sortVersion {
			s := make([]int, len(o))
			copy(s, o)
			start := ()
			(s)
			elapsed := (start)
			("%s: %s\n", , elapsed)
		}
		()
	}
}

v1 implementation:

func sequentialMergesort(s []int) {
    if len(s) <= 1 {
    	return
    }
    middle := len(s) / 2
    sequentialMergesort(s[:middle])
    sequentialMergesort(s[middle:])
    merge(s, middle)
}

func merge(s []int, middle int) {
    // ...
}

v2 implementation:

func SequentialMergesortV2(s []int) {
	if len(s) <= 1 {
		return
	}
	middle := len(s) / 2

	var wg 
	(2)

	go func() {
		defer ()
		SequentialMergesortV2(s[:middle])
	}()
	go func() {
		defer ()
		SequentialMergesortV2(s[middle:])
	}()
	()
	Merge(s, middle)
}

v3 implementation:

const max = 2048

func SequentialMergesortV3(s []int) {
	if len(s) <= 1 {
		return
	}
	if len(s) < max {
		SequentialMergesortV1(s)
	} else {
		middle := len(s) / 2

		var wg 
		(2)

		go func() {
			defer ()
			SequentialMergesortV3(s[:middle])
		}()
		go func() {
			defer ()
			SequentialMergesortV3(s[middle:])
		}()

		()
		Merge(s, middle)
	}
}

Time-consuming comparison:

Testing Size: 128
Mergesort V1: 10.583µs
Mergesort V2: 211.25µs
Mergesort V3: 7.5µs

Testing Size: 512
Mergesort V1: 34.208µs
Mergesort V2: 500.666µs
Mergesort V3: 60.291µs

Testing Size: 1024
Mergesort V1: 48.666µs
Mergesort V2: 643.583µs
Mergesort V3: 47.084µs

Testing Size: 2048
Mergesort V1: 94.666µs
Mergesort V2: 786.125µs
Mergesort V3: 77.667µs

Testing Size: 100000
Mergesort V1: 6.810917ms
Mergesort V2: 58.148291ms
Mergesort V3: 4.219375ms

Testing Size: 1000000
Mergesort V1: 62.053875ms
Mergesort V2: 558.586458ms
Mergesort V3: 37.99375ms

Testing Size: 10000000
Mergesort V1: 632.3875ms
Mergesort V2: 5.764997041s
Mergesort V3: 388.343584ms

Since creating and scheduling the concatenation itself has overhead, the second case uses concatenation to sort in parallel no matter how many elements there are, resulting in the merging of very few elements that also require concatenation creation and scheduling, which is more overhead than sorting, resulting in a performance that is not as good as the first type of sequential merging.

In particular, the number of concatenations created based on v2 totaled a million when the slice length was 1000w.

And on this computer, after debugging the third way can get better performance than the first way, because it chooses parallel sorting when there are more than 2048 elements and uses sequential sorting when there are less. But 2048 is a magic number and may vary from computer to computer. Here this is to demonstrate that relying exclusively on concurrent/parallel mechanisms does not necessarily improve performance, and that one needs to be aware of the overhead of the concurrency itself.

wrap-up

You can clone the project and then write a subsumption sort using the concatenation pool to further compare memory consumption.