0. Preface
Lecture 8 covered the situation when a goroutine runs too long and gets preempted. This talk continues to look at preemption when a goroutine takes too long to execute a system call.
1. Excessive preemption of system calls
See the example below:
func longSyscall() {
timeout := (int64(5 * ))
fds := make([], 1)
if _, err := (0, &fds[0], nil, nil, &timeout); err != nil {
("Error:", err)
}
("Select returned after timeout")
}
func main() {
threads := (0)
for i := 0; i < threads; i++ {
go longSyscall()
}
(8 * )
}
longSyscall goroutine executes a 5s system call. During the system call, sysmon monitors longSyscall and preempts it if it finds that the execution of the system call is too long.
return tosysmon
thread to see how it grabbed the goroutine that was taking too long to be called by the system.
func sysmon() {
...
idle := 0 // how many cycles in succession we had not wokeup somebody
delay := uint32(0)
...
for {
if idle == 0 { // start with 20us sleep...
delay = 20
} else if idle > 50 { // start doubling the sleep after 1ms...
delay *= 2
}
if delay > 10*1000 { // up to 10ms
delay = 10 * 1000
}
usleep(delay)
...
// retake P's blocked in syscalls
// and preempt long running G's
if retake(now) != 0 {
idle = 0
} else {
idle++
}
...
}
}
Similar to a goroutine that takes too long to run, calling theretake
Make the grab:
func retake(now int64) uint32 {
n := 0
lock(&allpLock)
for i := 0; i < len(allp); i++ {
pp := allp[i]
if pp == nil {
continue
}
pd := &.
pd := & s :=
sysretake := false
if s == _Prunning || s == _Psyscall { // Preempt G if it's _Prunning or _Psyscall.
// Preempt G if it's running for too long. t := int64()
t := int64()
t := int64() if int64() ! t := int64() if int64() !
= uint32(t)
= } else if +forcePreemptNS < } else if +forcePreemptNS
} else if +forcePreemptNS <= now {
// preemptone for _Prunning or _Psyscall running for too long
// preemptone, which we covered in Runtime Preemption, sets the goroutine flags.
// For a goroutine that is in a system call, this setting does not preempt it. Because the thread is always in the system call state
preemptone(pp)
// In case of syscall, preemptone() doesn't // work, because there is no MAC address.
// work, because there is no M wired to P.
sysretake = true
sysretake = true }
}
if s == _Psyscall {
// Retake P from syscall if it's there for more than 1 sysmon tick (at least 20us).
// P is in the middle of a syscall, check if it needs to be preempted.
// syscalltick is used to keep track of the number of syscalls, adding 1 to it after it completes.
t := int64()
if !sysretake && int64() ! = t {
// ! = , indicating that it is no longer the last observed system call, // but another system call.
// It's a different system call, and the tick and when values need to be re-recorded.
= uint32(t)
= now
continue
}
// On the one hand we don't want to retake Ps if there is no other work to do, // but on the other hand we want to retake them eventually
// but on the other hand we want to retake them eventually
// because they can prevent the sysmon thread from deep sleep.
// Preemption is performed if one of the following three conditions is met:
// 1. there is a running goroutine in the thread's bound local queue
// 2. there are no P's doing nothing (meaning everyone is busy, so don't take so long executing syscalls)
// 3. system calls that take longer than 10ms to execute
if runqempty(pp) && ()+() > 0 && +10*1000*1000 > now {
continue
}
// Here's the logic that performs the preemption
unlock(&allpLock)
// Need to decrement number of idle locked M's // (pretending that one more is running) before the CAS.
// (pretending that one more is running) before the CAS.
// Otherwise the M from which we retake can exit the syscall, // increment nmidle and the M's // (pretending that one more is running) before the CAS.
// increment nmidle and report deadlock.
incidlelocked(-1)
if (&, s, _Pidle) { // Update P's state to _Pidle. n++ // Number of preemptions + 1
n++ // Preemption count + 1
++ // system call preemption + 1
handoffp(pp) // handoffp preemption
}
incidlelocked(1)
lock(&allpLock)
}
}
unlock(&allpLock)
return uint32(n)
}
go intohandoffp
:
// Hands off P from syscall or locked M. // Always runs without a P, so write barriers are not allowed.
// Always runs without a P, so write barriers are not allowed.
// Always runs without a P, so write barriers are not allowed.
//go:nowritebarrierrec
func handoffp(pp *p) {
// if it has local work, start it straight away
// if it has local work, start it straight away
// Bind P to some other thread, whether or not it's the thread that executes the system call.
// The thread executing the system call doesn't need P anymore, so it's a good use of resources to free P, which is finite compared to threads.
if !runqempty(pp) || ! = 0 {
startm(pp, false, false)
return
}
...
// no local work, check that there are no spinning/idle M's, // otherwise our help is not required.
// otherwise our help is not required
if ()+() == 0 && (0, 1) { // TODO: fast atomic
(0)
startm(pp, true, false)
return (0, 1) { // TODO: fast atomic (0)
}
...
// Determine if the global queue has any jobs to process.
if ! = 0 {
unlock(&)
startm(pp, false, false)
return (pp, false, false)
}
...
// If none of them work, then put P on the global idle queue
pidleput(pp, 0)
unlock(&)
}
You can see the goroutine that preempts a system call that is too long. Here preempting means releasing the P bound to the system call thread, preempting does not mean not letting the thread make the system call, but releasing the P. (Since the stackguard0 of this goroutine is set earlier, it is similar to the stackguard0 ofExcessive runtime goroutine hijacking of the process will still go through).
Let's look at a schematic to visualize the process more clearly:
handoff
After the end, increase the number of preemptions n thatretake
Return:
func sysmon() {
...
idle := 0 // how many cycles in succession we had not wokeup somebody
delay := uint32(0)
if idle == 0 { // start with 20us sleep.
if idle == 0 { // start with 20us sleep...
delay = 20 // if idle == 0, sysmon needs to wake up and monitor 20us intervals.
} else if idle > 50 { // start doubling the sleep after 1ms...
delay *= 2 // If idle is greater than 50, sysmon will start doubling the sleep after 50 loops, so sysmon won't waste resources.
}
if delay > 10*1000 { // up to 10ms
delay = 10 * 1000 // Of course, you can't sleep indefinitely. Set the maximum sleep time to 10ms.
}
if retake(now) ! = 0 {
idle = 0 // If there is a preemption, idle = 0, indicating that sysmon is busy.
} else {
idle++ // no preemption, idle + 1
}
...
}
...
}
2. Summary
This talk covered preemption caused by long system calls. The next lecture will continue with asynchronous preemption.