Go scheduler and preemption October 7, 2021 on tomleb's blog

I recently started doing a deep dive into the go internals, more specifically the scheduler. One of the first thing I came across is that go uses a cooperative work-stealing scheduler.

The cooperative part means that the program cooperates with the runtime to schedule or deschedule goroutines. I don’t have the expertise to go over the pros and cons of it, but one disadvantage with a fully cooperative language is that long computation can prevent other goroutines to be switched until finished. This was reported as Issue #11462 with the following code snippet taking a lot more than 100ms to print.

package main

import (
	"fmt"
	"runtime"
	"time"
)

//go:noinline
func cpuIntensive(p *int) {
	for i := 1; i <= 1000000000000000000; i++ {
		*p = i
	}
}

func main() {
	runtime.GOMAXPROCS(1)

	x := 0
	go cpuIntensive(&x)

	time.Sleep(100 * time.Millisecond)

	// printed only after cpuIntensive is completely finished
	fmt.Printf("x = %d.\n", x)
}

This program first limits the number of process to 1, and then starts a long computation goroutine. According to the issue, due to being cooperative, the code is supposed to finish the computation before printing even though we’re creating a goroutine. However, when I tried running the code (with version 1.16), I got output before the end of the computation. How come?

Turns out asynchronous preemption was added to go recently¹. Preemption means that a running goroutine can be interrupted (similar to how OS threads are interrupted by the OS) if it is safe to do so.

There are two interesting questions I wanted to find the answer for:

How does this asynchronous preemption works?
How does the runtime determine if a running goroutine can be preempted at its current instruction?

Let’s dig in!

How does asynchronous preemption works ?

Go uses a signal-based preemption model. It configures a signal handler for the SIGURG signal, which was chosen for various reasons. Upon receiving this signal, the goroutine will stop execution, leaving its place to the runtime (we don’t consider this being preempted in this context). The runtime will then decide if it is safe to suspend the goroutine to run another one. Alright, but who sends the SIGURG signal? Answer: sysmon.

At startup, the runtime creates a System Monitor (sysmon) thread. According to Go: sysmon, Runtime Monitoring, the sysmon thread is not taken into account by the runtime, which more importantly means that it is always running (concurrently with other OS threads). As you can see from the simplified code below, func sysmon() is basically an endless loop that sleeps and attempts to preempt goroutines (among other things) by sending the SIGURG signal.

func sysmon() {
	idle := 0 // how many cycles in succession we had not wokeup somebody
	delay := uint32(0)
	...
	for {
		...
		usleep(delay)

                ...

		now = nanotime()
                ...

		// retake P's blocked in syscalls
		// and preempt long running G's
		if retake(now) != 0 {
			idle = 0
		} else {
			idle++
		}
		...
	}
}

Here’s what the simplified retake() function looks like. It preempts goroutines every 10ms.

// forcePreemptNS is the time slice given to a G before it is
// preempted.
const forcePreemptNS = 10 * 1000 * 1000 // 10ms

func retake(now int64) uint32 {
	n := 0
	...
	for i := 0; i < len(allp); i++ {
		_p_ := allp[i]
		if _p_ == nil {
			// This can happen if procresize has grown
			// allp but not yet created new Ps.
			continue
		}
		pd := &_p_.sysmontick
		s := _p_.status
		sysretake := false
		if s == _Prunning || s == _Psyscall {
			// Preempt G if it's running for too long.
			t := int64(_p_.schedtick)
			if int64(pd.schedtick) != t {
				pd.schedtick = uint32(t)
				pd.schedwhen = now
			} else if pd.schedwhen+forcePreemptNS <= now {
				preemptone(_p_)
				// In case of syscall, preemptone() doesn't
				// work, because there is no M wired to P.
				sysretake = true
			}
		}
                ...
	}
        ...
}

For sake of completeness, if you follow the preemptone() source code, you will end up in preemptM() in signal_unix.go.

const sigPreempt = _SIGURG

...

func preemptM(mp *m) {
	...
		signalM(mp, sigPreempt)
	...
}

The following diagram illustrates the difference between go with and without async preemption. Without sysmon sending the signal, the goroutine with a long computation is never preempted and must finish before another one can run. However, with sysmon sending the signal, then the goroutine can be preempted and give CPU time for another goroutine. This is simplified and skips details but it gives an idea.

Scheduler comparison with and without preemption

Now that we have an idea of how, let’s find out in which circumstances the runtime is allowed to preempt a goroutine.

Determining preemptability of current function

Having no experience with compiler design, the fact that the runtime is able to determine the preemptability of a goroutine at the PC (program counter) level is fascinating to me.

At compile time, go adds some metadata into a table structure called the PCDATA table. More specifically, the PCDATA_UnsafePointer table. You can find more information about the internal design of this table at Go 1.2 Runtime Symbol Information, although I’m not sure how up to date it is.

When the SIGURG signal is received, the runtime will look at this table to verify whether the goroutine can be preempted and also what the PC will be once the goroutine will be rescheduled (the resumption PC). The following flowchart shows the logic to determine preemptability. This is defined in the isAsyncSafePoint() function.

If we go back to the initial program shown in this post, we can easily force the old behavior by making the go compiler mark cpuIntensive() as unsafe. One way to mark it as unsafe is to make it a nosplit function with a compiler directive. The code then becomes:

package main

import (
	"fmt"
	"runtime"
	"time"
)

//go:noinline
//go:nosplit
func cpuIntensive(p *int) {
	for i := 1; i <= 1000000000000000000; i++ {
		*p = i
	}
}

func main() {
	runtime.GOMAXPROCS(1)

	x := 0
	go cpuIntensive(&x)

	time.Sleep(100 * time.Millisecond)

	// printed only after cpuIntensive is completely finished
	fmt.Printf("x = %d.\n", x)
}

Now, if you build and run this code, the output will appear after the cpuIntensive() function has terminated.

go run main.go

It was added in the commit runtime: implement async scheduler preemption in go version 1.14. ↩︎

Contribute to the discussion in my public inbox by sending an email to ~tomleb/public-inbox@lists.sr.ht [mailing list etiquette]