NET Core Thread (Thread) underlying principles of the discussion

summary

The basic concepts of kernel state, user state, threads, processes, and concurrent programs will not be repeated.

Native and user threads

native thread
Threads created in the kernel state serve only the kernel state
user thread
A thread created by the User Application that goes back and forth between the kernel state and the user state.
A Throw Exception, for example, is triggered by a CLR thread that switches from the user state to the kernel state and back again.

Clock Interrupt and Time Slice

Clock interrupts are underpinned by the motherboard'shardware timerGenerated, triggered at a fixed interval (15.6ms). windows acts as a consumer to handle multi-threaded task scheduling/timing tasks.

The operating system acquires the interrupt and then allocates time slices on its own, each thread gets CPU runtime in one time slice, and when the time slice runs out, it is then allocated by the operating system to the next thread
The windows client has 2 clock interrupts per time slice (15.6*2=31.5ms).
The windows server has a time slice of 12 clock interrupts (15.6*12=187.2ms, mainly for higher throughput).

The CLR via C# article says windows switches every 30ms for this reason.

When a thread runs out of time, the operating system transfers a new time slice to the other threads. To realize the "multi-threading" effect.

A single core can handle only one threaded task at a time.

seeing is believing

How often is the interrupt triggered?
Use windbg to enter the kernel state and use the nt!KeMaximumIncrement command to check to see its value.

Note that the unit is 100ns, so 156250*100/1000/1000=15.625ms

Data structure of CPU cores under Windows

Windows allocates a _KPCR memory structure to each CPU core to record the current CPU state. and extends _KPRCB to record more information.
The key message is that the CuurentThread/NextThread/IdleThread (idle thread) is stored

seeing is believing

Use the dt command to view the

The dt command is a very useful tool for displaying type information, primarily for viewing and analyzing the layout and contents of data structures

Which thread is the CPU currently executing?

Use the !running command to see what threads are currently executing on the CPU core.

essentially a distilled simplification of _KPCR/_KPRCB.

Data structures for threads on Windows

Every thread has the following elements, which are unavoidable overheads for creating threads.

Thread Kernel Object (Thread Kernel Object)
Each thread created in the OS is allocated data structures to carry the description information
Windows allocates an _ETHREAD memory structure to each Thread to record the current state of the Thread, which includes the Thread Context.
Thread Environment Block(Thread Environment Block, TEB)
TEB is a block of memory allocated in the user state, mainly including thread Exception, Local Storage and other information
User-Mode Stack
We often refer to the stack space is here, the famous OOM is from here!
Kernel-Mode Stack
For security isolation reasons, an identical stack space is copied in kernel state. It is used to handle user-state access to kernel-state code.

seeing is believing

Thread Kernel Objects
Use the command dt nt!_ETHREAD
TEB
Use the command dt nt!_TEB

The nature of thread context switching

The essence of context switching is to back up the value of the switched thread's registers to the context of that thread. The context is then read into the registers from the switched thread.

A simple example is that I take turns playing a game with you, I load my archive when I play, and I save my archive when it's your turn to play. You repeat the process when you play.

The cost of thread switching

Context switching is a net overhead that does not result in any performance gains. So one way to think about optimizing your program is to reduce the context switching

explicit cost
Saves the value of a register to memory and reads the register from memory.
The greater the number of registers, the higher the cost, the AMD 7840HS processor, for example, there are a total of 17 registers
implicit cost
If the thread switch is in the same process, they share the virtual memory space of the user state. So when a thread switch occurs, there is a possibility of hitting the CPU cache (e.g. variables, code shared between threads).
If you switch threads in a different process, the user-state virtual memory space will be invalidated, which in turn will invalidate the CPU cache.

seeing is believing

After all this theory, why don't we just look at the source code.

/*Main Code Entry*/
PUBLIC KiSwapContext
.PROC KiSwapContext

    /* Generate a KEXCEPTION_FRAME on the stack */
	/* Core logic：Back up all the registers. */
    GENERATE_EXCEPTION_FRAME

    /* Do the swap with the registers correctly setup */
	/* Add the address of the new thread，swap toR8registers */
    mov r8, gs:[PcCurrentThread] /* Pointer to the new thread */
    call KiSwapContextInternal

    /* Restore the registers from the KEXCEPTION_FRAME */
	/* Restore the previously saved register values to theCPUprocessor register */
    RESTORE_EXCEPTION_STATE

    /* Return */
    ret
.ENDP

MACRO(GENERATE_EXCEPTION_FRAME)

    /* Allocate a KEXCEPTION_FRAME on the stack */
    /* -8 because the last field is the return address */
    sub rsp, KEXCEPTION_FRAME_LENGTH - 8
    .allocstack (KEXCEPTION_FRAME_LENGTH - 8)

    /* Save non-volatiles in KEXCEPTION_FRAME */
    mov [rsp + ExRbp], rbp
    .savereg rbp, ExRbp
    mov [rsp + ExRbx], rbx
    .savereg rbx, ExRbx
    mov [rsp +ExRdi], rdi
    .savereg rdi, ExRdi
    mov [rsp + ExRsi], rsi
    .savereg rsi, ExRsi
	......an omission
ENDM

MACRO(RESTORE_EXCEPTION_STATE)

    /* Restore non-volatile registers */
    mov rbp, [rsp + ExRbp]
    mov rbx, [rsp + ExRbx]
    mov rdi, [rsp + ExRdi]
    mov rsi, [rsp + ExRsi]
    mov r12, [rsp + ExR12]
    mov r13, [rsp + ExR13]
    mov r14, [rsp + ExR14]
    mov r15, [rsp + ExR15]
    movaps xmm6, [rsp + ExXmm6]
	......an omission

    /* Clean stack and return */
    add rsp, KEXCEPTION_FRAME_LENGTH - 8

ENDM

/reactos/reactos/blob/master/ntoskrnl/ke/amd64/

Thread Scheduling Model (Extreme Simplified Version)

as mentioned abovelogical coreThere are three attributes in the data structure _KPRCB.
DeferredReadyListHead for single linked tables, WaitListHead for double linked tables, DispatcherReadyListHead for two-dimensional array form.

Simply put, when the thread switches, the logic core switches from the DispatcherReadyListHead according to thethread prioritySwitch high priority threads. If a thread has voluntarily given up a time slice (/), the thread is placed in the DeferredReadyListHead, and the WaitListHead is used to hold threads that are waiting for something to happen, such as waiting for an I/O operation to complete, waiting for a certain amount of signals, or waiting for a mutex.

seeing is believing

Look directly at the source code

You can see that DispatcherReadyListHead size is 32, mainly because windows set the thread priority to different levels from 0-31.

/reactos/reactos/blob/master/sdk/include/ndk/amd64/

thread priority

Windows\Linux, being a preemptive operating system, cannot guarantee that threads will run all the time. Therefore thread priorities are used to give the user some control.
Each thread in windows has a priority from 0 (lowest) to 31 (highest), which is stored in DispatcherReadyListHead, and when OS assigns a time slice to a thread, it is thePriority for high priority threadsAllocation of time.
As long as there are always threads with priority 31, it will never be possible to call threads with priority 0-30. This is called "thread starvation".
Linux uses the nice value to indicate priority, ranging from - 20 to 19. The smaller the nice value, the higher the priority, and the default nice value is 0.

C# Thread Structure Model

The underlying C# threads are CLR managed threads and the CLR hosts OS threads. So they both have a one-to-one correspondence.

corresponds to C# threads (), CLR threads, and OS threads, respectively.

Threads go through two phases during creation

        static void Main(string[] args)
        var testThread = new Thread(DoWork)
            var testThread = new Thread(DoWork);// This stage only creates the Thread in the CLR, not on the OS.

            ();// The underlying CLR will call the system api to create the OS thread.
        }

clr reserves the 5 thread priorities Lowest, BelowNormal, Normal, AboveNormal, Highest

Frontend and backend threads

Note that this is only a CLR concept, there is no such concept at the OS level.
Foreground threads: for critical tasks, the process waits for all foreground threads to finish executing before exiting normally.Thread is foreground thread by default
Background threads: for non-critical tasks, the process will not wait for the background thread to finish executing and will exit directly; ThreadPool is a background thread by default.

Think about a problem where managed threads call unmanaged code and unmanaged code calls managed code. What threads do they use to call?
The former depends on the thread creation method (Thread/ThreadPool), the latter for the background thread, because the native thread to bind managed thread, created by the thread pool

Concurrent and Virtual Threads

NET 9 does not support this feature yet!
/blogPost/59752c38-9c99-4641-9853-9cfa97bb2d29

Thread Local Storage (TLS)

TLS is used to implement thread-local variables that are segregated by thread and whose modified values are visible only to the thread that modified them.

The underlying implementation is as follows:
OS natively supports TLS, e.g. TLS data distribution/modification/assignment via TlsAlloc/TlsGetValue/TlsSetValue on windows.
The OS uses segmented registers (e.g., the gs register) to store addresses that point to TLS data. Using the context switching mechanism, each native thread can access the gs register independently. This in turn locates the associated TLS

seeing is believing

Use the ~ command to derive the range of each thread stack

After using !teb to observe its memory layout, you can see that the memory space pointed to by TLS Storage is accessed in BaseAddress+Offset mode.