Multithreading in C++ and its subsequent perimeter

multi-threaded

Reference:/p/613630658

Platform differences: Linux vs. Windows, cross-platform solutions

On Linux, there is the use of pthread, and the C++ 11 standard uses <thread>, a good cross-platform solution.
There are some significant differences between thread and pthread in practical use, typical examples are:
pthread_createis used to create threads, whereas std::thread can be used directly to create threads.

Objectively speaking, thread is a much cleaner implementation, while pthread is inevitably a bit crude, this article focuses on the thread style in multithreading programming in C++.

Subthread Exit vs. Main Thread Exit
Citation:

/a0408152/article/details/129093394
/m0_56374992/article/details/119109979
detach is to separate the execution instance represented by the current thread object from that thread object, allowing the execution of the thread to proceed separately.
However, in this case, even if the child thread detaches, the main thread exiting will cause the child thread to exit.
The specific reason is that the main process exits by return or exit, and the process exit causes all threads to exit synchronously.
Here it is related to the process/thread model of Linux, which is a specific implementation of posix (pthread), cf.Citation 2。
To prevent this, use pthread_exit(nullptr) when you don't want to recycle a child process in the main process; the
You can also add loops to the main process if only for testing purposes.

So, re-understand detach: separate the main process from the sub-threads so that the two can run independently.

atomic operation

Note: Parts of this section and code are from Imperial College COMP60017 - L05 - Multi-core and Parallelism, Lluís Vilanova

Since modern CPUs use out-of-order pipelining for instruction execution, a single a = b + 1 instruction can be broken down into the following three assembly instructions at the O2 optimization level:

mov     eax, DWORD PTR [rbp-8]
add     eax, 1
mov     DWORD PTR [rbp-4], eax

Simulated compilation (and compilation optimization):/

It can be seen that without any additional processing, it is possible to switch the scheduling of processes between add and mov operations, and therefore pseudo-incrementation occurs, i.e., for two parallel threads to simultaneously increment a variable 10,000 times, with the final result usually being less than 20,000.

atomic keyword

atomic is a keyword in C++ that serves to provide a set of 'atomic' operations for a specific variable.

In essence it is a critical zone protection for a single instruction.

Specific examples of use are shown below:

// Use atomic operations on data shared across threads
#include <atomic>
#include <thread>
#include <iostream>
int main(int argc, char** argv) {
    int iters = 100000000; std::atomic<int> a = 0;
    std::thread t1([&](){ for (volatile int i = 0; i < iters; i++) a++; });
    std::thread t2([&](){ for (volatile int i = 0; i < iters; i++) a++; });
    (); ();
    std::cout << "expected=" << iters*2 << " got=" << a << std::endl;
}

In C++, there are two ways of using atomic:

Declaring variables of type atomic when defining them .std::atomic<int> a = 0;
Using atomic_series operations when using variables .atomic_fetch_add(&a, 1);

Reference:/w/cpp/atomic/atomic/compare_exchange

atomic_{load, store}: read/write
atomic_compare_exchange_{weak, strong}: note that in atomic, two CAS operations are provided:
compare_exchange_weak(T& expected, T desired)
compare_exchange_strong(T& expected, T desired)
First compare it with the first parameter:
- If equal, changes the value of the atomic variable to the second argument, returning true.
- If they are not equal, the value of the first argument is changed to the current value of the atomic variable, returning false.
  However, it should be noted that when the operation implemented as weak returns false, it does not necessarily complete the actual expected value modification, and there may be a pseudo-failure (spuriously fail) situation.
  In practice, especially at the application level, if you are not extremely sensitive to performance, always use the strong
atomic_fetch_{add, sub, or, xor, and}: arithmetic/logical operations

int beings, legs;
void enter_room(int nlegs) {
    atomic_fetch_add(&beings, 1);
    atomic_fetch_add(&legs, nlegs);
}

C++ memory order: memory order

CAS lock-free operation

The biggest advantage of being lockless over locking operations is the significant performance improvement.

Most modern CPUs provide mechanisms for atomic implementation of CAS at the hardware level.

shared lock: read/write lock, multiple threads can read at the same time, but only one can write.

In the pthread library in Linux, we use CAS as an implementation of read and write locks

The CAS operation, whose full name is Compare-and-swap, is an atomic operation.

A simple implementation in C++:

bool compare_and_swap(int *pAddr, int nExpected, int nNew)
{
    if(*pAddr == nExpected)
    {
        *pAddr = nNew;
        return true;
    }
    else
        return false;
}

An operation for self-incrementation using CAS is provided here:

void atomic_inc(uint64_t* addr) {
    bool swapped = false;
    while (not swapped) {
        auto old = *addr;
        swapped = CAS(addr, old, old+1);
    }
}

thread synchronization

Thread Critical Area: no protection for read-only, no-write variables

▪ Lock/mutex
▪ Semaphore
▪ Shared lock (aka, read/write lock)
▪ Condition variables
▪ Barrier

Example: Using Conditional Variablescondition_variable

Example: leetcode 1117. H2O generation

In C++, thecondition_variablemust be combined withunique_lockuse, in addition to acondition_variable_anyClasses can use all locks, for now.

Basic usage flow: mutex lock -> wait -> mutex unlock

The wait function blocks and releases the lock automatically, without the need to release it manually.

False awakening: usenotify_all()When the function wakes up all the threads in the WAIT state, it finds that the conditions for which they are waiting have not been met.

Solution:

Use a while loop to determine the condition each time you are woken up.

while (g_deque.empty())
{
    g_cond.wait(lck);
}

Use a wait with a predicate condition.

(lck,[this]{return printo > 0;});// here the lambda function is used, in the class so this is needed, it is more convenient

asynchronous programming

Asynchronous programming: callback function callback

std::future and std::promise are new in c++11:.

More lightweight: coprograms

c++20: provides co_routine (co-programming) that does suspend and resume when appropriate, a stackless co-programming based on a state machine

Rated as a poor imitation of golang (native support, never seen anyone use it in C++)

I used a similar idea in this project when I maintained a VM state stream from kernel to userspace that supported hot migration, but here it's a stacked coprocessor:/mahiru23/intravisor/tree/syscall/src

This article talks about co-programming and it feels good:/lizhaolong/p/

Lock Implementation

User-level

Acquire/lock → Loop until CAS from “released” to “acquired”
Release/unlock → Set value to “released”

Drawbacks:

Assuming that there are two threads, t1 holds the lock when t2 will repeatedly loop through the attempts until the position is acquired, there is circular waste (called thebusy waiting）
Potential thread starvation: waiting threads may be waiting all the time (possibly using queue to solve the problem?).

Here's an example of an application that only uses the

#include <iostream>
#include <atomic>
#include <thread>
#include <cstdlib>
#include <ctime>
#include <>

class mysem {
public:
    mysem(uint32_t init_value);
    void acquire();
    void release();
private:
    std::atomic<uint32_t> counter;
};

mysem::mysem(uint32_t init_value) {
    (init_value, std::memory_order_seq_cst);
}

void mysem::acquire() {
    if((std::memory_order_seq_cst) > 0) {
        counter.fetch_sub(1);
    }
    else {
        while ((std::memory_order_seq_cst) <= 0) {
            // busy-wait
        }
    }
}

void mysem::release() {
    counter.fetch_add(1);
}

void random_work() {
    usleep((rand()%1000)*10);
}

int main(int argc, char**argv)
{
    srand(time(nullptr));
    mysem s(1);
    std::thread t1([&](){
        random_work();
        ();
        std::cout << 1; random_work(); std::cout << 1;
        ();
    });
    std::thread t2([&](){
        random_work();
        ();
        std::cout << 2; random_work(); std::cout << 2;
        ();
    });
    (); ();
    std::cout << std::endl;
}

Kernel-level

Sleep after blocking, kernel level awake
Following the order ensures fairness in blocking threads
However, this approach is more EXPENSIVE because the process requires syscall

hybrid

existLinux pthread_mutex_lockThe internal use of theLinux futex。

Use user-level for shorter times, for threads with longer waits handled by kernel syscall (busy-wait, then block)

Glibc's pthread implementation: predicting in advance how long it might take: can adapt user-level busy-wait time dynamically

A mixed code for CAS + futex is given:

#include <iostream>
#include <atomic>
#include <thread>
#include <cstdlib>
#include <ctime>
#include <>
#include <sys/>
#include <linux/>
#include <sys/>

class mysem {
public:
    mysem(uint32_t init_value);
    void acquire();
    void release();
private:
    std::atomic<uint32_t> counter;
};

mysem::mysem(uint32_t init_value) {
    (init_value, std::memory_order_seq_cst);
}

void mysem::acquire() {

    for (int i = 0; i < 100; ++i) {
        uint32_t expected = (std::memory_order_seq_cst);
        if (expected > 0 && counter.compare_exchange_strong(expected, expected - 1, std::memory_order_seq_cst)) {
            return; 
        }
    }

    uint32_t* counter_ptr = reinterpret_cast<uint32_t*>(&counter);
    syscall(SYS_futex, counter_ptr, FUTEX_WAIT, (std::memory_order_seq_cst) > 0, nullptr, nullptr, 0);
}

void mysem::release() {
    counter.fetch_add(1);
    uint32_t* counter_ptr = reinterpret_cast<uint32_t*>(&counter);
    syscall(SYS_futex, counter_ptr, FUTEX_WAKE, 1, nullptr, nullptr, 0);
}

void random_work() {
    usleep((rand()%1000)*10);
}

int main(int argc, char**argv)
{
    srand(time(nullptr));
    mysem s(1);
    std::thread t1([&](){
        random_work();
        ();
        std::cout << 1; random_work(); std::cout << 1;
        ();
    });
    std::thread t2([&](){
        random_work();
        ();
        std::cout << 2; random_work(); std::cout << 2;
        ();
    });
    (); ();
    std::cout << std::endl;
}

Thread safety.

thread_local
Citation:/p/77585472

C++ 11 introduced thread_local as a thread-internalprivate equitylocal variable
Application Scenario: Multi-threaded Lock-free Programming