DDCA -- cache coherence

1. Multi-processor memory organization

1.1 SMP/Centralized Shared Memory

Centralized shared memory multiprocessor（Centralized shared-memory multiprocessor）maybeSymmetric shared memory multiprocessor（Symmetric shared-memory multiprocessor，SMP）
Multiple processors connected to a centralized memory -- because all processors see the same memory structure -> ie.Uniform memory access (UMA)
Shared memory means that all processors have access to the entire memory address space.
Can centralized memory be a bandwidth bottleneck? -- Not if a large cache is used and the number of processors is less than 12.

The Main Memory in the diagram isShared Memory，The processor broadcasts the Bus, and each processor can see the requests and responses from the other processors.

1.2 Distributed Memory Multiprocessor

For better scalability, memory is distributed among multiple processors ->Distributed memory multiprocessors (DMM)

If one processor has direct access to another processor's local memory, then the address space is shared ->Distributed Shared Memory (DSM) Multiprocessor
If the memory is strictly local, we need toCommunicating data via message -> Computer clusters or multi-computer systems
Non-uniform memory architecture (NUMA) because local memory has lower latency than remote memory

In the figure, Memory is integrated separately into each processor chip.Memory is changed from Shared Memory to Local Memory and processors communicate with each other through the Interconnection network.

2. Cache Coherence Protocols (CCPs)

2.1 Shared-Memory vs. Massage-Passing

Shared-memory:

Easy to understand programming model
Communication is implicit, hardware handles protection
Hardware Control Cache

As shown in the figure below, for an application, it has a number of threads, each of which can read or write to shared-memory.

Message-passing:

No cache coherency -> simpler hardware
Explicit Communication -> Facilitates bulk refactoring of code by programmers
The sender can initiate a data transfer

As shown in the following figure, thread 2 needs the latest A data, Message-passing procedure.

2.2 SMPs（Symmetric shared-memory multiprocessors）

Centralized main memory and multiple caches -> generates multiple copies of the same data
If the system can be used inReturns the most recently written value of this data when read, then the system iscache consistency(used form a nominal expression)
Cache Inconsistency Example: Cache-A and Cache-B Cache Inconsistency at Time 3

2.3 Cache Coherence

Essential Question:If multiple processors cache the same block, how do you make sure they all see consistent state?

P2 performs ld r2, x:

P1 performs ld r2, x:

P1 Modify the value of x:

P2 then reads the value of x. If there is no cache coherence, the wrong value will be read:

A memory system isconsistent, if:

write for dissemination: After P1 writes X, after enough time, P2 reads X and gets the value written by P1.
Write Serialization: A write operation performed by two processors to the same location is observed by all processors in the same order.
memory consistency modelDefines the concept of "time elapsed", i.e., the amount of time a processor's operation waits before it becomes visible to other processors, and defines the order in which read/write operations are performed with other locations (roughly - more on this later).

2.4 Cache Coherence Protocol

Directory-based:A location (directory) tracks the shared state of a block of memory (For DMM）

Snooping：Each cache block carries a shared state for that block, and all cache controllers monitor the shared bus in order to update the shared state of the block if necessary (For SMP）

Write invalidate: Before writing, the processor gains exclusive access to the block by invalidating all other copies
Write Update: When a processor writes, it updates the other shared copies of the block

3. Snoop-based cache coherence protocols

3.1 SMP Snoop-based Cache Coherence Protocol Flow

Example 1：

Example 2：

P1 reads X:
- X is not found in cache 1, so P1 sends a request to the bus.
- The host receives the request and responds by placing X into cache 1 and marking its state as "shared".
P2 reads X:
- X is not found in cache 2, so P2 sends a request to the bus.
- All caches snoop on this request. Cache 1 detects the request, but because it is a read request, Cache 1 does not perform any action.
- The host memory responds to the request again, placing X into cache 2 and setting its state to "shared".
P1 writes to X:
- The state of X in cache 1 is "shared", and the "shared" state can only provide read access, so P1 needs to get exclusive access before writing.
- P1 sends a write request to the bus, which is detected by cache 2, which sets X in its own cache to the "invalid" state (indicating that it has been invalidated).
- Cache 1 changes the state of X to "modified", which means that Cache 1 has a unique valid copy and can be modified.
P2 reads X:
- The status of X in cache 2 is "invalid", so P2 sends another read request to the bus.
- Cache 1 detects this request and realizes that it has the only valid copy of X, so it demotes its X state to "shared" and responds with the data to Cache 2.
- X is placed in cache 2 with a status of "shared", and the main memory is updated with the value of X to ensure consistency.

Example 3：

3.2 Design issues

Invalidate:This refers to the need to invalidate copies of that data in the caches of other processors when one processor performs a write operation on a block of data. This is to ensure that in a multiprocessor system, the data read by all processors is up to date. The invalidate operation can be performed by sniffing the bus.
Find Data:When a processor needs a certain block of data, it may not know if the block is in another processor's cache, so a mechanism is needed to locate the needed data in the system.The Snoop protocol helps to locate the data location by monitoring requests on the bus.
Writeback/Writethrough:These are two cache update strategies. Writethrough writes data directly to the main memory, thus reducing the complexity of the cache consistency problem, but increasing storage bandwidth consumption. Writeback writes to the main memory only when a data block is replaced or invalidated, reducing the number of writes to the main memory but increasing the complexity of maintaining consistency.
Cache Block States:In the cache coherence protocol, cache blocks have different states, such as Modified, Shared, Exclusive, and Invalid in the MESI protocol. Exclusive", "Invalid" in the MESI protocol. These states are used to characterize the permissions and validity of each cache block to ensure data consistency.
Contention for Tags:Since multiple processors access data blocks in the cache at the same time, this may lead to tag contention problems. Tag contention means that when a processor needs a certain cache block, it needs to frequently check the state of its own cache block, i.e., the cache block Tag.
Enforcing Write Serialization:In a multiprocessor system, the order of write operations is important. Forced write serialization means ensuring that write operations on the same block of data are performed in a specific order, which prevents inconsistent data from being read by different processors.

3.3 Examples of protocols

4. Directory-based cache coherence protocol

In a multiprocessor system, in order to achieve high scalability, the system distributes physical memory among the processor nodes. This means that each processor has its own portion of physical memory, rather than all processors sharing a centralized memory area.To manage these distributed memory regions, the directory structure is distributed, and each directory is responsible for recording shared state information with its corresponding memory block.

Distribution of physical memory:Physical memory in the system is distributed across processor nodes, which allows each processor to access local memory faster (with lower access latency compared to remote memory). This approach improves the scalability of the system, making it easier to add new processors and memory.
Distribution of catalogs:The memory in each processor node has a corresponding directory to record its shared state. This directory contains information that describesWhether other processors have copies of this memory in their caches. In this way.The system no longer requires all processors to snoop the bus for consistency information, reducing bandwidth and overhead。
The physical address determines the memory location:The physical address of each memory block determines the node in the system to which it belongs. The processor can find out where the memory is directly by address without having to broadcast a request.
Scalable interconnects and message routing:Processing nodes pass aextensible interconnection, rather than the traditional bus structure. With a traditional bus structure, all requests are broadcast, but in a scalable interconnect, theMessages are routed directly from the sender to the receiver.. Since messages are no longer broadcast, the processor cannot monitor all requests and therefore must rely on the catalog to keep track of the shared state of memory blocks.

4.1 DMM Directory-Based Cache Coherence Protocol

Example1 ：

145fb88aded135762dbb5965d12ac6b

When the 4th instruction is: P4: Rd X

df05e9efd3260b6cf0b146775d06ce1

When the 4th instruction is: P4: Wr X

e0278c39a0cc1eec51b1462a8ab8d24

Example 2：

4.2 Cache Block Status

What are the different states a block of memory can have in a catalog?

Note that we needRecord information about each cachein order to send an invalidate message.
In the interest of efficiency.The state of the memory block is also stored in the cache。
Now.The directory acts as an arbiter: if there are multiple write operations occurring at the same time, the directory determines the order in which they occur。

4.3 Catalog Operation

If the memory block is in the uncached state:
- Read miss: send data, set block to shared state
- Write miss: send data, set block to exclusive state
If the memory block is in a shared state:
- Read miss: send data to add node to sharer list
- Write miss: send data, disable sharer, set block to exclusive state
If the memory block is in an exclusive state:
- Read misses: request data from current owner, write data to memory, send data, set block to shared state, add node to sharer list
- Data write-back: write data to memory, set block to uncached state
- Write not hit: request data from current owner, write data to memory, send data, update new owner identity, remain exclusive

5. Multi-threaded programming model

5.1 Ocean Kernel

Procedure Solve(A)
begin
	diff = done = 0;
 	while (!done) do
 		diff = 0;
 		for i <- 1 to n do
 			for j <- 1 to n do
 				temp = A[i,j];
 				A[i,j] <- 0.2 * (A[i,j] + neighbors);
 				diff += abs(A[i,j] – temp);
 			end for
 		end for
 		if (diff < TOL) then done = 1;
	end while
end procedure

5.2 Parallel Programming in a Shared Address Space Model

int n, nprocs; float **A, diff; int n, nprocs; int n, nprocs
float **A, diff;
LOCKDEC(diff_lock);
BARDEC(bar1).

LOCKDEC(diff_lock); BARDEC(bar1); main()
read(n); read(nprocs); read(nprocs)
read(n); read(nprocs); A <- G_MALLOC(); read(nprocs); read(nprocs)
 A <- G_MALLOC(); initialize (A); - G_MALLOC().
 initialize (A); CREATE (nprocs,Solve,A); // create nprocs,Solve,A)
 CREATE (nprocs,Solve,A); // create nprocs of parallel tasks, each calling Solve(A) function to compute
 WAIT_FOR_END (nprocs); // wait for all tasks to complete
end main

synchronization: ByBARRIER(bar1, nprocs) Processes are synchronized at different stages of computation to ensure that all processes move to the next step at the same point in time.
locking mechanism: UseLOCK(diff_lock) cap (a poem)UNLOCK(diff_lock) treat (sb a certain way)diff Locking protection is performed to avoid multiple processors accessing the modifications at the same time, ensuring thread safety.

procedure Solve(A)
	int i, j, pid, done=0;
	float temp, mydiff=0;
	int mymin = 1 + (pid * n/procs);
	int mymax = mymin + n/nprocs -1;
 	while (!done) do
    	mydiff = diff = 0;
     	BARRIER(bar1,nprocs); // synchronize
     	for i <- mymin to mymax
     		for j <- 1 to n do
 			...
			endfor
 		endfor
 		LOCK(diff_lock);
 		diff += mydiff;
 		UNLOCK(diff_lock);
 		BARRIER (bar1, nprocs); // synchronize
 		if (diff < TOL) then done = 1;
 		BARRIER (bar1, nprocs);
 	endwhile

5.3 Parallel Programming under the Message Passing Model

main()
	read(n); read(nprocs);
 	CREATE (nprocs-1, Solve);
 	Solve();
 	WAIT_FOR_END (nprocs-1);

Unlike the shared address space model, here the individual incomingThe program is operated by explicit message sending (SEND) and receiving (RECEIVE) operations.Communication. This approach is suitable for distributed memory systems where each processor has its own separate memory region.
existwhile (!done) loop, each process first passes theSEND cap (a poem)RECEIVE function sends and receives boundary rows to and from neighboring processes so that the latest data from neighboring rows can be used in calculations.
Non-master processes (pid != 0) bySEND convert one'smydiff sent to the master process, and thenRECEIVE Receive an end-of-life flagdone。

procedure Solve()
	int i, j, pid, nn = n/nprocs, done=0;
 	float temp, tempdiff, mydiff = 0;
 	myA <- malloc(...)
 	initialize(myA);
 	while (!done) do
 		mydiff = 0;
 		if (pid != 0) 
 			SEND(&myA[1,0], n, pid-1, ROW);
 		if (pid != nprocs-1)
 			SEND(&myA[nn,0], n, pid+1, ROW);
 		if (pid != 0)
 			RECEIVE(&myA[0,0], n, pid-1, ROW);
 		if (pid != nprocs-1)
 			RECEIVE(&myA[nn+1,0], n, pid+1, ROW);
		for i <- 1 to nn do
 			for j <- 1 to n do
 				...
 			endfor
 		endfor
 		if (pid != 0)
 			SEND(mydiff, 1, 0, DIFF);
 			RECEIVE(done, 1, 0, DONE);
 		else
 			for i <- 1 to nprocs-1 do
 				RECEIVE(tempdiff, 1, *, DIFF);
 				mydiff += tempdiff;
 			endfor
 			if (mydiff < TOL) done = 1;
 			for i <- 1 to nprocs-1 do
 				SEND(done, 1, I, DONE);
 			endfor
 		endif
 	endwhile