1. Multi-processor memory organization
1.1 SMP/Centralized Shared Memory
- Centralized shared memory multiprocessor(Centralized shared-memory multiprocessor)maybeSymmetric shared memory multiprocessor(Symmetric shared-memory multiprocessor,SMP)
- Multiple processors connected to a centralized memory -- because all processors see the same memory structure -> ie.Uniform memory access (UMA)
- Shared memory means that all processors have access to the entire memory address space.
- Can centralized memory be a bandwidth bottleneck? -- Not if a large cache is used and the number of processors is less than 12.
The Main Memory in the diagram isShared Memory,The processor broadcasts the Bus, and each processor can see the requests and responses from the other processors.
1.2 Distributed Memory Multiprocessor
For better scalability, memory is distributed among multiple processors ->Distributed memory multiprocessors (DMM)
- If one processor has direct access to another processor's local memory, then the address space is shared ->Distributed Shared Memory (DSM) Multiprocessor
- If the memory is strictly local, we need toCommunicating data via message -> Computer clusters or multi-computer systems
- Non-uniform memory architecture (NUMA) because local memory has lower latency than remote memory
In the figure, Memory is integrated separately into each processor chip.Memory is changed from Shared Memory to Local Memory and processors communicate with each other through the Interconnection network.
2. Cache Coherence Protocols (CCPs)
2.1 Shared-Memory vs. Massage-Passing
Shared-memory:
- Easy to understand programming model
- Communication is implicit, hardware handles protection
- Hardware Control Cache
As shown in the figure below, for an application, it has a number of threads, each of which can read or write to shared-memory.
Message-passing:
- No cache coherency -> simpler hardware
- Explicit Communication -> Facilitates bulk refactoring of code by programmers
- The sender can initiate a data transfer
As shown in the following figure, thread 2 needs the latest A data, Message-passing procedure.
2.2 SMPs(Symmetric shared-memory multiprocessors)
- Centralized main memory and multiple caches -> generates multiple copies of the same data
- If the system can be used inReturns the most recently written value of this data when read, then the system iscache consistency(used form a nominal expression)
- Cache Inconsistency Example: Cache-A and Cache-B Cache Inconsistency at Time 3
2.3 Cache Coherence
- Essential Question:If multiple processors cache the same block, how do you make sure they all see consistent state?
P2 performs ld r2, x:
P1 performs ld r2, x:
P1 Modify the value of x:
P2 then reads the value of x. If there is no cache coherence, the wrong value will be read:
A memory system isconsistent, if:
- write for dissemination: After P1 writes X, after enough time, P2 reads X and gets the value written by P1.
- Write Serialization: A write operation performed by two processors to the same location is observed by all processors in the same order.
- memory consistency modelDefines the concept of "time elapsed", i.e., the amount of time a processor's operation waits before it becomes visible to other processors, and defines the order in which read/write operations are performed with other locations (roughly - more on this later).
2.4 Cache Coherence Protocol
Directory-based:A location (directory) tracks the shared state of a block of memory (For DMM)
Snooping:Each cache block carries a shared state for that block, and all cache controllers monitor the shared bus in order to update the shared state of the block if necessary (For SMP)
- Write invalidate: Before writing, the processor gains exclusive access to the block by invalidating all other copies
- Write Update: When a processor writes, it updates the other shared copies of the block
3. Snoop-based cache coherence protocols
3.1 SMP Snoop-based Cache Coherence Protocol Flow
Example 1:
Example 2:
-
P1 reads X:
-
X is not found in cache 1, so P1 sends a request to the bus.
-
The host receives the request and responds by placing X into cache 1 and marking its state as "shared".
-
-
P2 reads X:
-
X is not found in cache 2, so P2 sends a request to the bus.
-
All caches snoop on this request. Cache 1 detects the request, but because it is a read request, Cache 1 does not perform any action.
-
The host memory responds to the request again, placing X into cache 2 and setting its state to "shared".
-
-
P1 writes to X:
-
The state of X in cache 1 is "shared", and the "shared" state can only provide read access, so P1 needs to get exclusive access before writing.
-
P1 sends a write request to the bus, which is detected by cache 2, which sets X in its own cache to the "invalid" state (indicating that it has been invalidated).
-
Cache 1 changes the state of X to "modified", which means that Cache 1 has a unique valid copy and can be modified.
-
-
P2 reads X:
-
The status of X in cache 2 is "invalid", so P2 sends another read request to the bus.
-
Cache 1 detects this request and realizes that it has the only valid copy of X, so it demotes its X state to "shared" and responds with the data to Cache 2.
-
X is placed in cache 2 with a status of "shared", and the main memory is updated with the value of X to ensure consistency.
-
Example 3:
3.2 Design issues
-
Invalidate:This refers to the need to invalidate copies of that data in the caches of other processors when one processor performs a write operation on a block of data. This is to ensure that in a multiprocessor system, the data read by all processors is up to date. The invalidate operation can be performed by sniffing the bus.
-
Find Data:When a processor needs a certain block of data, it may not know if the block is in another processor's cache, so a mechanism is needed to locate the needed data in the system.The Snoop protocol helps to locate the data location by monitoring requests on the bus.
-
Writeback/Writethrough:These are two cache update strategies. Writethrough writes data directly to the main memory, thus reducing the complexity of the cache consistency problem, but increasing storage bandwidth consumption. Writeback writes to the main memory only when a data block is replaced or invalidated, reducing the number of writes to the main memory but increasing the complexity of maintaining consistency.
-
Cache Block States:In the cache coherence protocol, cache blocks have different states, such as Modified, Shared, Exclusive, and Invalid in the MESI protocol. Exclusive", "Invalid" in the MESI protocol. These states are used to characterize the permissions and validity of each cache block to ensure data consistency.
-
Contention for Tags:Since multiple processors access data blocks in the cache at the same time, this may lead to tag contention problems. Tag contention means that when a processor needs a certain cache block, it needs to frequently check the state of its own cache block, i.e., the cache block Tag.
-
Enforcing Write Serialization:In a multiprocessor system, the order of write operations is important. Forced write serialization means ensuring that write operations on the same block of data are performed in a specific order, which prevents inconsistent data from being read by different processors.
3.3 Examples of protocols
4. Directory-based cache coherence protocol
In a multiprocessor system, in order to achieve high scalability, the system distributes physical memory among the processor nodes. This means that each processor has its own portion of physical memory, rather than all processors sharing a centralized memory area.To manage these distributed memory regions, the directory structure is distributed, and each directory is responsible for recording shared state information with its corresponding memory block.
- Distribution of physical memory:Physical memory in the system is distributed across processor nodes, which allows each processor to access local memory faster (with lower access latency compared to remote memory). This approach improves the scalability of the system, making it easier to add new processors and memory.
- Distribution of catalogs:The memory in each processor node has a corresponding directory to record its shared state. This directory contains information that describesWhether other processors have copies of this memory in their caches. In this way.The system no longer requires all processors to snoop the bus for consistency information, reducing bandwidth and overhead。
- The physical address determines the memory location:The physical address of each memory block determines the node in the system to which it belongs. The processor can find out where the memory is directly by address without having to broadcast a request.
- Scalable interconnects and message routing:Processing nodes pass aextensible interconnection, rather than the traditional bus structure. With a traditional bus structure, all requests are broadcast, but in a scalable interconnect, theMessages are routed directly from the sender to the receiver.. Since messages are no longer broadcast, the processor cannot monitor all requests and therefore must rely on the catalog to keep track of the shared state of memory blocks.
4.1 DMM Directory-Based Cache Coherence Protocol
Example1 :
- When the 4th instruction is: P4: Rd X
- When the 4th instruction is: P4: Wr X
Example 2:
4.2 Cache Block Status
What are the different states a block of memory can have in a catalog?
- Note that we needRecord information about each cachein order to send an invalidate message.
- In the interest of efficiency.The state of the memory block is also stored in the cache。
- Now.The directory acts as an arbiter: if there are multiple write operations occurring at the same time, the directory determines the order in which they occur。
4.3 Catalog Operation
- If the memory block is in the uncached state:
- Read miss: send data, set block to shared state
- Write miss: send data, set block to exclusive state
- If the memory block is in a shared state:
- Read miss: send data to add node to sharer list
- Write miss: send data, disable sharer, set block to exclusive state
- If the memory block is in an exclusive state:
- Read misses: request data from current owner, write data to memory, send data, set block to shared state, add node to sharer list
- Data write-back: write data to memory, set block to uncached state
- Write not hit: request data from current owner, write data to memory, send data, update new owner identity, remain exclusive
5. Multi-threaded programming model
5.1 Ocean Kernel
Procedure Solve(A)
begin
diff = done = 0;
while (!done) do
diff = 0;
for i <- 1 to n do
for j <- 1 to n do
temp = A[i,j];
A[i,j] <- 0.2 * (A[i,j] + neighbors);
diff += abs(A[i,j] – temp);
end for
end for
if (diff < TOL) then done = 1;
end while
end procedure
5.2 Parallel Programming in a Shared Address Space Model
int n, nprocs; float **A, diff; int n, nprocs; int n, nprocs
float **A, diff;
LOCKDEC(diff_lock);
BARDEC(bar1).
LOCKDEC(diff_lock); BARDEC(bar1); main()
read(n); read(nprocs); read(nprocs)
read(n); read(nprocs); A <- G_MALLOC(); read(nprocs); read(nprocs)
A <- G_MALLOC(); initialize (A); - G_MALLOC().
initialize (A); CREATE (nprocs,Solve,A); // create nprocs,Solve,A)
CREATE (nprocs,Solve,A); // create nprocs of parallel tasks, each calling Solve(A) function to compute
WAIT_FOR_END (nprocs); // wait for all tasks to complete
end main
-
synchronization: By
BARRIER(bar1, nprocs)
Processes are synchronized at different stages of computation to ensure that all processes move to the next step at the same point in time. -
locking mechanism: Use
LOCK(diff_lock)
cap (a poem)UNLOCK(diff_lock)
treat (sb a certain way)diff
Locking protection is performed to avoid multiple processors accessing the modifications at the same time, ensuring thread safety.
procedure Solve(A)
int i, j, pid, done=0;
float temp, mydiff=0;
int mymin = 1 + (pid * n/procs);
int mymax = mymin + n/nprocs -1;
while (!done) do
mydiff = diff = 0;
BARRIER(bar1,nprocs); // synchronize
for i <- mymin to mymax
for j <- 1 to n do
...
endfor
endfor
LOCK(diff_lock);
diff += mydiff;
UNLOCK(diff_lock);
BARRIER (bar1, nprocs); // synchronize
if (diff < TOL) then done = 1;
BARRIER (bar1, nprocs);
endwhile
5.3 Parallel Programming under the Message Passing Model
main()
read(n); read(nprocs);
CREATE (nprocs-1, Solve);
Solve();
WAIT_FOR_END (nprocs-1);
-
Unlike the shared address space model, here the individual incomingThe program is operated by explicit message sending (SEND) and receiving (RECEIVE) operations.Communication. This approach is suitable for distributed memory systems where each processor has its own separate memory region.
-
exist
while (!done)
loop, each process first passes theSEND
cap (a poem)RECEIVE
function sends and receives boundary rows to and from neighboring processes so that the latest data from neighboring rows can be used in calculations. -
Non-master processes (
pid != 0
) bySEND
convert one'smydiff
sent to the master process, and thenRECEIVE
Receive an end-of-life flagdone
。
procedure Solve()
int i, j, pid, nn = n/nprocs, done=0;
float temp, tempdiff, mydiff = 0;
myA <- malloc(...)
initialize(myA);
while (!done) do
mydiff = 0;
if (pid != 0)
SEND(&myA[1,0], n, pid-1, ROW);
if (pid != nprocs-1)
SEND(&myA[nn,0], n, pid+1, ROW);
if (pid != 0)
RECEIVE(&myA[0,0], n, pid-1, ROW);
if (pid != nprocs-1)
RECEIVE(&myA[nn+1,0], n, pid+1, ROW);
for i <- 1 to nn do
for j <- 1 to n do
...
endfor
endfor
if (pid != 0)
SEND(mydiff, 1, 0, DIFF);
RECEIVE(done, 1, 0, DONE);
else
for i <- 1 to nprocs-1 do
RECEIVE(tempdiff, 1, *, DIFF);
mydiff += tempdiff;
endfor
if (mydiff < TOL) done = 1;
for i <- 1 to nprocs-1 do
SEND(done, 1, I, DONE);
endfor
endif
endwhile