1. Memory architecture and subsystems
1.1 How is access controlled?
access control:
- The storage unit is accessed through theAccess transistors (access transistors) Control. The access transistor acts like a switch to connect or disconnect the memory cell to the bitline.
- Access control is controlled byWordline Control. When the word line is active, the access transistor turns on, allowing data from the memory cell to flow into or out of the bit line.
Structure of DRAM (Dynamic random access memory):
- The memory cell in a DRAM is usually composed ofIt consists of a capacitor and a transistor. The capacitor is used to store data (1 or 0), while the transistor acts as an access switch.
- due toCapacitors leak charge.,DRAM requires periodic data refreshes to maintain information integrity.
SRAM Architecture:
- The memory cells in SRAM are organized byCross-coupled inverters (cross-coupled inverters) consists of, typically, two inverters connected to each other to form a stable bistable structure.
- SRAM required4 transistors for storage and 2 transistors for access.
- SRAM does not require periodic data refreshing like DRAM., because its circuit structure allows it to hold data for a long time until it is rewritten.
discrepancy:
- DRAM s circuit structure is relatively simple and takes up little physical space, thus providing higher storage density, but requires refreshing circuits.
- SRAM The circuit structure is more complex and takes up more space, so the storage density is lower, but it is faster to access and does not need to be refreshed.
1.2 Memory Architecture
DIMM:The back and front are fitted withDRAM chipsof printed circuit boards.
Rank: A group of DRAM chipsThey work together to respond to requests and keep the data bus full.
64-bit (computing)The data bus requires8x8 DRAM chipsmaybe4x16 DRAM chips...
Bank: inOne request periodBusy onerankA subset of the
Row buffer:The last line read from the group (e.g., 8 KB) acts similarly to theCache.
Channel: Each channel is connected to the processor via a command bus (cmd bus), an address bus (addr bus), and a data bus for sending commands, addresses, and data. These buses allowProcessor interacts with multiple memory modules in parallelThis improves the system'sparallelism。
Memory Controller (MemCtrl): The processor manages access to memory through a memory controller (MemCtrl). The Memory Controller is responsible forMemory requests are scheduled and data is sent through channels to specific areas of memory.
Rank: in the memory module.Each channel contains one or more rank。The rank is made up of multiple banks, they can process the data insimultaneous visits。
Bank:Each rank contains multiple banks, which are subsets of the rank, and only one bank is busy during each visit.Each bank canstand alonePerforms data storage and access, allowing the system to beParallel processing of data between different banks, further improving the efficiency and bandwidth of the memory.
This hierarchical structure allows the memory system toParallel access between different banks and ranksthat improves memory bandwidth and data processing efficiency. This design is widely used in DRAM to reduce latency and increase data throughput.
1.3 DRAM Array Access
It's a\(16Mb\) of the DRAM array, i.e., there are\(4096\times4096\) Array of bits.
Row Access Strobe (RAS):Since this DRAM array has 4096 rows, there are\(log_{2}^{4096}=12 bits\)line address, when accessing data.The 12bits line address bits arrive first.
Column Access Strobe (CAS):Since this DRAM array has 4096 columns, there are\(log_{2}^{4096}=12 bits\)column address, when accessing the data, the12bits The column address bits arrive after the row address bits.
Upon arrival of the row access select, DRAM reads one row of data (i.e., 4096bits) to theRow bufferThe subsequent columns visit the opt-in arrivals, afterColumn decoderReads data from the Row buffer back to the CPU.
1.4 Organization and Operation of the Memory Bank
Read access order:
- Decode line addresses and drive word lines (word-lines)
- Selects and drives bit-lines - reads entire lines
- Zoom line data
- Decode column addresses and select a subset of rows - send to outputs
- Bit-lines pre-charging - For next visit
1.5 DRAM main memory
-
Main memory is stored in DRAM cells, which have a much higher storage density
-
DRAM cells lose state over time - must be refreshed periodically, hence the term dynamic memory
-
Long DRAM access time and high energy overhead
1.6 DRAM vs. SRAM
DRAM:
- Slower access(Capacitors)
- Higher density (1T 1C cell)
- Lower cost
- Needs to be refreshed(power, performance, circuitry)
- Manufacturing requires placing capacitors and logic devices together
SRAM:
- Faster access (no capacitors)
- Lower density(6T cell)
- Higher costs
- No need to refresh
- Manufactured with logic process compatibility (no capacitors)
1.7 Organizational structure of Rank
-
DIMM、rank、bank、array -> form a hierarchy in the storage organization
-
Due to electrical constraints, theOnly a few DIMMs can be connected to the bus
-
One DIMM can have 1~4 ranks.
-
because ofImprove energy efficiencyshould be usedwide output DRAM chips --Each request only activates the\(4\times16bits\) Chip ratio activation\(16\times 4bits\) Better chips.
-
because ofhigh volumeshould be usednarrow output DRAM chips - due to the on-channelLimited number of ranksUse\(16 \times 4bits\ 2Gb\) Chips are more efficient than using\(4 \times 16bits\ 2Gb\) Chip increases capacity per rank
1.8 Organization of Banks and Arrays
- A rank is divided into several banks (4 to 16)to improve parallelism within rank, parallel access can be achieved by operating between different banks.
- ranks and banks provide memory-level parallelismBy spreading the data in differentranks and banks, the memory system is able to handle multiple memory requests at the same time, thus enhancing the parallel processing capability of the memory.
- A bank consists of multiple arrays (subarrays, tiles, mats)
- To maximize density, the arrays in the bank should be very large:In order to store more data in a limited space, the array within each bank is designed to be very large. This means that the rows in the array are very wide, so therow bufferIt is also very wide. For example, when the memory request is\(64Bytes\) when it may actually read $8KBytes $ of data (called aOverfetch) to take full advantage of the wide line buffer feature.
- Each array provides one bit of data per cycle to the output pins.: In order to achieve higher storage density, each array provides only 1 bit of data to the output pins in each clock cycle. This design reduces the amount of data transferred in a single pass, but increases the total storage density of the system and is suitable for situations where large amounts of data need to be stored.
This organization achieves high-density storage through the division of multiple levels (racks, banks, arrays) and also improves the parallelism and performance of the memory system by accessing multiple banks and racks in parallel.
1.9 Row Buffers
- Each bank has a line buffer
- The line buffer acts as a cache in DRAM.
- Row buffer hit:Approx. 20 ns access time (just move data from line buffer to pin)
- Empty row buffer access (Empty row buffer access): about 40ns (first need to read array, then move data from line buffer to pins).
- Row buffer conflict: about 60ns (first need to pre-charge bit-lines, then read new lines, then move data to pins).
- There is also the need to wait in a queue (tens of ns) and to experience address/command/data transfer delays (~10ns).
1.10 Building larger memories
We need larger storage arrays, but large arrays mean slow access.
How do you keep the memory capacity large without making it very slow?
Idea:Divide memory into smaller arrays, and interconnect these arrays to the input/output buses.
- Mass storage is usually a hierarchical array structure.
- DRAM hierarchy: Channel → Rank → Memory → Subarrays → Matrices (Mats)
2. DRAM subsystem organization
2.1 Generalized Memory Structure
The above figure shows a generic memory structure with the following main components:
- Memory Controller: It is responsible for managing the read and write operations of memory data. The figure shows two memory controllers, each connected to the memory module through a channel.
- Channel: A data transfer channel between a memory controller and memory that allows multiple controllers to access different memory modules, increasing the bandwidth of memory.
- DRAM (Dynamic Random Access Memory)DRAM modules are organized in multiple "ranks", each of which contains several "banks" to further increase parallelism and efficiency.
- Rank: A logically organized unit of memory consisting of multiple physical memory chips for easy access management by a memory controller.
- Bank: A smaller unit of memory that supports multiple concurrent accesses. Each bank contains multiple rows and columns.
- Row and Column: The basic unit for storing data within a bank, with specific data locations identified by row and column addresses.
- Cache Line: The basic unit of data read by the CPU, the size of a block of data read at one time. The efficiency of data transfer is improved by the design of cache line.
2.2 Generic principles: alternating visits (banking)
Interleaving (banking repositories)
- Problem: A single large memory array takes a long time to access and does not allow for multiple accesses in parallel.
- Goal: Reduces the access latency of memory arrays and enables parallel multiple accesses.
-
Idea: Divide a large array into multiple independently accessible storage banks (banks) that can be accessed in the same cycle or in consecutive cycles.
- Each bank is smaller than the entire memory store.
- Access to different banks can overlap.
- Access latency is controlled.
2.3 Memory Banking Example
- Memory is divided into several banks.: These memory banks can be accessed independently, thus enabling multiple accesses in parallel. This structure divides the memory into multiple banks, each of which can independently perform read and write operations on the data, reducing the waiting time.
- Shared address and data bus: Multiple libraries share address and data buses, and this design can beReducing the number of pins on a memory chipThis reduces hardware complexity and cost.
- Completion of one bank visit per cycle: By accessing different banks in parallel, accesses to a repository can be initiated and completed in each cycle, improving the overall memory access efficiency.
- Support N concurrent access: If all N access requests are directed to different banks, then the system can support N concurrent memory access requests. This means that if the memory access requests are evenly distributed to different banks, the memory bandwidth can be fully utilized for efficient parallel processing.
- The diagram shows 16 storage banks (Bank 0 to Bank 15).
- Each bank has its own individualMemory Data Register (MDR)respond in singingMemory Address Register (MAR), which is used to store the data being transmitted and address information.
- These memory banks are connected to the CPU via a data bus, which is responsible for the transfer of data, and an address bus, which is responsible for the transfer of address information.
2.4 DRAM subsystem
The processor in the middle of the figure is connected to the memory module through two separate memory channels.Multiple memory sticks can be inserted in each channel (DIMM, Dual In-line Memory Module)The channel is responsible for transferring data between the memory and the processor.
2.4.1 Decomposing DIMMs (modules)
The diagram above shows the DIMMpositivelycap (a poem)the reverse side:
- Rank 0: Located on the front of the DIMM, it consists of 8 chips.
- Rank 1: Located on the back of the DIMM, it also consists of 8 chips.
2.4.2 Decomposing Rank
everyoneMemory channel Included:
- Addr/Cmd: Indicates address and command signals used to send memory access requests to different Ranks.
- CS(Chip Select): Selection signal to select which Rank to operate on. In the figure, CS <0:1> indicates the control signal to select Rank 0 or Rank 1.
- Data <0:63>: A data bus that provides 64-bit data channels for transferring data.
An in-memory Rank is internally composed ofMultiple ChipsComposition:
Rank 0 is broken down into multiple chips, from Chip 0 to Chip 7. Each chip is responsible for a portion of the data bits:
- Chip 0Responsible for data bits 0 through 7 (<0:7>).
- Chip 1Responsible for data bits 8 through 15 (<8:15>).
- And so on untilChip 7Responsible for data bits 56 through 63 (<56:63>).
64-bit data channel: All chips together form a 64-bit data channel (Data <0:63>) through the 8-bit data bits that each is responsible for, thus enabling parallel data transfer. This design allows multiple chips within the Rank to work simultaneously, improving data access efficiency.
2.4.3 Decomposing Chip
A Memory Chip is internally composed ofMultiple repositories (Banks)Composition:
Chip 0: Chip 0 on the left side of the diagram, which is responsible for the 8-bit data channel (<0:7>), indicates that the chip only transmits bits 0 through 7 of the data.
Internal storage units (Banks): In the enlarged image on the right, Chip 0 is further broken down into eight independent storage units (called "Banks"). Each Bank can store and read data independently, allowing the chip to process multiple data requests at the same time, increasing the parallelism of data transfers.
2.4.4 Decomposition Bank
A memory bank is internally composed ofMultiple Arrays of 1Byte sizeComposition:
- Reads a row first and caches that row in a row-buffer
- Then take the data of size 1Byte (i.e. 8bits) from the Row-buffer according to the column address.
2.4.5 Digging Deeper: DRAM Bank Operations
Visit 1 (Row, Column 0):
Step 1: Row address 0 arrives, word-lines are activated and bit-lines are connected to the sense amplifier.
Step 2:The sense amplifier senses the contents of the row and captures the data into the Row Buffer.
Step 3:Column address 0 arrives and selects the column.
Step 4:Finally the data is read out.
Visit 2 (Row 0, Column 1):
due toRow Buffer HITThe data is quickly read out.
Visit 3 (Row 0, Column 85):
again as a result ofRow Buffer HITThe data is quickly read out.
Visit 4 (Row 1, Column 0):
appearedRow Buffer Conflict!
Step 1:commander-in-chief (military)line feednamelypre-chargethat makes the data reliable for the next visit.Increased delay。
Step 2:Row address 1 arrives, word-lines are activated and bit-lines are connected to the sense amplifier.
Step 3:The sense amplifier senses the contents of the row and captures the data into the Row Buffer.
Step 4:Column address 0 arrives and selects the column.
Step 5:Finally the data is read out.
2.5 DRAM Bank with internal Sub-Banks
The figure on the left (a) is a logical abstraction showing a conventional representation of a DRAM bank containing rows, row-decoders, and a large row-buffer. In the logical abstraction, the bank appears to be a single unit with up to 32K rows of data.
The figure on the right (b) shows the actual physical implementation. A DRAM bank is actually divided into subarrays, each with its own local row-buffer, and each subarray consists of 512 rows. The structure from the 1st subarray to the 64th subarray is shown here. These subarrays are connected to the global row-buffer of the entire bank by a global decoder.
- logical abstraction: Logically, a bank is considered to be a single integral structure, with all rows sharing a single row buffer.
- physical realization: In practice, the bank is divided into multiple subarrays to improve access efficiency. Each subarray has its own local row buffer, allowing data to be processed in parallel between subarrays to improve parallelism and performance.
This design enhances access speed and parallelism by dividing the bank into multiple subarrays to reduce wait times and improve DRAM performance.
2.6 Example: Transferring a Cache Block
- It takes 8 I/O cycles to transfer a 64B cache block.
- During this process, the 8 columns are read sequentially.
Step 1:
Step 2:
Step 3:
Step 4:
3. Memory controller
3.1 Open/Closed Page Policies
- If the access stream haslocalconjunction used express contrast with a previous sentence or clauseThe line buffer will remain open
- Low line buffer hit cost(open-page policy)
- Row buffer misses are repository conflicts, costly because precharge is on the critical path
- If the access streamLittle to no localizationfollowBit-lines are pre-charged immediately after access (close-page policy)
- Row buffer misses on almost every visit
- Pre-charging is usually not on the critical path
- Modern memory controller strategies fall somewhere in between (often proprietary)
3.2 Read and Write Operations
-
Read and write operations use the same bus.
-
When switching between read and write operations, the bus direction must be reversed; this takes time and causes the bus to idle.
-
Therefore, write operations are typically performed in bursts; the write buffer stores pending writes until the high water mark is reached.
-
The write operation will continue until the low water mark is reached.
-
High Water Mark (HWM): This is a predetermined upper limit reached by the data in the buffer. When the amount of data in the buffer reaches this mark, the system triggers certain operations, such as starting to write data (e.g., in memory) or stopping further data writes to the buffer to avoid a buffer overflow.
-
Low Water Mark (LWM): This is a predetermined lower limit reached by the amount of data in the buffer. When the amount of data in the buffer is reduced to this mark, the system can re-enable data writes or perform other operations to ensure that the buffer does not run out of data and affect performance.
3.3 Address Mapping Policy
- Successive cache lines can be placed in thesame linein theImprove row buffer hit rate
- Successive cache lines can be placed in thedifferentiationin theImprove Parallelism
- Address Mapping Policy Example:
- row : rank : bank : channel : column : blkoffset
- row : column : rank : bank : channel : blkoffset
3.4 Scheduling Strategy
-
FCFS (first-come, first-served): Processes the first read or write request in the queue that can be executed.
- In the FCFS policy, the system executes requests sequentially in the order in which they arrive. As soon as a request is ready (meets the execution conditions), it is executed. This approach is simple, but in memory accesses mayInability to fully utilize row buffer hits, resulting in poor performance.
-
First Ready - FCFS (Priority Row Buffer Hits - First Come First Served): Prioritize requests for line buffer hits, if possible.
- This policy first checks to see if there are requests that can hit the row buffer (i.e., the current row is already in the row buffer in memory). If there are, these hit requests are prioritized, as this allows theReduce line overhead and increase access speed. If there are no line buffer hits, they are processed on a first-come, first-served basis.
-
Stall Time Fair: Prioritize requests for line buffer hits, unless other threads have been ignored.
-
This policy adds fairness to prioritizing row buffer hits. If multiple threads are competing for memory accesses, the policy tries to prioritize row buffer hits while ensuring that all threads get a fair chance to access them and not letting some threads be delayed all the time. This approach helps to balance resource allocation in a multithreaded environment.
3.5 Refresh
Each memory cell in DRAM (Dynamic Random Access Memory) is represented by a capacitive storage charge to represent the data. Since the capacitive charge leaks gradually over time, periodic refreshes are required to replenish the charge to ensure that data is not lost.
-
refresh time window: All DRAM cells must be refreshed once in 64 milliseconds to prevent data loss due to charge leakage.
-
auto-refresh: When a read or write operation is performed on a row, the row is automatically refreshed to help extend the data retention time.
-
Impact of the Refresh Directive: Each refresh command refreshes a certain number of lines. Memory is temporarily unavailable during the refresh process, which can cause minor delays.
-
refresh frequency: Memory controllers typically issue refresh commands every 7.8 microseconds on average to spread the refresh burden and avoid the performance impact of centralized refreshes.