Location>code7788 >text

DDCA -- Cache: Cache Architecture, Cache Operations

Popularity:83 ℃/2024-11-09 14:01:23

1. The Memory Hierarchy

1.1 Memory in modern systems

image-20241107185003120

These include L1, L2, L3 and DRAM.

1.2 Limitations of Memory

The ideal memory requirements are as follows:

  • zero delay
  • unlimited capacity
  • zero cost
  • Unlimited bandwidth
  • zero power consumption

But the needs of the ideal memory conflict with each other:

  • Larger capacity memory means greater latency: it takes longer to determine where the data is located
  • Faster memory is more expensive
    • Memory Processes: SRAM vs. DRAM vs. Disk vs. Tape
  • Higher bandwidth memory is more expensive
    • Need more banks, more ports, faster operating frequency, faster process

1.3 Why use a memory hierarchy

We want the memory to be fast and large, and that can't be achieved with a single layer of memory.

Idea:Multiple levels of storage (the further away from the processor the memory capacity is, the slower it is) and ensure that most of the data required by the processor is kept in the faster levels.

Memory System:

image-20241107190608552image-20241107190838875

Just have a goodCitation localizationThe memory hierarchy can make the memory fast and large.

1.4 Memory hierarchy

image-20241107191126551

They are included in order from small to large, fast to slow:Register FileCacheMain Memorycap (a poem)Hard Disk

1.5 Locality

  • The ability to predict a person's near future by looking at their immediate past.

  • Temporal Locality:If you happen to do something, it is highly likely that you will do the same thing again soon.

    • You are in this classroom today, then there will be a high likelihood that you will be in this classroom again and again with regularity.
  • Spatial Locality:If you do something, you're likely to do something similar/related (spatially).

    • Every time I've found you in this classroom, you've probably been sitting close to the same people or in a very close seat.

1.6 Memory locality

  • A "typical" program has a lot of locality in its memory references.
    • leave it (to sb)loop Composition of the program
  • Time:Programs tend to reference the same memory location multiple times in a small period of time
  • Space:Programs tend to refer to nearby memory locations within a window of time
    • Instruction memory references → usually sequential/streaming (sequential execution of instructions)
    • References to arrays/vectors → usually streaming/stringing

1.7 Caching Basics

1.7.1 Exploiting temporal localization

  • Idea:Stores recently accessed data in automatically managed fast memory (called cache).
  • Expected results:The same memory address will be accessed soon

1.7.2 Use of spatial localization

  • Idea:In automatically managed fast memories, data is stored in the address adjacent to the most recently accessed address
    • Logically divides memory into equal sized blocks
    • Fetching the entire accessed block into the cache
  • Expected results:Memory locations adjacent to the currently accessed memory location will soon be accessed

2. Cache hierarchy

2.1 Caching in assembly line design

  • Caching needs to be tightly integrated with the pipeline

    • Ideally one would like to be able to access the data within 1 cycle so that load-dependent operations don't stagnate
  • However, high-frequency pipelines → can't let the cache get too large

    • However, we need a large cache and a pipelined design.
  • Idea: Cache Hierarchy

image-20241107200809335

In this case, the first level cache (L1) design decision is mainly determined by the frequency of the processor and has a very small L1; the second level cache (L2) is larger than L1 but also slower to access than L1.

2.2 The most basic Cache structure

image-20241107203143500

This includes a Tag Store and a Data Store.

When the processor needs to query the cache, it needs to send an address to the Tag Store.

If the Tag Store shows that the address exists in the cache, then you can get the data from the Data Store.

If the Tag Store shows that the address is not in the cache, then you must go within the main to access the data.

2.3 Memory hierarchical structure design analysis

image-20241107204140331

2.3.1 Layered delay analysis

  • A program querying the layer will get an access data hit or miss (Hit/Miss), which means that the data is in the layer or not.

  • Assuming a given memory hierarchy No.\(i\) Layer hasIntrinsic access time (technology-intrinsic access time)\(t_i\)

  • practicalPerceived access time (perceived access time)\(T_i\) compared with\(t_i\) Long.

  • In addition to the outermost level of the hierarchy, there are several scenarios when looking up a given address

    • When you hit "hit" (hit-rate)\(h_i\)), with access times of\(t_i\)
    • When you miss "miss" (miss-rate)\(m_i\)), with access times of\(t_i + T_{i+1}\)
    • \(h_i + m_i = 1\)
  • consequently\(T_i = h_i \cdot t_i + m_i \cdot (t_i + T_{i+1})\)namely\(T_i = t_i + m_i \cdot T_{i+1}\)

2.3.2 Stratified structure design considerations

  • Recursive delay equation

    \[T_i = t_i + m_i \cdot T_{i+1} \]

  • Objective:Realize the required within the allowable cost of the\(T_1\)

  • wish\(T_i \approx t_i\)Although it's impossible.

  • Reduced miss rate\(m_i\)

    • increase capacity\(C_i\) reduces\(m_i\)But this will increase the inherent access time\(t_i\)
    • Reduced through smarter cache management (replacing content not needed for forecasting, prefetching content needed for forecasting)\(m_i\)
  • Reduce perceived access time to the next level\(T_{i+1}\)

    • Use faster tiers, but watch out for increased costs
    • Introducing an intermediate level as a compromise

2.3.3 Example: Intel Pentium 4

  • 90nm P4, 3.6 GHz
  • L1 D-cache
    • \(C_1\) = 16 kB
    • \(t_1\) = 4 cyc int / 9 cycle fp
  • L2 D-cache
    • \(C_2\) = 1024 kB
    • \(t_2\) = 18 cyc int / 18 cyc fp
  • Main memory
    • \(t_3\) = ~ 50ns or 180 cyc

\(T_i = t_i + m_i\cdot T_{i+1}\)

  • if \(m_1=0.1, m_2=0.1\)
    • \(T_2=t_2 + m_2T_3 = 18+0.1\times 180 =36cyc\)
    • \(T_1=t_1+m_1T_2=4+0.1\times36=7.6cyc\)
  • if \(m_1=0.01, m_2=0.01\)
    • \(T_1=4.2, T_2=19.8\)
  • if \(m_1=0.05, m_2=0.01\)
    • \(T_1=5.00, T_2=19.8\)
  • if \(m_1=0.01, m_2=0.50\)
    • \(T_1=5.08, T_2=108\)

3. Cache fundamentals and operations

3.1 Caching

  • Any structure that "stores" used (or generated) data
    • to avoid repeating the long latency operations required to copy/fetch data from scratch
    • . a web cache
  • Most common in processor design: automatically managed memory structures
    • . Storing the most frequently or recently accessed DRAM memory locations in fast SRAM to avoid double DRAM access latency charges
    • Management Strategies:
      • What data bring to cache?
      • What data keep in cache?

3.2 Basic Hardware Cache

  • We'll start with basic hardware cache examples and cache operations
  • We will then formalize some basic concepts of caching
  • Finally, we will look at ideas and innovations to improve cache performance

3.2.1 Access to the cache

image-20241107225134473
  • Cache size:The total size of the cache is 64 bytes and consists of eight 8-byte memory cells (8-byte words), each of which can store 8 bytes of data.

  • Byte Address:The last 3 bits represent the byte'sOffset, due to the 8-byte per memory cell.

  • Index Bits:There are three.index bits, since there are 8 cache lines (\(2^3 = 8\)), each line can store an 8-byte block of data.

Direct-mapped Cache (DMC): Each main memory address is mapped directly to a unique location in the cache. The index bit in the address determines the location of the data in the cache, so each address corresponds to a unique cache line.

3.2.2 Tag Array

image-20241107230424748

When a main memory address is mapped to a cache line, the system compares the Tag value of that address with the Tag value in the tag array to determine whether the data is in the cache.If the Tag value matches, the data is in the cache (hit); if it does not match, the data needs to be loaded from the main memory (not hit).

3.2.3 Increasing Block Size

image-20241107231201653

Changed from storing 8-byte data per row to 32-byte data per row:

  • Cache size:The total size of the cache is 256 bytes and consists of eight 32-byte memory cells (32-byte words), each of which can store 32 bytes of data.

  • Byte Address:The last 5 bits represent the byte'sOffset, due to the fact that each memory cell is 32-byte.

  • Index Bits:There are three.index bits, since there are 8 cache lines (\(2^3 = 8\)), each line can store a 32-byte block of data.

Impact of larger cache lines

  • Smaller Tag Array: Because each cache line can hold more data, fewer lines are needed to cover the same range of data, reducing the size of the Tag array.
  • Reduce cache misses: As a result ofSpatial Locality, after accessing a block of data, often accesses nearby data. Larger cache lines allow more neighboring data to be loaded at once, reducing the miss rate on subsequent accesses.

3.2.4 Associativity

image-20241107232437313

Structure of a 2-way set associative cache (2-way set associative cache):

  • Tag Array: Because of the two-way group caching, each address will correspond to more than one "Way". In this example there are two ways (Way-1 and Way-2), so the tag array contains two sets of tags to allow each cache line to store a different block of data.
  • Data Array: The data is divided into two ways (Way-1 and Way-2), each with its own data storage block. Compared to direct mapped caches, group concatenated caches can hold more data blocks of the same index, thus reducing cache conflicts.
  • Compare: When a read operation is performed, both tags in the Tag array (corresponding to Way-1 and Way-2) need to be compared with the requested address Tag to determine if there is a cache hit. If one of the Way's tags matches the requested address, it indicates a cache hit and the data can be read from the corresponding Way.

Advantages and disadvantages of group phasing

  • dominance: Reduces cache conflicts because multiple data blocks can be mapped to the same cache location, solving the problem of conflicts in directly mapped caches.
  • inferior: Power consumption increases because multiple tags and data blocks need to be read simultaneously for matching and selection.

3.2.5 Example

32 KB 4-way group-associative cache array with 32-byte line size, assuming a 40-bit address

  • How many sets?

    \(32KB / (32Byte\times4)=32KB/128Byte=256set\)

  • How many index bits, offset bits, tag bits?

    index bits(8): \(2^8=256set\)

    offset bits(5): \(2^{5}=32Byte\)

    tag bits(27): \(40-5-8=27bits\)

  • How large is the tag array?

\(27bit \times 256 \times 4=27Kb\)

3.3 Caching Block/Line

  • Block(line): storage unit in the cache
    • The memory is logically divided into cache blocks that are mapped to locations in the cache
    • a fixed-size collection of data.Contains the requested wordThe following is an example of a cache that has been taken out of main memory and put into the cache.
image-20241108104328007image-20241108104437312
  • For the quote:

    • HIT:If in cache, use cached data instead of accessing memory
    • MISS:If not in the cache, introduce the data block into the cache
      • It may be necessary to kick something else out of the cache to do this (replace the
  • Some important cache design decisions

    • Placement: where and how to place/find/index blocks in the cache?
    • Replacement: what data is deleted/implemented/replaced to make room in the cache?
    • Managing granularity: large or small chunks?
    • Write policy: how to handle writes?Write-back, Write-through
    • Instructions/data: do we treat them separately?

3.4 Types of Cache Misses

  • Forced misses:Occurs on first access to memory word - infinite cache of misses
  • Capacity missed:This happens because the program touches many other words before it re-touches the same word - the missing amount of the fully associative cache
  • * missed:Occurs as a result of two words being mapped to the same location in the cache - a miss that occurs when going from a fully associative cache to a directly mapped cache

Notes:Will a fully-associative cache have more misses than a direct-mapped cache of the same size? Yes, the increase in conflict misses

3.5 Reduction in Miss Rate

  • Increase cache block size: Increasing the cache block size reduces mandatory misses (i.e., misses when data is first accessed) and reduces the penalty for misses when spatial locality exists (since more relevant data can be prefetched at once). However, increasing the block size also has some negative effects, such as:

    • Increased traffic between different caching layers

    • May lead to wasted space (when only a small portion of the data needs to be accessed)

    • Possible increase in conflict misses (because larger blocks take up more cache space, causing other data to not be cached)

  • Increase cache size: Larger caches can reduce capacity misses and conflict misses because more data can be stored. However, increasing the cache size comes with an access time penalty (larger caches are usually slightly slower to access).

  • high relevance: Increasing the concatenation of caches (e.g., from 1 to 2 or higher) reduces conflict misses because multiple data blocks can be stored in the same cache group, reducing data loss due to address conflicts.

    • rule of thumb: The miss rate of a 2-way coherent cache with capacity N/2 is comparable to that of a 1-way coherent cache with capacity N, but a high degree of coherence consumes more energy.

3.6 More Caches Basics

  • The L1 cache is divided into an instruction cache and a data cache, while L2 and L3 are unified caches: In Level 1 cache (L1), instructions and data are stored in separate caches to improve access efficiency, while at L2 and L3 levels, instructions and data share the same cache space, i.e., unified cache.

  • The L1/L2 cache hierarchy can be inclusive, exclusive or non-inclusive

    • non-exclusivity: The L2 cache contains the contents of all the data in the L1 cache, so that if L2 has data, L1 must also have it. This structure is easy to manage, but will take up more L2 space.

    • exclusivity: The data in the L2 cache is not in L1, and the two caches do not store the same data over and over again; this structure makes more efficient use of the total cache space.

    • non-inclusive: There is no strict inclusive relationship between L1 and L2, and data can exist separately in either L1 or L2.

  • Write allocation or no write allocation can be selected for write operations

    • write allocation(Write-Allocate): When writing to the cache, if the data is not in the cache, it will be loaded into the cache before writing.

    • unwritten allocation(Write-No-Allocate): If the data is not in the cache, it is written directly to the next level of cache or to the main memory without loading it into the cache.

  • Write-Back or Write-Through can be selected for write operations.

    • write back(Write-Back): Writes only update the data in the cache without writing back to memory immediately, and only write back to memory when the cache line is replaced, reducing traffic on the memory bus.

    • write directly(Write-Through): Every time the cache is written, the memory is updated at the same time, which simplifies data consistency management but increases memory access traffic.

  • Read operations take precedence over write operations, and write operations are usually buffered: In order to ensure fast reading of data, read operations usually have a higher priority, while write operations are buffered and wait for the right time to write again to avoid affecting the efficiency of read operations.

  • L1 cache parallel for tagging and data access, L2/L3 cache serial for tagging and data access

    • In L1 caching, the accesses to tags and data are in parallel to minimize access time.

    • In L2 and L3 caches, tag and data accesses are serial, checking for a tag hit before deciding whether to read the data or not, and this design reduces the complexity of the circuit.

3.7 Tolerance of penalties

How to minimize the performance loss due to waits with some strategies when dealing with cache misses:

  • Out of Order Execution (OOE): During the time spent waiting for cache misses to be returned, the processor can continue to perform other useful work to improve overall execution efficiency. This means that the processor does not need to stop and wait for a cache miss, but can continue processing other instructions. In addition, promiscuous execution allows the processor to process multiple cache misses at the same time, thereby maximizing resource utilization.
    • Multiple Cache Misses: In this case, the cache controller needs to be able to keep track of multiple missed requests, which requires Non-Blocking Cache. Non-blocking caches allow the processor to continue to issue other requests while waiting for data from a single miss, rather than blocking all current operations.
  • Hardware Prefetching (Hardware Prefetching): Hardware is used to reduce the waiting time for future accesses by loading data that may be accessed in the future into a Prefetch Buffer in advance. This approach is useful when the program hasSpatial or temporal localizationThis is especially effective when the processor can anticipate the data that may be needed next.
    • Aggressive Prefetching (Aggressive Prefetching): Although prefetching can reduce the waiting time, it can increase the contention pressure on the bus (Bus) and affect the performance of other memory accesses if prefetching too much. Therefore, the prefetching strategy needs to balance the competition between data loading and bus resources.

3.8 Prefetching

  • Hardware prefetching can be applied to any cache level: Prefetching is a hardware technique used to load data into the cache ahead of time before it is actually needed to minimize wait times. Prefetching can be applied at any cache level (e.g., L1, L2, L3).
  • Prefetching causes cache contamination: If prefetched data is not used immediately, it may take up spots in the cache, causing otherwise more useful data to be replaced, a situation known as cache pollution. To avoid contamination, prefetched data is usually placed in a separate prefetch buffer. In this way, prefetched data is not mixed with data in the main cache, but a parallel lookup of this buffer is required when accessing the cache to ensure that prefetched data is present.
  • Aggressive prefetching increases "coverage" but decreases "accuracy": Aggressive prefetching means prefetching as much data as possible, so that more potential data needs can be covered, known as "increasing coverage". However, this can also result in a lot of prefetched data not actually being used, thus wasting memory bandwidth, which is known as "reduced accuracy".
  • Timing of prefetching is important: The timing of prefetching data needs to be well-timed. Prefetching must be done early enough so that the data is loaded by the time it is needed, thus hiding the latency of memory accesses. However, if prefetching is done too early, the data may be evicted due to cache replacement policies before it can be used, leading to cache pollution or wasted resources.

4. Cache Abstraction and Structure

4.1 Cache Abstraction and Metrics

image-20241108145559312

Given an address to index the data store, the tag store answers whether the address sought is in the cache or not, it signals Hit/Miss, if Hit, the Data store gives Data.

  • Cache hit rate = (# hits) / (# hits + # misses) = (# hits) / (# accesses)
  • Average memory access time (AMAT) = ( hit-rate * hit-latency ) + ( miss-rate * miss-latency )(include hit latency)
  • Is a smaller AMAT better?

4.2 Block and Cache Addressing

  • Memory is logically partitioned into fixed-size blocks
  • Each data block is mapped to a location in the cache that is determined by index bits or set bits in the address.
    • For index tags and data storage
image-20241108150340922
  • Cache access process:
  1. The index is entered into tag and data storage, and the index bits are located in the address
  2. Check for valid bits in the tag store
  3. Compare the tag bits in the address with the tags stored in the tag store
  • If the data block is in the cache (cache hit), the stored tag should beValidand matches the tag of the data block.

4.3 Directly mapped caches: placement and access

  • Assuming byte-addressable memory: 256bytes, 8-byte blocks -> 32 blocks

  • Assumed cache: 64 bytes, 8 blocks

    • Direct mapping: A block can only go to one location.
  • Addresses with the same index contend for the same position

    • Resulting in a conflict without a hit
image-20241108151318826
  • Directly mapped cache: two blocks of the same index mapped in memory to the cache cannot be in the cache at the same time

    • an index -> an entry
  • provided thatInterleaved accessmay result in a 0% hit rate if more than one block is mapped to the same index

    • Suppose address A and address B have the same index bit but different tag bits.
    • A, B, A, B, A, B, A, B, ... -> cache index conflict
    • All visits areMissed *es

4.4 Set Associativity

4.4.1 2-group associations

  • Addresses 0 and 8 always conflict in the direct-mapped cache
  • Change a column of 8 blocks to 2 columns of 4 blocks
  • index bit minus 1, tag bit plus 1
image-20241108152518390image-20241108152638244
  • Group Associated Memory
    • Better conflict adaptation (fewer conflict misses)
    • More complex, slower access, larger tag store

4.4.2 4-group associations

image-20241108153556752
  • Lower likelihood of conflict misses
  • More tag comparators and wider data multiplexers; larger tag

4.4.3 Full associations

  • Fully associative caching
    • A block of data can be placed in any cache location
image-20241108153831092

4.4.4 Relevance (and trade-offs)

  • Degree of relevance:How many blocks can be mapped to the same index (or dataset)?

  • Higher relevance

    • Higher hit rate
    • Slower cache access times (hit latency and data access latency)
    • More expensive hardware (more comparators)
  • Diminishing returns from high correlation

image-20241108154511095

4.5 Innovative Approaches to Reducing Miss Rates

  • Victim cache
  • Better Replacement Strategies - Pseudo LRU, NRU, DRRIP
    • Insert, Elevate, Victim Select
    • Prefetching, cache compression

4.5.1 Victim Cache

  • Directly mapping the cache results in misses because multiple data are mapped to the same location
  • Processors often try to access recently discarded data
    • All discarded data is placed in a small victim cache (4 or 8 entries)
    • Check victim cache before entering L2
  • This can be thought of as an additional correlation between the few datasets with the most conflicts

4.6 Issues in the Group Association Cache

  • Consider that each block in a set has a "priority"
    • Indicates the importance of keeping the block in the cache
  • Key question: How to prioritize/adjust data blocks?
  • There are three key decisions in a set:
    • Insertion, lifting, elimination (replacement)
  • Insertion: what happens to the priority when the cache fills?
    • Where to insert the incoming block and whether to insert the block
  • Promotion: What happens to the priority on a cache hit?
    • Whether and how to change the priority of a block
  • Eviction/replacement: what happens to the priority when the cache misses?
    • Which block to evict and how to reprioritize it

4.7 Eviction/Replacement Strategies

  • Which data block in the dataset is replaced when caching a miss?
    • Replace any invalid blocks first
    • If all blocks are valid, refer to the replacement policy
      • Random
      • FIFO
      • Least recently used (how to implement?) Least recently used(LRU)
      • Not most recently used Not recently used
      • Least frequently used
      • Least costly to re-fetch?
      • Why would memory accesses have different cost?
      • Hybrid replacement policies
      • Optimal replacement policy?

5. LRU

  • Idea: Eviction of the least recently visited block

  • Problem:Need to keep track of the order in which blocks are accessed

  • Problem: 2-way group association cache

    • What is required for perfect implementation of LRU?Requires 1bit record access order
  • Problem: 4-way group associative caching

    • What is required for the perfect realization of LRU?
    • How many different orders are possible for the 4 blocks in the set?\(4\times 3\times 2\times 1 = 4! = 24\)
    • How many bits are needed to encode the LRU order of the block?Minimum 5 bits required
    • What logic is required to determine the LRU victim?
IMG_0164(20241108-170423)

5.1 Approximate LRU

  • Most modern processors do not implement "true LRU" (also known as "perfect LRU") in highly associative caches.

    • Because the real LRU is complex
    • LRU is only an approximation of predicted locality (i.e., not an optimal cache management policy)
  • Example:

    • NRU, not recently used
    • DRRIP

5.2 Eviction/Replacement Strategies

  • NRU:Each block in the set has a bit; when the block is accessed, the bit is changed to 0; if all bits are 0, all bits are changed to 1; blocks with bit set to 1 are evicted

  • DRRIP:Use multiple NRU bits (3 bits) to set incoming blocks high so that they are close to being evicted; similar to placing incoming blocks near the head of the LRU list instead of the tail

6. Caching strategy

6.1 Tag Store

Tag can be composed of the following parts:

  • Valid bit
  • Tag
  • Replacement policy bits
  • Dirty bit
    • Indicates whether the data is in write back mode or write through mode.

6.2 Handling Write Operations

6.2.1 Write-back mode vs. write-through mode

  • When do we write the modified data in the cache to the next level?

    • write through:write
    • write back:When a block is evicted
  • Write-back mode:When updating the cache, it does not synchronize the update of memory

    • Multiple writes to the same block can be merged before eviction
      • Potential to save bandwidth between cache levels and reduce energy consumption
    • A bit needs to be added to the tag store to indicate that the block is "dirty/modified".
  • Write-through mode:When the CPU writes to cache, it also writes to memory.

    • simpler
    • All levels of Cache are up to date.Consistency:Cache consistency is simpler because there is no need to check for the existence of a Tag store close to the processor's cache
    • Denser bandwidth; unconsolidated writes

6.2.2 Allocate on write miss vs. No-allocate on write miss

  • Are we going to be inAllocation of cache blocks on write misses, i.e., whether other cache blocks are evicted from the cache
    • Allocate on write miss: yes
    • No-allocate on write miss: No
  • Allocate on write miss: the block where the address to be written is first read from memory to cache, and then the cache is written.
    • canMerge WriteInstead of writing each piece of data to be written individually each time to the next level of the
    • It's simpler becauseWrite misses are handled in the same way as read misses
    • Need to transfer the entire cache block (cache block)
  • No-allocate: Write the content to be written directly back to memory.
    • Save cache space if write blocks are less localized (may improve cache hit rate)

6.3 Subblocked/Sectored Cache

  • What if the processor writes the entire block of data in a shorter period of time? Is it necessary to first bring the data block from memory into the cache?

  • Why can't we write only a part of the data block, i.e. a sub-block?

    • . Write 4byte of 64byte
    • Problem: valid and dirty bits are associated with the entire 64byte, not with each 4byte
  • Idea: divide a block into subblocks subblocks (or sectorsectors)

    • Each subblock (sector) has individual valid and dirty bits
    • Only one subblocks (or a subset of subblocks) are assigned to a request

image-20241109105032585

  • Pros:

    • There is no need to transfer the entire cache block to the cache (writes only need to beValidating and updating a subblock
    • More freedom to transfer subblocks to the cache (the cache block block doesn't need to be in the cache at all, just need to determine how many subblocks to transfer for a single read?)
  • Drawbacks:

    • More complex design; more active and dirty bits
    • May not fully utilize spatial localization

6.4 Instruction Cache vs. Data Cache

  • Separate or Unified Management

  • Pros and cons of Unified:

    • Dynamically shared cache space: no over-provisioning that might occur with static partitions (i.e., separate Instruction and Data caches)

    • Instructions and data will evict each other (i.e., there is not enough room for both)

    • Instruction and Data are accessed at different locations in the pipeline. Where should we put the unified cache for fast access?This is the main reason why most processors use separate caches

  • The first-level cache is almost always split

    • That's mainly why."Instruction and Data are accessed at different locations in the pipeline. Where should we put the unified cache for fast access?"
  • Advanced caching is almost always uniform

    • This is because"Dynamically shared cache space: no over-provisioning that might occur with static partitions (i.e., separate Instruction and Data caches)"

6.5 Multi-level caching in assembly line design

  • Level 1 cache L1 (instruction and data):
    • Decisions are heavily influenced by cycle time
    • Small, low relevance; delays are critical
    • Parallel access to Tag store and Data Store
    • Instructions and data are also accessed in parallel
  • L2 cache L2:
    • Decisions need to balance hit rate and access latency
    • Usually high capacity and relevance; latency less important
    • Serial access to Tag store and Data store
      • Delay doesn't matter, it's the hit rate and energy efficiency that counts
  • Serial vs. parallel access at the cache level:
    • Serial: access the L2 cache only when the L1 cache misses
    • Access to the second layer is different from the first layer
      • The first layer acts as a filter (filtering out some temporal and spatial localization)
      • Different management strategies

img

7. Improving cache performance

Relationship between cache parameters and miss/hit ratio

  • Cache size
  • Block size
  • relatedness
  • expulsion strategy
  • insertion strategy

7.1 Cache Size

  • Cache size: total capacity of data (excluding tags)

    • The larger it is the more time localization can be exploited
    • Bigger is not always better
  • Excessive caching can adversely affect hit and miss latency

    • Smaller is faster => Larger is slower
    • Access times may degrade the performance of the critical path
  • Cache too small

    • Does not make good use of time localization
    • Frequent replacement of useful data
  • Working set points: the entire set of data referenced by the application may exist in the cache

image-20241109124348151

7.2 Block size

  • Block size is the data associated with the address tag

    • Not necessarily a transmission unit between levels
      • Sub-blocking: a block divided into multiple parts (each with Valid/Dirty bits)
  • Block is too small.

    • Doesn't make good use of spatial localization
    • tag Higher expenses
  • Block is too big.

    • Too few total blocks
    • Less time localization is utilized
    • Wasted cache space and bandwidth/energy if not spatially localized

7.2.1 Large Block:Critical-Word and Subblocking

  • Large cache blocks may take a long time to fill the cache

    • Fill in the cache line firstCritical-Word
    • Restarting Cache Access Before Completing Population
  • Large cache blocks waste bus bandwidth

    • Dividing a block into sub-blocks
    • Separate valid and dirty bits for each sub-block

7.3 Relevance

How many blocks can appear in the same index (i.e. group)?

  • Greater relevance
    • Lower miss rate (reduced conflict)
    • Higher hit latency and area cost
  • Smaller correlations
    • Lower costs
    • Lower hit latency
    • Especially important for L1 caches
image-20241109125250843

7.4 Classification of Cache Misses

  • forcible failure to hit
    • The first reference to address (block) always results in a miss
    • Subsequent references should hit unless the cache block has been shifted due to
    • Solution:Prefetching: predicting which blocks will be needed
  • Capacity misses
    • Cache too small to hold all the content needed
    • Defined as the number of misses that occur even in a fully associative cache of the same capacity (with optimal replacement)
    • Solution:
      • Better use of cache space: retaining blocks that will be referenced
      • Software management: dividing the working set and computation so that each "computation phase" is suitable for caching
  • Missed *es
    • Defined as any failure that is neither a forced failure nor a capacity failure
    • Solution:
      • More relevance
      • Alternative ways to get more relevance without making the cache associative
        • Victim cache
        • Better random indexing
        • Software tips?

7.5 How to Improve Cache Performance

  • Reducing miss rate
    • More relevance
    • Alternative/enhanced methods of correlation
    • victim caching, hashing, pseudo-correlation, skewed correlation
    • Better replacement/insertion strategies
    • Software Methodology
  • Reducing miss latency or miss cost
    • Multi-level cache
    • Keyword Priority
    • Subblocking / Sectors
    • Better replacement/insertion strategies
    • Non-blocking cache (multiple caches read in parallel)
    • Multiple visits per cycle
    • Software Methodology
  • Reducing hit latency or hit cost