Paper Discussing

From SubfireWiki

Jump to: navigation, search

Use-Based Register Caching with Decoupled Indexing

  1. What is the problem that is addressed?
    • With deep pipeline we need more physical registers, which means a bigger register file, which takes longer to search...and we want to maintain high clock frequencies. So the solution is using register caching, but there are some problems including poor insertion and replacement decisions and the need for fully associative cache, in order to solve these problems they come up with a new replacement policy(use-based) and decouples indexing to reduce the conflicts. The problem with current replacement algorithms the values in the cache may not be used in future, and entries which may be needed end up being replaced using the LRU replacement algorithm.
  2. Briefly describe the solution proposed
    • Use-based register cache: The problem above can be solved by using the knowlege we have about the dependencies on each entry. If a register entry is to be used by a large number of instructions in the future, we should keep it in the cache. If the register in question is an output register of an in flight instruction, it need not be cached because its value will be forwarded to all waiting instructions on completion.
      • Reduces insertion pollution: by not caching a register value when all of its predicated consumers are satisfied by the bypass network.
      • Replacement policy: selects register cache entries with the fewest remaining uses to lower the miss rate.
    • Decoupled Indexing: The point here is to use something other than the address of the physical register to decide which cache line a register should be in. One of a few different policies can be used, Round Robin, Minimum, and filtered Round Robin. The benefit here is that a set associative cache will be filled more evenly than using the bits of the address. Thus, more physical registers which are currently in use can be in the cach at the same time.
      • Round robin simply puts the register in the next set, without looking at set usage
      • Minimum keeps track of how many registers are in use in each set and the register is put in the set with the smallest usage
      • Filtered round robin keeps track of the usage, but uses round robin on the sets that are below a threshold.
      • Benefit: Reducing register cache conflicts.
  3. What are the main limitations?
    • (1) Size of the predictor: To access the large sized predictor itself might takes two cycles.
    • (2) The full identifier for each value must be stored. More transistors.
    • (3) Wire delays? complexity.
    • (4) misprediction of use. Too low, evict something need. Too high, you keep value that you won'y use. Either way, increase misses.
    • (5) better allocation policies for decoupling? (theoretical maximum).
    • (6) two accesses to predictor each time, one to read and one to update count.
    • (7) if we mispredict the counter a value in a cache(high) and we won't use it in future, so the counter for that value won't change and we won't free that place until the retirement. Thae effects the performance and increases the miss.

Silent Stores for Free

  1. What is the problem that is addressed?
    • Silent stores are stores where the value being replaced is the same as the replacement value.
    • Basic algorithm: read location, compare, then store if not equal.
      • Explicitly converting all stores to loads increases pressure on the available cache ports in the system and can potentially delay the issue of loads which are likely on the critical path.
      • Having a single instruction perform multiple data cache accesses (and potentially cause many data cache misses) will increase scheduler and control logic complexity.
      • Performing more cache accesses (an additional read for each nonâ�?��??silent store) can increase power consumption.
  2. Briefly describe the solution proposed
    • Use ECC, Read port stealing, LSQ
      • ECC: Change the ECC logic in a 64 bit memory cache so that when 32 bits are written, it's compared with the 32 bits that are being read to recompute the ECC for the whole 64 bit memory location, use the result of this comparison to cancel the write.
        • This is a free change because ECC needs to compute the ECC for the entire 64 bit memory location when writing a regular sized 32 bit word.
      • Read port stealing: While a store is waiting in the LSQ, wait until a read port on the D-cache becomes free, then perform the read/compare without disrupting any other operations. Note that there is a problem with WAW dependencies here that must be handled somehow. If A is at the memory location and instruction I changes it to B and instruction J later follows and changes it back to A, J will be dropped as a silent store, but now when J commits, the memory location will have value B.
      • LSQ:
        • Handle WAW dependencies by eliminating later writes to the same location of the same data from the queue.
        • Handle WAR dependencies when the read occurs, use its result of the read to compare with the write to see if it's goign to be a silent store.
        • When you're reading a cache line, compare every entry on the line with any waiting write.
  3. Why write is more expensive than read?
    • Detecting and squashing silent stores can have a number of beneficial effects: reducing the pressure on cache write ports, reducing the pressure on store queues or other microarchitectural structures that are used to track pending writes, reducing the need for store forwarding to dependentloads, and reducing both address and data bus traffic outside the processor chip. Many of these benefits are examined and quantified in [14].
  4. What are the main limitations?
    • Standard: However, there is also a complexity and microarchitectural resource utilization cost associated with detecting silent stores. Namely, to detect the fact that a store is silent, the prior value must first be read out from the memory location, compared to the new value, and then conditionally overwritten in a process called store squashing. The simple store squashing approach outlined in [14] simply issues each store instruction twice first as a read followed by a compare, and later as a store if it is not silent. Though beneficial overall, it is clear that such a simplistic approach places additional pressure on cache ports, particularly when running programs with few silent stores. characteristics. First, explicitly converting all stores to loads increases pressure on the available cache ports in the system and can potentially delay the issue of loads which are likely on the critical path. Second, saving a single instruction perform multiple data cache accesses (and potentially cause many data cache misses) will increase scheduler and control logic complexity. Finally, performing more cache accesses (an additional read for each non-silent store) can increase power consumption. Therefore, we would like to find more efficient ways of squashing silent stores.
    • ECC:In comparison to standard store verifies (Section 2.1),we can see that store verifies carried out in ECC logic require no explicit load operation, but rather can simply be performed at commit, as illustrated in Figure 2.The drawbacks of this approach are that a store is squashed relatively late in the pipeline (at commit instead of during the execute stage) so it may not reduce pressure on write buffers; it cannot be removed early from the LSQ; and finally that it cannot capture ECC-word-aligned stores. Relative to standard store verifies, this method has the benefit of not delaying execution of load operations due to resource conflicts.
    • Read port stealing:However, it can create additional instruction scheduling difficulties because the policy for issuing a store verify is dependent on resource usage and not just program order or another static scheduling policy.
Personal tools