Architecture: Spring '05
From SubfireWiki
Revision as of 20:00, 13 March 2006 by MichaelHead (Talk | contribs)
Contents |
Question 1
Question 2
Question 3 - Wire delay vs. Logic delay
- How do longer wire delays impact overall processor performance?
- The limiting factor in processor frequency is becoming the wire delay. The clock period can no longer be shortened because the wires connecting components are causing the dealy. This causes any access to any data (particularly in caches, which may be far from the rest of the processor).
- Large FPU: can lead to 2-3 cycle delays between components
- increases the cpi when clock cycle is increased
- Fowarding and excution may not be able to occur sequentially
- What are some solutions that can be used to mitigate the performance impact of wire delays?
- Maximize hit ratios of caches and branch prediction hits. Send more data in parallel (more read ports).
- "Pipeline wires" so that components at the ends of wires that take, say, 2 cycles can operate on other data while waiting for the new input.
- Have lots of local buffers and caches at each processor component instead of one large buffer.
- Limit register file reads to issue stage so that the RF's read ports can be connected to just one component, making the wires connecting them shorter.
- Add more pipeline stages
Question 4 - Branch Predictors
- It's going to take longer to make the prediction with a slower predictor. This could introduce bubbles into the pipeline while waiting for the predictor because the fetch stage won't know what to fetch. Could show example of pipeline stages that demonstrate that the FETCH stage must halt because a branch destination address has not been predicted.
- It's particularly bad for highly branched code with wide input. (5-10 instructions between branches). Having multiple branches fetched in the same batch make problems even worse
- The two could be combined so that the we can quickly get a prediction from the faster predictor and fetch the address predicted. A cycle or two later, we can get the result of the slower predictor and fixup the instructions fetched based on the improved predictor. (discuss what it means to "fixup" the instructions given time)
- Use Bimodal predictor, then in parallel use a GShare/history. On cycle 3 if it's different, roll back the pipeline
Question 5 - Resolving the DRAM/processor clock gap
(368-372)
- Instead of adding more levels of cache, improve the cache mechanism used:
- Instead of a lookaside cache, use an in-line or backside cache.
- Also put the cache controller on the chip while leaving the cache itself off chip.
- See the solutions on 373 (write combining, streaming stores (needs ISA improvement, which requires compiler and architectural changes), explicit prefetching)
- Backside cache - off chip cache now on separate bus
- Pros
- Fewer conflicts with main memory bus, can be clocked higher than regular system bus
- Cons
- Misses are more complicated to handle
- Pros
- Inline cache - CPU has point-to-point access with cache
- Pros
- Cache controller has access to the memory, high speed connection to CPU. CPU easier to implement, since all the memory access it done in one place.
- Cons
- Expensive
- Pros
- Wave pipeline - logic delays are expected and used
- Pros
- High speed clock, no need for extra latches
- Cons
- ...
- Pros
- Streaming Stores - don't bring in cache lines that are write-only
- Pros
- No need to waste cache space, no need to find a cache victim
- Cons
- Requires ISA and compiler support.
- Pros
- Write Combining - batch up multiple writes
- Pros
- Fewer writes to main memory.
- Cons
- Requires ISA changes and Streaming Stores. Could be an extra step, requires an extra buffer (may be considered a cache for the purposes of the question)
- Pros
- Multithreading
- Space out loads
- Load Prediction
- For prefetch
- VLIW
Question 6 - Simultaneus MultiThreading
- Divide up independent streams of code and execute in parallel
- Talk about dependencies
- Use some extra threads for run-ahead or extra OS threads
- Hardware helper threads which figure out stuff can be run separately
- Compiler help with making threads
