John L. Hennessy, David A. Patterson1558605967, 9781558605961, 9780080502526
Table of contents :
And now for something completely different…….Page 1
FIGURE 1.1 Growth in microprocessor performance since the mid 1980s has been substantially highe………Page 4
Servers……Page 6
Embedded Computers……Page 7
The Task of a Computer Designer……Page 9
FIGURE 1.3 A summary of the three computing classes and their system characteristics. The total ………Page 10
FIGURE 1.4 Summary of some of the most important functional requirements an architect faces. The………Page 11
Scaling of Transistor Performance, Wires, and Power in Integrated Circuits……Page 13
The Impact of Time, Volume, Commodification, and Packaging……Page 15
FIGURE 1.5 Prices of six generations of DRAMs (from 16Kb to 64 Mb) over time in 1977 dollars, sh………Page 17
FIGURE 1.6 The price of an Intel Pentium III at a given frequency decreases over time as yield e………Page 18
Cost of an Integrated Circuit……Page 16
EXAMPLE Find the number of dies per 30-cm wafer for a die that is 0.7 cm on a side…….Page 19
FIGURE 1.8 Photograph of an 12-inch wafer containing NEC MIPS 4122 processors…….Page 20
EXAMPLE Find the die yield for dies that are 1 cm on a side and 0.7 cm on a side, assuming a defe………Page 21
Cost Versus Price—Why They Differ and By How Much……Page 22
FIGURE 6.43 Code for a sense-reversing barrier using fetch-and-increment to do the counting…….Page 23
FIGURE 1.10 The components of price for a $1,000 PC. Each increase is shown along the bottom as ………Page 24
90.7u 12.9s 2:39 65%……Page 27
Choosing Programs to Evaluate Performance……Page 28
Benchmark Suites……Page 29
Desktop Benchmarks……Page 30
Data Dependences……Page 31
Server Benchmarks……Page 32
FIGURE 6.4 The distribution of execution time in the commercial workloads. The OLTP benchmark ha………Page 33
Reporting Performance Results……Page 34
FIGURE 6.10 The cache-coherence mechanism receives requests from both the processor and the bus ………Page 35
Comparing and Summarizing Performance……Page 36
Weighted Execution Time……Page 37
Normalized Execution Time and the Pros and Cons of Geometric Means……Page 38
FIGURE 1.17 Execution times from Figure 1.15 normalized to each machine. The arithmetic mean per………Page 40
Amdahl’s Law……Page 41
EXAMPLE Suppose that we are considering an enhancement to the processor of a server system used f………Page 42
EXAMPLE A common transformation required in graphics engines is square root. Implementations of f………Page 43
The CPU Performance Equation……Page 44
EXAMPLE Suppose we have made the following measurements:……Page 46
Measuring and Modeling the Components of the CPU Performance Equation……Page 47
3. By interpreting the program at the instruction set level, compiling instruction counts in the ………Page 48
Take Advantage of Parallelism……Page 49
Performance and Price-Performance for Desktop Systems……Page 50
Performance and Price-Performance for Transaction Processing Servers……Page 51
FIGURE 1.19 Performance and price-performance for seven systems are measured using SPEC CINT2000………Page 52
FIGURE 1.20 Performance and price-performance for seven systems are measured using SPEC CFP2000 ………Page 53
How much to speculate……Page 54
Performance and Price-Performance for Embedded Processors……Page 55
FIGURE 1.23 Price-performance (plotted as transactions per minute per $1000 of system cost) and ………Page 56
2. The amount of overhead per loop iteration is very high: two of out of five instructions (the D………Page 57
FIGURE 1.26 Relative price-performance for three of the five EEMBC benchmark suites on five diff………Page 58
FIGURE 1.27 Relative performance per watt for the five embedded processors. The power is measure………Page 60
FIGURE 1.28 A comparison of the performance of the Pentium 4 (P4) relative to the Pentium III (P………Page 61
FIGURE 1.29 The tuning parameters for the SPEC CFP2000 report on an AlphaServer DS20E Model 6/66………Page 63
1. Although a moderate range of scalability, up to a few hundred processors may be of interest, t………Page 65
FIGURE 1.32 Measurements of peak performance and actual performance for the Hitachi S810/20 and ………Page 66
SQRT(EXP(X)) = = EXP(X/2)……Page 68
3.7 [15] Suppose we have a deeply pipelined processor, for which we implement a branch-tar………Page 72
3.8 [10] Determine the improvement from branch folding for unconditional branches. Assume ………Page 73
3.10 [30] Implement a simulator to evaluate various branch prediction schemes. You can use………Page 74
g. [22] Using the MIPS code for SAXPY above, assume a speculative processor with the fu………Page 75
References……Page 77
c. [10] What percentage of vectorization is needed to achieve one-half the maximum speedup ………Page 78
1.6 [15] Assume that we have a machine that with a perfect cache behaves as given in Figur………Page 79
FIGURE 1.33 The frequency of floating-point operations in the Whetstone benchmark…….Page 80
1.9 [15/10/15/15/15] This exercise estimates the complete packaged cost of a microproc………Page 81
e. [15] The parameter a depends on the complexity of the process. Additional metal levels r………Page 82
The turning away from the conventional organization came in the middle 1960s, when the law of di………Page 83
b. [10] For the configuration without the coprocessor, we measure that F = 8 ¥ 106, Y = 50,………Page 84
Hardware Support for Preserving Exception Behavior……Page 85
4. A mechanism is provided to indicate that an instruction is speculative and the hardware buffer………Page 86
chap06.2001.pdf……Page 0
Exploiting Instruction Level Parallelism with Software Approaches……Page 167
FIGURE 3.1 The major techniques examined in Appendix A, chapter 3, or chapter 4 are shown togeth………Page 170
FIGURE 6.12 Cache-coherence state diagram with the state transitions induced by the local proces………Page 169
Data Dependence and Hazards……Page 171
2. An output dependence occurs when instruction i and instruction j write the same register or me………Page 173
Data Hazards……Page 174
FIGURE 6.29 State transition diagram for an individual cache block in a directory- based system……….Page 175
Dynamic Scheduling: The Idea……Page 178
2. Read operands—Wait until no data hazards, then read operands…….Page 180
Dynamic Scheduling Using Tomasulo’s Approach……Page 181
FIGURE 3.2 The basic structure of a MIPS floating point unit using Tomasulo’s algorithm. Instruc………Page 184
2. Execute—If one or more of the operands is not yet available, monitor the common data bus (CDB)………Page 183
3. Write result—When the result is available, write it on the CDB and from there into the registe………Page 185
EXAMPLE Show what the information tables look like for the following code sequence when only the ………Page 186
FIGURE 3.3 Reservation stations and register tags shown when all of the instructions have issued………Page 187
EXAMPLE Using the same code segment as the previous example (page 239), show what the status tabl………Page 188
Exercises……Page 189
FIGURE 3.5 Steps in the algorithm and what is required for each step. For the issuing instructi………Page 190
FIGURE 3.6 Two active iterations of the loop with no instruction yet completed. Entries in the m………Page 192
EXAMPLE Consider a loop branch whose behavior is taken nine times in a row, then not taken once. ………Page 195
FIGURE 3.7 The states in a two-bit prediction scheme. By using two bits rather than one, a branc………Page 196
FIGURE 3.8 Prediction accuracy of a 4096-entry two-bit prediction buffer for the SPEC89 benchmar………Page 197
FIGURE 3.9 Prediction accuracy of a 4096-entry two-bit prediction buffer versus an infinite buff………Page 199
Correlating Branch Predictors……Page 198
FIGURE 3.11 Behavior of a one-bit predictor initialized to not taken. T stands for taken, NT for………Page 200
FIGURE 3.13 The action of the one-bit predictor with one bit of correlation, initialized to not ………Page 201
FIGURE 3.14 A (2,2) branch-prediction buffer uses a two-bit global history to choose from among ………Page 202
EXAMPLE How many branch-selected entries are in a (2,2) predictor that has a total of 8K bits in ………Page 203
FIGURE 3.15 Comparison of two-bit predictors. A noncorrelating predictor for 4096 bits is first,………Page 204
An Example: the Alpha 21264 Branch Predictor……Page 205
FIGURE 3.16 The state transition diagram for a tournament predictor has four states correspondin………Page 206
FIGURE 3.17 The fraction of predictions coming from the local predictor for a tournament predict………Page 207
FIGURE 3.18 The misprediction rate for three different predictors on SPEC89 as the total number ………Page 208
FIGURE 3.19 A branch-target buffer. The PC of the instruction being fetched is matched against a………Page 209
FIGURE 3.20 The steps involved in handling an instruction with a branch-target buffer. If the PC………Page 211
FIGURE 3.21 Penalties for all possible combinations of whether the branch is in the buffer and w………Page 212
EXAMPLE Determine the total branch penalty for a branch-target buffer assuming the penalty cycles………Page 210
3. Instruction memory access and buffering: when fetching multiple instructions per cycle a varie………Page 213
FIGURE 3.22 Prediction accuracy for a return address buffer operated as a stack. The accuracy is………Page 214
FIGURE 3.23 There are five primary approaches in use for multiple-issue processors, and this tab………Page 216
Statically-Scheduled Superscalar Processors……Page 215
A Statically Scheduled Superscalar MIPS Processor……Page 217
FIGURE 3.24 Superscalar pipeline in operation. The integer and floating-point instructions are i………Page 218
Multiple Instruction Issue with Dynamic Scheduling……Page 220
EXAMPLE Consider the execution of the following simple loop, which adds a scalar in F2 to each el………Page 221
EXAMPLE Consider the execution of the same loop on two-issue processor, but, in addition, assume ………Page 222
FIGURE 3.26 Resource usage table for the example shown in Figure 3.25. The entry in each box sho………Page 223
FIGURE 3.28 Resource usage table for the example shown in Figure 3.27, using the same format as ………Page 224
3. The control hazard, which prevents us from starting the next L.D before we know whether the br………Page 225
3. Write result—When the result is available, write it on the CDB (with the ROB tag sent when the………Page 228
2. Execute—If one or more of the operands is not yet available, monitor the CDB (common data bus)………Page 227
EXAMPLE Assume the same latencies for the floating-point functional units as in earlier examples:………Page 229
FIGURE 3.30 At the time the MUL.D is ready to commit, only the two L.D instructions have committ………Page 231
EXAMPLE Consider the code example used earlier for Tomasulo’s algorithm and shown in Figure3.6 o………Page 232
FIGURE 3.31 Only the L.D and MUL.D instructions have committed, though all the others have compl………Page 233
FIGURE 3.32 Steps in the algorithm and what is required for each step. For the issuing instruct………Page 235
2. maintaining the program order for the computation of an effective address of a load with respe………Page 234
EXAMPLE Consider the execution of the following loop, which searches an array, on a two issue pro………Page 236
FIGURE 4.22 The energy performance of the processor and memory interface modules using two multi………Page 238
FIGURE 3.34 The time of issue, execution, and writing result for a dual-issue version of our pip………Page 239
Register renaming versus Reorder Buffers……Page 237
Speculating through multiple branches……Page 241
4. Memory-address alias analysis—All memory addresses are known exactly and a load can be moved b………Page 242
5. Provide enough replicated functional units to allow all the ready instructions to issue…….Page 244
FIGURE 3.36 The effects of reducing the size of the window. The window is the group of instructi………Page 246
FIGURE 3.37 The effect of window size shown by each application by plotting the average number o………Page 247
FIGURE 3.38 The effect of branch-prediction schemes. This graph shows the impact of going from a………Page 248
2. Tournament-based branch predictor—The prediction scheme uses a correlating two-bit predictor a………Page 249
5. None—No branch prediction is used, though jumps are still predicted. Parallelism is largely li………Page 250
The Effects of Finite Registers……Page 251
FIGURE 3.41 The effect of finite numbers of registers available for renaming. Both the number of………Page 252
The Effects of Imperfect Alias Analysis……Page 253
2. Inspection—This model examines the accesses to see if they can be determined not to interfere ………Page 254
3. None—All memory references are assumed to conflict…….Page 255
4. Register renaming with 64 additional integer and 64 additional FP registers, exceeding largest………Page 256
3. A speculative superscalar with a 64-entry window. It achieves one- half of the ideal issue rat………Page 258
FIGURE 3.46 The amount of parallelism available versus the window size for a variety of integer ………Page 259
1. A simple MIPS two-issue static pipe running at a clock rate of 1 GHz and achieving a pipeline ………Page 257
3. Overcoming the data flow limit: a recent proposed idea to boost ILP, which goes beyond the cap………Page 261
2. Speculating on multiple paths: this idea was discussed by Lam and Wilson in 1992 and explored ………Page 262
FIGURE 3.47 The Intel processors based on the P6 microarchitecture and their important differenc………Page 263
Performance of the Pentium Pro Implementation……Page 264
5. A data cache misses led to a stall because every reservation station or the reorder buffer was………Page 265
FIGURE 3.50 The number of instructions decoded each clock varies widely and depends upon a varie………Page 266
FIGURE 3.51 Stall cycles per instruction at decode time and the breakdown due to instruction str………Page 267
Data Cache Behavior……Page 268
Branch Performance and Speculation Costs……Page 269
Putting the Pieces Together: Overall Performance of the P6 Pipeline……Page 270
The Pentium III versus the Pentium 4……Page 271
FIGURE 3.56 The breakdown in how often 0, 1, 2, or 3 uops commit in a cycle. The average number ………Page 272
FIGURE 3.57 The actual CPI (shown as a line) is lower than the sum of the number of uop cycles p………Page 273
FIGURE 3.58 The performance of the Pentium 4 for four SPEC2000 benchmarks (two integer: gcc and ………Page 274
FIGURE 6.46 The SGI Origin 2000 uses an architecture that contains two processors per node and a………Page 277
Pitfall: Emphasizing an improvement in CPI by increasing issue rate while sacrificing clock rate ………Page 278
Pitfalls: Sometimes bigger and dumber is better…….Page 279
Practical Limitations on Exploiting More ILP……Page 281
FIGURE 3.60 The relative performance per Watt of the Pentium 4 is 15% to 40% less than the Penti………Page 283
Branch Prediction Schemes……Page 284
The Development of Multiple-Issue Processors……Page 285
Studies of ILP and Ideas to Increase ILP……Page 286
Recent Advanced Microprocessors……Page 287
References……Page 288
Exercises……Page 292
3.5 [15] Tomasulo’s algorithm also has a disadvantage versus the scoreboard: only one resu………Page 293
3.4 [12] A shortcoming of the scoreboard approach occurs when multiple functional units t………Page 71
FIGURE 3.63 Latencies for functional units, configuration 2…….Page 294
4.5 [20/22/22/22/22/25/25/25/20/22/22] In this Exercise, we will look at how a com………Page 295
a. [20] For this problem use the standard single-issue MIPS pipeline with the pipeline late………Page 296
d. [25] To further boost clock rates, a number of processors have added additional pipelini………Page 297
3.16 [Discussion] There is a subtle problem that must be considered when implementing Tom………Page 298
4.11 [15] Perform the same transformation (moving up the branch) as the example on page 26………Page 299
One of the surprises about IA-64 is that we hear no claims of high frequency, despite claims that………Page 301
4.10 Concluding Remarks 293……Page 168
Exercises 299……Page 302
for (i=1000; i>0; i=i–1) x[i] = x[i] + s;……Page 303
EXAMPLE Show how the loop would look on MIPS, both scheduled and unscheduled, including any stall………Page 304
EXAMPLE Show our loop unrolled so that there are four copies of the loop body, assuming R1 is in………Page 305
EXAMPLE Show the unrolled loop in the previous example after it has been scheduled for the pipeli………Page 306
5. Determine that the loads and stores in the unrolled loop can be interchanged by observing that………Page 307
EXAMPLE Show how the process of optimizing the loop overhead by unrolling the loop actually elimi………Page 308
EXAMPLE Unroll our example loop, eliminating the excess loop overhead, but using the same registe………Page 309
EXAMPLE Unroll and schedule the loop used in the earlier examples and shown on page 223…….Page 311
5. Unoptimized Wildfire with poor data placement: Wildfire with poor data placement and unintelli………Page 312
FIGURE 4.3 Misprediction rate on SPEC92 for a profile-based predictor varies widely but is gener………Page 314
FIGURE 4.4 Accuracy of a predict-taken strategy and a profile-based predictor for SPEC92 benchma………Page 315
The Basic VLIW Approach……Page 316
EXAMPLE Suppose we have a VLIW that could issue two memory references, two FP operations, and one………Page 317
FIGURE 4.5 VLIW instructions that occupy the inner loop and replace the unrolled sequence. This ………Page 318
Detecting and Enhancing Loop-Level Parallelism……Page 319
2. S2 uses the value, A[i+1], computed by S1 in the same iteration…….Page 320
2. On the first iteration of the loop, statement S1 depends on the value of B[1] computed prior t………Page 321
for (i=6;i
Reviews
There are no reviews yet.