Arnowitz J., Berger N., Arent M.0-12-088525-5
Table of contents :
Title Page……Page 4
Copyright Page……Page 5
Table of Contents……Page 6
Foreword……Page 12
Preface……Page 14
Acknowledgements……Page 18
About the Authors……Page 20
1 Preliminaries……Page 22
1.1 Fault Classification……Page 23
1.2 Types of Redundancy……Page 24
1.3 Basic Measures of Fault Tolerance……Page 25
1.3.1 Traditional Measures……Page 26
1.3.2 Network Measures……Page 27
1.4 Outline of This Book……Page 28
1.5 Further Reading……Page 30
References……Page 31
2.1 The Rate of Hardware Failures……Page 32
2.2 Failure Rate, Reliability, and Mean Time to Failure……Page 34
2.3 Canonical and Resilient Structures……Page 36
2.3.1 Series and Parallel Systems……Page 37
2.3.2 Non-Series/Parallel Systems……Page 38
2.3.3 M-of-N Systems……Page 41
2.3.5 Variations on N-Modular Redundancy……Page 44
2.3.6 Duplex Systems……Page 48
2.4.1 Poisson Processes……Page 51
2.4.2 Markov Models……Page 54
2.5 Fault-Tolerance Processor-Level Techniques……Page 57
2.5.1 Watchdog Processor……Page 58
2.5.2 Simultaneous Multithreading for Fault Tolerance……Page 60
2.6 Byzantine Failures……Page 62
2.6.1 Byzantine Agreement with Message Authentication……Page 67
2.8 Exercises……Page 69
References……Page 74
3 Information Redundancy……Page 76
3.1 Coding……Page 77
3.1.1 Parity Codes……Page 78
3.1.2 Checksum……Page 85
3.1.3 M-of-N Codes……Page 86
3.1.4 Berger Code……Page 87
3.1.5 Cyclic Codes……Page 88
3.1.6 Arithmetic Codes……Page 95
3.2.1 RAID Level 1……Page 100
3.2.2 RAID Level 2……Page 102
3.2.3 RAID Level 3……Page 103
3.2.4 RAID Level 4……Page 104
3.2.6 Modeling Correlated Failures……Page 105
3.3 Data Replication……Page 109
3.3.1 Voting: Non-Hierarchical Organization……Page 110
3.3.2 Voting: Hierarchical Organization……Page 116
3.3.3 Primary-Backup Approach……Page 117
3.4 Algorithm-Based Fault Tolerance……Page 120
3.5 Further Reading……Page 122
3.6 Exercises……Page 123
References……Page 127
4 Fault-Tolerant Networks……Page 130
4.1.1 Graph-Theoretical Measures……Page 131
4.1.2 Computer Networks Measures……Page 132
4.2.1 Multistage and Extra-Stage Networks……Page 133
4.2.2 Crossbar Networks……Page 140
4.2.3 Rectangular Mesh and Interstitial Mesh……Page 142
4.2.4 Hypercube Network……Page 145
4.2.5 Cube-Connected Cycles Networks……Page 149
4.2.6 Loop Networks……Page 151
4.2.7 Ad Hoc Point-to-Point Networks……Page 153
4.3 Fault-Tolerant Routing……Page 156
4.3.1 Hypercube Fault-Tolerant Routing……Page 157
4.3.2 Origin-Based Routing in the Mesh……Page 159
4.4 Further Reading……Page 162
4.5 Exercises……Page 163
References……Page 166
5 Software Fault Tolerance……Page 168
5.1 Acceptance Tests……Page 169
5.2.1 Wrappers……Page 170
5.2.2 Software Rejuvenation……Page 173
5.2.3 Data Diversity……Page 176
5.2.4 Software Implemented Hardware Fault Tolerance (SIHFT)……Page 178
5.3 N-Version Programming……Page 181
5.3.1 Consistent Comparison Problem……Page 182
5.3.2 Version Independence……Page 183
5.4.2 Success Probability Calculation……Page 190
5.4.3 Distributed Recovery Blocks……Page 192
5.6 Exception-Handling……Page 194
5.6.1 Requirements from Exception-Handlers……Page 195
5.6.2 Basics of Exceptions and Exception-Handling……Page 196
5.6.3 Language Support……Page 198
5.7.1 Jelinski-Moranda Model……Page 199
5.7.2 Littlewood-Verrall Model……Page 200
5.7.3 Musa-Okumoto Model……Page 201
5.8.1 Primary-Backup Approach……Page 203
5.8.2 The Circus Approach……Page 204
5.9 Further Reading……Page 205
5.10 Exercises……Page 207
References……Page 209
6 Checkpointing……Page 214
6.1 What Is Checkpointing?……Page 216
6.2 Checkpoint Level……Page 218
6.3 Optimal Checkpointing — An Analytical Model……Page 219
6.3.1 Time Between Checkpoints — A First-Order Approximation……Page 221
6.3.2 Optimal Checkpoint Placement……Page 222
6.3.3 Time Between Checkpoints — A More Accurate Model……Page 223
6.3.4 Reducing Overhead……Page 225
6.3.5 Reducing Latency……Page 226
6.4 Cache-Aided Rollback Error Recovery (CARER)……Page 227
6.5 Checkpointing in Distributed Systems……Page 228
6.5.1 The Domino Effect and Livelock……Page 230
6.5.2 A Coordinated Checkpointing Algorithm……Page 231
6.5.3 Time-Based Synchronization……Page 232
6.5.4 Diskless Checkpointing……Page 233
6.5.5 Message Logging……Page 234
6.6 Checkpointing in Shared-Memory Systems……Page 238
6.6.1 Bus-Based Coherence Protocol……Page 239
6.6.2 Directory-Based Protocol……Page 240
6.7 Checkpointing in Real-Time Systems……Page 241
6.9 Further Reading……Page 244
6.10 Exercises……Page 245
References……Page 247
7.1.1 Architecture……Page 250
7.1.3 Software……Page 254
7.1.4 Modifications to the NonStop Architecture……Page 256
7.2 Stratus Systems……Page 257
7.3 Cassini Command and Data Subsystem……Page 259
7.4 IBM G5……Page 262
7.5 IBM Sysplex……Page 263
7.6 Itanium……Page 265
7.7 Further Reading……Page 267
References……Page 268
8.1 Manufacturing Defects and Circuit Faults……Page 270
8.2 Probability of Failure and Critical Area……Page 272
8.3 Basic Yield Models……Page 274
8.3.1 The Poisson and Compound Poisson Yield Models……Page 275
8.3.2 Variations on the Simple Yield Models……Page 277
8.4 Yield Enhancement Through Redundancy……Page 279
8.4.1 Yield Projection for Chips with Redundancy……Page 280
8.4.2 Memory Arrays with Redundancy……Page 284
8.4.3 Logic Integrated Circuits with Redundancy……Page 291
8.4.4 Modifying the Floorplan……Page 293
8.5 Further Reading……Page 297
8.6 Exercises……Page 298
References……Page 302
9 Fault Detection in Cryptographic Systems……Page 306
9.1.1 Symmetric Key Ciphers……Page 307
9.1.2 Public Key Ciphers……Page 316
9.2 Security Attacks Through Fault Injection……Page 317
9.2.1 Fault Attacks on Symmetric Key Ciphers……Page 318
9.2.2 Fault Attacks on Public (Asymmetric) Key Ciphers……Page 319
9.3 Countermeasures……Page 320
9.3.2 Error-Detecting Codes……Page 321
9.3.3 Are These Countermeasures Sufficient?……Page 325
9.5 Exercises……Page 328
References……Page 329
10.1 Writing a Simulation Program……Page 332
10.2.1 Point Versus Interval Estimation……Page 336
10.2.2 Method of Moments……Page 337
10.2.3 Method of Maximum Likelihood……Page 339
10.2.4 The Bayesian Approach to Parameter Estimation……Page 343
10.2.5 Confidence Intervals……Page 345
10.3.1 Antithetic Variables……Page 349
10.3.2 Using Control Variables……Page 351
10.3.3 Stratified Sampling……Page 352
10.3.4 Importance Sampling……Page 354
10.4 Random Number Generation……Page 362
10.4.1 Uniformly Distributed Random Number Generators……Page 363
10.4.2 Testing Uniform Random Number Generators……Page 366
10.4.3 Generating Other Distributions……Page 370
10.5 Fault Injection……Page 376
10.5.1 Types of Fault Injection Techniques……Page 377
10.6 Further Reading……Page 379
10.7 Exercises……Page 380
References……Page 384
Index……Page 386
Reviews
There are no reviews yet.