Xuedong Huang, Alex Acero, Hsiao-Wuen Hon9780130226167, 0130226165
Table of contents :
Cover……Page 1
Table of Contents……Page 3
Foreword……Page 16
Preface……Page 19
Introduction……Page 21
1.1.1. Spoken Language Interface……Page 22
1.1.3. Knowledge Partners……Page 23
1.2.1. Automatic Speech Recognition……Page 24
1.2.2. Text- to- Speech Conversion……Page 26
1.2.3. Spoken Language Understanding……Page 27
1.3.2. Part II: Speech Processing……Page 29
1.3.5. Part V: Spoken Language Systems……Page 30
1.5. HISTORICAL PERSPECTIVE FURTHER READING……Page 31
REFERENCES……Page 34
Spoken Language Structure……Page 36
2.1.1. Sound……Page 38
2.1.2. Speech Production……Page 41
Speech Perception 2.1.3…….Page 45
2.2.1. Phonemes……Page 53
2.2.2. The Allophone: Sound and Context……Page 64
2.2.3. Speech Rate and Coarticulation……Page 66
2.3. SYLLABLES WORDS……Page 67
2.3.1. Syllables……Page 68
2.3.2. Words……Page 69
2.4. SYNTAX SEMANTICS……Page 74
2.4.1. Syntactic Constituents……Page 75
2.4.2. Semantic Roles……Page 80
2.4.3. Lexical Semantics……Page 81
2.4.4. Logical Form……Page 83
2.5. HISTORICAL PERSPECTIVE FURTHER READING……Page 85
REFERENCES……Page 86
Probability, Statistics, and Information Theory……Page 89
3.1. PROBABILITY THEORY……Page 90
3.1.1. Conditional Probability And Bayes’ Rule……Page 91
3.1.2. Random Variables……Page 93
3.1.3. Mean and Variance……Page 95
3.1.4. Covariance and Correlation……Page 99
3.1.5. Random Vectors and Multivariate Distributions……Page 100
3.1.6. Some Useful Distributions……Page 101
3.1.6.3. Geometric Distributions……Page 103
3.1.7. Gaussian Distributions……Page 108
3.2. ESTIMATION THEORY……Page 114
3.2.1. Minimum/ Least Mean Squared Error Estimation……Page 115
3.2.2. Maximum Likelihood Estimation……Page 120
3.2.3. Bayesian Estimation and MAP Estimation……Page 124
3.3.1. Level of Significance……Page 130
3.3.2. Normal Test (Z- Test)……Page 132
3.3.3. Goodness- of- Fit Test……Page 133
3.3.4. Matched- Pairs Test……Page 135
3.4.1. Entropy……Page 137
3.4.2. Conditional Entropy……Page 140
3.4.3. The Source Coding Theorem……Page 141
3.4.4. Mutual Information and Channel Coding……Page 143
3.5. HISTORICAL PERSPECTIVE FURTHER READING……Page 145
REFERENCES……Page 147
Pattern Recognition……Page 149
4.1. BAYES DECISION THEORY……Page 150
4.1.1. Minimum- Error- Rate Decision Rules……Page 151
4.1.2. Discriminant Functions……Page 154
4.2. HOW CONSTRUCT CLASSIFIERS……Page 156
4.2.1. Gaussian Classifiers……Page 158
4.2.2. The Curse of Dimensionality……Page 160
4.2.3. Estimating the Error Rate……Page 162
4.2.4. Comparing Classifiers……Page 164
4.3.1. Maximum Mutual Information Estimation……Page 166
4.3.2. Minimum- Error- Rate Estimation……Page 172
4.3.3. Neural Networks……Page 174
4.4. UNSUPERVISED ESTIMATION METHODS……Page 179
4.4.1. Vector Quantization……Page 180
4.4.2. The EM Algorithm……Page 186
4.4.3. Multivariate Gaussian Mixture Density Estimation……Page 188
4.5. CLASSIFICATION REGRESSION TREES……Page 192
4.5.1. Choice of Question Set……Page 193
4.5.2. Splitting Criteria……Page 195
4.5.3. Growing the Tree……Page 197
4.5.4. Missing Values and Conflict Resolution……Page 198
4.5.5. Complex Questions……Page 199
4.5.6. The Right- Sized Tree……Page 201
4.6. HISTORICAL PERSPECTIVE FURTHER READING……Page 206
REFERENCES……Page 208
Digital Signal Processing……Page 214
5.1. DIGITAL SIGNALS SYSTEMS……Page 215
5.1.1. Sinusoidal Signals……Page 216
5.1.3. Digital Systems……Page 219
5.2.1. The Fourier Transform……Page 222
5.2.2. Z- Transform……Page 224
5.2.3. Z- Transforms of Elementary Functions……Page 225
5.2.4. Properties of the Z and Fourier Transform……Page 228
5.3. DISCRETE- FREQUENCY TRANSFORMS……Page 229
5.3.1. The Discrete Fourier Transform (DFT)……Page 231
5.3.2. Fourier Transforms of Periodic Signals……Page 232
5.3.3. The Fast Fourier Transform (FFT)……Page 235
5.3.4. Circular Convolution……Page 240
5.3.5. The Discrete Cosine Transform (DCT)……Page 241
5.4.1. The Ideal Low- Pass Filter……Page 242
5.4.2. Window Functions……Page 243
5.4.3. FIR Filters……Page 245
5.4.4. IIR Filters……Page 251
5.5.1. Fourier Transform of Analog Signals……Page 255
5.5.2. The Sampling Theorem……Page 256
5.5.3. Analog- to- Digital Conversion……Page 258
5.5.4. Digital- to- Analog Conversion……Page 259
5.6. MULTIRATE SIGNAL PROCESSING……Page 260
5.6.1. Decimation……Page 261
5.6.2. Interpolation……Page 262
5.7.1. Two- Band Conjugate Quadrature Filters……Page 263
5.7.2. Multiresolution Filterbanks……Page 266
5.7.3. The FFT as a Filterbank……Page 268
5.7.4. Modulated Lapped Transforms……Page 270
5.8. STOCHASTIC PROCESSES……Page 272
5.8.1. Statistics of Stochastic Processes……Page 273
5.8.2. Stationary Processes……Page 276
5.8.3. LTI Systems with Stochastic Inputs……Page 279
5.8.4. Power Spectral Density……Page 280
5.9. HISTORICAL PERSPECTIVE AND FURTHER READING……Page 282
REFERENCES……Page 284
Speech Signal Representations……Page 286
6.1. SHORT- TIME FOURIER ANALYSIS……Page 287
6.1.1. Spectrograms……Page 292
6.2. ACOUSTICAL MODEL SPEECH PRODUCTION……Page 294
6.2.2. Lossless Tube Concatenation……Page 295
6.2.3. Source- Filter Models of Speech Production……Page 299
6.3. LINEAR PREDICTIVE CODING……Page 301
6.3.1. The Orthogonality Principle……Page 302
6.3.2. Solution of the LPC Equations……Page 304
6.3.3. Spectral Analysis via LPC……Page 311
6.3.4. The Prediction Error……Page 312
6.3.5. Equivalent Representations……Page 314
6.4. CEPSTRAL PROCESSING……Page 317
6.4.1. The Real and Complex Cepstrum……Page 318
6.4.2. Cepstrum of Pole- Zero Filters……Page 319
6.4.3. Cepstrum of Periodic Signals……Page 322
6.4.4. Cepstrum of Speech Signals……Page 323
6.4.5. Source- Filter Separation via the Cepstrum……Page 324
6.5.1. The Bilinear Transform……Page 326
6.5.2. Mel- Frequency Cepstrum……Page 327
6.6. FORMANT FREQUENCIES……Page 329
6.6.1. Statistical Formant Tracking……Page 331
6.7.1. Autocorrelation Method……Page 334
6.7.2. Normalized Cross- Correlation Method……Page 337
6.7.4. Pitch Tracking……Page 340
6.8. HISTORICAL PERSPECTIVE AND FUTURE READING……Page 342
REFERENCES……Page 343
Speech Coding……Page 347
7.1. SPEECH CODERS ATTRIBUTES……Page 348
7.2.1. Linear Pulse Code Modulation (PCM)……Page 350
7.2.2. µ µ µ µ -law and A- law PCM……Page 352
7.2.3. Adaptive PCM……Page 354
7.2.4. Differential Quantization……Page 355
7.3.1. Benefits of Masking……Page 358
7.3.2. Transform Coders……Page 360
7.3.4. Digital Audio Broadcasting (DAB)……Page 361
7.4.1. LPC Vocoder……Page 362
7.4.2. Analysis by Synthesis……Page 363
7.4.3. Pitch Prediction: Adaptive Codebook……Page 366
7.4.4. Perceptual Weighting and Postfiltering……Page 367
7.4.5. Parameter Quantization……Page 368
7.4.6. CELP Standards……Page 369
7.5. LOW- BIT RATE SPEECH CODERS……Page 371
7.5.2. Harmonic Coding……Page 372
7.5.3. Waveform Interpolation……Page 377
REFERENCES……Page 381
Hidden Markov Models……Page 385
8.1. THE MARKOV CHAIN……Page 386
8.2. DEFINITION HIDDEN MARKOV MODEL……Page 388
8.2.1. Dynamic Programming and DTW……Page 391
8.2.2. How to Evaluate an HMM – The Forward Algorithm……Page 393
8.2.3. How to Decode an HMM – The Viterbi Algorithm……Page 395
8.2.4. How to Estimate HMM Parameters – Baum- Welch Algorithm……Page 397
8.3.1. Continuous Mixture Density HMMs……Page 402
8.3.2. Semi- continuous HMMs……Page 404
8.4.1. Initial Estimates……Page 406
8.4.2. Model Topology……Page 407
8.4.4. Deleted Interpolation……Page 409
8.4.5. Parameter Smoothing……Page 411
8.4.6. Probability Representations……Page 412
8.5. HMM LIMITATIONS……Page 413
8.5.1. Duration Modeling……Page 414
8.5.2. First- Order Assumption……Page 416
8.6. HISTORICAL PERSPECTIVE FURTHER READING……Page 417
REFERENCES……Page 419
Acoustic Modeling……Page 422
9.1. VARIABILITY SPEECH SIGNAL……Page 423
9.1.1. Context Variability……Page 424
9.1.3. Speaker Variability……Page 425
9.2. HOW MEASURE SPEECH RECOGNITION ERRORS……Page 426
9.3. SIGNAL PROCESSING— EXTRACTING FEATURES……Page 428
9.3.1. Signal Acquisition……Page 429
9.3.2. End- Point Detection……Page 430
9.3.3. MFCC and Its Dynamic Features……Page 432
9.3.4. Feature Transformation……Page 433
UNITS……Page 435
9.4.1. Comparison of Different Units……Page 436
9.4.2. Context Dependency……Page 437
9.4.3. Clustered Acoustic- Phonetic Units……Page 439
9.4.4. Lexical Baseforms……Page 443
9.5.1. Choice of HMM Output Distributions……Page 446
9.5.2. Isolated vs. Continuous Speech Training……Page 448
9.6. ADAPTIVE TECHNIQUES— MINIMIZING MISMATCHES……Page 451
9.6.1. Maximum a Posteriori (MAP)……Page 452
9.6.2. Maximum Likelihood Linear Regression (MLLR)……Page 455
9.6.3. MLLR and MAP Comparison……Page 457
9.6.4. Clustered Models……Page 459
9.7.1. Filler Models……Page 460
9.7.2. Transformation Models……Page 461
9.7.3. Combination Models……Page 463
9.8.1. Neural Networks……Page 464
9.8.2. Segment Models……Page 466
9.9. CASE STUDY: WHISPER……Page 471
9.10. HISTORICAL PERSPECTIVE FURTHER READING……Page 472
REFERENCES……Page 475
Environmental Robustness……Page 482
10.1.1. Additive Noise……Page 483
10.1.2. Reverberation……Page 485
10.1.3. A Model of the Environment……Page 487
10.2. ACOUSTICAL TRANSDUCERS……Page 491
10.2.2. Directionality Patterns……Page 493
10.2.3. Other Transduction Categories……Page 501
10.3. ADAPTIVE ECHO CANCELLATION (AEC)……Page 502
10.3.1. The LMS Algorithm……Page 503
10.3.2. Convergence Properties of the LMS Algorithm……Page 504
10.3.4. Transform- Domain LMS Algorithm……Page 506
10.3.5. The RLS Algorithm……Page 507
10.4. MULTIMICROPHONE SPEECH ENHANCEMENT……Page 508
10.4.1. Microphone Arrays……Page 509
10.4.2. Blind Source Separation……Page 514
10.5.1. Spectral Subtraction……Page 519
10.5.2. Frequency- Domain MMSE from Stereo Data……Page 523
10.5.3. Wiener Filtering……Page 525
10.5.4. Cepstral Mean Normalization (CMN)……Page 526
10.5.6. The Use of Gaussian Mixture Models……Page 529
10.6. ENVIRONMENTAL MODEL ADAPTATION……Page 531
10.6.1. Retraining on Corrupted Speech……Page 532
10.6.2. Model Adaptation……Page 533
10.6.3. Parallel Model Combination……Page 535
10.6.4. Vector Taylor Series……Page 537
10.6.5. Retraining on Compensated Features……Page 541
10.7. MODELING NONSTATIONARY NOISE……Page 542
10.8. HISTORICAL PERSPECTIVE FURTHER READING……Page 543
REFERENCES……Page 544
Language Modeling……Page 548
11.1. FORMAL LANGUAGE THEORY……Page 549
11.1.1. Chomsky Hierarchy……Page 550
11.1.2. Chart Parsing for Context- Free Grammars……Page 552
11.2.1. Probabilistic Context- Free Grammars……Page 557
11.2.2. N- gram Language Models……Page 561
11.3. COMPLEXITY MEASURE LANGUAGE MODELS……Page 563
11.4. N- GRAM SMOOTHING……Page 565
11.4.1. Deleted Interpolation Smoothing……Page 567
11.4.2. Backoff Smoothing……Page 568
11.4.3. Class n- grams……Page 574
11.4.4. Performance of n- gram Smoothing……Page 576
11.5.1. Cache Language Models……Page 577
11.5.2. Topic- Adaptive Models……Page 578
11.5.3. Maximum Entropy Models……Page 579
11.6.1. Vocabulary Selection……Page 581
11.6.2. N- gram Pruning……Page 583
11.6.3. CFG vs n- gram Models……Page 584
11.7. HISTORICAL PERSPECTIVE FURTHER READING……Page 587
REFERENCES……Page 589
Basic Search Algorithms……Page 594
12.1.1. General Graph Searching Procedures……Page 595
12.1.2. Blind Graph Search Algorithms……Page 600
12.1.3. Heuristic Graph Search……Page 603
12.2. SEARCH ALGORITHMS FOR SPEECH RECOGNITION……Page 610
12.2.1. Decoder Basics……Page 611
12.2.2. Combining Acoustic And Language Models……Page 612
12.2.4. Continuous Speech Recognition……Page 613
12.3.1. Search Space with FSM and CFG……Page 615
12.3.2. Search Space with the Unigram……Page 618
12.3.3. Search Space with Bigrams……Page 619
12.3.4. Search Space with Trigrams……Page 621
12.3.5. How to Handle Silences Between Words……Page 622
12.4. TIME- SYNCHRONOUS VITERBI BEAM SEARCH……Page 624
12.4.1. The Use of Beam……Page 626
12.4.2. Viterbi Beam Search……Page 627
12.5. STACK (A SEARCH)……Page 628
12.5.1. Admissible Heuristics for Remaining Path……Page 631
12.5.2. When to Extend New Words……Page 633
12.5.3. Fast Match……Page 637
12.5.4. Stack Pruning……Page 640
12.5.5. Multistack Search……Page 641
12.6. HISTORICAL PERSPECTIVE FURTHER READING……Page 642
REFERENCES……Page 643
Large Vocabulary Search Algorithms……Page 646
13.1.1. Lexical Tree……Page 647
13.1.2. Multiple Copies of Pronunciation Trees……Page 649
13.1.3. Factored Language Probabilities……Page 651
13.1.4. Optimization of Lexical Trees……Page 654
13.1.5. Exploiting Subtree Polymorphism……Page 657
13.1.6. Context- Dependent Units and Inter- Word Triphones……Page 659
13.2.1. Using Entire HMM as a State in Search……Page 660
13.2.2. Different Layers of Beams……Page 661
13.2.3. Fast Match……Page 662
13.3.1. N- Best Lists and Word Lattices……Page 664
13.3.2. The Exact N- best Algorithm……Page 667
13.3.3. Word- Dependent N- Best and Word- Lattice Algorithm……Page 668
13.3.4. The Forward- Backward Search Algorithm……Page 671
13.3.5. One- Pass vs. Multipass Search……Page 674
13.4. SEARCH- ALGORITHM EVALUATION……Page 675
13.5. CASE STUDY— MICROSOFT WHISPER……Page 676
13.5.1. The CFG Search Architecture……Page 677
13.5.2. The N- Gram Search Architecture……Page 678
13.6. HISTORICAL PERSPECTIVES FURTHER READING……Page 682
REFERENCES……Page 683
Text and Phonetic Analysis……Page 686
14.1. MODULES DATA FLOW……Page 687
14.1.1. Modules……Page 689
14.1.2. Data Flows……Page 691
14.1.3. Localization Issues……Page 693
14.2. LEXICON……Page 694
14.3. DOCUMENT STRUCTURE DETECTION……Page 695
14.3.1. Chapter and Section Headers……Page 697
14.3.2. Lists……Page 698
14.3.4. Sentences……Page 699
14.3.5. E- mail……Page 701
14.3.7. Dialog Turns and Speech Acts……Page 702
14.4. TEXT NORMALIZATION……Page 703
14.4.1. Abbreviations and Acronyms……Page 706
14.4.2. Number Formats……Page 708
14.4.3. Domain- Specific Tags……Page 714
14.4.4. Miscellaneous Formats……Page 715
14.5. LINGUISTIC ANALYSIS……Page 716
14.6. HOMOGRAPH DISAMBIGUATION……Page 719
14.7. MORPHOLOGICAL……Page 721
14.8. LETTER- SOUND CONVERSION……Page 723
14.9. EVALUATION……Page 726
14.10.1. Lexicon……Page 728
14.10.2. Text Analysis……Page 729
14.10.3. Phonetic Analysis……Page 730
REFERENCES……Page 731
Prosody……Page 734
15.1. THE ROLE UNDERSTANDING……Page 735
15.2. PROSODY GENERATION SCHEMATIC……Page 738
15.3.2. Emotion……Page 739
15.4. SYMBOLIC PROSODY……Page 740
15.4.1. Pauses……Page 742
15.4.2. Prosodic Phrases……Page 744
15.4.3. Accent……Page 745
15.4.4. Tone……Page 748
15.4.5. Tune……Page 752
15.4.6. Prosodic Transcription Systems……Page 754
15.5. DURATION ASSIGNMENT……Page 756
15.5.1. Rule- Based Methods……Page 757
15.6.1. Attributes of Pitch Contours……Page 758
15.6.2. Baseline F0 Contour Generation……Page 762
15.6.3. Parametric F0 Generation……Page 768
15.6.4. Corpus- Based F0 Generation……Page 772
15.7. PROSODY MARKUP LANGUAGES……Page 776
15.8. PROSODY EVALUATION……Page 778
15.9. HISTORICAL PERSPECTIVE FURTHER READING……Page 779
REFERENCES……Page 781
Speech Synthesis……Page 784
16.1. ATTRIBUTES SPEECH SYNTHESIS……Page 785
16.2.1. Waveform Generation from Formant Values……Page 787
16.2.2. Formant Generation by Rule……Page 790
16.2.4. Articulatory Synthesis……Page 793
16.3. CONCATENATIVE SPEECH SYNTHESIS……Page 794
16.3.1. Choice of Unit……Page 795
16.3.2. Optimal Unit String: The Decoding Process……Page 799
16.3.3. Unit Inventory Design……Page 807
16.4.1. Synchronous Overlap and Add (SOLA)……Page 808
16.4.2. Pitch Synchronous Overlap and Add (PSOLA)……Page 809
16.4.3. Spectral Behavior of PSOLA……Page 811
16.4.4. Synthesis Epoch Calculation……Page 812
16.4.5. Pitch- Scale Modification Epoch Calculation……Page 814
16.4.6. Time- Scale Modification Epoch Calculation……Page 815
16.4.9. Epoch Detection……Page 817
16.4.10. Problems with PSOLA……Page 819
16.5.1. Prosody Modification of the LPC Residual……Page 821
16.5.2. Mixed Excitation Models……Page 822
16.5.3. Voice Effects……Page 823
16.6. EVALUATION TTS SYSTEMS……Page 824
16.6.1. Intelligibility Tests……Page 826
16.6.2. Overall Quality Tests……Page 829
16.6.4. Functional Tests……Page 831
16.6.5. Automated Tests……Page 832
16.7. HISTORICAL PERSPECTIVE AND FUTURE READING……Page 833
REFERENCES……Page 836
Spoken Language Understanding……Page 840
17.1. WRITTEN SPOKEN LANGUAGES……Page 842
17.1.1. Style……Page 843
17.1.2. Disfluency……Page 844
17.1.3. Communicative Prosody……Page 845
17.2. DIALOG STRUCTURE……Page 846
17.2.1. Units of Dialog……Page 847
17.2.2. Dialog (Speech) Acts……Page 848
17.2.3. Dialog Control……Page 853
17.3.1. Semantic Frames……Page 854
17.3.2. Conceptual Graphs……Page 859
17.4. SENTENCE INTERPRETATION……Page 860
17.4.1. Robust Parsing……Page 861
17.4.2. Statistical Pattern Matching……Page 865
17.5. DISCOURSE ANALYSIS……Page 867
17.5.1. Resolution of Relative Expression……Page 868
17.5.2. Automatic Inference and Inconsistency Detection……Page 871
17.6. DIALOG MANAGEMENT……Page 872
17.6.1. Dialog Grammars……Page 873
17.6.2. Plan- Based Systems……Page 875
17.6.3. Dialog Behavior……Page 879
17.7.1. Response Content Generation……Page 881
17.7.2. Concept- to- Speech Rendition……Page 885
17.8.1. Evaluation in the ATIS Task……Page 887
17.8.2. PARADISE Framework……Page 889
17.9.1. Semantic Representation……Page 892
17.9.2. Semantic Parser (Sentence Interpretation)……Page 894
17.9.3. Discourse Analysis……Page 895
17.9.4. Dialog Manager……Page 896
17.10. HISTORICAL PERSPECTIVE FURTHER READING……Page 899
REFERENCES……Page 900
Applications and User Interfaces……Page 904
18.1. APPLICATION ARCHITECTURE……Page 905
18.2.1. Computer Command and Control……Page 906
18.2.2. Telephony Applications……Page 909
18.2.3. Dictation……Page 911
18.2.5. Handheld Devices……Page 914
18.2.7. Speaker Recognition……Page 915
18.3.1. General Principles……Page 916
18.3.2. Handling Errors……Page 921
18.3.3. Other Considerations……Page 925
18.3.4. Dialog Flow……Page 926
18.4. INTERNATIONALIZATION……Page 928
18.5. CASE STUDY— MIPAD……Page 929
18.5.1. Specifying the Application……Page 930
18.5.2. Rapid Prototyping……Page 932
18.5.3. Evaluation……Page 933
18.5.4. Iterations……Page 935
18.6. HISTORICAL PERSPECTIVE FURTHER READING……Page 936
REFERENCES……Page 938
Index……Page 941
Reviews
There are no reviews yet.