Otis Gospodnetic, Erik Hatcher9781932394283, 1-932394-28-1
Table of contents :
1932394281 – Manning – Lucene in Action – fly.pdf……Page 0
foreword……Page 18
preface……Page 20
acknowledgments……Page 23
Roadmap……Page 26
Why JUnit?……Page 28
JUnit primer……Page 29
About the authors……Page 33
About the title……Page 34
About the cover illustration……Page 35
Part 1 – Core Lucene……Page 36
Meet Lucene……Page 38
1.1 Evolution of information organization and access……Page 39
1.2 Understanding Lucene……Page 41
1.2.2 What Lucene can do for you……Page 42
1.2.3 History of Lucene……Page 44
1.3.1 What is indexing, and why is it important?……Page 45
1.4 Lucene in action: a sample application……Page 46
1.4.1 Creating an index……Page 47
1.4.2 Searching an index……Page 50
1.5 Understanding the core indexing classes……Page 53
1.5.3 Analyzer……Page 54
1.5.5 Field……Page 55
1.6 Understanding the core searching classes……Page 57
1.6.3 Query……Page 58
1.7.1 IR libraries……Page 59
1.7.2 Indexing and searching applications……Page 61
1.8 Summary……Page 62
Indexing……Page 63
2.1.1 Conversion to text……Page 64
2.1.2 Analysis……Page 65
2.2.1 Adding documents to an index……Page 66
2.2.2 Removing Documents from an index……Page 68
2.2.4 Updating Documents in an index……Page 71
2.3 Boosting Documents and Fields……Page 73
2.4 Indexing dates……Page 74
2.5 Indexing numbers……Page 75
2.6 Indexing Fields used for sorting……Page 76
2.7.1 Tuning indexing performance……Page 77
2.7.2 In-memory indexing: RAMDirectory……Page 83
2.7.3 Limiting Field sizes: maxFieldLength……Page 89
2.8 Optimizing an index……Page 91
2.9.1 Concurrency rules……Page 94
2.9.2 Thread-safety……Page 95
2.9.3 Index locking……Page 97
2.10 Debugging indexing……Page 101
2.11 Summary……Page 102
Adding search to your application……Page 103
3.1 Implementing a simple search feature……Page 104
3.1.1 Searching for a specific term……Page 105
3.1.2 Parsing a user-entered query expression: QueryParser……Page 107
3.2 Using IndexSearcher……Page 110
3.2.1 Working with Hits……Page 111
3.2.3 Reading indexes into memory……Page 112
3.3 Understanding Lucene scoring……Page 113
3.3.1 Lucene, you got a lot of ‘splainin’ to do!……Page 115
3.4 Creating queries programmatically……Page 116
3.4.1 Searching by term: TermQuery……Page 117
3.4.2 Searching within a range: RangeQuery……Page 118
3.4.3 Searching on a string: PrefixQuery……Page 119
3.4.4 Combining queries: BooleanQuery……Page 120
3.4.5 Searching by phrase: PhraseQuery……Page 122
3.4.6 Searching by wildcard: WildcardQuery……Page 125
3.4.7 Searching for similar terms: FuzzyQuery……Page 127
3.5 Parsing query expressions: QueryParser……Page 128
3.5.2 Boolean operators……Page 129
3.5.4 Field selection……Page 130
3.5.5 Range searches……Page 131
3.5.6 Phrase queries……Page 133
3.5.9 Boosting queries……Page 134
3.6 Summary……Page 135
Analysis……Page 137
4.1 Using analyzers……Page 139
4.1.1 Indexing analysis……Page 140
4.1.2 QueryParser analysis……Page 141
4.2 Analyzing the analyzer……Page 142
4.2.1 What’s in a token?……Page 143
4.2.2 TokenStreams uncensored……Page 144
4.2.3 Visualizing analyzers……Page 147
4.2.4 Filtering order can be important……Page 151
4.3.1 StopAnalyzer……Page 154
4.3.2 StandardAnalyzer……Page 155
4.4 Dealing with keyword fields……Page 156
4.5 “Sounds like” querying……Page 160
4.6 Synonyms, aliases, and words that mean the same……Page 163
4.6.1 Visualizing token positions……Page 169
4.7.1 Leaving holes……Page 171
4.7.2 Putting it together……Page 172
4.7.3 Hole lot of trouble……Page 173
4.8.1 Unicode and encodings……Page 175
4.8.2 Analyzing non-English languages……Page 176
4.8.3 Analyzing Asian languages……Page 177
4.9 Nutch analysis……Page 180
4.10 Summary……Page 182
Advanced search techniques……Page 184
5.1.1 Using a sort……Page 185
5.1.2 Sorting by relevance……Page 187
5.1.3 Sorting by index order……Page 188
5.1.5 Reversing sort order……Page 189
5.1.6 Sorting by multiple fields……Page 190
5.1.7 Selecting a sorting field type……Page 191
5.2 Using PhrasePrefixQuery……Page 192
5.3 Querying on multiple fields at once……Page 194
5.4 Span queries: Lucene’s new hidden gem……Page 196
5.4.1 Building block of spanning, SpanTermQuery……Page 198
5.4.2 Finding spans at the beginning of a field……Page 200
5.4.3 Spans near one another……Page 201
5.4.4 Excluding span overlap from matches……Page 203
5.4.5 Spanning the globe……Page 204
5.4.6 SpanQuery and QueryParser……Page 205
5.5.1 Using DateFilter……Page 206
5.5.2 Using QueryFilter……Page 208
5.5.3 Security filters……Page 209
5.5.4 A QueryFilter alternative……Page 211
5.5.6 Beyond the built-in filters……Page 212
5.6.1 Using MultiSearcher……Page 213
5.6.2 Multithreaded searching using ParallelMultiSearcher……Page 215
5.7 Leveraging term vectors……Page 220
5.7.1 Books like this……Page 221
5.7.2 What category?……Page 224
5.8 Summary……Page 228
Extending search……Page 229
6.1 Using a custom sort method……Page 230
6.1.1 Accessing values used in custom sorting……Page 235
6.2 Developing a custom HitCollector……Page 236
6.2.2 Using BookLinkCollector……Page 237
6.3.1 Customizing QueryParser’s behavior……Page 238
6.3.2 Prohibiting fuzzy and wildcard queries……Page 239
6.3.3 Handling numeric field-range queries……Page 240
6.3.4 Allowing ordered phrase queries……Page 243
6.4 Using a custom filter……Page 244
6.4.1 Using a filtered query……Page 247
6.5.1 Testing the speed of a search……Page 248
6.5.2 Load testing……Page 252
6.5.3 QueryParser again!……Page 253
6.6 Summary……Page 255
Part 2 – Applied Lucene……Page 256
Parsing common document formats……Page 258
7.1 Handling rich-text documents……Page 259
7.1.1 Creating a common DocumentHandler interface……Page 260
7.2 Indexing XML……Page 261
7.2.1 Parsing and indexing using SAX……Page 262
7.2.2 Parsing and indexing using Digester……Page 265
7.3 Indexing a PDF document……Page 270
7.3.1 Extracting text and indexing using PDFBox……Page 271
7.3.2 Built-in Lucene support……Page 274
7.4 Indexing an HTML document……Page 276
7.4.2 Using JTidy……Page 277
7.4.3 Using NekoHTML……Page 280
7.5 Indexing a Microsoft Word document……Page 283
7.5.1 Using POI……Page 284
7.5.2 Using TextMining.org’s API……Page 285
7.6 Indexing an RTF document……Page 287
7.7 Indexing a plain-text document……Page 288
7.8 Creating a document-handling framework……Page 289
7.8.1 FileHandler interface……Page 290
7.8.2 ExtensionFileHandler……Page 292
7.8.3 FileIndexer application……Page 295
7.8.4 Using FileIndexer……Page 297
7.8.5 FileIndexer drawbacks, and how to extend the framework……Page 298
7.9.1 Document-management systems and services……Page 299
7.10 Summary……Page 300
Tools and extensions……Page 302
8.1 Playing in Lucene’s Sandbox……Page 303
8.2.1 lucli: a command-line interface……Page 304
8.2.2 Luke: the Lucene Index Toolbox……Page 306
8.2.3 LIMO: Lucene Index Monitor……Page 314
8.3 Analyzers, tokenizers, and TokenFilters, oh my……Page 317
8.3.1 SnowballAnalyzer……Page 318
8.4 Java Development with Ant and Lucene……Page 319
8.4.1 Using the task……Page 320
8.4.2 Creating a custom document handler……Page 321
8.5 JavaScript browser utilities……Page 325
8.5.1 JavaScript query construction and validation……Page 326
8.6 Synonyms from WordNet……Page 327
8.6.1 Building the synonym index……Page 329
8.6.2 Tying WordNet synonyms into an analyzer……Page 331
8.6.3 Calling on Lucene……Page 332
8.7 Highlighting query terms……Page 335
8.7.1 Highlighting with CSS……Page 336
8.7.2 Highlighting Hits……Page 338
8.8 Chaining filters……Page 339
8.9 Storing an index in Berkeley DB……Page 342
8.9.1 Coding to DbDirectory……Page 343
8.10 Building the Sandbox……Page 344
8.10.2 Ant in the Sandbox……Page 345
8.11 Summary……Page 346
Lucene ports……Page 347
9.1 Ports’ relation to Lucene……Page 348
9.2.2 API compatibility……Page 349
9.2.3 Unicode support……Page 351
9.3.1 API compatibility……Page 352
9.4 Plucene……Page 353
9.4.1 API compatibility……Page 354
9.5.1 API compatibility……Page 355
9.6 PyLucene……Page 357
9.6.4 Users……Page 358
9.7 Summary……Page 359
Case studies……Page 360
10.1 Nutch: “The NPR of search engines”……Page 361
10.1.1 More in depth……Page 362
10.1.2 Other Nutch features……Page 363
10.2 Using Lucene at jGuru……Page 364
10.2.1 Topic lexicons and document categorization……Page 365
10.2.2 Search database structure……Page 366
10.2.3 Index fields……Page 367
10.2.4 Indexing and content preparation……Page 368
10.2.5 Queries……Page 370
10.2.6 JGuruMultiSearcher……Page 374
10.2.7 Miscellaneous……Page 375
10.3.1 Why choose Lucene?……Page 376
10.3.2 SearchBlox architecture……Page 377
10.3.4 Language support……Page 378
10.4 Competitive intelligence with Lucene in XtraMind’s XM-InformationMinder™……Page 379
10.4.1 The system architecture……Page 382
10.4.2 How Lucene has helped us……Page 385
10.5 Alias-i: orthographic variation with Lucene……Page 386
10.5.1 Alias-i application architecture……Page 387
10.5.2 Orthographic variation……Page 389
10.5.3 The noisy channel model of spelling correction……Page 390
10.5.4 The vector comparison model of spelling variation……Page 391
10.5.5 A subword Lucene analyzer……Page 392
10.5.7 Mixing in context……Page 395
10.6 Artful searching at Michaels.com……Page 396
10.6.1 Indexing content……Page 397
10.6.2 Searching content……Page 402
10.6.3 Search statistics……Page 405
10.7.1 Building better search capability……Page 406
10.7.2 High-level infrastructure……Page 408
10.7.3 Building the index……Page 409
10.7.4 Searching the index……Page 412
10.7.5 Configuration: one place to rule them all……Page 414
10.7.6 Web tier: TheSeeeeeeeeeeeerverSide?……Page 418
10.7.7 Summary……Page 419
10.8 Conclusion……Page 420
Installing Lucene……Page 422
A.1 Binary installation……Page 423
A.2 Running the command-line demo……Page 424
A.3 Running the web application demo……Page 425
A.4 Building from source……Page 426
A.5 Troubleshooting……Page 427
Lucene index format……Page 428
B.1 Logical index view……Page 429
B.2.1 Understanding the multifile index structure……Page 430
B.2.2 Understanding the compound index structure……Page 434
B.2.3 Converting from one index structure to the other……Page 435
B.3.1 Calculating the number of open files……Page 436
B.3.2 Comparing performance……Page 437
B.4.1 Inside the index……Page 439
B.5 Summary……Page 442
Resources……Page 443
C.3 Term vectors……Page 444
C.7 Miscellaneous……Page 445
C.9.1 Conference papers……Page 446
C.9.2 U.S. Patents……Page 447
C……Page 450
H……Page 451
L……Page 452
P……Page 453
S……Page 454
W……Page 455
Z……Page 456
Reviews
There are no reviews yet.