Kevin Hemenway, Tara Calishain9780596005771, 0596005776
The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren’t enough. If you’ve ever wanted your data in a different form than it’s presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you. Spidering Hacks takes you to the next level in Internet data retrieval–beyond search engines–by showing you how to create sp. Read more… Credits; About the Authors; Contributors; Acknowledgments; Kevin; Tara; Preface; Why Spidering Hacks?; How This Book Is Organized; How to Use This Book; Conventions Used in This Book; How to Contact Us; Got a Hack?; Walking Softly; A Crash Course in Spidering and Scraping; Why Spider?; Best Practices for You and Your Spider; Be Liberal in What You Accept; Don’t Limit Your Dataset; Don’t Reinvent the Wheel; Best Practices for You; Choose the most structured format available; If you must scrape HTML, do so sparingly; Use the right tool for the job; Don’t go where you’re not wanted. Choose a good identifierMake information on your spider readily available; Don’t demand unlimited site access or support; Best Practices for Your Spider; Respect robots.txt; Go light on the bandwidth; Take just enough, and don’t take too often; Anatomy of an HTML Page; Anatomy of an HTML Page; Header Information with the H Tags; List Information with Special HTML Tags; Non-HTML Files; Registering Your Spider; Naming Your Spider; A Web Page About Your Spider; Places to Register Your Spider; Preempting Discovery; Making Contact; Making the Arguments for Your Spider. Making Your Spider Easy to Find and Learn AboutConsidering Legal Issues; Keeping Your Spider Out of Sticky Situations; Bad Spider, No Biscuit!; Violating Copyright; Aggregating Data; Competitive Intelligence; Possible Consequences of Misbehaving Spiders; Tracking Legal Issues; Finding the Patterns of Identifiers; Arbitrary Classification Systems Within a Collection; Classification Systems that Use an Established Universal Taxonomy Within a Collection; Classification Systems that Identify Documents Across a Wide Number of Collections; Some Large Collections with ID Numbers. Assembling a ToolboxPerl Modules; Resources You May Find Helpful; Installing Perl Modules; Example: Installing LWP; Unix and Mac OS X installation via CPAN; Unix and Mac OS X installation by hand; Windows installation via PPM; Simply Fetching with LWP::Simple; More Involved Requests with LWP:: UserAgent; Adding HTTP Headers to Your Request; Posting Form Data with LWP; Authentication, Cookies, and Proxies; Authentication; Enabling Cookies; Using Proxies; Handling Relative and Absolute URLs; Secured Access and Browser Attributes; Other Browser Attributes; Respecting Your Scrapee’s Bandwidth. If-Modified-SinceETags; Compressed Data; Respecting robots.txt; Adding Progress Bars to Your Scripts; The Code; Scraping with HTML::TreeBuilder; Hacking the Hack; Parsing with HTML::TokeParser; The Code; Running the Hack; See Also; WWW::Mechanize 101; Introducing WWW::Mechanize; Using Mech’s Navigation Tools; The Code; Running the Hack; Scraping with WWW::Mechanize; The Code; Running the Hack; In Praise of Regular Expressions; Using Modules to Parse HTML; Watching the Printers: Score One for Regular Expressions; The Code; Not Fragile, but Probably Not Permanent Either | |
Reviews
There are no reviews yet.