Pagewise preview ]

CategoryValue
Available viahttp://dbpubs.stanford.edu/pub/2000-36
Next version(s) 2001-19
Submitted on 8th of December 2000
Author Raghavan, Sriram; Garcia-Molina, Hector
Title Crawling the Hidden Web
Date of publication 2000
Citation Raghavan, Sriram; Garcia-Molina, Hector. Crawling the Hidden Web,
Number of pages 25
Language English
Project Digital Libraries
Type Technical Report
Subject group Databases and the Web; Miscellaneous
Abstract Current-day crawlers retrieve content only from the publicly indexable Web, i.e.,the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. In particular, they ignore the tremendous amount of high quality content ``hidden'' behind search forms, in large searchable electronic databases. In this paper, we provide a framework for addressing the problem of extracting content from this hidden Web. At Stanford, we have built a task-specific hidden Web crawler called the Hidden Web Exposer (HiWE). We describe the architecture of HiWE and present a number of novel techniques that went into its design and implementation. We also present results from experiments we conducted to test and validate our techniques.
Keywords Crawling, Hidden Web, Content extraction, HTML Forms
Fulltext source
  • Postscript (ps, ps.gz, ps.zip)
  • PDF (pdf, pdf.gz, pdf.zip)
  • Plain text (text, text.gz, text.zip)
  • Management of the document bypubs@db.stanford.edu

    Pagewise preview ]


    Stanford InfoLab Publication Server