Pagewise preview ]

CategoryValue
Available viahttp://dbpubs.stanford.edu/pub/2001-19
Previous version2000-36
Submitted on 18th of May 2001
Author Raghavan, Sriram; Garcia-Molina, Hector
Title Crawling the Hidden Web
Date of publication 2001
Published in To appear in VLDB 2001
Citation Raghavan, Sriram; Garcia-Molina, Hector. Crawling the Hidden Web, To appear in VLDB 2001
Number of pages 12
Language English
Project Digital Libraries
Type Conference or Journal Paper
Subject group Databases and the Web
Abstract Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of Web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. In particular, they ignore the tremendous amount of high quality content ``hidden'' behind search forms, in large searchable electronic databases. In this paper, we address the problem of designing a crawler capable of extracting content from this hidden Web. We introduce a generic operational model of a hidden Web crawler and describe how this model is realized in HiWE (Hidden Web Exposer), a prototype crawler built at Stanford. We introduce a new Layout-based Information Extraction Technique (LITE) and demonstrate its use in automatically extracting semantic information from search forms and response pages. We alsopresent results from experiments conducted to test and validate our techniques.
Fulltext source
  • Postscript (ps, ps.gz, ps.zip)
  • PDF (pdf, pdf.gz, pdf.zip)
  • Plain text (text, text.gz, text.zip)
  • Management of the document bypubs@db.stanford.edu

    Pagewise preview ]


    Stanford InfoLab Publication Server