Pagewise preview ]

CategoryValue
Available viahttp://dbpubs.stanford.edu/pub/2002-40
Submitted on 11th of July 2002
Author Arasu, Arvind; Garcia-Molina, Hector
Title Extracting Structured Data from Web Pages
Date of publication 1st of July 2002
Citation Arasu, Arvind; Garcia-Molina, Hector. Extracting Structured Data from Web Pages,
Number of pages 30
Language English
Project Stanford InfoLab; Database Group; Digital Libraries
Type Technical Report
Subject group Databases and the Web; Data Integration and Mediation
Abstract Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from the web pages without any learning examples or other similar human input. We formally define the notion of a template, and propose a model that describes how values are encoded into pages using a template. We present an extraction algorithm that uses sets of words that have similar occurrence pattern in the input pages, to construct the template. The constructed template is then used to extract values from the pages. We show experimentally that the extracted values make semantic sense in most cases.
Keywords Automatic Data Extraction
Fulltext source
  • Postscript (ps, ps.gz, ps.zip)
  • PDF (pdf, pdf.gz, pdf.zip)
  • Plain text (text, text.gz, text.zip)
  • Management of the document bypubs@db.stanford.edu

    Pagewise preview ]


    Stanford InfoLab Publication Server