Pagewise preview ]

CategoryValue
Available viahttp://dbpubs.stanford.edu/pub/1998-31
Submitted on 26th of February 2000
Author Shivakumar, N.; Garcia-Molina, H.
Title Finding near-replicas of documents on the web
Date of publication 1998
Citation N. Shivakumar,H. Garcia-Molina: Finding near-replicas of documents on the web. Proceedings of Workshop on Web Databases (WebDB'98)
Language English
Project Digital Libraries
Type Conference or Journal Paper
Subject group Databases and the Web
Abstract We consider how to effciently compute the overlap between all pairs of web documents. This information can be used to improve web crawlers, web archivers and in the presentation of search results, among others. We report statistics on how common replication is on the web, and on the cost of computing the above information for a relatively large subset of the web { about 24 million web pages which corresponds to about 150 Gigabytes of textual information.
Keywords SCAM, web experiments
Fulltext source
  • Postscript (ps, ps.gz, ps.zip)
  • PDF (pdf, pdf.gz, pdf.zip)
  • Plain text (text, text.gz, text.zip)
  • Management of the document bypubs@db.stanford.edu

    Pagewise preview ]


    Stanford InfoLab Publication Server