Pagewise preview ]

CategoryValue
Available viahttp://dbpubs.stanford.edu/pub/1999-39
Submitted on 26th of February 2000
Author Cho, J.; Shivakumar, N.; Garcia-Molina, H.
Title Finding replicated web collections
Date of publication 1999
Citation J. Cho,N. Shivakumar,H. Garcia-Molina: Finding replicated web collections. In Proceedings of 2000 ACM International Conference on Management of Data (SIGMOD) Conference, May 2000
Language English
Project Digital Libraries
Type Conference or Journal Paper
Subject group Databases and the Web
Abstract Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to efficiently identify replicated documents and hyperlinked document collections. The challenge is to identify these replicas from an input data set of several tens of millions of web pages and several hundreds of gigabytes of textual data. We also present two real-life case studies where we used replication information to improve a crawler and a search engine. We report these results for a data set of 25 million web pages (about 150 gigabytes of HTML data) crawled from the web.
Keywords Web, database, mirror, replica, copy detection, clustering
Fulltext source
  • Postscript (ps, ps.gz, ps.zip)
  • PDF (pdf, pdf.gz, pdf.zip)
  • Plain text (text, text.gz, text.zip)
  • Management of the document bypubs@db.stanford.edu

    Pagewise preview ]


    Stanford InfoLab Publication Server