Stanford Digital Library Project

SIDL-WP-1999-0107

Finding replicated web documents and collections

Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina

cho@cs.stanford.edu

Abstract: Many web documents (such as JAVA FAQs) are being replicated on the Internet. In many cases, entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to efficiently identify replicated documents and hyperlinked document collections. The challenge is to identify these replicas from an input data set of several tens of millions of web pages and several hundreds of gigagytes of textual data. We also present two real-life case studies where we used replication information to improve a crawler and a search engine. We report these results for a data set of 60 million web pages (about 400 gigabytes of HTML data) crawled from the web.

Note: Papers in this series are in development and are not in a final form for publication or general dissemination. They are subject to change. Please do not quote or further distribute them without explicit permission from the authors.

This paper was created on: 02/01/99 and last revised on:3/29/1999

Author's Comments:

Status: PRIVATE

Click here to see the full text of SIDL-WP-1999-0107 (PDF)

Revision History

Version	Format	Date	Comments
1	PS	3/29/1999