title: Finding near-replicas of documents on the web creator: Shivakumar, N. creator: Garcia-Molina, H. subject: Databases and the Web description: We consider how to efficiently compute the overlap between all pairs of web documents. This information can be used to improve web crawlers, web archivers and in the presentation of search results, among others. We report statistics on how common replication is on the web, and on the cost of computing the above information for a relatively large subset of the web { about 24 million web pages which corresponds to about 150 Gigabytes of textual information. date: 1998 type: Conference or Workshop Item type: NonPeerReviewed format: application/pdf identifier: http://ilpubs.stanford.edu:8090/325/1/1998-31.pdf identifier: Shivakumar, N. and Garcia-Molina, H. (1998) Finding near-replicas of documents on the web. In: International Workshop on the Web and Databases (WebDB 1998 ), March 27-28, 1998, Valencia, Spain. relation: http://ilpubs.stanford.edu:8090/325/