[ Pagewise preview ]
| Category | Value | ||
| Available via | http://dbpubs.stanford.edu/pub/1998-31 | ||
| Submitted on | 26th of February 2000 | ||
| Author | Shivakumar, N.; Garcia-Molina, H. | ||
| Title | Finding near-replicas of documents on the web | ||
| Date of publication | 1998 | ||
| Citation | N. Shivakumar,H. Garcia-Molina: Finding near-replicas of documents on the web. Proceedings of Workshop on Web Databases (WebDB'98) | ||
| Language | English | ||
| Project | Digital Libraries | ||
| Type | Conference or Journal Paper | ||
| Subject group | Databases and the Web | ||
| Abstract | We consider how to effciently compute the overlap between all pairs of web documents. This information can be used to improve web crawlers, web archivers and in the presentation of search results, among others. We report statistics on how common replication is on the web, and on the cost of computing the above information for a relatively large subset of the web { about 24 million web pages which corresponds to about 150 Gigabytes of textual information. | ||
| Keywords | SCAM, web experiments | ||
| Fulltext source |
| Management of the document by | pubs@db.stanford.edu
| |
[ Pagewise preview ]