title: Parallel Crawlers creator: Cho, Junghoo creator: Garcia-Molina, Hector subject: Databases and the Web description: In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture. publisher: Stanford date: 2002-02-19 type: Techreport type: NonPeerReviewed format: application/pdf identifier: http://ilpubs.stanford.edu:8090/733/1/2002-9.pdf identifier: Cho, Junghoo and Garcia-Molina, Hector (2002) Parallel Crawlers. Technical Report. Stanford. relation: http://ilpubs.stanford.edu:8090/733/