Challenge (crawler)
Huge “URL” space
100M X 40 Bytes = 4 Giga bytes
Overhead per site
DNS lookup of site IPs
robots.txt
Site-based rules
exclusion of certain sites or directory
Previous slide
Next slide
Back to first slide
View graphic version