Challenges
Huge information space
- Wide area distribution
- URL space (to remember while crawling)
- Web content (to store)
Limited resources
- Disk
- Time
- Memory
- Bandwidth
- Server administrator tolerance
Continuous evolution
- More pages
- Pages change/disappear
- Mirror sites installed
Crawling issues
- Data ‘fiefdoms’: firewalls; access permissions; load controls
- Overhead per site: DNS lookups; processing robots.txt
- Parallelization
- Ability to interrupt & restart