Architecture Components

Crawler
- the source of the Web data
- scalable, parallelizable, network-friendly
- continuous crawling

Repository
- the storage of web data
- support online update
- maximize storage usage (e.g. compression)
- random access and efficient serial service

Feature index
- Storage of �processed� information (e.g. PageRank, forward links, Genre, reading level, �)
- Easy to add new features

Multicast engine
- Distribution of data across the Internet
- Subscription-based

Crawler