Architecture Components
Crawler
- the source of the Web data
- scalable, parallelizable, network-friendly
- continuous crawling
Repository
- the storage of web data
- support online update
- maximize storage usage (e.g. compression)
- random access and efficient serial service
Feature index
- Storage of �processed� information (e.g. PageRank, forward links, Genre, reading level, �)
- Easy to add new features
Multicast engine
- Distribution of data across the Internet
- Subscription-based