The Stanford WebBase project has been collecting topic focused snapshots of Web sites. All the resulting archives are available to the public via fast download streams. For example, we collected pages from 350 sites every day for several weeks after the Katrina hurricane disaster. We also collect pages from government Web sites on a regular basis.
In addition, the project examines how our archives can be explored by historians, sociologists, and public policy professionals.
WebBase was originally funded by the Digital Library Initiatives I and II. During that time the focus of the project was crawling, indexing, clustering and searching technology. The current Google company spun out into the commercial sphere during this phase.
The focus on providing generally accessible, topically coherent archive snapshots was later supported by NSF Infrastructure grant EIA-0322975. An exploratory effort to learn from political scientists about their needs for tools that would allow them to analyze large Web archives was supported by NSF grant IIS-0624725. The project collaborates with the California Digital Library (CDL), and with Stanford's Communications Department.
For a list of available content, please see our access instructions page. A Web page for initiating downloads is also available.