| Available via | http://dbpubs.stanford.edu/pub/2004-34 |
|
Submitted on |
9th of July 2004 |
|
Author |
Cho, Junghoo; Garcia-Molina, Hector; Haveliwala, Taher; Lam, Wang; Paepcke, Andreas; Raghavan, Sriram; Wesley, Gary |
|
Title |
Stanford WebBase Components and Applications |
|
Date of publication |
2004 |
|
Citation |
Cho, Junghoo; Garcia-Molina, Hector; Haveliwala, Taher; Lam, Wang; Paepcke, Andreas; Raghavan, Sriram; Wesley, Gary. Stanford WebBase Components and Applications, |
|
Number of pages |
30 |
|
Language |
English |
|
Project |
Digital Libraries |
|
Type |
Technical Report |
|
Subject group |
Databases and the Web; Digital Libraries |
|
Abstract |
We describe the design and performance of WebBase, a tool for Web
research. The system includes a highly customizable crawler, a
repository for collected Web pages, an indexer for both text and
link-related page features, and a high-speed content distribution
facility. The distribution module enables researchers world-wide to
retrieve pages from WebBase, and stream them across the Internet at
high speed. The advantage for the researchers is that they need not
all crawl the Web before beginning their research. WebBase has been
used by scores of research and teaching organizations world-wide,
mostly for investigations into Web topology and linguistic content
analysis. After describing the system's architecture, we explain our
engineering decisions for each of the WebBase components, and present
respective performance measurements. |
|
Keywords |
WebBase Web crawler, site crawling; hyperlink indexing; distribution |
|
Contact address |
Wang Lam |
Sponsored by |
This work was supported under NSF Grant CS98-92A DLI2. | | Fulltext source |
Postscript (ps, ps.gz, ps.zip)
PDF (pdf, pdf.gz, pdf.zip)
| Management of the document by | pubs@db.stanford.edu
| | |