Fetching Web Pages from the WebBase Web Page Repository

InfoLab was Database Group
                                                                                                                                       Gary Wesley <Gary at InfoLab.Stanford.Edu>
Updated  November 19, 2008


Herein is described how to retrieve Web pages from the Stanford WebBase Archive,
a World Wide Web page repository built as part of the Stanford Digital Libraries Project
by members of the
Stanford InfoLab.

WB11 is down

WB2 will come down some hours for a disk fix this week

Congratulations Mr Obama!!!

The Repository


This web repository is over 117TB ( uncompressed size as of September 2008 ) of various web pages intended for research into topics such as web graph analysis and election or disaster press coverage ( we have  a workbench for press coverage analysis and coding ).

The general text crawls are about  0.5TB compressed  ( 1.5TB uncompressed ). Sizes below are in compressed units.  We now effectively have rudimentary time series data. General crawls use the same site list each time. Building the client software.   Lists of sites with page counts is available via the "sites" links below.  Architecture diagram. Our web crawler or spider is named WebVac ( it was called Pita ). Technical report: Stanford WebBase Components and Applications.  We are working in cooperation with the Library of Congress and the California Digital Library.

We now have tools for computational sociology in our Web Sociologist's Workbench. It was used for election coverage analysis by the Stanford Communication Department. Picture of  a sample screen. (The letter in the checkbox label is a keyboard shortcut.) Here is  a 2007 report on our efforts.

A current project is duplicate detection because newspaper crawls carry so many duplicated both within and across papers (like wire stories). This will be integrated into the
Web Sociologist's Workbench.

We have a collection of the links from each of the general crawls. These are available upon request via ftp.

We have a C++ tool to convert from our format to ARC version 1 format (Internet Archive and Heretrix). We will developing one for WARC when it becomes an ISO and International Internet Preservation Consortium (IIPC) standard. County, city, state and federal  crawls through July 2008 have now been converted to ARC.

Wibbi:

If you don't want to bother with the client because you will not be building custom handlers, there is now  a Web interface to the crawls. There are several custom filters to choose from like and and or. Wibbi will give slower throughput than our C++ client, even with no filtering. A Windows/Linux browser limit (Except  Opera and Firefox 2.0.0.1+) causes you to only be able to download 4GB at  a time. Since the filters are run on our server, it is possible to filter more data than that but not to reach that limit.

If you decide to use the data, please  email Gary  for our records (and our funders).
We would also appreciate knowing of any papers that come out of your usage.


The Crawls

General Monthly Crawls
US Government
State and Local Governments
Newspapers
Universities
California 2003 Governor Recall
2004 National Elections
2005 California Special Election
Hurricane Katrina aftermath
2006 Mid Term Elections
Virginia Tech shooting

General Crawls
2004   2005   2006   2007  2008


Host       Port     Million pgs             Date         Mimetype   Type of web crawl  

                                                
WB9        7003     119      343GB         1/2001         Text     general crawl    site list

WB12       7005      44      152GB         3/2002         Text     general crawl    site list(use 2002getpages.pl) 

[unavailable]        50      500GB         4/2003         Text     general crawl    site list (has many 0 page sites)

WB1        7006      96      406GB         6/2003         Text     general crawl    site list

WB1        7008      96      423GB         8/2003         Text     general crawl    site list

WB1        7010     102      451GB        10/2003         Text     general crawl    site list

                             526GB
WB8        7012      36                   12/2003         Text     general crawl    site list
           7032      14                   12/2003         Image    general crawl    site list  

2004

WB1        7103      95      450GB         3/2004         Text     general crawl    site list
   
WB1        7114      6       447GB         4/2004         Image    general crawl  
site list

                                178GB
[repairing]7105      17                    5/2004         Text     general crawl   site list
WB1        7115       8                    5/2004         Image    general crawl   site list

                                457GB
WB2        7107     11.5                   7/2004         Text     general crawl   site list
           7117      4.2                   7/2004         Image    general crawl   site list
           7127      0.02                  7/2004         Audio    general crawl   site list
           7137      2.3                   7/2004         Other    general crawl   site list

WB13       7108      45      388GB         8/2004         Text     general crawl   site list


                             474GB
WB3       7109      36                    9/2004         Text     general crawl   site list
WB3       7119       7                    9/2004         Image    general crawl   site list

WB4        7190     105       495GB       10/2004         Text     general crawl   site list

                             1561GB
[by special arrangement]
           7192     37                    12/2004         Text      general crawl  site list

           7193     14                    12/2004         Image     general crawl  site list
           7194     0.08                  12/2004         Audio     general crawl  site list
           7195     7.7                   12/2004         Other     general crawl  site list

             


Host       Port     Million pgs             Date         Mimetype   Type of web crawl  


2005                               

                              980GB
WB8        7601     27                     1/2005         Text       general crawl  site list

           7611     6                      1/2005         Image      general crawl  site list
           7621     0.04                   1/2005         Audio      general crawl  site list
           7631     3.5                    1/2005         Other      general crawl  site list 

this deeper next crawl was done with pagemax of 20k per site instead of the usual 10k:
WB1        7603     85        440GB        3/2005         Text       general crawl  site list

WB18       7489     0.48      192GB      3-5/2005         Audio      general audio  site list

WB3        7604     98        480GB        4/2005         Text       general crawl  site list 

WB3        7605     79        460GB        5/2005         Text       general crawl  site list
 

WB8        7606    101        503GB        6/2005         Text       general crawl  site list

                              487GB
WB16       7658     9.5                    8/2005         Text       general crawl 
site list
           7668     3.4                                   Image                     site list
           7658     .02                                   Audio                     site list
           7678     2                                     Other                     site list       

WB3        7609     97        490GB        9/2005         Text       general crawl  site list

WB3        7610     97        508GB       10/2005         Text       general crawl  site list

WB18       7691     93        527GB       11/2005         Text       general crawl  site list

                                                                  945GB
WB1        7612     20.7                  12/2005         Text       general crawl 
site list
           7622      7                    12/2005         Image      general crawl  site list
           7632     0.04                  12/2005         Audio      general crawl  site list
           7642     4.5                   12/2005         Other      general crawl 
site list
                                                                                      

                                                                                     


Host       Port     Million pgs             Date         Mimetype   Type of web crawl  


2006

WB1        7701     98        515GB        1/2006         Text       general crawl  site list

WB19       7702     93        490GB        2/2006         Text       general crawl  site list

WB1        7703     95        497GB        3/2006         Text       general crawl  site list

WB17       7704     92        493GB        4/2006         Text       general crawl  site list

WB17       7705     93        499GB        5/2006         Text       general crawl  site list    

WB10       7706     90        497GB        6/2006         Text       general crawl  site list

WB19       7707     92        501GB        7/2006         Text       general crawl  site list

WB10       7708     93        515GB        8/2006         Text       general crawl  site list

WB8        7709     90        502GB        9/2006         Text       general crawl  site list

WB1        7710     90        497GB        10/2006        Text       general crawl  site list

WB22       7730     10        353GB        10-11/2006     Image      general crawl  site list

WB16       7711     90        506GB        11/2006        Text       general crawl  site list

WB1        7712     90        511GB        12/2006        Text       general crawl  site list

WB1        7713     0.5       222GB        12/2006        Audio      general crawl  site list

                                                                                      

    


Host     Port       Million pgs           Date        Mimetype   Type of web crawl      


2007

WB2        7118     87        502GB        1/2007         Text       general crawl  site list

WB2        7106    103        590GB        2/2007         Text       general crawl  site list

WB14       7161    102        578GB        3/2007         Text       general crawl  site list

WB5        7163    100        578GB        4/2007         Text       general crawl  site list

WB15       7239     98        573GB        5/2007         Text       general crawl  site list

WB4        7260     98        590GB        6/2007         Text       general crawl 
site list

WB1        7263     86        525GB        7/2007         Text       general crawl  site list

WB12       7262     87        514GB        8/2007         Text       general crawl  site list

WB13       7266     79        486GB        9/2007         Text       general crawl  site list

WB3        7272     78        492GB       10/2007         Text       general crawl  site list

WB19       7289     80        497GB       11/2007         Text       general crawl  site list

WB4        7291     79        494GB       12/2007         Text       general crawl  site list



Host   Port        Million pgs            Date      Mimetype   Type of web crawl      


2008

WB20        7320     81        498GB        1/2008         Text       general crawl  site list

WB7         7298     79        496GB        2/2008         Text       general crawl  site list

WB20        7299     80        507GB        3/2008         Text       general crawl  site list

WB7         7301     68        439GB        4/2008         Text       general crawl  site list

WB22        7476     66        500GB        5/2008         Text       general crawl  site list

WB6         7482     64        485GB        6/2008         Text       general crawl  site list

WB7         7483     75        498GB        7/2008         Text       general crawl  site list

WB2         7495     77        516GB        8/2008         Text       general crawl  site list

WB8         7518     76        522GB        9/2008         Text       general crawl  site list

WB22        7220     77        526GB       10/2008         Text       general crawl  site list

We crawled  a  small subset of the general crawl list weekly January-May 2006 (available on  Wibbi ).
Around 2 million pages and 12.5GB of highest rank sites per week.



Specialized Crawls

University

Host       Port     Million pgs             Date         Mimetype   Type of web crawl    

WB1        7022      .28     1GB         11/2002         Text    U Cal@Berkeley site list

WB8        7050      .35    13GB          8/2003         All     Stanford University www.stanford.edu
[ we crawl 202 Stanford sites in our monthly text crawl ]

WB1        7300      .4      2GB         11/2004         Text    US CS        site list 

                                7.6GB
WB1        7440      .14                  1/2005         Text    U Cal@Berkeley site list
           7641      .07                  1/2005         Image   U Cal@Berkeley site list
           7492      .0001                1/2005         Audio   U Cal@Berkeley site list
           7443      .02                  1/2005         Other   U Cal@Berkeley site list

                              3GB
WB3        7060      .040   1.5GB         6/2005         Text    Stanford University site list
           7061      .038   125MB         6/2005         Image   Stanford University site list
           7062      60pgs                6/2005         Audio   Stanford University site list
           7063      .011   1.4GB         6/2005         Other   Stanford University site list

[ we crawl 202 Stanford sites in our monthly text crawl ]


Government

US Government
    .mil is in the general crawl


Host       Port     Million pgs             Date         Mimetype   Type of web crawl         


                               213GB
WB4        7567      4.3                  7/2003         Text     US Government 
site list
                        270GB
WB3        7506      3.4                  6/2004         Text     US Government  site list
           7516      1.6                  6/2004         Image                  
site list
           7516      .003                 6/2004         Audio                   site list
           7536      1.2                  6/2004         Other                   site list

                            274GB
[by request]7508   3.2                    8/2004         Text     US Government  site list
           7518      1.7                  8/2004         Image                   site list
           7538      1.2                  8/2004         Other                   site list

                            259GB
WB1        7509      2.8                  9/2004         Text     US Government  site list
           7519      1.5                  9/2004         Image                   site list
           7529      .006                 9/2004         Audio                   site list
           7539      1.1                  9/2004         Other                   site list

                            274GB
[by request]7570   2.9                    10/2004         Text    US Govt early Oct  site list
           7580      1.5                  10/2004         Image                      site list
           7590      2.2                  10/2004         Other                      site list

                            280GB
[by request]7573     3.0                  10/2004         Text   US Govt ,very late Oct site list
           7583      1.6                  10/2004         Image                         site list
           7563      0.004                10/2004         Audio                         site list
           7593      1.2                  10/2004         Other                         site list

                            283GB                          
WB8        7511      3.0                 11/2004    Text   US Govt+election, early Nov site list
           7521      1.6                 11/2004    Image                              site list
           7531      .004                11/2004    Audio                              site list
           7541      1.3                 11/2004    Other                              site list

                            277GB
[by request]7512      2.9                 12/2004         Text   US Government site list
            7522      1.5                 12/2004         Image                site list
            7532     .004                 12/2004         Audio                site list
            7542      1.2                 12/2004         Other                site list

Host       Port     Million pgs             Date         Mimetype   Type of web crawl  


2005                         
                                  274GB

[upon request]       3.0                  1/2005         Text   US Government, January site list
           7781      1.5                  1/2005         Image                         site list
           7791     .004                  1/2005         Audio                         site list
           7792      1.2                  1/2005         Other                         site list

                                  483GB
WB3        7644      2.5                  4/2005         Text  US Govt .gov + election site list
           7614      1.3                  4/2005         Image
           7624     .003                  4/2005         Audio
           7634      1.1                  4/2005         Other


Next 3: 20,000/site max on .gov only
                                    363GB
WB18       7607      4.0                  6-7/2005       Text   US .gov       site list
           7617      2.0                  6-7/2005       Image  US .gov       site list

           7627      .004                 6-7/2005       Audio  US .gov       site list
           7637      1.7                  6-7/2005       Other  US .gov       site list

                                   336GB (updated site list from LOC)
WB4        7799      3.3                  9/2005         Text   US .gov       site list
           7719      1.1                  9/2005         Image  US .gov       site list
           7729                           9/2005         Audio  US .gov       site list
           7739      1.4                  9/2005         Other  US .gov       site list

                                   233GB
WB8        8012      2.2                 12/2005         Text   US .gov       site list
           8022      1.1                 12/2005         Image  US .gov       site list
           8032      0.004               12/2005         Audio  US .gov       site list
           8042      1.0                 12/2005         Other  US .gov       site list

 From here on we crawl up to 150,000 pages per  .gov site to  a depth of 12 quarterly.
For those below, we have removed the site list from ca.gov, which are state site list for California.
ca.gov are about 100GB for each crawl and can be made available upon request. These are also in the
state crawls.

2006                              484GB
WB2        8001      5.0                  3/2006          Text   US .gov         site list
           8011      2.7                  3/2006          Image  US .gov         site list
            8021      0.007                3/2006          Audio  US .gov         site list
            8031      1.9                  3/2006          Other  US .gov         site list
                                                                       
                                   658GB
WB14       8041      7.6                 6-7/2006         Text   US .gov         site list
           8051      3.6                 6-7/2006         Image  US .gov         site list
           8052      0.01                6-7/2006         Audio  US .gov         site list
            8053      3.0                 6-7/2006         Other  US .gov         site list       

                                   726GB
WB1        7100      6.6                 9-10/2006        Text   US .gov         site list
           7101      3.0                                  Image                  site list
           7102      0.01                                 Audio                  site list
           7104      2.8                                  Other                  site list
                                                                    
                                    609GB
WB12       7149      7.6                  12/2006         Text   US .gov         site list
           7150      3.2                                  Image                  site list
           7151      0.01                                 Audio                  site list
           7152      2.9                                  Other                  site list
                                                                      
2007                              681GB
WB2        7157      8.1                  3/2007          Text   US .gov         site list
           7158      3.4                                  Image                  site list
           7159      0.01                                 Audio                  site list
           7160      3.1                                  Other                  site list

(Updated our list of site list here. )

                                     613GB
WB1        7255      7.0                  6/2007          Text   US .gov         site list
           7256      3.0                                  Image                  site list
           7257      0.01                                 Audio                  site list
           7258      2.8                                  Other                  site list
 
                                     636GB
WB6        7267      5.5                  9/2007          Text   US .gov         site list
           7268      2.7                                  Image                  site list
           7269      0.01                                 Audio                  site list
           7270      2.4                                  Other                  site list
   ( California ca.gov is not crawled from here on except as part of the state crawls )

 
                                    629 GB
WB7        7292      5.4                 12/2007          Text   US .gov         site list
           7293      2.5                                  Image                  site list
           7295      0.01                                 Audio                  site list
           7296      2.3                                  Other                  site list
         
2008

                                    654 GB
WB7        7369      7.4                  3/2008          Text   US .gov         site list
           7370      3.4                                  Image                  site list
           7371      0.01                                 Audio                  site list
           7372      3.0                                  Other                  site list

                                    650 GB
WB6        7484      5.5                  6/2008          Text   US .gov         site list
           7485      2.5                                  Image                  site list
           7488      0.01                                 Audio                  site list
           7490      2.7                                  Other                  site list

                                    755 GB
WB2        7510      8.0                  9/2008          Text   US .gov         site list
           7513      3,2                                  Image                  site list
           7514      0.01                                 Audio                  site list
           7515      3.3                                  Other                  site list






 
State, County and Local Governments


Host       Port     Million pgs             Date         Mimetype   Type of web crawl  


These sitelists were compiled from the site http://www.statelocalgov.net

State                               210GB
WB13       7204      2.3                  5/2005         Text    State govt   site list
           7214      0.7                  5/2005         Image   State govt   site list
           7224     .005                  5/2005         Audio   State govt   site list
           7234      1.4                  5/2005         Other   State govt   site list

County                              90GB
WB8        7264      1.2                  5/2005         Text    County govt  site list
           7274      0.5                  5/2005         Image   County govt  site list
           7284     .060                  5/2005         Audio   County govt  site list   
           7294      0.5                  5/2005         Other   County govt  site list

City and town                       187GB
WB13       7664      2.5                  5/2005         Text    City govt    site list
           7674      1.2                  5/2005         Image   City govt    site list
           7684     .001                  5/2005         Audio   City govt    site list
           7694      1.0                  5/2005         Other   City govt    site list  

 
 Post Katrina crawl
State                               217GB

WB2        7465      2.1                  9/2005         Text    State govt   site list
           7466      0.7                  9/2005         Image   State govt   site list
           7467     .060                  9/2005         Audio   State govt   site list
           7468      1.3                  9/2005         Other   State govt   site list    


2006

State                              280GB
WB17       7365      2.0                  4/2006         Text    State govt    site list  
            7366      0.7                  4/2006         Image   State govt    site list
           7367     .006                  4/2006         Audio   State govt    site list
           7368      1.3                  4/2006         Other   State govt    site list
      
County                            115GB
WB1        7364      1.2                  4/2006         Text    County govt   site list
           7374      0.4                  4/2006         Image   County govt   site list
           7384     .002                  4/2006         Audio   County govt   site list
           7394      0.6                  4/2006         Other   County govt   site list

City and town                     237GB
WB16       7165      2.7                  4/2006         Text    City govt    
site list   
   
       7175      1.1                  4/2006         Image   City govt     site list
           7185      0.001                4/2006         Audio   City govt     site list
           7186      1.2                  4/2006         Other   City govt     site list


State                            251GB
WB15       7395      2.4                  9/2006         Text    State govt    site list
           7966      0.7                  9/2006         Image   State govt    site list
           7367     .008                  9/2006         Audio   State govt    site list
           7968      1.5                  9/2006         Other   State govt    site list

County                           126GB
WB1        7964      1.2                  9/2006         Text    County govt   site list
           7974      0.4                  9/2006         Image   County govt   site list
           7987     .002                  9/2006         Audio   County govt   site list
           7407      0.7                  9/2006         Other   County govt   site list

City and town                     258GB
WB18       7965      2.9                  9/2006         Text    City govt     site list
           7975      1.1                  9/2006         Image   City govt     site list
           7985     .002                  9/2006         Audio   City govt     site list
           7986      1.3                  9/2006         Other   City govt     site list


State                             263GB
WB4        7133      2.4                 12/2006         Text    State govt    site list
           7138      0.7                 12/2006         Image   State govt    site list
           7139     .008                 12/2006         Audio   State govt    site list
           7140      1.5                 12/2006         Other   State govt    site list

County                           129GB
WB1        7141      1.3                 12/2006         Text    County govt   site list
           7142      0.5                 12/2006         Image   County govt   site list
           7143     .002                 12/2006         Audio   County govt   site list
           7144      0.7                 12/2006         Other   County govt   site list

City and town                   270GB
WB1        7145      2.9                 12/2006         Text    City govt     site list
           7146      1.1                 12/2006         Image   City govt     site list
           7147     .002                 12/2006         Audio   City govt     site list
           7148      1.3                 12/2006         Other   City govt     site list

2007

State sites                        260GB

WB22       7246      2.4                 5/2007         Text    State govt     site list
           7247      0.7                 5/2007         Image   State govt     site list
           7248     .008                 5/2007         Audio   State govt     site list
           7249      1.5                 5/2007         Other   State govt     site list

County sites                       140GB

WB10       7242      1.3                 5/2007         Text    County govt    site list
           7243      0.5                
5/2007         Image   County govt    site list
           7244     .002                 5/2007         Audio   County govt    site list
           7245      0.7                 5/2007         Other   County govt    site list

City and town sites               279GB
WB11       7236      2.9                 5/2007         Text    City govt      site list
           7237      1.1                 5/2007         Image   City govt      site list
           7240     .002                 5/2007         Audio   City govt      site list
           7241      1.3                 5/2007         Other   City govt      site list     

 (Updated sites to be crawled here. )

State sites                        296 GB
WB8        7273      2.3                 10/2007         Text    State govt    site list
           7275      0.7                 10/2007         Image   State govt    site list
           7276      .01                 10/2007         Audio   State govt    site list
           7277      1.6                 10/2007         Other   State govt    site list

County sites                       143 GB
WB4        7278      1.4                 10/2007         Text    County govt   site list

           7280      0.4                
10/2007         Image   County govt   site list
           7281     .002                
10/2007         Audio   County govt   site list
           7282      0.8                
10/2007         Other   County govt   site list

City and town sites                301 GB
WB7        7283      3.1                 10/2007         Text    City govt     site list
           7285      1.1                 10/2007         Image   City govt     site list
           7286     .003                 10/2007         Audio   City govt     site list
           7287      1.5                 10/2007         Other   City govt     site list     

2008
State sites                        309GB
WB11       7436      2.2                 5/2008         Text    State govt     site list
           7437      0.7                 5/2008         Image   State govt     site list
           7438     .008                 5/2008         Audio   State govt     site list
           7439      1.6                 5/2008         Other   State govt     site list

County sites                       153GB
WB22       7456      1.4                 5/2008         Text    County govt    site list

           7457      0.4                
5/2008         Image   County govt    site list
           7458     .002                
5/2008         Audio   County govt    site list
           7459      0.8                
5/2008         Other   County govt    site list

City and town sites                327GB
WB19       7469      3.1                 5/2008         Text    City govt      site list
           7470      1.5                 5/2008         Image   City govt      site list
           7471     .004                 5/2008         Audio   City govt      site list
           7472      1.3                 5/2008         Other   City govt      site list 


State sites                       298GB

WB8        7544      2.2                 11/2008         Text    State govt     site list
           7545      0.7                 11/2008         Image   State govt     site list
           7546     .008                 11/2008         Audio   State govt     site list
           7547      1.6                 11/2008         Other   State govt     site list

County sites                     167GB
WB7        7548      1.4                 11/2008         Text    County govt    site list

           7549      0.4                
11/2008         Image   County govt    site list
           7550     .002                
11/2008         Audio   County govt    site list
           7551      0.8                
11/2008         Other   County govt    site list

City and town sites               324GB
WB1        7552      3.2                 11/2008         Text    City govt      site list
           7553      1.0                 11/2008         Image   City govt      site list
           7554     .004                 11/2008         Audio   City govt      site list
           7555      1.5                 11/2008         Other   City govt      site list


   


Host       Port     Million pgs             Date         Mimetype   Type of web crawl  


California 2003 Governor Recall

WB1        7081    .006                   9/26/03          All     California recall site list
WB1        7082    .008                   9/27/03            "     California recall site list
WB1        7083    .2        5GB          9/29/03            "     California circus w/county gov site list site list
WB1        7084    .05      1.3GB         9/30/03            "     California recall site list
WB1        7085    .05                    10/1/03            "     California recall site list
WB1        7086    .05                    10/2/03            "     California recall site list
WB1        7087    .05                    10/3/03            "     California recall site list
WB1        7088    .05                    10/4/03            "     California recall site list
WB1        7089    .05                    10/5/03            "     California recall site list
WB1        7090    .05                    10/6/03            "     California recall site list
WB1        7091    .05                    10/7/03            "     California recall site list
WB1        7092    .05    1.3GB           10/8/03            "     California recall site list

WB1        7094    .05                    10/10/03           "     California recall site list
WB1        7095    .05                    11/04/03           "     California recall site list
WB1        7096    .05                    12/12/03           "     California recall site list

 
2004 American Elections
Available via Wibbi

California  2005 Special Election

Available via ftp

Hurricane Katrina (August 29th 2005) aftermath, Rita and Wilma

( 3 of the 6 most intense Atlantic Hurricanes ever recorded ) 
9/03/05-10/29/05 news, gov and charities (available on Wibbi )
~400 sites crawled daily, 1GB increasing up to 30 GB/day, all mime types.

Also good for researching non-hurricane  press coverage on consecutive days,
for instance doing sociological analysis or topical analysis.
We do not filter by topic, though papers are only Gulf Coast regional press.
Newspaper crawls contain many archival stories and duplicates.

2006 Mid Term Elections

We did daily text crawls of the 40 largest US papers up through the week after election day.
These are about 2.5GB per day. (available on Wibbi )
Newspaper crawls contain many archival stories and duplicates.

Monthly newspaper crawls

There are over 140 US papers in our general monthly crawls,
made available as a separate collection on Wibbi.
560-740k pages per crawl. We could index earlier crawls for
you upon request.
Newspaper crawls contain many archival stories and duplicates.

Virginia Tech Shooting

We crawled regional news, college papers, psychiatric, supremecists,
gun control and European/Indian/Arabic/Korean news sites daily for a coupla weeks.
 (available on Wibbi )
Newspaper crawls contain many archival stories and duplicates.

2008 Presidential Primaries and Election

Crawling the 13 largest US newspapers , plus magazine and candidate sites. (available on Wibbi )
Newspaper crawls contain many archival stories and duplicates.

2008 Hurricane Ike

Crawling the 349 regional and news sites before and for a month after. ( available on Wibbi )


WebVac spider

WebVac crawls depth first, generally to a depth of 7 levels and fetch a maximum of 10k pgs per site.
We only follow links within the domain. Til 2007 our general policy was to gather a 1.5TB sample.
Now we crawl a larger stable (but gradually shrinking) list of sites, til the list is done. We retry unavailables several times.
We pause 1 to ( almost always ) 10 seconds between pages, depending on ipaddress bottlenecks.
For the federal government crawls, we take up to 150,000 pages to 12 levels  over
a fairly static group of sites.

 


 

Architecture

WebBase Architecture
Overall system screenshotScreen shot


Client Software ( RunHandlers )

  • If you don't want to bother with the client because you will not be building custom handlers, there is now a  Web interface to get pieces of up to 4GB of the 2003-present crawls on Wibbi
  • These instructions assume Internet access to the machine hosting WebBase data and a  CVS checkout of the WebBase code or an ftp get.
  • We allow specification of machine, port, first site and last site for the stream (e.g. www.ibm.com). Distribrequestor.pl and getpages.pl also take those arguments.  The webpage repository is organised by site, so offset means offset within the site.

  • RunHandlers is supported on 32-bit GNU/Linux and Solaris systems with GNU make (gmake), g++ ( <=  3.4.0), Perl 5.05+, and W3C's libwww.

    1. Fetch the latest WebBase client source code from ftp://db.stanford.edu/pub/webbase.
    2. Unroll the source code. For example, GNU tar can do this with

    3. > tar xfz webbase-client-????-??-??.tar.gz
    4. Follow the instructions in the source code's README.client.

    5. > chdir dli2/src/WebBase/ && more README.client
    Build everything:
    (Use a  32-bit Linux box)

    Make sure the library path includes W3C's libwww .
    This library must be installed by a  system administrator with root privileges.

    Make sure environment variable WEBBASE points to WebBase:
    setenv WEBBASE  [absolute path]/WebBase

    (1) Run GNU make:

       WebBase/> ./configure
       WebBase/> make client

    If you get:
    handlers/extract-hosts.h:27:21: WWWCore.h: No such file or directory

    handlers/extract-hosts.h:28:21: HTParse.h: No such file or directory

         Your include path may be wrong:
    We expect it to be in /usr/local/include/w3c-libwww/WWWCore.h,
    so you may need to change this in Makefile.in and configure.
    Rerun ./configure.

             To use later gcc versions:
                     Here's the hack:

                   After running ./configure, do the following:

                   1. add -fpermissive in the CPPFLAGS on line 68 in the makefile
                   2. comment out
                                   lines 34 and 35 in hashlookup/hashlookup.h
                                    extern unsigned int hashlookup_error;
                                    extern unsigned int verbose_error;


    (2) Test your build.
         (a) Turn on cat-handler, which simply outputs what it receives.
                In inputs/webbase.conf, set
                CAT_ON = 1
         (b) Try RunHandlers on a  local example file:
                bin/RunHandlers inputs/webbase.conf \
               "file:///handlers/example-50-pages"
              [50 sample pages are printed]

    Now try the network version:

    Method 1:
     Run scripts/distribrequestor.pl to start a  distributor:
     (either chmod +x scripts/*.pl or invoke it with "perl")
    args: (must be in this order)
    # host
    # port
    # num pages
    # starting web site (optional) e.g. www.ibm.com
    # ending web site (optional)
    # offset in bytes within web site (optional)
     

    [example run:]
    WebBase/scripts> distribrequestor.pl wb1 7008 100
     distrib daemon returned 171.64.75.151 7160
     (use as ../bin/RunHandlers ../inputs/webbase.conf "net://171.64.75.151:7160/?numPages=100" )
    WebBase/scripts>
     Now you can invoke RunHandlers with the above info:
     ( cut and paste it from the echo)
    WebBase/scripts> ../bin/RunHandlers ../inputs/webbase.conf "net://171.64.75.151:7160/?numPages=100"
     will print back 100 sample pages.  All instances of RunHandlers connected to
     the above port will share the same pool of pages.  To get an independent
     stream, run distribrequestor.pl to get a  new port.
     

    Method 2:
     You can also use our one-step script getpages.pl (no need to specify a  first site )
    (either chmod +x scripts/*.pl or invoke it with "perl")

    [example run:]
    args: (must be in this order)
    # host
    # port
    # num pages
    # starting web site (optional) e.g. www.ibm.com
    # ending web site (optional)
    # offset in bytes within web site (optional)

    WebBase/scripts> getpages.pl 2 wb1 7008 www.ibm.com www.ibm.com (only give me www.ibm.com)
     Starting getpages.pl using Perl 5.6.0
     Do you want to run
    /dfs/sole/6/gary/dli2/src/WebBase//bin/RunHandlers /dfs/sole/6/gary/dli2/src/WebBase//inputs/webbase.conf "net://171.64.75.151:7163/?numPages=2" now?(Y/N):
    WebBase/scripts> Y

    To get all of the page, set CAT_ON = 1 in the inputs/*.conf.

    If you get the ERROR:
    bin/RunHandlers: error while loading shared libraries: libwwwcore.so.0:
    cannot open shared object file: No such file or directory
    you don't have your paths set right.
    setting a variable called LD_LIBRARY_PATH where you're about to run the
    WebBase client.  For example, if you found your libwwwcore.so in your
    /opt/somewhere/lib/libwwwcore.so, then you could tell your system:
    setenv LD_LIBRARY_PATH /opt/somewhere/lib

    Note on the output:

    This "junk" is just a  separator, so that RunHandler knows it is getting a  new page:
    ==P=>>>>=i===<<<<=T===>=A===<=!Junghoo![...]  -- page separator
    URL: http://www.powa.org/ -- page URL
    Date: June 3, 2004                -- when crawled
    Position: 695                         -- bytes into the site so far
    DocId: 1                                 -- sequential page id within site
    HTTP/1.1 200 OK                -- response to our http request


    Death threat:
    If a  distributor is inactive for a  while, it may be killed by us so that we can reuse the resources.
    To restart at the same point you must start a  new distributor  @ the offset where it left off
    ( + 1 to prevent getting the previous page again).

    Putting out a  contract:
    If you are done, you can run  distribrelease.pl [remote-host] [host port] [stream port]
    from the same machine you requested on. We will immediately kill the distributor for you.
    We especially recommend this if you are running
    many requests in 1 day so that we do not run out of resources.

    If you specify firstSite/lastSite, please note that you can only use the root
    (e.g. www.ibm.com) not a  page within the site (e.g. 01net.com/envoyerArticle/1 )
    and dont include the http:// part.

    -------------------------------------------------------------------
     

    To create a  new webpage stream handler:

    You can use the other handlers in the distribution as templates.
    To add a  new handler, add the following to the appropriate places:
     * 1) #include "myhandler.h" into handlers/all_handlers.h
     * 2) handler.push_back(new MyHandler()); into handlers/all_handlers.h
             (following the template of the handlers already there)
     * 3) in Makefile, add entries for your segments to compile
             in the line: HANDLER_OBJS = jhandler.o [...]
     *opt)in Makefile, customize your build if necessary by adding a  line
               jhandler_CXXFLAGS = -Iyour-include-dir --your-switches [...]
              (following the template of the handlers already there)

    We also have a  one-button script called scripts/addHandler.pl that will
    prompt you for all your pieces and put them in place, without you having
    to do the above file surgery yourself.
     
     


     

    GLOSSARY


    WebVac - the WebBase web crawler or spider. Used to be called Pita.

    RunHandlers - (formerly "process") an executable that indexes a  stream,
                  file or repository.
                  Made up basically of a  feeder and one or more handlers.

    handler - the interface that any index-building piece of code must implement.
              The interface's main (only) method will provide a  page and associated
              metadata and the implementor of the method can do whatever he wants
              with it.

    feeder -  the interface for receiving a  page stream from any kind of source
              (directly from the repository, via Webcat, via network, etc.). The
              key method of the interface is "next" which advances the stream by one
              page. After calling next, various other methods can be used to get the
              associated metadata for the current page in the stream. Can also be used
              to build indexes if the index-building code is written to process page
              streams

    distributor - a  program that disseminates pages to multiple clients
               over the network, supporting session ID's, etc...generalization of what
                Distributor.cc in Text -index/ does.

    offset - used in distributor requests to specify how many bytes to start from
             the beginning of the site.

    DocId - DocId is computed within the download. If you download any portion of the crawl,
                   even from the middle,  it will begin with 0.  If you download all the crawl,
                   it will be monotonically increasing from start to end.