edu.stanford.nlp.web
Class WebSearch

java.lang.Object
  |
  +--edu.stanford.nlp.web.WebSearch

public class WebSearch
extends Object

Implements a simple web search interface. Currently Google, Lycos, Altavista, and MSN searches are supported. These search engines can all be searched via screen-scraping human-viewable search results pages, and in addition Google can be searched via the Google Web API. An initialization file is required to specify which services are desired, and in the case of using the GoogleAPI, what the license key is. This initialization file is read as a standard Properties file. As well as specifying search services, it may contain any other properties, such as HTTP timeouts or HTTP proxy server specifications. Any such properties will be read and added to the System properties list. If no initialization file is provided, the default is a non-GoogleAPI Google search. More information about the Google Web API can be found at: http://www.google.com/apis/

Example initialization file: websearch.init, provided in the source directory.

 
  ## Initialization file for WebSearch java class
  #
  # The following determine which supported search engines should be used.
  # Services set to 'yes' are used for the web searches, and services set to
  # 'no' are not.
  # By default, a non-GoogleAPI Google search is performed if all services
  # are set to 'no'
  # The GoogleAPI option uses Google's Java API rather than a basic
  # socket/HTML parsing approach; if GoogleAPI is set to yes, the Google
  # option is ignored.  Using the GoogleAPI requires a license key, which
  # should be supplied below
 
  GoogleAPI = yes
  Google = no
  Lycos = yes
  Altavista = no
  MSN = no
 
  # The GoogleAPI licence key should be provided below if GoogleAPI is set
  # to yes
  
  GoogleKey = 
 
  


Field Summary
 String currentURL
          When getNextPage() is called, the url that is accessed is copied here as a fully specified url, e.g., http://www.foo.bar/dir/
 double searchRank
          When currentURL changes, the searchRank is updated.
static int TT_BAD_SOCKET
          A constant indicating that there was a socket or other connection error that caused one or more searches to fail.
static int TT_SEARCH_ERRORS
          A constant indicating that there was a server problem with at least one of the search sites.
static int TT_SEARCH_OK
          A constant indicating that the search function successfully connected with every search site
 
Constructor Summary
WebSearch()
          Constructor: uses the default initialization file websearch.init in the current directory for properties.
WebSearch(String initFile)
          Constructor: uses the initialization file specified.
 
Method Summary
 int doSearch(String keyword, int maxNumHits, boolean exact)
          Performs the actual web search.
 String getNextPage()
          Returns the full text of the next valid search result.
 int getTimeout()
          Returns how long the HTTP request timeout period is.
 Vector getURLList()
          Returns a new Vector list of urls obtained from the search engines.
 String getWebPage(String url)
          An internal convenience function that is made public for the benefit of the client for a quick and easy way to return the text of a text/html html page.
static void main(String[] argv)
           
 void setServiceRanking(double googleRank, double altavistaRank, double lycosRank, double msnRank)
          If more than one search service is used, then the results are ordered according to what order individual urls appear on each service's results page as well as the services' own ranking.
 void setTimeout(int timeoutValue)
          Specifies how long before an HTTP request times out.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TT_SEARCH_OK

public static final int TT_SEARCH_OK
A constant indicating that the search function successfully connected with every search site

See Also:
Constant Field Values

TT_SEARCH_ERRORS

public static final int TT_SEARCH_ERRORS
A constant indicating that there was a server problem with at least one of the search sites. Note this does NOT mean that zero search results were obtained, only that one of the search sites failed, and future search calls might obtain more results.

See Also:
Constant Field Values

TT_BAD_SOCKET

public static final int TT_BAD_SOCKET
A constant indicating that there was a socket or other connection error that caused one or more searches to fail. Again, this does not always mean that zero search results were obtained, though this sort of error indicates a "bigger" type of problem than a TT_SEARCH_ERRORS code.

See Also:
Constant Field Values

currentURL

public String currentURL
When getNextPage() is called, the url that is accessed is copied here as a fully specified url, e.g., http://www.foo.bar/dir/


searchRank

public double searchRank
When currentURL changes, the searchRank is updated.

Constructor Detail

WebSearch

public WebSearch()
Constructor: uses the default initialization file websearch.init in the current directory for properties. If this file is not found, then a basic Altavista search is performed.


WebSearch

public WebSearch(String initFile)
Constructor: uses the initialization file specified. If this file is not found or malformed, then a basic Altavista search is performed.

Parameters:
initFile - the name of the initialization file
Method Detail

doSearch

public int doSearch(String keyword,
                    int maxNumHits,
                    boolean exact)
             throws IOException
Performs the actual web search. The results of any previous web searches are discarded. Search results are combined from the search services specified in the initialization file, and are stored in ranked order (see setServiceRanking).

Parameters:
keyword - a String specifying the search string. It can be one word such as bananas or multiple words such as fruit fly genome
maxNumHits - specifies the maximum number of search results desired PER SEARCH SERVICE specified in the initialization file; maxNumHits that are < 1 default to 1.
exact - indicates whether surrounding quotes should be used for a multi-keyword search (note that this does not necessarily improve search results for all search engines, but can help
Returns:
a status code TT_SEARCH_OK, TT_SEARCH_ERRORS, or TT_BAD_SOCKET. Neither of the two error codes guarantees that zero search results were obtained, however.
IOException

getURLList

public Vector getURLList()
Returns a new Vector list of urls obtained from the search engines. These are ordered from highest ranked (most relevant, according to the scoring metric described in setServiceRanking) to lowest ranked.

Returns:
a Vector of fully specified urls (type String)

setServiceRanking

public void setServiceRanking(double googleRank,
                              double altavistaRank,
                              double lycosRank,
                              double msnRank)
If more than one search service is used, then the results are ordered according to what order individual urls appear on each service's results page as well as the services' own ranking. A url is given a score of (service rank) / (url result rank); for example, if the Google rank is .99, the first result url would get a score of .99, the second a score of .99/2, etc. These scores are additive, so that if the same url is returned by multiple search services, the scores with respect to each service are added. For example: It is better not to rank any search service the exact same rank.

Parameters:
googleRank - the rank given to Google results
altavistaRank - the rank given to AltaVista results
lycosRank - the rank given to Lycos results
msnRank - the rank given to MSN results

getNextPage

public String getNextPage()
                   throws Exception
Returns the full text of the next valid search result. Only text/html pages returning an HTTP server code of 200 are returned. The corresponding url is stored in currentURL. null is returned if there are no more search results to return. null can also indicate an epidemic server or connection error.

Returns:
a String containing the full text of the web site, or null
Exception

getWebPage

public String getWebPage(String url)
                  throws Exception
An internal convenience function that is made public for the benefit of the client for a quick and easy way to return the text of a text/html html page. If a "200 OK" page is specified, then it is returned as a String. All other cases return null, including malformed urls, server errors, http errors (404, etc.), or non text/html.

Returns:
the full text of the web page, or null
Throws:
Exception - This can be various sorts of exceptions, including java.io.IOException java.net.UnknownHostException java.net.SocketException java.net.SocketTimeoutException

setTimeout

public void setTimeout(int timeoutValue)
Specifies how long before an HTTP request times out. Default is 4000 milliseconds.

Parameters:
timeoutValue - number of milliseconds for the timeout. Should be greater than or equal to 0, otherwise the value is not changed.

getTimeout

public int getTimeout()
Returns how long the HTTP request timeout period is.

Returns:
time in milliseconds allowed for an HTTP request

main

public static void main(String[] argv)


Stanford NLP Group