|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--edu.stanford.nlp.web.WebSearch
Implements a simple web search interface. Currently Google, Lycos,
Altavista, and MSN searches are supported. These search engines can
all be searched via screen-scraping human-viewable search results pages,
and in addition Google can be searched via the Google Web API.
An initialization file is
required to specify which services are desired, and in the case of
using the GoogleAPI, what the license key is. This initialization file
is read as a standard Properties file. As well as specifying search
services, it may contain any other properties, such as HTTP timeouts or
HTTP proxy server specifications. Any such properties will be read and
added to the System properties list.
If no initialization
file is provided, the default is a non-GoogleAPI Google search.
More information about the Google Web API can be found at:
http://www.google.com/apis/
Example initialization file: websearch.init
, provided
in the source directory.
## Initialization file for WebSearch java class
#
# The following determine which supported search engines should be used.
# Services set to 'yes' are used for the web searches, and services set to
# 'no' are not.
# By default, a non-GoogleAPI Google search is performed if all services
# are set to 'no'
# The GoogleAPI option uses Google's Java API rather than a basic
# socket/HTML parsing approach; if GoogleAPI is set to yes, the Google
# option is ignored. Using the GoogleAPI requires a license key, which
# should be supplied below
GoogleAPI = yes
Google = no
Lycos = yes
Altavista = no
MSN = no
# The GoogleAPI licence key should be provided below if GoogleAPI is set
# to yes
GoogleKey =
Field Summary | |
String |
currentURL
When getNextPage() is called, the url that is accessed is copied here as a fully specified url, e.g., http://www.foo.bar/dir/ |
double |
searchRank
When currentURL changes, the searchRank
is updated. |
static int |
TT_BAD_SOCKET
A constant indicating that there was a socket or other connection error that caused one or more searches to fail. |
static int |
TT_SEARCH_ERRORS
A constant indicating that there was a server problem with at least one of the search sites. |
static int |
TT_SEARCH_OK
A constant indicating that the search function successfully connected with every search site |
Constructor Summary | |
WebSearch()
Constructor: uses the default initialization file websearch.init in the current directory for properties. |
|
WebSearch(String initFile)
Constructor: uses the initialization file specified. |
Method Summary | |
int |
doSearch(String keyword,
int maxNumHits,
boolean exact)
Performs the actual web search. |
String |
getNextPage()
Returns the full text of the next valid search result. |
int |
getTimeout()
Returns how long the HTTP request timeout period is. |
Vector |
getURLList()
Returns a new Vector list of urls obtained from the search engines. |
String |
getWebPage(String url)
An internal convenience function that is made public for the benefit of the client for a quick and easy way to return the text of a text/html html page. |
static void |
main(String[] argv)
|
void |
setServiceRanking(double googleRank,
double altavistaRank,
double lycosRank,
double msnRank)
If more than one search service is used, then the results are ordered according to what order individual urls appear on each service's results page as well as the services' own ranking. |
void |
setTimeout(int timeoutValue)
Specifies how long before an HTTP request times out. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public static final int TT_SEARCH_OK
public static final int TT_SEARCH_ERRORS
public static final int TT_BAD_SOCKET
TT_SEARCH_ERRORS
code.
public String currentURL
http://www.foo.bar/dir/
public double searchRank
currentURL
changes, the searchRank
is updated.
Constructor Detail |
public WebSearch()
public WebSearch(String initFile)
initFile
- the name of the initialization fileMethod Detail |
public int doSearch(String keyword, int maxNumHits, boolean exact) throws IOException
setServiceRanking
).
keyword
- a String specifying the search string. It can be
one word such as bananas
or multiple words such as
fruit fly genome
maxNumHits
- specifies the maximum number of search results
desired PER SEARCH SERVICE specified in the initialization file;
maxNumHits that are < 1 default to 1.exact
- indicates whether surrounding quotes should be used
for a multi-keyword search (note that this does not necessarily
improve search results for all search engines, but can help
TT_SEARCH_OK
,
TT_SEARCH_ERRORS
, or TT_BAD_SOCKET
.
Neither of the two error codes guarantees that zero search results
were obtained, however.
IOException
public Vector getURLList()
setServiceRanking
) to
lowest ranked.
public void setServiceRanking(double googleRank, double altavistaRank, double lycosRank, double msnRank)
googleRank
- the rank given to Google resultsaltavistaRank
- the rank given to AltaVista resultslycosRank
- the rank given to Lycos resultsmsnRank
- the rank given to MSN resultspublic String getNextPage() throws Exception
currentURL
.
null
is returned if there are no more search results
to return. null can also indicate an epidemic server or
connection error.
Exception
public String getWebPage(String url) throws Exception
Exception
- This can be various sorts of exceptions, including
java.io.IOException java.net.UnknownHostException
java.net.SocketException java.net.SocketTimeoutExceptionpublic void setTimeout(int timeoutValue)
timeoutValue
- number of milliseconds for the timeout. Should be
greater than or equal to 0, otherwise the value is not changed.public int getTimeout()
public static void main(String[] argv)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |