edu.stanford.nlp.web
Class HTMLParser
java.lang.Object
|
+--javax.swing.text.html.HTMLEditorKit.ParserCallback
|
+--edu.stanford.nlp.web.HTMLParser
- Direct Known Subclasses:
- LocusLinkParser, USPDIParser
- public class HTMLParser
- extends HTMLEditorKit.ParserCallback
Parses an HTML document and returns the plain text (and title).
The main thing that HTMLParser is used for is the
parse(String url)
method, which will return a String with the
contents of an HTML page, without the tags. After calling parse, you can get
the HTML title (contents of the TITLE tag) by calling title().
Subclasses may override the handleText(), handleComment(),
handleStartTag(), etc. methods so that parse(String url)
returns something other than the text of the web page. (For example, one
may be interested in returning only part of the text, or only the links.)
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
textBuffer
protected StringBuffer textBuffer
title
protected String title
HTMLParser
public HTMLParser()
handleText
public void handleText(char[] data,
int pos)
- Overrides:
handleText
in class HTMLEditorKit.ParserCallback
handleStartTag
public void handleStartTag(HTML.Tag tag,
MutableAttributeSet attrSet,
int pos)
- Sets a flag if the start tag is the "TITLE" element start tag.
- Overrides:
handleStartTag
in class HTMLEditorKit.ParserCallback
handleEndTag
public void handleEndTag(HTML.Tag tag,
int pos)
- Sets a flag if the end tag is the "TITLE" element end tag
- Overrides:
handleEndTag
in class HTMLEditorKit.ParserCallback
parse
public String parse(URL url)
throws IOException
IOException
parse
public String parse(String text)
throws IOException
IOException
parse
public String parse(Reader r)
throws IOException
IOException
title
public String title()
Stanford NLP Group