edu.stanford.nlp.web
Class HTMLParser

java.lang.Object
  |
  +--javax.swing.text.html.HTMLEditorKit.ParserCallback
        |
        +--edu.stanford.nlp.web.HTMLParser
Direct Known Subclasses:
LocusLinkParser, USPDIParser

public class HTMLParser
extends HTMLEditorKit.ParserCallback

Parses an HTML document and returns the plain text (and title). The main thing that HTMLParser is used for is the parse(String url) method, which will return a String with the contents of an HTML page, without the tags. After calling parse, you can get the HTML title (contents of the TITLE tag) by calling title(). Subclasses may override the handleText(), handleComment(), handleStartTag(), etc. methods so that parse(String url) returns something other than the text of the web page. (For example, one may be interested in returning only part of the text, or only the links.)


Field Summary
protected  StringBuffer textBuffer
           
protected  String title
           
 
Fields inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
IMPLIED
 
Constructor Summary
HTMLParser()
           
 
Method Summary
 void handleEndTag(HTML.Tag tag, int pos)
          Sets a flag if the end tag is the "TITLE" element end tag
 void handleStartTag(HTML.Tag tag, MutableAttributeSet attrSet, int pos)
          Sets a flag if the start tag is the "TITLE" element start tag.
 void handleText(char[] data, int pos)
           
 String parse(Reader r)
           
 String parse(String text)
           
 String parse(URL url)
           
 String title()
           
 
Methods inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
flush, handleComment, handleEndOfLineString, handleError, handleSimpleTag
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

textBuffer

protected StringBuffer textBuffer

title

protected String title
Constructor Detail

HTMLParser

public HTMLParser()
Method Detail

handleText

public void handleText(char[] data,
                       int pos)
Overrides:
handleText in class HTMLEditorKit.ParserCallback

handleStartTag

public void handleStartTag(HTML.Tag tag,
                           MutableAttributeSet attrSet,
                           int pos)
Sets a flag if the start tag is the "TITLE" element start tag.

Overrides:
handleStartTag in class HTMLEditorKit.ParserCallback

handleEndTag

public void handleEndTag(HTML.Tag tag,
                         int pos)
Sets a flag if the end tag is the "TITLE" element end tag

Overrides:
handleEndTag in class HTMLEditorKit.ParserCallback

parse

public String parse(URL url)
             throws IOException
IOException

parse

public String parse(String text)
             throws IOException
IOException

parse

public String parse(Reader r)
             throws IOException
IOException

title

public String title()


Stanford NLP Group