edu.stanford.nlp.dbm
Class HTMLDocument

java.lang.Object
  |
  +--java.util.AbstractCollection
        |
        +--java.util.AbstractList
              |
              +--java.util.ArrayList
                    |
                    +--edu.stanford.nlp.dbm.BasicDocument
                          |
                          +--edu.stanford.nlp.dbm.HTMLDocument
All Implemented Interfaces:
Cloneable, Collection, Datum, Document, Featurizable, Labeled, List, RandomAccess, Serializable
Direct Known Subclasses:
LocusLinkDocument, USPDIDocument

public class HTMLDocument
extends BasicDocument

The HTMLDocument class implements Document methods for an HTML encoded document. The title() method returns the title of an HTML document, or an empty string if there is no TITLE tag. The text() method returns all the text that is not a tag. Subclasses may override the handleText(), handleComment(), handleStartTag(), etc. methods so that the text()(String url) method returns something other than the text of the web page. (For example, one may be interested in returning only part of the text, or only the links.) The constructor for an HTML document takes as its argument a URL, not a string of HTML code.

See Also:
Serialized Form

Field Summary
protected  String parsedText
           
 
Fields inherited from class edu.stanford.nlp.dbm.BasicDocument
labels, originalText, title
 
Fields inherited from class java.util.AbstractList
modCount
 
Constructor Summary
HTMLDocument()
           
 
Method Summary
 String getParsedText()
          Returns the text of the document that was used to populate the words (ie with all tags stripped).
protected  void parse(String text)
          Parses the given HTML text so only true text is used to make the word list (ie all tags, etc are stripped).
 
Methods inherited from class edu.stanford.nlp.dbm.BasicDocument
addLabel, asFeatures, init, init, init, init, init, init, init, init, init, init, init, init, init, init, init, init, init, init, init, label, labels, main, originalText, presentableText, setLabel, setLabels, setTitle, title
 
Methods inherited from class java.util.ArrayList
add, add, addAll, addAll, clear, clone, contains, ensureCapacity, get, indexOf, isEmpty, lastIndexOf, remove, removeRange, set, size, toArray, toArray, trimToSize
 
Methods inherited from class java.util.AbstractList
equals, hashCode, iterator, listIterator, listIterator, subList
 
Methods inherited from class java.util.AbstractCollection
containsAll, remove, removeAll, retainAll, toString
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface java.util.List
add, add, addAll, addAll, clear, contains, containsAll, equals, get, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, retainAll, set, size, subList, toArray, toArray
 

Field Detail

parsedText

protected String parsedText
Constructor Detail

HTMLDocument

public HTMLDocument()
Method Detail

parse

protected void parse(String text)
Parses the given HTML text so only true text is used to make the word list (ie all tags, etc are stripped). Also takes the TITLE tag to be the document's title.

Overrides:
parse in class BasicDocument

getParsedText

public String getParsedText()
Returns the text of the document that was used to populate the words (ie with all tags stripped).



Stanford NLP Group