edu.stanford.nlp.dbm
Class HTMLDocument
java.lang.Object
|
+--java.util.AbstractCollection
|
+--java.util.AbstractList
|
+--java.util.ArrayList
|
+--edu.stanford.nlp.dbm.BasicDocument
|
+--edu.stanford.nlp.dbm.HTMLDocument
- All Implemented Interfaces:
- Cloneable, Collection, Datum, Document, Featurizable, Labeled, List, RandomAccess, Serializable
- Direct Known Subclasses:
- LocusLinkDocument, USPDIDocument
- public class HTMLDocument
- extends BasicDocument
The HTMLDocument class implements Document methods for an
HTML encoded document.
The title()
method returns the
title
of an HTML document, or an empty string if there is no TITLE tag.
The text()
method returns all the text that is not a tag.
Subclasses may override the handleText(), handleComment(),
handleStartTag(), etc. methods so that the text()(String url)
method returns something other than the text of the web page.
(For example, one may be interested in returning only part of
the text, or only the links.)
The constructor for an HTML document takes as its argument a URL, not
a string of HTML code.
- See Also:
- Serialized Form
Method Summary |
String |
getParsedText()
Returns the text of the document that was used to populate the words (ie with all tags stripped). |
protected void |
parse(String text)
Parses the given HTML text so only true text is used to make the word list
(ie all tags, etc are stripped). |
Methods inherited from class edu.stanford.nlp.dbm.BasicDocument |
addLabel, asFeatures, init, init, init, init, init, init, init, init, init, init, init, init, init, init, init, init, init, init, init, label, labels, main, originalText, presentableText, setLabel, setLabels, setTitle, title |
Methods inherited from class java.util.ArrayList |
add, add, addAll, addAll, clear, clone, contains, ensureCapacity, get, indexOf, isEmpty, lastIndexOf, remove, removeRange, set, size, toArray, toArray, trimToSize |
Methods inherited from interface java.util.List |
add, add, addAll, addAll, clear, contains, containsAll, equals, get, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, retainAll, set, size, subList, toArray, toArray |
parsedText
protected String parsedText
HTMLDocument
public HTMLDocument()
parse
protected void parse(String text)
- Parses the given HTML text so only true text is used to make the word list
(ie all tags, etc are stripped). Also takes the TITLE tag to be the document's
title.
- Overrides:
parse
in class BasicDocument
getParsedText
public String getParsedText()
- Returns the text of the document that was used to populate the words (ie with all tags stripped).
Stanford NLP Group