HTMLDocument (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.stanford.nlp.dbm
Class HTMLDocument

java.lang.Object
  |
  +--java.util.AbstractCollection
        |
        +--java.util.AbstractList
              |
              +--java.util.ArrayList
                    |
                    +--edu.stanford.nlp.dbm.BasicDocument
                          |
                          +--edu.stanford.nlp.dbm.HTMLDocument

All Implemented Interfaces:: Cloneable, Collection, Datum, Document, Featurizable, Labeled, List, RandomAccess, Serializable

Direct Known Subclasses:: LocusLinkDocument, USPDIDocument

public class HTMLDocument
extends BasicDocument

The HTMLDocument class implements Document methods for an HTML encoded document. The title() method returns the title of an HTML document, or an empty string if there is no TITLE tag. The text() method returns all the text that is not a tag. Subclasses may override the handleText(), handleComment(), handleStartTag(), etc. methods so that the text()(String url) method returns something other than the text of the web page. (For example, one may be interested in returning only part of the text, or only the links.) The constructor for an HTML document takes as its argument a URL, not a string of HTML code.

See Also:: Serialized Form

Field Summary

protected String parsedText


Fields inherited from class edu.stanford.nlp.dbm.BasicDocument

labels, originalText, title

Fields inherited from class java.util.AbstractList

modCount

Constructor Summary

HTMLDocument()


Method Summary

String getParsedText()
          Returns the text of the document that was used to populate the words (ie with all tags stripped).

protected void parse(String text)
          Parses the given HTML text so only true text is used to make the word list (ie all tags, etc are stripped).

Methods inherited from class edu.stanford.nlp.dbm.BasicDocument

addLabel, asFeatures, init, init, init, init, init, init, init, init, init, init, init, init, init, init, init, init, init, init, init, label, labels, main, originalText, presentableText, setLabel, setLabels, setTitle, title

Methods inherited from class java.util.ArrayList

add, add, addAll, addAll, clear, clone, contains, ensureCapacity, get, indexOf, isEmpty, lastIndexOf, remove, removeRange, set, size, toArray, toArray, trimToSize

Methods inherited from class java.util.AbstractList

equals, hashCode, iterator, listIterator, listIterator, subList

Methods inherited from class java.util.AbstractCollection

containsAll, remove, removeAll, retainAll, toString

Methods inherited from class java.lang.Object

finalize, getClass, notify, notifyAll, wait, wait, wait

Methods inherited from interface java.util.List

add, add, addAll, addAll, clear, contains, containsAll, equals, get, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, retainAll, set, size, subList, toArray, toArray

Field Detail

parsedText

protected String parsedText

Constructor Detail

HTMLDocument

public HTMLDocument()

Method Detail

parse

protected void parse(String text)

Parses the given HTML text so only true text is used to make the word list (ie all tags, etc are stripped). Also takes the TITLE tag to be the document's title.

Overrides:: parse in class BasicDocument

getParsedText

public String getParsedText()

Returns the text of the document that was used to populate the words (ie with all tags stripped).