edu.stanford.nlp.dbm
Class BasicDocument

java.lang.Object
  |
  +--java.util.AbstractCollection
        |
        +--java.util.AbstractList
              |
              +--java.util.ArrayList
                    |
                    +--edu.stanford.nlp.dbm.BasicDocument
All Implemented Interfaces:
Cloneable, Collection, Datum, Document, Featurizable, Labeled, List, RandomAccess, Serializable
Direct Known Subclasses:
Context, CranDocument, HTMLDocument, MedlineDocument, OhsumedDocument, TypedTaggedDocument

public class BasicDocument
extends ArrayList
implements Document

Basic implementation of Document that should be suitable for most needs. BasicDocument is an ArrayList for storing words and performs tokenization during construction. Override parse(java.lang.String) to provide support for custom document formats or to do a custom job of tokenization. BasicDocument should only be used for documents that are small enough to store in memory.

See Also:
Serialized Form

Field Summary
protected  List labels
          Label(s) for this document.
protected  String originalText
          original text of this document (may be null).
protected  String title
          title of this document (never null).
 
Fields inherited from class java.util.AbstractList
modCount
 
Constructor Summary
BasicDocument()
          Constructs a new (empty) BasicDocument.
 
Method Summary
 void addLabel(Label label)
          Adds the given Label to the List of labels for this Document if it is not null.
 Collection asFeatures()
          Returns this (the features are the list of words).
 BasicDocument init()
          Calls init((String)null,null,true)
 BasicDocument init(File textFile)
          Calls init(textFile,textFile.getCanonicalPath(),true)
 BasicDocument init(File textFile, boolean keepOriginalText)
          Calls init(textFile,textFile.getCanonicalPath(),keepOriginalText)
 BasicDocument init(File textFile, String title)
          Calls init(textFile,title,true)
 BasicDocument init(File textFile, String title, boolean keepOriginalText)
          Inits a new BasicDocument by reading in the text from the given File.
 BasicDocument init(List words)
          Calls init(words,null)
 BasicDocument init(List words, String title)
          Inits a new BasicDocument with the given list of words and title.
 BasicDocument init(Reader textReader)
          Calls init(textReader,null,true)
 BasicDocument init(Reader textReader, boolean keepOriginalText)
          Calls init(textReader,null,keepOriginalText)
 BasicDocument init(Reader textReader, String title)
          Calls init(textReader,title,true)
 BasicDocument init(Reader textReader, String title, boolean keepOriginalText)
          Inits a new BasicDocument by reading in the text from the given Reader.
 BasicDocument init(String text)
          Calls init(text,null,true)
 BasicDocument init(String text, boolean keepOriginalText)
          Calls init(text,null,keepOriginalText)
 BasicDocument init(String text, String title)
          Calls init(text,title,true)
 BasicDocument init(String text, String title, boolean keepOriginalText)
          Inits a new BasicDocument with the given text contents and title.
 BasicDocument init(URL textURL)
          Calls init(textURL,textURL.toExternalForm(),true)
 BasicDocument init(URL textURL, boolean keepOriginalText)
          Calls init(textURL,textFile.toExternalForm(),keepOriginalText)
 BasicDocument init(URL textURL, String title)
          Calls init(textURL,title,true)
 BasicDocument init(URL textURL, String title, boolean keepOriginalText)
          Constructs a new BasicDocument by reading in the text from the given URL.
 Label label()
          Returns the first label for this Document, or null if none have been set.
 Collection labels()
          Returns the complete List of labels for this Document.
static void main(String[] args)
          For internal debugging purposes only.
 String originalText()
          Returns the text originally used to construct this document, or null if there was no original text.
protected  void parse(String text)
          Tokenizes the given text to populate the list of Words this Document represents.
 String presentableText()
          Returns a "pretty" version of the words in this Document suitable for display.
 void setLabel(Label label)
          Removes all currently assigned Labels for this Document then adds the given Label.
 void setLabels(Collection labels)
          Removes all currently assigned labels for this Document then adds all of the given Labels.
 void setTitle(String title)
          Sets the title of this Document to the given title.
 String title()
          Returns the title of this document.
 
Methods inherited from class java.util.ArrayList
add, add, addAll, addAll, clear, clone, contains, ensureCapacity, get, indexOf, isEmpty, lastIndexOf, remove, removeRange, set, size, toArray, toArray, trimToSize
 
Methods inherited from class java.util.AbstractList
equals, hashCode, iterator, listIterator, listIterator, subList
 
Methods inherited from class java.util.AbstractCollection
containsAll, remove, removeAll, retainAll, toString
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface java.util.List
add, add, addAll, addAll, clear, contains, containsAll, equals, get, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, retainAll, set, size, subList, toArray, toArray
 

Field Detail

title

protected String title
title of this document (never null).


originalText

protected String originalText
original text of this document (may be null).


labels

protected final List labels
Label(s) for this document.

Constructor Detail

BasicDocument

public BasicDocument()
Constructs a new (empty) BasicDocument. Call one of the init to populate the document from a desired source.

Method Detail

init

public BasicDocument init(String text,
                          String title,
                          boolean keepOriginalText)
Inits a new BasicDocument with the given text contents and title. The text is tokenized using parse(java.lang.String) to populate the list of words ("" is used if text is null). If specified, a reference to the original text is also maintained so that the text() method returns the text given to this constructor. Returns a reference to this BasicDocument for convinience (so it's more like a constructor, but inherited).


init

public BasicDocument init(String text,
                          String title)
Calls init(text,title,true)


init

public BasicDocument init(String text,
                          boolean keepOriginalText)
Calls init(text,null,keepOriginalText)


init

public BasicDocument init(String text)
Calls init(text,null,true)


init

public BasicDocument init()
Calls init((String)null,null,true)


init

public BasicDocument init(Reader textReader,
                          String title,
                          boolean keepOriginalText)
                   throws IOException
Inits a new BasicDocument by reading in the text from the given Reader.

IOException
See Also:
init(String,String,boolean)

init

public BasicDocument init(Reader textReader,
                          String title)
                   throws IOException
Calls init(textReader,title,true)

IOException

init

public BasicDocument init(Reader textReader,
                          boolean keepOriginalText)
                   throws IOException
Calls init(textReader,null,keepOriginalText)

IOException

init

public BasicDocument init(Reader textReader)
                   throws IOException
Calls init(textReader,null,true)

IOException

init

public BasicDocument init(File textFile,
                          String title,
                          boolean keepOriginalText)
                   throws FileNotFoundException,
                          IOException
Inits a new BasicDocument by reading in the text from the given File.

FileNotFoundException
IOException
See Also:
init(String,String,boolean)

init

public BasicDocument init(File textFile,
                          String title)
                   throws FileNotFoundException,
                          IOException
Calls init(textFile,title,true)

FileNotFoundException
IOException

init

public BasicDocument init(File textFile,
                          boolean keepOriginalText)
                   throws FileNotFoundException,
                          IOException
Calls init(textFile,textFile.getCanonicalPath(),keepOriginalText)

FileNotFoundException
IOException

init

public BasicDocument init(File textFile)
                   throws FileNotFoundException,
                          IOException
Calls init(textFile,textFile.getCanonicalPath(),true)

FileNotFoundException
IOException

init

public BasicDocument init(URL textURL,
                          String title,
                          boolean keepOriginalText)
                   throws IOException
Constructs a new BasicDocument by reading in the text from the given URL.

IOException
See Also:
init(String,String,boolean)

init

public BasicDocument init(URL textURL,
                          String title)
                   throws FileNotFoundException,
                          IOException
Calls init(textURL,title,true)

FileNotFoundException
IOException

init

public BasicDocument init(URL textURL,
                          boolean keepOriginalText)
                   throws FileNotFoundException,
                          IOException
Calls init(textURL,textFile.toExternalForm(),keepOriginalText)

FileNotFoundException
IOException

init

public BasicDocument init(URL textURL)
                   throws FileNotFoundException,
                          IOException
Calls init(textURL,textURL.toExternalForm(),true)

FileNotFoundException
IOException

init

public BasicDocument init(List words,
                          String title)
Inits a new BasicDocument with the given list of words and title.


init

public BasicDocument init(List words)
Calls init(words,null)


parse

protected void parse(String text)
Tokenizes the given text to populate the list of Words this Document represents. The default implementation uses a SimpleTokenizer and tokenizes the entirity of the text into words. Subclasses should override this method to parse documents in non-standard formats, and/or to pull the title of the document from the text. The given text may be empty ("") but will never be null.


asFeatures

public Collection asFeatures()
Returns this (the features are the list of words).

Specified by:
asFeatures in interface Featurizable

label

public Label label()
Returns the first label for this Document, or null if none have been set.

Specified by:
label in interface Labeled
Returns:
One of the labels of the object (if there are multiple labels, preferably the primary label, if it exists). Returns null if there is no label.

labels

public Collection labels()
Returns the complete List of labels for this Document. This is an empty collection if none have been set.

Specified by:
labels in interface Labeled
Returns:
A Collection of the Object's labels. Returns an empty Collection if there are no labels.

setLabel

public void setLabel(Label label)
Removes all currently assigned Labels for this Document then adds the given Label. Calling setLabel(null) effectively clears all labels.

Specified by:
setLabel in interface Labeled

setLabels

public void setLabels(Collection labels)
Removes all currently assigned labels for this Document then adds all of the given Labels.

Specified by:
setLabels in interface Labeled

addLabel

public void addLabel(Label label)
Adds the given Label to the List of labels for this Document if it is not null.


title

public String title()
Returns the title of this document. The title may be empty ("") but will never be null.

Specified by:
title in interface Document

setTitle

public void setTitle(String title)
Sets the title of this Document to the given title. If the given title is null, sets the title to "".


originalText

public String originalText()
Returns the text originally used to construct this document, or null if there was no original text.


presentableText

public String presentableText()

Returns a "pretty" version of the words in this Document suitable for display. The default implementation returns each of the words in this Document separated by spaces. Specifically, each element that is a Word has its Word.word() printed, and other elements are skipped.

Subclasses that maintain additional information may which to override this method.


main

public static void main(String[] args)
For internal debugging purposes only. Creates and tests various instances of BasicDocument.



Stanford NLP Group