edu.stanford.nlp.ie.hmm
Class Corpus

java.lang.Object
  |
  +--java.util.AbstractCollection
        |
        +--java.util.AbstractList
              |
              +--edu.stanford.nlp.dbm.AbstractDataCollection
                    |
                    +--edu.stanford.nlp.ie.hmm.Corpus
All Implemented Interfaces:
Collection, DataCollection, List

public class Corpus
extends AbstractDataCollection

Class to handle a corpus of information extraction data. A Corpus is a set of documents, each document being a list of words. Corpus objects also track the counts of each word for doing emission estimation, and know which words are contained in which states according to the partial labeling of the source data.

The set of documents is read in as a single file, containing any number of tagged documents each separated (or followed) by the ENDOFDOC token, i.e., the string "ENDOFDOC" by itself on a line.

Within a document, one basically has natural language text, but any number of states can be partially labeled by enclosing sequences of words in an XML-style tag, such as <purchaser>First Wisconsin Corp</purchaser> said it plans to acquire <acquired>Shelard Bancshares Inc</acquired>.. In a particular use, some tags will be treated as target tags and the tagging of those words will be returned by the tokenizer, while other tags will be ignored.

See Also:
TaggedStreamTokenizer

Field Summary
static int NUM_MARKUP_FORMS
           
 
Fields inherited from class edu.stanford.nlp.dbm.AbstractDataCollection
data, datamatrix, features, name
 
Fields inherited from class java.util.AbstractList
modCount
 
Constructor Summary
Corpus(String[] targets)
           
Corpus(String fileName, String targetField)
          Make a Corpus from a file.
Corpus(String fileName, String[] targets)
           
 
Method Summary
 int add(Datum d)
          Inserts a Datum into the Data Collection.
 boolean add(Object o)
           
 HashMap genStarter()
          Generates a hashtable of emission probabilities using Maximum Likelihood estimation with add-one smoothing for the words in the vocabulary.
 Object get(int i)
           
 String getTargetField()
          Get field we are extracting.
 String[] getTargetFields()
          Get fields we are extracting.
 HashMap getVocab()
           
static void incrementCount(HashMap v, String s)
           
 void isolateContext()
          Isolates context of all targetFields.
static void main(String[] args)
          Simply test what gets put into a corpus.
static double[] normalize(double[] fractions)
           
 void retainOnlyTarget(String tagName)
           
 int size()
           
 Corpus[] split(double start, double[] fractions)
          Divides the corpus into corpora of sizes specified by the fractions argument.
 Corpus splitFrom(double start, double fraction)
          Returns a corpus with a fraction of the documents starting from a specified point, expressed as a fraction into the corpus.
 Corpus[] splitRandom(double[] fractions, long seed)
          Divides the corpus into corpora of sizes specified by the fractions argument.
 Corpus splitRange(double start, double end)
          Return a corpus with the subset of documents that range from the fraction into the corpus to the end fraction of the corpus.
 Corpus splitRange(double s1, double e1, double s2, double e2)
           
 String toString()
          returns a String representation of DBM
 int vocabSize()
           
 int wordCount()
           
 
Methods inherited from class edu.stanford.nlp.dbm.AbstractDataCollection
dataMatrix, features, name, toXMLString
 
Methods inherited from class java.util.AbstractList
add, addAll, clear, equals, hashCode, indexOf, iterator, lastIndexOf, listIterator, listIterator, remove, removeRange, set, subList
 
Methods inherited from class java.util.AbstractCollection
addAll, contains, containsAll, isEmpty, remove, removeAll, retainAll, toArray, toArray
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface java.util.List
add, addAll, addAll, clear, contains, containsAll, equals, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, retainAll, set, subList, toArray, toArray
 

Field Detail

NUM_MARKUP_FORMS

public static final int NUM_MARKUP_FORMS
See Also:
Constant Field Values
Constructor Detail

Corpus

public Corpus(String fileName,
              String targetField)
Make a Corpus from a file. Convenience constructor for when there is only one target field.

Parameters:
fileName - file to read from
targetField - field that is being extracted

Corpus

public Corpus(String fileName,
              String[] targets)
Parameters:
fileName - file to read from

Corpus

public Corpus(String[] targets)
Method Detail

add

public int add(Datum d)
Description copied from interface: DataCollection
Inserts a Datum into the Data Collection. This assigns Datum to lowest unassigned index in FileDataCollection and returns this index. Note: this allows for duplicate objects to be stored with different indices.


add

public boolean add(Object o)
Specified by:
add in interface List
Overrides:
add in class AbstractList

get

public Object get(int i)
Specified by:
get in interface List
Overrides:
get in class AbstractDataCollection
Returns:
The ith document in the corpus

size

public int size()
Specified by:
size in interface List
Overrides:
size in class AbstractDataCollection
Returns:
The number of documents in this corpus

getVocab

public HashMap getVocab()
Returns:
A hashtable mapping all words observed to their frequencies

wordCount

public int wordCount()
Returns:
Number of words in corpus

vocabSize

public int vocabSize()
Returns:
Number of words types in corpus

getTargetFields

public String[] getTargetFields()
Get fields we are extracting.


getTargetField

public String getTargetField()
Get field we are extracting.


splitRange

public Corpus splitRange(double start,
                         double end)
Return a corpus with the subset of documents that range from the fraction into the corpus to the end fraction of the corpus. When rounding, the left-hand side should be inclusive, and the right-hand side exclusive (except when it's 1.0). So, if there are 200 documents, numbered 0 to 199, 0.2 to 0.3 gives you 40 through 59.


splitRange

public Corpus splitRange(double s1,
                         double e1,
                         double s2,
                         double e2)

splitFrom

public Corpus splitFrom(double start,
                        double fraction)
Returns a corpus with a fraction of the documents starting from a specified point, expressed as a fraction into the corpus. When rounding, the left-hand side should be inclusive, and the right-hand side exclusive


split

public Corpus[] split(double start,
                      double[] fractions)
Divides the corpus into corpora of sizes specified by the fractions argument. fractions is normalized to 1. The documents are separated sequentially by chunks, starting from the specified start point (expressed as a fraction into the Corpus). So if you want 75% training, 10% validation, and 15% test, you need to pass in a fractions array containing [.75, .25, .10 ] in that order.

Returns:
An array containing the new corpora

splitRandom

public Corpus[] splitRandom(double[] fractions,
                            long seed)
Divides the corpus into corpora of sizes specified by the fractions argument. fractions is normalized to 1. Attempts to distribute the documents randomly among the new corpora. Specifying the same seed on the same machine should produce the same split.

Returns:
An array containing the new corpora

normalize

public static double[] normalize(double[] fractions)

incrementCount

public static void incrementCount(HashMap v,
                                  String s)

genStarter

public HashMap genStarter()
Generates a hashtable of emission probabilities using Maximum Likelihood estimation with add-one smoothing for the words in the vocabulary.

Returns:
the resulting hashtable mapping words to probabilities

retainOnlyTarget

public void retainOnlyTarget(String tagName)

isolateContext

public void isolateContext()
Isolates context of all targetFields. That is, this goes through the documents in the corpus, and for each document, when a sequence of one or more words of a target state are found, they are replaced with a single special target state observation token.


toString

public String toString()
Description copied from interface: DataCollection
returns a String representation of DBM

Specified by:
toString in interface DataCollection
Overrides:
toString in class AbstractDataCollection

main

public static void main(String[] args)
Simply test what gets put into a corpus.
Usage: java edu.stanford.nlp.ie.hmm.Corpus corpusFile targetFields*



Stanford NLP Group