Corpus (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.stanford.nlp.ie.hmm
Class Corpus

java.lang.Object
  |
  +--java.util.AbstractCollection
        |
        +--java.util.AbstractList
              |
              +--edu.stanford.nlp.dbm.AbstractDataCollection
                    |
                    +--edu.stanford.nlp.ie.hmm.Corpus

All Implemented Interfaces:: Collection, DataCollection, List

public class Corpus
extends AbstractDataCollection

Class to handle a corpus of information extraction data. A Corpus is a set of documents, each document being a list of words. Corpus objects also track the counts of each word for doing emission estimation, and know which words are contained in which states according to the partial labeling of the source data.

The set of documents is read in as a single file, containing any number of tagged documents each separated (or followed) by the ENDOFDOC token, i.e., the string "ENDOFDOC" by itself on a line.

Within a document, one basically has natural language text, but any number of states can be partially labeled by enclosing sequences of words in an XML-style tag, such as <purchaser>First Wisconsin Corp</purchaser> said it plans to acquire <acquired>Shelard Bancshares Inc</acquired>.. In a particular use, some tags will be treated as target tags and the tagging of those words will be returned by the tokenizer, while other tags will be ignored.

See Also:: TaggedStreamTokenizer

Field Summary

static int NUM_MARKUP_FORMS


Fields inherited from class edu.stanford.nlp.dbm.AbstractDataCollection

data, datamatrix, features, name

Fields inherited from class java.util.AbstractList

modCount

Constructor Summary

Corpus(String[] targets)


Corpus(String fileName, String targetField)
          Make a Corpus from a file.

Corpus(String fileName, String[] targets)


Method Summary

int add(Datum d)
          Inserts a Datum into the Data Collection.

boolean add(Object o)


HashMap genStarter()
          Generates a hashtable of emission probabilities using Maximum Likelihood estimation with add-one smoothing for the words in the vocabulary.

Object get(int i)


String getTargetField()
          Get field we are extracting.

String[] getTargetFields()
          Get fields we are extracting.

HashMap getVocab()


static void incrementCount(HashMap v, String s)


void isolateContext()
          Isolates context of all targetFields.

static void main(String[] args)
          Simply test what gets put into a corpus.

static double[] normalize(double[] fractions)


void retainOnlyTarget(String tagName)


int size()


Corpus[] split(double start, double[] fractions)
          Divides the corpus into corpora of sizes specified by the fractions argument.

Corpus splitFrom(double start, double fraction)
          Returns a corpus with a fraction of the documents starting from a specified point, expressed as a fraction into the corpus.

Corpus[] splitRandom(double[] fractions, long seed)
          Divides the corpus into corpora of sizes specified by the fractions argument.

Corpus splitRange(double start, double end)
          Return a corpus with the subset of documents that range from the fraction into the corpus to the end fraction of the corpus.

Corpus splitRange(double s1, double e1, double s2, double e2)


String toString()
          returns a String representation of DBM

int vocabSize()


int wordCount()


Methods inherited from class edu.stanford.nlp.dbm.AbstractDataCollection

dataMatrix, features, name, toXMLString

Methods inherited from class java.util.AbstractList

add, addAll, clear, equals, hashCode, indexOf, iterator, lastIndexOf, listIterator, listIterator, remove, removeRange, set, subList

Methods inherited from class java.util.AbstractCollection

addAll, contains, containsAll, isEmpty, remove, removeAll, retainAll, toArray, toArray

Methods inherited from class java.lang.Object

clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Methods inherited from interface java.util.List

add, addAll, addAll, clear, contains, containsAll, equals, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, retainAll, set, subList, toArray, toArray

Field Detail

NUM_MARKUP_FORMS

public static final int NUM_MARKUP_FORMS

See Also:: Constant Field Values

Constructor Detail

Corpus

public Corpus(String fileName,
              String targetField)

Make a Corpus from a file. Convenience constructor for when there is only one target field.
Parameters:: fileName - file to read from; targetField - field that is being extracted

Corpus

public Corpus(String fileName,
              String[] targets)

Parameters:: fileName - file to read from

Corpus

public Corpus(String[] targets)

Method Detail

add

public int add(Datum d)

Description copied from interface: DataCollection

Inserts a Datum into the Data Collection. This assigns Datum to lowest unassigned index in FileDataCollection and returns this index. Note: this allows for duplicate objects to be stored with different indices.

add

public boolean add(Object o)

Specified by:: add in interface List
Overrides:: add in class AbstractList

get

public Object get(int i)

Specified by:: get in interface List
Overrides:: get in class AbstractDataCollection

Returns:: The ith document in the corpus

size

public int size()

Specified by:: size in interface List
Overrides:: size in class AbstractDataCollection

Returns:: The number of documents in this corpus

getVocab

public HashMap getVocab()

Returns:: A hashtable mapping all words observed to their frequencies

wordCount

public int wordCount()

Returns:: Number of words in corpus

vocabSize

public int vocabSize()

Returns:: Number of words types in corpus

getTargetFields

public String[] getTargetFields()

Get fields we are extracting.

getTargetField

public String getTargetField()

Get field we are extracting.

splitRange

public Corpus splitRange(double start,
                         double end)

Return a corpus with the subset of documents that range from the fraction into the corpus to the end fraction of the corpus. When rounding, the left-hand side should be inclusive, and the right-hand side exclusive (except when it's 1.0). So, if there are 200 documents, numbered 0 to 199, 0.2 to 0.3 gives you 40 through 59.

splitRange

public Corpus splitRange(double s1,
                         double e1,
                         double s2,
                         double e2)

splitFrom

public Corpus splitFrom(double start,
                        double fraction)

Returns a corpus with a fraction of the documents starting from a specified point, expressed as a fraction into the corpus. When rounding, the left-hand side should be inclusive, and the right-hand side exclusive

split

public Corpus[] split(double start,
                      double[] fractions)

Divides the corpus into corpora of sizes specified by the fractions argument. fractions is normalized to 1. The documents are separated sequentially by chunks, starting from the specified start point (expressed as a fraction into the Corpus). So if you want 75% training, 10% validation, and 15% test, you need to pass in a fractions array containing [.75, .25, .10 ] in that order.

Returns:: An array containing the new corpora

splitRandom

public Corpus[] splitRandom(double[] fractions,
                            long seed)

Divides the corpus into corpora of sizes specified by the fractions argument. fractions is normalized to 1. Attempts to distribute the documents randomly among the new corpora. Specifying the same seed on the same machine should produce the same split.

Returns:: An array containing the new corpora

normalize

public static double[] normalize(double[] fractions)

incrementCount

public static void incrementCount(HashMap v,
                                  String s)

genStarter

public HashMap genStarter()

Generates a hashtable of emission probabilities using Maximum Likelihood estimation with add-one smoothing for the words in the vocabulary.

Returns:: the resulting hashtable mapping words to probabilities

retainOnlyTarget

public void retainOnlyTarget(String tagName)

isolateContext

public void isolateContext()

Isolates context of all targetFields. That is, this goes through the documents in the corpus, and for each document, when a sequence of one or more words of a target state are found, they are replaced with a single special target state observation token.

toString

public String toString()

Description copied from interface: DataCollection

returns a String representation of DBM

Specified by:: toString in interface DataCollection
Overrides:: toString in class AbstractDataCollection

main

public static void main(String[] args)

Simply test what gets put into a corpus.
Usage:

 java edu.stanford.nlp.ie.hmm.Corpus 
			 corpusFile targetFields*

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Stanford NLP Group

Constructor Summary
`Corpus(String[] targets)`
`Corpus(String fileName, String targetField)` Make a Corpus from a file.
`Corpus(String fileName, String[] targets)`

Method Summary
`int`	`add(Datum d)` Inserts a Datum into the Data Collection.
`boolean`	`add(Object o)`
`HashMap`	`genStarter()` Generates a hashtable of emission probabilities using Maximum Likelihood estimation with add-one smoothing for the words in the vocabulary.
`Object`	`get(int i)`
`String`	`getTargetField()` Get field we are extracting.
`String[]`	`getTargetFields()` Get fields we are extracting.
`HashMap`	`getVocab()`
`static void`	`incrementCount(HashMap v, String s)`
`void`	`isolateContext()` Isolates context of all targetFields.
`static void`	`main(String[] args)` Simply test what gets put into a corpus.
`static double[]`	`normalize(double[] fractions)`
`void`	`retainOnlyTarget(String tagName)`
`int`	`size()`
`Corpus[]`	`split(double start, double[] fractions)` Divides the corpus into corpora of sizes specified by the fractions argument.
`Corpus`	`splitFrom(double start, double fraction)` Returns a corpus with a fraction of the documents starting from a specified point, expressed as a fraction into the corpus.
`Corpus[]`	`splitRandom(double[] fractions, long seed)` Divides the corpus into corpora of sizes specified by the fractions argument.
`Corpus`	`splitRange(double start, double end)` Return a corpus with the subset of documents that range from the fraction into the corpus to the end fraction of the corpus.
`Corpus`	`splitRange(double s1, double e1, double s2, double e2)`
`String`	`toString()` returns a String representation of DBM
`int`	`vocabSize()`
`int`	`wordCount()`

edu.stanford.nlp.ie.hmm Class Corpus

NUM_MARKUP_FORMS

Corpus

Corpus

Corpus

add

add

get

size

getVocab

wordCount

vocabSize

getTargetFields

getTargetField

splitRange

splitRange

splitFrom

split

splitRandom

normalize

incrementCount

genStarter

retainOnlyTarget

isolateContext

toString

main

edu.stanford.nlp.ie.hmm
Class Corpus