|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--java.util.AbstractCollection | +--java.util.AbstractList | +--edu.stanford.nlp.dbm.AbstractDataCollection | +--edu.stanford.nlp.ie.hmm.Corpus
Class to handle a corpus of information extraction data. A Corpus is a set of documents, each document being a list of words. Corpus objects also track the counts of each word for doing emission estimation, and know which words are contained in which states according to the partial labeling of the source data.
The set of documents is read in as a single file, containing any number of tagged documents each separated (or followed) by the ENDOFDOC token, i.e., the string "ENDOFDOC" by itself on a line.
Within a document, one basically has natural language text, but any number
of states can be partially labeled by enclosing sequences of words in
an XML-style tag, such as <purchaser>First Wisconsin
Corp</purchaser> said it plans to acquire <acquired>Shelard
Bancshares Inc</acquired>.
. In a particular use, some
tags will be treated as target tags and the tagging of those words will
be returned by the tokenizer, while other tags will be ignored.
TaggedStreamTokenizer
Field Summary | |
static int |
NUM_MARKUP_FORMS
|
Fields inherited from class edu.stanford.nlp.dbm.AbstractDataCollection |
data, datamatrix, features, name |
Fields inherited from class java.util.AbstractList |
modCount |
Constructor Summary | |
Corpus(String[] targets)
|
|
Corpus(String fileName,
String targetField)
Make a Corpus from a file. |
|
Corpus(String fileName,
String[] targets)
|
Method Summary | |
int |
add(Datum d)
Inserts a Datum into the Data Collection. |
boolean |
add(Object o)
|
HashMap |
genStarter()
Generates a hashtable of emission probabilities using Maximum Likelihood estimation with add-one smoothing for the words in the vocabulary. |
Object |
get(int i)
|
String |
getTargetField()
Get field we are extracting. |
String[] |
getTargetFields()
Get fields we are extracting. |
HashMap |
getVocab()
|
static void |
incrementCount(HashMap v,
String s)
|
void |
isolateContext()
Isolates context of all targetFields. |
static void |
main(String[] args)
Simply test what gets put into a corpus. |
static double[] |
normalize(double[] fractions)
|
void |
retainOnlyTarget(String tagName)
|
int |
size()
|
Corpus[] |
split(double start,
double[] fractions)
Divides the corpus into corpora of sizes specified by the fractions argument. |
Corpus |
splitFrom(double start,
double fraction)
Returns a corpus with a fraction of the documents starting from a specified point, expressed as a fraction into the corpus. |
Corpus[] |
splitRandom(double[] fractions,
long seed)
Divides the corpus into corpora of sizes specified by the fractions argument. |
Corpus |
splitRange(double start,
double end)
Return a corpus with the subset of documents that range from the fraction into the corpus to the end fraction of the corpus. |
Corpus |
splitRange(double s1,
double e1,
double s2,
double e2)
|
String |
toString()
returns a String representation of DBM |
int |
vocabSize()
|
int |
wordCount()
|
Methods inherited from class edu.stanford.nlp.dbm.AbstractDataCollection |
dataMatrix, features, name, toXMLString |
Methods inherited from class java.util.AbstractList |
add, addAll, clear, equals, hashCode, indexOf, iterator, lastIndexOf, listIterator, listIterator, remove, removeRange, set, subList |
Methods inherited from class java.util.AbstractCollection |
addAll, contains, containsAll, isEmpty, remove, removeAll, retainAll, toArray, toArray |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Methods inherited from interface java.util.List |
add, addAll, addAll, clear, contains, containsAll, equals, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, retainAll, set, subList, toArray, toArray |
Field Detail |
public static final int NUM_MARKUP_FORMS
Constructor Detail |
public Corpus(String fileName, String targetField)
fileName
- file to read fromtargetField
- field that is being extractedpublic Corpus(String fileName, String[] targets)
fileName
- file to read frompublic Corpus(String[] targets)
Method Detail |
public int add(Datum d)
DataCollection
public boolean add(Object o)
add
in interface List
add
in class AbstractList
public Object get(int i)
get
in interface List
get
in class AbstractDataCollection
public int size()
size
in interface List
size
in class AbstractDataCollection
public HashMap getVocab()
public int wordCount()
public int vocabSize()
public String[] getTargetFields()
public String getTargetField()
public Corpus splitRange(double start, double end)
public Corpus splitRange(double s1, double e1, double s2, double e2)
public Corpus splitFrom(double start, double fraction)
public Corpus[] split(double start, double[] fractions)
public Corpus[] splitRandom(double[] fractions, long seed)
public static double[] normalize(double[] fractions)
public static void incrementCount(HashMap v, String s)
public HashMap genStarter()
public void retainOnlyTarget(String tagName)
public void isolateContext()
public String toString()
DataCollection
toString
in interface DataCollection
toString
in class AbstractDataCollection
public static void main(String[] args)
java edu.stanford.nlp.ie.hmm.Corpus
corpusFile targetFields*
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |