|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--edu.stanford.nlp.ie.hmm.HMM
Class for a Hidden Markov Model information extraction tool. Overview: This class is where the actual training and use of the Hidden Markov Model extractor is implemented. It has a variety of constructors, all of which use the passed training corpus to estimate the parameters for the HMM. A default structure can be used, or the various Structure classes can be used to pass in an arbitrary structure. Three types of HMMs can be trained: regular, target, or context. A regular HMM is a self-contained full extractor. A target HMM is an HMM that emits only target sequences. A context HMM learns where in documents targets appear. A context and set of targets can be combined to form a regular HMM using the merge constructor. Training: the training process has 3 phases:
The rest of these comments are notes on the implementation.
The HMM implements a probabilistic regular grammar with a start and an end state within the transition matrix (rather than having a start probability matrix and no end), and so there are two additional states beyond the surface visible ones. The sequence model of the HMM has time 0 be when it is in the start state and time (numTimes-1) is when it is in the finish state. Emissions are state emissions at each time apart from the start and end. So we have:
Times: 0 1 2 ... (numTimes-2) (numTimes-1) Document.get(): - 0 1 ... (doc.size()-1) - State: S F
Emissions are initialized as MLE unigram estimates over the whole corpus for all states. During the basic forward-backward reestimation, emissions are calculated for states, and shrinked state-type states. Only the actual states have unseen word models. These models are done by counting singleton tokens in the training data as unseen (but this tends to work very badly for this kind of data, because terms are so bursty -- if you see a company name once, then you usually see it several times).
There's lots of code in this class that initializes arrays to zero, but arrays are always initialized to zero on creation (JLS 4.5.5), so this code should disappear.
Field Summary | |
protected double[][] |
alpha
|
protected double[][] |
beta
|
protected int |
calcEmitStates
|
static int |
CONTEXT_HMM
|
protected double |
globalMaxChange
|
protected Corpus |
heldDocs
|
protected int |
hmmType
|
protected static int |
MAX_ITER
|
protected static int |
MAX_SHR_ITER
|
protected static int |
MAX_WRONG
maximum number of wrong guesses that can be made and still get it counted as correct. |
protected boolean |
print
|
static int |
REGULAR_HMM
|
protected double[] |
scale
|
protected State[] |
states
|
protected GeneralStructure |
structure
Optional structure object to get transition and start probabilities from. |
static int |
TARGET_HMM
|
protected String[] |
targetFields
|
protected edu.stanford.nlp.ie.hmm.PlainEmitMap[] |
targetParents
Shrinked emissions for union of all states of a particular type. |
protected Corpus |
trainDocs
|
protected edu.stanford.nlp.ie.hmm.PlainEmitMap |
uniform
|
protected HashMap |
vocab
maps words to how many times they appeared in training |
protected HashMap |
zeroTable
|
Constructor Summary | |
HMM(Corpus train)
|
|
HMM(GeneralStructure struc,
Corpus train)
Build an HMM of a certain structure and train it on a certain corpus. |
|
HMM(GeneralStructure struc,
Corpus train,
Corpus heldOut,
int hmmType,
boolean full)
Build an HMM using the given corpora for basic training and validation. |
|
HMM(GeneralStructure struc,
Corpus train,
int hmmType)
|
|
HMM(GeneralStructure struc,
Corpus train,
int hmmType,
boolean full)
Build an HMM. |
|
HMM(HMM context,
HMM[] targets)
Put together a learned top level cascaded HMM with individual HMMs for the different states. |
Method Summary | |
HashMap |
bestAnswers(Document doc,
int[] stateSequence)
Returns a map from state type (Integer) -> List of Strings representing best answer for that type. |
String |
extractFrom(String s)
Calls extractFrom(s,false). |
String |
extractFrom(String s,
boolean print)
Returns the best guess of the extracted field from s as a String. |
int[] |
getLabelsForSequence(int[] sequence)
Returns the state type for each state in a (viterbi) state sequence. |
double |
getMdlScore()
Computes the MDL score for this HMM structure on the training corpus. |
protected void |
initEmissions()
Sets up the initial emissions of a created HMM using the training corpus. |
double |
logLikelihood(Corpus trainDocs)
Calculate the loglikelihood of the passed in corpus according to to the model (stored in class variables). |
static void |
main(String[] args)
This is just for testing the forward and backward algorithms. |
protected int |
numParameters()
|
protected int |
numTargetStates()
Helper method for merging calculation. |
void |
printProbs()
Prints transitions and states. |
void |
printStates(State[] states)
|
static void |
printTransitions(State[] states)
|
void |
printTrellis(String name,
double[][] trellis)
Print out a complete trellis (a state x time double array. |
void |
printTrellis(String name,
double[][] trellis,
int fromTime,
int toTime)
Print out a time slice of a trellis (a state x time double array. |
void |
printTrellis(String name,
double[][] trellis,
int fromTime,
int toTime,
int decimalPlaces)
Print out a time slice of a trellis (a state x time double array. |
protected boolean |
reestimate(boolean useScaling)
Do one iteration of Baum-Welch parameter re-estimation over the entire set of trainDocs. |
String |
toString()
Print out a representation of the HMM. |
void |
train()
Calls train(true). |
void |
train(boolean full)
Trains using Baum-Welch estimation with the training data. |
int[] |
viterbiSequence(Document doc)
Calculate a Viterbi alignment through the document from start state to end state which precede and follow the state emission observations respectively. |
protected boolean |
wasSeen(String s)
Words that are only seen once in the training data are considered unseen for the purpose of esimating each state's probability of emitting an unseen word. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
protected int calcEmitStates
protected State[] states
protected String[] targetFields
protected transient GeneralStructure structure
protected HashMap zeroTable
protected HashMap vocab
protected transient double[][] alpha
protected transient double[][] beta
protected transient double[] scale
protected transient Corpus trainDocs
protected transient Corpus heldDocs
protected transient double globalMaxChange
protected static final int MAX_ITER
protected static final int MAX_SHR_ITER
protected edu.stanford.nlp.ie.hmm.PlainEmitMap uniform
protected edu.stanford.nlp.ie.hmm.PlainEmitMap[] targetParents
protected static int MAX_WRONG
protected int hmmType
public static final int REGULAR_HMM
public static final int TARGET_HMM
public static final int CONTEXT_HMM
protected boolean print
Constructor Detail |
public HMM(Corpus train)
train
- Training corpuspublic HMM(GeneralStructure struc, Corpus train)
struc
- A description of the HMM structure. If this is
null
, a default HMM structure is used.public HMM(GeneralStructure struc, Corpus train, int hmmType)
public HMM(GeneralStructure struc, Corpus train, int hmmType, boolean full)
full
- True means to also do shrinkage and unseen estimation on
held out datapublic HMM(GeneralStructure struc, Corpus train, Corpus heldOut, int hmmType, boolean full)
full
- True means to also do shrinkage and unseen estimation on
held out datapublic HMM(HMM context, HMM[] targets)
Method Detail |
protected int numTargetStates()
protected void initEmissions()
protected boolean wasSeen(String s)
public void train()
public void train(boolean full)
public int[] viterbiSequence(Document doc)
public int[] getLabelsForSequence(int[] sequence)
public String extractFrom(String s)
public String extractFrom(String s, boolean print)
public HashMap bestAnswers(Document doc, int[] stateSequence)
protected boolean reestimate(boolean useScaling)
useScaling
- True means to use scaling coefficients. This is vital
for all but toy problems, as otherwise numerical underflow occurs.
public void printProbs()
public static void printTransitions(State[] states)
public void printTrellis(String name, double[][] trellis)
public void printTrellis(String name, double[][] trellis, int fromTime, int toTime)
public void printTrellis(String name, double[][] trellis, int fromTime, int toTime, int decimalPlaces)
public void printStates(State[] states)
public double logLikelihood(Corpus trainDocs)
protected int numParameters()
public static void main(String[] args)
public double getMdlScore()
public String toString()
toString
in class Object
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |