HMM (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.stanford.nlp.ie.hmm
Class HMM

java.lang.Object
  |
  +--edu.stanford.nlp.ie.hmm.HMM

All Implemented Interfaces:: Serializable

public class HMM
extends Object
implements Serializable

Class for a Hidden Markov Model information extraction tool. Overview: This class is where the actual training and use of the Hidden Markov Model extractor is implemented. It has a variety of constructors, all of which use the passed training corpus to estimate the parameters for the HMM. A default structure can be used, or the various Structure classes can be used to pass in an arbitrary structure. Three types of HMMs can be trained: regular, target, or context. A regular HMM is a self-contained full extractor. A target HMM is an HMM that emits only target sequences. A context HMM learns where in documents targets appear. A context and set of targets can be combined to form a regular HMM using the merge constructor. Training: the training process has 3 phases:

Regular Parameter Estimation using EM: Initial transitions are set to either hard-coded defaults or pulled from the Structure object. Initial emissions are set to use simple word counts with add-one smoothing from the training corpus. Regular parameter estimation is implemented as described in Manning and Schütze (1999).
Unseen Estimation: Uses held out data to estimate how likely each state is to emit an unseen word.
Shrink Estimation: Uses held out data to estimate the shrink parameters, which means the HMM is learning the optimum compromise between the specific model that has more structure but suffers from more sparsity with the general model that has less structure more more training data.

For training an HMM, of the data set passed in, a portion (currently 3/4) is used for training, and the remainder is used as a validation set for setting parameters, such as in the Shrink Estimation.

The rest of these comments are notes on the implementation.

The HMM implements a probabilistic regular grammar with a start and an end state within the transition matrix (rather than having a start probability matrix and no end), and so there are two additional states beyond the surface visible ones. The sequence model of the HMM has time 0 be when it is in the start state and time (numTimes-1) is when it is in the finish state. Emissions are state emissions at each time apart from the start and end. So we have:

 Times:          0   1    2      ...     (numTimes-2)    (numTimes-1)
 Document.get(): -   0    1      ...     (doc.size()-1)  -
 State:          S                                       F

Emissions are initialized as MLE unigram estimates over the whole corpus for all states. During the basic forward-backward reestimation, emissions are calculated for states, and shrinked state-type states. Only the actual states have unseen word models. These models are done by counting singleton tokens in the training data as unseen (but this tends to work very badly for this kind of data, because terms are so bursty -- if you see a company name once, then you usually see it several times).

There's lots of code in this class that initializes arrays to zero, but arrays are always initialized to zero on creation (JLS 4.5.5), so this code should disappear.

See Also:: Serialized Form

Field Summary

protected double[][] alpha


protected double[][] beta


protected int calcEmitStates


static int CONTEXT_HMM


protected double globalMaxChange


protected Corpus heldDocs


protected int hmmType


protected static int MAX_ITER


protected static int MAX_SHR_ITER


protected static int MAX_WRONG
          maximum number of wrong guesses that can be made and still get it counted as correct.

protected boolean print


static int REGULAR_HMM


protected double[] scale


protected State[] states


protected GeneralStructure structure
          Optional structure object to get transition and start probabilities from.

static int TARGET_HMM


protected String[] targetFields


protected edu.stanford.nlp.ie.hmm.PlainEmitMap[] targetParents
          Shrinked emissions for union of all states of a particular type.

protected Corpus trainDocs


protected edu.stanford.nlp.ie.hmm.PlainEmitMap uniform


protected HashMap vocab
          maps words to how many times they appeared in training

protected HashMap zeroTable


Constructor Summary

HMM(Corpus train)


HMM(GeneralStructure struc, Corpus train)
          Build an HMM of a certain structure and train it on a certain corpus.

HMM(GeneralStructure struc, Corpus train, Corpus heldOut, int hmmType, boolean full)
          Build an HMM using the given corpora for basic training and validation.

HMM(GeneralStructure struc, Corpus train, int hmmType)


HMM(GeneralStructure struc, Corpus train, int hmmType, boolean full)
          Build an HMM.

HMM(HMM context, HMM[] targets)
          Put together a learned top level cascaded HMM with individual HMMs for the different states.

Method Summary

HashMap bestAnswers(Document doc, int[] stateSequence)
          Returns a map from state type (Integer) -> List of Strings representing best answer for that type.

String extractFrom(String s)
          Calls extractFrom(s,false).

String extractFrom(String s, boolean print)
          Returns the best guess of the extracted field from s as a String.

int[] getLabelsForSequence(int[] sequence)
          Returns the state type for each state in a (viterbi) state sequence.

double getMdlScore()
          Computes the MDL score for this HMM structure on the training corpus.

protected void initEmissions()
          Sets up the initial emissions of a created HMM using the training corpus.

double logLikelihood(Corpus trainDocs)
          Calculate the loglikelihood of the passed in corpus according to to the model (stored in class variables).

static void main(String[] args)
          This is just for testing the forward and backward algorithms.

protected int numParameters()


protected int numTargetStates()
          Helper method for merging calculation.

void printProbs()
          Prints transitions and states.

void printStates(State[] states)


static void printTransitions(State[] states)


void printTrellis(String name, double[][] trellis)
          Print out a complete trellis (a state x time double array.

void printTrellis(String name, double[][] trellis, int fromTime, int toTime)
          Print out a time slice of a trellis (a state x time double array.

void printTrellis(String name, double[][] trellis, int fromTime, int toTime, int decimalPlaces)
          Print out a time slice of a trellis (a state x time double array.

protected boolean reestimate(boolean useScaling)
          Do one iteration of Baum-Welch parameter re-estimation over the entire set of trainDocs.

String toString()
          Print out a representation of the HMM.

void train()
          Calls train(true).

void train(boolean full)
          Trains using Baum-Welch estimation with the training data.

int[] viterbiSequence(Document doc)
          Calculate a Viterbi alignment through the document from start state to end state which precede and follow the state emission observations respectively.

protected boolean wasSeen(String s)
          Words that are only seen once in the training data are considered unseen for the purpose of esimating each state's probability of emitting an unseen word.

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Field Detail

calcEmitStates

protected int calcEmitStates

states

protected State[] states

targetFields

protected String[] targetFields

structure

protected transient GeneralStructure structure

Optional structure object to get transition and start probabilities from. Useful for structure learning

zeroTable

protected HashMap zeroTable

vocab

protected HashMap vocab

maps words to how many times they appeared in training

alpha

protected transient double[][] alpha

beta

protected transient double[][] beta

scale

protected transient double[] scale

trainDocs

protected transient Corpus trainDocs

heldDocs

protected transient Corpus heldDocs

globalMaxChange

protected transient double globalMaxChange

MAX_ITER

protected static final int MAX_ITER

See Also:: Constant Field Values

MAX_SHR_ITER

protected static final int MAX_SHR_ITER

See Also:: Constant Field Values

uniform

protected edu.stanford.nlp.ie.hmm.PlainEmitMap uniform

targetParents

protected edu.stanford.nlp.ie.hmm.PlainEmitMap[] targetParents

Shrinked emissions for union of all states of a particular type. The background parent is targetParents[0].

MAX_WRONG

protected static int MAX_WRONG

maximum number of wrong guesses that can be made and still get it counted as correct.

hmmType

protected int hmmType

REGULAR_HMM

public static final int REGULAR_HMM

See Also:: Constant Field Values

TARGET_HMM

public static final int TARGET_HMM

See Also:: Constant Field Values

CONTEXT_HMM

public static final int CONTEXT_HMM

See Also:: Constant Field Values

print

protected boolean print

Constructor Detail

HMM

public HMM(Corpus train)

Parameters:: train - Training corpus

HMM

public HMM(GeneralStructure struc,
           Corpus train)

Build an HMM of a certain structure and train it on a certain corpus.
Parameters:: struc - A description of the HMM structure. If this is null, a default HMM structure is used.

HMM

public HMM(GeneralStructure struc,
           Corpus train,
           int hmmType)

HMM

public HMM(GeneralStructure struc,
           Corpus train,
           int hmmType,
           boolean full)

Build an HMM.
Parameters:: full - True means to also do shrinkage and unseen estimation on held out data

HMM

public HMM(GeneralStructure struc,
           Corpus train,
           Corpus heldOut,
           int hmmType,
           boolean full)

Build an HMM using the given corpora for basic training and validation.
Parameters:: full - True means to also do shrinkage and unseen estimation on held out data

HMM

public HMM(HMM context,
           HMM[] targets)

Put together a learned top level cascaded HMM with individual HMMs for the different states.

Method Detail

numTargetStates

protected int numTargetStates()

Helper method for merging calculation. Should only be called on CONTEXT_HMMs. Returns number of dummy target states (detected simply as ones that are constant -- this could be fooled if there were other constant elements.

initEmissions

protected void initEmissions()

Sets up the initial emissions of a created HMM using the training corpus.

wasSeen

protected boolean wasSeen(String s)

Words that are only seen once in the training data are considered unseen for the purpose of esimating each state's probability of emitting an unseen word. This is for initial round only. Now the unseen probability is estimated using held out data.

train

public void train()

Calls train(true). @see #train(boolean)

train

public void train(boolean full)

Trains using Baum-Welch estimation with the training data.

viterbiSequence

public int[] viterbiSequence(Document doc)

Calculate a Viterbi alignment through the document from start state to end state which precede and follow the state emission observations respectively.

Returns:: The viterbi state sequence (including start and end states)

getLabelsForSequence

public int[] getLabelsForSequence(int[] sequence)

Returns the state type for each state in a (viterbi) state sequence. This strips the start and end so that it's of length (numTimes-2)

extractFrom

public String extractFrom(String s)

Calls extractFrom(s,false). @see #extractFrom(String,boolean)

extractFrom

public String extractFrom(String s,
                          boolean print)

Returns the best guess of the extracted field from s as a String. Prints debug info is print is true.

bestAnswers

public HashMap bestAnswers(Document doc,
                           int[] stateSequence)

Returns a map from state type (Integer) -> List of Strings representing best answer for that type. The best answer is the one with highest probability of being generated by this HMM.

reestimate

protected boolean reestimate(boolean useScaling)

Do one iteration of Baum-Welch parameter re-estimation over the entire set of trainDocs.

Parameters:: useScaling - True means to use scaling coefficients. This is vital for all but toy problems, as otherwise numerical underflow occurs.
Returns:: Whether the parameters appear to have converged

printProbs

public void printProbs()

Prints transitions and states.

printTransitions

public static void printTransitions(State[] states)

printTrellis

public void printTrellis(String name,
                         double[][] trellis)

Print out a complete trellis (a state x time double array. This sends it in a pretty-printed format to System.out.

printTrellis

public void printTrellis(String name,
                         double[][] trellis,
                         int fromTime,
                         int toTime)

Print out a time slice of a trellis (a state x time double array. This sends it in a pretty-printed format to System.out. Print times from fromTime to toTime inclusively at both ends. If your fromTime or toTime are too big/small, they will be adjusted to what is in the trellis.

printTrellis

public void printTrellis(String name,
                         double[][] trellis,
                         int fromTime,
                         int toTime,
                         int decimalPlaces)

printStates

public void printStates(State[] states)

logLikelihood

public double logLikelihood(Corpus trainDocs)

Calculate the loglikelihood of the passed in corpus according to to the model (stored in class variables).

numParameters

protected int numParameters()

Returns:: the total number of parameters (emission and transition) in the HMM. Useful for structure learning

main

public static void main(String[] args)

This is just for testing the forward and backward algorithms. You have to run it on a very short document, so that the unscaled version doesn't underflow.

getMdlScore

public double getMdlScore()

Computes the MDL score for this HMM structure on the training corpus. The MDL score is logLikelihood(docs) - log(docs.size())*numParameters()/2.

Returns:: The MDL score for this HMM structure on the training corpus

toString

public String toString()

Print out a representation of the HMM. This should be rewritten to utilize the better print methods.

Overrides:: toString in class Object

Returns:: HMM representation

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Stanford NLP Group

Field Summary
`protected double[][]`	`alpha`
`protected double[][]`	`beta`
`protected int`	`calcEmitStates`
`static int`	`CONTEXT_HMM`
`protected double`	`globalMaxChange`
`protected Corpus`	`heldDocs`
`protected int`	`hmmType`
`protected static int`	`MAX_ITER`
`protected static int`	`MAX_SHR_ITER`
`protected static int`	`MAX_WRONG` maximum number of wrong guesses that can be made and still get it counted as correct.
`protected boolean`	`print`
`static int`	`REGULAR_HMM`
`protected double[]`	`scale`
`protected State[]`	`states`
`protected GeneralStructure`	`structure` Optional structure object to get transition and start probabilities from.
`static int`	`TARGET_HMM`
`protected String[]`	`targetFields`
`protected edu.stanford.nlp.ie.hmm.PlainEmitMap[]`	`targetParents` Shrinked emissions for union of all states of a particular type.
`protected Corpus`	`trainDocs`
`protected edu.stanford.nlp.ie.hmm.PlainEmitMap`	`uniform`
`protected HashMap`	`vocab` maps words to how many times they appeared in training
`protected HashMap`	`zeroTable`

Constructor Summary
`HMM(Corpus train)`
`HMM(GeneralStructure struc, Corpus train)` Build an HMM of a certain structure and train it on a certain corpus.
`HMM(GeneralStructure struc, Corpus train, Corpus heldOut, int hmmType, boolean full)` Build an HMM using the given corpora for basic training and validation.
`HMM(GeneralStructure struc, Corpus train, int hmmType)`
`HMM(GeneralStructure struc, Corpus train, int hmmType, boolean full)` Build an HMM.
`HMM(HMM context, HMM[] targets)` Put together a learned top level cascaded HMM with individual HMMs for the different states.

Method Summary
`HashMap`	`bestAnswers(Document doc, int[] stateSequence)` Returns a map from state type (Integer) -> List of Strings representing best answer for that type.
`String`	`extractFrom(String s)` Calls `extractFrom(s,false)`.
`String`	`extractFrom(String s, boolean print)` Returns the best guess of the extracted field from `s` as a String.
`int[]`	`getLabelsForSequence(int[] sequence)` Returns the state type for each state in a (viterbi) state sequence.
`double`	`getMdlScore()` Computes the MDL score for this HMM structure on the training corpus.
`protected void`	`initEmissions()` Sets up the initial emissions of a created HMM using the training corpus.
`double`	`logLikelihood(Corpus trainDocs)` Calculate the loglikelihood of the passed in corpus according to to the model (stored in class variables).
`static void`	`main(String[] args)` This is just for testing the forward and backward algorithms.
`protected int`	`numParameters()`
`protected int`	`numTargetStates()` Helper method for merging calculation.
`void`	`printProbs()` Prints transitions and states.
`void`	`printStates(State[] states)`
`static void`	`printTransitions(State[] states)`
`void`	`printTrellis(String name, double[][] trellis)` Print out a complete trellis (a state x time double array.
`void`	`printTrellis(String name, double[][] trellis, int fromTime, int toTime)` Print out a time slice of a trellis (a state x time double array.
`void`	`printTrellis(String name, double[][] trellis, int fromTime, int toTime, int decimalPlaces)` Print out a time slice of a trellis (a state x time double array.
`protected boolean`	`reestimate(boolean useScaling)` Do one iteration of Baum-Welch parameter re-estimation over the entire set of trainDocs.
`String`	`toString()` Print out a representation of the HMM.
`void`	`train()` Calls `train(true)`.
`void`	`train(boolean full)` Trains using Baum-Welch estimation with the training data.
`int[]`	`viterbiSequence(Document doc)` Calculate a Viterbi alignment through the document from start state to end state which precede and follow the state emission observations respectively.
`protected boolean`	`wasSeen(String s)` Words that are only seen once in the training data are considered unseen for the purpose of esimating each state's probability of emitting an unseen word.

edu.stanford.nlp.ie.hmm Class HMM

calcEmitStates

states

targetFields

structure

zeroTable

vocab

alpha

beta

scale

trainDocs

heldDocs

globalMaxChange

MAX_ITER

MAX_SHR_ITER

uniform

targetParents

MAX_WRONG

hmmType

REGULAR_HMM

TARGET_HMM

CONTEXT_HMM

print

HMM

HMM

HMM

HMM

HMM

HMM

numTargetStates

initEmissions

wasSeen

train

train

viterbiSequence

getLabelsForSequence

extractFrom

extractFrom

bestAnswers

reestimate

printProbs

printTransitions

printTrellis

printTrellis

printTrellis

printStates

logLikelihood

numParameters

main

getMdlScore

toString

edu.stanford.nlp.ie.hmm
Class HMM