edu.stanford.nlp.ie.hmm
Class HMM

java.lang.Object
  |
  +--edu.stanford.nlp.ie.hmm.HMM
All Implemented Interfaces:
Serializable

public class HMM
extends Object
implements Serializable

Class for a Hidden Markov Model information extraction tool. Overview: This class is where the actual training and use of the Hidden Markov Model extractor is implemented. It has a variety of constructors, all of which use the passed training corpus to estimate the parameters for the HMM. A default structure can be used, or the various Structure classes can be used to pass in an arbitrary structure. Three types of HMMs can be trained: regular, target, or context. A regular HMM is a self-contained full extractor. A target HMM is an HMM that emits only target sequences. A context HMM learns where in documents targets appear. A context and set of targets can be combined to form a regular HMM using the merge constructor. Training: the training process has 3 phases:

  1. Regular Parameter Estimation using EM: Initial transitions are set to either hard-coded defaults or pulled from the Structure object. Initial emissions are set to use simple word counts with add-one smoothing from the training corpus. Regular parameter estimation is implemented as described in Manning and Schütze (1999).
  2. Unseen Estimation: Uses held out data to estimate how likely each state is to emit an unseen word.
  3. Shrink Estimation: Uses held out data to estimate the shrink parameters, which means the HMM is learning the optimum compromise between the specific model that has more structure but suffers from more sparsity with the general model that has less structure more more training data.
For training an HMM, of the data set passed in, a portion (currently 3/4) is used for training, and the remainder is used as a validation set for setting parameters, such as in the Shrink Estimation.

The rest of these comments are notes on the implementation.

The HMM implements a probabilistic regular grammar with a start and an end state within the transition matrix (rather than having a start probability matrix and no end), and so there are two additional states beyond the surface visible ones. The sequence model of the HMM has time 0 be when it is in the start state and time (numTimes-1) is when it is in the finish state. Emissions are state emissions at each time apart from the start and end. So we have:

 Times:          0   1    2      ...     (numTimes-2)    (numTimes-1)
 Document.get(): -   0    1      ...     (doc.size()-1)  -
 State:          S                                       F
 

Emissions are initialized as MLE unigram estimates over the whole corpus for all states. During the basic forward-backward reestimation, emissions are calculated for states, and shrinked state-type states. Only the actual states have unseen word models. These models are done by counting singleton tokens in the training data as unseen (but this tends to work very badly for this kind of data, because terms are so bursty -- if you see a company name once, then you usually see it several times).

There's lots of code in this class that initializes arrays to zero, but arrays are always initialized to zero on creation (JLS 4.5.5), so this code should disappear.

See Also:
Serialized Form

Field Summary
protected  double[][] alpha
           
protected  double[][] beta
           
protected  int calcEmitStates
           
static int CONTEXT_HMM
           
protected  double globalMaxChange
           
protected  Corpus heldDocs
           
protected  int hmmType
           
protected static int MAX_ITER
           
protected static int MAX_SHR_ITER
           
protected static int MAX_WRONG
          maximum number of wrong guesses that can be made and still get it counted as correct.
protected  boolean print
           
static int REGULAR_HMM
           
protected  double[] scale
           
protected  State[] states
           
protected  GeneralStructure structure
          Optional structure object to get transition and start probabilities from.
static int TARGET_HMM
           
protected  String[] targetFields
           
protected  edu.stanford.nlp.ie.hmm.PlainEmitMap[] targetParents
          Shrinked emissions for union of all states of a particular type.
protected  Corpus trainDocs
           
protected  edu.stanford.nlp.ie.hmm.PlainEmitMap uniform
           
protected  HashMap vocab
          maps words to how many times they appeared in training
protected  HashMap zeroTable
           
 
Constructor Summary
HMM(Corpus train)
           
HMM(GeneralStructure struc, Corpus train)
          Build an HMM of a certain structure and train it on a certain corpus.
HMM(GeneralStructure struc, Corpus train, Corpus heldOut, int hmmType, boolean full)
          Build an HMM using the given corpora for basic training and validation.
HMM(GeneralStructure struc, Corpus train, int hmmType)
           
HMM(GeneralStructure struc, Corpus train, int hmmType, boolean full)
          Build an HMM.
HMM(HMM context, HMM[] targets)
          Put together a learned top level cascaded HMM with individual HMMs for the different states.
 
Method Summary
 HashMap bestAnswers(Document doc, int[] stateSequence)
          Returns a map from state type (Integer) -> List of Strings representing best answer for that type.
 String extractFrom(String s)
          Calls extractFrom(s,false).
 String extractFrom(String s, boolean print)
          Returns the best guess of the extracted field from s as a String.
 int[] getLabelsForSequence(int[] sequence)
          Returns the state type for each state in a (viterbi) state sequence.
 double getMdlScore()
          Computes the MDL score for this HMM structure on the training corpus.
protected  void initEmissions()
          Sets up the initial emissions of a created HMM using the training corpus.
 double logLikelihood(Corpus trainDocs)
          Calculate the loglikelihood of the passed in corpus according to to the model (stored in class variables).
static void main(String[] args)
          This is just for testing the forward and backward algorithms.
protected  int numParameters()
           
protected  int numTargetStates()
          Helper method for merging calculation.
 void printProbs()
          Prints transitions and states.
 void printStates(State[] states)
           
static void printTransitions(State[] states)
           
 void printTrellis(String name, double[][] trellis)
          Print out a complete trellis (a state x time double array.
 void printTrellis(String name, double[][] trellis, int fromTime, int toTime)
          Print out a time slice of a trellis (a state x time double array.
 void printTrellis(String name, double[][] trellis, int fromTime, int toTime, int decimalPlaces)
          Print out a time slice of a trellis (a state x time double array.
protected  boolean reestimate(boolean useScaling)
          Do one iteration of Baum-Welch parameter re-estimation over the entire set of trainDocs.
 String toString()
          Print out a representation of the HMM.
 void train()
          Calls train(true).
 void train(boolean full)
          Trains using Baum-Welch estimation with the training data.
 int[] viterbiSequence(Document doc)
          Calculate a Viterbi alignment through the document from start state to end state which precede and follow the state emission observations respectively.
protected  boolean wasSeen(String s)
          Words that are only seen once in the training data are considered unseen for the purpose of esimating each state's probability of emitting an unseen word.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

calcEmitStates

protected int calcEmitStates

states

protected State[] states

targetFields

protected String[] targetFields

structure

protected transient GeneralStructure structure
Optional structure object to get transition and start probabilities from. Useful for structure learning


zeroTable

protected HashMap zeroTable

vocab

protected HashMap vocab
maps words to how many times they appeared in training


alpha

protected transient double[][] alpha

beta

protected transient double[][] beta

scale

protected transient double[] scale

trainDocs

protected transient Corpus trainDocs

heldDocs

protected transient Corpus heldDocs

globalMaxChange

protected transient double globalMaxChange

MAX_ITER

protected static final int MAX_ITER
See Also:
Constant Field Values

MAX_SHR_ITER

protected static final int MAX_SHR_ITER
See Also:
Constant Field Values

uniform

protected edu.stanford.nlp.ie.hmm.PlainEmitMap uniform

targetParents

protected edu.stanford.nlp.ie.hmm.PlainEmitMap[] targetParents
Shrinked emissions for union of all states of a particular type. The background parent is targetParents[0].


MAX_WRONG

protected static int MAX_WRONG
maximum number of wrong guesses that can be made and still get it counted as correct.


hmmType

protected int hmmType

REGULAR_HMM

public static final int REGULAR_HMM
See Also:
Constant Field Values

TARGET_HMM

public static final int TARGET_HMM
See Also:
Constant Field Values

CONTEXT_HMM

public static final int CONTEXT_HMM
See Also:
Constant Field Values

print

protected boolean print
Constructor Detail

HMM

public HMM(Corpus train)
Parameters:
train - Training corpus

HMM

public HMM(GeneralStructure struc,
           Corpus train)
Build an HMM of a certain structure and train it on a certain corpus.

Parameters:
struc - A description of the HMM structure. If this is null, a default HMM structure is used.

HMM

public HMM(GeneralStructure struc,
           Corpus train,
           int hmmType)

HMM

public HMM(GeneralStructure struc,
           Corpus train,
           int hmmType,
           boolean full)
Build an HMM.

Parameters:
full - True means to also do shrinkage and unseen estimation on held out data

HMM

public HMM(GeneralStructure struc,
           Corpus train,
           Corpus heldOut,
           int hmmType,
           boolean full)
Build an HMM using the given corpora for basic training and validation.

Parameters:
full - True means to also do shrinkage and unseen estimation on held out data

HMM

public HMM(HMM context,
           HMM[] targets)
Put together a learned top level cascaded HMM with individual HMMs for the different states.

Method Detail

numTargetStates

protected int numTargetStates()
Helper method for merging calculation. Should only be called on CONTEXT_HMMs. Returns number of dummy target states (detected simply as ones that are constant -- this could be fooled if there were other constant elements.


initEmissions

protected void initEmissions()
Sets up the initial emissions of a created HMM using the training corpus.


wasSeen

protected boolean wasSeen(String s)
Words that are only seen once in the training data are considered unseen for the purpose of esimating each state's probability of emitting an unseen word. This is for initial round only. Now the unseen probability is estimated using held out data.


train

public void train()
Calls train(true). @see #train(boolean)


train

public void train(boolean full)
Trains using Baum-Welch estimation with the training data.


viterbiSequence

public int[] viterbiSequence(Document doc)
Calculate a Viterbi alignment through the document from start state to end state which precede and follow the state emission observations respectively.

Returns:
The viterbi state sequence (including start and end states)

getLabelsForSequence

public int[] getLabelsForSequence(int[] sequence)
Returns the state type for each state in a (viterbi) state sequence. This strips the start and end so that it's of length (numTimes-2)


extractFrom

public String extractFrom(String s)
Calls extractFrom(s,false). @see #extractFrom(String,boolean)


extractFrom

public String extractFrom(String s,
                          boolean print)
Returns the best guess of the extracted field from s as a String. Prints debug info is print is true.


bestAnswers

public HashMap bestAnswers(Document doc,
                           int[] stateSequence)
Returns a map from state type (Integer) -> List of Strings representing best answer for that type. The best answer is the one with highest probability of being generated by this HMM.


reestimate

protected boolean reestimate(boolean useScaling)
Do one iteration of Baum-Welch parameter re-estimation over the entire set of trainDocs.

Parameters:
useScaling - True means to use scaling coefficients. This is vital for all but toy problems, as otherwise numerical underflow occurs.
Returns:
Whether the parameters appear to have converged

printProbs

public void printProbs()
Prints transitions and states.


printTransitions

public static void printTransitions(State[] states)

printTrellis

public void printTrellis(String name,
                         double[][] trellis)
Print out a complete trellis (a state x time double array. This sends it in a pretty-printed format to System.out.


printTrellis

public void printTrellis(String name,
                         double[][] trellis,
                         int fromTime,
                         int toTime)
Print out a time slice of a trellis (a state x time double array. This sends it in a pretty-printed format to System.out. Print times from fromTime to toTime inclusively at both ends. If your fromTime or toTime are too big/small, they will be adjusted to what is in the trellis.


printTrellis

public void printTrellis(String name,
                         double[][] trellis,
                         int fromTime,
                         int toTime,
                         int decimalPlaces)
Print out a time slice of a trellis (a state x time double array. This sends it in a pretty-printed format to System.out. Print times from fromTime to toTime inclusively at both ends. If your fromTime or toTime are too big/small, they will be adjusted to what is in the trellis.


printStates

public void printStates(State[] states)

logLikelihood

public double logLikelihood(Corpus trainDocs)
Calculate the loglikelihood of the passed in corpus according to to the model (stored in class variables).


numParameters

protected int numParameters()
Returns:
the total number of parameters (emission and transition) in the HMM. Useful for structure learning

main

public static void main(String[] args)
This is just for testing the forward and backward algorithms. You have to run it on a very short document, so that the unscaled version doesn't underflow.


getMdlScore

public double getMdlScore()
Computes the MDL score for this HMM structure on the training corpus. The MDL score is logLikelihood(docs) - log(docs.size())*numParameters()/2.

Returns:
The MDL score for this HMM structure on the training corpus

toString

public String toString()
Print out a representation of the HMM. This should be rewritten to utilize the better print methods.

Overrides:
toString in class Object
Returns:
HMM representation


Stanford NLP Group