edu.stanford.nlp.ie.hmm(Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES

Package edu.stanford.nlp.ie.hmm

A package implementing HMMs for the purpose of information extraction.

See:
Description

Interface Summary

EmitMap Interface to model a states emission distribution.

GeneralStructure A simple interface for anything that has a State array.

HasType Something that implements the HasType interface knows about HMM target types.

Class Summary

AnswerChecker Utility class for checking whether words pulled from the HMM match the errors created from an AnswerConstructor.

AnswerChecker.Range Reprsents a range [from,to) (same semantics as substring).

AnswerConstructor Takes a Collection of TypedTaggedWords (or a Collection of Words and a list of integers for the corresponsing types) and pulls out the strings of each type.

ContextTrainer Trains a context HMM on the contexts of the given target states, representing each target state as atomic.

Corpus Class to handle a corpus of information extraction data.

DiscriminativeHMMDiffFunction Interface to optimization package for discriminatively learning the structure of an HMM.

Extractor A command line information extraction tool built using the HMM and Corpus classes.

HMM Class for a Hidden Markov Model information extraction tool.

HMMSingleFieldExtractor An interface between the KAON extraction world, and extraction of a single field via an HMM information extractor.

HMMTester Programmatically tests the quality of an HMM on a Corpus.

MergeTrainer Main class for building a single HMM by combining multiple target HMMs and a context HMM.

MultiStructure Class to model an HMM context structure.

State Class to model a single state in an HMM.

Structure Class to model an HMM structure.

StructureLearner A class to learn HMM structures by stochastic optimization.

TargetTrainer Trains a small target HMM on target sequences only.

Tester Test a trained, serialized HMM on a (tagged) testing file.

Trainer Trains HMM and saves it as a serialized object.

TypedTaggedDocument Document whose words are TypedTaggedWord objects.

TypedTaggedWord A TypedTaggedWord object contains a word, it's tag, and it's type.

WordTypeStripper Appliable that sets the type of a TypedTaggedWord to 0.

Package edu.stanford.nlp.ie.hmm Description

A package implementing HMMs for the purpose of information extraction. This work is based largely on work done by Freitag and McCallum. For more descriptions of ideas used, see:

Freitag and McCallum "Information Extraction with HMMs and Shrinkage". 1999.
Freitag and McCallum. "Information Extraction with HMM Structures Learned by Stochastic Optimization". AAAI 2000.
Borkar, Deshmukh, Sarawagi. "Automatic segmentation of text into structured records". SIGMOD 2001.

The key classes are:

HMM: the actual Hidden Markov Model extractor. Uses shrinkage and an unseen word model.
Corpus: represents a sequence of documents to be used for training or testing the hmm. Training documents must be tagged.
EmitMap: this is the interface for any object that models the emission probability distribution of an HMM state. Concrete classes inherit from AbstractEmitMap
- PlainEmitMap: just a straightforward hashtable mapping words to probabilities.
- ConstantEmitMap: always emits the same thing (used in context models).
- UnseenEmitMap: With probabability seen, emits from another the known distribution. With probability (1 - seen), emits from unseen distribution, which is based on word feature counts of unseen words seen in unseen phase of training.
- ShrinkedEmitMap: optimizes three parameters over three emit maps over held out data. See Freitag and McCallum "Information Extraction with HMMs and Shrinkage". 1999.
Structure.java: Describes HMM structures, and can be used to initialized transition matrices.
FeatureMap.java: Used to give a feature-based representation of unknown words.

A variety of command line utility classes are available to build, train, and test HMM extractors:

Extractor: Given a (single-file) tagged corpus, trains an HMM on part of the corpus, and then tests the accuracty of this HMM on a left out portion of the corpus. It does not save the HMM.
Trainer: Trains an HMM using the given tagged corpus and writes a serialized HMM object to a file.
Tester: Tests the given serialized HMM object on the given corpus.
StructureLearner: Uses structure search with F1 score on held out data as the scoring function to find the best structure for a given corpus.
TargetTrainer: Trains an HMM to emit target strings of the specified field. For example, an address HMM might have only 3 states, the first which is likely to emit a number, the second a name, and the third a word like "St." or "Ave.". Use MergeTrainer to combine target HMMs and a ContextHMM into a single HMM.
ContextTrainer: Trains the context for a TargetHMM; it learns where in documents the target is likely to appear. Can be combined with a set of TargetTrainers to create a complete HMM.

Data format: The input utilities work with a simple XML-like but not XML document structure which is described in the documentation of the class Corpus. Documents are all in one file, separated by the string ENDOFDOC on a line by itself. Within a document, fields for training are marked as XML-style elements.

Use cases

Testing a single field extractor

Using cross validation, and a default structure:

java edu.stanford.nlp.ie.hmm.Extractor /u/nlp/data/iedata/acquisitions.txt acquired

Building a single field extractor from a target and context HMM, and testing it

One should be able to put together an HMM from parts and test it like this. However, this doesn't currently work (Oct 2002).

Making a target HMM with a fixed target structure

java edu.stanford.nlp.ie.hmm.TargetTrainer /u/nlp/data/iedata/acquisitions.txt acquired acquired-fixed.hmm

Making a target HMM with a learned target structure

Alternatively, one could learn a target HMM structure as below. This takes considerably longer, but it isn't so bad for a simple target HMM. It learns a much bigger HMM structure.
java edu.stanford.nlp.ie.hmm.TargetTrainer -sl /u/nlp/data/iedata/acquisitions.txt acquired acquired-learned.hmm

Making the context HMM

java edu.stanford.nlp.ie.hmm.ContextTrainer /u/nlp/data/iedata/acquisitions.txt acquired-context.hmm acquired
(This didn't seem to work in the older code -- but doing java edu.stanford.nlp.ie.hmm.ContextTrainer -cc /u/nlp/data/iedata/acquisitions.txt acquired seems sensible, so I think it shouldn't be too far from working.)

Gluing the HMMs together

java edu.stanford.nlp.ie.hmm.MergeTrainer -f acquired-merged.hmm acquired-context.hmm acquired-fixed.hmm

Testing the merged HMM

java edu.stanford.nlp.ie.hmm.Tester /u/nlp/data/iedata/acquisitions.txt acquired acquired-merged.hmm

Since:: 1.4