|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Interface Summary | |
EmitMap | Interface to model a states emission distribution. |
GeneralStructure | A simple interface for anything that has a State array. |
HasType | Something that implements the HasType interface
knows about HMM target types. |
Class Summary | |
AnswerChecker | Utility class for checking whether words pulled from the HMM match the
errors created from an AnswerConstructor . |
AnswerChecker.Range | Reprsents a range [from,to) (same semantics as substring). |
AnswerConstructor | Takes a Collection of TypedTaggedWords (or a Collection of Words and a list of integers for the corresponsing types) and pulls out the strings of each type. |
ContextTrainer | Trains a context HMM on the contexts of the given target states, representing each target state as atomic. |
Corpus | Class to handle a corpus of information extraction data. |
DiscriminativeHMMDiffFunction | Interface to optimization package for discriminatively learning the structure of an HMM. |
Extractor | A command line information extraction tool built using the HMM
and Corpus classes. |
HMM | Class for a Hidden Markov Model information extraction tool. |
HMMSingleFieldExtractor | An interface between the KAON extraction world, and extraction of a single field via an HMM information extractor. |
HMMTester | Programmatically tests the quality of an HMM on a Corpus. |
MergeTrainer | Main class for building a single HMM by combining multiple target HMMs and a context HMM. |
MultiStructure | Class to model an HMM context structure. |
State | Class to model a single state in an HMM. |
Structure | Class to model an HMM structure. |
StructureLearner | A class to learn HMM structures by stochastic optimization. |
TargetTrainer | Trains a small target HMM on target sequences only. |
Tester | Test a trained, serialized HMM on a (tagged) testing file. |
Trainer | Trains HMM and saves it as a serialized object. |
TypedTaggedDocument | Document whose words are TypedTaggedWord objects. |
TypedTaggedWord | A TypedTaggedWord object contains a word, it's tag, and it's type. |
WordTypeStripper | Appliable that sets the type of a TypedTaggedWord to 0. |
A package implementing HMMs for the purpose of information extraction. This work is based largely on work done by Freitag and McCallum. For more descriptions of ideas used, see:
The key classes are:
HMM
: the actual Hidden Markov Model extractor. Uses shrinkage and an unseen word model.Corpus
: represents a sequence of documents to be used for training or testing the hmm. Training documents must be
tagged.EmitMap
: this is the interface for any object that
models the emission probability distribution of an HMM state.
Concrete classes inherit from AbstractEmitMap
PlainEmitMap
: just a straightforward hashtable mapping words to probabilities.ConstantEmitMap
: always emits the same thing (used in context
models).UnseenEmitMap
: With probabability seen, emits from another the known distribution. With probability (1 - seen), emits
from unseen distribution, which is based on word feature counts of unseen words seen in unseen phase of training.ShrinkedEmitMap
: optimizes three parameters over three emit maps over held out data. See Freitag and McCallum
"Information Extraction with HMMs and Shrinkage". 1999.Structure.java
: Describes HMM structures, and can be
used to initialized transition matrices.FeatureMap.java
: Used to give a feature-based
representation of unknown words.
A variety of command line utility classes are available to build, train, and test HMM extractors:
Extractor
: Given a (single-file) tagged corpus, trains
an HMM on part of the corpus, and
then tests the accuracty of this HMM on a left out portion of
the corpus. It does not save the HMM.Trainer
: Trains an HMM using the given tagged corpus and writes a serialized HMM object to a file.Tester
: Tests the given serialized HMM object on the given corpus.StructureLearner
: Uses structure search with F1 score on held out data as the scoring function to find the best
structure for a given corpus.TargetTrainer
: Trains an HMM to emit target strings of the specified field. For example, an address HMM might have only
3 states, the first which is likely to emit a number, the second a name, and the third a word like "St." or "Ave.". Use
MergeTrainer
to combine target HMMs and a ContextHMM into a
single HMM.ContextTrainer
: Trains the context for a TargetHMM; it learns where in documents the target is likely to appear. Can be
combined with a set of TargetTrainers to create a complete HMM.
Data format:
The input utilities work with a
simple XML-like but not XML document structure which is
described in the documentation of the class Corpus
.
Documents are all in one file, separated by the string ENDOFDOC on a
line by itself. Within a document, fields for training are
marked as XML-style elements.
Using cross validation, and a default structure:
java edu.stanford.nlp.ie.hmm.Extractor
/u/nlp/data/iedata/acquisitions.txt acquired
One should be able to put together an HMM from parts and test it like this. However, this doesn't currently work (Oct 2002).
java edu.stanford.nlp.ie.hmm.TargetTrainer /u/nlp/data/iedata/acquisitions.txt
acquired acquired-fixed.hmm
Alternatively, one could learn a target HMM structure as below. This
takes considerably longer, but it isn't so bad for a simple target
HMM. It learns a much bigger HMM structure.
java edu.stanford.nlp.ie.hmm.TargetTrainer -sl /u/nlp/data/iedata/acquisitions.txt
acquired acquired-learned.hmm
java edu.stanford.nlp.ie.hmm.ContextTrainer /u/nlp/data/iedata/acquisitions.txt
acquired-context.hmm acquired
(This didn't seem to work in the older code -- but doing
java edu.stanford.nlp.ie.hmm.ContextTrainer -cc
/u/nlp/data/iedata/acquisitions.txt acquired
seems sensible, so I
think it shouldn't be too far from working.)
java edu.stanford.nlp.ie.hmm.MergeTrainer -f acquired-merged.hmm
acquired-context.hmm acquired-fixed.hmm
java edu.stanford.nlp.ie.hmm.Tester /u/nlp/data/iedata/acquisitions.txt
acquired acquired-merged.hmm
|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |