PNPC (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.stanford.nlp.ie.pcfg
Class PNPC

java.lang.Object
  |
  +--edu.stanford.nlp.ie.pcfg.PNPC

All Implemented Interfaces:: Serializable

public class PNPC
extends Object
implements Serializable

Statistical classifier of unseen proper noun phrases. Supports training and testing on data files. Uses an n-gram word-length model, and n-gram character model, and a word model.

See Also:: Serialized Form

Field Summary

protected int[] categoryCounts


List categoryNames


static int[] charBinCutoffs


protected double charConvergenceMargin


protected double[][][] charInterpolationConstants


protected HashMap[] charSequenceCounts


protected HashMap[] charSequenceTotalsByLength


protected int[] charTotalCounts


protected HashMap[] charWordInterpolationConstants


static int cn


static char END_SYMBOL


protected List[] heldOutExamples


protected int heldOutPercent


static int[] lengthBinCutoffs


protected double lengthConvergenceMargin


protected double[][][] lengthInterpolationConstants


protected double lengthNormalization


protected double[] lengthNormalizations


protected HashMap[] lengthSequenceCounts


static int ln


protected int maxPriorBoost


protected int numCategories


protected int numCharWordSteps


protected int numExamples


protected double priorBoost


static Random rand


static char START_SYMBOL


protected HashMap[] wordCountsByLength


protected int[] wordTotalCounts


protected HashMap[] wordTotalsByLength


Constructor Summary

PNPC(List categoryNames, List trainingLines)
          Constructs a new PNPC which is trained on the given file.

Method Summary

protected void addCounts(String line, int category)
          Counts relevant statistics for the given example in its given category

protected void computeCharSequenceTotals()
          Computes the total probability of generating all words of a given length.

String generateLine(int category)
          Generates a novel example of the given category, starting with (cn-1) start symbols and ending with an end symbol.

String generateWord(int wordLength, String initialContext, char finalChar, int category)
          Randomly generates a word of the given length, starting with the given intial context, and ending with the given final char by sampling from the char n-gram model of the given category.

int getBestCategory(String line)
          Returns the category that generates the given line with the highest probability.

protected int getCharBin(String charSequence, int category)
          Returns the index of the appropriate EM interpolation parameter bin for the given char ngram.

protected int getCharBinCount()
          Returns the number of bins used for char EM interpolation.

double getEmpiricalProb(List lengthSequence, int category)
          Returns the empirical estimate of the probability of the last word length in the sequence given the sequence excluding that length, as observed within the given category.

double getEmpiricalProb(String charSequence, int category)
          Returns the empirical estimate of the probability of the last char in the sequence given the sequence excluding that char, as observed within the given category.

double getEmpiricalProb(String word, int wordLength, int category)
          Returns the empirical estimate of the probability of the given word given the word's length and the given category.

static String getEndMarkedString(String line)
          Returns the given line prepended with enough ' ' symbols to allow n-gram parsing.

protected int getHeldOutScore()
          Runs the classifier on the held-out examples and returns the number of correctly classified examples.

double getInterpolatedProb(List lengthSequence, int category)
          Returns a linearly interpolated estimate of the last length in the sequence given the rest of it.

double getInterpolatedProb(String charSequence, int category)
          Returns a linearly interpolated estimate of the last char in the sequence given the rest of it.

protected int getLengthBin(List lengthSequence, int category)
          Returns the index of the appropriate EM interpolation parameter bin for the given length ngram.

protected int getLengthBinCount()
          Returns the number of bins used for char EM interpolation.

int getNumCategories()
          Returns the number of different categories represented in this classifier.

double getPriorProb(int category)
          Returns the empirical a piori probability of each category, as observed in the training data (fraction of each category in the whole training data).

static String getPureString(String word)
          Prunes the first (cn-1) chars from the beginning of the word as well as the final char.

double getScore(String line, int category)
          Returns the score for the given example as scored in the given category.

static List getWordLengths(String line)
          Takes an end-marked string and returns a list of Integers for the length of each word.

static List getWordsWithContext(String line)
          Takes an end-marked string and returns a List of strings, one for each word in the line.

protected void incrementCount(HashMap map, Object key)
          Adds 1 to the count for the given key in the given map.

protected void incrementCountByLength(HashMap map, int length, Object key)
          Adds 1 to the count for the given key in the given map under the given length.

protected void initCounts()
          Initializes and zeroes all variables and counts before training.

protected void learnCharInterpolationConstants()
          Learns good weights for deleted interpolation in the char n-gram model via EM.

protected void learnCharWordInterpolationConstants()
          Computes the best interpolation weights for the char n-gram vs word model with a line search.

protected void learnLengthInterpolationConstants()
          Learns good weights for deleted interpolation in the length n-gram model via EM.

protected void learnLengthNormalizations()
          Learns a constant for each category to normalize word probabilities by length.

protected void learnPriorBoost()
          Sets the log-prior multiplier (priorBoost) to the best value on the held-out set.

protected void test(String testFilename)
          Runs the classifier on each line in the given test file and prints out the category with the highest score.

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

categoryNames

public List categoryNames

ln

public static final int ln

See Also:: Constant Field Values

cn

public static final int cn

See Also:: Constant Field Values

START_SYMBOL

public static final char START_SYMBOL

See Also:: Constant Field Values

END_SYMBOL

public static final char END_SYMBOL

See Also:: Constant Field Values

rand

public static final Random rand

charBinCutoffs

public static final int[] charBinCutoffs

lengthBinCutoffs

public static final int[] lengthBinCutoffs

numCategories

protected int numCategories

numExamples

protected int numExamples

priorBoost

protected double priorBoost

lengthNormalization

protected double lengthNormalization

heldOutPercent

protected final int heldOutPercent

See Also:: Constant Field Values

heldOutExamples

protected List[] heldOutExamples

charConvergenceMargin

protected final double charConvergenceMargin

See Also:: Constant Field Values

lengthConvergenceMargin

protected final double lengthConvergenceMargin

See Also:: Constant Field Values

numCharWordSteps

protected final int numCharWordSteps

See Also:: Constant Field Values

maxPriorBoost

protected final int maxPriorBoost

See Also:: Constant Field Values

categoryCounts

protected int[] categoryCounts

lengthSequenceCounts

protected HashMap[] lengthSequenceCounts

lengthInterpolationConstants

protected double[][][] lengthInterpolationConstants

wordTotalCounts

protected int[] wordTotalCounts

charSequenceCounts

protected HashMap[] charSequenceCounts

charInterpolationConstants

protected double[][][] charInterpolationConstants

charSequenceTotalsByLength

protected HashMap[] charSequenceTotalsByLength

charTotalCounts

protected int[] charTotalCounts

wordCountsByLength

protected HashMap[] wordCountsByLength

wordTotalsByLength

protected HashMap[] wordTotalsByLength

charWordInterpolationConstants

protected HashMap[] charWordInterpolationConstants

lengthNormalizations

protected double[] lengthNormalizations

Constructor Detail

PNPC

public PNPC(List categoryNames,
            List trainingLines)

Constructs a new PNPC which is trained on the given file. Number of categories is inferred from reading the training file. The first line of the training file should just be an integer, indicating the total number of categories in the training set. Each subsequent line should be of the format "# rest of example" (excluding quotes) where # is the category (1-n, don't use category 0) and "rest of example" is the full example line. Training is first performed on all but a held out set of data. Then various parameters are set on the held out data. Finally, the held out data is also trained on.

Method Detail

initCounts

protected void initCounts()

Initializes and zeroes all variables and counts before training.

addCounts

protected void addCounts(String line,
                         int category)

Counts relevant statistics for the given example in its given category

incrementCount

protected void incrementCount(HashMap map,
                              Object key)

Adds 1 to the count for the given key in the given map. If the key has not been seen before, creates a new count of 1 for that key.

incrementCountByLength

protected void incrementCountByLength(HashMap map,
                                      int length,
                                      Object key)

Adds 1 to the count for the given key in the given map under the given length. Used for HashMaps that store sub-HashMaps by length If either the length or the key has not been seen before, creates a new count of 1 for that key/length.

learnPriorBoost

protected void learnPriorBoost()

Sets the log-prior multiplier (priorBoost) to the best value on the held-out set. Specifically, classifies the held-out examples using different values, and keeps the one that leads to the best score.

learnLengthInterpolationConstants

protected void learnLengthInterpolationConstants()

Learns good weights for deleted interpolation in the length n-gram model via EM. Learns separate weights based on the counts of the conditioning contexts. Starts by mixing a 1-gram and 0-gram, then mixes the 2-gram with the 1/0-mixture, and so on all the way up to the full n-gram.

learnCharInterpolationConstants

protected void learnCharInterpolationConstants()

Learns good weights for deleted interpolation in the char n-gram model via EM. Learns separate weights based on the counts of the conditioning contexts. Starts by mixing a 1-gram and 0-gram, then mixes the 2-gram with the 1/0-mixture, and so on all the way up to the full n-gram.

learnCharWordInterpolationConstants

protected void learnCharWordInterpolationConstants()

Computes the best interpolation weights for the char n-gram vs word model with a line search.

computeCharSequenceTotals

protected void computeCharSequenceTotals()

Computes the total probability of generating all words of a given length. Simply looks at all unigram length counts, and normalizes them into a probability distribution.

learnLengthNormalizations

protected void learnLengthNormalizations()

Learns a constant for each category to normalize word probabilities by length. Since word probabilities are calculated with an n-gram, longer words get unfair influence over short words. Thus we normalize the probabilities by taking the (k/l)'th root, where l is the word length (# chars) and k is a constant learned here for each category.

getHeldOutScore

protected int getHeldOutScore()

Runs the classifier on the held-out examples and returns the number of correctly classified examples. Useful for setting various category-neutral parameters of the model and seeing how they do. For example, used to set log-prior boost.

getBestCategory

public int getBestCategory(String line)

Returns the category that generates the given line with the highest probability. NOTE: Input lines should already be end-marked (e.g. run getEndMarkedLine() before calling getBestCategory())

getScore

public double getScore(String line,
                       int category)

Returns the score for the given example as scored in the given category. Essentially computes Log[P(line|category)*P(category)]. Higher scores mean the line is more likely to be generated from this category. NOTE: Input lines should already be end-marked (e.g. run getEndMarkedLine() before calling getScores())

getInterpolatedProb

public double getInterpolatedProb(String charSequence,
                                  int category)

Returns a linearly interpolated estimate of the last char in the sequence given the rest of it. This function is called recursively in conjunction with getEmpiricalProb to build up the full equation: gIP(n) = w_n*gEP(n) + (1-w_n)*gIP(n-1) gIP(0) = 1/256

getEmpiricalProb

public double getEmpiricalProb(String charSequence,
                               int category)

Returns the empirical estimate of the probability of the last char in the sequence given the sequence excluding that char, as observed within the given category. For example, gEP("Inc.",2) returns P(.|I,n,c) as observed in category 2.

getInterpolatedProb

public double getInterpolatedProb(List lengthSequence,
                                  int category)

Returns a linearly interpolated estimate of the last length in the sequence given the rest of it. This function is called recursively in conjunction with getEmpiricalProb to build up the full equation: gIP(n) = w_n*gEP(n) + (1-w_n)*gIP(n-1) gIP(0) = 1/256

getEmpiricalProb

public double getEmpiricalProb(List lengthSequence,
                               int category)

Returns the empirical estimate of the probability of the last word length in the sequence given the sequence excluding that length, as observed within the given category. For example, gEP([0,2,5],2) returns P(5|0,2) as observed in category 2.

getEmpiricalProb

public double getEmpiricalProb(String word,
                               int wordLength,
                               int category)

Returns the empirical estimate of the probability of the given word given the word's length and the given category. For example, gEP("dog",3,2) returns P(word="dog"|length=3,category=2). If no words of the given length have been seen, returns prob=0.0. This is because the word model is mixed with an n-gram model, so it's important to know when the word model has nothing to contribute. NOTE: Yes, I realize passing in length is redundant, but it makes this method signature unique from gEP for the char n-gram.

getPriorProb

public double getPriorProb(int category)

Returns the empirical a piori probability of each category, as observed in the training data (fraction of each category in the whole training data).

getNumCategories

public int getNumCategories()

Returns the number of different categories represented in this classifier.

getEndMarkedString

public static String getEndMarkedString(String line)

Returns the given line prepended with enough ' ' symbols to allow n-gram parsing. Also adds a '^' to the end so a terminal ngram can be counted For example, if n=4, "A sentence" would be returned as " A sentence^". Before applying end-marking, trims whitespace from both ends of line.

getPureString

public static String getPureString(String word)

Prunes the first (cn-1) chars from the beginning of the word as well as the final char. Inverse of getEndMarkedString()

getWordLengths

public static List getWordLengths(String line)

Takes an end-marked string and returns a list of Integers for the length of each word. List includes (cn-1) starting 0's and one trailing 0.

getWordsWithContext

public static List getWordsWithContext(String line)

Takes an end-marked string and returns a List of strings, one for each word in the line. Each word has (cn-1) prefix chars and one suffix char (either a space or '^') for context. Thus each word is sort of "end-marked".

getCharBin

protected int getCharBin(String charSequence,
                         int category)

Returns the index of the appropriate EM interpolation parameter bin for the given char ngram. Specifically, looks at the count of the conditioning context (i.e. all but the last char) and returns a separate bin index depending on the size of the count, using charBinCutoffs (see top)

getCharBinCount

protected int getCharBinCount()

Returns the number of bins used for char EM interpolation.

getLengthBin

protected int getLengthBin(List lengthSequence,
                           int category)

Returns the index of the appropriate EM interpolation parameter bin for the given length ngram. Specifically, looks at the count of the conditioning context (i.e. all but the last char) and returns a separate bin index depending on the size of the count, using lengthBinCutoffs (see top)

getLengthBinCount

protected int getLengthBinCount()

Returns the number of bins used for char EM interpolation.

test

protected void test(String testFilename)
             throws FileNotFoundException,
                    IOException

Runs the classifier on each line in the given test file and prints out the category with the highest score.

FileNotFoundException

IOException

generateWord

public String generateWord(int wordLength,
                           String initialContext,
                           char finalChar,
                           int category)

Randomly generates a word of the given length, starting with the given intial context, and ending with the given final char by sampling from the char n-gram model of the given category. Since it's unfair to force early termination, this method generates words of the given length until one naturally occurs with the final char. word length is not including inital context or final char. Returns the generated word without the inital context, but with the final char.

generateLine

public String generateLine(int category)

Generates a novel example of the given category, starting with (cn-1) start symbols and ending with an end symbol. First generates a word-lengths list, then generates a word for each length.

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Stanford NLP Group

Field Summary
`protected int[]`	`categoryCounts`
`List`	`categoryNames`
`static int[]`	`charBinCutoffs`
`protected double`	`charConvergenceMargin`
`protected double[][][]`	`charInterpolationConstants`
`protected HashMap[]`	`charSequenceCounts`
`protected HashMap[]`	`charSequenceTotalsByLength`
`protected int[]`	`charTotalCounts`
`protected HashMap[]`	`charWordInterpolationConstants`
`static int`	`cn`
`static char`	`END_SYMBOL`
`protected List[]`	`heldOutExamples`
`protected int`	`heldOutPercent`
`static int[]`	`lengthBinCutoffs`
`protected double`	`lengthConvergenceMargin`
`protected double[][][]`	`lengthInterpolationConstants`
`protected double`	`lengthNormalization`
`protected double[]`	`lengthNormalizations`
`protected HashMap[]`	`lengthSequenceCounts`
`static int`	`ln`
`protected int`	`maxPriorBoost`
`protected int`	`numCategories`
`protected int`	`numCharWordSteps`
`protected int`	`numExamples`
`protected double`	`priorBoost`
`static Random`	`rand`
`static char`	`START_SYMBOL`
`protected HashMap[]`	`wordCountsByLength`
`protected int[]`	`wordTotalCounts`
`protected HashMap[]`	`wordTotalsByLength`

Constructor Summary
`PNPC(List categoryNames, List trainingLines)` Constructs a new PNPC which is trained on the given file.

Method Summary
`protected void`	`addCounts(String line, int category)` Counts relevant statistics for the given example in its given category
`protected void`	`computeCharSequenceTotals()` Computes the total probability of generating all words of a given length.
`String`	`generateLine(int category)` Generates a novel example of the given category, starting with (cn-1) start symbols and ending with an end symbol.
`String`	`generateWord(int wordLength, String initialContext, char finalChar, int category)` Randomly generates a word of the given length, starting with the given intial context, and ending with the given final char by sampling from the char n-gram model of the given category.
`int`	`getBestCategory(String line)` Returns the category that generates the given line with the highest probability.
`protected int`	`getCharBin(String charSequence, int category)` Returns the index of the appropriate EM interpolation parameter bin for the given char ngram.
`protected int`	`getCharBinCount()` Returns the number of bins used for char EM interpolation.
`double`	`getEmpiricalProb(List lengthSequence, int category)` Returns the empirical estimate of the probability of the last word length in the sequence given the sequence excluding that length, as observed within the given category.
`double`	`getEmpiricalProb(String charSequence, int category)` Returns the empirical estimate of the probability of the last char in the sequence given the sequence excluding that char, as observed within the given category.
`double`	`getEmpiricalProb(String word, int wordLength, int category)` Returns the empirical estimate of the probability of the given word given the word's length and the given category.
`static String`	`getEndMarkedString(String line)` Returns the given line prepended with enough ' ' symbols to allow n-gram parsing.
`protected int`	`getHeldOutScore()` Runs the classifier on the held-out examples and returns the number of correctly classified examples.
`double`	`getInterpolatedProb(List lengthSequence, int category)` Returns a linearly interpolated estimate of the last length in the sequence given the rest of it.
`double`	`getInterpolatedProb(String charSequence, int category)` Returns a linearly interpolated estimate of the last char in the sequence given the rest of it.
`protected int`	`getLengthBin(List lengthSequence, int category)` Returns the index of the appropriate EM interpolation parameter bin for the given length ngram.
`protected int`	`getLengthBinCount()` Returns the number of bins used for char EM interpolation.
`int`	`getNumCategories()` Returns the number of different categories represented in this classifier.
`double`	`getPriorProb(int category)` Returns the empirical a piori probability of each category, as observed in the training data (fraction of each category in the whole training data).
`static String`	`getPureString(String word)` Prunes the first (cn-1) chars from the beginning of the word as well as the final char.
`double`	`getScore(String line, int category)` Returns the score for the given example as scored in the given category.
`static List`	`getWordLengths(String line)` Takes an end-marked string and returns a list of Integers for the length of each word.
`static List`	`getWordsWithContext(String line)` Takes an end-marked string and returns a List of strings, one for each word in the line.
`protected void`	`incrementCount(HashMap map, Object key)` Adds 1 to the count for the given key in the given map.
`protected void`	`incrementCountByLength(HashMap map, int length, Object key)` Adds 1 to the count for the given key in the given map under the given length.
`protected void`	`initCounts()` Initializes and zeroes all variables and counts before training.
`protected void`	`learnCharInterpolationConstants()` Learns good weights for deleted interpolation in the char n-gram model via EM.
`protected void`	`learnCharWordInterpolationConstants()` Computes the best interpolation weights for the char n-gram vs word model with a line search.
`protected void`	`learnLengthInterpolationConstants()` Learns good weights for deleted interpolation in the length n-gram model via EM.
`protected void`	`learnLengthNormalizations()` Learns a constant for each category to normalize word probabilities by length.
`protected void`	`learnPriorBoost()` Sets the log-prior multiplier (priorBoost) to the best value on the held-out set.
`protected void`	`test(String testFilename)` Runs the classifier on each line in the given test file and prints out the category with the highest score.

edu.stanford.nlp.ie.pcfg Class PNPC

categoryNames

ln

cn

START_SYMBOL

END_SYMBOL

rand

charBinCutoffs

lengthBinCutoffs

numCategories

numExamples

priorBoost

lengthNormalization

heldOutPercent

heldOutExamples

charConvergenceMargin

lengthConvergenceMargin

numCharWordSteps

maxPriorBoost

categoryCounts

lengthSequenceCounts

lengthInterpolationConstants

wordTotalCounts

charSequenceCounts

charInterpolationConstants

charSequenceTotalsByLength

charTotalCounts

wordCountsByLength

wordTotalsByLength

charWordInterpolationConstants

lengthNormalizations

PNPC

initCounts

addCounts

incrementCount

incrementCountByLength

learnPriorBoost

learnLengthInterpolationConstants

learnCharInterpolationConstants

learnCharWordInterpolationConstants

computeCharSequenceTotals

learnLengthNormalizations

getHeldOutScore

getBestCategory

getScore

getInterpolatedProb

getEmpiricalProb

getInterpolatedProb

getEmpiricalProb

getEmpiricalProb

getPriorProb

getNumCategories

getEndMarkedString

getPureString

getWordLengths

getWordsWithContext

getCharBin

getCharBinCount

getLengthBin

getLengthBinCount

test

generateWord

generateLine

edu.stanford.nlp.ie.pcfg
Class PNPC