edu.stanford.nlp.ie.pcfg
Class PNPC

java.lang.Object
  |
  +--edu.stanford.nlp.ie.pcfg.PNPC
All Implemented Interfaces:
Serializable

public class PNPC
extends Object
implements Serializable

Statistical classifier of unseen proper noun phrases. Supports training and testing on data files. Uses an n-gram word-length model, and n-gram character model, and a word model.

See Also:
Serialized Form

Field Summary
protected  int[] categoryCounts
           
 List categoryNames
           
static int[] charBinCutoffs
           
protected  double charConvergenceMargin
           
protected  double[][][] charInterpolationConstants
           
protected  HashMap[] charSequenceCounts
           
protected  HashMap[] charSequenceTotalsByLength
           
protected  int[] charTotalCounts
           
protected  HashMap[] charWordInterpolationConstants
           
static int cn
           
static char END_SYMBOL
           
protected  List[] heldOutExamples
           
protected  int heldOutPercent
           
static int[] lengthBinCutoffs
           
protected  double lengthConvergenceMargin
           
protected  double[][][] lengthInterpolationConstants
           
protected  double lengthNormalization
           
protected  double[] lengthNormalizations
           
protected  HashMap[] lengthSequenceCounts
           
static int ln
           
protected  int maxPriorBoost
           
protected  int numCategories
           
protected  int numCharWordSteps
           
protected  int numExamples
           
protected  double priorBoost
           
static Random rand
           
static char START_SYMBOL
           
protected  HashMap[] wordCountsByLength
           
protected  int[] wordTotalCounts
           
protected  HashMap[] wordTotalsByLength
           
 
Constructor Summary
PNPC(List categoryNames, List trainingLines)
          Constructs a new PNPC which is trained on the given file.
 
Method Summary
protected  void addCounts(String line, int category)
          Counts relevant statistics for the given example in its given category
protected  void computeCharSequenceTotals()
          Computes the total probability of generating all words of a given length.
 String generateLine(int category)
          Generates a novel example of the given category, starting with (cn-1) start symbols and ending with an end symbol.
 String generateWord(int wordLength, String initialContext, char finalChar, int category)
          Randomly generates a word of the given length, starting with the given intial context, and ending with the given final char by sampling from the char n-gram model of the given category.
 int getBestCategory(String line)
          Returns the category that generates the given line with the highest probability.
protected  int getCharBin(String charSequence, int category)
          Returns the index of the appropriate EM interpolation parameter bin for the given char ngram.
protected  int getCharBinCount()
          Returns the number of bins used for char EM interpolation.
 double getEmpiricalProb(List lengthSequence, int category)
          Returns the empirical estimate of the probability of the last word length in the sequence given the sequence excluding that length, as observed within the given category.
 double getEmpiricalProb(String charSequence, int category)
          Returns the empirical estimate of the probability of the last char in the sequence given the sequence excluding that char, as observed within the given category.
 double getEmpiricalProb(String word, int wordLength, int category)
          Returns the empirical estimate of the probability of the given word given the word's length and the given category.
static String getEndMarkedString(String line)
          Returns the given line prepended with enough ' ' symbols to allow n-gram parsing.
protected  int getHeldOutScore()
          Runs the classifier on the held-out examples and returns the number of correctly classified examples.
 double getInterpolatedProb(List lengthSequence, int category)
          Returns a linearly interpolated estimate of the last length in the sequence given the rest of it.
 double getInterpolatedProb(String charSequence, int category)
          Returns a linearly interpolated estimate of the last char in the sequence given the rest of it.
protected  int getLengthBin(List lengthSequence, int category)
          Returns the index of the appropriate EM interpolation parameter bin for the given length ngram.
protected  int getLengthBinCount()
          Returns the number of bins used for char EM interpolation.
 int getNumCategories()
          Returns the number of different categories represented in this classifier.
 double getPriorProb(int category)
          Returns the empirical a piori probability of each category, as observed in the training data (fraction of each category in the whole training data).
static String getPureString(String word)
          Prunes the first (cn-1) chars from the beginning of the word as well as the final char.
 double getScore(String line, int category)
          Returns the score for the given example as scored in the given category.
static List getWordLengths(String line)
          Takes an end-marked string and returns a list of Integers for the length of each word.
static List getWordsWithContext(String line)
          Takes an end-marked string and returns a List of strings, one for each word in the line.
protected  void incrementCount(HashMap map, Object key)
          Adds 1 to the count for the given key in the given map.
protected  void incrementCountByLength(HashMap map, int length, Object key)
          Adds 1 to the count for the given key in the given map under the given length.
protected  void initCounts()
          Initializes and zeroes all variables and counts before training.
protected  void learnCharInterpolationConstants()
          Learns good weights for deleted interpolation in the char n-gram model via EM.
protected  void learnCharWordInterpolationConstants()
          Computes the best interpolation weights for the char n-gram vs word model with a line search.
protected  void learnLengthInterpolationConstants()
          Learns good weights for deleted interpolation in the length n-gram model via EM.
protected  void learnLengthNormalizations()
          Learns a constant for each category to normalize word probabilities by length.
protected  void learnPriorBoost()
          Sets the log-prior multiplier (priorBoost) to the best value on the held-out set.
protected  void test(String testFilename)
          Runs the classifier on each line in the given test file and prints out the category with the highest score.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

categoryNames

public List categoryNames

ln

public static final int ln
See Also:
Constant Field Values

cn

public static final int cn
See Also:
Constant Field Values

START_SYMBOL

public static final char START_SYMBOL
See Also:
Constant Field Values

END_SYMBOL

public static final char END_SYMBOL
See Also:
Constant Field Values

rand

public static final Random rand

charBinCutoffs

public static final int[] charBinCutoffs

lengthBinCutoffs

public static final int[] lengthBinCutoffs

numCategories

protected int numCategories

numExamples

protected int numExamples

priorBoost

protected double priorBoost

lengthNormalization

protected double lengthNormalization

heldOutPercent

protected final int heldOutPercent
See Also:
Constant Field Values

heldOutExamples

protected List[] heldOutExamples

charConvergenceMargin

protected final double charConvergenceMargin
See Also:
Constant Field Values

lengthConvergenceMargin

protected final double lengthConvergenceMargin
See Also:
Constant Field Values

numCharWordSteps

protected final int numCharWordSteps
See Also:
Constant Field Values

maxPriorBoost

protected final int maxPriorBoost
See Also:
Constant Field Values

categoryCounts

protected int[] categoryCounts

lengthSequenceCounts

protected HashMap[] lengthSequenceCounts

lengthInterpolationConstants

protected double[][][] lengthInterpolationConstants

wordTotalCounts

protected int[] wordTotalCounts

charSequenceCounts

protected HashMap[] charSequenceCounts

charInterpolationConstants

protected double[][][] charInterpolationConstants

charSequenceTotalsByLength

protected HashMap[] charSequenceTotalsByLength

charTotalCounts

protected int[] charTotalCounts

wordCountsByLength

protected HashMap[] wordCountsByLength

wordTotalsByLength

protected HashMap[] wordTotalsByLength

charWordInterpolationConstants

protected HashMap[] charWordInterpolationConstants

lengthNormalizations

protected double[] lengthNormalizations
Constructor Detail

PNPC

public PNPC(List categoryNames,
            List trainingLines)
Constructs a new PNPC which is trained on the given file. Number of categories is inferred from reading the training file. The first line of the training file should just be an integer, indicating the total number of categories in the training set. Each subsequent line should be of the format "# rest of example" (excluding quotes) where # is the category (1-n, don't use category 0) and "rest of example" is the full example line. Training is first performed on all but a held out set of data. Then various parameters are set on the held out data. Finally, the held out data is also trained on.

Method Detail

initCounts

protected void initCounts()
Initializes and zeroes all variables and counts before training.


addCounts

protected void addCounts(String line,
                         int category)
Counts relevant statistics for the given example in its given category


incrementCount

protected void incrementCount(HashMap map,
                              Object key)
Adds 1 to the count for the given key in the given map. If the key has not been seen before, creates a new count of 1 for that key.


incrementCountByLength

protected void incrementCountByLength(HashMap map,
                                      int length,
                                      Object key)
Adds 1 to the count for the given key in the given map under the given length. Used for HashMaps that store sub-HashMaps by length If either the length or the key has not been seen before, creates a new count of 1 for that key/length.


learnPriorBoost

protected void learnPriorBoost()
Sets the log-prior multiplier (priorBoost) to the best value on the held-out set. Specifically, classifies the held-out examples using different values, and keeps the one that leads to the best score.


learnLengthInterpolationConstants

protected void learnLengthInterpolationConstants()
Learns good weights for deleted interpolation in the length n-gram model via EM. Learns separate weights based on the counts of the conditioning contexts. Starts by mixing a 1-gram and 0-gram, then mixes the 2-gram with the 1/0-mixture, and so on all the way up to the full n-gram.


learnCharInterpolationConstants

protected void learnCharInterpolationConstants()
Learns good weights for deleted interpolation in the char n-gram model via EM. Learns separate weights based on the counts of the conditioning contexts. Starts by mixing a 1-gram and 0-gram, then mixes the 2-gram with the 1/0-mixture, and so on all the way up to the full n-gram.


learnCharWordInterpolationConstants

protected void learnCharWordInterpolationConstants()
Computes the best interpolation weights for the char n-gram vs word model with a line search.


computeCharSequenceTotals

protected void computeCharSequenceTotals()
Computes the total probability of generating all words of a given length. Simply looks at all unigram length counts, and normalizes them into a probability distribution.


learnLengthNormalizations

protected void learnLengthNormalizations()
Learns a constant for each category to normalize word probabilities by length. Since word probabilities are calculated with an n-gram, longer words get unfair influence over short words. Thus we normalize the probabilities by taking the (k/l)'th root, where l is the word length (# chars) and k is a constant learned here for each category.


getHeldOutScore

protected int getHeldOutScore()
Runs the classifier on the held-out examples and returns the number of correctly classified examples. Useful for setting various category-neutral parameters of the model and seeing how they do. For example, used to set log-prior boost.


getBestCategory

public int getBestCategory(String line)
Returns the category that generates the given line with the highest probability. NOTE: Input lines should already be end-marked (e.g. run getEndMarkedLine() before calling getBestCategory())


getScore

public double getScore(String line,
                       int category)
Returns the score for the given example as scored in the given category. Essentially computes Log[P(line|category)*P(category)]. Higher scores mean the line is more likely to be generated from this category. NOTE: Input lines should already be end-marked (e.g. run getEndMarkedLine() before calling getScores())


getInterpolatedProb

public double getInterpolatedProb(String charSequence,
                                  int category)
Returns a linearly interpolated estimate of the last char in the sequence given the rest of it. This function is called recursively in conjunction with getEmpiricalProb to build up the full equation: gIP(n) = w_n*gEP(n) + (1-w_n)*gIP(n-1) gIP(0) = 1/256


getEmpiricalProb

public double getEmpiricalProb(String charSequence,
                               int category)
Returns the empirical estimate of the probability of the last char in the sequence given the sequence excluding that char, as observed within the given category. For example, gEP("Inc.",2) returns P(.|I,n,c) as observed in category 2.


getInterpolatedProb

public double getInterpolatedProb(List lengthSequence,
                                  int category)
Returns a linearly interpolated estimate of the last length in the sequence given the rest of it. This function is called recursively in conjunction with getEmpiricalProb to build up the full equation: gIP(n) = w_n*gEP(n) + (1-w_n)*gIP(n-1) gIP(0) = 1/256


getEmpiricalProb

public double getEmpiricalProb(List lengthSequence,
                               int category)
Returns the empirical estimate of the probability of the last word length in the sequence given the sequence excluding that length, as observed within the given category. For example, gEP([0,2,5],2) returns P(5|0,2) as observed in category 2.


getEmpiricalProb

public double getEmpiricalProb(String word,
                               int wordLength,
                               int category)
Returns the empirical estimate of the probability of the given word given the word's length and the given category. For example, gEP("dog",3,2) returns P(word="dog"|length=3,category=2). If no words of the given length have been seen, returns prob=0.0. This is because the word model is mixed with an n-gram model, so it's important to know when the word model has nothing to contribute. NOTE: Yes, I realize passing in length is redundant, but it makes this method signature unique from gEP for the char n-gram.


getPriorProb

public double getPriorProb(int category)
Returns the empirical a piori probability of each category, as observed in the training data (fraction of each category in the whole training data).


getNumCategories

public int getNumCategories()
Returns the number of different categories represented in this classifier.


getEndMarkedString

public static String getEndMarkedString(String line)
Returns the given line prepended with enough ' ' symbols to allow n-gram parsing. Also adds a '^' to the end so a terminal ngram can be counted For example, if n=4, "A sentence" would be returned as " A sentence^". Before applying end-marking, trims whitespace from both ends of line.


getPureString

public static String getPureString(String word)
Prunes the first (cn-1) chars from the beginning of the word as well as the final char. Inverse of getEndMarkedString()


getWordLengths

public static List getWordLengths(String line)
Takes an end-marked string and returns a list of Integers for the length of each word. List includes (cn-1) starting 0's and one trailing 0.


getWordsWithContext

public static List getWordsWithContext(String line)
Takes an end-marked string and returns a List of strings, one for each word in the line. Each word has (cn-1) prefix chars and one suffix char (either a space or '^') for context. Thus each word is sort of "end-marked".


getCharBin

protected int getCharBin(String charSequence,
                         int category)
Returns the index of the appropriate EM interpolation parameter bin for the given char ngram. Specifically, looks at the count of the conditioning context (i.e. all but the last char) and returns a separate bin index depending on the size of the count, using charBinCutoffs (see top)


getCharBinCount

protected int getCharBinCount()
Returns the number of bins used for char EM interpolation.


getLengthBin

protected int getLengthBin(List lengthSequence,
                           int category)
Returns the index of the appropriate EM interpolation parameter bin for the given length ngram. Specifically, looks at the count of the conditioning context (i.e. all but the last char) and returns a separate bin index depending on the size of the count, using lengthBinCutoffs (see top)


getLengthBinCount

protected int getLengthBinCount()
Returns the number of bins used for char EM interpolation.


test

protected void test(String testFilename)
             throws FileNotFoundException,
                    IOException
Runs the classifier on each line in the given test file and prints out the category with the highest score.

FileNotFoundException
IOException

generateWord

public String generateWord(int wordLength,
                           String initialContext,
                           char finalChar,
                           int category)
Randomly generates a word of the given length, starting with the given intial context, and ending with the given final char by sampling from the char n-gram model of the given category. Since it's unfair to force early termination, this method generates words of the given length until one naturally occurs with the final char. word length is not including inital context or final char. Returns the generated word without the inital context, but with the final char.


generateLine

public String generateLine(int category)
Generates a novel example of the given category, starting with (cn-1) start symbols and ending with an end symbol. First generates a word-lengths list, then generates a word for each length.



Stanford NLP Group