edu.stanford.nlp.ie.pnp
Class PnpClassifier

java.lang.Object
  |
  +--edu.stanford.nlp.ie.pnp.PnpClassifier
All Implemented Interfaces:
Serializable

public class PnpClassifier
extends Object
implements Serializable

Statistical classifier of unseen proper noun phrases. Supports training and testing on data files. Uses an n-gram word-length model, and n-gram character model, and a word model.

Standard usage:

See Also:
Serialized Form

Field Summary
static int[] charBinCutoffs
           
static int cn
           
static char END_SYMBOL
           
static int[] lengthBinCutoffs
           
static int ln
           
static Random rand
           
static char START_SYMBOL
           
 
Constructor Summary
PnpClassifier(String trainingFilename)
          Constructs a new PnpClassifier which is trained on the given file.
 
Method Summary
 String generateLine(int category)
          Generates a novel example of the given category, starting with (cn-1) start symbols and ending with an end symbol.
 String generateWord(int wordLength, String initialContext, char finalChar, int category)
          Randomly generates a word of the given length, starting with the given intial context, and ending with the given final char by sampling from the char n-gram model of the given category.
 int getBestCategory(String line)
          Returns the category that generates the given line with the highest probability.
 double getEmpiricalProb(List lengthSequence, int category)
          Returns the empirical estimate of the probability of the last word length in the sequence given the sequence excluding that length, as observed within the given category.
 double getEmpiricalProb(String charSequence, int category)
          Returns the empirical estimate of the probability of the last char in the sequence given the sequence excluding that char, as observed within the given category.
 double getEmpiricalProb(String word, int wordLength, int category)
          Returns the empirical estimate of the probability of the given word given the word's length and the given category.
static String getEndMarkedString(String line)
          Returns the given line prepended with enough ' ' symbols to allow n-gram parsing.
 double getInterpolatedProb(List lengthSequence, int category)
          Returns a linearly interpolated estimate of the last length in the sequence given the rest of it.
 double getInterpolatedProb(String charSequence, int category)
          Returns a linearly interpolated estimate of the last char in the sequence given the rest of it.
 double getLogProb(String line, int category)
          Computes and returns Log[P(line|category)].
 int getNumCategories()
          Returns the number of different categories represented in this classifier.
 double getPriorProb(int category)
          Returns the empirical a piori probability of each category, as observed in the training data (fraction of each category in the whole training data).
static String getPureString(String word)
          Prunes the first (cn-1) chars from the beginning of the word as well as the final char.
 double getScore(String line, int category)
          Returns the score for the given example as scored in the given category.
static List getWordLengths(String line)
          Takes an end-marked string and returns a list of Integers for the length of each word.
static List getWordsWithContext(String line)
          Takes an end-marked string and returns a List of strings, one for each word in the line.
static void main(String[] args)
          Trains and tests a PnpClassifier on the passed-in files.
protected  void test(String testFilename)
          Runs the classifier on each line in the given test file and prints out the category with the highest score.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ln

public static final int ln
See Also:
Constant Field Values

cn

public static final int cn
See Also:
Constant Field Values

START_SYMBOL

public static final char START_SYMBOL
See Also:
Constant Field Values

END_SYMBOL

public static final char END_SYMBOL
See Also:
Constant Field Values

rand

public static final Random rand

charBinCutoffs

public static final int[] charBinCutoffs

lengthBinCutoffs

public static final int[] lengthBinCutoffs
Constructor Detail

PnpClassifier

public PnpClassifier(String trainingFilename)
Constructs a new PnpClassifier which is trained on the given file. Number of categories is inferred from reading the training file. The first line of the training file should just be an integer, indicating the total number of categories in the training set. Each subsequent line should be of the format "# rest of example" (excluding quotes) where # is the category (1-n, don't use category 0) and "rest of example" is the full example line. Training is first performed on all but a held out set of data. Then various parameters are set on the held out data. Finally, the held out data is also trained on.

Method Detail

getBestCategory

public int getBestCategory(String line)
Returns the category that generates the given line with the highest probability. NOTE: Input lines should already be end-marked (e.g. run getEndMarkedLine() before calling getBestCategory())


getScore

public double getScore(String line,
                       int category)
Returns the score for the given example as scored in the given category. Essentially computes Log[P(line|category)*P(category)]. Higher scores mean the line is more likely to be generated from this category. NOTE: Input lines should already be end-marked (e.g. run getEndMarkedLine() before calling getScores())


getLogProb

public double getLogProb(String line,
                         int category)
Computes and returns Log[P(line|category)]. This is the probability of the given category generating the given line.


getInterpolatedProb

public double getInterpolatedProb(String charSequence,
                                  int category)
Returns a linearly interpolated estimate of the last char in the sequence given the rest of it. This function is called recursively in conjunction with getEmpiricalProb to build up the full equation:
gIP(n) = w_n*gEP(n) + (1-w_n)*gIP(n-1)
gIP(0) = 1/256


getEmpiricalProb

public double getEmpiricalProb(String charSequence,
                               int category)
Returns the empirical estimate of the probability of the last char in the sequence given the sequence excluding that char, as observed within the given category. For example, gEP("Inc.",2) returns P(.|I,n,c) as observed in category 2.


getInterpolatedProb

public double getInterpolatedProb(List lengthSequence,
                                  int category)
Returns a linearly interpolated estimate of the last length in the sequence given the rest of it. This function is called recursively in conjunction with getEmpiricalProb to build up the full equation: gIP(n) = w_n*gEP(n) + (1-w_n)*gIP(n-1) gIP(0) = 1/256


getEmpiricalProb

public double getEmpiricalProb(List lengthSequence,
                               int category)
Returns the empirical estimate of the probability of the last word length in the sequence given the sequence excluding that length, as observed within the given category. For example, gEP([0,2,5],2) returns P(5|0,2) as observed in category 2.


getEmpiricalProb

public double getEmpiricalProb(String word,
                               int wordLength,
                               int category)
Returns the empirical estimate of the probability of the given word given the word's length and the given category. For example, gEP("dog",3,2) returns P(word="dog"|length=3,category=2). If no words of the given length have been seen, returns prob=0.0. This is because the word model is mixed with an n-gram model, so it's important to know when the word model has nothing to contribute. NOTE: Yes, I realize passing in length is redundant, but it makes this method signature unique from gEP for the char n-gram.


getPriorProb

public double getPriorProb(int category)
Returns the empirical a piori probability of each category, as observed in the training data (fraction of each category in the whole training data).


getNumCategories

public int getNumCategories()
Returns the number of different categories represented in this classifier.


getEndMarkedString

public static String getEndMarkedString(String line)
Returns the given line prepended with enough ' ' symbols to allow n-gram parsing. Also adds a '^' to the end so a terminal ngram can be counted For example, if n=4, "Proper Noun" would be returned as "   Proper Noun^". Before applying end-marking, trims whitespace from both ends of line.


getPureString

public static String getPureString(String word)
Prunes the first (cn-1) chars from the beginning of the word as well as the final char. Inverse of getEndMarkedString(java.lang.String).


getWordLengths

public static List getWordLengths(String line)
Takes an end-marked string and returns a list of Integers for the length of each word. List includes (cn-1) starting 0's and one trailing 0. For example, the string "   Proper Noun^" would yield {0,0,0,6,4,0}.


getWordsWithContext

public static List getWordsWithContext(String line)
Takes an end-marked string and returns a List of strings, one for each word in the line. Each word has (cn-1) prefix chars and one suffix char (either a space or '^') for context. Thus each word is sort of "end-marked". For example, the string "   Proper Noun^" would yield {"   Proper ","er Noun^"}.


generateWord

public String generateWord(int wordLength,
                           String initialContext,
                           char finalChar,
                           int category)
Randomly generates a word of the given length, starting with the given intial context, and ending with the given final char by sampling from the char n-gram model of the given category. Since it's unfair to force early termination, this method generates words of the given length until one naturally occurs with the final char. word length is not including inital context or final char. Returns the generated word without the inital context, but with the final char.


generateLine

public String generateLine(int category)
Generates a novel example of the given category, starting with (cn-1) start symbols and ending with an end symbol. First generates a word-lengths list, then generates a word for each length.


test

protected void test(String testFilename)
             throws FileNotFoundException,
                    IOException
Runs the classifier on each line in the given test file and prints out the category with the highest score.

FileNotFoundException
IOException

main

public static void main(String[] args)
Trains and tests a PnpClassifier on the passed-in files.

Usage: java PnpClassifier trainingFilename testFilename.

See Also:
PnpClassifier(String trainingFIlename), test(String testFilename)


Stanford NLP Group