PnpClassifier (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.stanford.nlp.ie.pnp
Class PnpClassifier

java.lang.Object
  |
  +--edu.stanford.nlp.ie.pnp.PnpClassifier

All Implemented Interfaces:: Serializable

public class PnpClassifier
extends Object
implements Serializable

Statistical classifier of unseen proper noun phrases. Supports training and testing on data files. Uses an n-gram word-length model, and n-gram character model, and a word model.

Standard usage:

To train a new PnpClassifier, call PnpClassifier(String trainingFilename).
To get the probability of generating a given string for a given category, call getLogProb(String line,int category).
To find the most probable category for a given string, call getBestCategory(String line).
To generate a novel string for a given category, call generateLine(int category)

See Also:: Serialized Form

Field Summary

static int[] charBinCutoffs


static int cn


static char END_SYMBOL


static int[] lengthBinCutoffs


static int ln


static Random rand


static char START_SYMBOL


Constructor Summary

PnpClassifier(String trainingFilename)
          Constructs a new PnpClassifier which is trained on the given file.

Method Summary

String generateLine(int category)
          Generates a novel example of the given category, starting with (cn-1) start symbols and ending with an end symbol.

String generateWord(int wordLength, String initialContext, char finalChar, int category)
          Randomly generates a word of the given length, starting with the given intial context, and ending with the given final char by sampling from the char n-gram model of the given category.

int getBestCategory(String line)
          Returns the category that generates the given line with the highest probability.

double getEmpiricalProb(List lengthSequence, int category)
          Returns the empirical estimate of the probability of the last word length in the sequence given the sequence excluding that length, as observed within the given category.

double getEmpiricalProb(String charSequence, int category)
          Returns the empirical estimate of the probability of the last char in the sequence given the sequence excluding that char, as observed within the given category.

double getEmpiricalProb(String word, int wordLength, int category)
          Returns the empirical estimate of the probability of the given word given the word's length and the given category.

static String getEndMarkedString(String line)
          Returns the given line prepended with enough ' ' symbols to allow n-gram parsing.

double getInterpolatedProb(List lengthSequence, int category)
          Returns a linearly interpolated estimate of the last length in the sequence given the rest of it.

double getInterpolatedProb(String charSequence, int category)
          Returns a linearly interpolated estimate of the last char in the sequence given the rest of it.

double getLogProb(String line, int category)
          Computes and returns Log[P(line|category)].

int getNumCategories()
          Returns the number of different categories represented in this classifier.

double getPriorProb(int category)
          Returns the empirical a piori probability of each category, as observed in the training data (fraction of each category in the whole training data).

static String getPureString(String word)
          Prunes the first (cn-1) chars from the beginning of the word as well as the final char.

double getScore(String line, int category)
          Returns the score for the given example as scored in the given category.

static List getWordLengths(String line)
          Takes an end-marked string and returns a list of Integers for the length of each word.

static List getWordsWithContext(String line)
          Takes an end-marked string and returns a List of strings, one for each word in the line.

static void main(String[] args)
          Trains and tests a PnpClassifier on the passed-in files.

protected void test(String testFilename)
          Runs the classifier on each line in the given test file and prints out the category with the highest score.

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

ln

public static final int ln

See Also:: Constant Field Values

cn

public static final int cn

See Also:: Constant Field Values

START_SYMBOL

public static final char START_SYMBOL

See Also:: Constant Field Values

END_SYMBOL

public static final char END_SYMBOL

See Also:: Constant Field Values

rand

public static final Random rand

charBinCutoffs

public static final int[] charBinCutoffs

lengthBinCutoffs

public static final int[] lengthBinCutoffs

Constructor Detail

PnpClassifier

public PnpClassifier(String trainingFilename)

Constructs a new PnpClassifier which is trained on the given file. Number of categories is inferred from reading the training file. The first line of the training file should just be an integer, indicating the total number of categories in the training set. Each subsequent line should be of the format "# rest of example" (excluding quotes) where # is the category (1-n, don't use category 0) and "rest of example" is the full example line. Training is first performed on all but a held out set of data. Then various parameters are set on the held out data. Finally, the held out data is also trained on.

Method Detail

getBestCategory

public int getBestCategory(String line)

Returns the category that generates the given line with the highest probability. NOTE: Input lines should already be end-marked (e.g. run getEndMarkedLine() before calling getBestCategory())

getScore

public double getScore(String line,
                       int category)

Returns the score for the given example as scored in the given category. Essentially computes Log[P(line|category)*P(category)]. Higher scores mean the line is more likely to be generated from this category. NOTE: Input lines should already be end-marked (e.g. run getEndMarkedLine() before calling getScores())

getLogProb

public double getLogProb(String line,
                         int category)

Computes and returns Log[P(line|category)]. This is the probability of the given category generating the given line.

getInterpolatedProb

public double getInterpolatedProb(String charSequence,
                                  int category)

Returns a linearly interpolated estimate of the last char in the sequence given the rest of it. This function is called recursively in conjunction with getEmpiricalProb to build up the full equation:
gIP(n) = w_n*gEP(n) + (1-w_n)*gIP(n-1) gIP(0) = 1/256

getEmpiricalProb

public double getEmpiricalProb(String charSequence,
                               int category)

Returns the empirical estimate of the probability of the last char in the sequence given the sequence excluding that char, as observed within the given category. For example, gEP("Inc.",2) returns P(.|I,n,c) as observed in category 2.

getInterpolatedProb

public double getInterpolatedProb(List lengthSequence,
                                  int category)

Returns a linearly interpolated estimate of the last length in the sequence given the rest of it. This function is called recursively in conjunction with getEmpiricalProb to build up the full equation: gIP(n) = w_n*gEP(n) + (1-w_n)*gIP(n-1) gIP(0) = 1/256

getEmpiricalProb

public double getEmpiricalProb(List lengthSequence,
                               int category)

Returns the empirical estimate of the probability of the last word length in the sequence given the sequence excluding that length, as observed within the given category. For example, gEP([0,2,5],2) returns P(5|0,2) as observed in category 2.

getEmpiricalProb

public double getEmpiricalProb(String word,
                               int wordLength,
                               int category)

Returns the empirical estimate of the probability of the given word given the word's length and the given category. For example, gEP("dog",3,2) returns P(word="dog"|length=3,category=2). If no words of the given length have been seen, returns prob=0.0. This is because the word model is mixed with an n-gram model, so it's important to know when the word model has nothing to contribute. NOTE: Yes, I realize passing in length is redundant, but it makes this method signature unique from gEP for the char n-gram.

getPriorProb

public double getPriorProb(int category)

Returns the empirical a piori probability of each category, as observed in the training data (fraction of each category in the whole training data).

getNumCategories

public int getNumCategories()

Returns the number of different categories represented in this classifier.

getEndMarkedString

public static String getEndMarkedString(String line)

Returns the given line prepended with enough ' ' symbols to allow n-gram parsing. Also adds a '^' to the end so a terminal ngram can be counted For example, if n=4, "Proper Noun" would be returned as " Proper Noun^". Before applying end-marking, trims whitespace from both ends of line.

getPureString

public static String getPureString(String word)

Prunes the first (cn-1) chars from the beginning of the word as well as the final char. Inverse of getEndMarkedString(java.lang.String).

getWordLengths

public static List getWordLengths(String line)

Takes an end-marked string and returns a list of Integers for the length of each word. List includes (cn-1) starting 0's and one trailing 0. For example, the string " Proper Noun^" would yield {0,0,0,6,4,0}.

getWordsWithContext

public static List getWordsWithContext(String line)

Takes an end-marked string and returns a List of strings, one for each word in the line. Each word has (cn-1) prefix chars and one suffix char (either a space or '^') for context. Thus each word is sort of "end-marked". For example, the string " Proper Noun^" would yield {" Proper ","er Noun^"}.

generateWord

public String generateWord(int wordLength,
                           String initialContext,
                           char finalChar,
                           int category)

Randomly generates a word of the given length, starting with the given intial context, and ending with the given final char by sampling from the char n-gram model of the given category. Since it's unfair to force early termination, this method generates words of the given length until one naturally occurs with the final char. word length is not including inital context or final char. Returns the generated word without the inital context, but with the final char.

generateLine

public String generateLine(int category)

Generates a novel example of the given category, starting with (cn-1) start symbols and ending with an end symbol. First generates a word-lengths list, then generates a word for each length.

test

protected void test(String testFilename)
             throws FileNotFoundException,
                    IOException

Runs the classifier on each line in the given test file and prints out the category with the highest score.

FileNotFoundException

IOException

main

public static void main(String[] args)

Trains and tests a PnpClassifier on the passed-in files.

Usage: java PnpClassifier trainingFilename testFilename.

See Also:: PnpClassifier(String trainingFIlename), test(String testFilename)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Stanford NLP Group

Field Summary
`static int[]`	`charBinCutoffs`
`static int`	`cn`
`static char`	`END_SYMBOL`
`static int[]`	`lengthBinCutoffs`
`static int`	`ln`
`static Random`	`rand`
`static char`	`START_SYMBOL`

Constructor Summary
`PnpClassifier(String trainingFilename)` Constructs a new PnpClassifier which is trained on the given file.

Method Summary
`String`	`generateLine(int category)` Generates a novel example of the given category, starting with (cn-1) start symbols and ending with an end symbol.
`String`	`generateWord(int wordLength, String initialContext, char finalChar, int category)` Randomly generates a word of the given length, starting with the given intial context, and ending with the given final char by sampling from the char n-gram model of the given category.
`int`	`getBestCategory(String line)` Returns the category that generates the given line with the highest probability.
`double`	`getEmpiricalProb(List lengthSequence, int category)` Returns the empirical estimate of the probability of the last word length in the sequence given the sequence excluding that length, as observed within the given category.
`double`	`getEmpiricalProb(String charSequence, int category)` Returns the empirical estimate of the probability of the last char in the sequence given the sequence excluding that char, as observed within the given category.
`double`	`getEmpiricalProb(String word, int wordLength, int category)` Returns the empirical estimate of the probability of the given word given the word's length and the given category.
`static String`	`getEndMarkedString(String line)` Returns the given line prepended with enough ' ' symbols to allow n-gram parsing.
`double`	`getInterpolatedProb(List lengthSequence, int category)` Returns a linearly interpolated estimate of the last length in the sequence given the rest of it.
`double`	`getInterpolatedProb(String charSequence, int category)` Returns a linearly interpolated estimate of the last char in the sequence given the rest of it.
`double`	`getLogProb(String line, int category)` Computes and returns Log[P(line\|category)].
`int`	`getNumCategories()` Returns the number of different categories represented in this classifier.
`double`	`getPriorProb(int category)` Returns the empirical a piori probability of each category, as observed in the training data (fraction of each category in the whole training data).
`static String`	`getPureString(String word)` Prunes the first (cn-1) chars from the beginning of the word as well as the final char.
`double`	`getScore(String line, int category)` Returns the score for the given example as scored in the given category.
`static List`	`getWordLengths(String line)` Takes an end-marked string and returns a list of Integers for the length of each word.
`static List`	`getWordsWithContext(String line)` Takes an end-marked string and returns a List of strings, one for each word in the line.
`static void`	`main(String[] args)` Trains and tests a PnpClassifier on the passed-in files.
`protected void`	`test(String testFilename)` Runs the classifier on each line in the given test file and prints out the category with the highest score.

edu.stanford.nlp.ie.pnp Class PnpClassifier

ln

cn

START_SYMBOL

END_SYMBOL

rand

charBinCutoffs

lengthBinCutoffs

PnpClassifier

getBestCategory

getScore

getLogProb

getInterpolatedProb

getEmpiricalProb

getInterpolatedProb

getEmpiricalProb

getEmpiricalProb

getPriorProb

getNumCategories

getEndMarkedString

getPureString

getWordLengths

getWordsWithContext

generateWord

generateLine

test

main

edu.stanford.nlp.ie.pnp
Class PnpClassifier