edu.stanford.nlp.trees
Class PennSentenceNormalizer

java.lang.Object
  |
  +--edu.stanford.nlp.trees.SentenceNormalizer
        |
        +--edu.stanford.nlp.trees.PennSentenceNormalizer
Direct Known Subclasses:
PennSentenceMrgNormalizer

public class PennSentenceNormalizer
extends SentenceNormalizer

A class for Penn tag directory sentence normalization. This one knows about the funny things in Penn Treebank pos files -- like lots of equals signs and square brackets. It also interns strings. A Singleton.


Constructor Summary
PennSentenceNormalizer()
          Constructs a PennSentenceNormalizer object.
PennSentenceNormalizer(boolean divideOffTags, char tagDivider)
          Constructs a PennSentenceNormalizer object.
PennSentenceNormalizer(boolean divideOffTags, char tagDivider, boolean unescape, char escapeChar)
          Constructs a PennSentenceNormalizer object.
 
Method Summary
 boolean endSentenceToken(String token, String prev, String next)
          Returns true if this token represents the end of a sentence.
 Sentence normalizeSentence(Sentence sent, LabelFactory lf)
          Normalize a sentence -- this method assumes that the argument that it is passed is the whole (linguistic) Sentence.
 String normalizeString(String word)
          Normalizes a read string word (and maybe intern it).
 
Methods inherited from class edu.stanford.nlp.trees.SentenceNormalizer
eolIsSentenceEnd
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PennSentenceNormalizer

public PennSentenceNormalizer()
Constructs a PennSentenceNormalizer object.


PennSentenceNormalizer

public PennSentenceNormalizer(boolean divideOffTags,
                              char tagDivider)
Constructs a PennSentenceNormalizer object.

Parameters:
divideOffTags - true iff an unescaped tagDivider and all characters to the right of it should be cut off from words
tagDivider - The character that separates words from their tags

PennSentenceNormalizer

public PennSentenceNormalizer(boolean divideOffTags,
                              char tagDivider,
                              boolean unescape,
                              char escapeChar)
Constructs a PennSentenceNormalizer object.

Parameters:
divideOffTags - true iff an unescaped tagDivider and all characters to the right of it should be cut off from words
tagDivider - The character that separates words from their tags
unescape - true if words should be unescaped, but at present this isn't implemented
escapeChar - The character used to escape a following character
Method Detail

normalizeString

public String normalizeString(String word)
Description copied from class: SentenceNormalizer
Normalizes a read string word (and maybe intern it).

Overrides:
normalizeString in class SentenceNormalizer
Parameters:
word - The word to normalize
Returns:
The normalized form

normalizeSentence

public Sentence normalizeSentence(Sentence sent,
                                  LabelFactory lf)
Normalize a sentence -- this method assumes that the argument that it is passed is the whole (linguistic) Sentence. It is normally implemented as a List-walking routine.

Overrides:
normalizeSentence in class SentenceNormalizer
Parameters:
sent - The sentence to be normalized
lf - the LabelFactory to create new Labels (if needed)
Returns:
Sentence the normalized sentence

endSentenceToken

public boolean endSentenceToken(String token,
                                String prev,
                                String next)
Returns true if this token represents the end of a sentence. Perhaps shouldn't be in this class, but it seemed a good place since other source-specific handling is here.... This is called on the token as read _prior_ to normalization. This seems more useful, as can detect things that are deleted during the normalization process.

Overrides:
endSentenceToken in class SentenceNormalizer
Parameters:
token - The String to be checked
prev - The previous token
next - The next token (lookahead)
Returns:
boolean True if this token is a sentence end


Stanford NLP Group