edu.stanford.nlp.annotation
Class TaggedStreamTokenizer

java.lang.Object
  |
  +--edu.stanford.nlp.annotation.TaggedStreamTokenizer

public class TaggedStreamTokenizer
extends Object

TaggedStreamTokenizer is similar to java.io.StreamTokenizer, except that it is better suited to deal with documents containing html-style tags. Note that TaggedStreamTokenizer is not a subclass of StreamTokenizer.

TaggedStreamTokenizer allows the distinction to be made between "target" text and "background" text through the use of target tags. For instance, consider the following snippet:

 <html>
 Cows eat <color>green</color> grass.  Aliens eat <color>green</color> slime.  I am an alien cow.
 </pre>
 </html>
 


By specifying that <color> is a target tag, the tokenizer would return both instances of "green" as type TT_TARGET_WORD, while the other words outside the scope of the color tags would be of type TT_BACKGROUND_WORD.

As in StreamTokenizer, a call to nextToken() after initializing the tokenizer returns a TT code, which is also stored in the public ttype variable; the token text will be stored in sval, and if the token is within the scope of some target tag, attr will contain the name of that tag. TT_EOF is returned in the case that the end of the stream is reached.

It is probably wise to run some sort of HTML validator or scrubber on input files before passing them into the tokenizer. Although the tokenizer is somewhat forgiving when dealing with broken html, things such as the spacing inside of tags or whether quotes are used for attribute names are ignored by the tokenizer; depending on your task, you may want to standardize these.

TaggedStreamTokenizer is built around a lexer coded using JavaCC, in the file HtmlLexer.jj. The following characteristics of the tokenizer are mainly built into the lexer, so modifications to these will probably require modifications to both the tokenizer and lexer files.

Updates 5/12/02: Removed the tag length restriction, '<' and '>' symbols are allowed to appear in quoted literals within an html tag, runaway tags are explicitly terminated when a '<' is encountered before a '>'

Updates 5/28/02: Added tokenization of email addresses, abbreviations are tokenized to include the trailing period.


Field Summary
 String attr
          This field contains the name of the active target tag that is currently in scope, or null if no tag is in scope.
 String sval
          This field contains a string giving the characters of the word token just read.
static int TT_BACKGROUND_HTML
          A constant indicating that a non-target-tag html token has been read that is outside the scope of any active tags
static int TT_BACKGROUND_INACTIVE_TAG
          A constant indicating that an inactive target tag has been read that is outside the scope of any active target tags
static int TT_BACKGROUND_WORD
          A constant indicating that a word token has been read that is outside the scope of any active tags
static int TT_EOF
          A constant indicating that the end of the stream has been read
static int TT_TARGET_HTML
          A constant indicating that a non-target-tag html token has been read that is within the scope of an active tag
static int TT_TARGET_INACTIVE_TAG
          A constant indicating that an inactive target tag has been read that is within the scope of an active tag
static int TT_TARGET_TAG
          A constant indicating that an active start target tag has been read
static int TT_TARGET_TAG_END
          A constant indicating that an active end target tag has been read
static int TT_TARGET_WORD
          A constant indicating that a word token has been read that is within the scope of an active tag
 int ttype
          After a call to the nextToken method, this field contains the type of the token just read.
 
Constructor Summary
TaggedStreamTokenizer(InputStream i)
          Create a tokenizer that parses the given character stream.
TaggedStreamTokenizer(Reader r)
          Create a tokenizer that parses the given character stream.
 
Method Summary
 void addKeeperTag(String tag)
          Specifies that a tag should be returned as a token, overriding the DiscardHtml setting.
 void addTarget(String start, String end)
          Adds an active target tag
 void addTarget(String start, String end, boolean active)
          Specifies which tags should be considered target tags.
 boolean getDiscardComments()
          Returns the DiscardComments setting.
 boolean getDiscardHtml()
          Returns the DiscardHtml setting.
 boolean getDiscardInactiveTags()
          Returns the DiscardInactiveTags setting.
 boolean getDiscardScript()
          Returns the DiscardScript setting.
 boolean getDiscardTargetTags()
          Returns the DiscardTargetTags setting.
 String getKeeperCharacters()
          Returns the keeper string.
static void main(String[] argv)
          Test the TaggedStreamTokenizer by passing in an html filename argument.
 int nextToken()
          Generates the next token from the input stream of this tokenizer.
 void reset()
          Causes the currently in-scope tag to be removed from scope.
 void setDiscardComments(boolean val)
          Determines whether or not html-style comments are discarded.
 void setDiscardHtml(boolean val)
          Determines whether or not html is discarded -- both html tags and escape characters such as &nbsp;.
 void setDiscardInactiveTags(boolean val)
          Determines whether or not inactive tags are discarded.
 void setDiscardScript(boolean val)
          Determines whether or not script, server, applet, and style (i.e., cascading style sheets) code is discarded.
 void setDiscardTargetTags(boolean val)
          Determines whether or not active target tags are discarded.
 void setKeeperCharacters(String keepers)
          Determines which (non-html) delimiters to return as tokens.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TT_EOF

public static final int TT_EOF
A constant indicating that the end of the stream has been read

See Also:
Constant Field Values

TT_TARGET_TAG

public static final int TT_TARGET_TAG
A constant indicating that an active start target tag has been read

See Also:
Constant Field Values

TT_TARGET_TAG_END

public static final int TT_TARGET_TAG_END
A constant indicating that an active end target tag has been read

See Also:
Constant Field Values

TT_TARGET_WORD

public static final int TT_TARGET_WORD
A constant indicating that a word token has been read that is within the scope of an active tag

See Also:
Constant Field Values

TT_TARGET_HTML

public static final int TT_TARGET_HTML
A constant indicating that a non-target-tag html token has been read that is within the scope of an active tag

See Also:
Constant Field Values

TT_TARGET_INACTIVE_TAG

public static final int TT_TARGET_INACTIVE_TAG
A constant indicating that an inactive target tag has been read that is within the scope of an active tag

See Also:
Constant Field Values

TT_BACKGROUND_WORD

public static final int TT_BACKGROUND_WORD
A constant indicating that a word token has been read that is outside the scope of any active tags

See Also:
Constant Field Values

TT_BACKGROUND_HTML

public static final int TT_BACKGROUND_HTML
A constant indicating that a non-target-tag html token has been read that is outside the scope of any active tags

See Also:
Constant Field Values

TT_BACKGROUND_INACTIVE_TAG

public static final int TT_BACKGROUND_INACTIVE_TAG
A constant indicating that an inactive target tag has been read that is outside the scope of any active target tags

See Also:
Constant Field Values

attr

public String attr
This field contains the name of the active target tag that is currently in scope, or null if no tag is in scope.


sval

public String sval
This field contains a string giving the characters of the word token just read. The initial value of this field is null.


ttype

public int ttype
After a call to the nextToken method, this field contains the type of the token just read. The initial value of this field is -10.

Constructor Detail

TaggedStreamTokenizer

public TaggedStreamTokenizer(InputStream i)
Create a tokenizer that parses the given character stream. The tokenizer is initialized to the following default state:

Parameters:
i - an InputStream object

TaggedStreamTokenizer

public TaggedStreamTokenizer(Reader r)
Create a tokenizer that parses the given character stream. The tokenizer is initialized to the following default state:

Parameters:
r - a Reader object
Method Detail

addTarget

public void addTarget(String start,
                      String end,
                      boolean active)
Specifies which tags should be considered target tags. A pair of start and end target tags mark "target" text, i.e., words that will be of type TT_TARGET_WORD, so long as they are "active." The tags themselves are of type TT_TARGET_TAG and TT_TARGET_TAG_END respectively; these are returned if setDiscardTargetTags is passed a value of true. Non-active tags can be specified to distinguish them from the naturally-occuring html tags in the document, for filtering purposes. These tags are of type TT_TARGET_INACTIVE_TAG or TT_BACKGROUND_INACTIVE_TAG, depending on whether a target tag is in scope or not, and are returned if true was passed to setDiscardInactiveTags.

Parameters:
start - a string representing the start tag, e.g., "<start>'"
end - a string representing the corresponding end tag, e.g., "<end>" or "</start>"
active - true indicates that the tag is active, and will be interpreted as marking target text.

addTarget

public void addTarget(String start,
                      String end)
Adds an active target tag

Parameters:
start - a string representing the start tag
end - a string representing the corresponding end tag

addKeeperTag

public void addKeeperTag(String tag)
Specifies that a tag should be returned as a token, overriding the DiscardHtml setting. The tag should be specified by name, without angled brackets. All instances of the tag, starting or closing, with or without attributes, case-insensitive, are returned as type TT_BACKGROUND_HTML or TT_TARGET_HTML. Comment tags may also be specified; pass in "!--" to allow both "<!--" and "-->" tags to be returned as tokens. Note that whether the comment body is tokenized needs to be set in setDiscardComments.
In the case of script, applet, server, and style tags, note that the DiscardScript setting can be true, while still returning these tags.

Parameters:
tag - a string representing the tag name to be returned

setKeeperCharacters

public void setKeeperCharacters(String keepers)
Determines which (non-html) delimiters to return as tokens. All non-word characters are treated as delimiters, including the ones in the keeper set. However, specifying a keeper sets causes these delimiters to be returned as one-character tokens. Special cases: delimiting white space characters are treated as one token by the lexer for efficiency; therefore, no white space characters can be specified in the keeper set. To change this, the grammar file needs to be modified. Ellipses (...) are returned as a single token, as are any string of two or more periods; these are returned by specifying a single period in the keeper string.

Parameters:
keepers - a String containing the non-word characters to return

getKeeperCharacters

public String getKeeperCharacters()
Returns the keeper string.

Returns:
a String containing the non-word characters that are legal tokens

setDiscardComments

public void setDiscardComments(boolean val)
Determines whether or not html-style comments are discarded. Passing in a value of false causes the start and end comment sequences to be considered html (and returned based on what was passed into setDiscardHtml); then the comment body is parsed. Because the lexer is customized to recognize entire comment spans, this process is not optimized; making it more efficient would require rewriting portions of the grammar.
Note that the leading "<!--" and trailing "-->" tags are considered html, separate from the comment body; therefore, passing true into setDiscardComments does not prevent those tags from being returned as tokens.

Parameters:
val - true indicates that comments are discarded.

getDiscardComments

public boolean getDiscardComments()
Returns the DiscardComments setting.

Returns:
true if comments are discarded

setDiscardScript

public void setDiscardScript(boolean val)
Determines whether or not script, server, applet, and style (i.e., cascading style sheets) code is discarded.

Parameters:
val - true indicates that script is discarded.

getDiscardScript

public boolean getDiscardScript()
Returns the DiscardScript setting.

Returns:
true if script is discarded

setDiscardInactiveTags

public void setDiscardInactiveTags(boolean val)
Determines whether or not inactive tags are discarded.

Parameters:
val - true indicates that inactive tags are discarded.

getDiscardInactiveTags

public boolean getDiscardInactiveTags()
Returns the DiscardInactiveTags setting.

Returns:
true if inactive tags are discarded

setDiscardHtml

public void setDiscardHtml(boolean val)
Determines whether or not html is discarded -- both html tags and escape characters such as &nbsp;. If true is passed in, it overrides whatever value was passed in by DiscardScript has, ie--if all html is discarded, then all javascript and applet code is discarded also. The same is not true for comments.
This setting is overridden for any tag name passed into setKeeperTag -- i.e., tags of those names are always returned as tokens.

Parameters:
val - true indicates that html is discarded.

getDiscardHtml

public boolean getDiscardHtml()
Returns the DiscardHtml setting.

Returns:
true if html is discarded

setDiscardTargetTags

public void setDiscardTargetTags(boolean val)
Determines whether or not active target tags are discarded.

Parameters:
val - true indicates that active target tags are discarded.

getDiscardTargetTags

public boolean getDiscardTargetTags()
Returns the DiscardTargetTags setting.

Returns:
true if active target tags are discarded

reset

public void reset()
Causes the currently in-scope tag to be removed from scope. Thus, if the next token is a plain word, it will be of type TT_BACKGROUND_WORD


nextToken

public int nextToken()
Generates the next token from the input stream of this tokenizer. The type of the token is returned in the ttype field. The string representation of the token is in the sval field, and an optional attribute tag is in the attr field.

Returns:
the value of the ttype field

main

public static void main(String[] argv)
                 throws Exception
Test the TaggedStreamTokenizer by passing in an html filename argument.

Exception


Stanford NLP Group