TaggedStreamTokenizer (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.stanford.nlp.annotation
Class TaggedStreamTokenizer

java.lang.Object
  |
  +--edu.stanford.nlp.annotation.TaggedStreamTokenizer

public class TaggedStreamTokenizer
extends Object

TaggedStreamTokenizer is similar to java.io.StreamTokenizer, except that it is better suited to deal with documents containing html-style tags. Note that TaggedStreamTokenizer is not a subclass of StreamTokenizer.

TaggedStreamTokenizer allows the distinction to be made between "target" text and "background" text through the use of target tags. For instance, consider the following snippet:

 <html>
 Cows eat <color>green</color> grass.  Aliens eat <color>green</color> slime.  I am an alien cow.
 </pre>
 </html>

By specifying that <color> is a target tag, the tokenizer would return both instances of "green" as type TT_TARGET_WORD, while the other words outside the scope of the color tags would be of type TT_BACKGROUND_WORD.

As in StreamTokenizer, a call to nextToken() after initializing the tokenizer returns a TT code, which is also stored in the public ttype variable; the token text will be stored in sval, and if the token is within the scope of some target tag, attr will contain the name of that tag. TT_EOF is returned in the case that the end of the stream is reached.

It is probably wise to run some sort of HTML validator or scrubber on input files before passing them into the tokenizer. Although the tokenizer is somewhat forgiving when dealing with broken html, things such as the spacing inside of tags or whether quotes are used for attribute names are ignored by the tokenizer; depending on your task, you may want to standardize these.

TaggedStreamTokenizer is built around a lexer coded using JavaCC, in the file HtmlLexer.jj. The following characteristics of the tokenizer are mainly built into the lexer, so modifications to these will probably require modifications to both the tokenizer and lexer files.

HTML tags must be well-formed, starting with a '<' and ending with a '>'. A tag is automatically closed off with a '>' if a '<' is encountered before a closing '>' -- this is new for this version. Tokenization should properly handle quoted literals inside of tags that include '<' or '>' characters.
There is no maximum length imposed for a tag.
HTML escape sequences must be well-formed to be recognized as html. Examples of well-formed escape sequences include  and  . Mal-formed escape sequences include &123; and &lt.
HTML comments must start with ""
Words can be single alphanumeric characters, sequences of alphanumeric characters (with exceptions, below), or single non-alphanumeric characters (which by default are discarded). The rules defining legal words are based on common patterns in American English language and CS jargon. They include:
- A word can include any number of '-' or '_' characters, with no restrictions placed on where these appear in the word, so long as no two such characters appear in sequence, or at the beginning or end of a word.
- A word can include at most one apostrophe, so long as it is flanked by alphanumeric characters.
- A word can include one or more periods, each of which must be flanked by at least one alphanumeric characters. A period may also occur at the beginning of a word. Thus, ellipses are not included in words, but IP addresses, domain names, are considered one word.
- A word can end in a period if it is a single capital-letter abbreviation, or if the word is an abbreviation with more than one period (eg, B.S.), or if the word is a recognized abbreviation (eg, St., Mr., Calif.)
- A word can be an email address
- A sequence of digits only can include commas, but in this case they must constitute a well-formed number (like 34,456 but not 23,67)
- A sequence of digits can optionally have one or both of '-' and '%' at the beginning or end, respectively
- A number can include a '$' at the beginning, as long as neither '-' or '%' appear
- A number can be in time format (hh:mm:ss or hh:mm or h:mm, etc), though no attempt is made at verifying that it represents a real time.
There is no support for multilingual character sets; accented letters, etc, are treated as delimiters.

Updates 5/12/02: Removed the tag length restriction, '<' and '>' symbols are allowed to appear in quoted literals within an html tag, runaway tags are explicitly terminated when a '<' is encountered before a '>'

Updates 5/28/02: Added tokenization of email addresses, abbreviations are tokenized to include the trailing period.

Field Summary

String attr
          This field contains the name of the active target tag that is currently in scope, or null if no tag is in scope.

String sval
          This field contains a string giving the characters of the word token just read.

static int TT_BACKGROUND_HTML
          A constant indicating that a non-target-tag html token has been read that is outside the scope of any active tags

static int TT_BACKGROUND_INACTIVE_TAG
          A constant indicating that an inactive target tag has been read that is outside the scope of any active target tags

static int TT_BACKGROUND_WORD
          A constant indicating that a word token has been read that is outside the scope of any active tags

static int TT_EOF
          A constant indicating that the end of the stream has been read

static int TT_TARGET_HTML
          A constant indicating that a non-target-tag html token has been read that is within the scope of an active tag

static int TT_TARGET_INACTIVE_TAG
          A constant indicating that an inactive target tag has been read that is within the scope of an active tag

static int TT_TARGET_TAG
          A constant indicating that an active start target tag has been read

static int TT_TARGET_TAG_END
          A constant indicating that an active end target tag has been read

static int TT_TARGET_WORD
          A constant indicating that a word token has been read that is within the scope of an active tag

int ttype
          After a call to the nextToken method, this field contains the type of the token just read.

Constructor Summary

TaggedStreamTokenizer(InputStream i)
          Create a tokenizer that parses the given character stream.

TaggedStreamTokenizer(Reader r)
          Create a tokenizer that parses the given character stream.

Method Summary

void addKeeperTag(String tag)
          Specifies that a tag should be returned as a token, overriding the DiscardHtml setting.

void addTarget(String start, String end)
          Adds an active target tag

void addTarget(String start, String end, boolean active)
          Specifies which tags should be considered target tags.

boolean getDiscardComments()
          Returns the DiscardComments setting.

boolean getDiscardHtml()
          Returns the DiscardHtml setting.

boolean getDiscardInactiveTags()
          Returns the DiscardInactiveTags setting.

boolean getDiscardScript()
          Returns the DiscardScript setting.

boolean getDiscardTargetTags()
          Returns the DiscardTargetTags setting.

String getKeeperCharacters()
          Returns the keeper string.

static void main(String[] argv)
          Test the TaggedStreamTokenizer by passing in an html filename argument.

int nextToken()
          Generates the next token from the input stream of this tokenizer.

void reset()
          Causes the currently in-scope tag to be removed from scope.

void setDiscardComments(boolean val)
          Determines whether or not html-style comments are discarded.

void setDiscardHtml(boolean val)
          Determines whether or not html is discarded -- both html tags and escape characters such as  .

void setDiscardInactiveTags(boolean val)
          Determines whether or not inactive tags are discarded.

void setDiscardScript(boolean val)
          Determines whether or not script, server, applet, and style (i.e., cascading style sheets) code is discarded.

void setDiscardTargetTags(boolean val)
          Determines whether or not active target tags are discarded.

void setKeeperCharacters(String keepers)
          Determines which (non-html) delimiters to return as tokens.

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

TT_EOF

public static final int TT_EOF

A constant indicating that the end of the stream has been read

See Also:: Constant Field Values

TT_TARGET_TAG

public static final int TT_TARGET_TAG

A constant indicating that an active start target tag has been read

See Also:: Constant Field Values

TT_TARGET_TAG_END

public static final int TT_TARGET_TAG_END

A constant indicating that an active end target tag has been read

See Also:: Constant Field Values

TT_TARGET_WORD

public static final int TT_TARGET_WORD

A constant indicating that a word token has been read that is within the scope of an active tag

See Also:: Constant Field Values

TT_TARGET_HTML

public static final int TT_TARGET_HTML

A constant indicating that a non-target-tag html token has been read that is within the scope of an active tag

See Also:: Constant Field Values

TT_TARGET_INACTIVE_TAG

public static final int TT_TARGET_INACTIVE_TAG

A constant indicating that an inactive target tag has been read that is within the scope of an active tag

See Also:: Constant Field Values

TT_BACKGROUND_WORD

public static final int TT_BACKGROUND_WORD

A constant indicating that a word token has been read that is outside the scope of any active tags

See Also:: Constant Field Values

TT_BACKGROUND_HTML

public static final int TT_BACKGROUND_HTML

A constant indicating that a non-target-tag html token has been read that is outside the scope of any active tags

See Also:: Constant Field Values

TT_BACKGROUND_INACTIVE_TAG

public static final int TT_BACKGROUND_INACTIVE_TAG

A constant indicating that an inactive target tag has been read that is outside the scope of any active target tags

See Also:: Constant Field Values

attr

public String attr

This field contains the name of the active target tag that is currently in scope, or null if no tag is in scope.

sval

public String sval

This field contains a string giving the characters of the word token just read. The initial value of this field is null.

ttype

public int ttype

After a call to the nextToken method, this field contains the type of the token just read. The initial value of this field is -10.

Constructor Detail

TaggedStreamTokenizer

public TaggedStreamTokenizer(InputStream i)

Create a tokenizer that parses the given character stream. The tokenizer is initialized to the following default state:

All tags (active or inactive) are discarded
All html is discarded
All non-word characters are discarded

Parameters:

i - an InputStream object

TaggedStreamTokenizer

public TaggedStreamTokenizer(Reader r)

Create a tokenizer that parses the given character stream. The tokenizer is initialized to the following default state:

All tags (active or inactive) are discarded
All html is discarded
All non-word characters are discarded

Parameters:

r - a Reader object

Method Detail

addTarget

public void addTarget(String start,
                      String end,
                      boolean active)

Specifies which tags should be considered target tags. A pair of start and end target tags mark "target" text, i.e., words that will be of type TT_TARGET_WORD, so long as they are "active." The tags themselves are of type TT_TARGET_TAG and TT_TARGET_TAG_END respectively; these are returned if setDiscardTargetTags is passed a value of true. Non-active tags can be specified to distinguish them from the naturally-occuring html tags in the document, for filtering purposes. These tags are of type TT_TARGET_INACTIVE_TAG or TT_BACKGROUND_INACTIVE_TAG, depending on whether a target tag is in scope or not, and are returned if true was passed to setDiscardInactiveTags.

Parameters:: start - a string representing the start tag, e.g., "<start>'"; end - a string representing the corresponding end tag, e.g., "<end>" or "</start>"; active - true indicates that the tag is active, and will be interpreted as marking target text.

addTarget

public void addTarget(String start,
                      String end)

Adds an active target tag

Parameters:: start - a string representing the start tag; end - a string representing the corresponding end tag

addKeeperTag

public void addKeeperTag(String tag)

Specifies that a tag should be returned as a token, overriding the DiscardHtml setting. The tag should be specified by name, without angled brackets. All instances of the tag, starting or closing, with or without attributes, case-insensitive, are returned as type TT_BACKGROUND_HTML or TT_TARGET_HTML. Comment tags may also be specified; pass in "!--" to allow both "" tags to be returned as tokens. Note that whether the comment body is tokenized needs to be set in setDiscardComments.
In the case of script, applet, server, and style tags, note that the DiscardScript setting can be true, while still returning these tags.

Parameters:: tag - a string representing the tag name to be returned

setKeeperCharacters

public void setKeeperCharacters(String keepers)

Determines which (non-html) delimiters to return as tokens. All non-word characters are treated as delimiters, including the ones in the keeper set. However, specifying a keeper sets causes these delimiters to be returned as one-character tokens. Special cases: delimiting white space characters are treated as one token by the lexer for efficiency; therefore, no white space characters can be specified in the keeper set. To change this, the grammar file needs to be modified. Ellipses (...) are returned as a single token, as are any string of two or more periods; these are returned by specifying a single period in the keeper string.

Parameters:: keepers - a String containing the non-word characters to return

getKeeperCharacters

public String getKeeperCharacters()

Returns the keeper string.

Returns:: a String containing the non-word characters that are legal tokens

setDiscardComments

public void setDiscardComments(boolean val)

Determines whether or not html-style comments are discarded. Passing in a value of false causes the start and end comment sequences to be considered html (and returned based on what was passed into setDiscardHtml); then the comment body is parsed. Because the lexer is customized to recognize entire comment spans, this process is not optimized; making it more efficient would require rewriting portions of the grammar.
Note that the leading "" tags are considered html, separate from the comment body; therefore, passing true into setDiscardComments does not prevent those tags from being returned as tokens.

Parameters:: val - true indicates that comments are discarded.

getDiscardComments

public boolean getDiscardComments()

Returns the DiscardComments setting.

Returns:: true if comments are discarded

setDiscardScript

public void setDiscardScript(boolean val)

Determines whether or not script, server, applet, and style (i.e., cascading style sheets) code is discarded.

Parameters:: val - true indicates that script is discarded.

getDiscardScript

public boolean getDiscardScript()

Returns the DiscardScript setting.

Returns:: true if script is discarded

setDiscardInactiveTags

public void setDiscardInactiveTags(boolean val)

Determines whether or not inactive tags are discarded.

Parameters:: val - true indicates that inactive tags are discarded.

getDiscardInactiveTags

public boolean getDiscardInactiveTags()

Returns the DiscardInactiveTags setting.

Returns:: true if inactive tags are discarded

setDiscardHtml

public void setDiscardHtml(boolean val)

Determines whether or not html is discarded -- both html tags and escape characters such as  . If true is passed in, it overrides whatever value was passed in by DiscardScript has, ie--if all html is discarded, then all javascript and applet code is discarded also. The same is not true for comments.
This setting is overridden for any tag name passed into setKeeperTag -- i.e., tags of those names are always returned as tokens.

Parameters:: val - true indicates that html is discarded.

getDiscardHtml

public boolean getDiscardHtml()

Returns the DiscardHtml setting.

Returns:: true if html is discarded

setDiscardTargetTags

public void setDiscardTargetTags(boolean val)

Determines whether or not active target tags are discarded.

Parameters:: val - true indicates that active target tags are discarded.

getDiscardTargetTags

public boolean getDiscardTargetTags()

Returns the DiscardTargetTags setting.

Returns:: true if active target tags are discarded

reset

public void reset()

Causes the currently in-scope tag to be removed from scope. Thus, if the next token is a plain word, it will be of type TT_BACKGROUND_WORD

nextToken

public int nextToken()

Generates the next token from the input stream of this tokenizer. The type of the token is returned in the ttype field. The string representation of the token is in the sval field, and an optional attribute tag is in the attr field.

Returns:: the value of the ttype field

main

public static void main(String[] argv)
                 throws Exception

Test the TaggedStreamTokenizer by passing in an html filename argument.

Exception

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Stanford NLP Group

Field Summary
`String`	`attr` This field contains the name of the active target tag that is currently in scope, or null if no tag is in scope.
`String`	`sval` This field contains a string giving the characters of the word token just read.
`static int`	`TT_BACKGROUND_HTML` A constant indicating that a non-target-tag html token has been read that is outside the scope of any active tags
`static int`	`TT_BACKGROUND_INACTIVE_TAG` A constant indicating that an inactive target tag has been read that is outside the scope of any active target tags
`static int`	`TT_BACKGROUND_WORD` A constant indicating that a word token has been read that is outside the scope of any active tags
`static int`	`TT_EOF` A constant indicating that the end of the stream has been read
`static int`	`TT_TARGET_HTML` A constant indicating that a non-target-tag html token has been read that is within the scope of an active tag
`static int`	`TT_TARGET_INACTIVE_TAG` A constant indicating that an inactive target tag has been read that is within the scope of an active tag
`static int`	`TT_TARGET_TAG` A constant indicating that an active start target tag has been read
`static int`	`TT_TARGET_TAG_END` A constant indicating that an active end target tag has been read
`static int`	`TT_TARGET_WORD` A constant indicating that a word token has been read that is within the scope of an active tag
`int`	`ttype` After a call to the nextToken method, this field contains the type of the token just read.

Constructor Summary
`TaggedStreamTokenizer(InputStream i)` Create a tokenizer that parses the given character stream.
`TaggedStreamTokenizer(Reader r)` Create a tokenizer that parses the given character stream.

Method Summary
`void`	`addKeeperTag(String tag)` Specifies that a tag should be returned as a token, overriding the DiscardHtml setting.
`void`	`addTarget(String start, String end)` Adds an active target tag
`void`	`addTarget(String start, String end, boolean active)` Specifies which tags should be considered target tags.
`boolean`	`getDiscardComments()` Returns the DiscardComments setting.
`boolean`	`getDiscardHtml()` Returns the DiscardHtml setting.
`boolean`	`getDiscardInactiveTags()` Returns the DiscardInactiveTags setting.
`boolean`	`getDiscardScript()` Returns the DiscardScript setting.
`boolean`	`getDiscardTargetTags()` Returns the DiscardTargetTags setting.
`String`	`getKeeperCharacters()` Returns the keeper string.
`static void`	`main(String[] argv)` Test the TaggedStreamTokenizer by passing in an html filename argument.
`int`	`nextToken()` Generates the next token from the input stream of this tokenizer.
`void`	`reset()` Causes the currently in-scope tag to be removed from scope.
`void`	`setDiscardComments(boolean val)` Determines whether or not html-style comments are discarded.
`void`	`setDiscardHtml(boolean val)` Determines whether or not html is discarded -- both html tags and escape characters such as ` `.
`void`	`setDiscardInactiveTags(boolean val)` Determines whether or not inactive tags are discarded.
`void`	`setDiscardScript(boolean val)` Determines whether or not script, server, applet, and style (i.e., cascading style sheets) code is discarded.
`void`	`setDiscardTargetTags(boolean val)` Determines whether or not active target tags are discarded.
`void`	`setKeeperCharacters(String keepers)` Determines which (non-html) delimiters to return as tokens.

edu.stanford.nlp.annotation Class TaggedStreamTokenizer

TT_EOF

TT_TARGET_TAG

TT_TARGET_TAG_END

TT_TARGET_WORD

TT_TARGET_HTML

TT_TARGET_INACTIVE_TAG

TT_BACKGROUND_WORD

TT_BACKGROUND_HTML

TT_BACKGROUND_INACTIVE_TAG

attr

sval

ttype

TaggedStreamTokenizer

TaggedStreamTokenizer

addTarget

addTarget

addKeeperTag

setKeeperCharacters

getKeeperCharacters

setDiscardComments

getDiscardComments

setDiscardScript

getDiscardScript

setDiscardInactiveTags

getDiscardInactiveTags

setDiscardHtml

getDiscardHtml

setDiscardTargetTags

getDiscardTargetTags

reset

nextToken

main

edu.stanford.nlp.annotation
Class TaggedStreamTokenizer