|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--edu.stanford.nlp.annotation.TaggedStreamTokenizer
TaggedStreamTokenizer
is similar to
java.io.StreamTokenizer
,
except that it is better suited to deal with documents containing html-style
tags. Note that TaggedStreamTokenizer
is not a subclass of
StreamTokenizer
.
TaggedStreamTokenizer
allows the distinction to be made between
"target" text and "background" text through the use of target tags. For
instance, consider the following snippet:
<html> Cows eat <color>green</color> grass. Aliens eat <color>green</color> slime. I am an alien cow. </pre> </html>
StreamTokenizer
, a call to nextToken()
after
initializing the tokenizer returns a TT code, which is also stored in the
public ttype
variable; the token text will be stored in
sval
, and if the token is within the scope of some target tag,
attr
will contain the name of that tag. TT_EOF is returned
in the case that the end of the stream is reached.
TaggedStreamTokenizer
is built around a lexer coded using JavaCC,
in the file HtmlLexer.jj
. The following characteristics of the
tokenizer are mainly built into the lexer, so modifications to these will
probably require modifications to both the tokenizer and lexer files.

and
. Mal-formed escape sequences include
&123;
and <
.
Field Summary | |
String |
attr
This field contains the name of the active target tag that is currently in scope, or null if no tag is in scope. |
String |
sval
This field contains a string giving the characters of the word token just read. |
static int |
TT_BACKGROUND_HTML
A constant indicating that a non-target-tag html token has been read that is outside the scope of any active tags |
static int |
TT_BACKGROUND_INACTIVE_TAG
A constant indicating that an inactive target tag has been read that is outside the scope of any active target tags |
static int |
TT_BACKGROUND_WORD
A constant indicating that a word token has been read that is outside the scope of any active tags |
static int |
TT_EOF
A constant indicating that the end of the stream has been read |
static int |
TT_TARGET_HTML
A constant indicating that a non-target-tag html token has been read that is within the scope of an active tag |
static int |
TT_TARGET_INACTIVE_TAG
A constant indicating that an inactive target tag has been read that is within the scope of an active tag |
static int |
TT_TARGET_TAG
A constant indicating that an active start target tag has been read |
static int |
TT_TARGET_TAG_END
A constant indicating that an active end target tag has been read |
static int |
TT_TARGET_WORD
A constant indicating that a word token has been read that is within the scope of an active tag |
int |
ttype
After a call to the nextToken method, this field contains the type of the token just read. |
Constructor Summary | |
TaggedStreamTokenizer(InputStream i)
Create a tokenizer that parses the given character stream. |
|
TaggedStreamTokenizer(Reader r)
Create a tokenizer that parses the given character stream. |
Method Summary | |
void |
addKeeperTag(String tag)
Specifies that a tag should be returned as a token, overriding the DiscardHtml setting. |
void |
addTarget(String start,
String end)
Adds an active target tag |
void |
addTarget(String start,
String end,
boolean active)
Specifies which tags should be considered target tags. |
boolean |
getDiscardComments()
Returns the DiscardComments setting. |
boolean |
getDiscardHtml()
Returns the DiscardHtml setting. |
boolean |
getDiscardInactiveTags()
Returns the DiscardInactiveTags setting. |
boolean |
getDiscardScript()
Returns the DiscardScript setting. |
boolean |
getDiscardTargetTags()
Returns the DiscardTargetTags setting. |
String |
getKeeperCharacters()
Returns the keeper string. |
static void |
main(String[] argv)
Test the TaggedStreamTokenizer by passing in an html filename argument. |
int |
nextToken()
Generates the next token from the input stream of this tokenizer. |
void |
reset()
Causes the currently in-scope tag to be removed from scope. |
void |
setDiscardComments(boolean val)
Determines whether or not html-style comments are discarded. |
void |
setDiscardHtml(boolean val)
Determines whether or not html is discarded -- both html tags and escape characters such as . |
void |
setDiscardInactiveTags(boolean val)
Determines whether or not inactive tags are discarded. |
void |
setDiscardScript(boolean val)
Determines whether or not script, server, applet, and style (i.e., cascading style sheets) code is discarded. |
void |
setDiscardTargetTags(boolean val)
Determines whether or not active target tags are discarded. |
void |
setKeeperCharacters(String keepers)
Determines which (non-html) delimiters to return as tokens. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public static final int TT_EOF
public static final int TT_TARGET_TAG
public static final int TT_TARGET_TAG_END
public static final int TT_TARGET_WORD
public static final int TT_TARGET_HTML
public static final int TT_TARGET_INACTIVE_TAG
public static final int TT_BACKGROUND_WORD
public static final int TT_BACKGROUND_HTML
public static final int TT_BACKGROUND_INACTIVE_TAG
public String attr
public String sval
public int ttype
Constructor Detail |
public TaggedStreamTokenizer(InputStream i)
i
- an InputStream objectpublic TaggedStreamTokenizer(Reader r)
r
- a Reader objectMethod Detail |
public void addTarget(String start, String end, boolean active)
start
- a string representing the start tag, e.g., "<start>'"end
- a string representing the corresponding end tag, e.g.,
"<end>" or "</start>"active
- true indicates that the tag is active, and will be
interpreted as marking target text.public void addTarget(String start, String end)
start
- a string representing the start tagend
- a string representing the corresponding end tagpublic void addKeeperTag(String tag)
setDiscardComments
.
tag
- a string representing the tag name to be returnedpublic void setKeeperCharacters(String keepers)
keeper
set.
However, specifying a keeper sets causes these delimiters to be returned
as one-character tokens. Special cases: delimiting white space
characters are treated as one token by the lexer for efficiency;
therefore, no white space characters can be specified in the keeper
set. To change this, the grammar file needs to be modified. Ellipses
(...) are returned as a single token, as are any string of two or more
periods; these are returned by specifying a single period in the
keeper string.
keepers
- a String containing the non-word characters to returnpublic String getKeeperCharacters()
public void setDiscardComments(boolean val)
val
- true indicates that comments are discarded.public boolean getDiscardComments()
public void setDiscardScript(boolean val)
val
- true indicates that script is discarded.public boolean getDiscardScript()
public void setDiscardInactiveTags(boolean val)
val
- true indicates that inactive tags are discarded.public boolean getDiscardInactiveTags()
public void setDiscardHtml(boolean val)
. If true is
passed in, it overrides whatever value was passed in by
DiscardScript has, ie--if all html is discarded, then all
javascript and applet code is discarded also. The same is not true
for comments.
val
- true indicates that html is discarded.public boolean getDiscardHtml()
public void setDiscardTargetTags(boolean val)
val
- true indicates that active target tags are discarded.public boolean getDiscardTargetTags()
public void reset()
TT_BACKGROUND_WORD
public int nextToken()
ttype
field. The string
representation of the token is in the sval
field, and an
optional attribute tag is in the attr
field.
ttype
fieldpublic static void main(String[] argv) throws Exception
Exception
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |