|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--edu.stanford.nlp.annotation.HtmlCleaner
HtmlCleaner
removes various code elements
(style
, script
, applet
, and so on)
from an HTML document.
HtmlCleaner
is built on top of the HtmlParser package
written by
Quiotix, which compresses vertical white space and outputs an html
document with consistent syntax (all
html is lower case, even spacing in tags, no quotes for attributes
except for file names and string literals). HtmlCleaner adds filters
for comments, meta and area tags, and for script, style, server, and
applet tags along with the text that is contained between those tags, which
generally do not add to the "semantics" of a web page. In addition, the
user may specify any addition tags to filter out.
This is
usually a necessary first step for processing html documents before
passing them into TaggedStreamTokenizer
, as the TST makes
no attempt to fix spacing, etc, (though it can filter tags and comments).
HtmlCleaner can be run in batch mode using default settings, with a shell
script similar to the
following; note that only single-word html file names can work in this
script -- I am not a script expert, and I never bothered finding a way
around this.
#!/bin/ksh
#
DIR=docs
OUTDIR=docs/cleaned
# ------------
for FILE in $DIR/*.htm*
do
echo $FILE
FILEROOT=${FILE%.*}
OUTFILE="$FILEROOT-c.html"
java HtmlCleaner $FILE > $OUTFILE
done
mv $DIR/*-c.html $OUTDIR
Constructor Summary | |
HtmlCleaner(InputStream in,
OutputStream os)
Creates a new HtmlCleaner with default settings. |
|
HtmlCleaner(Reader r,
OutputStream os)
Creates a new HtmlCleaner with default settings. |
Method Summary | |
void |
addIgnore(String tagName,
boolean spans)
Specifies a type of html tag to ignore. |
void |
clean()
Cleans the html file specified in the constructor and writes the output to the outstream specifed in the constructor. |
static void |
main(String[] args)
Runs HtmlCleaner with default settings on a
specified file, printing the cleaned html to standard out. |
void |
setDefaultIgnores(boolean val)
The default html elements to ignore are: comments; script, style, server, and applet tags and the text within those tags; meta and area tags. |
void |
setIgnoreComments(boolean val)
Specifies whether comments should be ignored. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
public HtmlCleaner(InputStream in, OutputStream os)
HtmlCleaner
with default settings.
in
- the input stream of the html to be cleanedpublic HtmlCleaner(Reader r, OutputStream os)
HtmlCleaner
with default settings.
Method Detail |
public void clean() throws IOException, com.quiotix.html.parser.ParseException
IOException
com.quiotix.html.parser.ParseException
public void setDefaultIgnores(boolean val)
val
- true indicates the default html is ignoredpublic void setIgnoreComments(boolean val)
setDefaultIgnores
overrides this setting.
val
- true indicates comments are ignoredpublic void addIgnore(String tagName, boolean spans)
setDefaultIgnores
, that tag will no longer be
ignored. Only html-style tags are supported.
tagName
- the name of the tag to be ignored, ie "table"
or "font" without brackets or attributes.spans
- true indicates the tag comes in start/end pairs,
such as font or table. Pass in false for tags such as br.public static void main(String[] args)
HtmlCleaner
with default settings on a
specified file, printing the cleaned html to standard out.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |