edu.stanford.nlp.annotation
Class HtmlCleaner

java.lang.Object
  |
  +--edu.stanford.nlp.annotation.HtmlCleaner

public class HtmlCleaner
extends Object

HtmlCleaner removes various code elements (style, script, applet, and so on) from an HTML document. HtmlCleaner is built on top of the HtmlParser package written by Quiotix, which compresses vertical white space and outputs an html document with consistent syntax (all html is lower case, even spacing in tags, no quotes for attributes except for file names and string literals). HtmlCleaner adds filters for comments, meta and area tags, and for script, style, server, and applet tags along with the text that is contained between those tags, which generally do not add to the "semantics" of a web page. In addition, the user may specify any addition tags to filter out. This is usually a necessary first step for processing html documents before passing them into TaggedStreamTokenizer, as the TST makes no attempt to fix spacing, etc, (though it can filter tags and comments).

HtmlCleaner can be run in batch mode using default settings, with a shell script similar to the following; note that only single-word html file names can work in this script -- I am not a script expert, and I never bothered finding a way around this.

 
 #!/bin/ksh
 #
 
 DIR=docs
 OUTDIR=docs/cleaned
 
 # ------------
     for FILE in $DIR/*.htm*
 
     do
         echo $FILE
         FILEROOT=${FILE%.*}
         OUTFILE="$FILEROOT-c.html"
         java HtmlCleaner $FILE > $OUTFILE
     done
 
     mv $DIR/*-c.html $OUTDIR
 
 
 


Constructor Summary
HtmlCleaner(InputStream in, OutputStream os)
          Creates a new HtmlCleaner with default settings.
HtmlCleaner(Reader r, OutputStream os)
          Creates a new HtmlCleaner with default settings.
 
Method Summary
 void addIgnore(String tagName, boolean spans)
          Specifies a type of html tag to ignore.
 void clean()
          Cleans the html file specified in the constructor and writes the output to the outstream specifed in the constructor.
static void main(String[] args)
          Runs HtmlCleaner with default settings on a specified file, printing the cleaned html to standard out.
 void setDefaultIgnores(boolean val)
          The default html elements to ignore are: comments; script, style, server, and applet tags and the text within those tags; meta and area tags.
 void setIgnoreComments(boolean val)
          Specifies whether comments should be ignored.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HtmlCleaner

public HtmlCleaner(InputStream in,
                   OutputStream os)
Creates a new HtmlCleaner with default settings.

Parameters:
in - the input stream of the html to be cleaned

HtmlCleaner

public HtmlCleaner(Reader r,
                   OutputStream os)
Creates a new HtmlCleaner with default settings.

Method Detail

clean

public void clean()
           throws IOException,
                  com.quiotix.html.parser.ParseException
Cleans the html file specified in the constructor and writes the output to the outstream specifed in the constructor.

IOException
com.quiotix.html.parser.ParseException

setDefaultIgnores

public void setDefaultIgnores(boolean val)
The default html elements to ignore are: comments; script, style, server, and applet tags and the text within those tags; meta and area tags.

Parameters:
val - true indicates the default html is ignored

setIgnoreComments

public void setIgnoreComments(boolean val)
Specifies whether comments should be ignored. Passing true into setDefaultIgnores overrides this setting.

Parameters:
val - true indicates comments are ignored

addIgnore

public void addIgnore(String tagName,
                      boolean spans)
Specifies a type of html tag to ignore. If one of the six default ignored tags is passed in, and subsequently false is passed into setDefaultIgnores, that tag will no longer be ignored. Only html-style tags are supported.

Parameters:
tagName - the name of the tag to be ignored, ie "table" or "font" without brackets or attributes.
spans - true indicates the tag comes in start/end pairs, such as font or table. Pass in false for tags such as br.

main

public static void main(String[] args)
Runs HtmlCleaner with default settings on a specified file, printing the cleaned html to standard out.



Stanford NLP Group