edu.stanford.nlp.ie.pcfg
Class TBMan

java.lang.Object
  |
  +--edu.stanford.nlp.ie.pcfg.TBMan

public class TBMan
extends Object

Treebank Manager. Handles all treebank operations.


Field Summary
static int lengthCutoff
          The TBMan parses sentences as it reads them.
static boolean parse
          The TBMan parses sentences as it reads them.
static int portNumber
          The TBMan parses sentences as it reads them.
 List tags
          a list of the tags in all the training and testing data.
 
Constructor Summary
TBMan(String inFN, String tbFN, double split)
          Calls TBMan(inFN, tbFN, split, null)
TBMan(String inFN, String tbFN, double split, String tag)
          constructs a new TBMan.
 
Method Summary
static List GetSentences(List documents)
          Breaks a list of documents into sentences.
static List GetSentences(List documents, boolean headlines)
          Breaks a list of documents into sentences.
 List GetTestData(int seed)
          Gets the test data.
 List GetTrainingData(int seed)
          Gets the training data.
 List GetTrees(List sentences)
          gets the parse trees for a list of sentences.
static void main(String[] args)
          test class functionality.
 Tree Parse(List sentence)
          parses a sentence.
static void Preprocess(String inFN, String outFN)
          This reads in the Acquisitions data set as text and outputs the training data as headline / paragraph pairs.
 void WriteTB(String fn)
          writes the treebank out to a file
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

portNumber

public static int portNumber
The TBMan parses sentences as it reads them. To do so it must connect to the Parser. By default, the TBMan connects to the Parser on this portNumber


parse

public static boolean parse
The TBMan parses sentences as it reads them. The TBMan also stores a list of sentences it has already parsed. The TBMan only parses a sentence if the sentence has not been parsed before AND "parse" is true AND the sentence length is less than or equal to "lengthCutoff"


lengthCutoff

public static int lengthCutoff
The TBMan parses sentences as it reads them. The TBMan also stores a list of sentences it has already parsed. The TBMan only parses a sentence if the sentence has not been parsed before AND "parse" is true AND the sentence length is lte "lengthCutoff"


tags

public List tags
a list of the tags in all the training and testing data. one instance per tag. calculated in the constructor

Constructor Detail

TBMan

public TBMan(String inFN,
             String tbFN,
             double split,
             String tag)
      throws IOException
constructs a new TBMan. reads and stores trees from the treebank file.

Parameters:
inFN - the name of the file containing training/test data. input data is expected to be headlines and paragraphs on alternating lines
tbFN - the name of the file containing parse trees. Hopefully these are the parse trees for the sentences in the training/test data. if not, as the TBMan parses new sentences, it will update update the file tbFN
split - the percent of headline/paragraph pairs to use as training data. the remaining headline/paragraph pairs are used for test data
tag - this constructor eliminates all tags in the training/test data except for instances of "tag". if "tag" is null, no effect

TBMan

public TBMan(String inFN,
             String tbFN,
             double split)
      throws IOException
Calls TBMan(inFN, tbFN, split, null)

Method Detail

Preprocess

public static void Preprocess(String inFN,
                              String outFN)
                       throws IOException
This reads in the Acquisitions data set as text and outputs the training data as headline / paragraph pairs. Headlines and paragraphs are stuck on alternating lines (i.e., all line breaks are removed from within a paragraph). This also translates from the SGML tag format (e.g., Amazon ) to my standard tag format (e.g., Amazon[{purchaser}]).

IOException

GetTrainingData

public List GetTrainingData(int seed)
Gets the training data. Randomly divides the data into training and test data. the algorithm is: for each headline/paragraph: if ( random.nextDouble() <= split ) assign headline/paragraph to training data; else assign headline/paragraph to test data. Thus no guarantees that size(training data) == (split / (1 - split)) * size(test data), but should be pretty close

Parameters:
seed - the random seed. the algorithm will always split the data the same way given the same data and the same seed

GetTestData

public List GetTestData(int seed)
Gets the test data. Randomly divides the data into training and test data. the algorithm is: for each headline/paragraph: if ( random.nextDouble() <= split ) assign headline/paragraph to training data; else assign headline/paragraph to test data. Thus no guarantees that size(training data) == (split / (1 - split)) * size(test data), but should be pretty close

Parameters:
seed - the random seed. the algorithm will always split the data the same way given the same data and the same seed

GetSentences

public static List GetSentences(List documents,
                                boolean headlines)
Breaks a list of documents into sentences. A document is a headline/paragraph pair.

Parameters:
documents - a list of documents (each document is a list of headline and paragraph, where a headline is a sentence and a paragraph is a list of sentences)
headlines - if true, just returns headlines. otherwise, just returns sentences from paragraphs
Returns:
a list of sentences (a list of strings)

GetSentences

public static List GetSentences(List documents)
Breaks a list of documents into sentences. A document is a headline/paragraph pair. Returns both headlines and sentences from paragraphs.

Parameters:
documents - a list of documents (each document is a list of headline and paragraph, where a headline is a sentence and a paragraph is a list of sentences)
Returns:
a list of sentences (a list of strings)

GetTrees

public List GetTrees(List sentences)
gets the parse trees for a list of sentences. Parses any sentences not in the treebank


WriteTB

public void WriteTB(String fn)
             throws IOException
writes the treebank out to a file

IOException

Parse

public Tree Parse(List sentence)
parses a sentence. only parses the sentence if 1) the sentence's parse tree is not in the treebank AND 2) the static variable "parse" is set to true AND 3) the sentence length is less than or equal to the static variable "lengthCutoff". If the sentence is parsed, the entire treebank is written back to the treebank file


main

public static void main(String[] args)
                 throws IOException
test class functionality. outdated

IOException


Stanford NLP Group