TBMan (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.stanford.nlp.ie.pcfg
Class TBMan

java.lang.Object
  |
  +--edu.stanford.nlp.ie.pcfg.TBMan

public class TBMan
extends Object

Treebank Manager. Handles all treebank operations.

Field Summary

static int lengthCutoff
          The TBMan parses sentences as it reads them.

static boolean parse
          The TBMan parses sentences as it reads them.

static int portNumber
          The TBMan parses sentences as it reads them.

List tags
          a list of the tags in all the training and testing data.

Constructor Summary

TBMan(String inFN, String tbFN, double split)
          Calls TBMan(inFN, tbFN, split, null)

TBMan(String inFN, String tbFN, double split, String tag)
          constructs a new TBMan.

Method Summary

static List GetSentences(List documents)
          Breaks a list of documents into sentences.

static List GetSentences(List documents, boolean headlines)
          Breaks a list of documents into sentences.

List GetTestData(int seed)
          Gets the test data.

List GetTrainingData(int seed)
          Gets the training data.

List GetTrees(List sentences)
          gets the parse trees for a list of sentences.

static void main(String[] args)
          test class functionality.

Tree Parse(List sentence)
          parses a sentence.

static void Preprocess(String inFN, String outFN)
          This reads in the Acquisitions data set as text and outputs the training data as headline / paragraph pairs.

void WriteTB(String fn)
          writes the treebank out to a file

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

portNumber

public static int portNumber

The TBMan parses sentences as it reads them. To do so it must connect to the Parser. By default, the TBMan connects to the Parser on this portNumber

parse

public static boolean parse

The TBMan parses sentences as it reads them. The TBMan also stores a list of sentences it has already parsed. The TBMan only parses a sentence if the sentence has not been parsed before AND "parse" is true AND the sentence length is less than or equal to "lengthCutoff"

lengthCutoff

public static int lengthCutoff

TBMan

public TBMan(String inFN,
             String tbFN,
             double split,
             String tag)
      throws IOException

constructs a new TBMan. reads and stores trees from the treebank file.
Parameters:: inFN - the name of the file containing training/test data. input data is expected to be headlines and paragraphs on alternating lines; tbFN - the name of the file containing parse trees. Hopefully these are the parse trees for the sentences in the training/test data. if not, as the TBMan parses new sentences, it will update update the file tbFN; split - the percent of headline/paragraph pairs to use as training data. the remaining headline/paragraph pairs are used for test data; tag - this constructor eliminates all tags in the training/test data except for instances of "tag". if "tag" is null, no effect

TBMan

public TBMan(String inFN,
             String tbFN,
             double split)
      throws IOException

Calls TBMan(inFN, tbFN, split, null)

Method Detail

Preprocess

public static void Preprocess(String inFN,
                              String outFN)
                       throws IOException

This reads in the Acquisitions data set as text and outputs the training data as headline / paragraph pairs. Headlines and paragraphs are stuck on alternating lines (i.e., all line breaks are removed from within a paragraph). This also translates from the SGML tag format (e.g., Amazon ) to my standard tag format (e.g., Amazon[{purchaser}]).

IOException

GetTrainingData

public List GetTrainingData(int seed)

Gets the training data. Randomly divides the data into training and test data. the algorithm is: for each headline/paragraph: if ( random.nextDouble() <= split ) assign headline/paragraph to training data; else assign headline/paragraph to test data. Thus no guarantees that size(training data) == (split / (1 - split)) * size(test data), but should be pretty close

Parameters:: seed - the random seed. the algorithm will always split the data the same way given the same data and the same seed

GetTestData

public List GetTestData(int seed)

Gets the test data. Randomly divides the data into training and test data. the algorithm is: for each headline/paragraph: if ( random.nextDouble() <= split ) assign headline/paragraph to training data; else assign headline/paragraph to test data. Thus no guarantees that size(training data) == (split / (1 - split)) * size(test data), but should be pretty close

Parameters:: seed - the random seed. the algorithm will always split the data the same way given the same data and the same seed

GetSentences

public static List GetSentences(List documents,
                                boolean headlines)

Breaks a list of documents into sentences. A document is a headline/paragraph pair.

Parameters:: documents - a list of documents (each document is a list of headline and paragraph, where a headline is a sentence and a paragraph is a list of sentences); headlines - if true, just returns headlines. otherwise, just returns sentences from paragraphs
Returns:: a list of sentences (a list of strings)

GetSentences

public static List GetSentences(List documents)

Breaks a list of documents into sentences. A document is a headline/paragraph pair. Returns both headlines and sentences from paragraphs.

Parameters:: documents - a list of documents (each document is a list of headline and paragraph, where a headline is a sentence and a paragraph is a list of sentences)
Returns:: a list of sentences (a list of strings)

GetTrees

public List GetTrees(List sentences)

gets the parse trees for a list of sentences. Parses any sentences not in the treebank

WriteTB

public void WriteTB(String fn)
             throws IOException

writes the treebank out to a file

IOException

Parse

public Tree Parse(List sentence)

parses a sentence. only parses the sentence if 1) the sentence's parse tree is not in the treebank AND 2) the static variable "parse" is set to true AND 3) the sentence length is less than or equal to the static variable "lengthCutoff". If the sentence is parsed, the entire treebank is written back to the treebank file

main

public static void main(String[] args)
                 throws IOException

test class functionality. outdated

IOException

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Stanford NLP Group

Field Summary
`static int`	`lengthCutoff` The TBMan parses sentences as it reads them.
`static boolean`	`parse` The TBMan parses sentences as it reads them.
`static int`	`portNumber` The TBMan parses sentences as it reads them.
`List`	`tags` a list of the tags in all the training and testing data.

Constructor Summary
`TBMan(String inFN, String tbFN, double split)` Calls TBMan(inFN, tbFN, split, null)
`TBMan(String inFN, String tbFN, double split, String tag)` constructs a new TBMan.

Method Summary
`static List`	`GetSentences(List documents)` Breaks a list of documents into sentences.
`static List`	`GetSentences(List documents, boolean headlines)` Breaks a list of documents into sentences.
`List`	`GetTestData(int seed)` Gets the test data.
`List`	`GetTrainingData(int seed)` Gets the training data.
`List`	`GetTrees(List sentences)` gets the parse trees for a list of sentences.
`static void`	`main(String[] args)` test class functionality.
`Tree`	`Parse(List sentence)` parses a sentence.
`static void`	`Preprocess(String inFN, String outFN)` This reads in the Acquisitions data set as text and outputs the training data as headline / paragraph pairs.
`void`	`WriteTB(String fn)` writes the treebank out to a file

edu.stanford.nlp.ie.pcfg Class TBMan

portNumber

parse

lengthCutoff

tags

TBMan

TBMan

Preprocess

GetTrainingData

GetTestData

GetSentences

GetSentences

GetTrees

WriteTB

Parse

main

edu.stanford.nlp.ie.pcfg
Class TBMan