lbjava/lbjava/doc/LBJLIBRARY.md at master · CogComp/lbjava

title LBJLIBRARY

5. The Library

The programming framework is supported by a library of interfaces, learning algorithms, and implementations of the building blocks described in Chapter 4. This chapter gives a general overview of each of those codes. .

The library is currently organized into five packages. lbjava.classify contains classes related to features and classification. lbjava.learn contains learner implementations and supporting classes. lbjava.infer contains inference algorithm implementations and internal representations for constraints and inference structures. lbjava.parse contains the Parser interface and some general purpose internal representation classes. Finally, lbjava.nlp contains some basic natural language processing internal representations and parsing routines. In the future, we plan to expand this library, adding more varieties of learners and domain specific parsers and internal representations.

5.1 lbjava.classify

The most important class in ’s library is lbjava.classify.Classifier. This abstract class is the interface through which the application accesses the classifiers defined in the source file. However, the programmer should, in general, only have need to become familiar with a few of the methods defined there.

One other class that may be of broad interest is the lbjava.classify.TestDiscrete class (discussed in Section 5.1.8), which can automate the performance evaluation of a discrete learning classifier on a labeled test set. The other classes in this package are designed mainly for internal use by ’s compiler and can be safely ignored by the casual user. More advanced users who writes their own learners or inference algorithms in the application, for instance, will need to become familiar with them.

5.1.1 lbjava.classify.Classifier

Every classifier declaration in an source file is translated by the compiler into a Java class that extends this class. When the programmer wants to call a classifier in the application, he creates an object of his classifier’s class using its zero argument constructor and calls an appropriate method on that object. The appropriate method will most likely be one of the following four methods:

  • String discreteValue(Object): This method will only be overridden in the classifier’s implementation if its feature return type is discrete. Its return value is the value of the single feature this classifier returns.
  • double realValue(Object): This method will only be overridden in the classifier’s implementation if its feature return type is real. Its return value is the value of the single feature this classifier returns.
  • String[] discreteValueArray(Object): This method will only be overridden in the classifier’s implementation if its feature return type is discrete[]. Its return value contains the values of all the features this classifier returns.
  • double[] realValueArray(Object): This method will only be overridden in the classifier’s implementation if its feature return type is real[]. Its return value contains the values of all the features this classifier returns.

There is no method similar to the four above for accessing the values of features produced by a feature generator, since those values are meaningless without their associated names. When the programmer wants access to the actual features produced by any classifier (not just feature generators), the following non-static method is used. Note, however, that the main purpose of this method is for internal use by the compiler (One circumstance where the programmer may be interested in this method is to print out the String representation of the returned FeatureVector.)

  • FeatureVector classify(Object): This method is overridden in every classifier implementation generated by the compiler. It returns a FeatureVector which may be iterated through to access individual features (see Section 5.1.3).

Every classifier implementation generated by the compiler overrides the following non-static member methods as well. They provide type information about the implemented classifier.

  • String getInputType(): This method returns a String containing the fully qualified name of the class this classifier expects as input.

  • String getOutputType(): This method returns a String containing the feature return type of this classifier. If the classifier is discrete and contains a list of allowable values, it will not appear in the output of this method.

  • String[] allowableValues(): If the classifier is discrete and contains a list of allowable values, that list will be returned by this method. Otherwise, an array of length zero is returned. Learners that require a particular number of allowable values may return an array filled with "*" whose length indicates that number.

Finally, class Classifier provides a simple static method for testing the agreement of two classifiers. It’s convenient, for instance, when testing the performance of a learned classifier against an oracle classifier.

  • double test(Classifier, Classifier, Object[]): This static method returns the fraction of objects in the third argument that produced the same classifications from the two argument Classifiers.

There are several other methods of this class described in the Javadoc documentation. They are omitted here since the programmer is not expected to need them.

5.1.2 lbjava.classify.Feature

This abstract class is part of the representation of the value produced by a classifier. In particular, the name of a feature, but not its value, is stored here. Classes derived from this class (described below) provide storage for the value of the feature. This class exists mainly for internal use by the compiler, and most programmers will not need to be familiar with it.

  • lbjava.classify.DiscreteFeature: The value of a feature returned by a discrete classifier is stored as a String in objects of this class.
  • lbjava.classify.DiscreteArrayFeature: The String value of a feature returned by a discrete[] classifier as well as its integer index into the array are stored in objects of this class.
  • lbjava.classify.RealFeature: The value of a feature returned by a real classifier is stored as a double in objects of this class.
  • lbjava.classify.RealArrayFeature: The double value of a feature returned by a real[] classifier as well as its integer index into the array are stored in objects of this class.

5.1.3 lbjava.classify.FeatureVector

FeatureVector is a linked-list-style container which stores features that function as labels separately from other features. It contains methods for iterating through the features and labels and adding more of either. Its main function is as the return value of the Classifier#classify(Object) method which is used internally by the compiler (see Section 5.1.1). Most programmers will not need to become intimately familiar with this class.

5.1.4 lbjava.classify.Score

This class represents the double score produced by a discrete learning classifier is association with one of its String prediction values. Both items are stored in an object of this class. This class is used internally by ’s inference infrastructure, which will interpret the score as an indication of how much the learning classifier prefers the associated prediction value, higher scores indicating more preference.

5.1.5 lbjava.classify.ScoreSet

This is another class used internally by ’s inference infrastructure. An object of this class is intended to contain one Score for each possible prediction value a learning classifier is capable of returning.

5.1.6 lbjava.classify.ValueComparer

This simple class derived from Classifier is used to convert a multi-value discrete classifier into a Boolean classifier that returns true if and only if the multi-valued classifier evaluated to a particular value. ValueComparer is used internally by SparseNetworkLearner (see Section 5.2.6).

5.1.7 Vector Returners

The classes lbjava.classify.FeatureVectorReturner and lbjava.classify.LabelVectorReturner are used internally by the compiler to help implement the training procedure when the programmer specifies multiple training rounds (see Section 4.1.2.6). A feature vector returner is substituted as the learning classifier’s feature extraction classifier, and a label vector returner is substituted as the learning classifier’s labeler (see Section 5.2.1 to see how this substitution is performed). Each of them then expects the object received as input by the learning classifier to be a FeatureVector, which is not normally the case. However, as will be described in Section 5.4.4, the programmer may still be interested in these classes if he wishes to continue training a learning classifier for additional rounds on the same data without incurring the costs of performing feature extraction.

5.1.8 lbjava.classify.TestDiscrete

This class can be quite useful to quickly evaluate the performance of a newly learned classifier on labeled testing data. It operates either as a stand-alone program or as a class that may be imported into an application for more tailored use. In either case, it will automatically compute accuracy, precision, recall, and F1 scores for the learning classifier in question.

To use this class inside an application, simply instantiate an object of it using the noargument constructor. Lets call this object tester. Then, each time the learning classifier makes a prediction p for an object whose true label is l, make the call tester.reportPrediction(p, l). Once all testing objects have been processed, the printPerformance(java.io.PrintStream) method may be used print a table of results, or the programmer may make use of the various other methods provided by this class to retrieve the computed statistics. More detailed usage of
all these methods as well as the operation of this class as a stand-alone program is available in the on-line Javadoc.

5.2 lbjava.learn

The programmer will want to familiarize himself with most of the classes in this package, in particular those that are derived from the abstract class lbjava.learn.Learner. These are the learners that may be selected from within an source file in association with a learning classifier expression.

5.2.1 lbjava.learn.Learner

Learner is an abstract class extending the abstract class Classifier (see Section 5.1.1). It acts as an interface between learning classifiers defined in an source file and applications that make on-line use of their learning capabilities. The class generated by the compiler when translating a learning classifier expression will always indirectly extend this class.

In addition to the methods inherited from Classifier, this class defines the following non-static, learning related methods. These are not the only methods defined in class Learner, and advanced users may be interested in perusing the Javadoc for descriptions of other methods.

  • void learn(Object): The programmer may call this method at any time from within the application to continue the training process given a single example object. The most common use of this method will be in conjunction with a supervised learning algorithm, in which case, of course, the true label of the example object must be accessible by the label classifier specified in the learning classifier expression in the source file. Note that changes made via this method will not persist beyond the current execution of the application unless the save() method (discussed below) is invoked.
  • void doneLearning(): Some learning algorithms (usually primarily off-line learning algorithms) save part of their computation until after all training objects have been observed. This method informs the learning algorithm that it is time to perform that part of the computation. When compiletime training is indicated in a learning classifier expression, the compiler will call this method after training is complete. Similarly, the programmer who performs on-line learning in his application may need to call this method as well, depending on the learning algorithm.
  • void forget(): The user may call this method from the application to reinitialize the learning classifier to the state at which it started before any training was performed. Note that changes made via this method will not persist beyond the current execution of the application unless the save() method (discussed below) is invoked.
  • void save(): As described in Section 4.1.1, the changes made while training a classifier on-line in the application are immediately visible everywhere in the application. These changes are not written back to disk unless the save() method is invoked. Once this method is invoked, changes that have been made from on-line learning will become visible to subsequent executions of applications that invoke this learning classifier. Please note that the save() method currently will not work when the classifier’s byte code is packed in a jar file.
  • lbjava.classify.ScoreSet scores(Object): This method is used internally by inference algorithms which interpret the scores in the returned ScoreSet (see Section 5.1.5) as indications of which predictions the learning classifier prefers and how much they are preferred.
  • lbjava.classify.Classifier getExtractor(): This method gives access to the feature extraction classifier used by this learning classifier.
  • void setExtractor(lbjava.classify.Classifier): Use this method to change the feature extraction classifier used by this learning classifier. Note that this change will be remembered during subsequent executions of the application if the save() method (described above) is later invoked.
  • lbjava.classify.Classifier getLabeler(): This method gives access to the classifier used by this learning classifier to produce labels for supervised learning. void setLabeler(lbjava.classify.Classifier): Use this method to change the labeler used by this learning classifier. Note that this change will be remembered during subsequent executions of the application if the save() method (described above) is later invoked.
  • void write(java.io.PrintStream): This abstract method must be overridden by each extending learner implementation. A learning classifier derived from such a learner may then invoke this method to produce the learner’s internal representation in text form. Invoking this method does not make modifications to the learner’s internal representation visible to subsequent executions of applications that invoke this learning classifier like the save() method does.

In addition, the following static flag is declared in every learner output by the compiler.

  • public static boolean isTraining: The isTraining variable can be used by the programmer to determine if his learning classifier is currently being trained. This ability may be useful if, for instance, a feature extraction classifier for this learning classifier needs to alter its behavior depending on the availability of labeled training data. The compiler will automatically set this flag true during offline training, and it will be initialized false in any application using the learning classifier. So, it becomes the programmer’s responsibility to make sure it is set appropriately if any additional online training is to be performed in the application

5.2.2 lbjava.learn.LinearThresholdUnit

A linear threshold unit is a supervised, mistake driven learner for binary classification. The predictions made by such a learner are produced by computing a score for a given example object and then comparing that score to a predefined threshold. While learning, if the prediction does not match the label, the linear function that produced the score is updated. Linear threshold units form the basis of many other learning techniques.

Class LinearThresholdUnit is an abstract class defining a basic API for learners of this type. A non-abstract class extending it need only provide implementations of the following abstract methods.

  • void promote(Object): This method makes an appropriate modification to the linear function when a mistake is made on a positive example (i.e., when the computed score mistakenly fell below the predefined threshold).

  • void demote(Object): This method makes an appropriate modification to the linear function when a mistake is made on a negative example (i.e., when the computed score mistakenly rose above the predefined threshold).

When a learning classifier expression (see Section 4.1.2.6) employs a learner derived from this class, the specified label producing classifier must be defined as discrete with a value list containing exactly two values (See Section 4.1.1 for more information on value lists in feature return types.). The learner derived from this class will then learn to produce a higher score when the correct prediction is the second value in the value list.

5.2.3 lbjava.learn.SparsePerceptron

This learner extends class LinearThresholdUnit (see Section 5.2.2). It represents its linear function for score computation as a vector of weights corresponding to features. It has an additive update rule, meaning that it promotes and demotes by treating the collection of features associated with a training object as a vector and using vector addition. Finally, parameters such as its learning rate, threshold, the thick separator, and others described in the online Javadoc can be configured by the user.

5.2.4 lbjava.learn.SparseAveragedPerceptron

Extended from SparsePerceptron (see Section 5.2.3), this learner computes an approximation of voted Perceptron by averaging the weight vectors obtained after processing each training example. Its configurable parameters are the same as those of SparsePerceptron, and, in particular, using this algorithm in conjunction with a positive thickness for the thick separator can be particularly effective.

5.2.5 lbjava.learn.SparseWinnow

This learner extends class LinearThresholdUnit (see Section 5.2.2). It represents its linear function for score computation as a vector of weights corresponding to features. It has a multiplicative update rule, meaning that it promotes and demotes by multiplying an individual weight in the weight vector by a function of the corresponding feature. Finally, parameters such as its learning rates, threshold, and others described in the online Javadoc can be configured by the user.

5.2.6 lbjava.learn.SparseNetworkLearner

SparseNetworkLearner is a multi-class learner, meaning that it can learn to distinguish among two or more discrete label values when classifying an object. It is not necessary to know which label values are possible when employing this learner (i.e., it is not necessary for the label producing classifier specified in a learning classifier expression to be declared with a value list in its feature return type). Values that were never observed during training will never be predicted.

This learner creates a new LinearThresholdUnit for each label value it observes and trains each independently to predict true when its associated label value is the correct classification. When making a prediction on a new object, it produces the label value corresponding to the LinearThresholdUnit producing the highest score. The LinearThresholdUnit used may be selected by the programmer, or, if no specific learner is specified, the default is SparsePerceptron.

SparseNetworkLearner is the default discrete learner; if the programmer does not include a with clause in a learning classifier expression (see Section 4.1.2.6) of discrete feature return type, this learner is invoked with default parameters.

5.2.7 lbjava.learn.NaiveBayes

Na¨ıve Bayes is a multi-class learner that uses prediction value counts and feature counts given a particular prediction value to select the most likely prediction value. It is not mistake driven, as LinearThresholdUnits are. The scores returned by its scores(Object) method are directly interpretable as empirical probabilities. It also has a smoothing parameter configurable by the user for dealing with features that were never encountered during training.

5.2.8 lbjava.learn.StochasticGradientDescent

Gradient descent is a batch learning algorithm for function approximation in which the learner tries to follow the gradient of the error function to the solution of minimal error. This implementation is a stochastic approximation to gradient descent in which the approximated function is assumed to have linear form.

StochasticGradientDescent is the default real learner; if the programmer does not include a with clause in a learning classifier expression (see Section 4.1.2.6) of real feature return type, this learner is invoked with default parameters.

5.2.9 lbjava.learn.Normalizer

A normalizer is a method that takes a set of scores as input and modifies those scores so that they obey particular constraints. Class Normalizer is an abstract class with a single abstract method normalize(lbjava.classify.ScoreSet) (see Section 5.1.5) which is implemented by extending classes to define this “normalization.” For example:

  • lbjava.learn.Sigmoid: This Normalizer simply replaces each score s_i in the given ScoreSet with 1 / 1+e^{s_i}. After normalization, each score will be greater than 0 and less than 1.
  • lbjava.learn.Softmax: This Normalizer replaces each score with the fraction of its exponential out of the sum of all scores’ exponentials. More precisely, each score si is replaced by exp(s_i)/ \sum_j exp(s_j). After normalization, each score will be positive and they will sum to 1.
  • lbjava.learn.IdentityNormalizer: This Normalizer simply returns the same scores it was passed as input.

5.2.10 lbjava.learn.WekaWrapper

The WekaWrapper class is meant to wrap instances of learners from the WEKA library of learning algorithms. The lbjava.learn.WekaWrapper class converts between the internal representations of and WEKA on the fly, so that the more extensive set of algorithms contained within WEKA can be applied to projects written in .

The WekaWrapper class extends lbjava.learn.Learner, and carries all of the functionality that can be expected from a learner. A standard invocation of WekaWrapper could look something like this:

new WekaWrapper(new weka.classifiers.bayes.NaiveBayes())

Restrictions

  • It is crucial to note that WEKA learning algorithms do not learn online. Therefore, whenever the learn method of the WekaWrapper is called, no learning actually takes place. Rather, the input object is added to a collection of examples for the algorithm to learn once the doneLearning() method is called.
  • The WekaWrapper only supports features which are either discrete without a value list, discrete with a value list, or real. In WEKA, these correspond to weka.core.Attribute objects of type String, Nominal, and Numerical. In particular, array producing classifiers and feature generators may not be used as features for a learning classifier learned with this class. See section 4.1.1 for further discussion Classifier Declarations.
  • When designing a learning classifier which will use a learning algorithm from WEKA, it is important to note that very very few algorithms in the WEKA library support String attributes. In , this means that it will be very hard to find a learning algorithm which will learn using a discrete feature extractor which does not have a value list. I.e. value lists should be provided for discrete feature extracting classifiers whenever possible.
  • Feature pre-extraction must be enabled in order to use the WekaWrapper class. Feature preextraction is enabled by using the preExtract clause in the LearningClassifierExpression (discussed in 4.1.2.6).

lbjava.infer

The lbjava.infer package contains many classes. The great majority of these classes form the internal representation of both propositional and first order constraint expressions and are used internally by ’s inference infrastructure. Only the programmer who designs his own inference algorithm in terms of constraints needs to familiarize himself with these classes. Detailed descriptions of them are provided in the Javadoc.

There are a few classes, however, that are of broader interest. First, the Inference class is an abstract class from which all inference algorithms implemented for are derived. It is described below along with the particular algorithms that have already been implemented. Finally, the InferenceManager class is used internally by the library when applications using inference are running.

5.3.1 lbjava.infer.Inference

Inference is an abstract class from which all inference algorithms are derived. Executing an inference generally evaluates all the learning classifiers involved on the objects they have been applied to in the constraints, as well as picking new values for their predictions so that the constraints are satisfied. An object of this class keeps track of all the information necessary to perform inference in addition to the information produced by it. Once that inference has been performed, constrained classifiers access the results through this class’s interface to determine what their constrained predictions are. This is done through the valueOf(lbjava.learn.Learner, Object) method described below.

  • String valueOf(lbjava.learn.Learner, Object): The arguments to this method are objects representing a learning classifier and an object involved in the inference. Calling this method causes the inference algorithm to run, if it has not been run before. This method then returns the new prediction corresponding to the given learner and object after constraints have been resolved.

5.3.2 lbjava.infer.GLPK

This inference algorithm, which may be named in the with clause of the inference syntax, uses Integer Linear Programming (ILP) to maximize the expected number of correct predictions while respecting the constraints. Upon receiving the constraints represented as First Order Logic (FOL) formulas, this implementation first translates those formulas to a propositional representation. The resulting propositional expression is then translated to a set of linear inequalities by recursively translating subexpressions into sets of linear inequalities that bound newly created variables to take their place.

The number of linear inequalities and extra variables generated is linear in the depth of the tree formed by the propositional representation of the constraints. This tree is not binary; instead, nodes representing operators that are associative and commutative such as conjunction and disjunction have multiple children and are not allowed to have children representing the same operator (i.e., when they do, they are collapsed into the parent node). So both the number of linear inequalities and the number of extra variables created will be relatively low. However, the performance of any ILP algorithm is very sensitive to both these numbers, since ILP is NP-hard. On a 3 Ghz machine, the programmer will still do well to keep both these numbers under 20,000 for any given instance of the inference problem.

The resulting ILP problem is then solved by the GNU Linear Programming Kit (GLPK), a linear programming library written in C. This software must be downloaded and installed separately before installing , or the GLPK inference algorithm will be disabled. If has already been installed, it must be reconfigured and reinstalled (see Chapter 6.1) after installing GLPK.

5.4 lbjava.parse

This package contains the very simple Parser interface, implementers of which are used in conjunction with learning classifier expressions in an source file when off-line training is desired (see Section 4.1.2.6). It also contains some general purpose internal representations which may be of interest to a programmer who has not yet written the internal representations or parsers for the application.

5.4.1 lbjava.parse.Parser

The compiler is capable of automatically training a learning classifier given training data, so long as that training data comes in the form of objects ready to be passed to the learner’s learn(Object) method. Any class that implements the Parser interface can be utilized by the compiler to provide those training objects. This interface simply consists of a single method for returning another object:

  • Object next(): This is the only method that an implementing class needs to define. It returns the next training Object until no more are available, at which point it returns null.

5.4.2 lbjava.parse.LineByLine

This abstract class extends Parser but does not implement the next() method. It does, however, define a constructor that opens the file with the specified name and a readLine() method that fetches the next line of text from that file. Exceptions (as may result from not being able to open or read from the file) are automatically handled by printing an error message and exiting the application.

5.4.3 lbjava.parse.ChildrenFromVectors

This parser calls a user specified, LinkedVector (see Section 5.4.6) returning Parser internally and returns the LinkedChildren (see Section 5.4.5) of that vector one at a time through its next() method. One notable LinkedVector returning Parser is lbjava.nlp.WordSplitter discussed in Section 5.5.2.

5.4.4 lbjava.parse.FeatureVectorParser

This parser is used internally by the compiler (and may be used by the programmer as well) to continue training the learning classifier after the first round of training without incurring the cost of feature extraction. See Section 4.1.2.6 for more information on ’s behavior when the programmer specifies multiple training rounds. That section describes how lexicon and example files are produced, and these files become the input to FeatureVectorParser.

The objects produced by FeatureVectorParser will be FeatureVectors, which are not normally the input to any classifier, including the learning classifier we’d like to continue training. So, the programmer must first replace the learning classifier’s feature extractor with a FeatureVectorReturner and its labeler with a LabelVectorReturner (see Section 5.1.7) before calling learn(Object). After the new training objects have been exhausted, the original feature extractor and labeler must be restored before finally calling save().

For example, if a learning classifier named MyTagger has been trained for multiple rounds by the compiler, the lexicon and example file will be created with the names MyTagger.lex and MyTagger.ex respectively. Then the following code in an application will continue training the classifier for an additional round:

MyTagger tagger = new MyTagger();
Classifier extractor = tagger.getExtractor();
tagger.setExtractor(new FeatureVectorReturner());
Classifier labeler = tagger.getLabeler();
tagger.setLabeler(new LabelVectorReturner());
FeatureVectorParser parser = new FeatureVectorParser("MyTagger.ex", "MyTagger.lex");
for (Object vector = parser.next(); vector != null; vector = parser.next())
   tagger.learn(vector);
tagger.setExtractor(extractor);
tagger.setLabeler(labeler);
tagger.save();

5.4.5 lbjava.parse.LinkedChild

Together with LinkedVector discussed next, these two classes form the basis for a simple, general purpose internal representation for raw data. LinkedChild is an abstract class containing pointers to two other LinkedChildren, the “previous” one and the “next” one. It may also store a pointer to its parent, which is a LinkedVector. Constructors that set up all these links are also provided, simplifying the implementation of the parser.

5.4.6 lbjava.parse.LinkedVector

A LinkedVector contains any number of LinkedChildren and provides random access to them in addition to the serial access provided by their links. It also provides methods for insertion and removal of new children. A LinkedVector is itself also a LinkedChild, so that hierarchies are easy to construct when sub-classing these two classes.

5.5 lbjava.nlp

The programmer of Natural Language Processing (NLP) applications may find the internal representations and parsing algorithms implemented in this package useful. There are representations of words, sentences, and documents, as well as parsers of some common file formats and algorithms for word and sentence segmentation.

5.5.1 Internal Representations

These classes may be used to represent the elements of a natural language document.

  • lbjava.nlp.Word: This simple representation of a word extends the LinkedChild class (see Section 5.4.5) and has space for its spelling and part of speech tag.
  • lbjava.nlp.Sentence: Objects of the Sentence class store only the full text of the sentence in a single String. However, a method is provided to heuristically split that text into Word objects contained in a LinkedVector.
  • lbjava.nlp.NLDocument: Extended from LinkedVector, this class has a constructor that takes the full text of a document as input. Using the methods in Sentence and SentenceSplitter, it creates a hierarchical representation of a natural language document in which Words are contained in LinkedVectors representing sentences which are contained in this LinkedVector.
  • lbjava.nlp.POS: This class may be used to represent a part of speech, but it used more frequently to simply retrieve information about the various parts of speech made standard by the Penn Treebank project (Marcus, Santorini, & Marcinkiewicz , 1994).

5.5.2 Parsers

The classes listed in this section are all derived from class LineByLine (see Section 5.4.2). They all contain (at least) a constructor that takes a single String representing the name of a file as input. The objects they return are retrieved through the overridden next() method.

  • lbjava.nlp.SentenceSplitter: Use this Parser to separate sentences out from plain text. The class provides two constructors, one for splitting sentences out of a plain text file, and the other for splitting sentences out of plain text already stored in memory in a String[]. The user can then retrieve Sentences one at a time with the next() method, or all at once with the splitAll() method. The returned Sentences’ start and end fields represent offsets into the text they were extracted from. Every character in between those two offsets inclusive, including extra spaces, newlines, etc., is included in the Sentence as it appeared in the paragraph. (If the constructor taking a String[] as an argument is used, newline characters are inserted into the returned sentences to indicate transitions from one element of the array to the next.)
  • lbjava.nlp.WordSplitter: This parser takes the plain, unannotated Sentences (see Section 5.5.1) returned by another parser (e.g., SentenceSplitter) and splits them into Word objects. Entire sentences now represented as LinkedVectors (see Section 5.4.6) are then returned one at a time by calls to the next() method.
  • lbjava.nlp.ColumnFormat: This parser returns a String[] representing the rows of a file in column format. The input file is assumed to contain fields of non-whitespace characters separated by any amount of whitespace, one line of which is commonly used to represent a word in a corpus. This parser breaks a given line into one String per field, omitting all of the whitespace. A common usage of this class will be in extending it to create a new Parser that calls super.next() and creates a more interesting internal representation with the results.
  • lbjava.nlp.POSBracketToVector: Use this parser to return LinkedVector objects representing sentences given file names of POS bracket form files to parse. These files are expected to have one sentence per line, and the format of each line is as follows:
 (pos1 spelling1) (pos2 spelling2) ... (posn spellingn)

It is also expected that there will be exactly one space between a part of speech and the corresponding spelling and between a closing parenthesis and an opening parenthesis.