| title | LBJLIBRARY |
|---|
5. The Library
The programming framework is supported by a library of interfaces, learning algorithms, and implementations of the building blocks described in Chapter 4. This chapter gives a general overview of each of those codes. .
The library is currently organized into five packages. lbjava.classify contains classes related
to features and classification. lbjava.learn contains learner implementations and supporting
classes. lbjava.infer contains inference algorithm implementations and internal representations
for constraints and inference structures. lbjava.parse contains the Parser interface and some
general purpose internal representation classes. Finally, lbjava.nlp contains some basic natural
language processing internal representations and parsing routines. In the future, we plan to
expand this library, adding more varieties of learners and domain specific parsers and internal
representations.
5.1 lbjava.classify
The most important class in ’s library is lbjava.classify.Classifier. This abstract class
is the interface through which the application accesses the classifiers defined in the source
file. However, the programmer should, in general, only have need to become familiar with a few
of the methods defined there.
One other class that may be of broad interest is the lbjava.classify.TestDiscrete class
(discussed in Section 5.1.8), which can automate the performance evaluation of a discrete learning
classifier on a labeled test set. The other classes in this package are designed mainly for internal
use by ’s compiler and can be safely ignored by the casual user. More advanced users who
writes their own learners or inference algorithms in the application, for instance, will need to
become familiar with them.
5.1.1 lbjava.classify.Classifier
Every classifier declaration in an source file is translated by the compiler into a Java class that extends this class. When the programmer wants to call a classifier in the application, he creates an object of his classifier’s class using its zero argument constructor and calls an appropriate method on that object. The appropriate method will most likely be one of the following four methods:
String discreteValue(Object): This method will only be overridden in the classifier’s implementation if its feature return type isdiscrete. Its return value is the value of the single feature this classifier returns.double realValue(Object): This method will only be overridden in the classifier’s implementation if its feature return type isreal. Its return value is the value of the single feature this classifier returns.String[] discreteValueArray(Object): This method will only be overridden in the classifier’s implementation if its feature return type isdiscrete[]. Its return value contains the values of all the features this classifier returns.double[] realValueArray(Object): This method will only be overridden in the classifier’s implementation if its feature return type isreal[]. Its return value contains the values of all the features this classifier returns.
There is no method similar to the four above for accessing the values of features produced
by a feature generator, since those values are meaningless without their associated names. When
the programmer wants access to the actual features produced by any classifier (not just feature
generators), the following non-static method is used. Note, however, that the main purpose of
this method is for internal use by the compiler
(One circumstance where the programmer may be interested in this method is to print out the String representation
of the returned FeatureVector.)
FeatureVector classify(Object): This method is overridden in every classifier implementation generated by the compiler. It returns aFeatureVectorwhich may be iterated through to access individual features (see Section 5.1.3).
Every classifier implementation generated by the compiler overrides the following non-static member methods as well. They provide type information about the implemented classifier.
-
String getInputType(): This method returns aStringcontaining the fully qualified name of the class this classifier expects as input. -
String getOutputType(): This method returns aStringcontaining the feature return type of this classifier. If the classifier isdiscreteand contains a list of allowable values, it will not appear in the output of this method. -
String[] allowableValues(): If the classifier isdiscreteand contains a list of allowable values, that list will be returned by this method. Otherwise, an array of length zero is returned. Learners that require a particular number of allowable values may return an array filled with"*"whose length indicates that number.
Finally, class Classifier provides a simple static method for testing the agreement of two
classifiers. It’s convenient, for instance, when testing the performance of a learned classifier
against an oracle classifier.
double test(Classifier, Classifier, Object[]): This static method returns the fraction of objects in the third argument that produced the same classifications from the two argumentClassifiers.
There are several other methods of this class described in the Javadoc documentation. They are omitted here since the programmer is not expected to need them.
5.1.2 lbjava.classify.Feature
This abstract class is part of the representation of the value produced by a classifier. In particular, the name of a feature, but not its value, is stored here. Classes derived from this class (described below) provide storage for the value of the feature. This class exists mainly for internal use by the compiler, and most programmers will not need to be familiar with it.
lbjava.classify.DiscreteFeature: The value of a feature returned by adiscreteclassifier is stored as a String in objects of this class.lbjava.classify.DiscreteArrayFeature: TheStringvalue of a feature returned by adiscrete[]classifier as well as its integer index into the array are stored in objects of this class.lbjava.classify.RealFeature: The value of a feature returned by arealclassifier is stored as adoublein objects of this class.lbjava.classify.RealArrayFeature: The double value of a feature returned by areal[]classifier as well as its integer index into the array are stored in objects of this class.
5.1.3 lbjava.classify.FeatureVector
FeatureVector is a linked-list-style container which stores features that function as labels separately
from other features. It contains methods for iterating through the features and labels and
adding more of either. Its main function is as the return value of the Classifier#classify(Object)
method which is used internally by the compiler (see Section 5.1.1). Most programmers will
not need to become intimately familiar with this class.
5.1.4 lbjava.classify.Score
This class represents the double score produced by a discrete learning classifier is association
with one of its String prediction values. Both items are stored in an object of this class.
This class is used internally by ’s inference infrastructure, which will interpret the score as
an indication of how much the learning classifier prefers the associated prediction value, higher
scores indicating more preference.
5.1.5 lbjava.classify.ScoreSet
This is another class used internally by ’s inference infrastructure. An object of this class is
intended to contain one Score for each possible prediction value a learning classifier is capable
of returning.
5.1.6 lbjava.classify.ValueComparer
This simple class derived from Classifier is used to convert a multi-value discrete classifier
into a Boolean classifier that returns true if and only if the multi-valued classifier evaluated
to a particular value. ValueComparer is used internally by SparseNetworkLearner (see Section
5.2.6).
5.1.7 Vector Returners
The classes lbjava.classify.FeatureVectorReturner and
lbjava.classify.LabelVectorReturner are used internally by the compiler to help implement
the training procedure when the programmer specifies multiple training rounds (see Section
4.1.2.6). A feature vector returner is substituted as the learning classifier’s feature extraction
classifier, and a label vector returner is substituted as the learning classifier’s labeler (see Section
5.2.1 to see how this substitution is performed). Each of them then expects the object received as
input by the learning classifier to be a FeatureVector, which is not normally the case. However,
as will be described in Section 5.4.4, the programmer may still be interested in these classes if he
wishes to continue training a learning classifier for additional rounds on the same data without
incurring the costs of performing feature extraction.
5.1.8 lbjava.classify.TestDiscrete
This class can be quite useful to quickly evaluate the performance of a newly learned classifier on labeled testing data. It operates either as a stand-alone program or as a class that may be imported into an application for more tailored use. In either case, it will automatically compute accuracy, precision, recall, and F1 scores for the learning classifier in question.
To use this class inside an application, simply instantiate an object of it using the noargument
constructor. Lets call this object tester. Then, each time the learning classifier makes
a prediction p for an object whose true label is l, make the call tester.reportPrediction(p, l). Once all testing objects have been processed, the printPerformance(java.io.PrintStream)
method may be used print a table of results, or the programmer may make use of the various
other methods provided by this class to retrieve the computed statistics. More detailed usage of
all these methods as well as the operation of this class as a stand-alone program is available in
the on-line Javadoc.
5.2 lbjava.learn
The programmer will want to familiarize himself with most of the classes in this package, in
particular those that are derived from the abstract class lbjava.learn.Learner. These are the
learners that may be selected from within an source file in association with a learning
classifier expression.
5.2.1 lbjava.learn.Learner
Learner is an abstract class extending the abstract class Classifier (see Section 5.1.1). It acts
as an interface between learning classifiers defined in an source file and applications that
make on-line use of their learning capabilities. The class generated by the compiler when
translating a learning classifier expression will always indirectly extend this class.
In addition to the methods inherited from Classifier, this class defines the following non-static,
learning related methods. These are not the only methods defined in class Learner, and
advanced users may be interested in perusing the Javadoc for descriptions of other methods.
void learn(Object): The programmer may call this method at any time from within the application to continue the training process given a single example object. The most common use of this method will be in conjunction with a supervised learning algorithm, in which case, of course, the true label of the example object must be accessible by the label classifier specified in the learning classifier expression in the source file. Note that changes made via this method will not persist beyond the current execution of the application unless thesave()method (discussed below) is invoked.void doneLearning(): Some learning algorithms (usually primarily off-line learning algorithms) save part of their computation until after all training objects have been observed. This method informs the learning algorithm that it is time to perform that part of the computation. When compiletime training is indicated in a learning classifier expression, the compiler will call this method after training is complete. Similarly, the programmer who performs on-line learning in his application may need to call this method as well, depending on the learning algorithm.void forget(): The user may call this method from the application to reinitialize the learning classifier to the state at which it started before any training was performed. Note that changes made via this method will not persist beyond the current execution of the application unless thesave()method (discussed below) is invoked.void save(): As described in Section 4.1.1, the changes made while training a classifier on-line in the application are immediately visible everywhere in the application. These changes are not written back to disk unless thesave()method is invoked. Once this method is invoked, changes that have been made from on-line learning will become visible to subsequent executions of applications that invoke this learning classifier. Please note that thesave()method currently will not work when the classifier’s byte code is packed in a jar file.lbjava.classify.ScoreSet scores(Object): This method is used internally by inference algorithms which interpret the scores in the returnedScoreSet(see Section 5.1.5) as indications of which predictions the learning classifier prefers and how much they are preferred.lbjava.classify.Classifier getExtractor(): This method gives access to the feature extraction classifier used by this learning classifier.void setExtractor(lbjava.classify.Classifier): Use this method to change the feature extraction classifier used by this learning classifier. Note that this change will be remembered during subsequent executions of the application if thesave()method (described above) is later invoked.lbjava.classify.Classifier getLabeler(): This method gives access to the classifier used by this learning classifier to produce labels for supervised learning. void setLabeler(lbjava.classify.Classifier): Use this method to change the labeler used by this learning classifier. Note that this change will be remembered during subsequent executions of the application if the save() method (described above) is later invoked.void write(java.io.PrintStream): This abstract method must be overridden by each extending learner implementation. A learning classifier derived from such a learner may then invoke this method to produce the learner’s internal representation in text form. Invoking this method does not make modifications to the learner’s internal representation visible to subsequent executions of applications that invoke this learning classifier like thesave()method does.
In addition, the following static flag is declared in every learner output by the compiler.
public static boolean isTraining: TheisTrainingvariable can be used by the programmer to determine if his learning classifier is currently being trained. This ability may be useful if, for instance, a feature extraction classifier for this learning classifier needs to alter its behavior depending on the availability of labeled training data. The compiler will automatically set this flagtrueduring offline training, and it will be initializedfalsein any application using the learning classifier. So, it becomes the programmer’s responsibility to make sure it is set appropriately if any additional online training is to be performed in the application
5.2.2 lbjava.learn.LinearThresholdUnit
A linear threshold unit is a supervised, mistake driven learner for binary classification. The predictions made by such a learner are produced by computing a score for a given example object and then comparing that score to a predefined threshold. While learning, if the prediction does not match the label, the linear function that produced the score is updated. Linear threshold units form the basis of many other learning techniques.
Class LinearThresholdUnit is an abstract class defining a basic API for learners of this type.
A non-abstract class extending it need only provide implementations of the following abstract
methods.
-
void promote(Object): This method makes an appropriate modification to the linear function when a mistake is made on a positive example (i.e., when the computed score mistakenly fell below the predefined threshold). -
void demote(Object): This method makes an appropriate modification to the linear function when a mistake is made on a negative example (i.e., when the computed score mistakenly rose above the predefined threshold).
When a learning classifier expression (see Section 4.1.2.6) employs a learner derived from this class, the specified label producing classifier must be defined as discrete with a value list containing exactly two values (See Section 4.1.1 for more information on value lists in feature return types.). The learner derived from this class will then learn to produce a higher score when the correct prediction is the second value in the value list.
5.2.3 lbjava.learn.SparsePerceptron
This learner extends class LinearThresholdUnit (see Section 5.2.2). It represents its linear
function for score computation as a vector of weights corresponding to features. It has an
additive update rule, meaning that it promotes and demotes by treating the collection of features
associated with a training object as a vector and using vector addition. Finally, parameters such
as its learning rate, threshold, the thick separator, and others described in the online Javadoc
can be configured by the user.
5.2.4 lbjava.learn.SparseAveragedPerceptron
Extended from SparsePerceptron (see Section 5.2.3), this learner computes an approximation of
voted Perceptron by averaging the weight vectors obtained after processing each training example.
Its configurable parameters are the same as those of SparsePerceptron, and, in particular, using
this algorithm in conjunction with a positive thickness for the thick separator can be particularly
effective.
5.2.5 lbjava.learn.SparseWinnow
This learner extends class LinearThresholdUnit (see Section 5.2.2). It represents its linear
function for score computation as a vector of weights corresponding to features. It has a multiplicative
update rule, meaning that it promotes and demotes by multiplying an individual weight
in the weight vector by a function of the corresponding feature. Finally, parameters such as its
learning rates, threshold, and others described in the online Javadoc can be configured by the
user.
5.2.6 lbjava.learn.SparseNetworkLearner
SparseNetworkLearner is a multi-class learner, meaning that it can learn to distinguish among
two or more discrete label values when classifying an object. It is not necessary to know which
label values are possible when employing this learner (i.e., it is not necessary for the label producing
classifier specified in a learning classifier expression to be declared with a value list in its
feature return type). Values that were never observed during training will never be predicted.
This learner creates a new LinearThresholdUnit for each label value it observes and trains
each independently to predict true when its associated label value is the correct classification.
When making a prediction on a new object, it produces the label value corresponding to the
LinearThresholdUnit producing the highest score. The LinearThresholdUnit used may be selected
by the programmer, or, if no specific learner is specified, the default is SparsePerceptron.
SparseNetworkLearner is the default discrete learner; if the programmer does not include a
with clause in a learning classifier expression (see Section 4.1.2.6) of discrete feature return type,
this learner is invoked with default parameters.
5.2.7 lbjava.learn.NaiveBayes
Na¨ıve Bayes is a multi-class learner that uses prediction value counts and feature counts given
a particular prediction value to select the most likely prediction value. It is not mistake driven,
as LinearThresholdUnits are. The scores returned by its scores(Object) method are directly
interpretable as empirical probabilities. It also has a smoothing parameter configurable by the
user for dealing with features that were never encountered during training.
5.2.8 lbjava.learn.StochasticGradientDescent
Gradient descent is a batch learning algorithm for function approximation in which the learner tries to follow the gradient of the error function to the solution of minimal error. This implementation is a stochastic approximation to gradient descent in which the approximated function is assumed to have linear form.
StochasticGradientDescent is the default real learner; if the programmer does not include
a with clause in a learning classifier expression (see Section 4.1.2.6) of real feature return type,
this learner is invoked with default parameters.
5.2.9 lbjava.learn.Normalizer
A normalizer is a method that takes a set of scores as input and modifies those scores so that they
obey particular constraints. Class Normalizer is an abstract class with a single abstract method
normalize(lbjava.classify.ScoreSet) (see Section 5.1.5) which is implemented by extending
classes to define this “normalization.” For example:
lbjava.learn.Sigmoid: ThisNormalizersimply replaces each scores_iin the givenScoreSetwith1 / 1+e^{s_i}. After normalization, each score will be greater than 0 and less than 1.lbjava.learn.Softmax: ThisNormalizerreplaces each score with the fraction of its exponential out of the sum of all scores’ exponentials. More precisely, each score si is replaced byexp(s_i)/ \sum_j exp(s_j). After normalization, each score will be positive and they will sum to 1.lbjava.learn.IdentityNormalizer: ThisNormalizersimply returns the same scores it was passed as input.
5.2.10 lbjava.learn.WekaWrapper
The WekaWrapper class is meant to wrap instances of learners from the WEKA library of learning
algorithms.
The lbjava.learn.WekaWrapper class converts between the internal representations
of and WEKA on the fly, so that the more extensive set of algorithms contained within
WEKA can be applied to projects written in .
The WekaWrapper class extends lbjava.learn.Learner, and carries all of the functionality that
can be expected from a learner. A standard invocation of WekaWrapper could look something
like this:
new WekaWrapper(new weka.classifiers.bayes.NaiveBayes())
Restrictions
- It is crucial to note that WEKA learning algorithms do not learn online. Therefore,
whenever the
learnmethod of theWekaWrapperis called, no learning actually takes place. Rather, the input object is added to a collection of examples for the algorithm to learn once thedoneLearning()method is called. - The
WekaWrapperonly supports features which are either discrete without a value list, discrete with a value list, or real. In WEKA, these correspond toweka.core.Attributeobjects of typeString,Nominal, andNumerical. In particular, array producing classifiers and feature generators may not be used as features for a learning classifier learned with this class. See section 4.1.1 for further discussion Classifier Declarations. - When designing a learning classifier which will use a learning algorithm from WEKA, it
is important to note that very very few algorithms in the WEKA library support
Stringattributes. In , this means that it will be very hard to find a learning algorithm which will learn using adiscretefeature extractor which does not have a value list. I.e. value lists should be provided for discrete feature extracting classifiers whenever possible. - Feature pre-extraction must be enabled in order to use the
WekaWrapperclass. Feature preextraction is enabled by using thepreExtractclause in theLearningClassifierExpression(discussed in 4.1.2.6).
lbjava.infer
The lbjava.infer package contains many classes. The great majority of these classes form the
internal representation of both propositional and first order constraint expressions and are used
internally by ’s inference infrastructure. Only the programmer who designs his own inference
algorithm in terms of constraints needs to familiarize himself with these classes. Detailed
descriptions of them are provided in the Javadoc.
There are a few classes, however, that are of broader interest. First, the Inference class
is an abstract class from which all inference algorithms implemented for are derived. It
is described below along with the particular algorithms that have already been implemented.
Finally, the InferenceManager class is used internally by the library when applications
using inference are running.
5.3.1 lbjava.infer.Inference
Inference is an abstract class from which all inference algorithms are derived. Executing an
inference generally evaluates all the learning classifiers involved on the objects they have been
applied to in the constraints, as well as picking new values for their predictions so that the
constraints are satisfied. An object of this class keeps track of all the information necessary to
perform inference in addition to the information produced by it. Once that inference has been
performed, constrained classifiers access the results through this class’s interface to determine
what their constrained predictions are. This is done through the valueOf(lbjava.learn.Learner, Object)
method described below.
String valueOf(lbjava.learn.Learner, Object): The arguments to this method are objects representing a learning classifier and an object involved in the inference. Calling this method causes the inference algorithm to run, if it has not been run before. This method then returns the new prediction corresponding to the given learner and object after constraints have been resolved.
5.3.2 lbjava.infer.GLPK
This inference algorithm, which may be named in the with clause of the inference syntax,
uses Integer Linear Programming (ILP) to maximize the expected number of correct predictions
while respecting the constraints. Upon receiving the constraints represented as First Order Logic
(FOL) formulas, this implementation first translates those formulas to a propositional representation.
The resulting propositional expression is then translated to a set of linear inequalities by
recursively translating subexpressions into sets of linear inequalities that bound newly created
variables to take their place.
The number of linear inequalities and extra variables generated is linear in the depth of the tree formed by the propositional representation of the constraints. This tree is not binary; instead, nodes representing operators that are associative and commutative such as conjunction and disjunction have multiple children and are not allowed to have children representing the same operator (i.e., when they do, they are collapsed into the parent node). So both the number of linear inequalities and the number of extra variables created will be relatively low. However, the performance of any ILP algorithm is very sensitive to both these numbers, since ILP is NP-hard. On a 3 Ghz machine, the programmer will still do well to keep both these numbers under 20,000 for any given instance of the inference problem.
The resulting ILP problem is then solved by the GNU Linear Programming Kit (GLPK),
a linear programming library written in C. This software must be downloaded and installed
separately before installing , or the GLPK inference algorithm will be disabled. If has
already been installed, it must be reconfigured and reinstalled (see Chapter 6.1) after installing
GLPK.
5.4 lbjava.parse
This package contains the very simple Parser interface, implementers of which are used in
conjunction with learning classifier expressions in an source file when off-line training is
desired (see Section 4.1.2.6). It also contains some general purpose internal representations
which may be of interest to a programmer who has not yet written the internal representations
or parsers for the application.
5.4.1 lbjava.parse.Parser
The compiler is capable of automatically training a learning classifier given training data, so long as that training data comes in the form of objects ready to be passed to the learner’s learn(Object) method. Any class that implements the Parser interface can be utilized by the compiler to provide those training objects. This interface simply consists of a single method for returning another object:
Object next(): This is the only method that an implementing class needs to define. It returns the next trainingObjectuntil no more are available, at which point it returnsnull.
5.4.2 lbjava.parse.LineByLine
This abstract class extends Parser but does not implement the next() method. It does, however,
define a constructor that opens the file with the specified name and a readLine() method that
fetches the next line of text from that file. Exceptions (as may result from not being able to
open or read from the file) are automatically handled by printing an error message and exiting
the application.
5.4.3 lbjava.parse.ChildrenFromVectors
This parser calls a user specified, LinkedVector (see Section 5.4.6) returning Parser internally
and returns the LinkedChildren (see Section 5.4.5) of that vector one at a time through its
next() method. One notable LinkedVector returning Parser is lbjava.nlp.WordSplitter discussed
in Section 5.5.2.
5.4.4 lbjava.parse.FeatureVectorParser
This parser is used internally by the compiler (and may be used by the programmer as well)
to continue training the learning classifier after the first round of training without incurring the
cost of feature extraction. See Section 4.1.2.6 for more information on ’s behavior when the
programmer specifies multiple training rounds. That section describes how lexicon and example
files are produced, and these files become the input to FeatureVectorParser.
The objects produced by FeatureVectorParser will be FeatureVectors, which are not normally
the input to any classifier, including the learning classifier we’d like to continue training.
So, the programmer must first replace the learning classifier’s feature extractor with a
FeatureVectorReturner and its labeler with a LabelVectorReturner (see Section 5.1.7) before
calling learn(Object). After the new training objects have been exhausted, the original feature
extractor and labeler must be restored before finally calling save().
For example, if a learning classifier named MyTagger has been trained for multiple rounds by
the compiler, the lexicon and example file will be created with the names MyTagger.lex
and MyTagger.ex respectively. Then the following code in an application will continue training
the classifier for an additional round:
MyTagger tagger = new MyTagger(); Classifier extractor = tagger.getExtractor(); tagger.setExtractor(new FeatureVectorReturner()); Classifier labeler = tagger.getLabeler(); tagger.setLabeler(new LabelVectorReturner()); FeatureVectorParser parser = new FeatureVectorParser("MyTagger.ex", "MyTagger.lex"); for (Object vector = parser.next(); vector != null; vector = parser.next()) tagger.learn(vector); tagger.setExtractor(extractor); tagger.setLabeler(labeler); tagger.save();
5.4.5 lbjava.parse.LinkedChild
Together with LinkedVector discussed next, these two classes form the basis for a simple, general
purpose internal representation for raw data. LinkedChild is an abstract class containing pointers
to two other LinkedChildren, the “previous” one and the “next” one. It may also store a pointer
to its parent, which is a LinkedVector. Constructors that set up all these links are also provided,
simplifying the implementation of the parser.
5.4.6 lbjava.parse.LinkedVector
A LinkedVector contains any number of LinkedChildren and provides random access to them
in addition to the serial access provided by their links. It also provides methods for insertion
and removal of new children. A LinkedVector is itself also a LinkedChild, so that hierarchies
are easy to construct when sub-classing these two classes.
5.5 lbjava.nlp
The programmer of Natural Language Processing (NLP) applications may find the internal representations and parsing algorithms implemented in this package useful. There are representations of words, sentences, and documents, as well as parsers of some common file formats and algorithms for word and sentence segmentation.
5.5.1 Internal Representations
These classes may be used to represent the elements of a natural language document.
lbjava.nlp.Word: This simple representation of a word extends theLinkedChildclass (see Section 5.4.5) and has space for its spelling and part of speech tag.lbjava.nlp.Sentence: Objects of theSentenceclass store only the full text of the sentence in a singleString. However, a method is provided to heuristically split that text into Word objects contained in aLinkedVector.lbjava.nlp.NLDocument: Extended fromLinkedVector, this class has a constructor that takes the full text of a document as input. Using the methods inSentenceandSentenceSplitter, it creates a hierarchical representation of a natural language document in whichWordsare contained inLinkedVectorsrepresenting sentences which are contained in thisLinkedVector.lbjava.nlp.POS: This class may be used to represent a part of speech, but it used more frequently to simply retrieve information about the various parts of speech made standard by the Penn Treebank project (Marcus, Santorini, & Marcinkiewicz , 1994).
5.5.2 Parsers
The classes listed in this section are all derived from class LineByLine (see Section 5.4.2). They
all contain (at least) a constructor that takes a single String representing the name of a file as
input. The objects they return are retrieved through the overridden next() method.
lbjava.nlp.SentenceSplitter: Use thisParserto separate sentences out from plain text. The class provides two constructors, one for splitting sentences out of a plain text file, and the other for splitting sentences out of plain text already stored in memory in aString[]. The user can then retrieveSentencesone at a time with thenext()method, or all at once with thesplitAll()method. The returnedSentences’ start and end fields represent offsets into the text they were extracted from. Every character in between those two offsets inclusive, including extra spaces, newlines, etc., is included in theSentenceas it appeared in the paragraph. (If the constructor taking aString[]as an argument is used, newline characters are inserted into the returned sentences to indicate transitions from one element of the array to the next.)lbjava.nlp.WordSplitter: This parser takes the plain, unannotatedSentences (see Section 5.5.1) returned by another parser (e.g.,SentenceSplitter) and splits them intoWordobjects. Entire sentences now represented asLinkedVectors(see Section 5.4.6) are then returned one at a time by calls to thenext()method.lbjava.nlp.ColumnFormat: This parser returns aString[]representing the rows of a file in column format. The input file is assumed to contain fields of non-whitespace characters separated by any amount of whitespace, one line of which is commonly used to represent a word in a corpus. This parser breaks a given line into oneStringper field, omitting all of the whitespace. A common usage of this class will be in extending it to create a newParserthat callssuper.next()and creates a more interesting internal representation with the results.lbjava.nlp.POSBracketToVector: Use this parser to returnLinkedVectorobjects representing sentences given file names of POS bracket form files to parse. These files are expected to have one sentence per line, and the format of each line is as follows:
(pos1 spelling1) (pos2 spelling2) ... (posn spellingn)
It is also expected that there will be exactly one space between a part of speech and the corresponding spelling and between a closing parenthesis and an opening parenthesis.