| title | 20NEWSGROUP |
|---|
3. Tutorial: 20 Newsgroups
We begin our discussion of the LBJava language with a tutorial that illustrates its most common usage. This tutorial is intended for a first time user of the language. It introduces the syntax for both single feature and feature generating hard-coded classifiers, as well as the syntax for declaring a learner. Next, it shows how the learner (or any other classifier) declared in the LBJava source code can be imported and used in the Java application. Finally, it discusses how to use the LBJava compiler on the command line to fully compile our learning based program. Throughout the tutorial, we'll be using the famous 20 Newsgroups corpus as our training data.
3.1 Setting Up
Suppose we want to classify newsgroup posts according to the newsgroup to
which each post is best suited. Such a classifier could be used in a
newsgroup client application to automatically suggest an appropriate
destination for a new post. It is plausible that these classifications could
be made as a function of the words that appear in them. For example, the word
motor is likely to appear more often in rec.autos or rec.motorcycles than in
alt.atheism. However, we do not want to
manually invent these associations one at a time, so we turn to LBJava.
To use LBJava, we first need to decide on an object oriented internal
representation. In this case, it makes sense to define a class named Post
that stores the contents of a newsgroup post. The following snippet
shows a skeleton for such a class. There, we see that space has been
allocated for several fields that we might expect in a newsgroup post, namely
the From and Subject fields from the header and the body of the post.
We have chosen to represent the body as a two dimensional array; one dimension
for the lines in the body, and the other for the words in each line.
public class Post { private String newsgroup; // The label of the post. private String fromHeader; private String subjectHeader; private [][] body; }
Finally, we have a field called newsgroup. It may seem counterintuitive
to include a field to store this information since it is exactly the
classification we aim to compute. LBJava's supervised learning algorithms will
need this information, however, since it labels the example object.
Furthermore, at test time, our newsgroup client application may fill this
field with the newsgroup in which the post was encountered or in which the
user intends to post it, and the learned classifier will simply ignore it at
that point.
We'll also need to implement a parser that knows how to create Post
objects when given the raw data in a file or files. In LBJava, a parser is any
class that implements the lbjava.parse.Parser interface. This is a simple
interface that requires only three methods be defined. First, the next()
method takes no arguments and returns a single example Object
(of any type in general, but in this case, it will be a Post). The LBJava
compiler will call this method repeatedly to retrieve training example objects
until it returns null. Next, the reset() method rewinds the
parser back to the beginning of the raw data input it has been reading.
Finally, the close() method closes any streams the parser may have open
and frees any other system resources it may be using.
The LBJava library comes with several parsers that read plain text. While it
does not include a parser for newsgroup posts, we can still make use of
lbjava.parse.LineByLine, which will at least take care of the boilerplate code
necessary to read text out of a file. This abstract class also provides
implementations of the reset() and close() methods. The NewsgroupParser class
in the following snippet, simply extends it to
take advantage of that functionality; it won't be necessary to override reset()
or close(). NewsgroupParser takes as input a file
containing the names of other files, assuming that each of those files
represents a single newsgroup post. For brevity, we have hidden in Post's
constructor the code that actually does the work of filling the fields
of a Post object.
import lbjava.parse.LineByLine; public class NewsgroupParser extends LineByLine { public NewsgroupParser(String file) { super(file); } public Object next() { String file = readLine(); if (file == null) return null; // No more examples. return new Post(file); } }
With Post and NewsgroupParser ready to go, we can now define in
the LBJava source code a hard-coded classifier that identifies which words appear
in each post and a learning classifier that categorizes each post based on
those words.
3.2 Classifier Declarations
Given the internal representation developed in the previous section, the LBJava
code bellow can be used to train a learned newsgroup
classifier. It involves a single feature extraction classifier named BagOfWords,
a label classifier named NewsgroupLabel to provide labels
during training, and a multi-class classifier named NewsgroupClassifier
that predicts a newsgroup label. It also assumes that the Post class
and the parser NewsgroupParser (or their source files) are available on
the CLASSPATH. To see the code in action, download the source
distribution
(it includes the data and all the classes mentioned above --
and run ./train.sh; assuming that LBJava is already on your CLASSPATH). We'll now
take a closer look at how it works.
discrete% BagOfWords(Post post) <- { for (int i = 0; i < post.bodySize(); ++i) for (int j = 0; j < post.lineSize(i); ++j) { String word = post.getBodyWord(i, j); if (word.length() > 0 && word.substring(0, 1).matches("[A-Za-z]")) sense word; } } discrete NewsgroupLabel(Post post) <- { return post.getNewsgroup(); } discrete NewsgroupClassifier(Post post) <- learn NewsgroupLabel using BagOfWords from new NewsgroupParser("data/20news.train.shuffled") 40 rounds with SparseNetworkLearner { SparseAveragedPerceptron.Parameters p = new SparseAveragedPerceptron.Parameters(); p.learningRate = .1; p.thickness = 3; baseLTU = new SparseAveragedPerceptron(p); } end
3.2.1 Hard-coded classifiers
An LBJava source file is a list of declarations. The simplest in the LBJava
of the previous section, is contained entirely on line 13. It consists of the
classifier's signature and a hard-coded classifier expression
separated by a left arrow indicating assignment. In the classifier's
signature, we see its return type (a single discrete feature) as well as its
input type (an object of type Post). All LBJava classifiers take a single
object (of any type) as input. It is up to the programmer to ensure that all
information pertinent to the classifiers is accessible from that object. The
return type, however, is not quite so restrictive. Returned features may be
either discrete or real, and a classifier may return either a
single feature (as on line 13) or multiple features (as indicated on line 3
with the % symbol). When a classifier can return multiple features, we
call it a feature generator.
On the right hand side of the left arrow is placed a classifier expression.
There are many types of classifier expression, and the two most common are on
displayed here. BagOfWords and NewsgroupLabel are defined
with hard-coded classifier expressions, while NewsgroupClassifier is
defined with a learning classifier expression. When hard-coding the behavior
of a classifier, the programmer has Java 1.4 syntax at his disposal to aid in
computing his features' values, plus some additional syntactic sugar to make
that type of computation easier.
For example, the sense statement on line 9 creates a feature which will
eventually be returned, but execution of the method continues so that multiple
features can be ``sensed.'' Note that only feature generators can use the
sense statement, and only classifier returning a single feature can use
Java's `return` statement (as on line 13).
After everything is said and done, we end up with two hard-coded classifiers.
One is a simple, one feature classifier that merely returns the value of the
Post.newsgroup field (via the getNewsgroup() method, since Post.newsgroup
is private). The other loops over all the words in the post
returning each as a separate feature.
3.2.2 Learners
NewsgroupClassifier on line 15 of 20 newsgroup example is not
specified in the usual, procedural way, but instead as the output of a
learning algorithm applied to data. The verbose learning classifier
expression syntax says that this classifier will learn to mimic an
oracle (line 16), using some feature extraction classifiers (line 17),
from some example objects (line 18), with a learning algorithm
(lines 19 through 25). The expression ends with the end keyword (line
26). In this case, the oracle is NewsgroupLabel, the only feature
extraction classifier is BagOfWords, the example objects come from
NewsgroupParser, and the learning algorithm is SparseNetworkLearner.
We explore each of these ideas in more detail below.
-
learn: We say thatNewsgroupClassifieris trying to mimicNewsgroupLabelbecause it will attempt to return features with the same values and for the same example objects thatNewsgroupLabelwould have returned them. Note that the particular feature values being returned have not been mentioned; they are induced by the learning algorithm from the data. We need only make sure that the return type of the label classifier is appropriate for the selected learning algorithm. -
using: The argument to theusingclause is a single classifier expression. As we can see from this example code, the name of a classifier qualifies. The only restriction is that this classifier expression must have an input type that allows it to take instances ofNewsgroupClassifier's input type. LBJava also provides a comma operator for constructing a feature generator that simply returns all the features returned by the classifiers on either side of the comma. This way, we can include as many features as we want simply by listing classifiers separated by commas. -
from: Thefromclause supplies a data source by instantiating a parser. The objects returned by this parser'snext()method must be instances ofNewsgroupClassifier's input type. LBJava can then extract features via the using clause and train with the learning algorithm. This clause also gives the programmer the opportunity to iterate over the training data if he so desires. The optionalroundsclause is part of thefromclause, and it specifies how many times to iterate. -
with: The argument to thewithclause names a learning algorithm (any class extendinglbjava.learn.Learneraccessible on theCLASSPATH) and allows the programmer to set its parameters. For example,learningRate(line 22) andthickness(line 23) are parameters of theSparseAveragedPerceptronlearning algorithm, whilebaseLTU(line 24) is a parameter of theSparseNetworkLearnerlearning algorithm.
From these elements, the LBJava compiler generates Java source code that performs feature extraction, applies that code on the example objects to create training examples, and trains our learner with them. The resulting learner is, in essence, a Java method that takes an example object as input and returns the predicted newsgroup in a string as output. Note that the code does not specify the possible newsgroup names or any other particulars about the content of our example objects. The only reason that this LBJava code results in a newsgroup classifier is because we give it training data that induces one. If we want a spam detector instead, we need only change data sources; the LBJava code need not change.(Of course, we may want to change the names of our classifiers in that case for clarity's sake.)
3.3 Using NewsgroupClassifier in a Java Program
Now that we’ve specified a learned classifier, the next step is to write a pure Java application that will use it once it’s been trained. This section first introduces the methods every automatically generated LBJava classifier makes available within pure Java code. These methods comprise a simple interface for predicting, online learning, and testing with a classifier.
3.3.1 Getting Started
We assume here that all learning will take place during the LBJava compilation phase, which we’ll discuss in Section 3.4. (It is also possible to learn online, i.e. while the application is running, which we’ll discuss in Section 3.3.3.) To gain access to the learned classifier within your Java program, simply instantiate an object of the classifier’s generated class, which has the same name as the classifier.
NewsgroupClassifier ngClassifier = new NewsgroupClassifier();
The classifier is now ready to make predictions on example objects. NewsgroupClassifier
was defined to take Post objects as input and to make a discrete prediction as output. Thus, if
we have a Post object available, we can retrieve NewsgroupClassifier’s prediction like this:
Post post = String prediction = ngClassifier.discreteValue(post);
The prediction made by the classifier will be one of the string labels it observed during training. And that’s it! The programmer is now free to use the classifier’s predictions however s/he chooses.
There's one important technical point to be aware of here. The instance we just created of
class NewsgroupClassifier above does not actually contain the model that LBJava learned for us.
It is merely a "clone" object that contains internally a reference to the real classifier. Thus, if
our Java application creates instances of this class in different places and performs any operation
that modifies the behavior of the classifier (like online learning), all instances will appear to be
affected by the changes. For simple use cases, this will not be an issue, but see Section 3.3.4 for
details on gaining direct access to the model.
3.3.2 Prediction Confidence
We’ve already seen how to get the prediction from a discrete valued classifier. This technique will work no matter how the classifier was defined; be it hard-coded, learned, or what have you. When the classifier is learned, it can go further than merely providing the prediction value it likes the best. In addition, it can provide a score for every possible prediction it chose amongst, thereby giving an indication of how confident the classifier is in its prediction. The prediction with the highest score is the one selected.
Scores are returned by a classifier in a ScoreSet object by calling the score(Object) method,
passing in the same example object that you would have passed to discreteValue(Object).
Once you have a ScoreSet you can get the score for any particular prediction value using the
get(String) method, which returns a double. Alternatively, you can retrieve all scores in an
array and iterate over them, like this:
ScoreSet scores = ngClassifier.scores(post); Score[] scoresArray = scores.toArray(); for (Score score : scoresArray) System.out.println("prediction: " + score.value + ", score: " + score.score);
Finally, LBJava also lets you define real valued classifiers which return doubles in the Java
application. If you have such a classifier, you can retreive its prediction on an example object
by calling the realValue(Object) method:
double prediction = realClassifier.realValue(someExampleObject);
3.3.3 Learning
As mentioned above, most classifiers are learned during the LBJava phase of compilation (see
Section 3.4 below). In addition, a classifier generated by the LBJava compiler can also continue
learning from labeled examples in the Java application. Since NewsgroupClassifier takes a
Post object as input, we merely have to get our hands on such an object, stick the label in the
newsgroup field (since that’s where the NewsgroupLabel classifier will look for it), and pass it
to the classifier’s learn(Object) method.
Now that we know how to get our classifier to learn, let’s see how to make it forget.
The contents of a classifier can be completely cleared out by calling the forget() method.
After this method is called, the classifier returns to the state it was in before it observed any
training examples. One reason to forget everything a classifier has learned is to try new learning
algorithm parameters (e.g. learning rates, thresholds, etc.). All LBJava learning algorithms provide
an inner class named Parameters that contains default settings for all their parameters. Simply
instantiate such an object, overwrite the parameters that need to be updated, and call the
setParameters(Parameters) method. For example:
ngClassifier.forget(); SparseAveragedPerceptron.Parameters ltuParameters = new SparseAveragedPerceptron.Parameters(); ltuParameters.thickness = 12; NewsgroupClassifier.Parameters parameters = new NewsgroupClassifier.Parameters(); parameters.baseLTU = new SparseAveragedPerceptron(ltuParameters); ngClassifier.setParameters(parameters);
This particular example is complicated by the fact that our newsgroup classifier is learned
using SparseNetworkLearner, an algorithm that uses another learning algorithm with its own
parameters as a subroutine. But the technique is the same. At this point, the classifier is
re-initialized with a new thickness setting and is ready for new training examples.
3.3.4 Saving Your Work
If we’ve done any forget()ing and/or learn()ing within our Java application, we’ll probably
be interested in saving what we learned at some point. No problem; simply call the save()
method.
This operation overwrites the model and lexicon files that were originally generated by the
LBJava compiler. A model file stores the values of the learned parameters (not to be confused
with the manually set learning algorithm parameters mentioned above). A lexicon file stores
the classifier’s feature index, used for quick access to the learnable parameters when training for
multiple rounds. These files are written by the LBJava compiler and by the save() method (though
only initially; see below) in the same directory where the NewsgroupClassifier.class file is
written.
We may also wish to train several versions of our classifier; perhaps each version will use different
manually set parameters. But how can we do this if each instance of our NewsgroupClassifier
class is actually a ”clone”, merely pointing to the real classifier object? Easy: just use the
NewsgroupClassifier constructor that takes model and lexicon filenames as input:
NewsgroupClassifier c2 = new NewsgroupClassifier( "myModel.lc", "myLexicon.lex");
This instance of our classifier is not a clone, simply by virtue of our chosen constructor.
It has its own completely independent learnable parameters. Furthermore, if myModel.lc and
myLexicon.lex exist, they will be read from disk into c2. If not, then calling this constructor
creates them. Either way, we can now train our classifier however we choose and then simply
call c2.save() to save everything into those files.
3.4 Compiling Our Learning Based Program with LBJava
Referring once again to this newsgroup classifier’s source distribution, we first examine our chosen directory structure starting from the root directory of the distribution.
$ ls
20news.LBJava class LBJava test.sh
README data src train.sh
$ ls src/dssi/news
NewsgroupParser.java NewsgroupPrediction.java Post.java
We see there is an LBJava source file 20news.LBJava in the root directory, and in src/dssi/news
we find plain Java source files implementing our internal representation (Post.java), a parser
that instantiates our internal representation (NewsgroupParser.java), and a program intended
use our trained classifier to make predictions about newsgroups (NewsgroupPrediction.java).
Note that the LBJava source file and all these plain Java source files declare package dssi.news;.
The root directory also contains two directories class and LBJava which are initially empty. They
will be used to store all compiled Java class files and all Java source files generated by the LBJava
compiler respectively. Keeping all these files in separate directories is not a requirement, but
many developers find it useful to reduce clutter around the source files they are editing
To compile the LBJava source file using all these directories as intended, we run the following command:
$ java -Xmx512m -cp $CLASSPATH:class LBJ2.Main \
-sourcepath src \
-gsp LBJava \
-d class \
20news.LBJava
This command runs the LBJava compiler on 20news.LBJava, generating a new Java source file
for each of the classifiers declared therein. Since 20news.LBJava mentions both the Post and
NewsgroupParser classes, their definitions (either compiled class files or their original source
files) must be available within a directory structure that mirrors their package names. We
have provided their source files using the -sourcepath src command line flag. The -gsp LBJava
(generated source path) flag tells LBJava to put the new Java source files it generates in the LBJava
directory, and the -d class flag tells LBJava to put class files in the class directory. For more
information on the LBJava compiler’s command line usage, see Chapter 6.
But the command does more than that; it also trains any learning classifiers on the specified training data, so that the compiled class files are ready to be used in new Java programs just like any other class can be. The fact that their implementations came from data is immaterial; the new Java program that uses these learned classifiers is agnostic to whether the functions it is calling are learned or hard-coded. Its output will look like this:
Generating code for BagOfWords
Generating code for NewsgroupLabel
Generating code for NewsgroupClassifier
Compiling generated code
Training NewsgroupClassifier
NewsgroupClassifier, pre-extract: 0 examples at Sun Mar 31 10:48...
NewsgroupClassifier, pre-extract: 16828 examples at Sun Mar 31 10:49...
NewsgroupClassifier: Round 1, 0 examples processed at Sun Mar 31 10:49...
NewsgroupClassifier: Round 1, 16828 examples processed at Sun Mar 31 10:49...
NewsgroupClassifier: Round 2, 0 examples processed at Sun Mar 31 10:49...
...
Writing NewsgroupClassifier
Compiling generated code
The compiler tells us which classifiers it is generating code for and which it is training.
Because we have specified progressOutput 20000 in NewsgroupClassifier’s specification (see
the distribution’s 20news.LBJava file), we also get messages updating us on the progress being
made during training. We can see here that the first stage of training is a “pre-extraction” stage
in which a feature index is compiled, and all Post objects in our training set are converted to
feature vectors based on the index. Then the classifier is trained over those vectors for 40 rounds.
The entire process should take under 2 minutes on a modern machine.
If you’re curious, you can also look at the files that have been generated:
$ ls LBJava/dssi/news
BagOfWords.java NewsgroupClassifier.java
NewsgroupClassifier.ex NewsgroupLabel.java
$ ls class/dssi/news
BagOfWords.class NewsgroupLabel.class
NewsgroupClassifier$Parameters.class NewsgroupParser.class
NewsgroupClassifier.class Post$1.class
NewsgroupClassifier.lc Post.class
NewsgroupClassifier.lex
The LBJava directory now contains a dssi/news subdirectory containing our classifier’s Java
implementations, as well as the pre-extracted feature vectors in the NewsgroupClassifier.ex
file. In the class/dssi/news directory, we find the class files compiled from all our hard-coded
and generated Java source files, as well as NewsgroupClassifier.lc and NewsgroupClassifier.
lex, which contain NewsgroupClassifier’s learned parameters and its feature index (a.k.a.
lexicon) respectively.
Finally, it’s time to compile NewsgroupPrediction.java, the program that calls our learned
classifier to make predictions about new posts.
$ javac -cp $CLASSPATH:class \
-sourcepath src \
-d class \
src/dssi/news/NewsgroupPrediction.java
Notice that the command line flags we gave to the LBJava compiler previously are very similar to those we give the Java compiler now. We can test out our new program like this:
$ java -Xmx512m -cp $CLASSPATH:class dssi.news.NewsgroupPrediction \
$(head data/20news.test.shuffled)
data/alt.atheism/53531: alt.atheism
data/talk.politics.mideast/76075: talk.politics.mideast
data/sci.med/59050: sci.med
data/rec.sport.baseball/104591: rec.sport.baseball
data/comp.windows.x/67088: comp.windows.x
data/rec.motorcycles/103131: rec.autos
data/sci.crypt/15215: sci.crypt
data/talk.religion.misc/84195: talk.religion.misc
data/sci.electronics/54094: sci.electronics
data/comp.os.ms-windows.misc/10793: comp.os.ms-windows.misc
Post rec.motorcycles/103131 was misclassified as rec.autos, but other than that, things are
going well.
3.5 Testing a Discrete Classifier
When a learned classifier returns discrete values, LBJava provides the handy TestDiscrete
class for measuring the classifier’s prediction performance. This class can be used either as a standalone
program or as a library for use inside a Java application. In either case, we’ll need to
provide TestDiscrete with the following three items:
- The classifier whose performance we’re measuring (e.g.
NewsgroupClassifier). - An oracle classifier that knows the true labels (e.g.
NewsgroupLabel). - A parser (i.e., any class implementing the
Parserinterface) that returns objects of our classifiers’ input type.
3.5.1 On the Command Line
If we’d like to use TestDiscrete on the command line, the parser must provide a constructor that
takes a single String argument (which could be, e.g., a file name) as input. NewsgroupClassifier
uses the NewsgroupParser parser, which meets this requirement, so we can test our classifier on
the command line like this:
$ java -Xmx512m -cp $CLASSPATH:class LBJ2.classify.TestDiscrete \
dssi.news.NewsgroupClassifier \
dssi.news.NewsgroupLabel \
dssi.news.NewsgroupParser \
data/20news.test
The output of this program is a table of the classifier’s performance statistics broken down
by label. For a given label l, the statistics are based on the quantity of examples with that gold
truth label c_l, the quantity of examples predicted to have that label by the classifier \hat{c}_l, and the
overlap of these two sets, denoted c_l ∧ \hat{c}_l (i.e., the quantity of examples correctly predicted to
have that label). Based on these definitions, the table has the following columns:
- the label
l, - the classifier’s precision on
l,p_l =c_l ∧ \hat{c}_l / \hat{c}_l * × 100% - the classifier’s recall on
l,r_l = c_l ∧ \hat{c}_l / c_l × 100% - the classifier’s F1 on
l,F1(l) = 2 p_l r_l / (p_l+r_l)× 100% - the label count
c_l, and - the prediction count
\hat{c}_l.
At the bottom of the table will always be the overall accuracy of the classifier. For the
NewsgroupClassifier, we get this output:
The TestDiscrete class also supports the notion of a null label, which is a label intended to
represent the absence of a prediction. The 20 Newsgroups task doesn’t make use of this concept,
but if our task were, e.g., named entity classification in which every phrase is potentially a named
entity, then the classifier will likely output a prediction we interpret as meaning “this phrase is
not a named entity.” In that case, we will also be interested in overall precision, recall, and F1
scores aggregated over the non-null labels. On the TestDiscrete command line, all arguments
after the four we’ve already seen are optional null labels. The output with a single null label
“O” might look like this (note the Overall row at the bottom):
3.5.2 In a Java Program
Alternatively, we can call TestDiscrete from within our Java application. This comes in handy
if our parser’s constructor isn’t so simple, or when we’d like to do further processing with the
performance numbers themselves. The simplest way to do so is to pass instances of our classifier,
labeler, and parser to TestDiscrete, like this:
NewsgroupLabel oracle = new NewsgroupLabel(); Parser parser = new NewsgroupParser("data/20news.test"); TestDiscrete tester = TestDiscrete.testDiscrete(classifier, oracle, parser); tester.printPerformance(System.out);
This Java code does exactly the same thing as the command line above. We can also
exert more fine grained control over the computed statistics. Starting from a new instance of
TestDiscrete, we can call reportPrediction(String,String) every time we acquire both a
prediction value and a label. Then we can either call the printPerformance(PrintStream)
method to produce the standard output in table form or any of the methods whose names start
with get to retrieve individual statistics. The example code below retrieves the overall precision,
recall, F1, and accuracy measures in an array.
TestDiscrete tester = new TestDiscrete(); ... tester.reportPrediction(classifier.discreteValue(ngPost), oracle.discreteValue(ngPost)); ... double[] performance = tester.getOverallStats(); System.out.println("Overall Accuracy: " + performance[3]);

