GitHub - stevenbot/kqbc.python: This is a python implementation of Query By Committee Made Real by Gilad-Bachrach et al.

This is python implementation for this paper: https://dl.acm.org/citation.cfm?id=2976304 \

Package content:

test_synth.py - the test that runs the synthetic data experiment
kqbc.py - the python functions converted from matlab code
matlab/KQBC.m - matlab original code from the paper
matlab/hit_n_run.m - matlab original code from the paper

Data

Synthetic data

The synthtetic data test executes the KQBC algorithm for learning a linear classifier in a d-dimensional space. The target classifier is the vector w∗ = (1, 0, . . . , 0) thus the label of an instance x ∈ IRd is the sign of its first coordinate. The instances were normally distributed N (µ = 0, Σ = Id).

Mnist data

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. For out experiment I degenerated the dataset into 2 classes (1,-1), which are one of the figures and all the rest, and balanced between them by eliminating most of the other classes samples. The final data was a sub version of the original one, consinting of 1000 training examples for each class, so 2000 in total, and 200 for each test class in the same manner.

Running the test:

Flags:

steps - defines the number of repetitions of the experiment for statistical significance.
dim - the dimention of the samplse and classifiers
samples - a list of number of samples to train on
hnr_ - number of hit and run walks inside the polytope

Command line example:

python test_synth.py --steps 20 --dim 5 --hnr_iter 21 --plot

Dependencies:

Python
matlab
numpy
matplotlib

Results:

These plots describe the error rate on the test set vs the number of samples the classifier was trained on.

SVM - random k entries
KQBC - using the classifier (point in the version space) after T steps (using dot product)

Synthetic data (5/15 dimensions):

Log scale

Linear scale

Mnist data (only 2 catagories - 1 class vs all):

Log scale

Linear scale

It is clear that we get exponential improvenet when the classifier is extracted from KQBC algorithm.