Data Analysis

Software

The following software is free and distributed under the terms of the GNU General Public License. It is written in ANSI C and should compile with any C compiler. If you have questions or comments contact: Stefan (stefan at exstrom dot com) or Richard (richard at exstrom dot com).

Gaussian Process Regression

gpr1.c

This program does Gaussian process regression (prediction). Given a set of training data it will do regression (prediction) at a specified set of points. The program uses the GNU Scientific Library which you will have to install before you can use it. Once you have GSL installed you should be able to compile the program using:

gcc -lm -lgsl -lgslcblas -o gpr1 gpr1.c

You run the program on the command line like this:

gpr1 tfile NT NF xfile NX sv sh nv

where the command line parameters are defined as follows:

tfile = name of training file.
NT = number of training records.
NP = number of predictors in each training record.
xfile = name of test file (points to predict).
NX = number of test records.
sv = signal variance.
sh = signal length scale.
nv = noise variance.

For NP=3 each line (record) of the training file will look like the following:

x1 x2 x3 y

where x1 x2 x3 are the predictors and y is the predicted value. The test file will have the same format but of course without the y value. The output (predictions) are written to stdout. The sv, sh, and nv parameters are used in the covariance function. For more information and an example see this blog post.

Bayesian Classifier

class2kde.c

The program class2kde runs a Bayesian binary classifier on all the records in a training file. Each record in the training file is classified using the other records. The class file contains the true class for each record. The output for each record is the class 1 probability followed by the true class for the record. This program is useful for testing how well the record attributes determine the record class. The conditional probability densities are estimated using a kernel density estimator. A kernel density bandwidth file is needed for this. Details on arguments and file formats are in the code.

xclass2kde.c

The program xclass2kde runs a Bayesian binary classifier on all the records in an unclassified file. It needs a training file with records that have been classified. The classifications of the training records should be in a seperate class file. The output for each record of the unclassified file is the probability that the class equals 1. The conditional probability densities are estimated using a kernel density estimator. A kernel density bandwidth file is needed for this. Details on arguments and file formats are in the code.

att1class2.c

The program att1class2 generates single attribute data that can be used to test a binary classifier. The attribute values for the two classes are drawn from two Normal distributions. The mean and standard deviation for the two distributions must be specified along with the probability of class 1. See the code for more documentation. The program uses the GNU Scientific Library. It must be installed before the program will compile. See the code for instructions on how to compile.

Awk Scripts

class2kde.awk

This script outputs statistics on the results of class2kde. See the code for more details.

xclass2kde.awk

This script outputs statistics on the results of xclass2kde. See the code for more details.

class2kde2roc.awk

This script converts sorted output of class2kde to ROC curve data. See the code for more details.