Abrazolica

Home Archive Tags About RSS

Gaussian Process Regression

posted on: Thursday, February 21st, 2013, 21:22

We just released a program for doing Gaussian Process regression called gpr1. The program is written in C and you can find it on our data mining page. Gaussian process regression is a powerful and flexible form of regression analysis that can be useful for modeling things like climate and financial markets. I'm just going to give an extremely brief overview of what the program does. The program is free and distributed under the GNU General Public License.

A Gaussian Process is a stochastic process where each point has a Gaussian distribution and any finite set of points can be represented as a multivariate Gaussian random variable. In a regression problem you have a set of training points represented by the vector:

\[ \vec{f}_T = \left(f(t_1) f(t_2) \cdots f(t_m)\right)^T \]

and a set of points \((x_1,x_2,\ldots,x_n)\) where you want to predict the function i.e. you want the vector:

\[ \vec{f}_X = \left(f(x_1) f(x_2) \cdots f(x_n)\right)^T \]

The two vectors together \(\left(\vec{f}_T \vec{f}_X\right)\) are modeled as a multivariate Gaussian with zero mean and a covariance matrix given by:

\[ \left( \begin{array}{cc} K_{TT} & K_{TX} \\ K_{XT} & K_{XX} \\ \end{array} \right) \]

where \(K_{TT}\) is an \(m\) by \(m\) covariance matrix for the \(T\) points, \(K_{XX}\) is an \(n\) by \(n\) covariance matrix for the \(X\) points, and \(K_{XT}\) and \(K_{TX}\) are cross covariance matrices between the \(T\) and \(X\) points. With this model, the distribution for \(\vec{f}_X\) conditioned on the value of \(\vec{f}_T\) is a Gaussian with mean equal to \(K_{XT}K_{TT}^{-1}\vec{f}_T\) and covariance matrix equal to \(K_{XX}-K_{XT}K_{TT}^{-1}K_{TX}\). The mean is used as the predicted (most likely) value for \(\vec{f}_X\) and this is what gpr1 calculates. The one remaining detail is how the elements of the covariance matrices are calculated. There are many ways to do it but gpr1 uses the following equation to calculate them

\[ k(x_i,x_j)=\sigma_s^2 \exp{\frac{(x_i-x_j)^2}{2h^2}} + \sigma_n^2\delta_{ij} \]

where \(\sigma_s^2\) is called the signal variance, \(\sigma_n^2\) is called the noise variance and \(h\) is a length scale parameter. If there is no noise in the training points then set \(\sigma_n^2=0\). In general you will have to experiment with values for \(\sigma_s^2\) and \(h\) given the particular problem you're trying to solve. For a good discussion of this and anything else related to Gaussian processes see the book: Gaussian Processes for Machine Learning.

The figure below is a simple example of what you can do with this software. It is a plot of the mean high temperatures in Boulder Colorado for every day of the year starting on January 1. The points marked with red crosses are the mean values calculated by averaging the daily high temperatures in Boulder for the last 117 years. The source of the data is the NOAA Earth System Research Laboratory. These points are used as the training data set to calculate the regression curve shown as a solid black line.

Daily Mean High Temperature in Boulder Colorado with Gaussian Process Regression Curve