What is a Gaussian Process and Why?
A Gaussian process is a concept that always comes next to Bayesian linear regression. Both approaches have strong advantages over measuring the uncertainty in predictions. A common saying on this is that knowing the uncertainty of a deep neural network’s (DNN) output is critical for handling wildly unrelated incoming data and safely reacting to it. In my case, I always preferred clear numbers instead of probabilities and distributions. Nowadays, however, I began to realize the importance of knowing how confident I can say about my DNN’s output and had to give it a try. Also, I’ve followed this awesome implementation of the Gaussian process by Martin Krasser and made a simpler version of this Google Colab for anyone who wants to follow through in practice.
Introduction
Gaussian processes (GPs) are distributions over functions. It might be difficult to grasp a concept of distributions over functions at first, but it’s easier to think of it as finding a range of plausible functions given data. Then it may raise a question on what’s the difference from Bayesian linear regression. Well, in my understanding, Bayesian linear regression has a fixed basis function (e.g., identity, polynomial, or gaussian) and tries to find the weight distributions. On the other hand, GP is a random process (i.e. stochastic process) where function values are assigned to a random variable (i.e. random function) that these random variables follow Gaussian distribution. Therefore, GPs are parameterized by a mean function and a covariance function just like any other variables that follow multivariate Gaussian distribution.
Mathematical Background
As we’ve discussed, GPs are parameterized by a mean and covariance function which can be put in as below:
This implies that the distribution of random variable f given input data X follows a Gaussian distribution given mean function μ and covariance function κ.
For mathematical convenience, we usually set the mean function μ as 0, which can be acquired by normalizing the data in practice.
Since the definition of GP tells us that the joint distribution of a finite number of random variables follows Gaussian, it’s a given that the joint distribution of the observed values and the predictions is also Gaussian. Since we are more interested in making predictions given new inputs, a GP posterior can be described as below:
If we want to take noisy function values into consideration, where we assume the noise also follows a normal distribution with zero means:
Implementation
The first step in GP is to define a kernel, which is a Gaussian kernel or Radial Basis Function (RBF) kernel in our case. The kernel gives us a sense of how far each input data is from each others. In this way, the predicted function values near the training data are similar to each other. Also, the closer the incoming data is to the training data the lesser uncertainty should be. Here is an example of a Gaussian kernel below:
The vertical and horizontal variation parameters are defined as σ and l at above.
Then we can retrieve random samples with zero means and covariance matrix calculated by the kernel above. Here is an example with a 95% confidence interval.
The uncertainty as in the blue region above is calculated by 1.96 (95% confidence interval) times the square root of variance which is the diagonal elements of the covariance matrix.
Let’s assume that we have training data that follows a sine function. The posterior given the training data will be as below:
We can notice that training data without the noise will have exact function values at each sample point with zero uncertainty. However, training data with some level of noise will have a smooth mean function across each sample point.
Here are some examples of how different parameters affect the result regarding σ, l, and noise. Vertical and horizontal variations are controlled by the σ and l. Since the distance value defined by our kernel shows a large distance with bigger σ causing more sudden increase in the uncertainty as the data gets further away. On the other hand, the bigger the l the less distance becomes causing the uncertainty more smooth along the horizontal axis. Also, the noise level depicted as sigma_y below shows how the mean function is smoothed along in between the training data as the noise level gets bigger.
Lastly, we need to find the optimal hyper-parameters σ, l, and noise in our case by maximizing the log marginal likelihood as below:
Then, the optimized random variables are as below:
Conclusion
The Gaussian process is a very powerful non-parametric method that can infer distribution over functions directly. Even though this method has a major drawback when it comes to a huge dataset, it still offers a very simple and neat way of making predictions with a wanted confidence level. Also, knowing the uncertainty of estimation is always a piece of critical information in the real-life application and it’s becoming a more important research field even in deep learning.
In my opinion, the key concept in the Gaussian process is that it defined how we can measure the correlation between two data points. From there we get to know how confident we can say about a predicted function value compare to data samples we’ve observed. We used the Gaussian kernel in our case, but there are many different types of kernels that can be used for different purposes.
Thank you very much for reading this and hope this could help anyone whose always wondered about the concept of the Gaussian process. I’m open to any comments or discussion, so please leave a note if you have any!
References
[1] Martin Krasser, Gaussian processes, 2018
[2] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006
[3] https://stats.stackexchange.com/questions/444049/equivalence-of-gaussian-process-and-bayesian-linear-regression-by-inspecting-the
[4] https://distill.pub/2019/visual-exploration-gaussian-processes/