February 2, 2021

Time series forecasting with Spectral Mixture Kernels

Time series modelling is a fundamental yet difficult problem. Forecasting in particular is incredibly challenging and requires strong inductive biases to give good predictions. One powerful framework for encoding inductive biases are kernel functions used with Gaussian Processes (GPs), however, kernels require manual work to embed domain knowledge which might not always be desirable. One might ask if we can learn kernel structures directly from the data, and indeed the answer is yes!

One powerful and general way to do this is to search over compositions of kernels, which might be familiar to you if you ever heard of the automatic statistician project. Another approach which we are going to look closer at in this article is to use a particular kernel called a Spectral mixture kernel (SM kernel) to learn patterns in the data through its frequency content. We are going to look at why it works, how to apply it using GPyTorch and how it behaves when we fit it to different data sets to do forecasting.

With this article I am trying a slightly different approach when it comes to code. I will inline a few specific snippets but will not try to provide a comprehensive implementation here. Instead, you can find all the code needed to reproduce the experiments and plots on my GitHub.

Kernels and GPs

Before we dive in, let us briefly recap GPs and kernels. This article will assume familiarity with the concepts, but hopefully this section can serve as a reminder for some. Recall that a (positive definite) kernel \(k : \mathbb{X} \times \mathbb{X} \mapsto \mathbb{R} \) is a function which when evaluated on two sets of inputs \(x, x’\) induces a kernel matrix with entries \(k_{ij} = k(x_i, x’_j)\) which is positive definite. We can think of such a matrix as a way to model similarity between observations. If \(k_{ij}\) is large we consider \(x_i\) and \(x’_j\) to be similar, and if \(k_{ij}\) is small we consider them dissimilar. When used together with GPs we can use this to model the correlation between observations in function (observed) space by using them as covariance functions in the GP. That is, we can mentally substitute \(x, x’\) for \(y, y’\) and the reasoning above still holds. Using this machinery we can impose a wide range of structural priors over our model.

Figure 1: Some common kernel functions which represent different notions of similarity (or in our case correlation). We can for instance correlate observations if they have the same periodicity or if they have similar values.

Figure 1: Some common kernel functions which represent different notions of similarity (or in our case correlation). We can for instance correlate observations if they have the same periodicity or if they have similar values.

As flexible as kernels are, we cannot rely on a notion of similarity with nearby observations when forecasting (there are no nearby observations in the future), so kernels such as the Matérn family are not as useful. However, periodic kernels still are still as useful as ever. But how many periodic kernels should one use? We would prefer not to decide a priori since this could limit our model. Fortunately, we don’t have to.

Spectral Mixture Kernel

The SM kernel allows us to learn all the frequencies in our data at once by learning its spectral density. In fact, it turns out that learning the spectral density itself is equivalent to learning a kernel. This is thanks to Bochner’s theorem, which connects kernel functions to their Fourier transform and is the key idea behind the SM kernel. It implies that any stationary kernel \(k(\tau) = k(\vert x_i - x_j \vert)\) is the Fourier transform of its spectral density \(S(s)\)

\[ k(\tau) = \int S(s) e^{2\pi i s^T \tau} ds, \]

and that the inverse relationship is also true

\[ S(s) = \int k(\tau) e^{-2\pi i s^T \tau} d\tau. \]

In other words: if we know the spectral density of a kernel, we know the kernel (and vice versa). This result strongly motivates the approach of the SM kernel. For the kernel to learn the spectral density we need some sort of model, but which to use? It is reasonable to assume a multimodal spectral density, with a few dominating frequencies. In light of this the SM kernel uses a Mixture of Gaussians to model the density

\[S(s) = \sum_{i=1}^Q w_i^2 \left[ \mathcal{N}(s \vert \mu_i, \sigma_i^2) + \mathcal{N}(s \vert - \mu_i, \sigma_i^2) \right]. \]

The symmetry around \(0\) is a technicality to make sure we learn real-valued functions, and the actual quantities of interest to us are \(w\), \(\mu\) and \(\sigma\). For the experiments we are going to use \(Q = 10 \) as in the original paper.

Training on toy data

Let us put the SM kernel to use on some toy data to get a feeling for how it works. The GPyTorch library comes with all the functionality we need, and we define a GP for exact inference with a zero mean prior as follows.

class GP(gp.models.ExactGP):
    def __init__(self, cov, train_x, train_y):
        super(GP, self).__init__(train_x, train_y, GaussianLikelihood())
        self.mean = gp.means.ConstantMean()
        self.cov = cov

    def forward(self, x):
        return MultivariateNormal(self.mean(x), self.cov(x))

    def predict(self, x):
        """Returns the posterior mean together with
        the boundaries for the 95% credible intervals.
        """
        self.eval()
        with torch.no_grad():
            pred = self.likelihood(self(x))
            lower, upper = pred.confidence_region()

        return pred.mean, lower, upper

    def spectral_density(self, smk) -> MixtureSameFamily:
        """Returns the Mixture of Gaussians over the spectral density
        of the provided spectral mixture kernel."""
        mus = smk.mixture_means.detach().reshape(-1, 1)
        sigmas = smk.mixture_scales.detach().reshape(-1, 1)
        mix = Categorical(smk.mixture_weights.detach())
        comp = Independent(Normal(mus, sigmas), 1)
        return MixtureSameFamily(mix, comp)

We fit our models by maximising the marginal likelihood, which makes for tractable and stable training but unfortunately is a point estimate and introduces the problem of bad local minima. To combat this we train several models with random restarts and pick the best one.

As for the data, we are going to fit he model to a sum of sinusoids sampled at a constant rate \(T = \frac{1}{14}\) corrupted by some Gaussian noise.

\[ y(nT) = \sin(2\pi nT) + \sin(6\pi nT) + \sin(10\pi nT) + \epsilon\]

where \(n = 1,2,\dots,36\) and \(\epsilon \sim \mathcal{N}(0, 0.1)\). Note that we are careful to use a sampling frequency \(f_s = \frac{1}{T}= 14\) which is larger than the signal bandwidth \(5\) to make sure we are above the Nyquist frequency and can capture the entire spectral density. This is good to have in mind when applying the SM kernel to real data as well. The kernel we are going to use is, of course, the SM kernel \(\kappa_1(x,x’) = \kappa_{sm}(x,x’)\) which is already implemented for us in GPyTorch. And with that we can fit the model!

Figure 2: The SM kernel successfully learns the frequency components in the data and extrapolates with high confidence.

Figure 2: The SM kernel successfully learns the frequency components in the data and extrapolates with high confidence.

We can clearly see that the model has recovered the correct spectral density, and only made us of three of the ten components in the Gaussian mixture. Plotting the model predictions also shows how powerful this is when forecasting. While the predictions degrade over time, the model captures the overall pattern in the data, despite the noise. We can also inspect the learned kernel to confirm that it has captured patterns actually found in the data.

Figure 3: The SM kernel contains patterns we find in the original signal after training.

Figure 3: The SM kernel contains patterns we find in the original signal after training.

Alright, we can confirm that the SM kernel allows us to recover frequency content from data. Cool! Let us put this to use on a slightly more realistic dataset.

Mauna Loa

No blog article with GPs in it would be complete without evaluating on the Mauna Loa data set, so it is time to do our part. The Mauna Loa data set contains a single time series of monthly measurements of carbon dioxide in the atmosphere from 1958 onward. There are a few missing observations which we fill by quadratic interpolation, and we split the data into train/test at the year 1985. We also standardise both features and targets before fitting the model. The reason for standardising the target variables in addition to the input features is that we are assuming a zero-mean function for our GP, so we do this to make this assumption hold (alternatively, we could of course assume a non-zero prior mean). We fit the GP using the same kernel \(\kappa_1\) as before.

Figure 4: The model successfully learns the periodic pattern of the data. However, it does not pick up on the exponential trend so when projecting into the future the prediction (very slowly) returns to the prior mean. As an aside, we see the tiny impact the Covid-19 pandemic has had on our emissions at the end of the test data. We still have a long way to go.

Figure 4: The model successfully learns the periodic pattern of the data. However, it does not pick up on the exponential trend so when projecting into the future the prediction (very slowly) returns to the prior mean. As an aside, we see the tiny impact the Covid-19 pandemic has had on our emissions at the end of the test data. We still have a long way to go.

The model confidently extrapolates with the correct periodicity, but slowly reverts towards the prior mean. It is apparent it has not learned the overall trend in the data. This can be confirmed by inspecting the covariance matrix of the trained model.

Figure 5: The learned SM kernel resembles an RBF with a large lengthscale plus a periodic kernel, which is consistent with its predictions and ability to extrapolate.

Figure 5: The learned SM kernel resembles an RBF with a large lengthscale plus a periodic kernel, which is consistent with its predictions and ability to extrapolate.

It looks like the model has learned an RBF component and a periodic component. But the fit is all but great. What is going on here? Didn’t I claim that SM kernel are great for extrapolation? Remember that Bochner’s theorem only applies to stationary kernels. Hence we can only learn stationary kernel structures. Both the RBF and the periodic kernel happen to be stationary, so these pose no problems, but the rising trend in the data corresponds to a non-stationary kernel which cannot be captured solely by the SM kernel. At this point you might be curious about how the spectral density of real world data looks like, so let us have a look!

Figure 6: The spectral density reflects the large stationary component together with the frequency components that make up the annual trend.

Figure 6: The spectral density reflects the large stationary component together with the frequency components that make up the annual trend.

This spectral density is quite busy compared to that of the sum of sinusoids. We see a strong constant signal and a large smoothed out body of mass which corresponds to the RBF kernel, and then a set of frequencies which together add up to the annual trend. Interestingly, the kernel has decomposed the data into a lot of relatively high frequency components, which together add up to the annual trend.

Composite kernels

What if we are not satisfied with the fit we got and want to improve the model? We probably reached the end of the road as far as the SM kernel goes, but nothing prevents us from adding additional structure to the model by combining it with other kernels. Let us see if we can improve the fit by using the composite kernel

\[ \kappa_2 = \kappa_{\text{SM}} + \kappa_{\text{Poly}} + \kappa_{\text{RBF}}. \]

\(\kappa_2\) adds a polynomial kernel to model the rising trend and an RBF kernel to model local changes. This lets the SM kernel focus on only learning the periodic contents in the data. Indeed, this kernel results in a much better fit, although it does overestimate the amount of CO2.

Figure 7: Model fit using (kappa_2). The model now tracks the rising trend and more accurately explains the data. However, it overshoots the true values, and the observed data barely makes it within the 95% confidence band.

Figure 7: Model fit using (kappa_2). The model now tracks the rising trend and more accurately explains the data. However, it overshoots the true values, and the observed data barely makes it within the 95% confidence band.

Finally, let us inspect the covariance matrix and the spectral density of \(\kappa_2\). They look quite different to those of \(\kappa_1\).

Figure 8: The learned kernel now has polynomial and periodic structure, but no strong local structure.

Figure 8: The learned kernel now has polynomial and periodic structure, but no strong local structure.

Figure 9: Spectral density of (kappa_2). Compared to (kappa_1) it does not have a large constant component which corresponds to the lack of local correlation in the covariance matrix.

Figure 9: Spectral density of (kappa_2). Compared to (kappa_1) it does not have a large constant component which corresponds to the lack of local correlation in the covariance matrix.

Ending notes

In this blog post we have seen how the SM kernel works and got a feel for how it behaves when using it in practice. We have seen that the SM kernel can learn any stationary kernel from data, in particular periodic kernels, which is powerful when extrapolating. While we managed to fit a pretty nice model to the Mauna Loa data set, the SM kernel did not manage to capture the non-stationary trend so we had to combine it with some additional kernels. Interestingly, the authors of the original paper claim to get a nice fit with a GP using only an SM kernel on the Mauna Loa data set. While the details in the paper are somewhat vague, this is surprising to me, since it should not be able to capture the increasing, non-stationary, trend. Perhaps my reasoning is wrong here and if so, I would be happy to be corrected on this.

The theory behind the SM kernel rests on Bochner’s Theorem which assumes a stationary kernel. We saw that this can be limiting, but fortunately this constraint can be relaxed to handle non-stationary kernels as well. Finally, I would like to point out that there exists an errata sheet for the original paper which fixes a typo in the equation for the SM kernel, and thank @parskatt for helpful discussions.

Thank you for reading! I hope you found it interesting.

© Sebastian Callh 2020