Appendix C — Estimation and Prediction for Dynamic Spatio-Temporal Models

Estimation and prediction for linear dynamic spatio-temporal models (DSTMs) with Gaussian errors can sometimes be done using methods developed for state-space models (when there are many more temporal observations than spatial locations). In particular, after conditioning on parameter estimates, the hidden (state) process can be predicted using a Kalman filter or smoother, and the parameters might be estimated using an expectation-maximization (EM) algorithm or a Markov chain Monte Carlo (MCMC) algorithm. This appendix illustrates, first, a method-of-moments estimation approach that is common in vector autoregression modeling in time series, and second, a detailed description of parameter estimation and prediction of the process in linear DSTMs with Gaussian errors using the Kalman filter, Kalman smoother, and the EM algorithm.

C.1 Estimation in Vector Autoregressive Spatio-Temporal Models via the Method of Moments

In traditional vector autoregressive (VAR) time-series applications, the autoregressive process is assumed to correspond directly to the data-generating process (i.e., there is no separate data model and process model). In the spatio-temporal context this implies a model such as

\[ \mathbf{Z}_t = \mathbf{M}\mathbf{Z}_{t-1} + \boldsymbol{\eta}_t,\quad \boldsymbol{\eta}_t \sim \; Gau(\mathbf{0},\mathbf{C}_{\eta}), \tag{C.1}\]

for \(t=1,\ldots,T\), where we assume that \(\mathbf{Z}_0\) is known and recall that \(\mathbf{Z}_t = (Z_t(\mathbf{s}_1),\ldots,Z_t(\mathbf{s}_m))'\). Estimation of the matrices \(\mathbf{M}\) and \(\mathbf{C}_{\eta}\) can be obtained via maximum likelihood, least squares, or the method of moments (see Lütkepohl (2005), Chapter 3). We illustrate the latter here.

For simplicity, we assume \(\{\mathbf{Z}_t\}\) has mean zero and is second-order stationary in time. If we post-multiply both sides of Equation C.1 by \(\mathbf{Z}'_{t-1}\) and take the expectation, we get,

\[ E(\mathbf{Z}_t \mathbf{Z}'_{t-1}) = \mathbf{M}E(\mathbf{Z}_{t-1} \mathbf{Z}'_{t-1}), \]

which we write as

\[ \mathbf{C}_z^{(1)} = \mathbf{M}\mathbf{C}_z^{(0)}. \tag{C.2}\]

Recall from Chapter 2 that \(\mathbf{C}_z^{(\tau)}\) is the lag-\(\tau\) spatial covariance matrix for \(\{\mathbf{Z}_t\}\). Now, Equation C.2 implies that

\[ \mathbf{M}= \mathbf{C}_z^{(1)} (\mathbf{C}_z^{(0)})^{-1}. \tag{C.3}\]

Similarly, if we post-multiply Equation C.1 by \(\mathbf{Z}'_t\) and take expectations, we can show that

\[ \mathbf{C}_{\eta} = \mathbf{C}_z^{(0)} - \mathbf{M}\mathbf{C}_z^{(1)'} = \mathbf{C}_z^{(0)} - \mathbf{C}_z^{(1)} (\mathbf{C}_z^{(0)})^{-1} \mathbf{C}_z^{(1)'}. \tag{C.4}\]

It follows that the method-of-moments estimators (where empirical moments are equated with theoretical moments) of Equation C.3 and Equation C.4 are given by

\[ \widehat{\mathbf{M}} = \widehat{\mathbf{C}}_z^{(1)} (\widehat{\mathbf{C}}_z^{(0)})^{-1}, \tag{C.5}\]

\[ \widehat{\mathbf{C}}_\eta = \widehat{\mathbf{C}}_z^{(0)} - \widehat{\mathbf{C}}_z^{(1)} (\widehat{\mathbf{C}}_z^{(0)})^{-1} \widehat{\mathbf{C}}_z^{(1)'}. \tag{C.6}\]

In Equation C.6, the empirical lag-\(\tau\) covariance matrices, \(\widehat{\mathbf{C}}_z^{(\tau)}\), are calculated as shown in Equation 2.4. Note that \(T\) needs to be larger than the dimension of \(\mathbf{Z}_t\) to ensure that \(\widehat{\mathbf{C}}_z^{(0)}\) is invertible.

As we have said throughout this book, we prefer to consider DSTMs that have a separate data and process model. Estimation for these models is described below in Section C.2. So, what is the benefit of the method-of-moments approach in the context of DSTMs? In cases where the signal-to-noise ratio is high, the estimates given by Equation C.5 and Equation C.6 can provide reasonable estimates for exploratory data analysis. We illustrate an example using method-of-moments estimation in Lab 5.3. Specifically, assume that we project the spatial-mean-centered data onto orthogonal basis functions, \(\boldsymbol{\Phi}\): \(\boldsymbol{\alpha}_t = \boldsymbol{\Phi}' (\mathbf{Z}_t - \widehat{\boldsymbol{\mu}})\). We then assume that the projected data come from the model, \(\boldsymbol{\alpha}_t = \mathbf{M}\boldsymbol{\alpha}_{t-\tau} + \boldsymbol{\eta}_t\), and we obtain estimates \(\widehat{\mathbf{M}}\) and \(\widehat{\mathbf{C}}_\eta\) based on the projected data. One can then produce forecasts such as \(\widehat{\boldsymbol{\alpha}}_{T+\tau} = \widehat{\mathbf{M}} \widehat{\boldsymbol{\alpha}}_T\), with estimated forecast covariance matrix, \(\widehat{\mathbf{C}}_{\alpha} = \widehat{\mathbf{M}} \widehat{\mathbf{C}}_\alpha^{(0)} \widehat{\mathbf{M}}' + \widehat{\mathbf{C}}_\eta\), where \(\widehat{\mathbf{C}}_\alpha^{(0)}\) is the empirical estimate of \(E(\boldsymbol{\alpha}_t \boldsymbol{\alpha}_t')\). To obtain a forecast for \(\widehat{\mathbf{Z}}_{T+\tau}\), one would have to multiply the forecast \(\widehat{\boldsymbol{\alpha}}_{T+\tau}\) by the basis-function matrix and add back the spatial mean: \(\widehat{\mathbf{Z}}_{T+\tau} = \widehat{\boldsymbol{\mu}} + \boldsymbol{\Phi}\widehat{\boldsymbol{\alpha}}_{T+\tau}\). The forecast covariance matrix is then approximated by \(\widehat{\mathbf{C}}_{Z} = \boldsymbol{\Phi}\widehat{\mathbf{C}}_{\alpha} \boldsymbol{\Phi}'\), where we have ignored the truncation and measurement error when projecting onto the basis functions. Although this procedure is somewhat ad hoc, it is simple and can give a quick forecast. More importantly, the parameter estimates in this procedure would be used as starting values in the state-space EM algorithm described in Section C.2. This is demonstrated in the second portion of Lab 5.3.

For completeness, note that when one makes the assumption that the initial spatial data vector \(\mathbf{Z}_0\) is known, it can be shown that, conditional on \(\mathbf{Z}_0\), maximum likelihood, least squares, and method-of-moments estimation all give equivalent estimates, \(\widehat{\mathbf{M}}\) and \(\widehat{\mathbf{C}}_\eta\) (see, for example, Harvey, 1993, Section 7.4).

C.2 Prediction and Estimation in Fully Parameterized Linear DSTMs

Traditionally, from the data model,

\[ \mathbf{Z}_t = \mathbf{H}_t \mathbf{Y}_t + \boldsymbol{\varepsilon}_t,\quad \boldsymbol{\varepsilon}_t \sim \; Gau(\mathbf{0},\mathbf{C}_{\epsilon,t}), \tag{C.7}\]

and from the process model,

\[ \mathbf{Y}_t = \mathbf{M}\mathbf{Y}_{t-1} + \boldsymbol{\eta}_t,\quad \boldsymbol{\eta}_t \sim \; Gau(\mathbf{0},\mathbf{C}_{\eta}), \tag{C.8}\]

we obtain a hierarchical model (HM). Note that we have assumed here that there is no additive offset in the data model and that the process has mean zero to simplify the exposition. Next, we can perform prediction on the hidden process via the Kalman filter and Kalman smoother if the parameter matrices are all known. In practice, these are not known, and estimates are sometimes used in their place, which is an empirical hierarchical model (EHM) approach. Note that although in general \(\mathbf{M}\) could depend on time (and hence would be written as \(\mathbf{M}_t\)), we consider the simpler time-invariant case here.

Sequential Prediction of the Process via Kalman Filtering and Smoothing

In Chapter 1 we discussed the notions of smoothing, filtering, and forecasting. Before we show the filtering and smoothing distributions and algorithms, we need to define some notation and terms. In particular, let \(\mathbf{w}_{c:d} \equiv \{\mathbf{w}_c,\ldots,\mathbf{w}_d\}\), for the generic vector \(\mathbf{w}_t\) at times \(t \in \{c, c+1,\ldots,d-1,d\}\). Then we define the forecasting distribution to be the distribution of \(\mathbf{Y}_t\) given all of the observations that occur before time \(t\), namely, \([\mathbf{Y}_t | \mathbf{Z}_{1:t-1}]\). We also define the filtering distribution to be the distribution of \(\mathbf{Y}_t\) given all of the observations up to and including time \(t\), namely, \([\mathbf{Y}_t | \mathbf{Z}_{1:t}]\). Finally, we define the smoothing distribution to be the distribution of \(\mathbf{Y}_t\) given all the observations before, including, and after time \(t\), namely, \([\mathbf{Y}_t | \mathbf{Z}_{1:T}]\), for \(1 \leq t \leq T\).

The forecasting distribution is of most interest when one would like to predict the process one time step into the future; the filtering distribution is typically most useful when one seeks to “filter out” observation error from the true process as data come along sequentially (e.g., in real time); and the smoothing distribution is most useful when one retrospectively wants to smooth out the observation errors for any time in the entire observation period. Now, consider the following notation for the conditional expectations of the forecast and filtering distributions, respectively: \(\mathbf{Y}_{t | t-1} \equiv E[\mathbf{Y}_t | \mathbf{Z}_{1:t-1}]\) and \(\mathbf{Y}_{t:t} \equiv E[\mathbf{Y}_t | \mathbf{Z}_{1:t}]\). Similarly, define the conditional covariance matrices for the forecast error and filtering error distributions, respectively, as: \(\mathbf{P}_{t|t-1} \equiv E[(\mathbf{Y}_t - \mathbf{Y}_{t|t-1})(\mathbf{Y}_t - \mathbf{Y}_{t|t-1})' | \mathbf{Z}_{1:t-1}]\) and \(\mathbf{P}_{t|t} \equiv E[(\mathbf{Y}_t - \mathbf{Y}_{t|t})(\mathbf{Y}_t - \mathbf{Y}_{t|t})' | \mathbf{Z}_{1:t}]\).

In the case of linear Gaussian data models and process models given by Equation C.7 and Equation C.8, the forecast and filtering distributions can be found analytically by using standard conditional expectation/variance relationships and Bayes’ Rule, respectively. In particular, the forecast and filtering distributions are denoted, respectively, by

\[ \mathbf{Y}_t | \mathbf{Z}_{1:t-1} \sim \; Gau(\mathbf{Y}_{t | t-1},\mathbf{P}_{t | t-1}), \]

and

\[ \mathbf{Y}_t | \mathbf{Z}_{1:t} \sim \; Gau(\mathbf{Y}_{t | t},\mathbf{P}_{t | t}), \]

and they can be found through the famous Kalman filter algorithm given in Note C.1. Thus, given the initial conditions \(\mathbf{Y}_{0 | 0} \equiv \boldsymbol{\mu}_0\) and \(\mathbf{P}_{0 | 0} \equiv \mathbf{C}_0\) and the parameter matrices, \(\{\mathbf{H}_t\}_{t=1}^T\), \(\{\mathbf{C}_{\epsilon,t}\}_{t=1}^T\), \(\mathbf{M}\), and \(\mathbf{C}_{\eta}\), one can iterate sequentially between the forecast and filtering steps to obtain these distributions for all times \(t=1,\ldots,T\).

Note C.1: Kalman Filter

Set initial conditions: \(\mathbf{Y}_{0 | 0} = \boldsymbol{\mu}_0\) and \(\mathbf{P}_{0 | 0} = \mathbf{C}_0\)
for \(t = 1\) to \(T\) do
1. Forecast distribution step:
  1. Obtain \(\mathbf{Y}_{t | t-1} = \mathbf{M}\mathbf{Y}_{t-1 | t-1}\)
  2. Obtain \(\mathbf{P}_{t | t-1} = \mathbf{C}_{\eta} + \mathbf{M}\mathbf{P}_{t-1 | t-1} \mathbf{M}'\)
2. Filtering distribution step:
  1. Obtain the Kalman gain, \(\mathbf{K}_t \equiv \mathbf{P}_{t | t-1} \mathbf{H}'_t (\mathbf{H}_t \mathbf{P}_{t | t-1} \mathbf{H}'_t + \mathbf{C}_{\epsilon,t})^{-1}\)
  2. Obtain \(\mathbf{Y}_{t|t} = \mathbf{Y}_{t | t-1} + \mathbf{K}_t (\mathbf{Z}_t - \mathbf{H}_t \mathbf{Y}_{t | t-1})\)
  3. Obtain \(\mathbf{P}_{t | t} = (\mathbf{I}- \mathbf{K}_t \mathbf{H}_t)\mathbf{P}_{t | t-1}\)
end for

Recall that the smoothing distribution considers the distribution of the process at time \(t\) given all of the observations regardless of whether they come before, during, or after time \(t\). This smoothing distribution is denoted by

\[ \mathbf{Y}_t | \mathbf{Z}_{1:T} \sim \; Gau(\mathbf{Y}_{t | T}, \mathbf{P}_{t | T}) \]

and, if one saves the results from the Kalman filter, this can be obtained for all \(t\) by the Kalman smoother algorithm (also known as the Rauch–Tung–Striebel smoother) given in Note C.2.

Note C.2: Kalman Smoother

Obtain \(\{\mathbf{Y}_{t|t-1}, \mathbf{P}_{t|t-1}\}_{t=1}^T\) and \(\{\mathbf{Y}_{t|t}, \mathbf{P}_{t|t}\}_{t=0}^T\) from the Kalman filter algorithm (Note C.1).
for \(t = T-1\) down to \(0\) do
1. Obtain \(\mathbf{J}_t \equiv \mathbf{P}_{t|t} \; \mathbf{M}' \; \mathbf{P}_{t+1|t}^{-1}\).
2. Obtain \(\mathbf{Y}_{t|T} = \mathbf{Y}_{t|t} + \mathbf{J}_t (\mathbf{Y}_{t+1|T} - \mathbf{Y}_{t+1|t})\).
3. Obtain \(\mathbf{P}_{t|T} = \mathbf{P}_{t|t} + \mathbf{J}_t (\mathbf{P}_{t+1|T} - \mathbf{P}_{t+1 | t})\mathbf{J}'_t\).
end for

Parameter Estimation via the EM Algorithm

The state-space approach discussed above in terms of the Kalman filter and smoother makes the assumption that the parameter matrices in the data and process models are known. This is unrealistic in most cases, and one must use the data to estimate these parameters; that is, the HM being used is an EHM. One of the most popular (and effective) ways to do this in the state-space time-series case is through the EM algorithm (recall the general EM algorithm presented in Note 4.4).

The state-space version of the EM algorithm, originally developed by Shumway & Stoffer (1982), denotes by \(\mathbf{Z}_{1:T}\) the observations and the unobservable latent process, and by \(\mathbf{Y}_{0:T}\) the “missing data.” Denote the parameters by \(\boldsymbol{\Theta}\equiv \{\boldsymbol{\mu}_0, \mathbf{C}_0, \mathbf{C}_\eta, \mathbf{C}_{\epsilon},\mathbf{M}\}\), where we assume typically that the observation matrices, \(\{\mathbf{H}_t\}\), are all known. We assume that the initial distribution is given by \(\mathbf{Y}_{0|0} \sim \; Gau(\boldsymbol{\mu}_0, \mathbf{C}_0)\), and we further assume here (for simplicity) that \(\mathbf{C}_{\epsilon}\) corresponds to the \(m \times m\) measurement-error covariance matrix for all possible observation locations (thus, \(\mathbf{C}_{\epsilon,t} = \mathbf{C}_{\epsilon}\), for all \(t\), so \(m_t = m\) and we assume no missing observations at each time point). The EM algorithm is then based on the complete-data likelihood given by \([\mathbf{Z}_{1:T}, \mathbf{Y}_{0:T} | \boldsymbol{\Theta}] = \left(\prod_{t=1}^T [\mathbf{Z}_t | \mathbf{Y}_t]\right) \left(\prod_{t=1}^T [\mathbf{Y}_t | \mathbf{Y}_{t-1}]\right) [\mathbf{Y}_0]\), which again makes use of the conditional independencies in the data model and the Markov property of the process model. The EM algorithm for a linear DSTM, presented in Note C.3, makes use of the Kalman smoother algorithm to evaluate both the E-step and the M-step. Note that, in addition to running the Kalman smoother at each iteration of the algorithm, we also have to obtain the so-called “lagged-one smoother” variance-covariance matrix, \(\mathbf{P}_{t,t-1|T} \equiv E((\mathbf{Y}_t - \mathbf{Y}_{t|T})(\mathbf{Y}_{t-1} - \mathbf{Y}_{t-1|T})' | \mathbf{Z}_{1:T})\), for \(t=T,T-1,\ldots.\) This is accomplished by the so-called lag-one covariance smoother, which is part of the algorithm in Note C.3. Convergence can be assessed by considering parameter changes and/or changes to the log complete-data likelihood (i.e., see Equation C.9 in Note C.3). Typically, in the linear DSTM case, one considers the latter because there are a large number of parameters. An example of using the EM algorithm for a linear DSTM is given in Lab 5.3.

Note C.3: Linear DSTM EM Algorithm

Choose initial condition covariance matrix, \(\mathbf{C}_0\)
Choose starting values: \(\widehat{\boldsymbol{\Theta}}^{(0)} =\{\widehat{\boldsymbol{\mu}}_0^{(0)}, \widehat{\mathbf{C}}_{\eta}^{(0)}, \widehat{\mathbf{C}}_{\epsilon}^{(0)}, \widehat{\mathbf{M}}^{(0)} \}\)
repeat \(i=1,2,\ldots\)

E-step:
- Use \(\widehat{\boldsymbol{\Theta}}^{(i-1)}\) in the Kalman smoother (Note C.2) to obtain \(\{\mathbf{Y}_{t|T}^{(i-1)}, \mathbf{P}_{t|T}^{(i-1)}\}\)
- Use Kalman smoother output to obtain the lag-one covariance smoother estimates
- Calculate \(\mathbf{P}^{(i-1)}_{T,T-1|T} = (\mathbf{I}- \mathbf{K}^{(i-1)}_T \mathbf{H}_T)\mathbf{M}^{(i-1)} \mathbf{P}^{(i-1)}_{T-1|T-1}\)
- for \(t=T,T-1,\ldots,2\) do \[ \mathbf{P}^{(i-1)}_{t-1,t-2|T} = \mathbf{P}^{(i-1)}_{t-1|t-1} \mathbf{J}^{(i-1)'}_{t-2} + \mathbf{J}^{(i-1)'}_{t-1}(\mathbf{P}^{(i-1)}_{t,t-1|T} - \mathbf{M}^{(i-1)} \mathbf{P}^{(i-1)}_{t-1|t-1})\mathbf{J}^{(i-1)'}_{t-2} \]
  
  end for
- Calculate \(\mathbf{S}_{00} \equiv \sum_{t=1}^T (\mathbf{P}^{(i-1)}_{t-1 | T} + \mathbf{Y}^{(i-1)}_{t-1|T} \mathbf{Y}^{(i-1)'}_{t-1|T})\)
- Calculate \(\mathbf{S}_{11} \equiv \sum_{t=1}^T (\mathbf{P}^{(i-1)}_{t | T} + \mathbf{Y}^{(i-1)}_{t|T} \mathbf{Y}^{(i-1)'}_{t|T})\)
- Calculate \(\mathbf{S}_{10} \equiv \sum_{t=1}^T (\mathbf{P}^{(i-1)}_{t,t-1 | T} + \mathbf{Y}^{(i-1)}_{t|T} \mathbf{Y}^{(i-1)'}_{t-1|T})\)
M-step:
- Update: \(\widehat{\boldsymbol{\mu}}_0^{(i)} = \mathbf{Y}^{(i-1)}_{0|T}\)
- Update: \(\widehat{\mathbf{M}}^{(i)} = \mathbf{S}_{10}\mathbf{S}_{00}^{-1}\)
- Update: \(\widehat{\mathbf{C}}_{\eta}^{(i)} = (1/T)(\mathbf{S}_{11} - \mathbf{S}_{10}\mathbf{S}_{00}^{-1} \mathbf{S}_{10}')\)
- Update: \[ \widehat{\mathbf{C}}_{\epsilon}^{(i)} = \frac1T \sum_{t=1}^T ((\mathbf{Z}_t - \mathbf{H}_t \mathbf{Y}^{(i-1)}_{t|T})(\mathbf{Z}_t - \mathbf{H}_t \mathbf{Y}^{(i-1)}_{t|T})' + \mathbf{H}_t \mathbf{P}^{(i-1)}_{t|T} \mathbf{H}'_t) \]
until convergence (typically, based on differences in \(-2\ln(L(\boldsymbol{\Theta}| \mathbf{Z}_{1:T},\mathbf{Y}_{0:T}))\) as calculated in Equation C.9:

\[ \small \begin{aligned} -2 \ln(L(\boldsymbol{\Theta}^{(i)} | \mathbf{Z}_{1:T},\mathbf{Y}^{(i)}_{0:T})) = \ln(|\widehat{\mathbf{C}}^{(i)}_0|) + (\mathbf{Y}^{(i)}_{0|T} - \widehat{\boldsymbol{\mu}}^{(i)}_0)'\widehat{\mathbf{C}}_0^{-1{(i)}}(\mathbf{Y}^{(i)}_{0|T} - \widehat{\boldsymbol{\mu}}^{(i)}_0) \\ + T \ln(|\widehat{\mathbf{C}}^{(i)}_{\eta}|) + \sum_{t=1}^T (\mathbf{Y}^{(i)}_{t|T} - \widehat{\mathbf{M}}^{(i)} \mathbf{Y}^{(i)}_{t-1|T})'\widehat{\mathbf{C}}_{\eta}^{-1{(i)}}(\mathbf{Y}^{(i)}_{t|T} - \widehat{\mathbf{M}}^{(i)} \mathbf{Y}^{(i)}_{t-1|T}) \\ + T \ln(|\widehat{\mathbf{C}}^{(i)}_{\epsilon}|) + \sum_{t=1}^T (\mathbf{Z}_t - \mathbf{H}_t \mathbf{Y}^{(i)}_{t|T})'\widehat{\mathbf{C}}_{\epsilon}^{-1{(i)}}(\mathbf{Z}_t - \mathbf{H}_t \mathbf{Y}^{(i)}_{t|T}). \end{aligned} \tag{C.9}\]

Uncertainty estimates are less easily obtained for the parameter estimates than they are for the state-process estimates, but they can be obtained through considering the inverse of the associated asymptotic information matrix or by parametric bootstrap methods. Unfortunately, obtaining uncertainty estimates even for the state-process estimates is not often done in practice and, as discussed in the comments motivating DSTMs in Section 5.2.3, it can be problematic because of the potential for explosive behavior by some of the transition matrices whose parameters are within the joint confidence region.

More flexible inference for DSTMs can be accomplished by the fully hierarchical Bayesian hierarchical model (BHM); see Section 4.5.2 as well as (Cressie & Wikle, 2011, Chapter 8). These BHM implementations are often problem-specific, and they are often best implemented directly in R or in a so-called probabilistic programming language (e.g., Stan, WinBugs, JAGS). For an example, see the Gibbs sampler MCMC algorithm (corresponding to a BHM) to predict Mediterranean surface winds implemented in Appendix E.

C.3 Estimation for Non-Gaussian and Nonlinear DSTMs

In principle, the filtering and smoothing methods presented in Section C.2 can be generalized to the setting of non-Gaussian and nonlinear DSTMs (e.g., particle filters and smoothers, ensemble Kalman filters; see Chapter 8 of Cressie & Wikle (2011)). However, in the high-dimensional settings with deep BHMs with complicated parameter-dependence structures, one typically has to consider fully Bayesian implementations. As mentioned above, these implementations are often programmed “from scratch” rather than from particular R packages. As an example, see the BHM based on a linear DSTM with Gaussian error given in Appendix E.