Appendix B — General Smoothing Kernels
Consider data \(\{Z_i: i=1,\ldots,m\}\), which we can write as a vector, \(\mathbf{Z}= (Z_1,\dots,Z_m)'\). Now, a homogeneously linear (smoothing) predictor for \(\mathbf{Z}\) can always be written as \(\widehat{\mathbf{Z}} = \mathbf{H}\mathbf{Z}\), where the \(i\)th row of the \(m \times m\) matrix \(\mathbf{H}\), sometimes referred to as the influence matrix, corresponds to smoothing weights for the prediction, \(\widehat{Z_i}\); that is,
\[ \widehat{Z}_i = \sum_{j=1}^m h_{ij} Z_j, \]
where \(h_{ij}\) corresponds to the \((i,j)\)th element of \(\mathbf{H}\) and, by definition, the elements of \(\mathbf{H}\) do not depend on \(\mathbf{Z}\). Note that both the kernel and regression predictors given in Section 3.1 and Section 3.2, respectively, are linear predictors of this form. In the case of the kernel predictors, \(h_{ij}\) corresponds to the kernel evaluated at location \(i\) and \(j\). For the regression case, \(\mathbf{H}= \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}'\) (sometimes called the “hat” matrix in books on regression). The difference is that, in general, under a kernel model, \(\mathbf{H}\) gives more weight to locations that are near to each other, whereas standard regression matrices do not necessarily do so, although so-called local linear regression approaches do (see, for example, James et al. (2013)).
There are several useful properties of the general linear smoothing matrix, \(\mathbf{H}\), used in the linear predictor. First, as we have noted, if one has \(m\) observations but they are statistically dependent, then there are effectively fewer than \(m\) degrees of freedom (e.g., some of the information is redundant due to the dependence). Specifically, the effective degrees of freedom in the sample of \(m\) observations are given by the trace of the matrix \(\mathbf{H}\),
\[ df_{\mathrm{eff}} = \textrm{tr}(\mathbf{H}) = \sum_{i=1}^m h_{ii}. \]
Another important property of linear predictors of this form is that we can obtain the LOOCV estimate (see Note 3.1) without actually having to refit the model. That is, in the case of evaluating the MSPE, the LOOCV statistic is given by
\[ CV_{(m)} = \frac{1}{m} \sum_{i=1}^m (Z_i - \widehat{Z}_i^{(-i)})^2 = \frac{1}{m} \sum_{i=1}^m \left(\frac{Z_i - \widehat{Z}_i}{1 - h_{ii}} \right)^2, \tag{B.1}\]
and the so-called generalized cross-validation statistic is given by replacing the denominator in the right-hand side of Equation B.1 by \((1 - \textrm{tr}(\mathbf{H})/m)\).
In cases where regularization is considered in the context of the linear predictor (e.g., when we wish to shrink the parameters toward zero by using, for example, a ridge regression (\(L_2\)-norm) penalty; see Note 3.4), we can write \(\mathbf{H}= \mathbf{X}(\mathbf{X}' \mathbf{X}+ \mathbf{R})^{-1} \mathbf{X}'\) (with \(\mathbf{R}= \lambda \mathbf{I}\) in the ridge-regression case), and the effective degrees of freedom and LOOCV properties are still valid (see James et al., 2013). As discussed in Note 3.4, a lasso (\(L_1\)-norm) penalty can also be used for regularization, but the smoothing kernel has no closed form in this case.