Revisiting Whittaker-Henderson Smoothing

What is Whittaker-Henderson smoothing ?

Origin

Whittaker-Henderson (WH) smoothing is a gradation method aimed at correcting the effect of sampling fluctuations on an observation vector. It is applied to evenly-spaced discrete observations. Initially proposed by Whittaker (1922) for constructing mortality tables and further developed by the works of Henderson (1924), it remains one of the most popular methods among actuaries for constructing experience tables in life insurance. Extending to two-dimensional tables, it can be used for studying various risks, including but not limited to: mortality, disability, long-term care, lapse, mortgage default, and unemployment.

The one-dimensional case

Let y be a vector of observations and w a vector of positive weights, both of size n. The estimator associated with Whittaker-Henderson smoothing is given by:

$$ \hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}}\{F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) + R_{\lambda,q}(\boldsymbol{\theta})\} $$

where:

$F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) = \underset{i = 1}{\overset{n}{\sum}} w_i(y_i - \theta_i)^2$ represents a fidelity criterion to the observations,
$R_{\lambda,q}(\boldsymbol{\theta}) = \lambda \underset{i = 1}{\overset{n - q}{\sum}} (\Delta^q\boldsymbol{\theta})_i^2$ represents a smoothness criterion.

In the latter expression, Δ^q denotes the forward difference operator of order q, such that for any i ∈ [1, n − q]:

$$ (\Delta^q\boldsymbol{\theta})_i = \underset{k = 0}{\overset{q}{\sum}} \begin{pmatrix}q \\ k\end{pmatrix}(- 1)^{q - k} \theta_{i + k}. $$

Let us define W = Diag(w), the diagonal matrix of weights, and D_n, q as the order q difference matrix of dimensions (n − q) × n, such that (D_n, qθ)_i = (Δ^qθ)_i for all i ∈ [1, n − q]. The most commonly used difference matrices of order 1 and 2 have the following forms:

$$ D_{n,1} = \begin{bmatrix} -1 & 1 & 0 & \ldots & 0 \\ 0 & -1 & 1 & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & \ldots & 0 & -1 & 1 \\ \end{bmatrix} \quad\text{and}\quad D_{n,2} = \begin{bmatrix} 1 & -2 & 1 & 0 & \ldots & 0 \\ 0 & 1 & -2 & 1 & \ldots & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & \ldots & 0 & 1 & -2 & 1 \\ \end{bmatrix}. $$

The fidelity and smoothness criteria can be rewritten in matrix form as:

F(y, w, θ) = (y − θ)^TW(y − θ) and R_λ, q(θ) = λθ^TD_n, q^TD_n, qθ.

The associated estimator for smoothing becomes:

$$ \hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}} \left\lbrace(\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right\rbrace $$

where P_λ = λD_n, q^TD_n, q.

The two-dimensional case

In the bidimensional case, let us consider a matrix Y of observations and a matrix Ω of non-negative weights, both of dimensions n_x × n_z. The estimator associated with the Whittaker-Henderson smoothing can be written as:

$$ \widehat{Y} = \underset{\Theta}{\text{argmin}}\{F(Y,\Omega, \Theta) + R_{\lambda,q}(\Theta)\} $$

where:

$F(Y,\Omega, \Theta) = \sum_{i = 1}^{n_x}\sum_{j = 1}^{n_z} \Omega_{i,j}(Y_{i,j} - \Theta_{i,j})^2$ represents a fidelity criterion to the observations,
$R_{\lambda,q}(\Theta) = \lambda_x \sum_{j = 1}^{n_z}\sum_{i = 1}^{n_x - q_x} (\Delta^{q_x}\Theta_{\bullet,j})_i^2 + \lambda_z \sum_{i = 1}^{n_x}\sum_{j = 1}^{n_z - q_z} (\Delta^{q_z}\Theta_{i,\bullet})_j^2$ is a smoothness criterion.

This latter criterion can be written as the sum of two one-dimensional regularization criteria, with orders q_x and q_z, applied respectively to all rows and all columns of Θ. It is also possible to adopt matrix notations by defining y = vec(Y), w = vec(Ω), and θ = vec(Θ) as the vectors obtained by stacking the columns of the matrices Y, Ω, and Θ, respectively. Additionally, let us denote W = Diag(w) and n = n_x × n_z. The fidelity and smoothness criteria can then be rewritten as linear combinations of the vectors y, w, and θ:

$$ \begin{aligned} F(\mathbf{y},\mathbf{w}, \boldsymbol{\theta}) &= (\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) \\ R_{\lambda,q}(\boldsymbol{\theta}) &= \boldsymbol{\theta}^{T}(\lambda_x I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x} + \lambda_z D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}) \boldsymbol{\theta}. \end{aligned} $$

and the estimator associated with the smoothing takes the same form as in the one-dimensional case:

except in this case P_λ = λ_xI_{n_z} ⊗ D_{n_x, q_x}^TD_{n_x, q_x} + λ_zD_{n_z, q_z}^TD_{n_z, q_z} ⊗ I_{n_x}.

Explicit solution

Whittaker-Henderson estimator has an explicit solution:

$$\hat{\mathbf{y}} = (W + P_{\lambda})^{-1}W\mathbf{y}.$$

Indeed, as a minimum, $\hat{\mathbf{y}}$ satisfies:

$$0 = \left.\frac{\partial}{\partial \boldsymbol{\theta}}\right|_{\hat{\mathbf{y}}}\left\lbrace(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^{T}P_{\lambda}\boldsymbol{\theta}\right\rbrace = - 2 X^TW(y - \hat{\mathbf{y}}) +2X^TP_{\lambda} \hat{\mathbf{y}}.$$

It follows that $(W + P_{\lambda})\hat{\boldsymbol{\theta}} = W\mathbf{y}$, and if W + P_λ is invertible, then $\hat{\mathbf{y}}$ is indeed a solution for the original equation. When λ ≠ 0, W + P_λ is invertible as long as w has q non-zero elements in the one-dimensional case, and Ω has at least q_x × q_z non-zero elements distributed over q_x different rows and q_z different columns in the two-dimensional case. These are sufficient conditions, which are always satisfied in practice and will not be demonstrated here.

How to use the package?

The WH package features two main functions WH_1d and WH_2d corresponding to the one-dimensional and two-dimensional cases respectively. Two arguments are mandatory for those functions:

The vector (or matrix in the two-dimension case) d corresponding to the number of observed events of interest by age (or by age and duration in the two-dimension case). d should have named elements (or rows and columns) for the model results to be extrapolated.
The vector (or matrix in the two-dimension case) ec corresponding to the portfolio central exposure by age (or by age and duration in the two-dimension case) whose dimensions should match those of d. The contribution of each individual to the portfolio central exposure corresponds to the time the individual was actually observed with corresponding age (and duration in the two-dimension cas). It always ranges from 0 to 1 and is affected by individuals leaving the portfolio, no matter the cause, as well as censoring and truncating phenomena.

Additional arguments may be supplied, whose description is given in the documentation of the functions.

The package also embed two fictive agregated datasets to illustrate how to use it:

portfolio_mortality contains the agregated number of deaths and associated central exposure by age for an annuity portfolio.
portfolio_LTC contains the agregated number of deaths and associated central exposure by age and duration (in years) since the onset of LTC for the annuitant database of a long-term care portfolio.

# One-dimensional case
d <- portfolio_mort$d
ec <- portfolio_mort$ec

WH_1d_fit <- WH_1d(d, ec)
Using outer iteration / Brent method

# Two-dimensional case
keep_age <- which(rowSums(portfolio_LTC$ec) > 5e2)
keep_duration <- which(colSums(portfolio_LTC$ec) > 1e3)

d  <- portfolio_LTC$d[keep_age, keep_duration]
ec <- portfolio_LTC$ec[keep_age, keep_duration]

WH_2d_fit <- WH_2d(d, ec)
Using performance iteration / Nelder-Mead method

Functions WH_1d and WH_2d output objects of class "WH_1d" and "WH_2d" to which additional functions (including generic S3 methods) may be applied:

The print function provides a glimpse of the fitted results

WH_1d_fit
An object fitted using the WH_1D function
Initial data contains 74 data points:
  Observation positions:  19  to  92 
Smoothing parameter selected: 13454 
Associated degrees of freedom: 7.5 
WH_2d_fit
An object fitted using the WH_2D function
Initial data contains 90 data points:
  First dimension:  75  to  89 
  Second dimension:  0  to  5 
Smoothing parameters selected: 301   2 
Associated degrees of freedom: 13

The plot function generates rough plots of the model fit, the associated standard deviation, the model residuals or the associated degrees of freedom. See the plot.WH_1d and plot.WH_2d functions help for more details.

plot(WH_1d_fit)

plot(WH_1d_fit, "res")

plot(WH_1d_fit, "edf")


plot(WH_2d_fit)

plot(WH_2d_fit, "std_y_hat")

The predict function generates an extrapolation of the model. It requires a newdata argument, a named list with one or two elements corresponding to the positions of the new observations. In the two-dimension case constraints are used so that the predicted values matches the fitted values for the initial observations (see Carballo, Durban, and Lee 2021 to understand why this is required).

WH_1d_fit |> predict(newdata = 18:99) |> plot()

WH_2d_fit |> predict(newdata = list(age = 50:99,
                                    duration = 0:19)) |> plot()

Finally the output_to_df function converts an "WH_1d" or "WH_2d" object into a data.frame. Information about the fit is discarded in the process. This function may be useful to produce better visualizations from the data, for example using the ggplot2 package.

WH_1d_df <- WH_1d_fit |> output_to_df()
WH_2d_df <- WH_2d_fit |> output_to_df()

Further WH smoothing theory

How to obtain credibility intervals?

From the explicit solution of WH smoothing, $\mathbb{E}(\hat{\mathbf{y}}) = (W + P_{\lambda})^{-1}W\mathbb{E}(\mathbf{y}) \ne \mathbb{E}(\mathbf{y})$ when λ ≠ 0. This implies that penalization introduces a smoothing bias, which prevents the construction of a confidence interval centered around 𝔼(y). Therefore, in this section, we turn to a Bayesian approach where smoothing can be interpreted more naturally.

Let us suppose that y|θ ∼ 𝒩(θ, W⁻) and θ ∼ 𝒩(0, P_λ⁻) where P_λ⁻ denotes the pseudo-inverse of the matrix P_λ. The Bayes’ formula allows us to express the posterior likelihood f(θ|y) associated with these choices in the following form:

$$\begin{aligned} f(\boldsymbol{\theta} | \mathbf{y}) &= \frac{f(\mathbf{y} | \boldsymbol{\theta}) f(\boldsymbol{\theta})}{f(y)} \\ &\propto f(\mathbf{y} | \boldsymbol{\theta}) f(\boldsymbol{\theta}) \\ &\propto \exp\left(- \frac{1}{2}(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta})\right)\exp\left(-\frac{1}{2}\boldsymbol{\theta}^TP_{\lambda} \boldsymbol{\theta}\right) \\ &\propto \exp\left(- \frac{1}{2}\left[(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right]\right). \end{aligned}$$

The mode of the posterior distribution, $\hat{\boldsymbol{\theta}} = \underset{\boldsymbol{\theta}}{\text{argmax}} [f(\boldsymbol{\theta} | \mathbf{y})]$, also known as the maximum a posteriori (MAP) estimate, coincides with the explicit solution $\hat{\mathbf{y}}$.

A second-order Taylor expansion of the log-posterior likelihood around $\hat{\mathbf{y}} = \hat{\boldsymbol{\theta}}$ gives us:

$$\ln f(\boldsymbol{\theta} | \mathbf{y}) = \ln f(\hat{\boldsymbol{\theta}} | \mathbf{y}) + \left.\frac{\partial \ln f(\boldsymbol{\theta} | \mathbf{y})}{\partial \boldsymbol{\theta}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}}^{T}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}) + \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T} \left.\frac{\partial^2 \ln f(\boldsymbol{\theta} | \mathbf{y})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^{T}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})$$

$$\text{where} \quad \left.\frac{\partial \ln f(\boldsymbol{\theta} | \mathbf{y})}{\partial \boldsymbol{\theta}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}} = 0 \quad \text{and} \quad \left.\frac{\partial^2 \ln f(\boldsymbol{\theta} | \mathbf{y})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^{T}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}} = - (W + P_{\lambda}).$$

Note that this last derivative no longer depends on θ, and higher-order derivatives beyond 2 are all zero. The Taylor expansion allows for an exact computation of ln f(θ|y). By substituting the result back into the Taylor expansion, we obtain:

$$\begin{aligned} f(\boldsymbol{\theta} | \mathbf{y}) &\propto \exp\left[\ln f(\hat{\boldsymbol{\theta}} | \mathbf{y}) - \frac{1}{2} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T}(W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})\right] \\ &\propto \exp\left[- \frac{1}{2} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T}(W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})\right] \end{aligned}$$

which can immediately be recognized as the density of the $\mathcal{N}(\hat{\boldsymbol{\theta}},(W + P_{\lambda})^{- 1})$ distribution.

The assumption θ ∼ 𝒩(0, P_λ⁻) corresponds to a simple Bayesian formalization of the smoothness criterion. It reflects a (improper) prior belief of the modeler about the underlying distribution of the observation vector y.

The use of Whittaker-Henderson smoothing in the Bayesian framework and the construction of credible intervals are conditioned on the validity of the assumption y|θ ∼ 𝒩(θ, W⁻). The components of the observation vector should be independent and have known variances. The weight vector w used should contain the inverses of these variances. Under these assumptions, credible intervals at 100(1 − α)% for smoothing can be constructed and take the form:

$$\mathbb{E}(\mathbf{y}) | \mathbf{y} \in \left[\hat{\mathbf{y}} \pm \Phi\left(1 -\frac{\alpha}{2}\right)\sqrt{\textbf{diag}\left\lbrace(W + P_{\lambda})^{-1}\right\rbrace}\right]$$

with probability $1 -\frac{\alpha}{2}$ where $\hat{\mathbf{y}} = (W + P_{\lambda})^{- 1}W\mathbf{y}$ and Φ denotes the cumulative distribution function of the standard normal distribution. According to Marra and Wood (2012), these credible intervals provide satisfactory coverage of the corresponding frequentist confidence intervals and can therefore be used as practical substitutes.

What should I use as observations and weights?

The previous section highlighted the need to apply the Whittaker-Henderson smoothing to independent observation vectors y and weight vectors w corresponding to the inverses of the variances of the components of y in order to obtain a measure of uncertainty in the results. In this section, we propose, within the framework of duration models used for constructing experience tables, vectors y and w that satisfy these conditions.

Duration models framework: one-dimensional case

Consider the observation of m individuals in a longitudinal study subject to left truncation and right censoring phenomena. Suppose we want to estimate a single distribution that depends on only one continuous explanatory variable, denoted by x. For illustration purposes, we can consider it as a mortality distribution with the explanatory variable of interest x representing age. Such a distribution is fully characterized by one of the following quantities:

the cumulative distribution function F(x) or its complement, the survival function S(x) = 1 − F(x),
the associated probability density function $f(x) = - \frac{\text{d}}{\text{d}x}S(x)$,
the instantaneous hazard function $\mu(x) = - \frac{\text{d}}{\text{d}x}\ln S(x)$.

Suppose that the considered distribution depends on a vector of parameters θ that we want to estimate using maximum likelihood. The likelihood associated with the observation of the individuals can be written as follows:

$$ \mathcal{L}(\boldsymbol{\theta}) = \underset{i = 1}{\overset{m}{\prod}} \left[\frac{f(x_i + t_i,\boldsymbol{\theta})}{S(x_i,\boldsymbol{\theta})}\right]^{\delta_i}\left[\frac{S(x_i + t_i,\boldsymbol{\theta})}{S(x_i,\boldsymbol{\theta})}\right]^{1 - \delta_i} $$

Where x_i represents the age at the start of observation, t_i represents the observation duration, i.e., the time elapsed between the start and end dates of observation, and δ_i is the indicator of event observation, which takes the value 1 if the event of interest is observed and 0 if the observation is censored. We will not go into the details of calculating these three quantities, which should take into account the individual-specific information such as the subscription date, redemption date if applicable, as well as the global characteristics of the product, such as the presence of a deductible or medical selection phenomenon, and the choice of a restricted observation period due to data quality issues or delays in the reporting of observations. These factors typically lead to a narrower observation period than the actual presence period of individuals in the portfolio.

The various quantities introduced above are related by the following relationships:

$$ S(x) = \exp\left(\underset{u = 0}{\overset{x}{\int}}\mu(u)\text{d}u\right)\quad \text{and} \quad f(x) = \mu(x)S(x). $$

The maximization of the likelihood in that equation is equivalent to the maximization of the associated log-likelihood, which can be rewritten using only the instantaneous hazard function (also known as the mortality rate in the case of the death risk):

$$ \ell(\boldsymbol{\theta}) = \underset{i = 1}{\overset{m}{\sum}} \left[\delta_i \ln\mu(x_i + t_i,\boldsymbol{\theta}) - \underset{u = 0}{\overset{t_i}{\int}}\mu(x_i + u,\boldsymbol{\theta})\text{d}u\right] $$

To discretize the problem, let us assume that the mortality rate is piecewise constant over one-year intervals between two integer ages. Formally, we have μ(x + ϵ) = μ(x) for all x ∈ ℕ and ϵ ∈ [0, 1[. Furthermore, if 1 denotes the indicator function, then for any 0 ≤ a < x_max, we have $\sum_{x = x_{\min}}^{x_{\max}} \mathbf{1}(x \leq a < x + 1) = 1$, where x_min = min (x) and x_max = max (x). The log-likelihood can be rewritten as:

$$ \begin{aligned} \ell(\boldsymbol{\theta}) = \underset{i = 1}{\overset{m}{\sum}} &\left[\underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \delta_i\mathbf{1}(x \le x_i + t_i < x + 1)\ln\mu(x_i + t_i,\boldsymbol{\theta})\right. \\ &- \left.\underset{u = 0}{\overset{t_i}{\int}}\underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \mathbf{1}(x \le x_i + u < x + 1)\mu(x_i + u,\boldsymbol{\theta})\text{d}u\right]. \end{aligned} $$

The assumption of piecewise constant mortality rate implies that:

$$ \begin{aligned} \mathbf{1}(x \le x_i + t_i < x + 1) \ln\mu(x_i + t_i,\boldsymbol{\theta}) &= \mathbf{1}(x \le x_i + t_i < x + 1) \ln\mu(x,\boldsymbol{\theta})\quad\text{and}\\ \mathbf{1}(x \le x_i + u < x + 1)\mu(x_i + u,\boldsymbol{\theta}) &= \mathbf{1}(x \le x_i + u < x + 1) \ln\mu(x,\boldsymbol{\theta}). \end{aligned} $$

It is then possible to interchange the two summations to obtain the following expressions:

$$ \begin{aligned} \ell(\boldsymbol{\theta}) &= \underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \left[\ln\mu(x,\boldsymbol{\theta}) d(x) - \mu(x,\boldsymbol{\theta}) e_c(x)\right] \quad \text{where} \\ d(x) & = \underset{i = 1}{\overset{m}{\sum}} \delta_i \mathbf{1}(x \le x_i + t_i < x + 1) \quad \text{and} \\ e_c(x) & = \underset{i = 1}{\overset{m}{\sum}}\underset{u = 0}{\overset{t_i}{\int}}\mathbf{1}(x \le x_i + u < x + 1)\text{d}u = \underset{i = 1}{\overset{m}{\sum}} \left[\min(t_i, x - x_i + 1) - \max(0, x - x_i)\right]^+ \end{aligned} $$

by denoting a⁺ = max (a, 0), where d(x) and e_c(x) correspond to the number of observed deaths between ages x and x + 1 and the sum of observation durations of individuals between these ages, respectively (the latter quantity is also known as central exposure to risk).

Extension to the two-dimensional case

The extension of the proposed approach to the two-dimensional framework requires only minor adjustments to the previous reasoning. Let z_min = min (z) and z_max = max (z). The piecewise constant assumption for the mortality rate needs to be extended to the second dimension. Formally, we now assume that μ(x + ϵ, z + ξ) = μ(x, z) for all pairs x, z ∈ ℕ and ϵ, ξ ∈ [0, 1[. The sums involving the variable x are then replaced by double sums considering all combinations of x and z. The log-likelihood is given by:

$$ \begin{aligned} \ell(\boldsymbol{\theta}) &= \underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \underset{z = z_{\min}}{\overset{z_{\max}}{\sum}}\left[\ln\mu(x,z,\boldsymbol{\theta}) d(x,z) - \mu(x,z,\boldsymbol{\theta}) \mathbf{e_c}(x,z)\right] \quad \text{where} \\ d(x,z) & = \underset{i = 1}{\overset{m}{\sum}} \delta_i \mathbf{1}(x \le x_i + t_i < x + 1) \mathbf{1}(z \le z_i + t_i < z + 1) \quad \text{and}\\ \mathbf{e_c}(x,z) & = \underset{i = 1}{\overset{m}{\sum}}\underset{u = 0}{\overset{t_i}{\int}}\mathbf{1}(x \le x_i + u < x + 1)\mathbf{1}(z \le z_i + u < z + 1)\text{d}u \\ & = \underset{i = 1}{\overset{m}{\sum}} \left[\min(t_i, x + 1 - x_i, z + 1 - z_i) - \max(0, x - x_i, z - z_i)\right]^+ \end{aligned} $$

Likelihood equations

The choice μ(θ) = exp (θ), which corresponds to considering one parameter per observation, allows us to relate to the Whittaker-Henderson smoothing. Using the exponential function ensures positive values for the estimated mortality rate. The expressions of the likelihood in the unidimensional or two-dimensional case can then be written in a common vectorized form:

ℓ(θ) = θ^Td − exp (θ)^Te_c

where d and e_c represent the vectors of observed deaths and exposures to risk, respectively. The derivatives of the likelihood function for this model are given by:

$$\frac{\partial \ell}{\partial \boldsymbol{\theta}} = \left[\mathbf{d} -\exp(\boldsymbol{\theta}) \odot \mathbf{e_c}\right] \quad \text{and} \quad \frac{\partial^2 \ell}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^T} = - \text{Diag}(\exp(\boldsymbol{\theta}) \odot \mathbf{e_c}).$$

Note that these likelihood equations are exactly what we would obtain by assuming that the observed numbers of deaths, conditional on the observed exposures to risk e_c, follow Poisson distributions with parameters μ(θ) ⊙ e_c. The model presented here has many similarities with a Poisson GLM (John Ashworth Nelder and Wedderburn 1972). However, the initial assumptions are not the same for both models.

These likelihood equations have an explicit solution given by $\hat{\boldsymbol{\theta}} = \ln(\mathbf{d} / \mathbf{e}_c)$. This model, which treats each age independently, is known as the crude rates estimator. The properties of the maximum likelihood estimator imply that asymptotically $\hat{\boldsymbol{\theta}} \sim \mathcal{N}(\boldsymbol{\theta}, W_{\hat{\boldsymbol{\theta}}}^{-1})$, where $W_{\hat{\boldsymbol{\theta}}}$ is a diagonal matrix with elements $\exp(\hat{\boldsymbol{\theta}}) \odot \mathbf{e}_c = (\mathbf{d} / \mathbf{e}_c) \odot \mathbf{e}_c = \mathbf{d}$. It should be noted that the asymptotic nature and the validity of this approximation are related to the number of individuals m in the portfolio and not the size n of the vectors d and e_c.

Thus, we have shown that in the framework of duration models, using the crude rates estimator, asymptotically ln (d/e_c) ∼ 𝒩(ln μ, W⁻¹), where W = Diag(d). This justifies applying the Whittaker-Henderson smoothing to the observation vector y = ln (d/e_c) and weight vector w = d. We obtain the following associated credibility intervals:

$$\ln\boldsymbol{\mu} | \mathbf{d}, \mathbf{e_c} \in \left[\hat{\boldsymbol{\theta}} \pm \Phi\left(1 -\frac{\alpha}{2}\right)\sqrt{\textbf{diag}\left\lbrace(\text{Diag}(\mathbf{d}) + P_{\lambda})^{-1}\right\rbrace}\right]$$

where $\hat{\boldsymbol{\theta}} = (\text{Diag}(\mathbf{d}) + P_{\lambda})^{-1}\text{Diag}(\mathbf{d})[\ln(\mathbf{d}) - \ln \mathbf{e}_c]$. Results on μ can be obtained directly by exponentiating the above expression.

How is the optimal smoothing parameter determined ?

In the definition of WH smoothing, (y − θ)^TW(y − θ) represents a fidelity criterion to the observations, and θ^TP_λθ represents a smoothness criterion. The relative importance of these criteria is controlled by the parameter (or pair of parameters in the two-dimensional case) λ. The result of smoothing is highly sensitive to the chosen value of the smoothing parameter. Ideally, the selection of the smoothing parameter is based on the optimization of a statistical criterion, typically belonging to one of two major families.

On one hand, there are criteria based on the minimization of the model’s prediction error, among which the Akaike information criterion (AIC, Akaike 1973) and the generalized cross-validation (GCV, Wahba 1980) are included. The Bayesian information criterion (BIC, Schwarz 1978) has a similar form to AIC, although its theoretical justification is very different, and thus it can be considered part of this group.

On the other hand, there are criteria based on the maximization of a likelihood function known as the marginal likelihood. This type of criterion was introduced by Patterson H. D. (1971) in the Gaussian case, initially under the name of restricted likelihood (REML), and used by Anderssen and Bloomfield (1974) for the selection of smoothing parameters. Wahba (1985) and Kauermann (2005) show that criteria minimizing prediction error have the best asymptotic performance, but their convergence to the optimal smoothing parameters is slower. For finite sample sizes, criteria based on the maximization of a likelihood function are considered a more robust choice by many authors, such as Reiss and Todd Ogden (2009) or Wood (2011). For these reasons, we will favor the marginal likelihood as the selection criterion, especially since this choice naturally fits into the Bayesian framework introduced in the past sections.

Let us start from the notations and assumptions from last sections, namely y|θ ∼ 𝒩(θ, W⁻) and θ|λ ∼ 𝒩(0, P_λ⁻). In a purely Bayesian approach, it would be necessary to define a prior distribution on λ and then estimate the posterior distribution of each parameter vector using methods such as Markov Chain Monte Carlo. The empirical Bayesian approach we adopt seeks to find the value of λ that maximizes the marginal likelihood:

ℒ_norm^m(λ) = f(y|λ) = ∫f(y, θ|λ)dθ = ∫f(y|θ)f(θ|λ)dθ.

This corresponds to the maximum likelihood method applied to the smoothing parameter. Let us explicitly rewrite the expressions of f(y|θ) and f(θ|λ) introduced previously:

$$\begin{aligned} f(\mathbf{y} | \boldsymbol{\theta}) &= \sqrt{\frac{|W|_{+}}{(2\pi)^{n_*}}}\exp\left(- \frac{1}{2}(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta})\right) \\ f(\boldsymbol{\theta} |\lambda) &= \sqrt{\frac{|P_{\lambda}|_{+}}{(2\pi)^{p - q}}} \exp\left(- \frac{1}{2}\boldsymbol{\theta}^{T}P_{\lambda} \boldsymbol{\theta}\right) \end{aligned}$$

where |A|₊ denotes the product of the non-zero eigenvalues of A, n_* is the number of non-zero diagonal elements of W, and q is the number of zero eigenvalues of P_λ (q = q_x × q_z in the two-dimensional case). Based on the Taylor expansion performed to obtain credibility intervals, let us recall that:

$$ \ln f(\mathbf{y}, \boldsymbol{\theta} | \lambda) = \ln f(\mathbf{y}, \hat{\boldsymbol{\theta}}_{\lambda} | \lambda) + \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_{\lambda})^{T} (W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_{\lambda}) $$

which leads to:

$$\begin{aligned} \mathcal{L}^m_\text{norm}(\lambda) &= \int \exp[\ln f(\mathbf{y}, \boldsymbol{\theta} | \lambda)]\text{d}\boldsymbol{\theta} \\ &= f(\mathbf{y}, \hat{\boldsymbol{\theta}}_{\lambda} | \lambda) \int \exp\left[- \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_{\lambda})^{T}(W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_{\lambda}) \right]\text{d}\boldsymbol{\theta} \\ &= f(\mathbf{y}, \hat{\boldsymbol{\theta}}_{\lambda} | \lambda) \sqrt{\frac{(2\pi)^p}{|W + P_{\lambda}|}} \\ &= f_\mathbf{y}(\mathbf{y} | \hat{\boldsymbol{\theta}}_{\lambda}) f_{\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}}_{\lambda} | \lambda) \sqrt{\frac{(2\pi)^p}{|W + P_{\lambda}|}} \\ &= \sqrt{\frac{|W|_{+}| P_{\lambda} |_{+}}{(2\pi)^{n_* - q}|W + P_{\lambda}|}} \exp\left(- \frac{1}{2}\left[(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda})^{T}W(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda}) + \hat{\boldsymbol{\theta}}_{\lambda}^T P_{\lambda} \hat{\boldsymbol{\theta}}_{\lambda}\right]\right). \end{aligned}$$

The associated log-likelihood can be expressed as follows:

$$\begin{aligned} \ell^m_\text{reg}(\lambda) = - \frac{1}{2}&\left[(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda})^{T}W(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda}) + \hat{\boldsymbol{\theta}}_{\lambda}^{T}P_{\lambda} \hat{\boldsymbol{\theta}}_{\lambda}\right. \\ &\left.- \ln|W|_{+} - \ln|P_{\lambda}|_{+} + \ln|W + P_{\lambda}| + (n_* - q)\ln(2\pi)\right]. \end{aligned}$$

Once the selection of λ has been determined using the estimator $\hat{\lambda} = \underset{\lambda}{\text{argmax}}\: \ell^m_\text{reg}(\lambda)$, the lack of an explicit solution to this equation forces us to resort to numerical methods for its resolution. The Newton algorithm could once again be employed here and is a robust choice. This approach was notably adopted by Wood (2011). However, explicitly calculating the derivatives of the likelihood ℓ_reg^m is rather difficult from an operational perspective. Instead, we will use general heuristics such as those provided by Brent (1973) and John A. Nelder and Mead (1965), which are applicable to any sufficiently smooth function. These heuristics do not require derivative calculations and are implemented in the optimize and optim functions of the statistical programming language R.

Advanced WH smoothing theory

See the upcoming paper available here

References

Akaike, Hirotsugu. 1973. “Information Theory and an Extension of the Maximum Likelihood Principle.” In 2nd International Symposium on Information Theory, 1973.

Anderssen, RS, and Peter Bloomfield. 1974. “A Time Series Approach to Numerical Differentiation.” Technometrics 16 (1): 69–75.

Brent, Richard P. 1973. “Algorithms for Minimization Without Derivatives, Chap. 4.” Prentice-Hall, Englewood Cliffs, NJ.

Carballo, Alba, Maria Durban, and Dae-Jin Lee. 2021. “Out-of-Sample Prediction in Multidimensional p-Spline Models.” Mathematics 9 (15): 1761.

Henderson, Robert. 1924. “A New Method of Graduation.” Transactions of the Actuarial Society of America 25: 29–40.

Kauermann, Göran. 2005. “A Note on Smoothing Parameter Selection for Penalized Spline Smoothing.” Journal of Statistical Planning and Inference 127 (1-2): 53–69.

Marra, Giampiero, and Simon N Wood. 2012. “Coverage Properties of Confidence Intervals for Generalized Additive Model Components.” Scandinavian Journal of Statistics 39 (1): 53–74.

Nelder, John A, and Roger Mead. 1965. “A Simplex Method for Function Minimization.” The Computer Journal 7 (4): 308–13.

Nelder, John Ashworth, and Robert WM Wedderburn. 1972. “Generalized Linear Models.” Journal of the Royal Statistical Society: Series A (General) 135 (3): 370–84.

Patterson H. D., Thompson R. 1971. “Recovery of Inter-Block Information When Block Sizes Are Unequal.” Biometrika 58: 545–54.

Reiss, Philip T, and R Todd Ogden. 2009. “Smoothing Parameter Selection for a Class of Semiparametric Linear Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (2): 505–23.

Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics, 461–64.

Wahba, Grace. 1980. Spline Bases, Regularization, and Generalized Cross Validation for Solving Approximation Problems with Large Quantities of Noisy Data. University of WISCONSIN.

———. 1985. “A Comparison of GCV and GML for Choosing the Smoothing Parameter in the Generalized Spline Smoothing Problem.” The Annals of Statistics, 1378–1402.

Whittaker, Edmund T. 1922. “On a New Method of Graduation.” Proceedings of the Edinburgh Mathematical Society 41: 63–75.

Wood, Simon N. 2011. “Fast Stable Restricted Maximum Likelihood and Marginal Likelihood Estimation of Semiparametric Generalized Linear Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (1): 3–36.