Sparse GPs
GaussianProcesses.predict_f — Methodpredict_f(gp::GPBase, X::Matrix{Float64}[; full_cov::Bool = false])Return posterior mean and variance of the Gaussian Process gp at specfic points which are given as columns of matrix X. If full_cov is true, the full covariance matrix is returned instead of only variances.
GaussianProcesses.DeterminTrainCondStrat — TypeDeterministic Training Conditional (DTC) covariance strategy.GaussianProcesses.predictMVN — MethodDeterministic Training Conditional (DTC) multivariate normal predictions.
See Quiñonero-Candela and Rasmussen 2005, equations 20b.
    μ_DTC = μ_SoR
    Σ_DTC = Σxx - Qxx + Σ_SoR
where μ_DTC and Σ_DTC are the predictive mean and covariance
functions for the Subset of Regressors approximation.GaussianProcesses.FullScaleApproxStrat — TypeFully Independent Training Conditional (FSA) covariance strategy.GaussianProcesses.FullScalePDMat — TypePositive Definite Matrix for Full Scale Approximation.Base.:\ — MethodWe have
    Σ ≈ Kuf' Kuu⁻¹ Kuf + Λ
where Λ is a diagonal matrix (here given by σ²I + diag(Kff - Qff))
The Woodbury matrix identity gives
    (A+UCV)⁻¹ = A⁻¹ - A⁻¹ U(C⁻¹ + V A⁻¹ U)⁻¹ V A⁻¹
which we use here with
    A ← Λ
    U = Kuf'
    V = Kuf
    C = Kuu⁻¹
which gives
    Σ⁻¹ = Λ⁻¹ - Λ⁻¹ Kuf'(Kuu + Kuf Λ⁻¹ Kuf')⁻¹ Kuf Λ⁻¹
        = Λ⁻¹ - Λ⁻¹ Kuf'(        ΣQR       )⁻¹ Kuf Λ⁻¹GaussianProcesses.dmll_kern! — Methoddmll_kern!(dmll::AbstractVector, kernel::Kernel, X::AbstractMatrix, cK::AbstractPDMat, kerneldata::KernelData, alpha::AbstractVector, Kuu, Kuf, Kuu⁻¹Kuf, Kuu⁻¹KufΣ⁻¹y, Σ⁻¹Kfu, ∂Kuu, ∂Kuf, covstrat::FullScaleApproxStrat) Derivative of the log likelihood under the Fully Independent Training Conditional (fsa) approximation.
Helpful reference: Vanhatalo, Jarno, and Aki Vehtari. "Sparse log Gaussian processes via MCMC for spatial epidemiology." In Gaussian processes in practice, pp. 73-89. 2007.
Generally, for a multivariate normal with zero mean     ∂logp(Y|θ) = 1/2 y' Σ⁻¹ ∂Σ Σ⁻¹ y - 1/2 tr(Σ⁻¹ ∂Σ)                     ╰───────────────╯     ╰──────────╯                            V                 T
where Σ = Kff + σ²I.
Notation: f is the observations, u is the inducing points.           ∂X stands for ∂X/∂θ, where θ is the kernel hyperparameters.
In the case of the FSA approximation, we have Σ = Λ + Qff The second component gives ∂(Qff) = ∂(Kfu Kuu⁻¹ Kuf) which is used by the gradient function for the subset of regressors approximation, and so I don't repeat it here.
for ∂Λ we have (with `i` indexing each block)
Λi = Ki - Qi + σ²I
   = Ki - Kui' Kuu⁻¹ Kui + σ²
∂Λi = ∂Ki - Kui' ∂(Kuu⁻¹) Kui - 2 ∂Kui' Kuu⁻¹ Kui
    = ∂Ki + Kui' Kuu⁻¹ ∂(Kuu) Kuu⁻¹ Kui - 2 ∂Kui' Kuu⁻¹ KuiGaussianProcesses.dmll_noise — Methoddmll_noise(gp::GPE, precomp::SoRPrecompute, covstrat::FullScaleApproxStrat)∂logp(Y|θ) = 1/2 y' Σ⁻¹ ∂Σ Σ⁻¹ y - 1/2 tr(Σ⁻¹ ∂Σ)
∂Σ = I for derivative wrt σ², so ∂logp(Y|θ) = 1/2 y' Σ⁻¹ Σ⁻¹ y - 1/2 tr(Σ⁻¹) = 1/2[ dot(α,α) - tr(Σ⁻¹) ]
We have: Σ⁻¹ = Λ⁻¹ - Λ⁻² Kuf'( ΣQR )⁻¹ Kuf Use the identity tr(A'A) = dot(A,A) to get: tr(Σ⁻¹) = tr(Λ⁻¹) - dot(Lk, Lk) . where Lk ≡ ΣQR^(-1/2) Kuf Λ⁻¹
GaussianProcesses.predictMVN — MethodpredictMVN(xpred::AbstractMatrix, xtrain::AbstractMatrix, ytrain::AbstractVector, kernel::Kernel, meanf::Mean, logNoise::Real, alpha::AbstractVector, covstrat::FullScaleApproxStrat, Ktrain::FullScalePDMat) See Quiñonero-Candela and Rasmussen 2005, equations 24b. Some derivations can be found below that are not spelled out in the paper.
Notation: Qab = Kau Kuu⁻¹ Kub
          ΣQR = Kuu + σ⁻² Kuf Kuf'
          x: prediction (test) locations
          f: training (observed) locations
          u: inducing point locations
We have
    Σ ≈ Kuf' Kuu⁻¹ Kuf + Λ
By Woodbury
    Σ⁻¹ = Λ⁻¹ - Λ⁻² Kuf'(Kuu + Kuf Λ⁻¹ Kuf')⁻¹ Kuf
        = Λ⁻¹ - Λ⁻² Kuf'(        ΣQR       )⁻¹ Kuf
The predictive mean can be derived (assuming zero mean function for simplicity)
μ = (Qxf+Λxf) (Qff + Λff)⁻¹ y
  = Qxf (Qff + Λff)⁻¹ y   + Λxf (Qff + Λff)⁻¹ y
    ╰─────────────────╯
        same as FITC
  = Kxu ΣQR⁻¹ Kuf Λff⁻¹ y + Λxf (Qff + Λff)⁻¹ y
       ╰────────────────╯       ╰─────────────╯
          ≡ alpha_u                ≡ alpha
Similarly for the posterior predictive covariance:
Σ = Σxx - (Qxf+Λxf) (Qff + Λff)⁻¹ (Qxf+Λxf)'
  = Σxx - Kxu ΣQR⁻¹ Kuf Λ⁻¹ Qxf'                # substituting result from μ
  = Σxx - Kxu ΣQR⁻¹  Kuf Λ⁻¹ Kfu Kuu⁻¹ Kux      # definition of Qxf
  = Σxx - Kxu ΣQR⁻¹ (ΣQR - Kuu) Kuu⁻¹ Kux       # using definition of ΣQR
  = Σxx - Kxu Kuu⁻¹ Kux + Kxu ΣQR⁻¹ Kux         # expanding
  = Σxx - Qxx           + Kxu ΣQR⁻¹ Kux         # definition of QxxGaussianProcesses.trinv — Methodtrinv(a::BlockDiagPDMat)Trace of the inverse of a block diagonal positive definite matrix.
This is obtained as the sum of the traces of the inverse of each block.
GaussianProcesses.trinv — Methodtrinv(pd::AbstractPDMat)Trace of the inverse of a positive definite matrix.
GaussianProcesses.trinvAB — MethodtrinvAB(A::FullScalePDMat, B::Diagonal)
Computes tr(A⁻¹ B) efficiently under the FSA approximation:
    Σ ≈ Kuf' Kuu⁻¹ Kuf + Λ
Derivation:
    tr(Σ⁻¹ B) = tr[ (Λ⁻¹ - Λ⁻² Kuf'(Kuu + Kuf Λ⁻¹ Kuf')⁻¹ Kuf) B ]
              = tr(Λ⁻¹ B) - tr( Λ⁻¹ Kuf' ΣQR⁻¹ Kuf Λ⁻¹ B )
              = tr(Λ⁻¹ B) - tr( Λ⁻¹ Kuf' ΣQR^(-1/2) ΣQR^(-1/2) Kuf Λ⁻¹ B )
                                                    ╰────────────────╯
                                                        ≡ L'
              = tr(Λ⁻¹ B) - tr(L*L'*B)
              = tr(Λ⁻¹ B) - dot(L, B*L)
See also: [``](@ref).LinearAlgebra.logdet — MethodThe matrix determinant lemma states that
    logdet(A+UWV') = logdet(W⁻¹ + V'A⁻¹U) + logdet(W) + logdet(A)
So for
    Σ ≈ Kuf' Kuu⁻¹ Kuf + Λ
    logdet(Σ) = logdet(Kuu + Kuf Λ⁻¹ Kuf')       + logdet(Kuu⁻¹) + logdet(Λ)
              = logdet(        ΣQR             ) - logdet(Kuu)   + logdet(Λ)LinearAlgebra.tr — Methodtr(a::FullScalePDMat)
Trace of the FSA approximation to the covariance matrix:
tr(Σ) = tr(Kuf' Kuu⁻¹ Kuf + Λ)
      = tr(Kuf' Kuu⁻¹ Kuf) + tr(Λ)
      = tr(Kuf' Kuu^{-1/2} Kuu^{-1/2} Kuf) + tr(Λ)
                          ╰──────────────╯
                             ≡  Lk
      = dot(Lk, Lk) + sum(diag(Λ))GaussianProcesses.FullyIndepPDMat — TypePositive Definite Matrix for Fully Independent Training Conditional approximation.GaussianProcesses.FullyIndepStrat — TypeFully Independent Training Conditional (FITC) covariance strategy.Base.:\ — MethodWe have
    Σ ≈ Kuf' Kuu⁻¹ Kuf + Λ
where Λ is a diagonal matrix (here given by σ²I + diag(Kff - Qff))
The Woodbury matrix identity gives
    (A+UCV)⁻¹ = A⁻¹ - A⁻¹ U(C⁻¹ + V A⁻¹ U)⁻¹ V A⁻¹
which we use here with
    A ← Λ
    U = Kuf'
    V = Kuf
    C = Kuu⁻¹
which gives
    Σ⁻¹ = Λ⁻¹ - Λ⁻² Kuf'(Kuu + Kuf Λ⁻¹ Kuf')⁻¹ Kuf
        = Λ⁻¹ - Λ⁻² Kuf'(        ΣQR       )⁻¹ KufGaussianProcesses.dmll_kern! — Methoddmll_kern!(dmll::AbstractVector, kernel::Kernel, X::AbstractMatrix, cK::AbstractPDMat, kerneldata::KernelData, alpha::AbstractVector, Kuu, Kuf, Kuu⁻¹Kuf, Kuu⁻¹KufΣ⁻¹y, Σ⁻¹Kfu, ∂Kuu, ∂Kuf, covstrat::FullyIndepStrat) Derivative of the log likelihood under the Fully Independent Training Conditional (FITC) approximation.
Helpful reference: Vanhatalo, Jarno, and Aki Vehtari. "Sparse log Gaussian processes via MCMC for spatial epidemiology." In Gaussian processes in practice, pp. 73-89. 2007.
Generally, for a multivariate normal with zero mean     ∂logp(Y|θ) = 1/2 y' Σ⁻¹ ∂Σ Σ⁻¹ y - 1/2 tr(Σ⁻¹ ∂Σ)                     ╰───────────────╯     ╰──────────╯                            V                 T
where Σ = Kff + σ²I.
Notation: f is the observations, u is the inducing points.           ∂X stands for ∂X/∂θ, where θ is the kernel hyperparameters.
In the case of the FITC approximation, we have Σ = Λ + Qff The second component gives ∂(Qff) = ∂(Kfu Kuu⁻¹ Kuf) which is used by the gradient function for the subset of regressors approximation, and so I don't repeat it here.
the ith element of diag(Kff-Qff) is
Λi = Kii - Qii + σ²
   = Kii - Kui' Kuu⁻¹ Kui + σ²
∂Λi = ∂Kii - Kui' ∂(Kuu⁻¹) Kui - 2 ∂Kui' Kuu⁻¹ Kui
    = ∂Kii + Kui' Kuu⁻¹ ∂(Kuu) Kuu⁻¹ Kui - 2 ∂Kui' Kuu⁻¹ KuiGaussianProcesses.dmll_noise — Methoddmll_noise(gp::GPE, precomp::SoRPrecompute, covstrat::FullyIndepStrat)∂logp(Y|θ) = 1/2 y' Σ⁻¹ ∂Σ Σ⁻¹ y - 1/2 tr(Σ⁻¹ ∂Σ)
∂Σ = I for derivative wrt σ², so ∂logp(Y|θ) = 1/2 y' Σ⁻¹ Σ⁻¹ y - 1/2 tr(Σ⁻¹) = 1/2[ dot(α,α) - tr(Σ⁻¹) ]
We have: Σ⁻¹ = Λ⁻¹ - Λ⁻² Kuf'( ΣQR )⁻¹ Kuf Use the identity tr(A'A) = dot(A,A) to get: tr(Σ⁻¹) = tr(Λ⁻¹) - dot(Lk, Lk) . where Lk ≡ ΣQR^(-1/2) Kuf Λ⁻¹
GaussianProcesses.predictMVN — MethodpredictMVN(xpred::AbstractMatrix, xtrain::AbstractMatrix, ytrain::AbstractVector, kernel::Kernel, meanf::Mean, logNoise::Real, alpha::AbstractVector, covstrat::FullyIndepStrat, Ktrain::FullyIndepPDMat) See Quiñonero-Candela and Rasmussen 2005, equations 24b. Some derivations can be found below that are not spelled out in the paper.
Notation: Qab = Kau Kuu⁻¹ Kub
          ΣQR = Kuu + σ⁻² Kuf Kuf'
          x: prediction (test) locations
          f: training (observed) locations
          u: inducing point locations
We have
    Σ ≈ Kuf' Kuu⁻¹ Kuf + Λ
By Woodbury
    Σ⁻¹ = Λ⁻¹ - Λ⁻² Kuf'(Kuu + Kuf Λ⁻¹ Kuf')⁻¹ Kuf
        = Λ⁻¹ - Λ⁻² Kuf'(        ΣQR       )⁻¹ Kuf
The predictive mean can be derived (assuming zero mean function for simplicity)
μ = Qxf (Qff + Λ)⁻¹ y
  = Kxu Kuu⁻¹ Kuf [Λ⁻¹ - Λ⁻² Kuf' ΣQR⁻¹ Kuf] y   # see Woodbury formula above.
  = Kxu Kuu⁻¹ [ΣQR - Kuf Λ⁻¹ Kfu] ΣQR⁻¹ Kuf Λ⁻¹ y # factoring out common terms
  = Kxu Kuu⁻¹ [Kuu] ΣQR⁻¹ Kuf Λ⁻¹ y               # using definition of ΣQR
  = Kxu ΣQR⁻¹ Kuf Λ⁻¹ y                           # matches equation 16b
Similarly for the posterior predictive covariance:
Σ = Σxx - Qxf (Qff + Λ)⁻¹ Qxf'
  = Σxx - Kxu ΣQR⁻¹ Kuf Λ⁻¹ Qxf'                # substituting result from μ
  = Σxx - Kxu ΣQR⁻¹  Kuf Λ⁻¹ Kfu Kuu⁻¹ Kux      # definition of Qxf
  = Σxx - Kxu ΣQR⁻¹ (ΣQR - Kuu) Kuu⁻¹ Kux       # using definition of ΣQR
  = Σxx - Kxu Kuu⁻¹ Kux + Kxu ΣQR⁻¹ Kux         # expanding
  = Σxx - Qxx           + Kxu ΣQR⁻¹ Kux         # definition of QxxGaussianProcesses.trinvAB — MethodtrinvAB(A::FullyIndepPDMat, B::Diagonal)
Computes tr(A⁻¹ B) efficiently under the FITC approximation:
    Σ ≈ Kuf' Kuu⁻¹ Kuf + Λ
Derivation:
    tr(Σ⁻¹ B) = tr[ (Λ⁻¹ - Λ⁻² Kuf'(Kuu + Kuf Λ⁻¹ Kuf')⁻¹ Kuf) B ]
              = tr(Λ⁻¹ B) - tr( Λ⁻¹ Kuf' ΣQR⁻¹ Kuf Λ⁻¹ B )
              = tr(Λ⁻¹ B) - tr( Λ⁻¹ Kuf' ΣQR^(-1/2) ΣQR^(-1/2) Kuf Λ⁻¹ B )
                                                    ╰────────────────╯
                                                        ≡ L
              = tr(Λ⁻¹ B) - tr(L'*L*B)
              = tr(Λ⁻¹ B) - dot(L, L*B)
See also: [``](@ref).GaussianProcesses.trinvAB — MethodtrinvAB(A::AbstractPDMat, B)
Computes tr(A⁻¹ B).LinearAlgebra.logdet — MethodThe matrix determinant lemma states that
    logdet(A+UWV') = logdet(W⁻¹ + V'A⁻¹U) + logdet(W) + logdet(A)
So for
    Σ ≈ Kuf' Kuu⁻¹ Kuf + Λ
    logdet(Σ) = logdet(Kuu + Kuf Λ⁻¹ Kuf')       + logdet(Kuu⁻¹) + logdet(Λ)
              = logdet(        ΣQR             ) - logdet(Kuu)   + logdet(Λ)LinearAlgebra.tr — Methodtr(a::FullyIndepPDMat)
Trace of the FITC approximation to the covariance matrix:
tr(Σ) = tr(Kuf' Kuu⁻¹ Kuf + Λ)
      = tr(Kuf' Kuu⁻¹ Kuf) + tr(Λ)
      = tr(Kuf' Kuu^{-1/2} Kuu^{-1/2} Kuf) + tr(Λ)
                          ╰──────────────╯
                             ≡  Lk
      = dot(Lk, Lk) + sum(diag(Λ))GaussianProcesses.SubsetOfRegsPDMat — TypeSubset of Regressors sparse positive definite matrix.Base.:\ — MethodWe have
    Σ ≈ Kuf' Kuu⁻¹ Kuf + σ²I
By Woodbury
    Σ⁻¹ = σ⁻²I - σ⁻⁴ Kuf'(Kuu + σ⁻² Kuf Kuf')⁻¹ Kuf
        = σ⁻²I - σ⁻⁴ Kuf'(       ΣQR        )⁻¹ KufGaussianProcesses.dmll_kern! — Methoddmll_kern!(dmll::AbstractVector, kernel::Kernel, X::AbstractMatrix, cK::SubsetOfRegsPDMat, kerneldata::KernelData, ααinvcKI::AbstractMatrix, covstrat::SubsetOfRegsStrategy)Derivative of the log likelihood under the Subset of Regressors (SoR) approximation.
Helpful reference: Vanhatalo, Jarno, and Aki Vehtari. "Sparse log Gaussian processes via MCMC for spatial epidemiology." In Gaussian processes in practice, pp. 73-89. 2007.
Generally, for a multivariate normal with zero mean     ∂logp(Y|θ) = 1/2 y' Σ⁻¹ ∂Σ Σ⁻¹ y - 1/2 tr(Σ⁻¹ ∂Σ)                     ╰───────────────╯     ╰──────────╯                            V                 T
where Σ = Kff + σ²I.
Notation: f is the observations, u is the inducing points.           ∂X stands for ∂X/∂θ, where θ is the kernel hyperparameters.
In the SoR approximation, we replace Kff with Qff = Kfu Kuu⁻¹ Kuf
∂Σ = ∂(Qff) = ∂(Kfu Kuu⁻¹ Kuf) = ∂(Kfu) Kuu⁻¹ Kuf + Kfu ∂(Kuu⁻¹) Kuf + Kfu Kuu⁻¹ ∂(Kuf)
∂(Kuu⁻¹) = -Kuu⁻¹ ∂(Kuu) Kuu⁻¹ ––––^
Also have pre-computed α = Σ⁻¹ y, so V can now be computed efficiency (O(nm²) I think…) by careful ordering of the matrix multiplication steps.
For T, we use the identity tr(AB) = dot(A',B): tr(Σ⁻¹ ∂Σ) = 2 tr(Σ⁻¹ ∂(Kfu) Kuu⁻¹ Kuf) + tr(Σ⁻¹ Kfu ∂(Kuu⁻¹) Kuf)            = 2 dot((Σ⁻¹ ∂(Kfu))', Kuu⁻¹ Kuf) - tr(Σ⁻¹ Kfu Kuu⁻¹ ∂Kuu Kuu⁻¹ Kuf)            = 2 dot((Σ⁻¹ ∂(Kfu))', Kuu⁻¹ Kuf) - dot((Σ⁻¹ Kfu)', Kuu⁻¹ ∂Kuu Kuu⁻¹ Kuf) which again is computed in O(nm²).
GaussianProcesses.dmll_noise — Methoddmll_noise(gp::GPE, precomp::SoRPrecompute)∂logp(Y|θ) = 1/2 y' Σ⁻¹ ∂Σ Σ⁻¹ y - 1/2 tr(Σ⁻¹ ∂Σ)
∂Σ = I for derivative wrt σ², so ∂logp(Y|θ) = 1/2 y' Σ⁻¹ Σ⁻¹ y - 1/2 tr(Σ⁻¹) = 1/2[ dot(α,α) - tr(Σ⁻¹) ]
Σ⁻¹ = σ⁻²I - σ⁻⁴ Kuf'(Kuu + σ⁻² Kuf Kuf')⁻¹ Kuf = σ⁻²I - σ⁻⁴ Kuf'( ΣQR )⁻¹ Kuf
GaussianProcesses.get_alpha_u — Methodalpha_u(Ktrain::SubsetOfRegsPDMat, xtrain::AbstractMatrix, ytrain::AbstractVector, m::Mean)
ΣQR⁻¹ Kuf Λ⁻¹ (y-μ)GaussianProcesses.predictMVN — MethodSee Quiñonero-Candela and Rasmussen 2005, equations 16b.
Some derivations can be found below that are not spelled out in the paper.
Notation: Qab = Kau Kuu⁻¹ Kub
          ΣQR = Kuu + σ⁻² Kuf Kuf'
          x: prediction (test) locations
          f: training (observed) locations
          u: inducing point locations
We have
    Σ ≈ Kuf' Kuu⁻¹ Kuf + σ²I
By Woodbury
    Σ⁻¹ = σ⁻²I - σ⁻⁴ Kuf'(Kuu + σ⁻² Kuf Kuf')⁻¹ Kuf
        = σ⁻²I - σ⁻⁴ Kuf'(       ΣQR        )⁻¹ Kuf
The predictive mean can be derived (assuming zero mean function for simplicity)
μ = Qxf (Qff + σ²I)⁻¹ y
  = Kxu Kuu⁻¹ Kuf [σ⁻²I - σ⁻⁴ Kuf' ΣQR⁻¹ Kuf] y   # see Woodbury formula above.
  = σ⁻² Kxu Kuu⁻¹ [ΣQR - σ⁻² Kuf Kfu] ΣQR⁻¹ Kuf y # factoring out common terms
  = σ⁻² Kxu Kuu⁻¹ [Kuu] ΣQR⁻¹ Kuf y               # using definition of ΣQR
  = σ⁻² Kxu ΣQR⁻¹ Kuf y                           # matches equation 16b
Similarly for the posterior predictive covariance:
Σ = Qxx - Qxf (Qff + σ²I)⁻¹ Qxf'
  = Qxx - σ⁻² Kxu ΣQR⁻¹ Kuf Qxf'                # substituting result from μ
  = Qxx - σ⁻² Kxu ΣQR⁻¹  Kuf Kfu    Kuu⁻¹ Kux   # definition of Qxf
  = Qxx -     Kxu ΣQR⁻¹ (ΣQR - Kuu) Kuu⁻¹ Kux   # using definition of ΣQR
  = Qxx - Kxu Kuu⁻¹ Kux + Kxu ΣQR⁻¹ Kux         # expanding
  = Qxx - Qxx           + Kxu ΣQR⁻¹ Kux         # definition of Qxx
  = Kxu ΣQR⁻¹ Kux                               # simplifying