Sparse GPs
GaussianProcesses.predict_f
— Methodpredict_f(gp::GPBase, X::Matrix{Float64}[; full_cov::Bool = false])
Return posterior mean and variance of the Gaussian Process gp
at specfic points which are given as columns of matrix X
. If full_cov
is true
, the full covariance matrix is returned instead of only variances.
GaussianProcesses.DeterminTrainCondStrat
— TypeDeterministic Training Conditional (DTC) covariance strategy.
GaussianProcesses.predictMVN
— MethodDeterministic Training Conditional (DTC) multivariate normal predictions.
See Quiñonero-Candela and Rasmussen 2005, equations 20b.
μ_DTC = μ_SoR
Σ_DTC = Σxx - Qxx + Σ_SoR
where μ_DTC and Σ_DTC are the predictive mean and covariance
functions for the Subset of Regressors approximation.
GaussianProcesses.FullScaleApproxStrat
— TypeFully Independent Training Conditional (FSA) covariance strategy.
GaussianProcesses.FullScalePDMat
— TypePositive Definite Matrix for Full Scale Approximation.
Base.:\
— MethodWe have
Σ ≈ Kuf' Kuu⁻¹ Kuf + Λ
where Λ is a diagonal matrix (here given by σ²I + diag(Kff - Qff))
The Woodbury matrix identity gives
(A+UCV)⁻¹ = A⁻¹ - A⁻¹ U(C⁻¹ + V A⁻¹ U)⁻¹ V A⁻¹
which we use here with
A ← Λ
U = Kuf'
V = Kuf
C = Kuu⁻¹
which gives
Σ⁻¹ = Λ⁻¹ - Λ⁻¹ Kuf'(Kuu + Kuf Λ⁻¹ Kuf')⁻¹ Kuf Λ⁻¹
= Λ⁻¹ - Λ⁻¹ Kuf'( ΣQR )⁻¹ Kuf Λ⁻¹
GaussianProcesses.dmll_kern!
— Methoddmll_kern!(dmll::AbstractVector, kernel::Kernel, X::AbstractMatrix, cK::AbstractPDMat, kerneldata::KernelData, alpha::AbstractVector, Kuu, Kuf, Kuu⁻¹Kuf, Kuu⁻¹KufΣ⁻¹y, Σ⁻¹Kfu, ∂Kuu, ∂Kuf, covstrat::FullScaleApproxStrat) Derivative of the log likelihood under the Fully Independent Training Conditional (fsa) approximation.
Helpful reference: Vanhatalo, Jarno, and Aki Vehtari. "Sparse log Gaussian processes via MCMC for spatial epidemiology." In Gaussian processes in practice, pp. 73-89. 2007.
Generally, for a multivariate normal with zero mean ∂logp(Y|θ) = 1/2 y' Σ⁻¹ ∂Σ Σ⁻¹ y - 1/2 tr(Σ⁻¹ ∂Σ) ╰───────────────╯ ╰──────────╯ V
T
where Σ = Kff + σ²I.
Notation: f
is the observations, u
is the inducing points. ∂X stands for ∂X/∂θ, where θ is the kernel hyperparameters.
In the case of the FSA approximation, we have Σ = Λ + Qff The second component gives ∂(Qff) = ∂(Kfu Kuu⁻¹ Kuf) which is used by the gradient function for the subset of regressors approximation, and so I don't repeat it here.
for ∂Λ we have (with `i` indexing each block)
Λi = Ki - Qi + σ²I
= Ki - Kui' Kuu⁻¹ Kui + σ²
∂Λi = ∂Ki - Kui' ∂(Kuu⁻¹) Kui - 2 ∂Kui' Kuu⁻¹ Kui
= ∂Ki + Kui' Kuu⁻¹ ∂(Kuu) Kuu⁻¹ Kui - 2 ∂Kui' Kuu⁻¹ Kui
GaussianProcesses.dmll_noise
— Methoddmll_noise(gp::GPE, precomp::SoRPrecompute, covstrat::FullScaleApproxStrat)
∂logp(Y|θ) = 1/2 y' Σ⁻¹ ∂Σ Σ⁻¹ y - 1/2 tr(Σ⁻¹ ∂Σ)
∂Σ = I for derivative wrt σ², so ∂logp(Y|θ) = 1/2 y' Σ⁻¹ Σ⁻¹ y - 1/2 tr(Σ⁻¹) = 1/2[ dot(α,α) - tr(Σ⁻¹) ]
We have: Σ⁻¹ = Λ⁻¹ - Λ⁻² Kuf'( ΣQR )⁻¹ Kuf Use the identity tr(A'A) = dot(A,A) to get: tr(Σ⁻¹) = tr(Λ⁻¹) - dot(Lk, Lk) . where Lk ≡ ΣQR^(-1/2) Kuf Λ⁻¹
GaussianProcesses.predictMVN
— MethodpredictMVN(xpred::AbstractMatrix, xtrain::AbstractMatrix, ytrain::AbstractVector, kernel::Kernel, meanf::Mean, logNoise::Real, alpha::AbstractVector, covstrat::FullScaleApproxStrat, Ktrain::FullScalePDMat) See Quiñonero-Candela and Rasmussen 2005, equations 24b. Some derivations can be found below that are not spelled out in the paper.
Notation: Qab = Kau Kuu⁻¹ Kub
ΣQR = Kuu + σ⁻² Kuf Kuf'
x: prediction (test) locations
f: training (observed) locations
u: inducing point locations
We have
Σ ≈ Kuf' Kuu⁻¹ Kuf + Λ
By Woodbury
Σ⁻¹ = Λ⁻¹ - Λ⁻² Kuf'(Kuu + Kuf Λ⁻¹ Kuf')⁻¹ Kuf
= Λ⁻¹ - Λ⁻² Kuf'( ΣQR )⁻¹ Kuf
The predictive mean can be derived (assuming zero mean function for simplicity)
μ = (Qxf+Λxf) (Qff + Λff)⁻¹ y
= Qxf (Qff + Λff)⁻¹ y + Λxf (Qff + Λff)⁻¹ y
╰─────────────────╯
same as FITC
= Kxu ΣQR⁻¹ Kuf Λff⁻¹ y + Λxf (Qff + Λff)⁻¹ y
╰────────────────╯ ╰─────────────╯
≡ alpha_u ≡ alpha
Similarly for the posterior predictive covariance:
Σ = Σxx - (Qxf+Λxf) (Qff + Λff)⁻¹ (Qxf+Λxf)'
= Σxx - Kxu ΣQR⁻¹ Kuf Λ⁻¹ Qxf' # substituting result from μ
= Σxx - Kxu ΣQR⁻¹ Kuf Λ⁻¹ Kfu Kuu⁻¹ Kux # definition of Qxf
= Σxx - Kxu ΣQR⁻¹ (ΣQR - Kuu) Kuu⁻¹ Kux # using definition of ΣQR
= Σxx - Kxu Kuu⁻¹ Kux + Kxu ΣQR⁻¹ Kux # expanding
= Σxx - Qxx + Kxu ΣQR⁻¹ Kux # definition of Qxx
GaussianProcesses.trinv
— Methodtrinv(a::BlockDiagPDMat)
Trace of the inverse of a block diagonal positive definite matrix.
This is obtained as the sum of the traces of the inverse of each block.
GaussianProcesses.trinv
— Methodtrinv(pd::AbstractPDMat)
Trace of the inverse of a positive definite matrix.
GaussianProcesses.trinvAB
— MethodtrinvAB(A::FullScalePDMat, B::Diagonal)
Computes tr(A⁻¹ B) efficiently under the FSA approximation:
Σ ≈ Kuf' Kuu⁻¹ Kuf + Λ
Derivation:
tr(Σ⁻¹ B) = tr[ (Λ⁻¹ - Λ⁻² Kuf'(Kuu + Kuf Λ⁻¹ Kuf')⁻¹ Kuf) B ]
= tr(Λ⁻¹ B) - tr( Λ⁻¹ Kuf' ΣQR⁻¹ Kuf Λ⁻¹ B )
= tr(Λ⁻¹ B) - tr( Λ⁻¹ Kuf' ΣQR^(-1/2) ΣQR^(-1/2) Kuf Λ⁻¹ B )
╰────────────────╯
≡ L'
= tr(Λ⁻¹ B) - tr(L*L'*B)
= tr(Λ⁻¹ B) - dot(L, B*L)
See also: [``](@ref).
LinearAlgebra.logdet
— MethodThe matrix determinant lemma states that
logdet(A+UWV') = logdet(W⁻¹ + V'A⁻¹U) + logdet(W) + logdet(A)
So for
Σ ≈ Kuf' Kuu⁻¹ Kuf + Λ
logdet(Σ) = logdet(Kuu + Kuf Λ⁻¹ Kuf') + logdet(Kuu⁻¹) + logdet(Λ)
= logdet( ΣQR ) - logdet(Kuu) + logdet(Λ)
LinearAlgebra.tr
— Methodtr(a::FullScalePDMat)
Trace of the FSA approximation to the covariance matrix:
tr(Σ) = tr(Kuf' Kuu⁻¹ Kuf + Λ)
= tr(Kuf' Kuu⁻¹ Kuf) + tr(Λ)
= tr(Kuf' Kuu^{-1/2} Kuu^{-1/2} Kuf) + tr(Λ)
╰──────────────╯
≡ Lk
= dot(Lk, Lk) + sum(diag(Λ))
GaussianProcesses.FullyIndepPDMat
— TypePositive Definite Matrix for Fully Independent Training Conditional approximation.
GaussianProcesses.FullyIndepStrat
— TypeFully Independent Training Conditional (FITC) covariance strategy.
Base.:\
— MethodWe have
Σ ≈ Kuf' Kuu⁻¹ Kuf + Λ
where Λ is a diagonal matrix (here given by σ²I + diag(Kff - Qff))
The Woodbury matrix identity gives
(A+UCV)⁻¹ = A⁻¹ - A⁻¹ U(C⁻¹ + V A⁻¹ U)⁻¹ V A⁻¹
which we use here with
A ← Λ
U = Kuf'
V = Kuf
C = Kuu⁻¹
which gives
Σ⁻¹ = Λ⁻¹ - Λ⁻² Kuf'(Kuu + Kuf Λ⁻¹ Kuf')⁻¹ Kuf
= Λ⁻¹ - Λ⁻² Kuf'( ΣQR )⁻¹ Kuf
GaussianProcesses.dmll_kern!
— Methoddmll_kern!(dmll::AbstractVector, kernel::Kernel, X::AbstractMatrix, cK::AbstractPDMat, kerneldata::KernelData, alpha::AbstractVector, Kuu, Kuf, Kuu⁻¹Kuf, Kuu⁻¹KufΣ⁻¹y, Σ⁻¹Kfu, ∂Kuu, ∂Kuf, covstrat::FullyIndepStrat) Derivative of the log likelihood under the Fully Independent Training Conditional (FITC) approximation.
Helpful reference: Vanhatalo, Jarno, and Aki Vehtari. "Sparse log Gaussian processes via MCMC for spatial epidemiology." In Gaussian processes in practice, pp. 73-89. 2007.
Generally, for a multivariate normal with zero mean ∂logp(Y|θ) = 1/2 y' Σ⁻¹ ∂Σ Σ⁻¹ y - 1/2 tr(Σ⁻¹ ∂Σ) ╰───────────────╯ ╰──────────╯ V
T
where Σ = Kff + σ²I.
Notation: f
is the observations, u
is the inducing points. ∂X stands for ∂X/∂θ, where θ is the kernel hyperparameters.
In the case of the FITC approximation, we have Σ = Λ + Qff The second component gives ∂(Qff) = ∂(Kfu Kuu⁻¹ Kuf) which is used by the gradient function for the subset of regressors approximation, and so I don't repeat it here.
the ith element of diag(Kff-Qff) is
Λi = Kii - Qii + σ²
= Kii - Kui' Kuu⁻¹ Kui + σ²
∂Λi = ∂Kii - Kui' ∂(Kuu⁻¹) Kui - 2 ∂Kui' Kuu⁻¹ Kui
= ∂Kii + Kui' Kuu⁻¹ ∂(Kuu) Kuu⁻¹ Kui - 2 ∂Kui' Kuu⁻¹ Kui
GaussianProcesses.dmll_noise
— Methoddmll_noise(gp::GPE, precomp::SoRPrecompute, covstrat::FullyIndepStrat)
∂logp(Y|θ) = 1/2 y' Σ⁻¹ ∂Σ Σ⁻¹ y - 1/2 tr(Σ⁻¹ ∂Σ)
∂Σ = I for derivative wrt σ², so ∂logp(Y|θ) = 1/2 y' Σ⁻¹ Σ⁻¹ y - 1/2 tr(Σ⁻¹) = 1/2[ dot(α,α) - tr(Σ⁻¹) ]
We have: Σ⁻¹ = Λ⁻¹ - Λ⁻² Kuf'( ΣQR )⁻¹ Kuf Use the identity tr(A'A) = dot(A,A) to get: tr(Σ⁻¹) = tr(Λ⁻¹) - dot(Lk, Lk) . where Lk ≡ ΣQR^(-1/2) Kuf Λ⁻¹
GaussianProcesses.predictMVN
— MethodpredictMVN(xpred::AbstractMatrix, xtrain::AbstractMatrix, ytrain::AbstractVector, kernel::Kernel, meanf::Mean, logNoise::Real, alpha::AbstractVector, covstrat::FullyIndepStrat, Ktrain::FullyIndepPDMat) See Quiñonero-Candela and Rasmussen 2005, equations 24b. Some derivations can be found below that are not spelled out in the paper.
Notation: Qab = Kau Kuu⁻¹ Kub
ΣQR = Kuu + σ⁻² Kuf Kuf'
x: prediction (test) locations
f: training (observed) locations
u: inducing point locations
We have
Σ ≈ Kuf' Kuu⁻¹ Kuf + Λ
By Woodbury
Σ⁻¹ = Λ⁻¹ - Λ⁻² Kuf'(Kuu + Kuf Λ⁻¹ Kuf')⁻¹ Kuf
= Λ⁻¹ - Λ⁻² Kuf'( ΣQR )⁻¹ Kuf
The predictive mean can be derived (assuming zero mean function for simplicity)
μ = Qxf (Qff + Λ)⁻¹ y
= Kxu Kuu⁻¹ Kuf [Λ⁻¹ - Λ⁻² Kuf' ΣQR⁻¹ Kuf] y # see Woodbury formula above.
= Kxu Kuu⁻¹ [ΣQR - Kuf Λ⁻¹ Kfu] ΣQR⁻¹ Kuf Λ⁻¹ y # factoring out common terms
= Kxu Kuu⁻¹ [Kuu] ΣQR⁻¹ Kuf Λ⁻¹ y # using definition of ΣQR
= Kxu ΣQR⁻¹ Kuf Λ⁻¹ y # matches equation 16b
Similarly for the posterior predictive covariance:
Σ = Σxx - Qxf (Qff + Λ)⁻¹ Qxf'
= Σxx - Kxu ΣQR⁻¹ Kuf Λ⁻¹ Qxf' # substituting result from μ
= Σxx - Kxu ΣQR⁻¹ Kuf Λ⁻¹ Kfu Kuu⁻¹ Kux # definition of Qxf
= Σxx - Kxu ΣQR⁻¹ (ΣQR - Kuu) Kuu⁻¹ Kux # using definition of ΣQR
= Σxx - Kxu Kuu⁻¹ Kux + Kxu ΣQR⁻¹ Kux # expanding
= Σxx - Qxx + Kxu ΣQR⁻¹ Kux # definition of Qxx
GaussianProcesses.trinvAB
— MethodtrinvAB(A::FullyIndepPDMat, B::Diagonal)
Computes tr(A⁻¹ B) efficiently under the FITC approximation:
Σ ≈ Kuf' Kuu⁻¹ Kuf + Λ
Derivation:
tr(Σ⁻¹ B) = tr[ (Λ⁻¹ - Λ⁻² Kuf'(Kuu + Kuf Λ⁻¹ Kuf')⁻¹ Kuf) B ]
= tr(Λ⁻¹ B) - tr( Λ⁻¹ Kuf' ΣQR⁻¹ Kuf Λ⁻¹ B )
= tr(Λ⁻¹ B) - tr( Λ⁻¹ Kuf' ΣQR^(-1/2) ΣQR^(-1/2) Kuf Λ⁻¹ B )
╰────────────────╯
≡ L
= tr(Λ⁻¹ B) - tr(L'*L*B)
= tr(Λ⁻¹ B) - dot(L, L*B)
See also: [``](@ref).
GaussianProcesses.trinvAB
— MethodtrinvAB(A::AbstractPDMat, B)
Computes tr(A⁻¹ B).
LinearAlgebra.logdet
— MethodThe matrix determinant lemma states that
logdet(A+UWV') = logdet(W⁻¹ + V'A⁻¹U) + logdet(W) + logdet(A)
So for
Σ ≈ Kuf' Kuu⁻¹ Kuf + Λ
logdet(Σ) = logdet(Kuu + Kuf Λ⁻¹ Kuf') + logdet(Kuu⁻¹) + logdet(Λ)
= logdet( ΣQR ) - logdet(Kuu) + logdet(Λ)
LinearAlgebra.tr
— Methodtr(a::FullyIndepPDMat)
Trace of the FITC approximation to the covariance matrix:
tr(Σ) = tr(Kuf' Kuu⁻¹ Kuf + Λ)
= tr(Kuf' Kuu⁻¹ Kuf) + tr(Λ)
= tr(Kuf' Kuu^{-1/2} Kuu^{-1/2} Kuf) + tr(Λ)
╰──────────────╯
≡ Lk
= dot(Lk, Lk) + sum(diag(Λ))
GaussianProcesses.SubsetOfRegsPDMat
— TypeSubset of Regressors sparse positive definite matrix.
Base.:\
— MethodWe have
Σ ≈ Kuf' Kuu⁻¹ Kuf + σ²I
By Woodbury
Σ⁻¹ = σ⁻²I - σ⁻⁴ Kuf'(Kuu + σ⁻² Kuf Kuf')⁻¹ Kuf
= σ⁻²I - σ⁻⁴ Kuf'( ΣQR )⁻¹ Kuf
GaussianProcesses.dmll_kern!
— Methoddmll_kern!(dmll::AbstractVector, kernel::Kernel, X::AbstractMatrix, cK::SubsetOfRegsPDMat, kerneldata::KernelData, ααinvcKI::AbstractMatrix, covstrat::SubsetOfRegsStrategy)
Derivative of the log likelihood under the Subset of Regressors (SoR) approximation.
Helpful reference: Vanhatalo, Jarno, and Aki Vehtari. "Sparse log Gaussian processes via MCMC for spatial epidemiology." In Gaussian processes in practice, pp. 73-89. 2007.
Generally, for a multivariate normal with zero mean ∂logp(Y|θ) = 1/2 y' Σ⁻¹ ∂Σ Σ⁻¹ y - 1/2 tr(Σ⁻¹ ∂Σ) ╰───────────────╯ ╰──────────╯ V
T
where Σ = Kff + σ²I.
Notation: f
is the observations, u
is the inducing points. ∂X stands for ∂X/∂θ, where θ is the kernel hyperparameters.
In the SoR approximation, we replace Kff with Qff = Kfu Kuu⁻¹ Kuf
∂Σ = ∂(Qff) = ∂(Kfu Kuu⁻¹ Kuf) = ∂(Kfu) Kuu⁻¹ Kuf + Kfu ∂(Kuu⁻¹) Kuf + Kfu Kuu⁻¹ ∂(Kuf)
∂(Kuu⁻¹) = -Kuu⁻¹ ∂(Kuu) Kuu⁻¹ ––––^
Also have pre-computed α = Σ⁻¹ y, so V
can now be computed efficiency (O(nm²) I think…) by careful ordering of the matrix multiplication steps.
For T
, we use the identity tr(AB) = dot(A',B): tr(Σ⁻¹ ∂Σ) = 2 tr(Σ⁻¹ ∂(Kfu) Kuu⁻¹ Kuf) + tr(Σ⁻¹ Kfu ∂(Kuu⁻¹) Kuf) = 2 dot((Σ⁻¹ ∂(Kfu))', Kuu⁻¹ Kuf) - tr(Σ⁻¹ Kfu Kuu⁻¹ ∂Kuu Kuu⁻¹ Kuf) = 2 dot((Σ⁻¹ ∂(Kfu))', Kuu⁻¹ Kuf) - dot((Σ⁻¹ Kfu)', Kuu⁻¹ ∂Kuu Kuu⁻¹ Kuf) which again is computed in O(nm²).
GaussianProcesses.dmll_noise
— Methoddmll_noise(gp::GPE, precomp::SoRPrecompute)
∂logp(Y|θ) = 1/2 y' Σ⁻¹ ∂Σ Σ⁻¹ y - 1/2 tr(Σ⁻¹ ∂Σ)
∂Σ = I for derivative wrt σ², so ∂logp(Y|θ) = 1/2 y' Σ⁻¹ Σ⁻¹ y - 1/2 tr(Σ⁻¹) = 1/2[ dot(α,α) - tr(Σ⁻¹) ]
Σ⁻¹ = σ⁻²I - σ⁻⁴ Kuf'(Kuu + σ⁻² Kuf Kuf')⁻¹ Kuf = σ⁻²I - σ⁻⁴ Kuf'( ΣQR )⁻¹ Kuf
GaussianProcesses.get_alpha_u
— Methodalpha_u(Ktrain::SubsetOfRegsPDMat, xtrain::AbstractMatrix, ytrain::AbstractVector, m::Mean)
ΣQR⁻¹ Kuf Λ⁻¹ (y-μ)
GaussianProcesses.predictMVN
— MethodSee Quiñonero-Candela and Rasmussen 2005, equations 16b.
Some derivations can be found below that are not spelled out in the paper.
Notation: Qab = Kau Kuu⁻¹ Kub
ΣQR = Kuu + σ⁻² Kuf Kuf'
x: prediction (test) locations
f: training (observed) locations
u: inducing point locations
We have
Σ ≈ Kuf' Kuu⁻¹ Kuf + σ²I
By Woodbury
Σ⁻¹ = σ⁻²I - σ⁻⁴ Kuf'(Kuu + σ⁻² Kuf Kuf')⁻¹ Kuf
= σ⁻²I - σ⁻⁴ Kuf'( ΣQR )⁻¹ Kuf
The predictive mean can be derived (assuming zero mean function for simplicity)
μ = Qxf (Qff + σ²I)⁻¹ y
= Kxu Kuu⁻¹ Kuf [σ⁻²I - σ⁻⁴ Kuf' ΣQR⁻¹ Kuf] y # see Woodbury formula above.
= σ⁻² Kxu Kuu⁻¹ [ΣQR - σ⁻² Kuf Kfu] ΣQR⁻¹ Kuf y # factoring out common terms
= σ⁻² Kxu Kuu⁻¹ [Kuu] ΣQR⁻¹ Kuf y # using definition of ΣQR
= σ⁻² Kxu ΣQR⁻¹ Kuf y # matches equation 16b
Similarly for the posterior predictive covariance:
Σ = Qxx - Qxf (Qff + σ²I)⁻¹ Qxf'
= Qxx - σ⁻² Kxu ΣQR⁻¹ Kuf Qxf' # substituting result from μ
= Qxx - σ⁻² Kxu ΣQR⁻¹ Kuf Kfu Kuu⁻¹ Kux # definition of Qxf
= Qxx - Kxu ΣQR⁻¹ (ΣQR - Kuu) Kuu⁻¹ Kux # using definition of ΣQR
= Qxx - Kxu Kuu⁻¹ Kux + Kxu ΣQR⁻¹ Kux # expanding
= Qxx - Qxx + Kxu ΣQR⁻¹ Kux # definition of Qxx
= Kxu ΣQR⁻¹ Kux # simplifying