Missing values
Missing values are a well-known problem in statistics. The simplest approach is just to delete all data cases that have missing values. Another approach is to repair things by filling in reasonable values. This is called
imputation. Imputation strategies can be very sophisticated (and complex).
Statistical tools have often direct support for representing missing values. E.g. the R language has
NA (not available). The modeling language GAMS also has
NA. Python, however, has no explicit support for missing values [2]. By convention, the special floating point value
NaN (Not a Number) is used to indicate missing values for floating point numbers. One real disadvantage is that
NaN is not available for other data types than floating point numbers. Furthermore, the object
None can be used to indicate missing values. As
None is an object, that approach often yields a performance penalty. It is noted that the
numpy library has some facilities to deal with missing data through masks (this is really different from
NA in R and GAMS).
Imputing missing values in a covariance matrix
In [1] a
semi-definite programming (SDP) model is proposed to deal with a covariance matrix with some missing values by imputation. The constraint to be added is that the covariance matrix should remain positive-semi definite (PSD). A covariance matrix should be in theory PSD, but in practice it can happen it is not. The resulting model is stated as:
SDP Model: Imputation of Covariance Matrix |
\[\begin{align} \text{minimize}\>& 0\\ \text{subject to}\>&\color{darkred}{\Sigma}_{i,j} = \widetilde{\color{darkblue}{\Sigma}}_{i,j} && (i,j)\notin \color{darkblue}M\\ & \color{darkred}{\Sigma} \succeq 0 \end{align} \] |
The notation is taken from [1]. Here \(\widetilde{\Sigma}\) is the (original) covariance matrix with missing data in locations \((i,j)\in M\). The variable \(\Sigma\) is the new covariance matrix with missing data filled in such that \(\Sigma\) is positive semi-definite. This last condition is denoted by \(\Sigma \succeq 0\). In this model there is no objective, as indicated by minimizing zero.
CVXPY implementation
There is no code provided for this model in [1]. So let me give it a try. CVXPY does not have good support for things like \(\forall (i,j) \notin M\). I can see two approaches:
- Expand the constraint into scalar form. Essentially, a DIY approach.
- Use a binary data matrix \(M_{i,j} \in \{0,1\}\) indicating the missing values and write \[(e\cdot e^T-M) \circ \Sigma = \widetilde{\Sigma}_0\] where \(\circ\) is elementwise multiplication (a.k.a. Hadamard product), \(e\) is a column vector of ones of appropriate size, and \(\widetilde{\Sigma}_0\) is \(\widetilde{\Sigma}\) but with NaN's replaced by zeros.
In addition, let's add a regularizing objective: minimize sum of squares of \(\Sigma_{i,j}\).
The Python code for these two models is:
import numpy as np
import pandas as pd
import cvxpy as cp
#------------ data ----------------
cov = np.array([
[ 0.300457, -0.158889, 0.080241, -0.143750, 0.072844, -0.032968, 0.077836, 0.049272],
[-0.158889, 0.399624, np.nan, 0.109056, 0.082858, -0.045462, -0.124045, -0.132096],
[ 0.080241, np.nan, np.nan, -0.031902, -0.081455, 0.098212, 0.243131, 0.120404],
[-0.143750, 0.109056, -0.031902, 0.386109, -0.058051, 0.060246, 0.082420, 0.125786],
[ 0.072844, 0.082858, -0.081455, -0.058051, np.nan, np.nan, -0.119530, -0.054881],
[-0.032968, -0.045462, 0.098212, 0.060246, np.nan, 0.400641, 0.051103, 0.007308],
[ 0.077836, -0.124045, 0.243131, 0.082420, -0.119530, 0.051103, 0.543407, 0.121709],
[ 0.049272, -0.132096, 0.120404, 0.125786, -0.054881, 0.007308, 0.121709, 0.481395]
])
print("Covariance data with NaNs")
print(pd.DataFrame(cov))
M = 1*np.isnan(cov)
print("M (indicator for missing values)")
print(pd.DataFrame(M))
dim = np.shape(cov)
n = dim[0]
#----------- model 1 -----------------
Sigma = cp.Variable(dim, symmetric=True)
prob = cp.Problem(
cp.Minimize(cp.sum_squares(Sigma)),
[ Sigma[i,j] == cov[i,j] for i in range(n) for j in range(n) if M[i,j]==0 ] +
[ Sigma >> 0 ]
)
prob.solve(solver=cp.SCS,verbose=True)
print("Status:",prob.status)
print("Objective:",prob.value)
print(pd.DataFrame(Sigma.value))
#----------- model 2 -----------------
e = np.ones((n,1))
cov0 = np.nan_to_num(cov,copy=True)
prob2 = cp.Problem(
# cp.Minimize(cp.trace(Sigma.T@Sigma)), <--- not recognized as convex
cp.Minimize(cp.norm(Sigma,"fro")**2),
[ cp.multiply(e@e.T - M,Sigma) == cov0,
Sigma >> 0 ]
)
prob2.solve(solver=cp.SCS,verbose=True)
print("Status:",prob2.status)
print("Objective:",prob2.value)
print(pd.DataFrame(Sigma.value))
Notes:
- Model 1 has a (long) list of scalar constraints. The objective is \[\min\>\sum_{i,j} \Sigma_{i,j}^2\] Sorry for the possible confusion between the symbols for summation and covariance.
- CVXPY uses the notation Sigma >> 0 to indicate \(\Sigma \succeq 0\) (i.e. \(\Sigma\) should be positive semi-definite).
- We added the condition that \(\Sigma\) should be symmetric in the variable statement. This seems to be needed. Without this, the solver may return a non-symmetric matrix. I suspect that in that case, the matrix \(0.5(\Sigma+\Sigma^T)\) rather than \(\Sigma\) itself is required to be positive definite.
- Model 2 is an attempt to use matrix notation. The objective can be stated as \[\min\>\mathbf{tr}(\Sigma^T\Sigma)\] but that is not recognized as being convex. As alternative I used the Frobenius norm: \[||A||_F =\sqrt{ \sum_{i,j} a_{i,j}^2}\]
- The function np.nan_to_num converts NaN values to zeros.
- The function cp.multiply performs elementwise multiplication (as opposed to matrix multiplication).
- I don't think we can easily only pass only the upper triangular part of the covariance matrix to the solver. For large problems this would save some effort (cpu time and memory).
- In a traditional optimization model we would have just \(|M|\) decision variables (corresponding to the missing values). Here, in the scalar model, we have \(n^2\) variables and \(n^2-|M|\) constraints.
The results are:
Covariance data with NaNs
0 1 2 3 4 5 6 \
0 0.300457 -0.158889 0.080241 -0.143750 0.072844 -0.032968 0.077836
1 -0.158889 0.399624 NaN 0.109056 0.082858 -0.045462 -0.124045
2 0.080241 NaN NaN -0.031902 -0.081455 0.098212 0.243131
3 -0.143750 0.109056 -0.031902 0.386109 -0.058051 0.060246 0.082420
4 0.072844 0.082858 -0.081455 -0.058051 NaN NaN -0.119530
5 -0.032968 -0.045462 0.098212 0.060246 NaN 0.400641 0.051103
6 0.077836 -0.124045 0.243131 0.082420 -0.119530 0.051103 0.543407
7 0.049272 -0.132096 0.120404 0.125786 -0.054881 0.007308 0.121709
7
0 0.049272
1 -0.132096
2 0.120404
3 0.125786
4 -0.054881
5 0.007308
6 0.121709
7 0.481395
M (indicator for missing values)
0 1 2 3 4 5 6 7
0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0
2 0 1 1 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 1 1 0 0
5 0 0 0 0 1 0 0 0
6 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
----------------------------------------------------------------------------
SCS v2.1.1 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012
----------------------------------------------------------------------------
Lin-sys: sparse-direct, nnz in A = 160
eps = 1.00e-04, alpha = 1.50, max_iters = 5000, normalize = 1, scale = 1.00
acceleration_lookback = 10, rho_x = 1.00e-03
Variables n = 37, constraints m = 160
Cones: primal zero / dual free vars: 58
soc vars: 66, soc blks: 1
sd vars: 36, sd blks: 1
Setup time: 9.69e-03s
----------------------------------------------------------------------------
Iter | pri res | dua res | rel gap | pri obj | dua obj | kap/tau | time (s)
----------------------------------------------------------------------------
0| 4.05e+19 7.57e+19 1.00e+00 -3.12e+19 1.92e+20 1.21e+20 1.53e-02
40| 2.74e-10 1.01e-09 4.51e-11 1.71e+00 1.71e+00 1.96e-17 1.88e-02
----------------------------------------------------------------------------
Status: Solved
Timing: Solve time: 1.89e-02s
Lin-sys: nnz in L factor: 357, avg solve time: 1.54e-06s
Cones: avg projection time: 1.52e-04s
Acceleration: avg step time: 1.66e-05s
----------------------------------------------------------------------------
Error metrics:
dist(s, K) = 4.7898e-17, dist(y, K*) = 1.5753e-09, s'y/|s||y| = 3.7338e-12
primal res: |Ax + s - b|_2 / (1 + |b|_2) = 2.7439e-10
dual res: |A'y + c|_2 / (1 + |c|_2) = 1.0103e-09
rel gap: |c'x + b'y| / (1 + |c'x| + |b'y|) = 4.5078e-11
----------------------------------------------------------------------------
c'x = 1.7145, -b'y = 1.7145
============================================================================
Status: optimal
Objective: 1.714544257213233
0 1 2 3 4 5 6 \
0 0.300457 -0.158889 0.080241 -0.143750 0.072844 -0.032968 0.077836
1 -0.158889 0.399624 -0.084196 0.109056 0.082858 -0.045462 -0.124045
2 0.080241 -0.084196 0.198446 -0.031902 -0.081455 0.098212 0.243131
3 -0.143750 0.109056 -0.031902 0.386109 -0.058051 0.060246 0.082420
4 0.072844 0.082858 -0.081455 -0.058051 0.135981 -0.041927 -0.119530
5 -0.032968 -0.045462 0.098212 0.060246 -0.041927 0.400641 0.051103
6 0.077836 -0.124045 0.243131 0.082420 -0.119530 0.051103 0.543407
7 0.049272 -0.132096 0.120404 0.125786 -0.054881 0.007308 0.121709
7
0 0.049272
1 -0.132096
2 0.120404
3 0.125786
4 -0.054881
5 0.007308
6 0.121709
7 0.481395
----------------------------------------------------------------------------
SCS v2.1.1 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012
----------------------------------------------------------------------------
Lin-sys: sparse-direct, nnz in A = 162
eps = 1.00e-04, alpha = 1.50, max_iters = 5000, normalize = 1, scale = 1.00
acceleration_lookback = 10, rho_x = 1.00e-03
Variables n = 38, constraints m = 168
Cones: primal zero / dual free vars: 64
soc vars: 68, soc blks: 2
sd vars: 36, sd blks: 1
Setup time: 1.02e-02s
----------------------------------------------------------------------------
Iter | pri res | dua res | rel gap | pri obj | dua obj | kap/tau | time (s)
----------------------------------------------------------------------------
0| 3.67e+19 5.42e+19 1.00e+00 -2.45e+19 1.28e+20 1.04e+20 9.84e-03
40| 5.85e-10 1.47e-09 7.31e-10 1.71e+00 1.71e+00 8.09e-17 1.29e-02
----------------------------------------------------------------------------
Status: Solved
Timing: Solve time: 1.31e-02s
Lin-sys: nnz in L factor: 368, avg solve time: 2.56e-06s
Cones: avg projection time: 3.03e-05s
Acceleration: avg step time: 2.45e-05s
----------------------------------------------------------------------------
Error metrics:
dist(s, K) = 4.4409e-16, dist(y, K*) = 1.5216e-09, s'y/|s||y| = 4.2866e-12
primal res: |Ax + s - b|_2 / (1 + |b|_2) = 5.8496e-10
dual res: |A'y + c|_2 / (1 + |c|_2) = 1.4729e-09
rel gap: |c'x + b'y| / (1 + |c'x| + |b'y|) = 7.3074e-10
----------------------------------------------------------------------------
c'x = 1.7145, -b'y = 1.7145
============================================================================
Status: optimal
Objective: 1.714544261472336
0 1 2 3 4 5 6 \
0 0.300457 -0.158889 0.080241 -0.143750 0.072844 -0.032968 0.077836
1 -0.158889 0.399624 -0.084196 0.109056 0.082858 -0.045462 -0.124045
2 0.080241 -0.084196 0.198446 -0.031902 -0.081455 0.098212 0.243131
3 -0.143750 0.109056 -0.031902 0.386109 -0.058051 0.060246 0.082420
4 0.072844 0.082858 -0.081455 -0.058051 0.135981 -0.041927 -0.119530
5 -0.032968 -0.045462 0.098212 0.060246 -0.041927 0.400641 0.051103
6 0.077836 -0.124045 0.243131 0.082420 -0.119530 0.051103 0.543407
7 0.049272 -0.132096 0.120404 0.125786 -0.054881 0.007308 0.121709
7
0 0.049272
1 -0.132096
2 0.120404
3 0.125786
4 -0.054881
5 0.007308
6 0.121709
7 0.481395
As a sanity check we can confirm that the eigenvalues of the solution matrix are non-negative:
w,v = np.linalg.eig(Sigma.value)
print(w)
[9.46355900e-01 6.34465779e-01 2.35993549e-10 5.30366506e-02
1.69999646e-01 2.29670882e-01 4.36623248e-01 3.75907704e-01]
Practice: propagation of NAs and pairwise covariances
I don't think this is a practical way of dealing with missing values. First of all missing values in the original data will
propagate in the covariance matrix. A single
NA in the data leads to lots of
NAs in the covariance matrix.
---- 28 PARAMETER cov effect of a single NA in the data
j1 j2 j3 j4 j5 j6 j7 j8
j1 0.300457 -0.158889 NA -0.143750 0.072844 -0.032968 0.077836 0.049272
j2 -0.158889 0.399624 NA 0.109056 0.082858 -0.045462 -0.124045 -0.132096
j3 NA NA NA NA NA NA NA NA
j4 -0.143750 0.109056 NA 0.386109 -0.058051 0.060246 0.082420 0.125786
j5 0.072844 0.082858 NA -0.058051 0.354627 -0.129507 -0.119530 -0.054881
j6 -0.032968 -0.045462 NA 0.060246 -0.129507 0.400641 0.051103 0.007308
j7 0.077836 -0.124045 NA 0.082420 -0.119530 0.051103 0.543407 0.121709
j8 0.049272 -0.132096 NA 0.125786 -0.054881 0.007308 0.121709 0.481395
This propagation is the result of the fact that any operation on
NA yields
NA. I.e.\[ \mathbf{NA}+ x=\mathbf{NA}\cdot x = \mathbf{NA}\] When applying the standard formula for the covariance: \[cov_{j,k} = \frac{1}{N-1} \sum_i (x_{i,j}-\mu_j)(x_{i,k}-\mu_k) \] we see this propagation in action. The above results were obtained using GAMS using this exact formula. This problem is of course difficult to fix after the fact in the covariance matrix. Just too much damage has been done.
A second problem with our SDP model is that we are not staying close to reasonable values for missing correlations. The model only looks at the PSD constraint.
Basically, the answer is that we need to look more carefully at the original data.
A simple remedy is just to throw away the record with the
NA. If you have lots of data and relatively few
NAs in the data, this is a reasonable approach. However there is a trick we can use. Instead of throwing away a whole row of observations in case we have an
NA, we inspect pairs of columns \((j,k)\) individually. For the two columns \(j\) and \(k\) throw away the observations with
NAs in these columns and then calculate the covariance \(cov_{j,k}\). Repeat for all combinations \((j,k)\) with \(j \lt k\). This means the number of usable observations is varying for different columbinations \((j,k)\). R has this pairwise covariance calculation built-in.
> cov(a)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.3269524261 -0.022335922 0.024062915 0.026460677 -0.0003735916 0.0021383397 0.0544727640 -0.0008417817
[2,] -0.0223359222 0.313444259 0.036135413 0.027115454 -0.0045955942 0.0286659334 0.0558610843 0.0222590384
[3,] 0.0240629148 0.036135413 0.308443182 0.003663338 0.0014232064 -0.0158431246 0.0308769925 -0.0177244600
[4,] 0.0264606771 0.027115454 0.003663338 0.322801448 0.0057221934 0.0175051722 0.0152804438 0.0034349411
[5,] -0.0003735916 -0.004595594 0.001423206 0.005722193 0.2920368646 -0.0069213567 0.0227153919 0.0163823701
[6,] 0.0021383397 0.028665933 -0.015843125 0.017505172 -0.0069213567 0.3095935603 0.0009359271 0.0506571760
[7,] 0.0544727640 0.055861084 0.030876993 0.015280444 0.0227153919 0.0009359271 0.3635080311 0.0322080200
[8,] -0.0008417817 0.022259038 -0.017724460 0.003434941 0.0163823701 0.0506571760 0.0322080200 0.2992700098
> a[2,3]=NA
> cov(a)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.3269524261 -0.022335922 NA 0.026460677 -0.0003735916 0.0021383397 0.0544727640 -0.0008417817
[2,] -0.0223359222 0.313444259 NA 0.027115454 -0.0045955942 0.0286659334 0.0558610843 0.0222590384
[3,] NA NA NA NA NA NA NA NA
[4,] 0.0264606771 0.027115454 NA 0.322801448 0.0057221934 0.0175051722 0.0152804438 0.0034349411
[5,] -0.0003735916 -0.004595594 NA 0.005722193 0.2920368646 -0.0069213567 0.0227153919 0.0163823701
[6,] 0.0021383397 0.028665933 NA 0.017505172 -0.0069213567 0.3095935603 0.0009359271 0.0506571760
[7,] 0.0544727640 0.055861084 NA 0.015280444 0.0227153919 0.0009359271 0.3635080311 0.0322080200
[8,] -0.0008417817 0.022259038 NA 0.003434941 0.0163823701 0.0506571760 0.0322080200 0.2992700098
> cov(a,use="pairwise")
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.3269524261 -0.022335922 0.024077969 0.026460677 -0.0003735916 0.0021383397 0.0544727640 -0.0008417817
[2,] -0.0223359222 0.313444259 0.036895996 0.027115454 -0.0045955942 0.0286659334 0.0558610843 0.0222590384
[3,] 0.0240779693 0.036895996 0.311573754 0.003377392 0.0013694087 -0.0162231609 0.0310202082 -0.0180617800
[4,] 0.0264606771 0.027115454 0.003377392 0.322801448 0.0057221934 0.0175051722 0.0152804438 0.0034349411
[5,] -0.0003735916 -0.004595594 0.001369409 0.005722193 0.2920368646 -0.0069213567 0.0227153919 0.0163823701
[6,] 0.0021383397 0.028665933 -0.016223161 0.017505172 -0.0069213567 0.3095935603 0.0009359271 0.0506571760
[7,] 0.0544727640 0.055861084 0.031020208 0.015280444 0.0227153919 0.0009359271 0.3635080311 0.0322080200
[8,] -0.0008417817 0.022259038 -0.018061780 0.003434941 0.0163823701 0.0506571760 0.0322080200 0.2992700098
>
The disadvantage with pairwise covariances is that it is possible (even theoretically) that the final covariance matrix is not positive-semi definite. We can repair this with R's
nearPD function. The next section discusses some strategies we can use to achieve the same as this function.
Finding a nearby positive definite matrix
An algorithm to find a positive definite matrix \(A\) that is close to the original matrix \(A_0\) is as follows:
- Form an eigen decomposition of \(A_0\) using something like \[(\lambda,V) := \mathbf{eig}(A_0)\]
- Replace negative eigenvalues: \[\lambda_i := \max(\lambda_i, 10^{-6})\]
- Reassemble the matrix: \[A := V \cdot \mathbf{diag}(\lambda) \cdot V^T\]
We can also formulate this as a Semi-Definite Program:
Closest Positive Semi-Definite Matrix |
\[\begin{align} \min\>& ||\color{darkred}A - \color{darkblue}A_0|| \\ & \color{darkred}A \succeq 0 \end{align} \] |
The eigen decomposition approach will not find the optimal closest matrix (in terms of squared deviations). Let's compare the results of these two approaches on a small problem. Here is the Python code:
import numpy as np
import pandas as pd
import cvxpy as cp
A0 = [[ 0.130457, -0.158889, 0.080241, -0.143750, 0.072844, -0.032968, 0.077836, 0.049272],
[-0.158889, 0.229624, -0.219293, 0.109056, 0.082858, -0.045462, -0.124045, -0.132096],
[ 0.080241, -0.219293, 0.308231, -0.031902, -0.081455, 0.098212, 0.243131, 0.120404],
[-0.143750, 0.109056, -0.031902, 0.216109, -0.058051, 0.060246, 0.082420, 0.125786],
[ 0.072844, 0.082858, -0.081455, -0.058051, 0.184627, -0.129507, -0.119530, -0.054881],
[-0.032968, -0.045462, 0.098212, 0.060246, -0.129507, 0.230641, 0.051103, 0.007308],
[ 0.077836, -0.124045, 0.243131, 0.082420, -0.119530, 0.051103, 0.373407, 0.121709],
[ 0.049272, -0.132096, 0.120404, 0.125786, -0.054881, 0.007308, 0.121709, 0.311395]]
#--------------- eigenvalue approach ------------------
# eigen decomposition
u,v= np.linalg.eig(A0)
print("Eigenvalues:")
print(np.sort(u))
# replace negative eigenvalues
u = np.maximum(1.0e-6,u)
# reassemble matrix
A = v @ np.diag(u) @ v.T
print("*** Sum of squared differences:",np.sum((A-A0)**2))
#--------------- SDP model ------------------
dim = np.shape(A0)
A = cp.Variable(dim,symmetric=True)
prob = cp.Problem(cp.Minimize(cp.norm(A-A0,"fro")**2),
[A >> 0])
SSE = prob.solve(solver=cp.SCS, verbose=True)
print("*** Sum of squared differences:",SSE)
I used as objective \[\min\> ||A-A_0||_F^2\] which is just the sum of squares. Obviously, I could have used the
cp.sum_squares function instead. The difference between the two methods, in terms of the sum of squares, is very small:
Eigenvalues:
[-0.06620331 -0.022672 0.06166647 0.0944498 0.21259723 0.30736467
0.49401996 0.90326818]
*** Sum of squared differences: 0.004897075316543839
----------------------------------------------------------------------------
SCS v2.1.1 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012
----------------------------------------------------------------------------
Lin-sys: sparse-direct, nnz in A = 104
eps = 1.00e-04, alpha = 1.50, max_iters = 5000, normalize = 1, scale = 1.00
acceleration_lookback = 10, rho_x = 1.00e-03
Variables n = 38, constraints m = 104
Cones: soc vars: 68, soc blks: 2
sd vars: 36, sd blks: 1
Setup time: 8.19e-03s
----------------------------------------------------------------------------
Iter | pri res | dua res | rel gap | pri obj | dua obj | kap/tau | time (s)
----------------------------------------------------------------------------
0| 4.97e+19 4.58e+19 1.00e+00 -3.53e+19 2.93e+19 6.25e+19 1.27e-02
20| 1.92e-10 3.09e-10 6.99e-10 4.90e-03 4.90e-03 6.36e-17 1.41e-02
----------------------------------------------------------------------------
Status: Solved
Timing: Solve time: 1.42e-02s
Lin-sys: nnz in L factor: 246, avg solve time: 1.96e-06s
Cones: avg projection time: 4.11e-05s
Acceleration: avg step time: 2.01e-05s
----------------------------------------------------------------------------
Error metrics:
dist(s, K) = 9.6655e-10, dist(y, K*) = 1.9088e-09, s'y/|s||y| = -1.7574e-12
primal res: |Ax + s - b|_2 / (1 + |b|_2) = 1.9164e-10
dual res: |A'y + c|_2 / (1 + |c|_2) = 3.0911e-10
rel gap: |c'x + b'y| / (1 + |c'x| + |b'y|) = 6.9872e-10
----------------------------------------------------------------------------
c'x = 0.0049, -b'y = 0.0049
============================================================================
*** Sum of squared differences: 0.004896897751827667
Conclusion
The model presented in [1] is interesting: it is not quite obvious how to implement it in matrix notation for use in CVXPY (and the code shown below the example in [1] is not directly related). However, it should be mentioned that better methods are available to address the underlying problem: how to handle missing values in a covariance matrix. Pairwise construction of the covariance matrix is an easy fix. In case a non-PSD covariance matrix is created, we can repair this is different ways.
References
- Semidefinite program, https://www.cvxpy.org/examples/basic/sdp.html
- Missing Data Functionality in NumPy, https://docs.scipy.org/doc/numpy-1.10.1/neps/missing-data.html
- Covariance matrix not positive definite in portfolio models, https://yetanothermathprogrammingconsultant.blogspot.com/2018/04/covariance-matrix-not-positive-definite.html
- Higham, Nick (2002) Computing the nearest correlation matrix - a problem from finance; IMA Journal of Numerical Analysis 22, 329–343