Wednesday, June 2, 2021

Total Least Squares: nonconvex optimization

Total Least Squares (TLS) is an alternative for OLS (Ordinary Least Squares). It is a form of orthogonal regression and also deals with the problem of EIV (Errors-in-Variables). 


The standard OLS model is \[\color{darkblue}y = \color{darkblue}X\color{darkred}\beta + \color{darkred}\varepsilon\] where we minimize the sum-of-squares of the residuals \[\min ||\color{darkred}\varepsilon||_2^2\] We can interpret \(\color{darkred}\varepsilon\) as the error in \(\color{darkblue}y\).


In TLS, we also allow for errors in \(\color{darkblue}X\). The model becomes \[\color{darkblue}y+\color{darkred}\varepsilon=(\color{darkblue}X+\color{darkred}E)\color{darkred}\beta\] Note that we made a sign change in \(\color{darkred}\varepsilon\). This is pure aesthetics: to make the equation more symmetric looking. The objective is specified as \[\min \> ||\left(\color{darkred}\varepsilon \> \color{darkred}E\right)||_F\] i.e. the Frobenius norm of the matrix formed by \(\color{darkred}\varepsilon\) and \(\color{darkred}E\). The Frobenius norm is just \[||A||_F=\sqrt{\sum_{i,j}a_{i,j}^2}\] We can drop the square root from the objective (the solution will remain the same, but we got rid of a non-linear function with a possible problem near zero: the gradient is not defined there). The remaining problem is a non-convex quadratic problem which can be solved with global MINLP solvers such as Baron or with a global quadratic solver like Gurobi.


One property of TLS is that orthogonal distances are measured instead of vertical ones for OLS:

Orthogonal distances for TLS (picture from [2])


Usually, linear TLS is solved using Singular Value Decomposition. But, here I want to try the following non-convex model:


Non-convex Quadratic Model
\[\begin{align}\min&\sum_i \color{darkred}\varepsilon_i^2+ \sum_{i,j}\color{darkred}E_{i,j}^2\\ & \color{darkblue}y_i + \color{darkred}\varepsilon_i = \sum_j \left(\color{darkblue}X_{i,j}+\color{darkred}E_{i,j}\right) \color{darkred}\beta_j && \forall i\end{align}\]


Example


Consider the small generated data set:


----     51 PARAMETER data  

              y          x1          x2

i1     5.658390    0.497808    3.435376
i2    10.206082    1.545745    9.413754
i3    11.735484    1.653134    9.476257
i4    10.994595    2.886785    9.694459
i5     8.178234    2.353039   10.770926
i6    10.586022    2.768120    5.760277
i7     4.481738    2.063961    2.004087
i8     7.495135    3.542960    3.440644
i9    10.959411    2.174741    6.331947
i10    4.809324    2.802416    1.620228
i11   13.290220    3.406511   10.506181
i12    9.774514    3.560376    5.563656
i13   12.084065    3.403917    8.248807
i14    7.992731    4.481498    2.519736
i15   12.111285    3.996363    7.249193
i16   14.487092    3.970683    9.410644
i17   11.062400    3.734251    7.364956
i18    7.592865    4.753497    4.695553
i19   11.244580    3.445085    6.214768
i20    7.837413    6.782433    6.367412
i21   12.199275    4.144486    7.570009
i22   13.872630    5.454476   10.463741
i23   10.461876    5.057793    4.034925
i24   14.713012    5.672384    7.914547
i25   10.751043    4.185621    7.648005


I tried the non-convex quadratic optimization model using GAMS with different global MINLP and QCP solvers. Most solvers found the optimal solution quite quickly but had enormous problems proving optimality. 

The solution for data without constant term looks like:


----     83 VARIABLE beta.L  parameter to estimate

x1 1.282931,    x2 0.844328

When we use the R model (reproduced in appendix 2 below), we note that we need to explicitly remove the constant term from the model (that is the -1 in the formula). If we would have kept the (implicit) constant term, we would have seen:

tls(Y ~ X1 + X2, data = df)
Error in tls(Y ~ X1 + X2, data = df) :
  Using total least squares requires model without intercept.


The model with the constant term removed produces:

> tls(Y ~ X1 + X2 - 1, data = df)

$coefficient
       X1        X2 
1.2829314 0.8443277 

$confidence.interval
   2.5% lower bound 97.5% upper bound
X1        0.6508868          1.914976
X2        0.5248924          1.163763

$sd.est
       X1        X2 
0.3224776 0.1629802 


Good to see we agree on the solution.

I still have a few questions and remarks about this. 

  • What if we want a constant term? This is often modeled by a data column with ones. Looks to me that there is no error in this column.
  • In OLS we may add dummy variables to indicate a structural break. This is encoded by a data column with 0's and 1's. Like the constant term, there is no measurement error involved here.
  • These situations could be handled easily in the optimization model. How to handle this with a singular value decomposition-based TLS algorithm is described in [3].
  • This is a good benchmark model for non-convex MINLP/QCP solvers: small but very difficult.
  • The optimization model confirms my understanding of the underlying mathematical model. Such a model is often more precise and succinct than a verbal description.
  • The TLS model measures orthogonal distances. This is dependent on the scale of the variables (columns). One could preprocess the data using a normalization step.
  • Statistics is full of interesting optimization problems.

References


  1. Total least squares, https://en.wikipedia.org/wiki/Total_least_squares
  2. Ivan Markovsky, Sabine Van Huffel, Overview of total least squares methods, Signal Processing, Volume 87, Issue 10, October 2007, Pages 2283-2302, https://eprints.soton.ac.uk/263855/1/tls_overview.pdf
  3. P. de Groen, An Introduction to Total Least Squareshttps://arxiv.org/pdf/math/9805076.pdf

Appendix 1: GAMS model


$ontext

  
Total Least Squares as optimization problem

    
Ivan Markovsky, Sabine Van Huffel, Overview of total least squares methods,
    
Signal Processing, Volume 87, Issue 10, October 2007, Pages 2283-2302

    
compare to R function tls (package tls)


$offtext

set
   i
'observations' /i1*i25/
   j0
'statistical variables'  /y,x1*x2/
   j(j0)
'explanatory variables' /x1*x2/
;


table data(i,j0)
           
Y        X1       X2
i1    5.658390  0.4978076  3.435376
i2   10.206082  1.5457447  9.413754
i3   11.735484  1.6531337  9.476257
i4   10.994595  2.8867848  9.694459
i5    8.178234  2.3530392 10.770926
i6   10.586022  2.7681198  5.760277
i7    4.481738  2.0639606  2.004087
i8    7.495135  3.5429598  3.440644
i9   10.959411  2.1747406  6.331947
i10   4.809324  2.8024155  1.620228
i11  13.290220  3.4065109 10.506181
i12   9.774514  3.5603761  5.563656
i13  12.084065  3.4039173  8.248807
i14   7.992731  4.4814979  2.519736
i15  12.111285  3.9963628  7.249193
i16  14.487092  3.9706833  9.410644
i17  11.062400  3.7342514  7.364956
i18   7.592865  4.7534969  4.695553
i19  11.244580  3.4450848  6.214768
i20   7.837413  6.7824328  6.367412
i21  12.199275  4.1444857  7.570009
i22  13.872630  5.4544764 10.463741
i23  10.461876  5.0577928  4.034925
i24  14.713012  5.6723841  7.914547
i25  10.751043  4.1856209  7.648005
;


option decimals=6;
display data;

parameter
  y(i)
  X(i,j)
;
y(i) = data(i,
'y');
X(i,j) = data(i,j);

variables
   beta(j)     
'parameter to estimate'
   epsilon(i)  
'error in y'
   E(i,j)      
'error in X'
   z           
'objective variable'
;

equations
   obj         
'objective'
   fit(i)      
'equation to fit'
;

obj..    z =e=
sum(i, sqr(epsilon(i))) + sum((i,j), sqr(E(i,j)));
fit(i).. y(i) + epsilon(i) =e=
sum(j, (X(i,j)+E(i,j))*beta(j));

beta.lo(j) = -10; beta.up(j) = 10;
epsilon.lo(i) = -10; epsilon.up(i) = 10;
E.lo(i,j) = -10; E.up(i,j) = 10;

model tls /all/;
option optcr=0, threads=8, reslim=1000, minlp=baron;
solve tls using minlp minimizing z;

display beta.l;



Appendix 2: R code

df <- read.table(text="
id      Y        X1       X2
i1    5.658390  0.4978076  3.435376
i2   10.206082  1.5457447  9.413754
i3   11.735484  1.6531337  9.476257
i4   10.994595  2.8867848  9.694459
i5    8.178234  2.3530392 10.770926
i6   10.586022  2.7681198  5.760277
i7    4.481738  2.0639606  2.004087
i8    7.495135  3.5429598  3.440644
i9   10.959411  2.1747406  6.331947
i10   4.809324  2.8024155  1.620228
i11  13.290220  3.4065109 10.506181
i12   9.774514  3.5603761  5.563656
i13  12.084065  3.4039173  8.248807
i14   7.992731  4.4814979  2.519736
i15  12.111285  3.9963628  7.249193
i16  14.487092  3.9706833  9.410644
i17  11.062400  3.7342514  7.364956
i18   7.592865  4.7534969  4.695553
i19  11.244580  3.4450848  6.214768
i20   7.837413  6.7824328  6.367412
i21  12.199275  4.1444857  7.570009
i22  13.872630  5.4544764 10.463741
i23  10.461876  5.0577928  4.034925
i24  14.713012  5.6723841  7.914547
i25  10.751043  4.1856209  7.648005",
header=T)

library(tls)
tls(Y ~ X1 + X2 - 1, data = df)

3 comments:

  1. I tried the model with some solvers, with the following results:

    Excel Solver GRG nonlinear: 5 seconds to optimality
    Bonmin via OpenSolver: < 1 second to optimality
    Couenne via OpenSolver: Finds optimal solution in < 1 second, but then very slow to converge towards proving optimality
    Octeract: Finds optimal solution in presolve (< 3.5 seconds), but then very slow to converge towards proving optimality

    In each case, I started with all variables set to zero.

    ReplyDelete
    Replies
    1. Excel/GRG and Bonmin are local solvers, so "optimal" means locally optimal. Proving (global) optimality is surprisingly difficult for this small model.

      Delete
  2. That's why I tried Octeract, which claims "guaranteed global optimal solutions for any complex optimisation problem, turning insights into decisions in no time". After 10 hours, Octeract had bounds on the objective function of 18.95 and 27.84, with little change in the lower bound in the last hour. As you say, surprisingly difficult for such a small problem.

    ReplyDelete