## Monday, April 15, 2024

### LP in statistics: The Dantzig Selector

Lots of statistical procedures are based on an underlying optimization problem. Least squares regression and maximum likelihood estimation are two obvious examples. In a few cases, linear programming is used. Some examples are:

• Least absolute deviation (LAD) regression [1]
• Chebyshev regression [2]
• Quantile regression [3]
Here is another regression example that uses linear programming.

We want to estimate a sparse vector $$\color{darkred}\beta$$ from the linear model $\color{darblue}y=\color{darkblue}X\color{darkred}\beta+\color{darkred}e$ where the number of observations $$n$$ (rows in $$\color{darkblue}X$$) is (much) smaller than the number of coefficients $$p$$ to estimate (columns in $$\color{darkblue}X$$) [4]: $$p \gg n$$. This is an alternative to the well-known Lasso method [5].

In traditional regression applications, it was safe to assume or require $$n\gt p$$. Nowadays, we see more and more very large data sets where this does not hold anymore. Think machine learning and data science. That means we need methods that can handle these cases.

The method, for a given threshold $$\color{darkblue}\delta$$ (a tuning parameter), will calculate:

Dantzig Selector
\begin{align}\min \>& ||\color{darkred}\beta||_1 \\ \text{subject to } & ||\color{darkblue}X^T(\color{darkblue}y-\color{darkblue}X\color{darkred}\beta)||_\infty \le \color{darkblue}\delta \end{align}

Here, $$||v||_1 = \sum_i |v_i|$$ and $$||v||_\infty = \max_i |v_i|$$. The resulting model is an LP.

The name Dantzig Selector is a reference to George Dantzig, the inventor of the Simplex method for linear programming. "Selector" indicates this method is about best subset variable selection: select a subset of the $$p$$ independent variables.

Note that if $$\color{darkblue}\delta=0$$ we solve a standard OLS (Ordinary Least Squares) problem. The constraint becomes: $$\color{darkblue}X^T(\color{darkblue}y-\color{darkblue}X\color{darkred}\beta)=0$$ which corresponds to the so-called normal equations: $$\color{darkblue}X^T \color{darkblue}X\color{darkred}\beta=\color{darkblue}X^T \color{darkblue}y$$. This system of linear equations, when solved for $$\color{darkred}\beta$$, gives us a least squares solution.

The intuition is here that $$\color{darkblue}\delta$$ models the trade-off between a full OLS estimate and a sparser solution.

Note that the more popular Lasso approach creates an unconstrained model using a penalty term:

Lasso
\begin{align}\min\>&||\color{darkblue}y-\color{darkblue}X\color{darkred}\beta||_2^2 + \color{darkblue}\lambda ||\color{darkred}\beta||_1 \end{align}

To understand what the function dantzig.delta in the R package GDSARM is doing, I tried to replicate their example:

> library(GDSARM)
V1 V2 V3 V4 V5 V6 V7    V8
1   1  1 -1  1  1  1 -1 6.058
2   1 -1  1  1  1 -1 -1 4.733
3  -1  1  1  1 -1 -1 -1 4.625
4   1  1  1 -1 -1 -1  1 5.899
5   1  1 -1 -1 -1  1 -1 7.000
6   1 -1 -1 -1  1 -1  1 5.752
7  -1 -1 -1  1 -1  1  1 5.682
8  -1 -1  1 -1  1  1 -1 6.607
9  -1  1 -1  1  1 -1  1 5.818
10  1 -1  1  1 -1  1  1 5.917
11 -1  1  1 -1  1  1  1 5.863
12 -1 -1 -1 -1 -1 -1 -1 4.809
> #scale and center X and y
> scaleX = base::scale(X, center= TRUE, scale = TRUE)
> scaleY = base::scale(Y, center= TRUE, scale = FALSE)
>
> maxDelta = max(abs(t(scaleX)%*%matrix(scaleY, ncol=1)))
> # Dantzig Selector on 4 equally spaced delta values between 0 and maxDelta
> dantzig.delta(scaleX, scaleY, delta = seq(0,maxDelta,length.out=4))
V1        V2         V3         V4         V5        V6         V7
0                0.17016091 0.1534495 -0.1283823 -0.2695593 0.07824791 0.4779302 0.09565567
1.75241074956335 0.01085084 0.0000000  0.0000000 -0.1102492 0.00000000 0.3186201 0.00000000
3.5048214991267  0.00000000 0.0000000  0.0000000  0.0000000 0.00000000 0.1593101 0.00000000
5.25723224869005 0.00000000 0.0000000  0.0000000  0.0000000 0.00000000 0.0000000 0.00000000
> 

When we increase delta, the estimates become sparser. The value for maxDelta is derived from plugging in $$\color{darkred}\beta_j=0$$ in $$||\color{darkblue}X^T(\color{darkblue}y-\color{darkblue}X\color{darkred}\beta)||_\infty$$. This gives: $\mathtt{maxDelta}:= ||\color{darkblue}X^T\color{darkblue}y||_\infty$ and corresponds to the last row in the results. Note that this data set has $$n \gt p$$, so not really the original target for this method.

It helps me better understand statistical techniques when I reimplement the underlying optimization problem, rather than just running a black-box function. The GAMS model (listed in the appendix) is not very complex. We have to linearize the absolute values in two places. In the objective and in the constraint. In this case, things are convex, so linearizing the absolute values is not very difficult.

The objective is $$\min_j |\color{darkred}\beta_j|$$, which can be modeled as \begin{align}\min &\sum_j \color{darkred}b_j\\ & \color{darkred}b_j \ge -\color{darkred}\beta_j\\&\color{darkred}b_j\ge\color{darkred}\beta_j \\ & \color{darkred}b_j\ge 0\end{align}

The constraints $$||\color{darkred}v||_\infty \le \color{darkblue}\delta$$ can be written as: $-\color{darkblue}\delta \le\color{darkred}v_j \le \color{darkblue}\delta$

So my implementation is:

Dantzig Selector Implementation
\begin{align}\min & \sum_j \color{darkred}b_j\\ & \color{darkred}b_j \ge -\color{darkred}\beta_j \\ & \color{darkred}b_j \ge \color{darkred}\beta_j \\ & \color{darkred}r_i = \color{darkblye}y_j - \sum_j \color{darkblue}X_{i,j}\cdot \color{darkred}\beta_j \\ & \color{darkred}{Xr}_j = \sum_i \color{darkblue}X_{i,j} \cdot \color{darkred}r_i \\ & \color{darkred}{Xr}_j \in [-\color{darkblue}\delta,\color{darkblue}\delta] \\& \color{darkred}b_j \ge 0 \end{align}

A disadvantage here is that I have $$\color{darkblue}X$$ twice in the model, leading to two large dense blocks of constraints. An advantage is that I don't form $$\color{darkblue}X^T\color{darkblue}X$$ which can lead to numerical issues. It is customary to center and scale the data. When I run my GAMS model, I can indeed reproduce the R results:

----    158 PARAMETER results  estimations

delta          V1          V2          V3          V4          V5          V6          V7

OLS                   0.170       0.153      -0.128      -0.270       0.078       0.478       0.096
k1                    0.170       0.153      -0.128      -0.270       0.078       0.478       0.096
k2        1.752       0.011                              -0.110                   0.319
k3        3.505                                                                   0.159
k4        5.257

I added the OLS solution for completeness.

### References

2. Linear Programming and Chebyshev Regression, https://yetanothermathprogrammingconsultant.blogspot.com/2017/11/lp-and-chebyshev-regression.html
3. Median, quantiles and quantile regression as linear programming problems, https://yetanothermathprogrammingconsultant.blogspot.com/2021/06/median-quantiles-and-quantile.html
4. Emmanuel Candes and Terence Tao, “The Dantzig Selector: Statistical Estimation When p Is Much Larger than n.” The Annals of Statistics, vol. 35, no. 6, 2007, pp. 2313–51.
5. Robert Tibshirani, “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1, 1996, pp. 267–88.

### Appendix: GAMS Model

 $onText The Dantzig Selector for sparse Regression Compare results of R package GDSARM with an LP model.$offText * for ordering of resultsset dummy /delta, OLS/; *------------------------------------------------* data*------------------------------------------------Sets   i 'rows',   j0 'all columns'; Table data(i<,j0<)   V1 V2 V3 V4 V5 V6 V7    V81   1  1 -1  1  1  1 -1 6.0582   1 -1  1  1  1 -1 -1 4.7333  -1  1  1  1 -1 -1 -1 4.6254   1  1  1 -1 -1 -1  1 5.8995   1  1 -1 -1 -1  1 -1 7.0006   1 -1 -1 -1  1 -1  1 5.7527  -1 -1 -1  1 -1  1  1 5.6828  -1 -1  1 -1  1  1 -1 6.6079  -1  1 -1  1  1 -1  1 5.81810  1 -1  1  1 -1  1  1 5.91711 -1  1  1 -1  1  1  1 5.86312 -1 -1 -1 -1 -1 -1 -1 4.809;display i,j0,data; *------------------------------------------------* scale and center*------------------------------------------------ set j(j0) 'independent variables: columns for X';j(j0) = ord(j0)= -delta*  Xr(i) <= delta* that is done in the solve loop*------------------------------------------------ variable  Xr(j0)   'X^T*r';positive variable  absb(j0) 'absolute values'; equations   obj        'objective'   bound1(j0) 'absolute value bound (obj)'   bound2(j0) 'absolute value bound (obj)'   fit(i)     'linear model'   defXr(j0)  'evaluate X^T*r'; obj.. z =e= sum(j,absb(j));bound1(j).. absb(j) =g= -beta(j);bound2(j).. absb(j) =g= beta(j);defXr(j)..  Xr(j) =e= sum(i,X(i,j)*r(i)); model ds /obj,fit,bound1,bound2,defXr/;  *------------------------------------------------* run regressions*------------------------------------------------ parameter results(*,*) 'estimations'; solve ols using qcp minimizing z;results('OLS',j) = beta.l(j); results(k,'delta') = delta(k);loop(k,   Xr.lo(j) = -delta(k);   Xr.up(j) = delta(k);   solve ds using lp minimizing z;   results(k,j) = beta.l(j););display results;