Lots of statistical procedures are based on an underlying optimization problem. Least squares regression and maximum likelihood estimation are two obvious examples. In a few cases, linear programming is used. Some examples are:
- Least absolute deviation (LAD) regression [1]
- Chebyshev regression [2]
- Quantile regression [3]
Dantzig Selector |
---|
\[\begin{align}\min \>& ||\color{darkred}\beta||_1 \\ \text{subject to } & ||\color{darkblue}X^T(\color{darkblue}y-\color{darkblue}X\color{darkred}\beta)||_\infty \le \color{darkblue}\delta \end{align}\] |
Here, \(||v||_1 = \sum_i |v_i| \) and \(||v||_\infty = \max_i |v_i|\). The resulting model is an LP.
The name Dantzig Selector is a reference to George Dantzig, the inventor of the Simplex method for linear programming. "Selector" indicates this method is about best subset variable selection: select a subset of the \(p\) independent variables.
Note that if \(\color{darkblue}\delta=0\) we solve a standard OLS (Ordinary Least Squares) problem. The constraint becomes: \(\color{darkblue}X^T(\color{darkblue}y-\color{darkblue}X\color{darkred}\beta)=0\) which corresponds to the so-called normal equations: \(\color{darkblue}X^T \color{darkblue}X\color{darkred}\beta=\color{darkblue}X^T \color{darkblue}y\). This system of linear equations, when solved for \(\color{darkred}\beta\), gives us a least squares solution.
The intuition is here that \(\color{darkblue}\delta\) models the trade-off between a full OLS estimate and a sparser solution.
Note that the more popular Lasso approach creates an unconstrained model using a penalty term:
Lasso |
---|
\[\begin{align}\min\>&||\color{darkblue}y-\color{darkblue}X\color{darkred}\beta||_2^2 + \color{darkblue}\lambda ||\color{darkred}\beta||_1 \end{align}\] |
To understand what the function dantzig.delta in the R package GDSARM is doing, I tried to replicate their example:
> library(GDSARM) > data(dataHamadaWu) > print(dataHamadaWu) V1 V2 V3 V4 V5 V6 V7 V8 1 1 1 -1 1 1 1 -1 6.058 2 1 -1 1 1 1 -1 -1 4.733 3 -1 1 1 1 -1 -1 -1 4.625 4 1 1 1 -1 -1 -1 1 5.899 5 1 1 -1 -1 -1 1 -1 7.000 6 1 -1 -1 -1 1 -1 1 5.752 7 -1 -1 -1 1 -1 1 1 5.682 8 -1 -1 1 -1 1 1 -1 6.607 9 -1 1 -1 1 1 -1 1 5.818 10 1 -1 1 1 -1 1 1 5.917 11 -1 1 1 -1 1 1 1 5.863 12 -1 -1 -1 -1 -1 -1 -1 4.809 > X = dataHamadaWu[,-8] > Y = dataHamadaWu[,8] > #scale and center X and y > scaleX = base::scale(X, center= TRUE, scale = TRUE) > scaleY = base::scale(Y, center= TRUE, scale = FALSE) > > maxDelta = max(abs(t(scaleX)%*%matrix(scaleY, ncol=1))) > # Dantzig Selector on 4 equally spaced delta values between 0 and maxDelta > dantzig.delta(scaleX, scaleY, delta = seq(0,maxDelta,length.out=4)) V1 V2 V3 V4 V5 V6 V7 0 0.17016091 0.1534495 -0.1283823 -0.2695593 0.07824791 0.4779302 0.09565567 1.75241074956335 0.01085084 0.0000000 0.0000000 -0.1102492 0.00000000 0.3186201 0.00000000 3.5048214991267 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.1593101 0.00000000 5.25723224869005 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 >
When we increase delta, the estimates become sparser. The value for maxDelta is derived from plugging in \(\color{darkred}\beta_j=0\) in \(||\color{darkblue}X^T(\color{darkblue}y-\color{darkblue}X\color{darkred}\beta)||_\infty\). This gives: \[\mathtt{maxDelta}:= ||\color{darkblue}X^T\color{darkblue}y||_\infty\] and corresponds to the last row in the results. Note that this data set has \(n \gt p\), so not really the original target for this method.
It helps me better understand statistical techniques when I reimplement the underlying optimization problem, rather than just running a black-box function. The GAMS model (listed in the appendix) is not very complex. We have to linearize the absolute values in two places. In the objective and in the constraint. In this case, things are convex, so linearizing the absolute values is not very difficult.
The objective is \(\min_j |\color{darkred}\beta_j|\), which can be modeled as \[\begin{align}\min &\sum_j \color{darkred}b_j\\ & \color{darkred}b_j \ge -\color{darkred}\beta_j\\&\color{darkred}b_j\ge\color{darkred}\beta_j \\ & \color{darkred}b_j\ge 0\end{align}\]
The constraints \(||\color{darkred}v||_\infty \le \color{darkblue}\delta\) can be written as: \[ -\color{darkblue}\delta \le\color{darkred}v_j \le \color{darkblue}\delta\]
So my implementation is:
Dantzig Selector Implementation |
---|
\[\begin{align}\min & \sum_j \color{darkred}b_j\\ & \color{darkred}b_j \ge -\color{darkred}\beta_j \\ & \color{darkred}b_j \ge \color{darkred}\beta_j \\ & \color{darkred}r_i = \color{darkblye}y_j - \sum_j \color{darkblue}X_{i,j}\cdot \color{darkred}\beta_j \\ & \color{darkred}{Xr}_j = \sum_i \color{darkblue}X_{i,j} \cdot \color{darkred}r_i \\ & \color{darkred}{Xr}_j \in [-\color{darkblue}\delta,\color{darkblue}\delta] \\& \color{darkred}b_j \ge 0 \end{align}\] |
A disadvantage here is that I have \(\color{darkblue}X\) twice in the model, leading to two large dense blocks of constraints. An advantage is that I don't form \(\color{darkblue}X^T\color{darkblue}X\) which can lead to numerical issues. It is customary to center and scale the data. When I run my GAMS model, I can indeed reproduce the R results:
---- 158 PARAMETER results estimations delta V1 V2 V3 V4 V5 V6 V7 OLS 0.170 0.153 -0.128 -0.270 0.078 0.478 0.096 k1 0.170 0.153 -0.128 -0.270 0.078 0.478 0.096 k2 1.752 0.011 -0.110 0.319 k3 3.505 0.159 k4 5.257
I added the OLS solution for completeness.
References
- Linear Programming and LAD Regression, https://yetanothermathprogrammingconsultant.blogspot.com/2017/11/lp-and-lad-regression.html
- Linear Programming and Chebyshev Regression, https://yetanothermathprogrammingconsultant.blogspot.com/2017/11/lp-and-chebyshev-regression.html
- Median, quantiles and quantile regression as linear programming problems, https://yetanothermathprogrammingconsultant.blogspot.com/2021/06/median-quantiles-and-quantile.html
- Emmanuel Candes and Terence Tao, “The Dantzig Selector: Statistical Estimation When p Is Much Larger than n.” The Annals of Statistics, vol. 35, no. 6, 2007, pp. 2313–51.
- Robert Tibshirani, “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1, 1996, pp. 267–88.
Appendix: GAMS Model
$onText
The Dantzig Selector for sparse Regression Compare results of R package GDSARM with an LP model.
$offText
* for ordering of results set dummy /delta, OLS/;
*------------------------------------------------ * data *------------------------------------------------ Sets i 'rows', j0 'all columns';
Table data(i<,j0<) V1 V2 V3 V4 V5 V6 V7 V8 1 1 1 -1 1 1 1 -1 6.058 2 1 -1 1 1 1 -1 -1 4.733 3 -1 1 1 1 -1 -1 -1 4.625 4 1 1 1 -1 -1 -1 1 5.899 5 1 1 -1 -1 -1 1 -1 7.000 6 1 -1 -1 -1 1 -1 1 5.752 7 -1 -1 -1 1 -1 1 1 5.682 8 -1 -1 1 -1 1 1 -1 6.607 9 -1 1 -1 1 1 -1 1 5.818 10 1 -1 1 1 -1 1 1 5.917 11 -1 1 1 -1 1 1 1 5.863 12 -1 -1 -1 -1 -1 -1 -1 4.809 ; display i,j0,data;
*------------------------------------------------ * scale and center *------------------------------------------------
set j(j0) 'independent variables: columns for X'; j(j0) = ord(j0)<card(j0);
parameters n 'number of rows' mean(j0) sd(j0) 'standard deviation' scaled_data(i,j0) ;
n = card(i); mean(j0) = sum(i,data(i,j0))/n;
* center scaled_data(i,j0) = data(i,j0)-mean(j0); * scale (only for X) sd(j) = sqrt(sum(i,sqr(scaled_data(i,j)))/(n-1)); scaled_data(i,j) = scaled_data(i,j)/sd(j); display scaled_data;
*------------------------------------------------ * extract X,y *------------------------------------------------
parameters X(i,j0) 'independent variables' y(i) 'dependent variable' ; X(i,j) = scaled_data(i,j); y(i) = scaled_data(i,'v8'); display X,y;
*------------------------------------------------ * Tuning parameter * maxdelta and delta *------------------------------------------------
scalar MaxDelta; MaxDelta = smax(j,abs(sum(i,X(i,j)*Y(i)))); display MaxDelta;
* create equally spaced values 0..MaxDelta set k /k1*k4/; parameter delta(k); delta(k) = MaxDelta*(ord(k)-1) / (card(k)-1); display delta;
*------------------------------------------------ * OLS regression *------------------------------------------------
variable beta(j0) 'estimators' z 'objective' r(i) 'residuals' ;
equations fit(i) 'linear model' qobj 'quadratic objective' ;
qobj.. z =e= sum(i, sqr(r(i))); fit(i).. r(i) =e= y(i) - sum(j,X(i,j)*beta(j));
model ols /qobj,fit/;
*------------------------------------------------ * Dantzig selector LP model * we need to set bounds for * Xr(i) >= -delta * Xr(i) <= delta * that is done in the solve loop *------------------------------------------------
variable Xr(j0) 'X^T*r' ; positive variable absb(j0) 'absolute values' ;
equations obj 'objective' bound1(j0) 'absolute value bound (obj)' bound2(j0) 'absolute value bound (obj)' fit(i) 'linear model' defXr(j0) 'evaluate X^T*r' ;
obj.. z =e= sum(j,absb(j)); bound1(j).. absb(j) =g= -beta(j); bound2(j).. absb(j) =g= beta(j); defXr(j).. Xr(j) =e= sum(i,X(i,j)*r(i));
model ds /obj,fit,bound1,bound2,defXr/;
*------------------------------------------------ * run regressions *------------------------------------------------
parameter results(*,*) 'estimations';
solve ols using qcp minimizing z; results('OLS',j) = beta.l(j);
results(k,'delta') = delta(k); loop(k, Xr.lo(j) = -delta(k); Xr.up(j) = delta(k); solve ds using lp minimizing z; results(k,j) = beta.l(j); ); display results; |
No comments:
Post a Comment