## Friday, November 10, 2017

### Linear Programming and Chebyshev Regression

The LAD (Least Absolute Deviation) or $$\ell_1$$ regression problem (minimize the sum of the absolute values of the residuals) is often discussed in Linear Programming textbooks: it has a few interesting LP formulations [1]. Chebyshev or $$\ell_{\infty}$$ regression is a little bit less well-known. Here we minimize the maximum (absolute) residual.

\bbox[lightcyan,10px,border:3px solid darkblue] {\begin{align}\min_{\beta}\>&\max_i \> |r_i|\\ &y-X\beta = r\end{align}}

As in [1],  $$\beta$$ are coefficients to estimate and  $$X,y$$ are data. $$r$$ are the residuals.
Some obvious and less obvious formulations are:
• Variable splitting:\begin{align}\min\>&z\\& z \ge r^+_i + r^-_i\\&r^+_i - r^-_i =y_i –\sum_j X_{i,j}\beta_j\\&r^+_i, r^-_i\ge 0\end{align} With variable splitting we use two non-negative variables $$r^+_i - r^-_i$$ to describe a value $$r_i$$ that can be positive or negative. We need to make sure that one of them is zero in order for $$r^+_i + r^-_i$$ to be equal to the absolute value $$|r_i|$$. Here we have an interesting case, as we are only sure of this for the particular index $$i$$ that has the largest absolute deviation (i.e. where $$|r_i|=z$$). In cases where $$|r_i|<z$$ we actually do not  enforce $$r^+_i \cdot r^-_i = 0$$. Indeed, when looking at the solution I see lots of cases where $$r^+_i > 0, r^-_i > 0$$. Effectively those $$r^+_i, r^-_i$$ have no clear physical interpretation. This is very different from the LAD regression formulation [1] where we require all $$r^+_i \cdot r^-_i = 0$$.
• Bounding:\begin{align}\min\>&z\\ & –z \le y_i –\sum_j X_{i,j}\beta_j \le z\end{align}Here $$z$$ can be left free or you can make it a non-negative variable (it will be non-negative automatically). Note that there are actually two constraints here: $$–z \le y_i –\sum_j X_{i,j}\beta_j$$ and $$y_i –\sum_j X_{i,j}\beta_j \le z$$. This model contains the data twice, leading to a large number of nonzero elements in the LP matrix.
• Sparse bounding: This model tries to remedy the disadvantage of the standard bounding model by introducing extra free variables $$d$$ and extra equations:\begin{align}\min\>&z\\ & –z \le d_i \le z\\& d_i = y_i –\sum_j X_{i,j}\beta_j \end{align} Note again that $$–z \le d_i \le z$$ is actually two constraints: $$–z \le d_i$$ and $$d_i \le z$$.
• Dual:\begin{align}\max\>&\sum_i y_i(d_i+e_i)\\&\sum_i X_{i,j}(d_i+e_i) = 0 \perp \beta_j\\&\sum_i (d_i-e_i)=1\\&d_i\ge 0, e_i\le 0\end{align}The estimates $$\beta$$ can be found to be the duals of equation $$X^T(d+e)=0$$.
We use the same synthetic data as in [1] with $$m=5,000$$ cases, and $$n=100$$ coefficients. Some timings with Cplex (default LP method) yield the following results (times are in seconds):

Opposed to the LAD regression example in [1], the bounding formulation is very fast here. The dual formulation remains doing very good.
##### Historical Note
The use of this minimax principle goes back to Euler (1749) [3,4].

 Leonhard Euler (1707-1783)
The term Chebyshev regression is related to the Chebyshev distance (a different name for the $$\ell_{\infty}$$ metric).

 Pafnuty Lvovich Chebyshev (1821-1894)