Peter Huber, Swiss statistician |
Traditionally, regression, well-known and widely used, is based on a least-squares objective: we have quadratic costs. A disadvantage of this approach is that large deviations (say due to outliers) get a very large penalty. As a result, these outliers can affect the results dramatically.
One alternative regression method that tries to remedy this is L1- or LAD-regression (LAD=Least Absolute Deviation) [1]. Here I want to discuss a method that is somewhere in the middle between L1 regression and least-squares: Huber or M-Regression [2]. Huber regression uses a loss function that is quadratic for small deviations and linear for large ones. It is defined in a piecewise fashion. The regression can look like the following non-linear optimization problem:
NLP Model |
---|
\[\begin{align} \min_{\color{darkred}\beta,\color{darkred}\varepsilon} &\sum_i \rho(\color{darkred}\varepsilon_i) \\ &\color{darkblue}y_i = \sum_j \color{darkblue}x_{i,j}\color{darkred}\beta_j +\color{darkred}\varepsilon_i \end{align} \] where \[\rho(\color{darkred}\varepsilon_i) = \begin{cases}\color{darkred}\varepsilon^2_i & \text{if $|\color{darkred}\varepsilon_i | \le \color{darkblue}k$} \\ 2\color{darkblue}k |\color{darkred}\varepsilon_i |-\color{darkblue}k^2 & \text{otherwise}\end{cases} \] |
The piecewise linear and quadratic Huber loss function \(\rho(x)\) |
The picture shows the Huber loss function \(\rho(x)\) with \(\color{darkblue}k = 1.5\). We see that for \(x \in [-1.5, 1.5]\) we have a quadratic behavior (following the red line), while for deviations larger than 1.5 we use linear penalties (using the green line). The Huber function is continuous and differentiable (but not twice continuously differentiable).
Huber regression requires you to provide the parameter \(\color{darkblue}k\). This is sometimes called the robustification parameter. There are methods that will try to determine \(\color{darkblue}k\) automatically. Such methods are called adaptive or tuning-free. I assume we provide \(\color{darkblue}k\) exogenously.
Below I will briefly describe four different mathematical programming formulations: an NLP, an LCP, and two different QP models.
Solve as NLP
We can directly solve this as an NLP (non-linear programming problem). In GAMS this can look like:
$ontext |
Usually, absolute values are very dangerous. In GAMS, to warn you against this, you need to solve this NLP model as a DNLP model. In this case, the danger is limited, as the objective is convex, continuous and differentiable. The second derivatives are not continuous, however.
This is the only version that is really intuitive. The other formulations below do work just fine, but I find them not at all obvious. If there is an intuitive interpretation, I would love to hear it.
Solve as LCP
In [3,4] it is proposed to form the first-order optimality conditions of this problem and solve the resulting LCP (linear complementarity problem). The LCP looks like:
LCP Model |
---|
\[\begin{align} \color{darkred}\lambda_i +\color{darkblue}y_i - \sum_j\color{darkblue}X_{i,j} \color{darkred}\beta_j &= \color{darkred}s^+_i - \color{darkred}s^-_i && \forall i \\ \sum_i \color{darkblue}X_{i,j}\color{darkred}\lambda_i & =0&& \forall j \\ \color{darkred}\lambda_i + \color{darkblue}k \ge 0 &\perp \color{darkred}s^+_i \ge 0 && \forall i \\ -\color{darkred}\lambda_i + \color{darkblue}k \ge 0 &\perp \color{darkred}s^-_i \ge 0 && \forall i \\ \color{darkred}s^+_i,\color{darkred}s^-_i &\ge 0 \end{align}\] |
Here \(\perp\) indicates complementarity between a constraint and a variable. Note that this model is a system of equations: there is no objective
In GAMS the equations look like:
$ontext |
GAMS does not have an LCP solver so instead, we use a more general MCP (Mixed Complementarity) solver.
We know we can solve LCPs in different ways [5]. A QP formulation is the next candidate.
Solve as QP
In [6], the following convex QP formulation is proposed:
QP Model |
---|
\[\begin{align}\min\> & 0.5 \sum_i \color{darkred}\varepsilon_i^2 + \color{darkblue}k \sum_i \color{darkred}t_i \\ & -\color{darkred}t_i\le \sum_j \color{darkblue}X_{i,j} \color{darkred}\beta_j - \color{darkblue}y_i + \color{darkred}\varepsilon_i \le \color{darkred}t_i && \forall i\\ & \color{darkred}t_i\ge 0\end{align}\] |
This looks a bit like normal regression with a regularization term.
This formulation can be fed into a standard QP solver.
variables |
Notes:
- We can make \(\color{darkred}t_i\) a free variable. The model will automatically make it non-negative. In my experience using a positive variable can help the solver a bit.
- We should get rid of the duplicate expression by introducing a new variable and a new equality constraint.
- A variant of this model [4] is based on variable splitting: \[\begin{align}\min\> & 0.5 \sum_i \color{darkred}\varepsilon_i^2 + \color{darkblue}k \sum_i \left(\color{darkred}s^+_i+\color{darkred}s^-_i\right) \\ & \color{darkred}\varepsilon_i- \sum_j \color{darkblue}X_{i,j} \color{darkred}\beta_j - \color{darkblue}y_i +\color{darkred}s^+_i - \color{darkred}s^-_i = 0 && \forall i\\ & \color{darkred}s^+_i,\color{darkred}s^-_i\ge 0\end{align}\]
Solve as dual QP
A dual formulation [6] is:
Dual QP Model |
---|
\[\begin{align} \min & \frac{\color{darkblue}k}{2}\sum_i \color{darkred}u_i^2 + \sum_i \color{darkblue}y_i \color{darkred}u_i \\ &\sum_i \color{darkblue}X_{i,j}\color{darkred}u_i = 0 \perp \color{darkred}\beta_j && \forall j \\ & \color{darkred}u_i \in [-1,1]\end{align}\] |
The GAMS code can look like:
variables |
Note that we can retrieve the estimates for \(\color{darkred}\beta\) by inspecting the duals (marginals) of equation edual.
Other tools
CVXPY and CVXR have a built-in function huber(). This will allow you to do Huber regression from CVXPY and CVXR. In [7] a great example is used to illustrate Huber regression:
Example from [7] |
The problem description contains a small error:
The function \(\phi(u)\) is missing an \(|.|\) |
The function \(\phi(u)\) should use \(|u|\) instead of \(u\).
R has the function rlm from the MASS package.
Conclusion
The Huber regression problem for robust regression can be solved in different ways. I have presented here four different implementations (NLP, LCP, and primal and dual QP). The quadratic formulations are likely the best candidates for large problems.
It is not so obvious that we can formulate compact standard QP models for this problem. This is interesting not only for robust regression problems but also in other cases where we want to add a penalty term that is between linear and quadratic.
References
- Linear Programming and L1 Regression, https://yetanothermathprogrammingconsultant.blogspot.com/2017/11/lp-and-lad-regression.html
- Huber loss, https://en.wikipedia.org/wiki/Huber_loss
- Michael Gertz and Stephen Wright, Object-oriented Software for Quadratic Programming, Tech. Report ANL/MCS-P891-1000, Argonne National Laboratory, Mathematics and Computer Science Division, 2001.
- Stephen J. Wright, On reduced convex QP formulations of monotone LCPs, Mathematical Programming 90 (2001), 459–473.
- Solving linear complementarity problems without an LCP solver, https://yetanothermathprogrammingconsultant.blogspot.com/2021/05/solving-linear-complementarity-problems.html
- Olvi L. Mangasarian and David R. Musicant, Robust Linear and Support Vector Regression, IEEE transactions on pattern analysis and machine intelligence, vol. 22, no 9, 2000.
- Huber Regression, https://cvxr.rbind.io/cvxr_examples/cvxr_huber-regression/
No comments:
Post a Comment