Saturday, June 12, 2021

Median, quantiles and quantile regression as linear programming problems

Quantile regression is a bit of an exotic type of regression [1,2,3]. It can be seen as a generalization of \(\ell_1\) or LAD regression [4], and just as LAD regression we can formulate and solve it as an LP.

First I want to discuss some preliminaries: how to find the median and the quantiles of a data vector \(\color{darkblue}y\). That will give us the tools to formulate the quantile regression problem as an LP. The reason for adding these preliminary steps is to develop some intuition about how Quantile Regression problems are defined. I found that most papers just "define" the underlying optimization problem, without much justification. I hope to show with these small auxiliary models how we arrive at the Quantile Regression model. Along the way, we encounter some interesting titbits. I'll discuss a few details that papers typically glance over or even skip, but I find fascinating.

Problem 1: Finding the median as an optimization problem


The (sample) median is the "middle observation".  Conceptually this can be done quite simply: sort the data and pick the middle observation. For an even number of observations, we have two middle observations. Often we use the average of these two. Here, in our models, we are a little bit more relaxed about this: allow something in between the two middle observations. Roughly, the median can be defined here as a number \(\color{darkred}m\) such that half of the data is below and half the data is above \(\color{darkred}m\). 

Suppose we have data \(\color{darkblue}y_i\). Then the following non-linear optimization problem can find the median \(\color{darkred}m\):

NLP model for finding the median
\[\min\>\sum_i \left|\color{darkblue}y_i - \color{darkred}m\right|\]


  • You may want to think a little bit about how this model indeed finds the median (or rather a median with 50% of the data below and 50% above).
  • This model looks deceptively simple. Actually, this is a nasty non-differentiable NLP problem. Many general-purpose NLP solvers may have a really difficult time with it. Most of them will not be able to establish (local) optimality. In the box below are some of the frightening messages you may encounter. Standard NLP solvers expect smoothness: functions and gradients should be continuous. Ignoring this may be a bad idea. 
  • This is a great example of why we always need to reformulate absolute values. Absolute values are wolves in sheep's clothing and they will eat you alive. 
  • Note that this model works also if the \(\color{darkblue}y_i\) are endogenous instead of just exogenous data.  
  • Why not just sort the data? This is related to the previous point. Sorting data is easy, sorting decision variables is a different thing altogether and much, much more complicated. 


----     11 PARAMETER y  data

i1 0.172,    i2 0.843,    i3 0.550,    i4 0.301,    i5 0.292


**** SOLVER STATUS     4 Terminated By Solver
**** MODEL STATUS      7 Feasible Solution
**** OBJECTIVE VALUE                0.9297


                           LOWER          LEVEL          UPPER         MARGINAL

---- VAR m                 -INF            0.3011        +INF            1.0000  NOPT
---- VAR z                 -INF            0.9297        +INF             .

  m  median
  z  objective



Scary messages from different solvers:


Conopt:
 ** Feasible solution. Convergence too slow. The change in objective
    has been less than 3.0000E-12 for 20 consecutive iterations


MINOS:
 EXIT - The current point cannot be improved.


SNOPT:
 EXIT - Current point cannot be improved.


IPOPT:
 EXIT: Restoration Failed!
 Final point is feasible: scaled constraint violation (0) is below tol (1e-08) and unscaled constraint violation (0) is below constr_viol_tol (0.0001).


Knitro:
  EXIT: Primal feasible solution estimate cannot be improved; desired accuracy
        in dual feasibility could not be achieved.


We can reformulate absolute values in a linear fashion, so we end up with an LP. There are (at least) two ways to do this:



LP model 1 for finding the median
\[\begin{align}\min&\sum_i\color{darkred}e_i \\ & \color{darkred}e_i\ge\color{darkblue}y_i -\color{darkred}m&& \forall i \\ & \color{darkred}e_i \ge -(\color{darkblue}y_i  -\color{darkred}m) && \forall i \\ &\color{darkred}e_i\ge 0\end{align}\]

or using variable splitting:

LP model 2 for finding the median
\[\begin{align}\min&\sum_i \left(\color{darkred}e^+_i+\color{darkred}e^-_i\right) \\ & \color{darkred}e^+_i-\color{darkred}e^-_i =\color{darkblue}y_i -\color{darkred}m && \forall i \\ &\color{darkred}e^-_i,\color{darkred}e^+_i\ge 0\end{align}\]

For the last model it may be instructive to show the errors \(\color{darkred}e^+_i,\color{darkred}e^-_i\):

----     69 VARIABLE e2.L  absolute value, formulation 2

             +           -

i1       0.129
i2                   0.542
i3                   0.249
i5       0.009

For every error \(i\) we see that only one of \(\color{darkred}e^+_i,\color{darkred}e^-_i\) will be nonzero. This is what we would expect. Note that there is one error that is zero.

These models will not have the same problems as the NLP model. They solve easily to optimality. In the next paragraphs, we will see how LP Model 2 is really the building block for Quantile Regression. 

Even though this model is just a piece of the puzzle, we can learn a lot from it. 

Problem 2: Finding a quantile as an optimization problem


We can generalize the concept of a median to quantiles. E.g., a 0.75 quantile gives us the number such that 75% of the data is below (and 25% is above).  Here is an optimization model for this:


NLP model for finding the \(\tau\)-th quantile 
\[\min\>\sum_{i|\color{darkblue}y_i\ge\color{darkred}q} \color{darkblue}\tau\left|\color{darkblue}y_i - \color{darkred}q \right| +\sum_{i|\color{darkblue}y_i\lt\color{darkred}q} (1-\color{darkblue}\tau)\left|\color{darkblue}y_i-\color{darkred}q \right| \]


This is essentially a weighted version of our median model, with weights \(\color{darkblue}\tau\) and \((1-\color{darkblue}\tau)\). We don't even try to solve this as stated. Note that the summations themselves are already difficult. However, when we take a step back we see that the first summation is over the positive parts of \(|\color{darkblue}y-\color{darkred}q|\) and the second one over the negative parts. Splitting absolute values into positive and negative parts is something we already know how to do: variable splitting


LP model for finding the \(\tau\)-th quantile 
\[\begin{align}\min&\sum_i \left( \color{darkblue}\tau \cdot  \color{darkred}e^+_i +(1-\color{darkblue}\tau) \cdot \color{darkred}e^-_i \right) \\ &\color{darkred}e^+_i -\color{darkred}e^-_i = \color{darkblue}y_i - \color{darkred}q && \forall i \\ &\color{darkred}e^-_i,\color{darkred}e^+_i\ge 0\end{align} \]


Let's try this on some random data:


----     15 PARAMETER y  data

i1  25.457,    i2  85.894,    i3  59.534,    i4  37.102,    i5  36.299,    i6  30.165,    i7  41.485,    i8  87.064
i9  16.040,    i10 55.019,    i11 99.831,    i12 62.086,    i13 99.202,    i14 78.603,    i15 21.762,    i16 67.575
i17 24.357,    i18 32.507,    i19 70.204,    i20 49.182,    i21 42.373,    i22 41.630,    i23 21.834,    i24 23.509
i25 63.020


----     15 SET t  quantile levels

0   ,    0.25,    0.5 ,    0.75,    1   


----     52 PARAMETER quantiles  Solution

0    16.040,    0.25 30.165,    0.5  42.373,    0.75 67.575,    1    99.831


Here we solved 5 LPs for quantile levels 0, 0.25, 0.5, 0.75, and finally 1.

When we do the same in R, we see identical results:



On some data sets, we may see slight differences between the optimization model and the R code. This is because R has 9 different types of quantiles! To more precisely match the optimization model use type=1:



Problem 3: Quantile regression


Finally, let's look at quantile regression. The underlying optimization model looks like [1]:


NLP model for quantile regression 
\[\begin{align}\min&\sum_{i|\color{darkred}e_i\ge 0} \color{darkblue}\tau|\color{darkred}e_i| +\sum_{i|\color{darkred}e_i\lt 0} (1-\color{darkblue}\tau) |\color{darkred}e_i| \\ & \color{darkred}e  = \color{darkblue}y-\color{darkblue}X\cdot\color{darkred}\beta\end{align}\]

We can linearize this with our familiar tools:

LP model for quantile regression 
\[\begin{align}\min&\sum_i \left( \color{darkblue}\tau \cdot\color{darkred}e^+_i + (1-\color{darkblue}\tau) \cdot \color{darkred}e^-_i \right)\\ & \color{darkred}e^+_i - \color{darkred}e^-_i = \color{darkblue}y_i-\sum_j \color{darkblue}X_{i,j}\cdot\color{darkred}\beta_j && \forall i\\ &\color{darkred}e^+_i,\color{darkred}e^-_i\ge 0\end{align}\]

When we run this model on the data from [3], we get:


----     96 PARAMETER estimates  

         suppins         age       white      female      totchr   intercept

0.25     453.444      16.083     338.083      16.056     782.472   -1412.889
0.5      687.222      35.111     632.889    -260.556    1332.833   -2252.556
0.75     708.409      87.364     801.682    -554.591    2855.318   -4512.045


The analysis in [3] focuses on the variable totchr. Indeed the coefficient for this variable changes a lot for different quantile levels. The R code from [3] produces:


> results <- rq(Y ~ X, data=mydata, tau=c(0.25, 0.5, 0.75))
> summary(results)

Call: rq(formula = Y ~ X, tau = c(0.25, 0.5, 0.75), data = mydata)

tau: [1] 0.25

Coefficients:
            Value       Std. Error  t value     Pr(>|t|)   
(Intercept) -1412.88889   433.20179    -3.26150     0.00112
Xsuppins      453.44444    75.05348     6.04162     0.00000
Xtotchr       782.47222    37.55769    20.83388     0.00000
Xage           16.08333     6.19162     2.59760     0.00943
Xfemale        16.05556    72.20278     0.22237     0.82404
Xwhite        338.08333    71.51522     4.72743     0.00000

Call: rq(formula = Y ~ X, tau = c(0.25, 0.5, 0.75), data = mydata)

tau: [1] 0.5

Coefficients:
            Value       Std. Error  t value     Pr(>|t|)   
(Intercept) -2252.55556   846.23023    -2.66187     0.00781
Xsuppins      687.22222   137.29264     5.00553     0.00000
Xtotchr      1332.83333    74.77913    17.82360     0.00000
Xage           35.11111    11.29450     3.10869     0.00190
Xfemale      -260.55556   150.46285    -1.73169     0.08343
Xwhite        632.88889   243.05734     2.60387     0.00926

Call: rq(formula = Y ~ X, tau = c(0.25, 0.5, 0.75), data = mydata)

tau: [1] 0.75

Coefficients:
            Value       Std. Error  t value     Pr(>|t|)   
(Intercept) -4512.04545  2350.56284    -1.91956     0.05501
Xsuppins      708.40909   375.76929     1.88522     0.05950
Xtotchr      2855.31818   196.12587    14.55860     0.00000
Xage           87.36364    30.98410     2.81963     0.00484
Xfemale      -554.59091   378.71501    -1.46440     0.14319
Xwhite        801.68182   370.96108     2.16109     0.03077


We see our LP model can reproduce the R results.

There is also a dual version of this model [1]. It looks like:

Dual LP model for quantile regression 
\[\begin{align}\max&\sum_i \color{darkblue}y_i \cdot \color{darkred}d_i\\ & \sum_i \color{darkblue}X_{i,j} \cdot \color{darkred}d_i = 0 \perp \color{darkred} \beta_j && \forall j\\ & \color{darkred}d_i \in [\color{darkblue}\tau-1,\color{darkblue}\tau] \end{align}\]

The duals of the constraint form the estimates \(\color{darkred}\beta_j\). Notice the bounds on the variable \(\color{darkred}d_i\). When we solve this, we see:


----    131 PARAMETER estimates2  from dual

         suppins         age       white      female      totchr   intercept

0.25     453.444      16.083     338.083      16.056     782.472   -1412.889
0.5      687.222      35.111     527.556    -260.556    1332.833   -2147.222
0.75     708.409      87.364     801.682    -554.591    2855.318   -4512.045

Conclusion


Quantile regression is based on a rather simple, but non-obvious LP model. I have tried to informally derive this model using small LP models for the calculation of the median and the quantiles of a data vector. After this, the LP model for the quantile regression problem is rather obvious.

Explicitly formulating and implementing the LP model helps to make things a bit less of a black box. After this exercise, running quantile regressions in say R is more transparent: you know what you are doing.

References


  1. Koenker, R., and Bassett, G. W. (1978). “Regression Quantiles.” Econometrica 46:33–50.
  2. Quantile Regressionhttps://en.wikipedia.org/wiki/Quantile_regression
  3. Ani Katchova, Quantile Regression, https://sites.google.com/site/econometricsacademy/econometrics-models/quantile-regression, this is a very good introduction with examples
  4. Linear Programming and L1 Regression, https://yetanothermathprogrammingconsultant.blogspot.com/2017/11/lp-and-lad-regression.html
  5. Median, https://en.wikipedia.org/wiki/Median
  6. Quantilehttps://en.wikipedia.org/wiki/Quantile.  Note: in case you thought you knew what quantiles are, R supports 9 different types of them.


Appendix A: GAMS code for non-smooth NLP model to calculate the median


Try this model with different NLP solvers to see what happens.

To have GAMS accept a function like abs() inside an NLP model, we need to declare the model as a DNLP. This is a warning that trouble may be ahead.

Note: this is not a good way to calculate the median.


$ontext

  
Find median using a non-smooth NLP model

  
This model will have difficulties due to non-differentiability
  
of the abs() function.

  
References:
     
http://yetanothermathprogrammingconsultant.blogspot.com/2021/06/median-quantiles-and-quantile.html

  
erwin@amsterdamoptimization.com

$offtext

set i /i1*i5/;

parameter y(i) 'data';
y(i) = uniform(0,1);
display y;

variable
   m
'median'
   z
'objective'
;

equation sumabs;

sumabs.. z =e=
sum(i, abs(y(i)-m));

* initial value
m.l = 0.5;

model median /sumabs/;
option dnlp=conopt;
solve median minimizing z using dnlp;

display m.l;


Appendix B: GAMS code for finding the median as an LP


This is a better way to calculate the median. Note that for even-sized datasets it may pick any number between the two middle points. (The usual definition would take the average of these two points.)  Both LP formulations for the absolute value (bounding and variable splitting) are implemented here. 

$ontext

  
Find Median using an LP model

  
Use two formulations for the absolute values


  
References:
     
http://yetanothermathprogrammingconsultant.blogspot.com/2021/06/median-quantiles-and-quantile.html

  
erwin@amsterdamoptimization.com

$offtext

set
  i
/i1*i5/
;

parameter y(i) 'data';
y(i) = uniform(0,1);
display y;


*-------------------------------------------------------------
* formulation 1: bounding
*-------------------------------------------------------------

variables
   m   
'median'
   z   
'objective'
;
positive variables
   e(i)
'absolute value, formulation 1'
;

equations
   obj1     
'objective, formulation 1'
   bound1(i)
   bound2(i)
;

obj1.. z =e=
sum(i, e(i));
bound1(i).. e(i) =g= m-y(i);
bound2(i).. e(i) =g= -(m-y(i));

model medianLP1 /obj1,bound1,bound2/;
solve medianLP1 minimizing z using lp;

display m.l;

*-------------------------------------------------------------
* formulation 2: variable splitting
*-------------------------------------------------------------
set
  pm
/'+','-'/
;

positive variables
   e2(i,pm)
'absolute value, formulation 2'
;

equations
   obj2     
'objective, formulation 2'
   split(i) 
'variable splitting'
;

obj2.. z =e=
sum((i,pm), e2(i,pm));
split(i).. e2(i,
'+')-e2(i,'-') =e= m-y(i);

model medianLP2 /obj2,split/;
solve medianLP2 minimizing z using lp;

display m.l;

display e2.l;


Appendix C: GAMS code for finding quantiles.


This LP model finds different quantiles of a (random) data vector.


$ontext

  
find quantiles using an LP model

  
References:
     
http://yetanothermathprogrammingconsultant.blogspot.com/2021/06/median-quantiles-and-quantile.html

  
erwin@amsterdamoptimization.com


$offtext

set
  i
'observations' /i1*i25/
  t
'quantile levels' /'0','0.25','0.5','0.75','1'/
;

parameter y(i) 'data';
y(i) = uniform(10,100);

display y,t;

*-------------------------------------------------------------
* variable splitting LP Model
*-------------------------------------------------------------

set
  pm
/'+','-'/
;

scalar tau 'quantile level';

positive variables e(i,pm) 'absolute value';

variable
   z
'objective variable'
   q
'quantile'
;


equations
   obj     
'objective'
   split(i) 
'variable splitting'
;

obj.. z =e=
sum(i, tau*e(i,'+') + (1-tau)*e(i,'-'));
split(i).. e(i,
'+')-e(i,'-') =e= y(i)-q;

model quantileLP /obj,split/;


*-------------------------------------------------------------
* solve loop
*-------------------------------------------------------------

parameter quantiles(t) 'Solution';

loop(t,
   tau = t.val;
  
solve quantileLP minimizing z using lp;
   quantiles(t) = q.l;
);
display quantiles;


Appendix D: GAMS code for quantile regression


The data file used in this model is from [3]. It looks like (partial view):

----     34 PARAMETER data  all data

              totexp     ltotexp     suppins         age       white      female      totchr

93193020       3.000       1.099       1.000      69.000       1.000
72072017       6.000       1.792       1.000      65.000       1.000       1.000
25296013       9.000       2.197                  85.000       1.000       1.000
23628011      14.000       2.639                  76.000       1.000       1.000
95041014      18.000       2.890                  71.000       1.000       1.000       1.000
25090018      20.000       2.996                  81.000       1.000       1.000
96478017      24.000       3.178                  74.000       1.000
96815023      25.000       3.219       1.000      83.000       1.000                   1.000
93118011      29.000       3.367                  80.000       1.000       1.000
24669013      30.000       3.401                  73.000       1.000       1.000       3.000
93317018      30.000       3.401       1.000      78.000       1.000
97173010      31.000       3.434       1.000      69.000       1.000
. . .


Both the primal and the dual LP are implemented.

$ontext

  
Quantile regression optimization problem as LP
  
Both primal and dual models.

  
Data from:
     
https://sites.google.com/site/econometricsacademy/econometrics-models/quantile-regression

  
References:
     
http://yetanothermathprogrammingconsultant.blogspot.com/2021/06/median-quantiles-and-quantile.html

  
erwin@amsterdamoptimization.com


$offtext


*-------------------------------------------------------------
* data from csv file
*-------------------------------------------------------------

$set csv quantile_health.csv
$set gdx quantile_health.gdx

sets
   i
'observations'
   j0
'column names in csv file'
;

parameter data(i,*) 'all data';

$call csv2gdx %csv% output=%gdx% id=data useHeader=T index=1 values=(2..8)

$gdxin %gdx%
$loaddc i=Dim1 j0=Dim2 data

display i,j0,data;

*-------------------------------------------------------------
* setup of regression data y,X
*-------------------------------------------------------------

set
  j
'independent variables' /intercept,suppins,totchr,age,female,white/
;

parameter
   y(i)  
'dependent variable'
   X(i,j)
'independent variables'
;

y(i) = data(i,
'totexp');
X(i,j) = data(i,j);
X(i,
'intercept') = 1;


*-------------------------------------------------------------
* quantile regression LP model (primal)
*-------------------------------------------------------------

set
  pm
/'+','-'/
;

scalar tau 'quantile';

positive variables e(i,pm) 'absolute value';

variables
   z
'objective variable'
   beta(j)
'estimates'
;

equations
   obj     
'objective'
   split(i) 
'variable splitting'
;

obj.. z =e=
sum(i, tau*e(i,'+') + (1-tau)*e(i,'-'));
split(i).. e(i,
'+')-e(i,'-') =e= y(i)-sum(j,X(i,j)*beta(j));

model quantileLP /obj,split/;


*-------------------------------------------------------------
* solve loop
*-------------------------------------------------------------

set q 'quantile levels' /'0.25','0.5','0.75'/;

parameter estimates(q,j);

loop(q,
   tau = q.val;
  
solve quantileLP minimizing z using lp;
   estimates(q,j) = beta.l(j);
);

display estimates;

*-------------------------------------------------------------
* quantile regression LP model (dual)
*-------------------------------------------------------------

variable d(i) 'variables of dual problem';

equations
   obj2       
'objective of dual'
   eqdual(j)  
'constraint of dual problem'
;

obj2.. z =e=
sum(i, y(i)*d(i));
eqdual(j)..
sum(i, x(i,j)*d(i)) =e= 0;

* still missing are the bounds on d.
* we populate them in the solve loop below

model quantileDual /obj2,eqdual/;

*-------------------------------------------------------------
* solve loop
*-------------------------------------------------------------

parameter estimates2(q,j) 'from dual';

loop(q,
   tau = q.val;
   d.lo(i) = tau-1;
   d.up(i) = tau;
  
solve quantileDual maximizing z using lp;
   estimates2(q,j) = eqdual.m(j);
);

display estimates2;

4 comments:

  1. I did something very similar to "LP model 1 for finding the median" in my bachelor's thesis. I don't think I realized that you could also formulate it by splitting variables.

    ReplyDelete
    Replies
    1. In most cases it does not really matter which formulation you choose. Here we use the variable splitting approach so we can put different weights on the positive and negative deviation in the subsequent models.

      Delete
  2. The NLP model for finding the median is not differentiable, that is of course correct. But it is convex in m and it has a subgradient defined at every point. A subgradient solver will reliably find the global minimum for that function.

    ReplyDelete
    Replies
    1. Thanks. Indeed, the standard solvers I have access to are all gradient based. I could not find off-the-shelf subgradient solvers (I assume users develop them for their problem at hand).

      Delete