Yet Another Math Programming Consultant: Mixture models as math programming problem

We can formulate a linear least squares regression model as an optimization problem. This is not how these problems are solved in statistical packages. Often, they use a QR decomposition. A real optimization formulation can be useful when we need to add unusual constraints that the statistical package does not support directly, or when we need to optimize a special maximum-likelihood function.

Here is another example of a least-squares regression problem where we can benefit from mathematical programming techniques.

Data

Statistical problems typically start with a data set:

----     79 PARAMETER data  

                 x           y

case1       20.202      85.162
case2        0.507       2.103
case3       26.961      55.969
case4       49.985      44.690
case5       15.129      86.515
case6       17.417      79.866
case7       33.064      56.328
case8       31.691      29.422
case9       32.209      64.021
case10      96.398      85.191
case11      99.360      68.235
case12      36.990      57.516
case13      37.289      25.884
case14      77.198      56.157
case15      39.668      58.398
case16      91.310      66.205
case17      11.958      93.742
case18      73.548      28.178
case19       5.542       5.788
case20      57.630      60.830
case21       5.141      53.988
case22       0.601      42.559
case23      40.123      61.928
case24      51.988      42.984
case25      62.888      58.308
case26      22.575       4.414
case27      39.612      67.282
case28      27.601      56.445
case29      15.237       0.218
case30      93.632      11.896
case31      42.266      60.515
case32      13.466      51.721
case33      38.606      65.392
case34      37.463      16.978
case35      26.848      74.588
case36      94.837      -0.803
case37      18.894      60.060
case38      29.751      14.005
case39       7.455      60.066
case40      40.135      62.898

It looks like there is not a single regression line, but rather three of them hidden in this data set. So we should find something like:

The question is: how can we estimate the intercept and slope of these three lines?

The problem has to handle two things at the same time:

assignment of points $i$ to a line $k$
perform $k$ regressions

We'll do these things simultaneously as they affect each other. When we separate these decisions, we have a heuristic.

I'll discuss different methods:

Mathematical programming models to find the overall least squares solution.
A heuristic to find a good assignment, similar to the one used in $k$-means clustering.
Models for an $\ell_1$ fit, similar to LAD regression.
Statistical methods that use a maximum likelihood approach: find the regression lines that make this pattern of points most likely.

Least squares optimization models

The simple statistical model we are going to use is: \[ {\color{darkblue}y} = {\color{darkred}\alpha}_1 + {\color{darkred}\alpha}_2 {\color{darkblue}x} + {\color{darkred}\varepsilon} \]

A standard least squares problem formulated as a Quadratic Programming (QP) can look like:

Least Squares
\[\begin{align} \min\> & {\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_1 + {\color{darkred}\alpha}_2 {\color{darkblue}x}_i +{\color{darkred}r}_i && \forall i \end{align}\]

Here ${\color{darkblue}y}_i, {\color{darkblue}x}_i$ is data and ${\color{darkred}r}_i$ is the residual.

This optimization model can also be collapsed into an unconstrained problem. I prefer this version as it can be a bit more numerically stable. (Nerdy detail: I have seen cases where the unconstrained version became non-convex when the Q matrix was no longer positive definite. This cannot happen with this formulation.)

If we want to estimate three different lines $k$, then the high-level mathematical model can be:

High-level Model
\[\begin{align} \min\>&{\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i && \forall i,k\| {\color{darkred}\delta}_{i,k}=1 \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \end{align}\]

Here \[{\color{darkred}\delta}_{i,k} = \begin{cases}1 & \text{if point $i$ is assigned to line $k$}\\ 0 & \text{otherwise} \end{cases}\] In the model, we enforce that each point $i$ is assigned to exactly one line $k$.

This model finds the assignment of points to lines, ${\color{darkred}\delta}_{i,k}$, and the regression coefficients, ${\color{darkred}\alpha}_{k,j}$, simultaneously.

The most straightforward implementation is using indicator constraints:

MIQP with indicator constraints
\[\begin{align} \min\> &{\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ &{\color{darkred}\delta}_{i,k}=1 \implies {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i&& \forall i,k \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ &{\color{darkred}r}_i, {\color{darkred}\alpha}_{k,j} \> \mathbf{free}\end{align}\]

MIQP with indicator constraints

\[\begin{align} \min\> &{\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ &{\color{darkred}\delta}_{i,k}=1 \implies {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i&& \forall i,k \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ &{\color{darkred}r}_i, {\color{darkred}\alpha}_{k,j} \> \mathbf{free}\end{align}\]

When we solve this, we get the following optimal solution:

----    149 VARIABLE ssq.L                 =      932.536  objective: sum of squares

----    149 VARIABLE delta.L  assignment of point to line

             line1       line2       line3

case1        1.000
case2                    1.000
case3                                1.000
case4        1.000
case5        1.000
case6        1.000
case7                                1.000
case8                    1.000
case9        1.000
case10                   1.000
case11                               1.000
case12                               1.000
case13                   1.000
case14                   1.000
case15                               1.000
case16                               1.000
case17       1.000
case18       1.000
case19                   1.000
case20                               1.000
case21                               1.000
case22                               1.000
case23       1.000
case24                   1.000
case25                               1.000
case26                   1.000
case27       1.000
case28                               1.000
case29                   1.000
case30       1.000
case31       1.000
case32                               1.000
case33       1.000
case34                   1.000
case35       1.000
case36       1.000
case37                               1.000
case38                   1.000
case39                               1.000
case40       1.000


----    149 VARIABLE alpha.L  coefficients to estimate for each line

        intercept       slope

line1     102.897      -1.033
line2      -6.663       0.879
line3      51.738       0.158


----    151 PARAMETER results  compare models

              obj        time       nodes      status

indic     932.536       5.344  122054.000          ok

Note that instead of index $j\in\{1,2\}$, I used in the actual model the more descriptive set $j\in\{\text{intercept},\text{slope}\}$. This helps not only with the model itself but also with presenting the results more meaningfully.

There is one simple improvement we can make. In the previous model, we can name any of the lines, line1, line2, or line3. We have six possible naming schemes. It makes sense to tighten this down. We can order, say, the intercepts: \[ {\color{darkred}\alpha}_{k,1} \le {\color{darkred}\alpha}_{k+1,1} \] If we add this symmetry breaker, we see:

----    165 VARIABLE alpha.L  coefficients to estimate for each line

        intercept       slope

line1      -6.663       0.879
line2      51.738       0.158
line3     102.897      -1.033


----    167 PARAMETER results  compare models

                  obj        time       nodes      status

indic         932.536       5.344  122054.000          ok
indic+ord     932.536       2.584   45519.000          ok

Notice the ordering of the ${\color{darkred}\alpha}_{j,k}$ coefficients. Taking away this freedom from the solver really helps in performance. The proven optimal solution is found in about half the time. The number of nodes was reduced even more. These simple symmetry-breaking constraints can be surprisingly effective. Solvers often have built-in measures to handle symmetries in the model, but I find that manually adding ordering constraints can still be highly beneficial. A second advantage is that the solution looks a bit more polished.

It is possible, but rather dangerous, to implement this model using plain binary variables. A possible approach could be:

MIQP with big-M constraints
\[\begin{align} \min\> &{\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i + {\color{darkred}s}_{i,k} && \forall i,k \\ & -{\color{darkblue}M}(1-{\color{darkred}\delta}_{i,k}) \le {\color{darkred}s}_{i,k} \le {\color{darkblue}M}(1-{\color{darkred}\delta}_{i,k}) && \forall i,k\\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ &{\color{darkred}r}_i, {\color{darkred}\alpha}_{k,j}, {\color{darkred}s}_{i,k} \> \mathbf{free} \\& {\color{darkblue}M}=1000 \end{align}\]

MIQP with big-M constraints

\[\begin{align} \min\> &{\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i + {\color{darkred}s}_{i,k} && \forall i,k \\ & -{\color{darkblue}M}(1-{\color{darkred}\delta}_{i,k}) \le {\color{darkred}s}_{i,k} \le {\color{darkblue}M}(1-{\color{darkred}\delta}_{i,k}) && \forall i,k\\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ &{\color{darkred}r}_i, {\color{darkred}\alpha}_{k,j}, {\color{darkred}s}_{i,k} \> \mathbf{free} \\& {\color{darkblue}M}=1000 \end{align}\]

It is difficult to find good values for ${\color{darkblue}M}$ (or ${\color{darkblue}M}_{i,k}$ if individually valued). The main problem with these big-M's is that they apply to $(i,k)$ (point/line) pairs that can be very far apart. If the slope is large, the big-M's may become very large. Unfortunately, the big-M's are determined not by the selected, and thus close, line/point combinations, but by the possibly far-away lines not selected for point $i$. Things really work against us here. In our case, the slopes are small, so ${\color{darkblue}M}=1000$ seems safe, but with other data, this may not hold.

An alternative to indicator constraints is to use SOS1 sets (Special Ordered Sets of Type 1). A SOS1-based model can look like:

MIQP with SOS1 constraints
\[\begin{align} \min\> &{\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i +{\color{darkred}{\mathit{slack}}}_{i,k} && \forall i,k \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ & ({\color{darkred}{\mathit{slack}}}_{i,k},{\color{darkred}\delta}_{i,k}) \in {\mathbf{SOS1}}\\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ &{\color{darkred}r}_i, {\color{darkred}\alpha}_{k,j},{\color{darkred}{\mathit{slack}}}_{i,k}\> \mathbf{free}\end{align}\]

MIQP with SOS1 constraints

\[\begin{align} \min\> &{\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i +{\color{darkred}{\mathit{slack}}}_{i,k} && \forall i,k \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ & ({\color{darkred}{\mathit{slack}}}_{i,k},{\color{darkred}\delta}_{i,k}) \in {\mathbf{SOS1}}\\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ &{\color{darkred}r}_i, {\color{darkred}\alpha}_{k,j},{\color{darkred}{\mathit{slack}}}_{i,k}\> \mathbf{free}\end{align}\]

The construct $ ({\color{darkred}{\mathit{slack}}}_{i,k},{\color{darkred}\delta}_{i,k}) \in {\mathbf{SOS1}}$ says only one of the pair $({\color{darkred}{\mathit{slack}}}_{i,k},{\color{darkred}\delta}_{i,k})$ can be nonzero. To be precise, both can be zero. I.e., we have a bunch of SOS1 sets with just two members.

There is also a nonconvex formulation. Its advantage is that we don't use any exotic MIP features, just plain, straightforward math. No need to explain SOS sets or indicator constraints.

nonconvex quadratic model
\[\begin{align} \min\> &{\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ &{\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}\rho}_{i,k} && \forall i,k \\ & {\color{darkred}r}_i = \sum_k {\color{darkred}\rho}_{i,k}{\color{darkred}\delta}_{i,k} && \forall i \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ & \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ &{\color{darkred}r}_i, {\color{darkred}\alpha}_{k,j},{\color{darkred}\rho}_{i,k} \> \mathbf{free}\end{align}\]

nonconvex quadratic model

\[\begin{align} \min\> &{\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ &{\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}\rho}_{i,k} && \forall i,k \\ & {\color{darkred}r}_i = \sum_k {\color{darkred}\rho}_{i,k}{\color{darkred}\delta}_{i,k} && \forall i \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ & \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ &{\color{darkred}r}_i, {\color{darkred}\alpha}_{k,j},{\color{darkred}\rho}_{i,k} \> \mathbf{free}\end{align}\]

Some performance figures are here:

----    248 PARAMETER results  compare models

                      obj        time       nodes      status

indic             932.536       5.344  122054.000          ok
indic+ord         932.536       2.584   45519.000          ok
bigM+ord          932.536       1.993   30033.000          ok
SOS1+ord          932.536       3.494   56992.000          ok
nonconvex+ord     932.536       2.921   33761.000          ok

I kept the ordering constraint in all subsequent models. We see the bigM model is the fastest, but because of its unreliability, I would not recommend using this unless you have a good idea about the maximum possible residuals. I like that the more natural, no-gimmicks, nonconvex formulation is doing quite well here.

Heuristic

An improvement heuristic can be stated as:

Form a random assignment ${\color{darkblue}\delta}_{i,k}$ assigning each point $i$ to a random line $k$. If it is not a valid assignment (e.g., when a line has fewer than 2 points), try again.
Perform a regression for each line $k$.
Reassign each point to its closest line as measured by the residual. If not different from the previous assignment, stop.
Go to step 2.

This can get caught in a local minimum. So we repeat this several times and report the best solution.

This is very much how a $k$-means clustering heuristic [1] works.

I coded this up in GAMS, and used 10 trials. Instead of using a special-purpose least-squares algorithm, I used a simple continuous QP (Quadratic Programming) model. This model solves three regressions (one for each line) in one swoop, and is just a version of the indicator model, but now with exogenous assignments:

QP regression model
\[\begin{align} \min\> &{\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i&& \forall i,k\|{\color{darkblue}\delta}_{i,k}=1 \\ &{\color{darkred}r}_i, {\color{darkred}\alpha}_{k,j} \> \mathbf{free}\end{align}\]

----    360 PARAMETER results  compare models

                      obj        time       nodes      status      numreg

indic             932.536       5.344  122054.000          ok
indic+ord         932.536       2.584   45519.000          ok
bigM+ord          932.536       1.993   30033.000          ok
SOS1+ord          932.536       3.494   56992.000          ok
nonconvex+ord     932.536       2.921   33761.000          ok
heuristic         932.536       1.084                             183.000

We were lucky and found the optimal solution. We needed to do 183 regressions for this. This actually meant: 61 invocations of the QP solver as each QP solve actually performs 3 regressions at once. Obviously, we only know about optimality by also solving formal optimization models. When we run only the heuristic, we don't know.

In general, we can verify the optimality of the heuristic solution by passing it to the solver. The simplest would be to add bounds on the objective: I want to find better solutions than I already have. A more sophisticated approach is to pass the integer variables of the heuristic solution to the solver via the MIPSTART option. This will allow the solver to branch towards that solution and then continue. Note that it is not guaranteed that solving with added bounds or with a complete integer solution is faster than solving from scratch. Sometimes solving from scratch is better. Unfortunately, there is no good way to predict this, so we'll end up with trying things out.

Here I tried to feed assignment variables ${\color{darkblue}\delta}_{i,k}$ found by the heuristic to the indicator model using MIPSTART. We would see:

----    410 PARAMETER results  compare models

                           obj        time       nodes      status      numreg

indic                  932.536       5.344  122054.000          ok
indic+ord              932.536       2.584   45519.000          ok
bigM+ord               932.536       1.993   30033.000          ok
SOS1+ord               932.536       3.494   56992.000          ok
nonconvex+ord          932.536       2.921   33761.000          ok
heuristic              932.536       1.084                             183.000
indic+ord+mipstart     932.536       2.223   41995.000          ok

It is just marginally faster. The improvement was so minimal that we did not make up for the heuristic's solution time. Note that for larger instances (i.e., more lines), things may look very different.

I also want to mention another reason to implement a heuristic. If the solver fails or has really trouble making progress, the availability of a good solution calculated using very different methods is an enormous asset. We can always return a good solution. This is kind of insurance against poor solver performance. Obviously, insurance is not free, but still worthwhile.

Note that when using the ordering constraint, we need to sort the solution found by the heuristic, since the heuristic does not care much about symmetry, and the QP models have no ordering constraint.

L1 models

Instead of minimizing squared residuals, we can also choose to minimize the absolute values of residuals. Standard statistics uses a least squares approach. A well-known issue with this method is that outliers can distort the results due to their large weight in the optimization model. Methods for robust optimization that are less sensitive to outliers. One such method is LAD (Least Absolute Deviation) or $\ell_1$ regression. A high-level model can look like:

High-level LAD Model
\[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i \|{\color{darkred}r}_i\|\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i && \forall i,k\| {\color{darkred}\delta}_{i,k}=1 \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \end{align}\]

Using indicator constraints and variable splitting to linearize the absolute value, we get the following:

Indicators and variable splitting MIP
\[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i \left({\color{darkred}r}^+_i +{\color{darkred}r}^-_i\right)\\ & {\color{darkred}\delta}_{i,k}=1 \implies {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}^+_i - {\color{darkred}r}^-_i && \forall i,k \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ & {\color{darkred}r}^+_i,{\color{darkred}r}^-_i \ge 0 \\ & {\color{darkred}\alpha}_{k,j}\>{\mathbf{free}} \end{align}\]

Indicators and variable splitting MIP

\[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i \left({\color{darkred}r}^+_i +{\color{darkred}r}^-_i\right)\\ & {\color{darkred}\delta}_{i,k}=1 \implies {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}^+_i - {\color{darkred}r}^-_i && \forall i,k \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ & {\color{darkred}r}^+_i,{\color{darkred}r}^-_i \ge 0 \\ & {\color{darkred}\alpha}_{k,j}\>{\mathbf{free}} \end{align}\]

We can add to any of these models the ordering constraint as a symmetry breaker: \[ {\color{darkred}\alpha}_{k,1} \le {\color{darkred}\alpha}_{k+1,1} \] We order the lines here by the intercept. This is, in general, a good idea: performance will increase, and as a byproduct, the solution looks better.

We can also use a different linearization:

Indicators and bounding MIP
\[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i {\color{darkred}\rho}_i\\ & {\color{darkred}\delta}_{i,k}=1 \implies {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i && \forall i,k \\ & -{\color{darkred}\rho}_i \le{\color{darkred}r}_i\le{\color{darkred}\rho}_i && \forall i\\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ & {\color{darkred}\rho}_i \ge 0 \\ & {\color{darkred}r}_i\>{\mathbf{free}} \end{align}\]

Indicators and bounding MIP

\[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i {\color{darkred}\rho}_i\\ & {\color{darkred}\delta}_{i,k}=1 \implies {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i && \forall i,k \\ & -{\color{darkred}\rho}_i \le{\color{darkred}r}_i\le{\color{darkred}\rho}_i && \forall i\\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ & {\color{darkred}\rho}_i \ge 0 \\ & {\color{darkred}r}_i\>{\mathbf{free}} \end{align}\]

Instead of indicator constraints, we can use big-M constraints:

Big-M MIP
\[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i \left({\color{darkred}r}^+_i +{\color{darkred}r}^-_i\right)\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}^+_i - {\color{darkred}r}^-_i +{\color{darkred}s}_{i,k} && \forall i,k \\ & -{\color{darkblue}M}(1-{\color{darkred}\delta}_{i,k}) \le {\color{darkred}s}_{i,k} \le {\color{darkblue}M}(1-{\color{darkred}\delta}_{i,k}) && \forall i,k\\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ & {\color{darkred}r}^+_i,{\color{darkred}r}^-_i \ge 0 \\& {\color{darkred}s}_{i,k}\>{\mathbf{free}} \\& {\color{darkblue}M} = 1000 \end{align}\]

Big-M MIP

\[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i \left({\color{darkred}r}^+_i +{\color{darkred}r}^-_i\right)\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}^+_i - {\color{darkred}r}^-_i +{\color{darkred}s}_{i,k} && \forall i,k \\ & -{\color{darkblue}M}(1-{\color{darkred}\delta}_{i,k}) \le {\color{darkred}s}_{i,k} \le {\color{darkblue}M}(1-{\color{darkred}\delta}_{i,k}) && \forall i,k\\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ & {\color{darkred}r}^+_i,{\color{darkred}r}^-_i \ge 0 \\& {\color{darkred}s}_{i,k}\>{\mathbf{free}} \\& {\color{darkblue}M} = 1000 \end{align}\]

Obviously, we have the same problem as discussed earlier: there is no easy way to find good values for the big-M constants.

A SOS1 approach could be:

SOS1 MIP
\[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i \left({\color{darkred}r}^+_i +{\color{darkred}r}^-_i\right)\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i +{\color{darkred}s}_{i,k} && \forall i,k \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ & ({\color{darkred}\delta}_{i,k},{\color{darkred}s}_{i,k}) \in \>{\mathbf{SOS1}} \\ & {\color{darkred}s}_{i,k}\>{\mathbf{free}} \end{align}\]

SOS1 MIP

\[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i \left({\color{darkred}r}^+_i +{\color{darkred}r}^-_i\right)\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i +{\color{darkred}s}_{i,k} && \forall i,k \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ & ({\color{darkred}\delta}_{i,k},{\color{darkred}s}_{i,k}) \in \>{\mathbf{SOS1}} \\ & {\color{darkred}s}_{i,k}\>{\mathbf{free}} \end{align}\]

We can also formulate a non-convex quadratically constrained model using either of the linearizations. Here is one using variable splitting:

Nonconvex MIQCP
\[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i \left({\color{darkred}r}^+_i +{\color{darkred}r}^-_i\right)\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}\rho}_{i,k} && \forall i,k \\ & {\color{darkred}r}^+_i -{\color{darkred}r}^-_i = \sum_k {\color{darkred}\rho}_{i,k}{\color{darkred}\delta}_{i,k} \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\}\\ & {\color{darkred}r}^+_i ,{\color{darkred}r}^-_i \ge 0 \\ & \\ & {\color{darkred}\rho}_{i,k}\>{\mathbf{free}} \end{align}\]

Nonconvex MIQCP

\[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i \left({\color{darkred}r}^+_i +{\color{darkred}r}^-_i\right)\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}\rho}_{i,k} && \forall i,k \\ & {\color{darkred}r}^+_i -{\color{darkred}r}^-_i = \sum_k {\color{darkred}\rho}_{i,k}{\color{darkred}\delta}_{i,k} \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\}\\ & {\color{darkred}r}^+_i ,{\color{darkred}r}^-_i \ge 0 \\ & \\ & {\color{darkred}\rho}_{i,k}\>{\mathbf{free}} \end{align}\]

The heuristic discussed in the previous section can be adapted to our $\ell_1$ objective by replacing the OLS estimation with an LAD estimation. This is just an LP. Again, we can estimate the coefficients of the $k$ lines in one swoop using the following LP:

LP LAD regression model
\[\begin{align} \min\> &{\color{darkred}{\mathit{sad}}} = \sum_i \left({\color{darkred}r}^+_i +{\color{darkred}r}^-_i\right) \\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}^+_i -{\color{darkred}r}^-_i && \forall i,k\|{\color{darkblue}\delta}_{i,k}=1 \\ & {\color{darkred}r}^+_i,{\color{darkred}r}^-_i \ge 0 \\ & {\color{darkred}\alpha}_{k,j} \> {\mathbf{free}}\end{align}\]

LP LAD regression model

Here is a comparison:

----    510 PARAMETER results  compare models

                                     obj        time       nodes      status     numreg

indic+split                      143.163    2654.639 9.234285E+7          ok
indic+split+ord                  143.163     634.161 2.462763E+7          ok
indic+bnd                        143.163      74.529 2599880.000          ok
indic+bnd+ord                    143.163      82.795 2936017.000          ok
bigM+split+ord                   143.163      18.893  148133.000          ok
bigM+bnd+ord                     143.163      21.704  238697.000          ok
sos1+split+ord                   143.163     184.504 7827034.000          ok
sos1+bnd+ord                     143.163     108.860 3783221.000          ok
nonconvex+split+ord              143.163      77.621 1755781.000          ok
nonconvex+bnd+ord                143.163      70.933 1492861.000          ok
heuristic                        143.163       0.563                            189.000
nonconvex+split+ord+mipstart     143.163      79.865 1755781.000          ok

Notes:

I am pretty sure the optimal objective is 143.163.
The performance of these linear MIP models is worse than our least squares models from the previous section. One would expect the opposite.
The ordering constraint can help a lot (see the first two results). But see lines 3 and 4 for a counterexample. Better models can be slower at times.
The fastest models are the bigM ones. That is unfortunate: bigM's are not a good idea as we have no good way of estimating the bigM constants.
The nonconvex quadratic problems are solved quite fast compared to some linear models.
The last solve (starting with the heuristic solution) is slower than just solving from scratch. That can happen sometimes.
For comparison: this solution has a sum of squares objective value of 936.577. The optimal least-squares solution from the previous section has an objective value of 932.536.
LAD regression results are, in general, not unique. Least squares solutions are unique.

An optimal solution for the estimated coefficients is:

----    170 VARIABLE alpha.L  coefficients to estimate for each line

        intercept       slope

line1      -6.425       0.950
line2      51.680       0.159
line3     104.206      -1.034

Statistical Mixture Models

Here is how we can solve our problem using the flexmix tool in R for mixture modeling.

library(flexmix)
#> Loading required package: lattice

df <- read.table(text="
case           x           y
case1       20.202      85.162
case2        0.507       2.103
case3       26.961      55.969
case4       49.985      44.690
case5       15.129      86.515
case6       17.417      79.866
case7       33.064      56.328
case8       31.691      29.422
case9       32.209      64.021
case10      96.398      85.191
case11      99.360      68.235
case12      36.990      57.516
case13      37.289      25.884
case14      77.198      56.157
case15      39.668      58.398
case16      91.310      66.205
case17      11.958      93.742
case18      73.548      28.178
case19       5.542       5.788
case20      57.630      60.830
case21       5.141      53.988
case22       0.601      42.559
case23      40.123      61.928
case24      51.988      42.984
case25      62.888      58.308
case26      22.575       4.414
case27      39.612      67.282
case28      27.601      56.445
case29      15.237       0.218
case30      93.632      11.896
case31      42.266      60.515
case32      13.466      51.721
case33      38.606      65.392
case34      37.463      16.978
case35      26.848      74.588
case36      94.837      -0.803
case37      18.894      60.060
case38      29.751      14.005
case39       7.455      60.066
case40      40.135      62.898
",header=T)

set.seed(123)

fit <- stepFlexmix(
  y ~ x,
  data = df,
  k = 3,
  nrep = 10
)
#> 3 : * * * * * * * * * * 

summary(fit)
#> 
#> Call:
#> stepFlexmix(y ~ x, data = df, k = 3, nrep = 10)
#> 
#>        prior size post>0 ratio
#> Comp.1 0.281   11     17 0.647
#> Comp.2 0.326   14     19 0.737
#> Comp.3 0.393   15     25 0.600
#> 
#> 'log Lik.' -154.6783 (df=11)
#> AIC: 331.3566   BIC: 349.9342
parameters(fit)
#>                      Comp.1     Comp.2     Comp.3
#> coef.(Intercept) -6.3644114 102.651247 52.6898762
#> coef.x            0.8796386  -1.033734  0.1512594
#> sigma             7.4150033   4.123961  4.3338367

Compare these coefficients with our least squares solution:

----    162 VARIABLE alpha.L  coefficients to estimate for each line

        intercept       slope

line1      -6.663       0.879
line2      51.738       0.158
line3     102.897      -1.033

The flexmix solution has a total sum of squares objective of 941.9742, compared with 932.536 for our least-squares models. Note that this is not a shortcoming of flexmix: it is using a different model with a different objective.

Notes:

flexmix is using a maximum likelihood approach: it finds values that maximize the likelihood of the observed data. So it is a probabilistic approach.
It uses the EM (Expectation-Maximization) algorithm.

Conclusions

The problem of fitting multiple linear regression models simultaneously is not so simple. We need to find simultaneously:

the assignment of points to lines
the value of the regression coefficients

We do this by minimizing the overall least squares norm of the residuals. An alternative is to use an LAD regression scheme, in which we minimize the absolute values of the residuals. In both cases, we can construct a $k$-means-like clustering heuristic. Surprisingly, the linear MIP models for the LAD version are more difficult than the MIQCP models for the least-squares case. Usually, MIP models are easier to solve.

In general, we should not use binary variables with big-M constraints. We don't know in advance what good, tight values the big M constants should have. One approach is to use exotic modeling tools like indicator or SOS1 constraints. A more natural formulation can be stated, resulting in a nonconvex quadratic model.

Statistical packages look at this problem a bit differently. In software for mixture modeling, a maximum-likelihood approach is typically used.

References

1 comment:

Andy WJune 22, 2026 at 7:47 AM
Your heuristic is also very similar to how the EM algorithm works. You set the weights, fit the weighted regression, and then move the individual point weights to the weighted regression that is closest. Repeat that until convergence.

I would update the post to make clear that mixture modelling in statistical packages are fuzzy assignments (a big difference with the LP approach with binary assignment variables). So a singular observation has a probability it is associated with one of the groups.

I have used non-linear solvers for this as well, https://andrewpwheeler.com/2020/04/29/using-pytorch-to-estimate-group-based-traj-models/. Just piping it all into pytorch and having the machine figure it out. (That is another way to get soft constraints.)

Monday, June 22, 2026

Mixture models as math programming problem