Here is another example of a least-squares regression problem where we can benefit from mathematical programming techniques.
Data
Statistical problems typically start with a data set:
---- 79 PARAMETER data x y case1 20.202 85.162 case2 0.507 2.103 case3 26.961 55.969 case4 49.985 44.690 case5 15.129 86.515 case6 17.417 79.866 case7 33.064 56.328 case8 31.691 29.422 case9 32.209 64.021 case10 96.398 85.191 case11 99.360 68.235 case12 36.990 57.516 case13 37.289 25.884 case14 77.198 56.157 case15 39.668 58.398 case16 91.310 66.205 case17 11.958 93.742 case18 73.548 28.178 case19 5.542 5.788 case20 57.630 60.830 case21 5.141 53.988 case22 0.601 42.559 case23 40.123 61.928 case24 51.988 42.984 case25 62.888 58.308 case26 22.575 4.414 case27 39.612 67.282 case28 27.601 56.445 case29 15.237 0.218 case30 93.632 11.896 case31 42.266 60.515 case32 13.466 51.721 case33 38.606 65.392 case34 37.463 16.978 case35 26.848 74.588 case36 94.837 -0.803 case37 18.894 60.060 case38 29.751 14.005 case39 7.455 60.066 case40 40.135 62.898
The question is: how can we estimate the intercept and slope of these three lines?
The problem has to handle two things at the same time:
- assignment of points \(i\) to a line \(k\)
- perform \(k\) regressions
We'll do these things simultaneously as they affect each other. When we separate these decisions, we have a heuristic.
I'll discuss different methods:
- Mathematical programming models to find the overall least squares solution.
- A heuristic to find a good assignment, similar to the one used in \(k\)-means clustering.
- Models for an \(\ell_1\) fit, similar to LAD regression.
- Statistical methods that use a maximum likelihood approach: find the regression lines that make this pattern of points most likely.
Least squares optimization models
The simple statistical model we are going to use is: \[ {\color{darkblue}y} = {\color{darkred}\alpha}_1 + {\color{darkred}\alpha}_2 {\color{darkblue}x} + {\color{darkred}\varepsilon} \]
A standard least squares problem formulated as a Quadratic Programming (QP) can look like:
| Least Squares |
|---|
| \[\begin{align} \min\> & {\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_1 + {\color{darkred}\alpha}_2 {\color{darkblue}x}_i +{\color{darkred}r}_i && \forall i \end{align}\] |
Here \({\color{darkblue}y}_i, {\color{darkblue}x}_i\) is data and \({\color{darkred}r}_i\) is the residual.
This optimization model can also be collapsed into an unconstrained problem. I prefer this version as it can be a bit more numerically stable. (Nerdy detail: I have seen cases where the unconstrained version became non-convex when the Q matrix was no longer positive definite. This cannot happen with this formulation.)
If we want to estimate three different lines \(k\), then the high-level mathematical model can be:
| High-level Model |
|---|
| \[\begin{align} \min\>&{\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i && \forall i,k| {\color{darkred}\delta}_{i,k}=1 \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \end{align}\] |
Here \[{\color{darkred}\delta}_{i,k} = \begin{cases}1 & \text{if point $i$ is assigned to line $k$}\\ 0 & \text{otherwise} \end{cases}\] In the model, we enforce that each point \(i\) is assigned to exactly one line \(k\).
This model finds the assignment of points to lines, \({\color{darkred}\delta}_{i,k}\), and the regression coefficients, \({\color{darkred}\alpha}_{k,j}\), simultaneously.
The most straightforward implementation is using indicator constraints:
| MIQP with indicator constraints |
|---|
| \[\begin{align} \min\> &{\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ &{\color{darkred}\delta}_{i,k}=1 \implies {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i&& \forall i,k \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ &{\color{darkred}r}_i, {\color{darkred}\alpha}_{k,j} \> \mathbf{free}\end{align}\] |
When we solve this, we get the following optimal solution:
---- 149 VARIABLE ssq.L = 932.536 objective: sum of squares ---- 149 VARIABLE delta.L assignment of point to line line1 line2 line3 case1 1.000 case2 1.000 case3 1.000 case4 1.000 case5 1.000 case6 1.000 case7 1.000 case8 1.000 case9 1.000 case10 1.000 case11 1.000 case12 1.000 case13 1.000 case14 1.000 case15 1.000 case16 1.000 case17 1.000 case18 1.000 case19 1.000 case20 1.000 case21 1.000 case22 1.000 case23 1.000 case24 1.000 case25 1.000 case26 1.000 case27 1.000 case28 1.000 case29 1.000 case30 1.000 case31 1.000 case32 1.000 case33 1.000 case34 1.000 case35 1.000 case36 1.000 case37 1.000 case38 1.000 case39 1.000 case40 1.000 ---- 149 VARIABLE alpha.L coefficients to estimate for each line intercept slope line1 102.897 -1.033 line2 -6.663 0.879 line3 51.738 0.158 ---- 151 PARAMETER results compare models obj time nodes status indic 932.536 5.344 122054.000 ok
Note that instead of index \(j\in\{1,2\}\), I used in the actual model the more descriptive set \(j\in\{\text{intercept},\text{slope}\}\). This helps not only with the model itself but also with presenting the results more meaningfully.
There is one simple improvement we can make. In the previous model, we can name any of the lines, line1, line2, or line3. We have six possible naming schemes. It makes sense to tighten this down. We can order, say, the intercepts: \[ {\color{darkred}\alpha}_{k,1} \le {\color{darkred}\alpha}_{k+1,1} \] If we add this symmetry breaker, we see:
---- 165 VARIABLE alpha.L coefficients to estimate for each line intercept slope line1 -6.663 0.879 line2 51.738 0.158 line3 102.897 -1.033 ---- 167 PARAMETER results compare models obj time nodes status indic 932.536 5.344 122054.000 ok indic+ord 932.536 2.584 45519.000 ok
Notice the ordering of the \({\color{darkred}\alpha}_{j,k}\) coefficients. Taking away this freedom from the solver really helps in performance. The proven optimal solution is found in about half the time. The number of nodes was reduced even more. These simple symmetry-breaking constraints can be surprisingly effective. Solvers often have built-in measures to handle symmetries in the model, but I find that manually adding ordering constraints can still be highly beneficial. A second advantage is that the solution looks a bit more polished.
It is possible, but rather dangerous, to implement this model using plain binary variables. A possible approach could be:
| MIQP with big-M constraints |
|---|
| \[\begin{align} \min\> &{\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i + {\color{darkred}s}_{i,k} && \forall i,k \\ & -{\color{darkblue}M}(1-{\color{darkred}\delta}_{i,k}) \le {\color{darkred}s}_{i,k} \le {\color{darkblue}M}(1-{\color{darkred}\delta}_{i,k}) && \forall i,k\\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ &{\color{darkred}r}_i, {\color{darkred}\alpha}_{k,j}, {\color{darkred}s}_{i,k} \> \mathbf{free} \\& {\color{darkblue}M}=1000 \end{align}\] |
It is difficult to find good values for \({\color{darkblue}M}\) (or \({\color{darkblue}M}_{i,k}\) if individually valued). The main problem with these big-M's is that they apply to \((i,k)\) (point/line) pairs that can be very far apart. If the slope is large, the big-M's may become very large. Unfortunately, the big-M's are determined not by the selected, and thus close, line/point combinations, but by the possibly far-away lines not selected for point \(i\). Things really work against us here. In our case, the slopes are small, so \({\color{darkblue}M}=1000\) seems safe, but with other data, this may not hold.
An alternative to indicator constraints is to use SOS1 sets (Special Ordered Sets of Type 1). A SOS1-based model can look like:
| MIQP with SOS1 constraints |
|---|
| \[\begin{align} \min\> &{\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i +{\color{darkred}{\mathit{slack}}}_{i,k} && \forall i,k \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ & ({\color{darkred}{\mathit{slack}}}_{i,k},{\color{darkred}\delta}_{i,k}) \in {\mathbf{SOS1}}\\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ &{\color{darkred}r}_i, {\color{darkred}\alpha}_{k,j},{\color{darkred}{\mathit{slack}}}_{i,k}\> \mathbf{free}\end{align}\] |
The construct \( ({\color{darkred}{\mathit{slack}}}_{i,k},{\color{darkred}\delta}_{i,k}) \in {\mathbf{SOS1}}\) says only one of the pair \(({\color{darkred}{\mathit{slack}}}_{i,k},{\color{darkred}\delta}_{i,k})\) can be nonzero. To be precise, both can be zero. I.e., we have a bunch of SOS1 sets with just two members.
There is also a nonconvex formulation. Its advantage is that we don't use any exotic MIP features, just plain, straightforward math. No need to explain SOS sets or indicator constraints.
| nonconvex quadratic model |
|---|
| \[\begin{align} \min\> &{\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ &{\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}\rho}_{i,k} && \forall i,k \\ & {\color{darkred}r}_i = \sum_k {\color{darkred}\rho}_{i,k}{\color{darkred}\delta}_{i,k} && \forall i \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ & \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ &{\color{darkred}r}_i, {\color{darkred}\alpha}_{k,j},{\color{darkred}\rho}_{i,k} \> \mathbf{free}\end{align}\] |
Some performance figures are here:
---- 248 PARAMETER results compare models obj time nodes status indic 932.536 5.344 122054.000 ok indic+ord 932.536 2.584 45519.000 ok bigM+ord 932.536 1.993 30033.000 ok SOS1+ord 932.536 3.494 56992.000 ok nonconvex+ord 932.536 2.921 33761.000 ok
I kept the ordering constraint in all subsequent models. We see the bigM model is the fastest, but because of its unreliability, I would not recommend using this unless you have a good idea about the maximum possible residuals. I like that the more natural, no-gimmicks, nonconvex formulation is doing quite well here.
Heuristic
An improvement heuristic can be stated as:
- Form a random assignment \({\color{darkblue}\delta}_{i,k}\) assigning each point \(i\) to a random line \(k\). If it is not a valid assignment (e.g., when a line has fewer than 2 points), try again.
- Perform a regression for each line \(k\).
- Reassign each point to its closest line as measured by the residual. If not different from the previous assignment, stop.
- Go to step 2.
This can get caught in a local minimum. So we repeat this several times and report the best solution.
This is very much how a \(k\)-means clustering heuristic [1] works.
I coded this up in GAMS, and used 10 trials. Instead of using a special-purpose least-squares algorithm, I used a simple continuous QP (Quadratic Programming) model. This model solves three regressions (one for each line) in one swoop, and is just a version of the indicator model, but now with exogenous assignments:
| QP regression model |
|---|
| \[\begin{align} \min\> &{\color{darkred}{\mathit{ssq}}} = \sum_i {\color{darkred}r}_i^2\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i&& \forall i,k|{\color{darkblue}\delta}_{i,k}=1 \\ &{\color{darkred}r}_i, {\color{darkred}\alpha}_{k,j} \> \mathbf{free}\end{align}\] |
---- 360 PARAMETER results compare models obj time nodes status numreg indic 932.536 5.344 122054.000 ok indic+ord 932.536 2.584 45519.000 ok bigM+ord 932.536 1.993 30033.000 ok SOS1+ord 932.536 3.494 56992.000 ok nonconvex+ord 932.536 2.921 33761.000 ok heuristic 932.536 1.084 183.000
We were lucky and found the optimal solution. We needed to do 183 regressions for this. This actually meant: 61 invocations of the QP solver as each QP solve actually performs 3 regressions at once. Obviously, we only know about optimality by also solving formal optimization models. When we run only the heuristic, we don't know.
In general, we can verify the optimality of the heuristic solution by passing it to the solver. The simplest would be to add bounds on the objective: I want to find better solutions than I already have. A more sophisticated approach is to pass the integer variables of the heuristic solution to the solver via the MIPSTART option. This will allow the solver to branch towards that solution and then continue. Note that it is not guaranteed that solving with added bounds or with a complete integer solution is faster than solving from scratch. Sometimes solving from scratch is better. Unfortunately, there is no good way to predict this, so we'll end up with trying things out.
Here I tried to feed assignment variables \({\color{darkblue}\delta}_{i,k}\) found by the heuristic to the indicator model using MIPSTART. We would see:
---- 410 PARAMETER results compare models obj time nodes status numreg indic 932.536 5.344 122054.000 ok indic+ord 932.536 2.584 45519.000 ok bigM+ord 932.536 1.993 30033.000 ok SOS1+ord 932.536 3.494 56992.000 ok nonconvex+ord 932.536 2.921 33761.000 ok heuristic 932.536 1.084 183.000 indic+ord+mipstart 932.536 2.223 41995.000 ok
It is just marginally faster. The improvement was so minimal that we did not make up for the heuristic's solution time. Note that for larger instances (i.e., more lines), things may look very different.
I also want to mention another reason to implement a heuristic. If the solver fails or has really trouble making progress, the availability of a good solution calculated using very different methods is an enormous asset. We can always return a good solution. This is kind of insurance against poor solver performance. Obviously, insurance is not free, but still worthwhile.
Note that when using the ordering constraint, we need to sort the solution found by the heuristic, since the heuristic does not care much about symmetry, and the QP models have no ordering constraint.
L1 models
Instead of minimizing squared residuals, we can also choose to minimize the absolute values of residuals. Standard statistics uses a least squares approach. A well-known issue with this method is that outliers can distort the results due to their large weight in the optimization model. Methods for robust optimization that are less sensitive to outliers. One such method is LAD (Least Absolute Deviation) or \(\ell_1\) regression. A high-level model can look like:
| High-level LAD Model |
|---|
| \[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i |{\color{darkred}r}_i|\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i && \forall i,k| {\color{darkred}\delta}_{i,k}=1 \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \end{align}\] |
Using indicator constraints and variable splitting to linearize the absolute value, we get the following:
| Indicators and variable splitting MIP |
|---|
| \[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i \left({\color{darkred}r}^+_i +{\color{darkred}r}^-_i\right)\\ & {\color{darkred}\delta}_{i,k}=1 \implies {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}^+_i - {\color{darkred}r}^-_i && \forall i,k \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ & {\color{darkred}r}^+_i,{\color{darkred}r}^-_i \ge 0 \\ & {\color{darkred}\alpha}_{k,j}\>{\mathbf{free}} \end{align}\] |
We can add to any of these models the ordering constraint as a symmetry breaker: \[ {\color{darkred}\alpha}_{k,1} \le {\color{darkred}\alpha}_{k+1,1} \] We order the lines here by the intercept. This is, in general, a good idea: performance will increase, and as a byproduct, the solution looks better.
We can also use a different linearization:
| Indicators and bounding MIP |
|---|
| \[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i {\color{darkred}\rho}_i\\ & {\color{darkred}\delta}_{i,k}=1 \implies {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i && \forall i,k \\ & -{\color{darkred}\rho}_i \le{\color{darkred}r}_i\le{\color{darkred}\rho}_i && \forall i\\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ & {\color{darkred}\rho}_i \ge 0 \\ & {\color{darkred}r}_i\>{\mathbf{free}} \end{align}\] |
Instead of indicator constraints, we can use big-M constraints:
| Big-M MIP |
|---|
| \[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i \left({\color{darkred}r}^+_i +{\color{darkred}r}^-_i\right)\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}^+_i - {\color{darkred}r}^-_i +{\color{darkred}s}_{i,k} && \forall i,k \\ & -{\color{darkblue}M}(1-{\color{darkred}\delta}_{i,k}) \le {\color{darkred}s}_{i,k} \le {\color{darkblue}M}(1-{\color{darkred}\delta}_{i,k}) && \forall i,k\\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ & {\color{darkred}r}^+_i,{\color{darkred}r}^-_i \ge 0 \\& {\color{darkred}s}_{i,k}\>{\mathbf{free}} \\& {\color{darkblue}M} = 1000 \end{align}\] |
Obviously, we have the same problem as discussed earlier: there is no easy way to find good values for the big-M constants.
A SOS1 approach could be:
| SOS1 MIP |
|---|
| \[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i \left({\color{darkred}r}^+_i +{\color{darkred}r}^-_i\right)\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}_i +{\color{darkred}s}_{i,k} && \forall i,k \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\} \\ & ({\color{darkred}\delta}_{i,k},{\color{darkred}s}_{i,k}) \in \>{\mathbf{SOS1}} \\ & {\color{darkred}s}_{i,k}\>{\mathbf{free}} \end{align}\] |
We can also formulate a non-convex quadratically constrained model using either of the linearizations. Here is one using variable splitting:
| Nonconvex MIQCP |
|---|
| \[\begin{align} \min\>&{\color{darkred}{\mathit{sad}}} = \sum_i \left({\color{darkred}r}^+_i +{\color{darkred}r}^-_i\right)\\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}\rho}_{i,k} && \forall i,k \\ & {\color{darkred}r}^+_i -{\color{darkred}r}^-_i = \sum_k {\color{darkred}\rho}_{i,k}{\color{darkred}\delta}_{i,k} \\ &\sum_k {\color{darkred}\delta}_{i,k} = 1 && \forall i \\ &{\color{darkred}\delta}_{i,k} \in \{0,1\}\\ & {\color{darkred}r}^+_i ,{\color{darkred}r}^-_i \ge 0 \\ & \\ & {\color{darkred}\rho}_{i,k}\>{\mathbf{free}} \end{align}\] |
The heuristic discussed in the previous section can be adapted to our \(\ell_1\) objective by replacing the OLS estimation with an LAD estimation. This is just an LP. Again, we can estimate the coefficients of the \(k\) lines in one swoop using the following LP:
| LP LAD regression model |
|---|
| \[\begin{align} \min\> &{\color{darkred}{\mathit{sad}}} = \sum_i \left({\color{darkred}r}^+_i +{\color{darkred}r}^-_i\right) \\ & {\color{darkblue}y}_i = {\color{darkred}\alpha}_{k,1} + {\color{darkred}\alpha}_{k,2} {\color{darkblue}x}_i + {\color{darkred}r}^+_i -{\color{darkred}r}^-_i && \forall i,k|{\color{darkblue}\delta}_{i,k}=1 \\ & {\color{darkred}r}^+_i,{\color{darkred}r}^-_i \ge 0 \\ & {\color{darkred}\alpha}_{k,j} \> {\mathbf{free}}\end{align}\] |
Here is a comparison:
---- 510 PARAMETER results compare models obj time nodes status numreg indic+split 143.163 2654.639 9.234285E+7 ok indic+split+ord 143.163 634.161 2.462763E+7 ok indic+bnd 143.163 74.529 2599880.000 ok indic+bnd+ord 143.163 82.795 2936017.000 ok bigM+split+ord 143.163 18.893 148133.000 ok bigM+bnd+ord 143.163 21.704 238697.000 ok sos1+split+ord 143.163 184.504 7827034.000 ok sos1+bnd+ord 143.163 108.860 3783221.000 ok nonconvex+split+ord 143.163 77.621 1755781.000 ok nonconvex+bnd+ord 143.163 70.933 1492861.000 ok heuristic 143.163 0.563 189.000 nonconvex+split+ord+mipstart 143.163 79.865 1755781.000 ok
Notes:
- I am pretty sure the optimal objective is 143.163.
- The performance of these linear MIP models is worse than our least squares models from the previous section. One would expect the opposite.
- The ordering constraint can help a lot (see the first two results). But see lines 3 and 4 for a counterexample. Better models can be slower at times.
- The fastest models are the bigM ones. That is unfortunate: bigM's are not a good idea as we have no good way of estimating the bigM constants.
- The nonconvex quadratic problems are solved quite fast compared to some linear models.
- The last solve (starting with the heuristic solution) is slower than just solving from scratch. That can happen sometimes.
- For comparison: this solution has a sum of squares objective value of 936.577. The optimal least-squares solution from the previous section has an objective value of 932.536.
- LAD regression results are, in general, not unique. Least squares solutions are unique.
An optimal solution for the estimated coefficients is:
---- 170 VARIABLE alpha.L coefficients to estimate for each line intercept slope line1 -6.425 0.950 line2 51.680 0.159 line3 104.206 -1.034
Statistical Mixture Models
Here is how we can solve our problem using the flexmix tool in R for mixture modeling.
library(flexmix)
#> Loading required package: lattice
df <- read.table(text="
case x y
case1 20.202 85.162
case2 0.507 2.103
case3 26.961 55.969
case4 49.985 44.690
case5 15.129 86.515
case6 17.417 79.866
case7 33.064 56.328
case8 31.691 29.422
case9 32.209 64.021
case10 96.398 85.191
case11 99.360 68.235
case12 36.990 57.516
case13 37.289 25.884
case14 77.198 56.157
case15 39.668 58.398
case16 91.310 66.205
case17 11.958 93.742
case18 73.548 28.178
case19 5.542 5.788
case20 57.630 60.830
case21 5.141 53.988
case22 0.601 42.559
case23 40.123 61.928
case24 51.988 42.984
case25 62.888 58.308
case26 22.575 4.414
case27 39.612 67.282
case28 27.601 56.445
case29 15.237 0.218
case30 93.632 11.896
case31 42.266 60.515
case32 13.466 51.721
case33 38.606 65.392
case34 37.463 16.978
case35 26.848 74.588
case36 94.837 -0.803
case37 18.894 60.060
case38 29.751 14.005
case39 7.455 60.066
case40 40.135 62.898
",header=T)
set.seed(123)
fit <- stepFlexmix(
y ~ x,
data = df,
k = 3,
nrep = 10
)
#> 3 : * * * * * * * * * *
summary(fit)
#>
#> Call:
#> stepFlexmix(y ~ x, data = df, k = 3, nrep = 10)
#>
#> prior size post>0 ratio
#> Comp.1 0.281 11 17 0.647
#> Comp.2 0.326 14 19 0.737
#> Comp.3 0.393 15 25 0.600
#>
#> 'log Lik.' -154.6783 (df=11)
#> AIC: 331.3566 BIC: 349.9342
parameters(fit)
#> Comp.1 Comp.2 Comp.3
#> coef.(Intercept) -6.3644114 102.651247 52.6898762
#> coef.x 0.8796386 -1.033734 0.1512594
#> sigma 7.4150033 4.123961 4.3338367Compare these coefficients with our least squares solution:
---- 162 VARIABLE alpha.L coefficients to estimate for each line intercept slope line1 -6.663 0.879 line2 51.738 0.158 line3 102.897 -1.033
The flexmix solution has a total sum of squares objective of 941.9742, compared with 932.536 for our least-squares models. Note that this is not a shortcoming of flexmix: it is using a different model with a different objective.
Notes:
- flexmix is using a maximum likelihood approach: it finds values that maximize the likelihood of the observed data. So it is a probabilistic approach.
- It uses the EM (Expectation-Maximization) algorithm.
Conclusions
The problem of fitting multiple linear regression models simultaneously is not so simple. We need to find simultaneously:
- the assignment of points to lines
- the value of the regression coefficients
We do this by minimizing the overall least squares norm of the residuals. An alternative is to use an LAD regression scheme, in which we minimize the absolute values of the residuals. In both cases, we can construct a \(k\)-means-like clustering heuristic. Surprisingly, the linear MIP models for the LAD version are more difficult than the MIQCP models for the least-squares case. Usually, MIP models are easier to solve.
In general, we should not use binary variables with big-M constraints. We don't know in advance what good, tight values the big M constants should have. One approach is to use exotic modeling tools like indicator or SOS1 constraints. A more natural formulation can be stated, resulting in a nonconvex quadratic model.
Statistical packages look at this problem a bit differently. In software for mixture modeling, a maximum-likelihood approach is typically used.
Your heuristic is also very similar to how the EM algorithm works. You set the weights, fit the weighted regression, and then move the individual point weights to the weighted regression that is closest. Repeat that until convergence.
ReplyDeleteI would update the post to make clear that mixture modelling in statistical packages are fuzzy assignments (a big difference with the LP approach with binary assignment variables). So a singular observation has a probability it is associated with one of the groups.
I have used non-linear solvers for this as well, https://andrewpwheeler.com/2020/04/29/using-pytorch-to-estimate-group-based-traj-models/. Just piping it all into pytorch and having the machine figure it out. (That is another way to get soft constraints.)