In [1], I discussed some models for a small 2 line regression problem. Here I want to dive into a more challenging problem.
Data generation process
I started with the following five lines: \[\begin{align}&f_1(x)=200-3x\\&f_2(x)=150-x\\&f_3(x)=10\\&f_4(x)=-50+2x\\&f_5(x)=-120\end{align}\] To generate points, I used \[\begin{align}&x_i \sim U(0,100)\\ & k_i \sim U\{1,5\} \\ & \epsilon_i \sim N(0,15)\\ & y_i = a_{k_i} + b_{k_i} \cdot x_i + \epsilon_i \end{align}\]
This corresponds to the following picture:
Now, we drop the lines and just keep the points. Together with \(K=5\) (the number of lines), this is all the data we give to the model:
---- 32 PARAMETER x data i1 5.141, i2 0.601, i3 40.123, i4 51.988, i5 62.888, i6 22.575, i7 39.612 i8 27.601, i9 15.237, i10 93.632, i11 42.266, i12 13.466, i13 38.606, i14 37.463 i15 26.848, i16 94.837, i17 18.894, i18 29.751, i19 7.455, i20 40.135, i21 10.169 i22 38.389, i23 32.409, i24 19.213, i25 11.237, i26 59.656, i27 51.145, i28 4.507 i29 78.310, i30 94.575, i31 59.646, i32 60.734, i33 36.251, i34 59.407, i35 67.985 i36 50.659, i37 15.925, i38 65.689, i39 52.388, i40 12.440, i41 98.672, i42 22.812 i43 67.565, i44 77.678, i45 93.245, i46 20.124, i47 29.714, i48 19.723, i49 24.635 i50 64.648, i51 73.497, i52 8.544, i53 15.035, i54 43.419, i55 18.694, i56 69.269 i57 76.297, i58 15.481, i59 38.938, i60 69.543, i61 84.581, i62 61.272, i63 97.597 i64 2.689, i65 18.745, i66 8.712, i67 54.040, i68 12.686, i69 73.400, i70 11.323 i71 48.835, i72 79.560, i73 49.205, i74 53.356, i75 1.062, i76 54.387, i77 45.113 i78 97.533, i79 18.385, i80 16.353, i81 2.463, i82 17.782, i83 6.132, i84 1.664 i85 83.565, i86 60.166, i87 2.702, i88 19.609, i89 95.071, i90 33.554, i91 59.426 i92 25.919, i93 64.063, i94 15.525, i95 46.002, i96 39.334, i97 80.546, i98 54.099 i99 39.072, i100 55.782 ---- 32 PARAMETER y data i1 192.922, i2 -122.916, i3 21.992, i4 70.273, i5 91.422, i6 109.528 i7 129.230, i8 -140.463, i9 180.013, i10 24.614, i11 -108.546, i12 10.211 i13 -148.679, i14 24.443, i15 122.773, i16 164.505, i17 164.359, i18 132.833 i19 -19.868, i20 15.210, i21 155.782, i22 127.528, i23 122.108, i24 149.691 i25 -0.143, i26 -120.337, i27 94.868, i28 -39.217, i29 96.683, i30 46.506 i31 35.514, i32 22.589, i33 97.811, i34 -113.523, i35 88.707, i36 94.006 i37 9.232, i38 73.173, i39 26.600, i40 0.429, i41 3.943, i42 119.583 i43 101.666, i44 -38.179, i45 39.630, i46 137.980, i47 4.808, i48 18.422 i49 -10.888, i50 83.361, i51 65.726, i52 -39.174, i53 0.179, i54 126.818 i55 139.561, i56 -18.520, i57 105.170, i58 8.206, i59 95.491, i60 87.668 i61 -64.973, i62 12.739, i63 -13.237, i64 -70.810, i65 136.727, i66 178.416 i67 7.676, i68 -51.541, i69 78.491, i70 143.387, i71 115.522, i72 66.261 i73 52.631, i74 -100.755, i75 151.703, i76 47.462, i77 127.759, i78 -65.191 i79 -6.051, i80 137.610, i81 130.662, i82 128.182, i83 148.801, i84 -23.712 i85 -64.504, i86 27.111, i87 150.200, i88 171.403, i89 72.107, i90 -105.643 i91 -112.767, i92 112.218, i93 102.355, i94 5.007, i95 86.645, i96 -127.749 i97 -30.270, i98 79.430, i99 78.474, i100 -7.158
The questions we want to address are:
- Can we use an optimization model to find the best fit for five different lines?
- Do the regression lines we find this way, correspond to the original functions we used in our data generation process?
The model has to make two decisions:
- Assign points to lines.
- Fitting: find the coefficients (constant term and slope) of each line.
I expected this to be not too difficult. This turned out to be wrong.
We have \(5\times 100=500\) binary variables to model the assignment. This does seem small for a good MIP solver, but remember \(2^{500}\) is a big number [3]:
Note: a better measure of size would be \(5^{100}\) which is a bit less: 7888609052210118054117285652827862296732064351090230047702789306640625 (70 decimal digits).
Solution process
To make things a bit easier, compared to a least squares objective, I used a linear MIP model based on LAD regression [2]:
Linear LAD Model |
---|
\[\begin{align} \min\>&\sum_i \color{darkred}z_i\\ & -\color{darkred}z_i \le \color{darkred}r_i \le \color{darkred}z_i &&\forall i && \text{absolute value}\\ & \color{darkred}r_i = \color{darkblue}y_i - \color{darkred}\alpha_{0,k} - \color{darkred}\alpha_{1,k}\cdot\color{darkblue}x_i +\color{darkred}s_{i,k} &&\forall i,k && \text{fit (possibly relaxed)}\\ & -\color{darkblue}M \cdot(1-\color{darkred}\delta_{i,k}) \le \color{darkred}s_{i,k} \le \color{darkblue}M \cdot(1-\color{darkred}\delta_{i,k}) && \forall i,k && \text{big-M constraint}\\ & \sum_k\color{darkred}\delta_{i,k} = 1 && \forall i && \text{pick one regression line}\\ & \color{darkred}\alpha_{0,k} \le \color{darkred}\alpha_{0,k+1} && \forall k \>\text{except last}&&\text{order by constant term} \\ & \color{darkred}\delta_{i,k} \in \{0,1\} \end{align} \] |
The big-M constraints implement the implication \[\color{darkred}\delta_{i,k}=1\Rightarrow \color{darkred}s_{i,k}=0 \]i.e. if \(\color{darkred}\delta_{i,k}=1\) then point \(i\) is used in fitting line \(k\). This model could not be solved to proven optimality. The best bound started out very bad (at zero) and moved very slowly. In the end, I used a heuristic to find a good solution and then used the MIP model to improve upon this using MIPSTART.
The heuristic looked like:
- If time or iteration limit reached: stop
- Generate a random assignment. Evaluate the objective using an L1norm regression procedure (for each of the five lines).
- For each point \(i\), see if it improves the objective if we assign it to a different line.
- If there is no longer an improvement possible, go to step 1.
This only looks at one point at a time, so it will not deliver optimal solutions. Hence we restart with a fresh random solution if we are unable to further improve the solution. In step 3 I use initially a cheap test: keep the lines constant and only check residuals without rerunning regressions. A more expensive test is to compare the residuals after rerunning regressions. This more expensive test is used once the simple test does not produce reassignments anymore.
The results were:
- Heuristic: obj=957.1482
- MIP improvement: obj=946.787
The picture below is from this solution.
The final LAD estimates are:
---- 473 PARAMETER est estimates a0 a1 line1 -123.012 0.160 line2 -66.229 2.246 line3 -0.078 0.041 line4 155.979 -1.123 line5 211.432 -3.213
We can compare this against the coefficients we used to generate the data.
Sidenote: checking the heuristic
Remember, our heuristic would try to move single points to another line. When it finishes it delivers a solution where such a move is no longer profitable. We can check if the heuristic is correct, by solving a slightly different version of the MIP model: restrict the number of changes in the assignment variable \(\color{darkred}\delta_{i,k}\) to one. If this model stops without an improvement, we have some confidence in the solution our heuristic produces.The modeling involved is as follows: \[\begin{align}&\color{darkred}d_{i,k} \ge \color{darkred}\delta_{i,k}-\color{darkblue}\delta^0_{i,k} \\ & \color{darkred}d_{i,k} \ge \color{darkblue}\delta^0_{i,k}-\color{darkred}\delta_{i,k} \\ & \sum_{i,k}\color{darkred}d_{i,k}\le 1 \\ & \color{darkred}d_{i,k}\in [0,1]\end{align}\] Here \(\color{darkblue}\delta^0_{i,k}\) is the solution found by the heuristic. The variable \(\color{darkred}d_{i,k}=1\) if we change an assignment from the heuristic solution.
GA heuristic
I also tried the GA algorithm in R. This heuristic only allows real-valued or binary variables. I used a real-valued vector of length 5 with bounds \([1,5.9999]\). The fitness function used this algorithm:
- Use the floor function to convert the reals to integers between 1 and 5.
- Perform an L1 regression on each line.
- The objective is minus the sum of all absolute residuals.
fitness <- function(rx) { ix <- floor(rx) # rx are n=100 real numbers in [1,K+0.999] # ix are n=100 integers in [1,K] obj <- 0 for (k in 1:K) { cnt <- length(ix[ix==k]); if (cnt>2) { r <- l1fit(x[ix==k],y[ix==k]) obj <- obj + sum(abs(r$residuals)) } } return(-obj); }
The simplicity of this is very attractive. Also, it is easy to change the fitness function to work with a least-squares fit.
Note that sometimes the GA algorithm generated an assignment with less than two points to a line. In that case, we don't use regression (that would fail) but conclude the residuals are zero for that line. With two points assigned to a line, we also can prevent doing a regression.
As a final test, I did provide our best solution (obj=946.787), and let the GA algorithm run:
No improvement was found.
Conclusions
- This problem was much more challenging than I expected.
- The MIP model based on LAD regression has real troubles, but in the end, it was a good tool to improve the heuristic solution. Sometimes you have to be happy with just good solutions. Looking at the picture, the final solution is matching the original lines quite well.
- I'll keep thinking about this a bit to see if I can come up with a better way to do this.
- Is there a way to find proven optimal solutions for this problem? Maybe even for the least-squares problem? I suspect the combinatorics are just difficult here.
References
- A strange regression problem: multi-line fitting, http://yetanothermathprogrammingconsultant.blogspot.com/2021/02/a-strange-regression-problem.html
- Linear Programming and LAD regression, https://yetanothermathprogrammingconsultant.blogspot.com/2017/11/lp-and-lad-regression.html
- https://www.wolframalpha.com/input/?i=2%5E500
Appendix
The best solution I found:
---- 216 PARAMETER sol4b line number, best solution found i1 5, i2 1, i3 2, i4 2, i5 4, i6 4, i7 4, i8 1, i9 5, i10 3, i11 1 i12 3, i13 1, i14 2, i15 5, i16 2, i17 5, i18 4, i19 3, i20 2, i21 4, i22 4 i23 4, i24 5, i25 3, i26 1, i27 4, i28 2, i29 2, i30 4, i31 5, i32 5, i33 5 i34 1, i35 2, i36 4, i37 3, i38 2, i39 5, i40 3, i41 3, i42 4, i43 2, i44 5 i45 4, i46 4, i47 3, i48 3, i49 2, i50 4, i51 4, i52 2, i53 3, i54 4, i55 4 i56 5, i57 2, i58 3, i59 5, i60 2, i61 5, i62 5, i63 3, i64 2, i65 4, i66 5 i67 3, i68 2, i69 4, i70 4, i71 4, i72 4, i73 5, i74 1, i75 4, i76 2, i77 4 i78 5, i79 3, i80 4, i81 4, i82 4, i83 4, i84 3, i85 5, i86 5, i87 4, i88 5 i89 4, i90 1, i91 1, i92 4, i93 4, i94 3, i95 4, i96 1, i97 5, i98 4, i99 5 i100 3 ---- 216 PARAMETER a estimated coefficients, best solution found const.term slope sum|r| line1 -123.012 0.160 105.780 line2 -66.229 2.246 154.042 line3 -0.078 0.041 156.192 line4 155.979 -1.123 324.395 line5 211.432 -3.213 206.378 total 946.787
I will see if I can script up an example of using mixture models sometime with this example. See https://cran.r-project.org/web/packages/flexmix/vignettes/flexmix-intro.pdf for R, or I have an example using python and back-propagation here, https://github.com/apwheele/Blog_Code/blob/master/Using%20Pytorch%20for%20group%20based%20trajectory%20models.ipynb. (Although that is alittle different, in that example I have repeated observations that should be assigned the same latent group.)
ReplyDeleteA difference with mixture models as opposed to this formulation is that they are fuzzy. So a point is not 100% assigned to a group. Could be good or bad I imagine -- areas here where the lines cross for example should have a higher posterior probability of belonging to multiple groups.
For both your solution and the mixture model approach, a main limitation is that you need to specify the number of groups to search for up front. So people typically look at 1, 2, 3 etc. and then do various fit statistics to see which one produces the best results.
Thanks. Good point about the relation to mixture models/EM algorithm.
DeleteHere is an R example, smaller sample sizes did not spit out 5 groups even when I asked. But offhand did a pretty decent job recovering the models.
Delete################################################
library("flexmix")
library('ggplot2')
set.seed(10)
# Generate simulated data
n <- 1000
x <- runif(n,0,100)
error <- rnorm(n,0,15)
m1 <- 200 - 3*x
m2 <- 150 − x
m3 <- 10
m4 <- -50 + 2*x
m5 <- -120
u <- sample(1:5,n,replace=TRUE)
mods <- data.frame(m1,m2,m3,m4,m5)
y <- mods[cbind(1:n,u)] + error
dat <- data.frame(x,y,u)
mix_mod <- flexmix(y ~ x, data = dat, k = 5)
dat$mix_assign <- clusters(mix_mod)
pprobs <- posterior(mix_mod)
dat$prob <- pprobs[cbind(1:n,dat$mix_assign)]
# Overlap
table(u, dat$mix_assign)
# True vs Observed predictions
p <- ggplot(data = dat, aes(x = x, y = y, size=prob, fill=as.factor(u))) +
geom_point(shape=21, alpha=0.8) + facet_wrap(~mix_assign)
p
# Showing the coefficients
rm1 <- refit(mix_mod)
summary(rm1)
################################################
That plot is fascinating.
DeleteFor an exact algorithm, how about Benders decomposition with a MILP master problem that assigns points to curves and independent LP subproblems that solve the regression for each curve? Because each subproblem is always feasible, only optimality cuts are needed.
ReplyDeleteHave to try this out..
DeletePlaying with Benders decomposition (Cplex makes this really easy) showed that the Benders cuts were really weak.
Delete