There are two issues we can see from the plot: we can expect a better fit if we can use a piecewise linear function with 2 segments. Secondly the standard regression will allow negative predictions for \(\hat{Y}\) when we have small \(X\) values.
The breakpoint is estimated to be \(\bar(x)=1.5\). The plots seem convincing.
Let's see what happens if we use some more advanced mathematical optimization models. The linear model is easily coded as QP problem:
\[\begin{align} \min \>& \text{SSQ} = \sum_i e_i^2 \\ & y_i = a + b x_i + e_i \end{align} \] where \(x,y\) are data.
This purely linear fit is of course identical:
Now let's see if we can tackle a piecewise linear function with two segments. For this we introduce a binary variable \(\delta_i \in \{0,1\}\) to indicate if an observation belongs to the first or the second segment. A high-level model can look like:
\[\begin{align} \min & \>\text{SSQ} = \sum_i e_i^2 && (1)\\ & e_i = (1-\delta_i) r_{1,i} + \delta_i r_{2,i} &&(2)\\ &r_{k,i} = y_i - a_k - b_k x_i && (3)\\ & x_i \le \bar{x} \Rightarrow \delta_i = 0 &&(4)\\& x_i \gt \bar{x} \Rightarrow \delta_i = 1 && (5)\\ & \bar{y} = a_k + b_k \bar{x}&& (6)\end{align}\]
Here \((\bar{x}, \bar{y})\) is the break point (to be determined by the model).
If we delete the last constraint, we can formulate the problem as a convex MIQP. This requires linearization of the constraints (2), (4) and (5).
Adding the last constraint \(\bar{y} = a_k + b_k \bar{x}\) gives us a non-convex MINLP. After solving it we see:
Interestingly these results are very different from [1]. A more formal optimization model can make a lot of difference compared to a more ad-hoc method. The pictures from [1] indicate we are not optimal with respect to the total Sum of Squares. The continuous version of the segmented model with a breakpoint at 1.68 as suggested in [1] has: SSQ=0.00235. This is larger than our SSQ=0.00206459 with a breakpoint at 1.8135.
The two rightmost observations are probably outliers that should be removed from the model.
Update: Alternative formulations
If the \(x_i\) are sorted: \(x_{i+1} \ge x_i\), we can replace the implications
\[\begin{align} & x_i \le \bar{x} \Rightarrow \delta_i = 0 &&(4)\\& x_i \gt \bar{x} \Rightarrow \delta_i = 1 && (5) \end{align}\] by \[\delta_{i+1} \ge \delta_{i}\] I.e. we switch once from 0 to 1, Of course the sorting is no problem: we can sort the data before doing the optimization.
If furthermore we restrict the breakpoint \(\bar{x}\) to one of the data points, additional reformulations are possible. See the comments below. This can make the model convex. I think for the problem at hand we can add this restriction without much impact on the sum of squares. Also this may be a very good model to solve first: it will be an excellent starting point for the non-convex model. (I used the discontinuous version to generate a good starting point for the non-convex MINLP, but surely this version must be better).
In [2] the problem is stated as the regression: \[y = a + b x + c (x-\bar{x}) _{\mathbf{+}}\] where \(x_{\mathbf{+}} = \max(x,0)\). \(c\) is now the difference in slope for the second segment. This leads to a simpler optimization problem. The term \(x-\bar{x}\) can be linearized using a big-M formulation. The model can look like: \[\begin{align} \min \>& \sum_i e_i^2 \\ & y_i = a + b x_i + c z_i + e_i\\ & z_i = \max(x_i-\bar{x},0) \end{align}\] As we have a multiplication of two variables \(c z_i\), we have not made much progress,
R package `segmented`
The R package segmented[2] is actually doing slightly better. Like the previous Baron solution it puts the last two observations in to the second segment, but then moves \(\bar{x}\) somewhat:
> lm.out <- lm(y~x) > summary(lm.out) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -0.018152 -0.002764 0.000568 0.002367 0.035213 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.013224 0.002454 -5.39 8.06e-07 *** x 0.021558 0.002047 10.53 2.26e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.007519 on 74 degrees of freedom Multiple R-squared: 0.5999, Adjusted R-squared: 0.5945 F-statistic: 111 on 1 and 74 DF, p-value: 2.257e-16 > SSQ <- sum(lm.out$residuals^2) > SSQ [1] 0.004183108 > seg.out<-segmented(lm.out,seg.Z = ~x) > summary(seg.out) ***Regression Model with Segmented Relationship(s)*** Call: segmented.lm(obj = lm.out, seg.Z = ~x) Estimated Break-Point(s): Est. St.Err 1.821 0.097 Meaningful coefficients of the linear terms: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.008816 0.001826 -4.828 7.52e-06 *** x 0.016833 0.001564 10.763 < 2e-16 *** U1.x 0.152178 0.063636 2.391 NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.005359 on 72 degrees of freedom Multiple R-Squared: 0.8022, Adjusted R-squared: 0.794 Convergence attained in 4 iterations with relative change 2.835799e-16 > SSQ2 <- sum(seg.out$residuals^2) > SSQ2 [1] 0.00206802
This is very close to what Baron produced (Baron is just a slightly bit better).
References
- Sandra E. Ryan; Laurie S. Porth, A tutorial on the piecewise regression approach applied to bedload transport data, Gen. Tech. Rep. RMRS-GTR-189. Fort Collins, CO: U.S. Department of Agriculture, Forest Service, https://www.fs.fed.us/rm/pubs/rmrs_gtr189.pdf
- Muggeo, V.M.R. (2008) Segmented: an R package to fit regression models with broken-line relationships. R News 8/1, 20–25.
No comments:
Post a Comment