Sunday, March 18, 2018

Piecewise linear regression

In [1] a data set is presented, where a piecewise or segmented linear regression approach can help. A linear fit looks like:

There are two issues we can see from the plot: we can expect a better fit if we can use a piecewise linear function with 2 segments. Secondly the standard regression will allow negative predictions for \(\hat{Y}\) when we have small \(X\) values.

The breakpoint is estimated to be \(\bar(x)=1.5\). The plots seem convincing.

These results were produced using SAS.

Let's see what happens if we use some more advanced mathematical optimization models. The linear model is easily coded as QP problem:

\[\begin{align} \min \>& \text{SSQ} = \sum_i e_i^2 \\ & y_i = a + b x_i + e_i \end{align} \] where \(x,y\) are data.

This purely linear fit is of course identical:

Now let's see if we can tackle a piecewise linear function with two segments. For this we introduce a binary variable \(\delta_i \in \{0,1\}\) to indicate if an observation belongs to the first or the second segment. A high-level model can look like:

\[\begin{align} \min & \>\text{SSQ} = \sum_i e_i^2 && (1)\\ & e_i = (1-\delta_i) r_{1,i} + \delta_i r_{2,i} &&(2)\\ &r_{k,i} = y_i - a_k - b_k x_i && (3)\\ & x_i \le \bar{x} \Rightarrow \delta_i = 0  &&(4)\\& x_i \gt \bar{x} \Rightarrow \delta_i = 1 && (5)\\ & \bar{y} = a_k + b_k \bar{x}&& (6)\end{align}\]

Here \((\bar{x}, \bar{y})\) is the break point (to be determined by the model).

If we delete the last constraint, we can formulate the problem as a convex MIQP. This requires linearization of the constraints (2), (4) and (5).

Adding the last constraint \(\bar{y} = a_k + b_k \bar{x}\) gives us a non-convex MINLP. After solving it we see:

Interestingly these results are very different from [1]. A more formal optimization model can make a lot of difference compared to a more ad-hoc method. The pictures from [1] indicate we are not optimal with respect to the total Sum of Squares. The continuous version of the segmented model with a breakpoint at 1.68 as suggested in [1] has: SSQ=0.00235. This is larger than our SSQ=0.00206459 with a breakpoint at 1.8135.

The two rightmost observations are probably outliers that should be removed from the model.

Update: Alternative formulations 

If the \(x_i\) are sorted: \(x_{i+1} \ge x_i\), we can replace the implications 
\[\begin{align} & x_i \le \bar{x} \Rightarrow \delta_i = 0  &&(4)\\& x_i \gt \bar{x} \Rightarrow \delta_i = 1 && (5) \end{align}\]  by \[\delta_{i+1} \ge \delta_{i}\] I.e. we switch once from 0 to 1, Of course the sorting is no problem: we can sort the data before doing the optimization.

If furthermore we restrict the breakpoint \(\bar{x}\) to one of the data points, additional reformulations are possible. See the comments below. This can make the model convex. I think for the problem at hand we can add this restriction without much impact on the sum of squares. Also this may be a very good model to solve first: it will be an excellent starting point for the non-convex model. (I used the discontinuous version to generate a good starting point for the non-convex MINLP, but surely this version must be better).

In [2] the problem is stated as the regression: \[y = a + b x + c (x-\bar{x}) _{\mathbf{+}}\] where \(x_{\mathbf{+}} = \max(x,0)\). \(c\) is now the difference in slope for the second segment. This leads to a simpler optimization problem. The term \(x-\bar{x}\) can be linearized using a big-M formulation. The model can look like: \[\begin{align} \min \>& \sum_i e_i^2 \\ & y_i  = a + b x_i + c z_i + e_i\\ & z_i = \max(x_i-\bar{x},0) \end{align}\] As we have a multiplication of two variables \(c z_i\), we have not made much progress,

R package `segmented`

The R package segmented[2] is actually doing slightly better. Like the previous Baron solution it puts the last two observations in to the second segment, but then moves \(\bar{x}\) somewhat:

> lm.out <- lm(y~x)
> summary(lm.out)

lm(formula = y ~ x)

      Min        1Q    Median        3Q       Max 
-0.018152 -0.002764  0.000568  0.002367  0.035213 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.013224   0.002454   -5.39 8.06e-07 ***
x            0.021558   0.002047   10.53 2.26e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.007519 on 74 degrees of freedom
Multiple R-squared:  0.5999, Adjusted R-squared:  0.5945 
F-statistic:   111 on 1 and 74 DF,  p-value: 2.257e-16

> SSQ <- sum(lm.out$residuals^2)
[1] 0.004183108
> seg.out<-segmented(lm.out,seg.Z = ~x)
> summary(seg.out)

 ***Regression Model with Segmented Relationship(s)***

segmented.lm(obj = lm.out, seg.Z = ~x)

Estimated Break-Point(s):
   Est. St.Err 
 1.821  0.097 

Meaningful coefficients of the linear terms:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.008816   0.001826  -4.828 7.52e-06 ***
x            0.016833   0.001564  10.763  < 2e-16 ***
U1.x         0.152178   0.063636   2.391       NA    
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.005359 on 72 degrees of freedom
Multiple R-Squared: 0.8022,  Adjusted R-squared: 0.794 

Convergence attained in 4 iterations with relative change 2.835799e-16 
> SSQ2 <- sum(seg.out$residuals^2)
> SSQ2
[1] 0.00206802

This is very close to what Baron produced (Baron is just a slightly bit better).


  1. Sandra E. Ryan; Laurie S. Porth, A tutorial on the piecewise regression approach applied to bedload transport data, Gen. Tech. Rep. RMRS-GTR-189. Fort Collins, CO: U.S. Department of Agriculture, Forest Service,
  2. Muggeo, V.M.R. (2008) Segmented: an R package to fit regression models with broken-line relationships. R News 8/1, 20–25.