Yet Another Math Programming Consultant: Optimal double exponential smoothing

Holt-Winters Double Exponential Smoothing method is a well-known method for smoothing data with a trend [1]. In [2] an Excel implementation was demonstrated. The formulas (from [2]) are: \[\begin{align} &u_1 = y_1 \\ & v_1 = 0 \\& u_t = \alpha y_t + (1-\alpha) (u_{t-1}+v_{t-1})&& t\gt 1 \\ &v_t = \beta (u_t - u_{t-1}) + (1-\beta)v_{t-1}&& t\gt 1 \\ & \hat{y}_{t+1} = u_{t}+v_{t} \end{align}\] The symbols are:

\(y_t\) is the data,
\(u_t\) and \(v_t\) are estimates of the smoothed value and the trend,
\(\alpha\) and \(\beta\) are parameters. We have \(0 \le \alpha \le 1\) and \(0 \le \beta \le 1\).
\(\hat{y}_t\) is the estimate for \(y\).

In [2] the MAE (Mean Absolute Error) is minimized. This is \[ \min \>\mathit{MAE}= \frac{\displaystyle\sum_{t\gt1} |\hat{y}_t - y_t|}{T-1} \] where \(T\) is the number of time periods. Of course, in practice a more popular measure is to minimize the Mean Squared Error. One reason for choosing a least absolute value criterion is that this provides robustness when outliers are present.

Data

I use the small data set from [2]:

----     23 PARAMETER y  data

t1   3.000,    t2   5.000,    t3   9.000,    t4  20.000,    t5  12.000,    t6  17.000,    t7  22.000,    t8  23.000
t9  51.000,    t10 41.000,    t11 56.000,    t12 75.000,    t13 60.000,    t14 75.000,    t15 88.000

Model

Often, in statistical software, the MSE version of the problem is modeled as a black-box function of just \(\alpha\) and \(\beta\) and solved using say a BFGS-B method (usually without analytic gradients, so the solver will use finite-differences). They typically ignore that we can end up in a local minimum (although most routines allow us to specify a starting point for \(\alpha\) and \(\beta\)). Here I want to state an explicit non-convex quadratic model that we can solve with different types of solvers. This approach will give us a much larger model, with variables \(\alpha\), \(\beta\), \(u\), \(v\), and \(\hat{y}\). In addition, when using an MAE objective, we need extra variables to linearize the absolute value.

Non-convex Quadratic Model
\[\begin{align}\min\>&\color{darkred}{\mathit{MAE}}=\sum_{t\gt 1} \color{darkred}{\mathit{abserr}}_t \\ & \color{darkred}u_1 = \color{darkblue}y_1 \\ & \color{darkred}v_1 = 0 \\ & \color{darkred}u_t = \color{darkred}\alpha \cdot \color{darkblue}y_t + (1-\color{darkred}\alpha) \cdot (\color{darkred}u_{t-1}+\color{darkred}v_{t-1})&& t\gt 1\\& \color{darkred}v_t = \color{darkred}\beta \cdot (\color{darkred}u_t - \color{darkred}u_{t-1}) + (1-\color{darkred}\beta)\cdot \color{darkred}v_{t-1}&& t\gt 1 \\ & \color{darkred}{\hat{y}}_{t+1} = \color{darkred}u_{t}+\color{darkred}v_{t}\\ & -\color{darkred}{\mathit{abserr}}_t \le \color{darkred}{\hat{y}}_{t}- \color{darkblue}y_t \le \color{darkred}{\mathit{abserr}}_t && t\gt 1\\ & \color{darkred}\alpha \in [0,1] \\ & \color{darkred}\beta \in [0,1]\\ & \color{darkred}{\mathit{abserr}}_t \ge 0\end{align}\]

Non-convex Quadratic Model

\[\begin{align}\min\>&\color{darkred}{\mathit{MAE}}=\sum_{t\gt 1} \color{darkred}{\mathit{abserr}}_t \\ & \color{darkred}u_1 = \color{darkblue}y_1 \\ & \color{darkred}v_1 = 0 \\ & \color{darkred}u_t = \color{darkred}\alpha \cdot \color{darkblue}y_t + (1-\color{darkred}\alpha) \cdot (\color{darkred}u_{t-1}+\color{darkred}v_{t-1})&& t\gt 1\\& \color{darkred}v_t = \color{darkred}\beta \cdot (\color{darkred}u_t - \color{darkred}u_{t-1}) + (1-\color{darkred}\beta)\cdot \color{darkred}v_{t-1}&& t\gt 1 \\ & \color{darkred}{\hat{y}}_{t+1} = \color{darkred}u_{t}+\color{darkred}v_{t}\\ & -\color{darkred}{\mathit{abserr}}_t \le \color{darkred}{\hat{y}}_{t}- \color{darkblue}y_t \le \color{darkred}{\mathit{abserr}}_t && t\gt 1\\ & \color{darkred}\alpha \in [0,1] \\ & \color{darkred}\beta \in [0,1]\\ & \color{darkred}{\mathit{abserr}}_t \ge 0\end{align}\]

The first thing to do is to see if we can reproduce the results for \(\alpha=0.4\) and \(\beta=0.7\) as they are reported in [1]. We do this by fixing the variables \(\alpha\) and \(\beta\) and throw the model into my favorite NLP solver: CONOPT. We only need to fix these two variables, as all other levels will be pinned down by this. This fixed problem solves very fast:

   Iter Phase Ninf   Infeasibility   RGmax    NSB   Step InItr MX OK
      0   0        7.8250000000E+02 (Input point)
 
                   Pre-triangular equations:   42
                   Post-triangular equations:  29
 
      1   0        0.0000000000E+00 (After pre-processing)
      2   0        0.0000000000E+00 (After scaling)
 
 ** Feasible solution. Value of objective =    7.40965962794
 
   Iter Phase Ninf     Objective     RGmax    NSB   Step InItr MX OK
      4   3        7.4096596279E+00 0.0E+00     0
 
 ** Optimal solution. There are no superbasic variables.

Basically, CONOPT recovers the optimal solution during preprocessing. A few more iterations are needed, probably to get the basis right. The objective value is the same as reported in [2]. With this little experiment, we have learned a lot. Most likely our constraints and objective are correctly formulated.

Solution

We are now ready to try to solve the problem. When using \(\alpha=0.4\) and \(\beta=0.7\) as starting point, CONOPT quickly finds a very good solution. A solution report can look like:

----     75 PARAMETER results  

              y           u           v        yhat         |e|

t1      3.00000     3.00000
t2      5.00000     3.42764     0.37003     3.00000     2.00000
t3      9.00000     4.91004     1.33254     3.79767     5.20233
t4     20.00000     9.18421     3.87786     6.24258    13.75742
t5     12.00000    12.83498     3.68136    13.06207     1.06207
t6     17.00000    16.61976     3.77085    16.51634     0.48366
t7     22.00000    20.73473     4.06861    20.39061     1.60939
t8     23.00000    24.41775     3.73497    24.80334     1.80334
t9     51.00000    33.03795     7.96205    28.15271    22.84729
t10    41.00000    41.00000     7.96205    41.00000
t11    56.00000    50.46691     9.26417    48.96205     7.03795
t12    75.00000    62.99591    12.08915    59.73109    15.26891
t13    60.00000    71.85955     9.29819    75.08506    15.08506
t14    75.00000    79.84108     8.15892    81.15774     6.15774
t15    88.00000    88.00000     8.15892    88.00000


----     76 VARIABLE alpha.L               =      0.21382  
            VARIABLE beta.L                =      0.86528  
            VARIABLE MAE.L                 =      6.59394

Conopt found this solution very quickly:

   Iter Phase Ninf   Infeasibility   RGmax    NSB   Step InItr MX OK
      0   0        7.8250000000E+02 (Input point)
 
                   Pre-triangular equations:   1
                   Post-triangular equations:  5
                   Definitional equations:     39
 
      1   0        1.0173523479E+02 (After pre-processing)
      2   0        3.6729535718E+00 (After scaling)
     10   0    12  3.3409268983E+00               0.0E+00      F  F
 
 ** Feasible solution. Value of objective =    7.07090607760
 
   Iter Phase Ninf     Objective     RGmax    NSB   Step InItr MX OK
     16   3        6.5939399026E+00 2.1E+00     3 1.0E+00    3 T  T
     18   3        6.5939398862E+00 0.0E+00     0
 
 ** Optimal solution. There are no superbasic variables.

When we compare this with the Excel solution from [2], we see the CONOPT solution is actually better.

Optimal Excel Solver Solution (from [2])

What we see is:

Variable	Excel solution [2]	Conopt solution
\(\alpha\)	0.271817	0.21382
\(\beta\)	0.598161	0.86528
MAE	6.740208	6.59394

This is typical for non-convex problems: local NLP solvers may converge to a local optimum. One way to get a better handle on this is to use different starting points. Another way is to try a global solver. Using global solvers, we can show that indeed Conopt found the global optimum.

Starting points

Here we try quickly a few local solvers, with different starting points. The QCP (Quadratically Constrained Program) is non-convex. This means we should be prepared for local optimal. When different solvers starting from different initial points converge to the same solution, we have some (limited) confidence that this point is a global optimum. Note that when solving as a general-purpose NLP model, the solver does not know the model is quadratic. The local solver also does not know whether the optimal solution it finds is just a local optimum or a global one. It is noted that with this extended QCP formulation, the solver is provided with exact gradients.

Solver	Starting point	Objective	Iterations
Conopt	\(\alpha=0.4,\beta=0.7\)	6.5939	18
Conopt	\(\alpha=0,\beta=0\)	6.5939	29
Conopt	\(\alpha=1,\beta=1\)	6.5939	17
Minos	\(\alpha=0.4,\beta=0.7\)	6.5939	8
Minos	\(\alpha=0,\beta=0\)	6.5939	14
Minos	\(\alpha=1,\beta=1\)	6.5939	16
Snopt	\(\alpha=0.4,\beta=0.7\)	6.5939	20
Snopt	\(\alpha=0,\beta=0\)	6.5939	14
Snopt	\(\alpha=1,\beta=1\)	6.5939	17
Ipopt	\(\alpha=0.4,\beta=0.7\)	6.5939	23
Ipopt	\(\alpha=0,\beta=0\)	6.5939	14
Ipopt	\(\alpha=1,\beta=1\)	6.5939	26

Local solvers have in general no problem in finding the same optimal solution. This is some evidence this is the global optimum.

Note: I was a bit lazy and left \(u_t\) and \(v_t\) at their default initial point of zero.

A global approach

The first thing I did, was to throw the above model unmodified at some global NLP solvers. This was a big disappointment for a small model like this:

Solver	Objective	Time	Gap	Note
Baron	6.5939	3600	95%	Time limit
Couenne	6.5939	3600	78%	Time limit
Antigone	6.5939	184	0%	Optimal
Gurobi	6.5946	218	0%	Optimal

Note: The runs with solvers Baron/Couenne/Antigone were done on a somewhat slow laptop. Gurobi ran on a faster (virtual) machine. So timings are not directly comparable between these solvers but are just indicative.

Looking again at our model, we can see that the model is way more non-linear than it needs to be. To fix this, we make the model even larger (more variables, more equations). But also less non-linear. The idea is that we replace terms like \[(x+y)z=xz+yz\] by \[\begin{align} & w=x+y \\ & wz \end{align}\] When we do this for our model, we get:

Non-convex Quadratic Model v2
\[\begin{align}\min\>&\color{darkred}{\mathit{MAE}}=\sum_{t\gt 1} \color{darkred}{\mathit{abserr}}_t \\ & \color{darkred}u_1 = \color{darkblue}y_1 \\ & \color{darkred}v_1 = 0 \\ & \color{darkred}w_t = \color{darkred}u_t+\color{darkred}v_t\\ & \color{darkred}u_t = \color{darkred}\alpha \cdot \color{darkblue}y_t + \color{darkred}w_{t-1} -\color{darkred}\alpha \cdot \color{darkred}w_{t-1}&& t\gt 1\\ & \color{darkred}x_t = \color{darkred}u_t - \color{darkred}u_{t-1} -\color{darkred}v_{t-1} && t\gt 1 \\& \color{darkred}v_t = \color{darkred}\beta \cdot \color{darkred}x_t + \color{darkred}v_{t-1}&& t\gt 1 \\ & \color{darkred}{\hat{y}}_{t+1} = \color{darkred}u_{t}+\color{darkred}v_{t}\\ & -\color{darkred}{\mathit{abserr}}_t \le \color{darkred}{\hat{y}}_{t}- \color{darkblue}y_t \le \color{darkred}{\mathit{abserr}}_t && t\gt 1\\ & \color{darkred}\alpha \in [0,1] \\ & \color{darkred}\beta \in [0,1]\\ & \color{darkred}{\mathit{abserr}}_t \ge 0\end{align}\]

Non-convex Quadratic Model v2

\[\begin{align}\min\>&\color{darkred}{\mathit{MAE}}=\sum_{t\gt 1} \color{darkred}{\mathit{abserr}}_t \\ & \color{darkred}u_1 = \color{darkblue}y_1 \\ & \color{darkred}v_1 = 0 \\ & \color{darkred}w_t = \color{darkred}u_t+\color{darkred}v_t\\ & \color{darkred}u_t = \color{darkred}\alpha \cdot \color{darkblue}y_t + \color{darkred}w_{t-1} -\color{darkred}\alpha \cdot \color{darkred}w_{t-1}&& t\gt 1\\ & \color{darkred}x_t = \color{darkred}u_t - \color{darkred}u_{t-1} -\color{darkred}v_{t-1} && t\gt 1 \\& \color{darkred}v_t = \color{darkred}\beta \cdot \color{darkred}x_t + \color{darkred}v_{t-1}&& t\gt 1 \\ & \color{darkred}{\hat{y}}_{t+1} = \color{darkred}u_{t}+\color{darkred}v_{t}\\ & -\color{darkred}{\mathit{abserr}}_t \le \color{darkred}{\hat{y}}_{t}- \color{darkblue}y_t \le \color{darkred}{\mathit{abserr}}_t && t\gt 1\\ & \color{darkred}\alpha \in [0,1] \\ & \color{darkred}\beta \in [0,1]\\ & \color{darkred}{\mathit{abserr}}_t \ge 0\end{align}\]

This version of the model behaves better:

Solver	Obj	Time	Gap	Note
Baron	6.5939	60	0%	Optimal
Couenne	6.5939	3600	49%	Time limit
Antigone	6.5939	157	0%	Optimal
Gurobi	6.5939	157	0%	Optimal

This reformulation is a big help for Baron, but also helps Gurobi delivering more accurate solutions.

How difficult is this problem?

We saw that Conopt (and other local NLP solvers) found the global optimal solution without any problems. Of course, Conopt does not know whether it is a local or a global solution. So the question arises how difficult is this problem? As it turns out, the problem is actually very well-behaved. Any good local NLP solver will likely find the global optimum, even when feeding it our non-convex QCP problem. The following plot will illuminate this:

Conclusion

The problem of finding optimal parameters \(\alpha\) and \(\beta\) for the Holt-Winters double exponential smoothing method leads to a non-convex quadratic optimization problem. Local solvers may deliver solutions that are only locally optimal (and thus not necessarily globally optimal). When trying to solve the global problem, it may help to reformulate the model in order to minimize the number of quadratic terms.

In practice, the problem is quite well-behaved, and a good NLP solver should be able to find the global optimum solution quite easily. It looks like Excel's solver is just stopping a bit prematurely. Interestingly quadratic solvers like Cplex and Gurobi are not as fast or suitable as a general-purpose local NLP solver for this problem.

The GAMS source of the first model can be found in [4].

References

Exponential Smoothing, https://en.wikipedia.org/wiki/Exponential_smoothing
Holt's Linear Trend, https://www.real-statistics.com/time-series-analysis/basic-time-series-forecasting/holt-linear-trend/
Handanhal V. Ravinder, Determining The Optimal Values Of Exponential Smoothing Constants – Does Solver Really Work?, American Journal Of Business Education, Volume 6, Number 3, 2013, https://files.eric.ed.gov/fulltext/EJ1054363.pdf. This paper shows some computational experience with Excel on these types of models.
How to setup a minimization problem in GAMS, https://stackoverflow.com/questions/63738161/how-to-setup-a-minimization-problem-in-gams

Yet Another Math Programming Consultant

Saturday, September 5, 2020

Optimal double exponential smoothing