Holt-Winters Double Exponential Smoothing method is a well-known method for smoothing data with a trend [1]. In [2] an Excel implementation was demonstrated. The formulas (from [2]) are: \[\begin{align} &u_1 = y_1 \\ & v_1 = 0 \\& u_t = \alpha y_t + (1-\alpha) (u_{t-1}+v_{t-1})&& t\gt 1 \\ &v_t = \beta (u_t - u_{t-1}) + (1-\beta)v_{t-1}&& t\gt 1 \\ & \hat{y}_{t+1} = u_{t}+v_{t} \end{align}\] The symbols are:
- \(y_t\) is the data,
- \(u_t\) and \(v_t\) are estimates of the smoothed value and the trend,
- \(\alpha\) and \(\beta\) are parameters. We have \(0 \le \alpha \le 1\) and \(0 \le \beta \le 1\).
- \(\hat{y}_t\) is the estimate for \(y\).
In [2] the MAE (Mean Absolute Error) is minimized. This is \[ \min \>\mathit{MAE}= \frac{\displaystyle\sum_{t\gt1} |\hat{y}_t - y_t|}{T-1} \] where \(T\) is the number of time periods. Of course, in practice a more popular measure is to minimize the Mean Squared Error. One reason for choosing a least absolute value criterion is that this provides robustness when outliers are present.
Data
I use the small data set from [2]:
---- 23 PARAMETER y data t1 3.000, t2 5.000, t3 9.000, t4 20.000, t5 12.000, t6 17.000, t7 22.000, t8 23.000 t9 51.000, t10 41.000, t11 56.000, t12 75.000, t13 60.000, t14 75.000, t15 88.000
Model
Often, in statistical software, the MSE version of the problem is modeled as a black-box function of just \(\alpha\) and \(\beta\) and solved using say a BFGS-B method (usually without analytic gradients, so the solver will use finite-differences). They typically ignore that we can end up in a local minimum (although most routines allow us to specify a starting point for \(\alpha\) and \(\beta\)). Here I want to state an explicit non-convex quadratic model that we can solve with different types of solvers. This approach will give us a much larger model, with variables \(\alpha\), \(\beta\), \(u\), \(v\), and \(\hat{y}\). In addition, when using an MAE objective, we need extra variables to linearize the absolute value.
Non-convex Quadratic Model |
---|
\[\begin{align}\min\>&\color{darkred}{\mathit{MAE}}=\sum_{t\gt 1} \color{darkred}{\mathit{abserr}}_t \\ & \color{darkred}u_1 = \color{darkblue}y_1 \\ & \color{darkred}v_1 = 0 \\ & \color{darkred}u_t = \color{darkred}\alpha \cdot \color{darkblue}y_t + (1-\color{darkred}\alpha) \cdot (\color{darkred}u_{t-1}+\color{darkred}v_{t-1})&& t\gt 1\\& \color{darkred}v_t = \color{darkred}\beta \cdot (\color{darkred}u_t - \color{darkred}u_{t-1}) + (1-\color{darkred}\beta)\cdot \color{darkred}v_{t-1}&& t\gt 1 \\ & \color{darkred}{\hat{y}}_{t+1} = \color{darkred}u_{t}+\color{darkred}v_{t}\\ & -\color{darkred}{\mathit{abserr}}_t \le \color{darkred}{\hat{y}}_{t}- \color{darkblue}y_t \le \color{darkred}{\mathit{abserr}}_t && t\gt 1\\ & \color{darkred}\alpha \in [0,1] \\ & \color{darkred}\beta \in [0,1]\\ & \color{darkred}{\mathit{abserr}}_t \ge 0\end{align}\] |
The first thing to do is to see if we can reproduce the results for \(\alpha=0.4\) and \(\beta=0.7\) as they are reported in [1]. We do this by fixing the variables \(\alpha\) and \(\beta\) and throw the model into my favorite NLP solver: CONOPT. We only need to fix these two variables, as all other levels will be pinned down by this. This fixed problem solves very fast:
Iter Phase Ninf Infeasibility RGmax NSB Step InItr MX OK 0 0 7.8250000000E+02 (Input point) Pre-triangular equations: 42 Post-triangular equations: 29 1 0 0.0000000000E+00 (After pre-processing) 2 0 0.0000000000E+00 (After scaling) ** Feasible solution. Value of objective = 7.40965962794 Iter Phase Ninf Objective RGmax NSB Step InItr MX OK 4 3 7.4096596279E+00 0.0E+00 0 ** Optimal solution. There are no superbasic variables.
Basically, CONOPT recovers the optimal solution during preprocessing. A few more iterations are needed, probably to get the basis right. The objective value is the same as reported in [2]. With this little experiment, we have learned a lot. Most likely our constraints and objective are correctly formulated.
Solution
We are now ready to try to solve the problem. When using \(\alpha=0.4\) and \(\beta=0.7\) as starting point, CONOPT quickly finds a very good solution. A solution report can look like:
---- 75 PARAMETER results y u v yhat |e| t1 3.00000 3.00000 t2 5.00000 3.42764 0.37003 3.00000 2.00000 t3 9.00000 4.91004 1.33254 3.79767 5.20233 t4 20.00000 9.18421 3.87786 6.24258 13.75742 t5 12.00000 12.83498 3.68136 13.06207 1.06207 t6 17.00000 16.61976 3.77085 16.51634 0.48366 t7 22.00000 20.73473 4.06861 20.39061 1.60939 t8 23.00000 24.41775 3.73497 24.80334 1.80334 t9 51.00000 33.03795 7.96205 28.15271 22.84729 t10 41.00000 41.00000 7.96205 41.00000 t11 56.00000 50.46691 9.26417 48.96205 7.03795 t12 75.00000 62.99591 12.08915 59.73109 15.26891 t13 60.00000 71.85955 9.29819 75.08506 15.08506 t14 75.00000 79.84108 8.15892 81.15774 6.15774 t15 88.00000 88.00000 8.15892 88.00000 ---- 76 VARIABLE alpha.L = 0.21382 VARIABLE beta.L = 0.86528 VARIABLE MAE.L = 6.59394
Conopt found this solution very quickly:
Iter Phase Ninf Infeasibility RGmax NSB Step InItr MX OK 0 0 7.8250000000E+02 (Input point) Pre-triangular equations: 1 Post-triangular equations: 5 Definitional equations: 39 1 0 1.0173523479E+02 (After pre-processing) 2 0 3.6729535718E+00 (After scaling) 10 0 12 3.3409268983E+00 0.0E+00 F F ** Feasible solution. Value of objective = 7.07090607760 Iter Phase Ninf Objective RGmax NSB Step InItr MX OK 16 3 6.5939399026E+00 2.1E+00 3 1.0E+00 3 T T 18 3 6.5939398862E+00 0.0E+00 0 ** Optimal solution. There are no superbasic variables.
When we compare this with the Excel solution from [2], we see the CONOPT solution is actually better.
Optimal Excel Solver Solution (from [2]) |
What we see is:
Variable | Excel solution [2] | Conopt solution |
---|---|---|
\(\alpha\) | 0.271817 | 0.21382 |
\(\beta\) | 0.598161 | 0.86528 |
MAE | 6.740208 | 6.59394 |
This is typical for non-convex problems: local NLP solvers may converge to a local optimum. One way to get a better handle on this is to use different starting points. Another way is to try a global solver. Using global solvers, we can show that indeed Conopt found the global optimum.
Starting points
Here we try quickly a few local solvers, with different starting points. The QCP (Quadratically Constrained Program) is non-convex. This means we should be prepared for local optimal. When different solvers starting from different initial points converge to the same solution, we have some (limited) confidence that this point is a global optimum. Note that when solving as a general-purpose NLP model, the solver does not know the model is quadratic. The local solver also does not know whether the optimal solution it finds is just a local optimum or a global one. It is noted that with this extended QCP formulation, the solver is provided with exact gradients.
Solver | Starting point | Objective | Time | Iterations |
---|---|---|---|---|
Conopt | \(\alpha=0.4,\beta=0.7\) | 6.5939 | 0 | 18 |
Conopt | \(\alpha=0,\beta=0\) | 6.5939 | 0 | 29 |
Conopt | \(\alpha=1,\beta=1\) | 6.5939 | 0 | 17 |
Minos | \(\alpha=0.4,\beta=0.7\) | 6.5939 | 0 | 8 |
Minos | \(\alpha=0,\beta=0\) | 6.5939 | 0 | 14 |
Minos | \(\alpha=1,\beta=1\) | 6.5939 | 0 | 16 |
Snopt | \(\alpha=0.4,\beta=0.7\) | 6.5939 | 0 | 20 |
Snopt | \(\alpha=0,\beta=0\) | 6.5939 | 0 | 14 |
Snopt | \(\alpha=1,\beta=1\) | 6.5939 | 0 | 17 |
Ipopt | \(\alpha=0.4,\beta=0.7\) | 6.5939 | 0 | 23 |
Ipopt | \(\alpha=0,\beta=0\) | 6.5939 | 0 | 14 |
Ipopt | \(\alpha=1,\beta=1\) | 6.5939 | 0 | 26 |
Local solvers have in general no problem in finding the same optimal solution. This is some evidence this is the global optimum.
Note: I was a bit lazy and left \(u_t\) and \(v_t\) at their default initial point of zero.
A global approach
The first thing I did, was to throw the above model unmodified at some global NLP solvers. This was a big disappointment for a small model like this:
Solver | Objective | Time | Gap | Note |
---|---|---|---|---|
Baron | 6.5939 | 3600 | 95% | Time limit |
Couenne | 6.5939 | 3600 | 78% | Time limit |
Antigone | 6.5939 | 184 | 0% | Optimal |
Gurobi | 6.5946 | 218 | 0% | Optimal |
Note: The runs with solvers Baron/Couenne/Antigone were done on a somewhat slow laptop. Gurobi ran on a faster (virtual) machine. So timings are not directly comparable between these solvers but are just indicative.
Non-convex Quadratic Model v2 |
---|
\[\begin{align}\min\>&\color{darkred}{\mathit{MAE}}=\sum_{t\gt 1} \color{darkred}{\mathit{abserr}}_t \\ & \color{darkred}u_1 = \color{darkblue}y_1 \\ & \color{darkred}v_1 = 0 \\ & \color{darkred}w_t = \color{darkred}u_t+\color{darkred}v_t\\ & \color{darkred}u_t = \color{darkred}\alpha \cdot \color{darkblue}y_t + \color{darkred}w_{t-1} -\color{darkred}\alpha \cdot \color{darkred}w_{t-1}&& t\gt 1\\ & \color{darkred}x_t = \color{darkred}u_t - \color{darkred}u_{t-1} -\color{darkred}v_{t-1} && t\gt 1 \\& \color{darkred}v_t = \color{darkred}\beta \cdot \color{darkred}x_t + \color{darkred}v_{t-1}&& t\gt 1 \\ & \color{darkred}{\hat{y}}_{t+1} = \color{darkred}u_{t}+\color{darkred}v_{t}\\ & -\color{darkred}{\mathit{abserr}}_t \le \color{darkred}{\hat{y}}_{t}- \color{darkblue}y_t \le \color{darkred}{\mathit{abserr}}_t && t\gt 1\\ & \color{darkred}\alpha \in [0,1] \\ & \color{darkred}\beta \in [0,1]\\ & \color{darkred}{\mathit{abserr}}_t \ge 0\end{align}\] |
This version of the model behaves better:
Solver | Obj | Time | Gap | Note |
---|---|---|---|---|
Baron | 6.5939 | 60 | 0% | Optimal |
Couenne | 6.5939 | 3600 | 49% | Time limit |
Antigone | 6.5939 | 157 | 0% | Optimal |
Gurobi | 6.5939 | 157 | 0% | Optimal |
This reformulation is a big help for Baron, but also helps Gurobi delivering more accurate solutions.
How difficult is this problem?
We saw that Conopt (and other local NLP solvers) found the global optimal solution without any problems. Of course, Conopt does not know whether it is a local or a global solution. So the question arises how difficult is this problem? As it turns out, the problem is actually very well-behaved. Any good local NLP solver will likely find the global optimum, even when feeding it our non-convex QCP problem. The following plot will illuminate this:
Conclusion
The problem of finding optimal parameters \(\alpha\) and \(\beta\) for the Holt-Winters double exponential smoothing method leads to a non-convex quadratic optimization problem. Local solvers may deliver solutions that are only locally optimal (and thus not necessarily globally optimal). When trying to solve the global problem, it may help to reformulate the model in order to minimize the number of quadratic terms.
In practice, the problem is quite well-behaved, and a good NLP solver should be able to find the global optimum solution quite easily. It looks like Excel's solver is just stopping a bit prematurely. Interestingly quadratic solvers like Cplex and Gurobi are not as fast or suitable as a general-purpose local NLP solver for this problem.
The GAMS source of the first model can be found in [4].
References
- Exponential Smoothing, https://en.wikipedia.org/wiki/Exponential_smoothing
- Holt's Linear Trend, https://www.real-statistics.com/time-series-analysis/basic-time-series-forecasting/holt-linear-trend/
- Handanhal V. Ravinder, Determining The Optimal Values Of Exponential Smoothing Constants – Does Solver Really Work?, American Journal Of Business Education, Volume 6, Number 3, 2013, https://files.eric.ed.gov/fulltext/EJ1054363.pdf. This paper shows some computational experience with Excel on these types of models.
- How to setup a minimization problem in GAMS, https://stackoverflow.com/questions/63738161/how-to-setup-a-minimization-problem-in-gams
This is really interesting, Erwin. Nicely done.
ReplyDelete