Largest rectangle in a histogram |
There are a number of algorithms to attack this problem, some of them very efficient, e.g. \(O(n)\). [1,2]
Here we try to find out if we can formulate this as an optimization model.
A small data set
Here is a small data set. It corresponds to the picture above.
---- 13 SET i bars i1, i2, i3, i4, i5, i6, i7 ---- 13 PARAMETER hist histogram data i1 6.000, i2 2.000, i3 5.000, i4 4.000, i5 5.000, i6 1.000, i7 6.000 ---- 13 PARAMETER maxhist = 6.000 heighest bar
A quadratic model
We start with our central binary variable \[\mathit{select}_i = \begin{cases} 1 & \text{if bar $i$ is selected to be part of the rectangle} \\ 0 & \text{otherwise}\end{cases}\] In addition, we want to know if an unselected bar is left or right of the selected bars. For this we use a binary variable \[\mathit{lr}_i = \begin{cases} 0 & \text{if unselected bar $i$ is to the left of the selected bars} \\ 1& \text{if unselected bar $i$ is to the right of the selected bars} \\ \text{unrestricted} & \text{if $\mathit{select}_i=1$}\end{cases}\] For the above picture, this looks like:
Note that the variable \(\mathit{lr}_i\) can be either 0 or 1 when the corresponding bar is selected.
The variables to describe the rcctangular area are: \(\mathit{start}\), \(\mathit{width}\) and \(\mathit{height}\).
The following implications can be formed. If bar \(i\) is not selected and to the left, then \(\mathit{start}\gt i\): \[\mathit{select}_i=0 \>{\bf{and}}\> \mathit{lr}_i=0 \Rightarrow \mathit{start}\ge i+1\] Similarly, if bar \(i\) is not selected and to the right, then \(\mathit{start}+\mathit{width}-1 \lt i\). Formally: \[\mathit{select}_i=0 \>{\bf{and}}\> \mathit{lr}_i=1 \Rightarrow \mathit{start}+\mathit{width}-1\le i-1\] Furthermore, we also have the implication that limits the height: if bar \(i\) is selected, then \(\mathit{height}\le \mathit{hist}_i\). These implications can be implemented as big-M constraints. Finally, we can link \(\mathit{select}_i\) to the width, by saying: \[\sum_i \mathit{select}_i = \mathit{width}\] The complete model can look like:
Quadratic Model |
---|
\[\begin{align}\max&\>\color{darkred}{\mathit{area}}=\color{darkred}{\mathit{width}}\cdot\color{darkred}{\mathit{height}}\\ & \color{darkred}{\mathit{start}} \ge\color{darkblue}i+1-\color{darkblue}n\cdot \color{darkred}{\mathit{select}}_i -\color{darkblue}n\cdot\color{darkred}{\mathit{lr}}_i&& \forall i \\ & \color{darkred}{\mathit{start}}+\color{darkred}{\mathit{width}} \le \color{darkblue} i+\color{darkblue}n\cdot \color{darkred}{\mathit{select}}_i+\color{darkblue}n\cdot (1-\color{darkred}{\mathit{lr}}_i) && \forall i \\ & \sum_i \color{darkred}{\mathit{select}}_i = \color{darkred}{\mathit{width}} \\ & \color{darkred}{\mathit{height}} \le \color{darkblue} {\mathit{hist}}_i + (\color{darkblue}{\mathit{maxhist}} - \color{darkblue} {\mathit{hist}}_i) \cdot (1- \color{darkred}{\mathit{select}}_i) && \forall i \\ & \color{darkred}{\mathit{start}}+\color{darkred}{\mathit{width}}-1 \le \color{darkblue}n \\ & \color{darkred}{\mathit{select}}_i \in \{0,1\} \\ & \color{darkred}{\mathit{lr}}_i \in \{0,1\} \\ & \color{darkred}{\mathit{start}} \in [1,\color{darkblue}n]\\ & \color{darkred}{\mathit{width}} \in [1,\color{darkblue}n]\\ & \color{darkred}{\mathit{height}} \ge 0\\ & \color{darkblue}n = {\bf{card}}(i) \end{align}\] |
This is a non-convex MIQP (Mixed Integer Quadratic Programming) model and can be solved with solvers like Baron, Couenne, Antigone, Cplex, and Gurobi. Note that Cplex and Gurobi need a special option to allow these models to be solved: optimalityTarget=3 for Cplex and nonConvex=2 for Gurobi.
Results
After running the model we see:
---- 68 VARIABLE select.L bar is selected i3 1.000, i4 1.000, i5 1.000 ---- 68 VARIABLE lr.L unselected bars are left (lr=0) or right (lr=1) of selection i5 1.000, i6 1.000, i7 1.000 ---- 68 VARIABLE start.L = 3.000 first selected bar VARIABLE width.L = 3.000 number of selected bars VARIABLE height.L = 4.000 shortest selected bar VARIABLE area.L = 12.000 width*height
Note that zero values are not printed.
A larger data set
To test if this formulation work with a larger data set, I generated some random data:
---- 12 SET i bars i1 , i2 , i3 , i4 , i5 , i6 , i7 , i8 , i9 , i10 , i11 , i12 , i13 i14 , i15 , i16 , i17 , i18 , i19 , i20 , i21 , i22 , i23 , i24 , i25 , i26 i27 , i28 , i29 , i30 , i31 , i32 , i33 , i34 , i35 , i36 , i37 , i38 , i39 i40 , i41 , i42 , i43 , i44 , i45 , i46 , i47 , i48 , i49 , i50 , i51 , i52 i53 , i54 , i55 , i56 , i57 , i58 , i59 , i60 , i61 , i62 , i63 , i64 , i65 i66 , i67 , i68 , i69 , i70 , i71 , i72 , i73 , i74 , i75 , i76 , i77 , i78 i79 , i80 , i81 , i82 , i83 , i84 , i85 , i86 , i87 , i88 , i89 , i90 , i91 i92 , i93 , i94 , i95 , i96 , i97 , i98 , i99 , i100 ---- 12 PARAMETER hist histogram data i1 17.175, i2 84.327, i3 55.038, i4 30.114, i5 29.221, i6 22.405, i7 34.983 i8 85.627, i9 6.711, i10 50.021, i11 99.812, i12 57.873, i13 99.113, i14 76.225 i15 13.069, i16 63.972, i17 15.952, i18 25.008, i19 66.893, i20 43.536, i21 35.970 i22 35.144, i23 13.149, i24 15.010, i25 58.911, i26 83.089, i27 23.082, i28 66.573 i29 77.586, i30 30.366, i31 11.049, i32 50.238, i33 16.017, i34 87.246, i35 26.511 i36 28.581, i37 59.396, i38 72.272, i39 62.825, i40 46.380, i41 41.331, i42 11.770 i43 31.421, i44 4.655, i45 33.855, i46 18.210, i47 64.573, i48 56.075, i49 76.996 i50 29.781, i51 66.111, i52 75.582, i53 62.745, i54 28.386, i55 8.642, i56 10.251 i57 64.125, i58 54.531, i59 3.152, i60 79.236, i61 7.277, i62 17.566, i63 52.563 i64 75.021, i65 17.812, i66 3.414, i67 58.513, i68 62.123, i69 38.936, i70 35.871 i71 24.303, i72 24.642, i73 13.050, i74 93.345, i75 37.994, i76 78.340, i77 30.003 i78 12.548, i79 74.887, i80 6.923, i81 20.202, i82 0.507, i83 26.961, i84 49.985 i85 15.129, i86 17.417, i87 33.064, i88 31.691, i89 32.209, i90 96.398, i91 99.360 i92 36.990, i93 37.289, i94 77.198, i95 39.668, i96 91.310, i97 11.958, i98 73.548 i99 5.542, i100 57.630 ---- 12 PARAMETER maxhist = 99.812 heighest bar
When we solve this we see the following results:
---- 67 VARIABLE select.L selection of bar i10 1.000, i11 1.000, i12 1.000, i13 1.000, i14 1.000, i15 1.000, i16 1.000, i17 1.000 i18 1.000, i19 1.000, i20 1.000, i21 1.000, i22 1.000, i23 1.000, i24 1.000, i25 1.000 i26 1.000, i27 1.000, i28 1.000, i29 1.000, i30 1.000, i31 1.000, i32 1.000, i33 1.000 i34 1.000, i35 1.000, i36 1.000, i37 1.000, i38 1.000, i39 1.000, i40 1.000, i41 1.000 i42 1.000, i43 1.000 ---- 67 VARIABLE lr.L unselected bars are left (lr=0) or right (lr=1) of selection i10 1.000, i11 1.000, i13 1.000, i14 1.000, i16 1.000, i19 1.000, i24 1.000, i30 1.000 i35 1.000, i36 1.000, i37 1.000, i40 1.000, i42 1.000, i44 1.000, i45 1.000, i46 1.000 i47 1.000, i48 1.000, i49 1.000, i50 1.000, i51 1.000, i52 1.000, i53 1.000, i54 1.000 i55 1.000, i56 1.000, i57 1.000, i58 1.000, i59 1.000, i60 1.000, i61 1.000, i62 1.000 i63 1.000, i64 1.000, i65 1.000, i66 1.000, i67 1.000, i68 1.000, i69 1.000, i70 1.000 i71 1.000, i72 1.000, i73 1.000, i74 1.000, i75 1.000, i76 1.000, i77 1.000, i78 1.000 i79 1.000, i80 1.000, i81 1.000, i82 1.000, i83 1.000, i84 1.000, i85 1.000, i86 1.000 i87 1.000, i88 1.000, i89 1.000, i90 1.000, i91 1.000, i92 1.000, i93 1.000, i94 1.000 i95 1.000, i96 1.000, i97 1.000, i98 1.000, i99 1.000, i100 1.000 ---- 67 VARIABLE start.L = 10.000 first selected bar VARIABLE width.L = 34.000 number of selected bars VARIABLE height.L = 11.049 shortest selected bar VARIABLE area.L = 375.674 width*height
Cplex needed 200 seconds for this. We can help Cplex a bit by adding the ordering constraint: \[\mathit{lr}_i \ge \mathit{lr}_{i-1}\] This has a significant impact:
original formulation | with ordering constraint | |
---|---|---|
Objective | 375.674 | 375.674 |
Time (s) | 210 | 24 |
Nodes | 95,830 | 6,040 |
Iterations | 2,190,751 | 153,255 |
A simpler model
There is a simpler approach. Instead of a scalar variable \(\mathit{start}\) we use an indexed variable, indicating where we switch from \(\mathit{select}_{i-1}=0\) to \(\mathit{select}_{i}=1\). In our picture:
We want contiguous selected bars, or in other words: only a single start variable can be 1. This can be modeled as \[\begin{align} & \mathit{start}_i \ge \mathit{select}_i-\mathit{select}_{i-1} \\ &\sum_i \mathit{start}_i \le 1 \end{align}\] The first inequality can be interpreted as the following implication: \[\mathit{select}_{i-1}=0 \>{\bf{and}}\>\mathit{select}_{i}=1 \Rightarrow \mathit{start}_i=1 \] Note, that \(\mathit{start}_i\) is unrestricted in all other cases. This formulation is not uncommon in machine scheduling models where we want to limit the number start-ups.
With this we can state the following quadratic model:
Quadratic Model |
---|
\[\begin{align}\max&\>\color{darkred}{\mathit{area}}=\sum_i \color{darkred}{\mathit{selected}}_i \cdot \color{darkred}{\mathit{height}}\\ & \color{darkred}{\mathit{start}}_i \ge \color{darkred}{\mathit{select}}_i - \color{darkred}{\mathit{select}}_{i-1} && \forall i \\ & \sum_i \color{darkred}{\mathit{start}}_i \le 1 \\ & \color{darkred}{\mathit{select}}_i \in \{0,1\} \\ & \color{darkred}{\mathit{height}} \le \color{darkblue} {\mathit{hist}}_i + (\color{darkblue}{\mathit{maxhist}} - \color{darkblue} {\mathit{hist}}_i) \cdot (1- \color{darkred}{\mathit{select}}_i) && \forall i \\ & \color{darkred}{\mathit{start}}_i \in \{0,1\} \\ & \color{darkred}{\mathit{height}} \ge 0\\ \end{align}\] |
This model can be linearized as we multiply a binary variable with a continuous variable [3]. The multiplication \[\begin{align} &y = \delta \cdot x \\ & \delta \in \{0,1\} \\ & x \in [0,U] \end{align}\] can be linearized as: \[\begin{align} & y \le U \cdot \delta\\ & y \le x \\& y \ge x - U \cdot (1-\delta) \\ & \delta \in \{0,1\} \\ & x,y \in [0,U] \end{align}\]
After a little bit of algebra (we have the complication that the variable \(\mathit{height}\) can exceed \(\mathit{hist}_i\) for non-selected bars), we arrive at the following linear model:
Linear Model |
---|
\[\begin{align}\max&\>\color{darkred}{\mathit{area}}=\sum_i \color{darkred}{\mathit{adjHeight}}_i \\ & \color{darkred}{\mathit{start}}_i \ge \color{darkred}{\mathit{select}}_i - \color{darkred}{\mathit{select}}_{i-1} && \forall i \\ & \sum_i \color{darkred}{\mathit{start}}_i \le 1 \\ & \color{darkred}{\mathit{adjHeight}}_i \le \color{darkred}{\mathit{select}}_i \cdot \color{darkblue} {\mathit{hist}}_i && \forall i \\ & \color{darkred}{\mathit{adjHeight}}_i \le \color{darkred}{\mathit{height}} && \forall i \\ & \color{darkred}{\mathit{adjHeight}}_i \ge \color{darkred}{\mathit{height}} - \color{darkblue}{\mathit{maxhist}} \cdot (1- \color{darkred}{\mathit{select}}_i) && \forall i \\ & \color{darkred}{\mathit{select}}_i \in \{0,1\} \\ & \color{darkred}{\mathit{start}}_i \in \{0,1\} \\ & \color{darkred}{\mathit{height}} \ge 0\\ & \color{darkred}{\mathit{adjHeight}}_i \ge 0 \\ \end{align}\] |
In this model, we use \[\mathit{adjHeight}_i = \begin{cases} \mathit{height} & \text{if $\mathit{select}_i=1$} \\ 0 & \text{if $\mathit{select}_i=0$}\end{cases}\]
The solution of the small data set for this model looks like:
---- 50 VARIABLE select.L bar is selected i3 1.000, i4 1.000, i5 1.000 ---- 50 VARIABLE start.L first selected bar i3 1.000 ---- 50 VARIABLE adjHeight.L adjusted height: heigth*selected(i) i3 4.000, i4 4.000, i5 4.000 ---- 50 VARIABLE height.L = 4.000 area height VARIABLE area.L = 12.000
Of course, we also ran it with the large, random data set. The solution looks like:
---- 49 VARIABLE select.L bar is selected i10 1.000, i11 1.000, i12 1.000, i13 1.000, i14 1.000, i15 1.000, i16 1.000, i17 1.000 i18 1.000, i19 1.000, i20 1.000, i21 1.000, i22 1.000, i23 1.000, i24 1.000, i25 1.000 i26 1.000, i27 1.000, i28 1.000, i29 1.000, i30 1.000, i31 1.000, i32 1.000, i33 1.000 i34 1.000, i35 1.000, i36 1.000, i37 1.000, i38 1.000, i39 1.000, i40 1.000, i41 1.000 i42 1.000, i43 1.000 ---- 49 VARIABLE start.L first selected bar i10 1.000 ---- 49 VARIABLE adjHeight.L adjusted height: heigth*selected(i) i10 11.049, i11 11.049, i12 11.049, i13 11.049, i14 11.049, i15 11.049, i16 11.049, i17 11.049 i18 11.049, i19 11.049, i20 11.049, i21 11.049, i22 11.049, i23 11.049, i24 11.049, i25 11.049 i26 11.049, i27 11.049, i28 11.049, i29 11.049, i30 11.049, i31 11.049, i32 11.049, i33 11.049 i34 11.049, i35 11.049, i36 11.049, i37 11.049, i38 11.049, i39 11.049, i40 11.049, i41 11.049 i42 11.049, i43 11.049 ---- 49 VARIABLE height.L = 11.049 area height VARIABLE area.L = 375.674
To compare the performance, let's add the statistics to our table:
MIQP original formulation | MIQP with ordering constraint | Linear MIP | |
---|---|---|---|
Objective | 375.674 | 375.674 | 375.674 |
Time (s) | 210 | 24 | 0.8 |
Nodes | 95,830 | 6,040 | 0 |
Iterations | 2,190,751 | 153,255 | 589 |
Cplex was able to solve the MIP model to optimality during pre-processing, so no branch-and-bound nodes were needed.
Conclusion
The Largest Rectangular Area in a Histogram is easy to explain. Writing down the corresponding optimization model is a bit more difficult. In addition, different formulations can lead to dramatic differences in performance. Clearly the linearized model is the winner.
References
- Largest rectangular area in a histogram, set 1, https://www.geeksforgeeks.org/largest-rectangular-area-in-a-histogram-set-1/
- Largest rectangular area in a histogram, set 2, https://www.geeksforgeeks.org/largest-rectangle-under-histogram/
- Multiplication of a continuous and a binary variable, http://yetanothermathprogrammingconsultant.blogspot.com/2008/05/multiplication-of-continuous-and-binary.html
No comments:
Post a Comment