Consider a histogram or bar chart. The problem is to find the largest rectangular area covered by the histogram. A picture will help:
|
Largest rectangle in a histogram
|
There are a number of algorithms to attack this problem, some of them very efficient, e.g. \(O(n)\). [1,2]
Here we try to find out if we can formulate this as an optimization model.
A small data set
Here is a small data set. It corresponds to the picture above.
---- 13 SET i bars
i1, i2, i3, i4, i5, i6, i7
---- 13 PARAMETER hist histogram data
i1 6.000, i2 2.000, i3 5.000, i4 4.000, i5 5.000, i6 1.000, i7 6.000
---- 13 PARAMETER maxhist = 6.000 heighest bar
A quadratic model
We start with our central binary variable \[\mathit{select}_i = \begin{cases} 1 & \text{if bar $i$ is selected to be part of the rectangle} \\ 0 & \text{otherwise}\end{cases}\] In addition, we want to know if an unselected bar is left or right of the selected bars. For this we use a binary variable \[\mathit{lr}_i = \begin{cases} 0 & \text{if unselected bar $i$ is to the left of the selected bars} \\ 1& \text{if unselected bar $i$ is to the right of the selected bars} \\ \text{unrestricted} & \text{if $\mathit{select}_i=1$}\end{cases}\] For the above picture, this looks like:
Note that the variable \(\mathit{lr}_i\) can be either 0 or 1 when the corresponding bar is selected.
The variables to describe the rcctangular area are: \(\mathit{start}\), \(\mathit{width}\) and \(\mathit{height}\).
The following implications can be formed. If bar \(i\) is not selected and to the left, then \(\mathit{start}\gt i\): \[\mathit{select}_i=0 \>{\bf{and}}\> \mathit{lr}_i=0 \Rightarrow \mathit{start}\ge i+1\] Similarly, if bar \(i\) is not selected and to the right, then \(\mathit{start}+\mathit{width}-1 \lt i\). Formally: \[\mathit{select}_i=0 \>{\bf{and}}\> \mathit{lr}_i=1 \Rightarrow \mathit{start}+\mathit{width}-1\le i-1\] Furthermore, we also have the implication that limits the height: if bar \(i\) is selected, then \(\mathit{height}\le \mathit{hist}_i\). These implications can be implemented as big-M constraints. Finally, we can link \(\mathit{select}_i\) to the width, by saying: \[\sum_i \mathit{select}_i = \mathit{width}\] The complete model can look like:
Quadratic Model |
\[\begin{align}\max&\>\color{darkred}{\mathit{area}}=\color{darkred}{\mathit{width}}\cdot\color{darkred}{\mathit{height}}\\
& \color{darkred}{\mathit{start}} \ge\color{darkblue}i+1-\color{darkblue}n\cdot \color{darkred}{\mathit{select}}_i -\color{darkblue}n\cdot\color{darkred}{\mathit{lr}}_i&& \forall i \\
& \color{darkred}{\mathit{start}}+\color{darkred}{\mathit{width}} \le \color{darkblue} i+\color{darkblue}n\cdot \color{darkred}{\mathit{select}}_i+\color{darkblue}n\cdot (1-\color{darkred}{\mathit{lr}}_i) && \forall i \\
& \sum_i \color{darkred}{\mathit{select}}_i = \color{darkred}{\mathit{width}} \\
& \color{darkred}{\mathit{height}} \le \color{darkblue} {\mathit{hist}}_i + (\color{darkblue}{\mathit{maxhist}} - \color{darkblue} {\mathit{hist}}_i) \cdot (1- \color{darkred}{\mathit{select}}_i) && \forall i \\
& \color{darkred}{\mathit{start}}+\color{darkred}{\mathit{width}}-1 \le \color{darkblue}n \\
& \color{darkred}{\mathit{select}}_i \in \{0,1\} \\
& \color{darkred}{\mathit{lr}}_i \in \{0,1\} \\
& \color{darkred}{\mathit{start}} \in [1,\color{darkblue}n]\\
& \color{darkred}{\mathit{width}} \in [1,\color{darkblue}n]\\
& \color{darkred}{\mathit{height}} \ge 0\\
& \color{darkblue}n = {\bf{card}}(i)
\end{align}\] |
This is a non-convex MIQP (Mixed Integer Quadratic Programming) model and can be solved with solvers like Baron, Couenne, Antigone, Cplex, and Gurobi. Note that Cplex and Gurobi need a special option to allow these models to be solved: optimalityTarget=3 for Cplex and nonConvex=2 for Gurobi.
Results
After running the model we see:
---- 68 VARIABLE select.L bar is selected
i3 1.000, i4 1.000, i5 1.000
---- 68 VARIABLE lr.L unselected bars are left (lr=0) or right (lr=1) of selection
i5 1.000, i6 1.000, i7 1.000
---- 68 VARIABLE start.L = 3.000 first selected bar
VARIABLE width.L = 3.000 number of selected bars
VARIABLE height.L = 4.000 shortest selected bar
VARIABLE area.L = 12.000 width*height
Note that zero values are not printed.
A larger data set
To test if this formulation work with a larger data set, I generated some random data:
---- 12 SET i bars
i1 , i2 , i3 , i4 , i5 , i6 , i7 , i8 , i9 , i10 , i11 , i12 , i13
i14 , i15 , i16 , i17 , i18 , i19 , i20 , i21 , i22 , i23 , i24 , i25 , i26
i27 , i28 , i29 , i30 , i31 , i32 , i33 , i34 , i35 , i36 , i37 , i38 , i39
i40 , i41 , i42 , i43 , i44 , i45 , i46 , i47 , i48 , i49 , i50 , i51 , i52
i53 , i54 , i55 , i56 , i57 , i58 , i59 , i60 , i61 , i62 , i63 , i64 , i65
i66 , i67 , i68 , i69 , i70 , i71 , i72 , i73 , i74 , i75 , i76 , i77 , i78
i79 , i80 , i81 , i82 , i83 , i84 , i85 , i86 , i87 , i88 , i89 , i90 , i91
i92 , i93 , i94 , i95 , i96 , i97 , i98 , i99 , i100
---- 12 PARAMETER hist histogram data
i1 17.175, i2 84.327, i3 55.038, i4 30.114, i5 29.221, i6 22.405, i7 34.983
i8 85.627, i9 6.711, i10 50.021, i11 99.812, i12 57.873, i13 99.113, i14 76.225
i15 13.069, i16 63.972, i17 15.952, i18 25.008, i19 66.893, i20 43.536, i21 35.970
i22 35.144, i23 13.149, i24 15.010, i25 58.911, i26 83.089, i27 23.082, i28 66.573
i29 77.586, i30 30.366, i31 11.049, i32 50.238, i33 16.017, i34 87.246, i35 26.511
i36 28.581, i37 59.396, i38 72.272, i39 62.825, i40 46.380, i41 41.331, i42 11.770
i43 31.421, i44 4.655, i45 33.855, i46 18.210, i47 64.573, i48 56.075, i49 76.996
i50 29.781, i51 66.111, i52 75.582, i53 62.745, i54 28.386, i55 8.642, i56 10.251
i57 64.125, i58 54.531, i59 3.152, i60 79.236, i61 7.277, i62 17.566, i63 52.563
i64 75.021, i65 17.812, i66 3.414, i67 58.513, i68 62.123, i69 38.936, i70 35.871
i71 24.303, i72 24.642, i73 13.050, i74 93.345, i75 37.994, i76 78.340, i77 30.003
i78 12.548, i79 74.887, i80 6.923, i81 20.202, i82 0.507, i83 26.961, i84 49.985
i85 15.129, i86 17.417, i87 33.064, i88 31.691, i89 32.209, i90 96.398, i91 99.360
i92 36.990, i93 37.289, i94 77.198, i95 39.668, i96 91.310, i97 11.958, i98 73.548
i99 5.542, i100 57.630
---- 12 PARAMETER maxhist = 99.812 heighest bar
When we solve this we see the following results:
---- 67 VARIABLE select.L selection of bar
i10 1.000, i11 1.000, i12 1.000, i13 1.000, i14 1.000, i15 1.000, i16 1.000, i17 1.000
i18 1.000, i19 1.000, i20 1.000, i21 1.000, i22 1.000, i23 1.000, i24 1.000, i25 1.000
i26 1.000, i27 1.000, i28 1.000, i29 1.000, i30 1.000, i31 1.000, i32 1.000, i33 1.000
i34 1.000, i35 1.000, i36 1.000, i37 1.000, i38 1.000, i39 1.000, i40 1.000, i41 1.000
i42 1.000, i43 1.000
---- 67 VARIABLE lr.L unselected bars are left (lr=0) or right (lr=1) of selection
i10 1.000, i11 1.000, i13 1.000, i14 1.000, i16 1.000, i19 1.000, i24 1.000, i30 1.000
i35 1.000, i36 1.000, i37 1.000, i40 1.000, i42 1.000, i44 1.000, i45 1.000, i46 1.000
i47 1.000, i48 1.000, i49 1.000, i50 1.000, i51 1.000, i52 1.000, i53 1.000, i54 1.000
i55 1.000, i56 1.000, i57 1.000, i58 1.000, i59 1.000, i60 1.000, i61 1.000, i62 1.000
i63 1.000, i64 1.000, i65 1.000, i66 1.000, i67 1.000, i68 1.000, i69 1.000, i70 1.000
i71 1.000, i72 1.000, i73 1.000, i74 1.000, i75 1.000, i76 1.000, i77 1.000, i78 1.000
i79 1.000, i80 1.000, i81 1.000, i82 1.000, i83 1.000, i84 1.000, i85 1.000, i86 1.000
i87 1.000, i88 1.000, i89 1.000, i90 1.000, i91 1.000, i92 1.000, i93 1.000, i94 1.000
i95 1.000, i96 1.000, i97 1.000, i98 1.000, i99 1.000, i100 1.000
---- 67 VARIABLE start.L = 10.000 first selected bar
VARIABLE width.L = 34.000 number of selected bars
VARIABLE height.L = 11.049 shortest selected bar
VARIABLE area.L = 375.674 width*height
Cplex needed 200 seconds for this. We can help Cplex a bit by adding the ordering constraint: \[\mathit{lr}_i \ge \mathit{lr}_{i-1}\] This has a significant impact:
| original formulation | with ordering constraint |
Objective | 375.674 | 375.674 |
Time (s) | 210 | 24 |
Nodes | 95,830 | 6,040 |
Iterations | 2,190,751 | 153,255 |
A simpler model
There is a simpler approach. Instead of a scalar variable \(\mathit{start}\) we use an indexed variable, indicating where we switch from \(\mathit{select}_{i-1}=0\) to \(\mathit{select}_{i}=1\). In our picture:
We want contiguous selected bars, or in other words: only a single start variable can be 1. This can be modeled as \[\begin{align} & \mathit{start}_i \ge \mathit{select}_i-\mathit{select}_{i-1} \\ &\sum_i \mathit{start}_i \le 1 \end{align}\] The first inequality can be interpreted as the following implication: \[\mathit{select}_{i-1}=0 \>{\bf{and}}\>\mathit{select}_{i}=1 \Rightarrow \mathit{start}_i=1 \] Note, that \(\mathit{start}_i\) is unrestricted in all other cases. This formulation is not uncommon in machine scheduling models where we want to limit the number start-ups.
With this we can state the following quadratic model:
Quadratic Model |
\[\begin{align}\max&\>\color{darkred}{\mathit{area}}=\sum_i \color{darkred}{\mathit{selected}}_i \cdot \color{darkred}{\mathit{height}}\\
& \color{darkred}{\mathit{start}}_i \ge \color{darkred}{\mathit{select}}_i - \color{darkred}{\mathit{select}}_{i-1} && \forall i \\
& \sum_i \color{darkred}{\mathit{start}}_i \le 1 \\
& \color{darkred}{\mathit{select}}_i \in \{0,1\} \\
& \color{darkred}{\mathit{height}} \le \color{darkblue} {\mathit{hist}}_i + (\color{darkblue}{\mathit{maxhist}} - \color{darkblue} {\mathit{hist}}_i) \cdot (1- \color{darkred}{\mathit{select}}_i) && \forall i \\
& \color{darkred}{\mathit{start}}_i \in \{0,1\} \\
& \color{darkred}{\mathit{height}} \ge 0\\
\end{align}\] |
This model can be linearized as we multiply a binary variable with a continuous variable [3]. The multiplication \[\begin{align} &y = \delta \cdot x \\ & \delta \in \{0,1\} \\ & x \in [0,U] \end{align}\] can be linearized as: \[\begin{align} & y \le U \cdot \delta\\ & y \le x \\& y \ge x - U \cdot (1-\delta) \\ & \delta \in \{0,1\} \\ & x,y \in [0,U] \end{align}\]
After a little bit of algebra (we have the complication that the variable \(\mathit{height}\) can exceed \(\mathit{hist}_i\) for non-selected bars), we arrive at the following linear model:
Linear Model |
\[\begin{align}\max&\>\color{darkred}{\mathit{area}}=\sum_i \color{darkred}{\mathit{adjHeight}}_i \\
& \color{darkred}{\mathit{start}}_i \ge \color{darkred}{\mathit{select}}_i - \color{darkred}{\mathit{select}}_{i-1} && \forall i \\
& \sum_i \color{darkred}{\mathit{start}}_i \le 1 \\
& \color{darkred}{\mathit{adjHeight}}_i \le \color{darkred}{\mathit{select}}_i \cdot \color{darkblue} {\mathit{hist}}_i && \forall i \\
& \color{darkred}{\mathit{adjHeight}}_i \le \color{darkred}{\mathit{height}} && \forall i \\
& \color{darkred}{\mathit{adjHeight}}_i \ge \color{darkred}{\mathit{height}} - \color{darkblue}{\mathit{maxhist}} \cdot (1- \color{darkred}{\mathit{select}}_i) && \forall i \\
& \color{darkred}{\mathit{select}}_i \in \{0,1\} \\
& \color{darkred}{\mathit{start}}_i \in \{0,1\} \\
& \color{darkred}{\mathit{height}} \ge 0\\
& \color{darkred}{\mathit{adjHeight}}_i \ge 0 \\
\end{align}\] |
In this model, we use \[\mathit{adjHeight}_i = \begin{cases} \mathit{height} & \text{if $\mathit{select}_i=1$} \\ 0 & \text{if $\mathit{select}_i=0$}\end{cases}\]
The solution of the small data set for this model looks like:
---- 50 VARIABLE select.L bar is selected
i3 1.000, i4 1.000, i5 1.000
---- 50 VARIABLE start.L first selected bar
i3 1.000
---- 50 VARIABLE adjHeight.L adjusted height: heigth*selected(i)
i3 4.000, i4 4.000, i5 4.000
---- 50 VARIABLE height.L = 4.000 area height
VARIABLE area.L = 12.000
Of course, we also ran it with the large, random data set. The solution looks like:
---- 49 VARIABLE select.L bar is selected
i10 1.000, i11 1.000, i12 1.000, i13 1.000, i14 1.000, i15 1.000, i16 1.000, i17 1.000
i18 1.000, i19 1.000, i20 1.000, i21 1.000, i22 1.000, i23 1.000, i24 1.000, i25 1.000
i26 1.000, i27 1.000, i28 1.000, i29 1.000, i30 1.000, i31 1.000, i32 1.000, i33 1.000
i34 1.000, i35 1.000, i36 1.000, i37 1.000, i38 1.000, i39 1.000, i40 1.000, i41 1.000
i42 1.000, i43 1.000
---- 49 VARIABLE start.L first selected bar
i10 1.000
---- 49 VARIABLE adjHeight.L adjusted height: heigth*selected(i)
i10 11.049, i11 11.049, i12 11.049, i13 11.049, i14 11.049, i15 11.049, i16 11.049, i17 11.049
i18 11.049, i19 11.049, i20 11.049, i21 11.049, i22 11.049, i23 11.049, i24 11.049, i25 11.049
i26 11.049, i27 11.049, i28 11.049, i29 11.049, i30 11.049, i31 11.049, i32 11.049, i33 11.049
i34 11.049, i35 11.049, i36 11.049, i37 11.049, i38 11.049, i39 11.049, i40 11.049, i41 11.049
i42 11.049, i43 11.049
---- 49 VARIABLE height.L = 11.049 area height
VARIABLE area.L = 375.674
To compare the performance, let's add the statistics to our table:
| MIQP original formulation | MIQP with ordering constraint | Linear MIP |
Objective | 375.674 | 375.674 | 375.674 |
Time (s) | 210 | 24 | 0.8 |
Nodes | 95,830 | 6,040 | 0 |
Iterations | 2,190,751 | 153,255 | 589 |
Cplex was able to solve the MIP model to optimality during pre-processing, so no branch-and-bound nodes were needed.
Conclusion
The Largest Rectangular Area in a Histogram is easy to explain. Writing down the corresponding optimization model is a bit more difficult. In addition, different formulations can lead to dramatic differences in performance. Clearly the linearized model is the winner.
References