Tuesday, August 11, 2020

Largest Rectangular Area in a Histogram

Consider a histogram or bar chart. The problem is to find the largest rectangular area covered by the histogram. A picture will help:

Largest rectangle in a histogram

There are a number of algorithms to attack this problem, some of them very efficient, e.g. \(O(n)\). [1,2]

Here we try to find out if we can formulate this as an optimization model.

A small data set


Here is a small data set. It corresponds to the picture above.


----     13 SET i  bars

i1,    i2,    i3,    i4,    i5,    i6,    i7


----     13 PARAMETER hist  histogram data

i1 6.000,    i2 2.000,    i3 5.000,    i4 4.000,    i5 5.000,    i6 1.000,    i7 6.000


----     13 PARAMETER maxhist              =        6.000  heighest bar


A quadratic model


We start with our central binary variable \[\mathit{select}_i = \begin{cases} 1 & \text{if bar $i$ is selected to be part of the rectangle} \\ 0 & \text{otherwise}\end{cases}\] In addition, we want to know if an unselected bar is left or right of the selected bars. For this we use a binary variable \[\mathit{lr}_i = \begin{cases} 0 & \text{if unselected bar $i$ is to the left of the selected bars} \\ 1& \text{if unselected bar $i$ is to the right of the selected bars} \\ \text{unrestricted} & \text{if $\mathit{select}_i=1$}\end{cases}\] For the above picture, this looks like:


Note that the variable \(\mathit{lr}_i\) can be either 0 or 1 when the corresponding bar is selected.

The variables to describe the rcctangular area are: \(\mathit{start}\), \(\mathit{width}\) and \(\mathit{height}\). 

The following implications can be formed. If bar \(i\) is not selected and to the left, then \(\mathit{start}\gt i\): \[\mathit{select}_i=0  \>{\bf{and}}\> \mathit{lr}_i=0 \Rightarrow  \mathit{start}\ge i+1\] Similarly, if bar \(i\) is not selected and to the right, then \(\mathit{start}+\mathit{width}-1 \lt i\). Formally: \[\mathit{select}_i=0  \>{\bf{and}}\> \mathit{lr}_i=1 \Rightarrow  \mathit{start}+\mathit{width}-1\le i-1\] Furthermore, we also have the implication that limits the height: if bar \(i\) is selected, then \(\mathit{height}\le \mathit{hist}_i\). These implications can be implemented as big-M constraints. Finally, we can link \(\mathit{select}_i\) to the width, by saying: \[\sum_i \mathit{select}_i = \mathit{width}\] The complete model can look like:

Quadratic Model
\[\begin{align}\max&\>\color{darkred}{\mathit{area}}=\color{darkred}{\mathit{width}}\cdot\color{darkred}{\mathit{height}}\\ & \color{darkred}{\mathit{start}} \ge\color{darkblue}i+1-\color{darkblue}n\cdot \color{darkred}{\mathit{select}}_i -\color{darkblue}n\cdot\color{darkred}{\mathit{lr}}_i&& \forall i \\ & \color{darkred}{\mathit{start}}+\color{darkred}{\mathit{width}} \le \color{darkblue} i+\color{darkblue}n\cdot \color{darkred}{\mathit{select}}_i+\color{darkblue}n\cdot (1-\color{darkred}{\mathit{lr}}_i) && \forall i \\ & \sum_i \color{darkred}{\mathit{select}}_i = \color{darkred}{\mathit{width}} \\ & \color{darkred}{\mathit{height}} \le \color{darkblue} {\mathit{hist}}_i + (\color{darkblue}{\mathit{maxhist}} - \color{darkblue} {\mathit{hist}}_i) \cdot (1- \color{darkred}{\mathit{select}}_i) && \forall i \\ & \color{darkred}{\mathit{start}}+\color{darkred}{\mathit{width}}-1 \le \color{darkblue}n \\ & \color{darkred}{\mathit{select}}_i \in \{0,1\} \\ & \color{darkred}{\mathit{lr}}_i \in \{0,1\} \\ & \color{darkred}{\mathit{start}} \in [1,\color{darkblue}n]\\ & \color{darkred}{\mathit{width}} \in [1,\color{darkblue}n]\\ & \color{darkred}{\mathit{height}} \ge 0\\ & \color{darkblue}n = {\bf{card}}(i) \end{align}\]

This is a non-convex MIQP (Mixed Integer Quadratic Programming) model and can be solved with solvers like Baron, Couenne, Antigone, Cplex, and Gurobi. Note that Cplex and Gurobi need a special option to allow these models to be solved: optimalityTarget=3 for Cplex and nonConvex=2 for Gurobi.

Results


After running the model we see:


----     68 VARIABLE select.L  bar is selected

i3 1.000,    i4 1.000,    i5 1.000


----     68 VARIABLE lr.L   unselected bars are left (lr=0) or right (lr=1) of selection

i5 1.000,    i6 1.000,    i7 1.000


----     68 VARIABLE start.L               =        3.000  first selected bar
            VARIABLE width.L               =        3.000  number of selected bars
            VARIABLE height.L              =        4.000  shortest selected bar
            VARIABLE area.L                =       12.000  width*height

Note that zero values are not printed.

A larger data set


To test if this formulation work with a larger data set, I generated some random data:

----     12 SET i  bars

i1  ,    i2  ,    i3  ,    i4  ,    i5  ,    i6  ,    i7  ,    i8  ,    i9  ,    i10 ,    i11 ,    i12 ,    i13 
i14 ,    i15 ,    i16 ,    i17 ,    i18 ,    i19 ,    i20 ,    i21 ,    i22 ,    i23 ,    i24 ,    i25 ,    i26 
i27 ,    i28 ,    i29 ,    i30 ,    i31 ,    i32 ,    i33 ,    i34 ,    i35 ,    i36 ,    i37 ,    i38 ,    i39 
i40 ,    i41 ,    i42 ,    i43 ,    i44 ,    i45 ,    i46 ,    i47 ,    i48 ,    i49 ,    i50 ,    i51 ,    i52 
i53 ,    i54 ,    i55 ,    i56 ,    i57 ,    i58 ,    i59 ,    i60 ,    i61 ,    i62 ,    i63 ,    i64 ,    i65 
i66 ,    i67 ,    i68 ,    i69 ,    i70 ,    i71 ,    i72 ,    i73 ,    i74 ,    i75 ,    i76 ,    i77 ,    i78 
i79 ,    i80 ,    i81 ,    i82 ,    i83 ,    i84 ,    i85 ,    i86 ,    i87 ,    i88 ,    i89 ,    i90 ,    i91 
i92 ,    i93 ,    i94 ,    i95 ,    i96 ,    i97 ,    i98 ,    i99 ,    i100


----     12 PARAMETER hist  histogram data

i1   17.175,    i2   84.327,    i3   55.038,    i4   30.114,    i5   29.221,    i6   22.405,    i7   34.983
i8   85.627,    i9    6.711,    i10  50.021,    i11  99.812,    i12  57.873,    i13  99.113,    i14  76.225
i15  13.069,    i16  63.972,    i17  15.952,    i18  25.008,    i19  66.893,    i20  43.536,    i21  35.970
i22  35.144,    i23  13.149,    i24  15.010,    i25  58.911,    i26  83.089,    i27  23.082,    i28  66.573
i29  77.586,    i30  30.366,    i31  11.049,    i32  50.238,    i33  16.017,    i34  87.246,    i35  26.511
i36  28.581,    i37  59.396,    i38  72.272,    i39  62.825,    i40  46.380,    i41  41.331,    i42  11.770
i43  31.421,    i44   4.655,    i45  33.855,    i46  18.210,    i47  64.573,    i48  56.075,    i49  76.996
i50  29.781,    i51  66.111,    i52  75.582,    i53  62.745,    i54  28.386,    i55   8.642,    i56  10.251
i57  64.125,    i58  54.531,    i59   3.152,    i60  79.236,    i61   7.277,    i62  17.566,    i63  52.563
i64  75.021,    i65  17.812,    i66   3.414,    i67  58.513,    i68  62.123,    i69  38.936,    i70  35.871
i71  24.303,    i72  24.642,    i73  13.050,    i74  93.345,    i75  37.994,    i76  78.340,    i77  30.003
i78  12.548,    i79  74.887,    i80   6.923,    i81  20.202,    i82   0.507,    i83  26.961,    i84  49.985
i85  15.129,    i86  17.417,    i87  33.064,    i88  31.691,    i89  32.209,    i90  96.398,    i91  99.360
i92  36.990,    i93  37.289,    i94  77.198,    i95  39.668,    i96  91.310,    i97  11.958,    i98  73.548
i99   5.542,    i100 57.630


----     12 PARAMETER maxhist              =       99.812  heighest bar


When we solve this we see the following results:


----     67 VARIABLE select.L  selection of bar

i10 1.000,    i11 1.000,    i12 1.000,    i13 1.000,    i14 1.000,    i15 1.000,    i16 1.000,    i17 1.000
i18 1.000,    i19 1.000,    i20 1.000,    i21 1.000,    i22 1.000,    i23 1.000,    i24 1.000,    i25 1.000
i26 1.000,    i27 1.000,    i28 1.000,    i29 1.000,    i30 1.000,    i31 1.000,    i32 1.000,    i33 1.000
i34 1.000,    i35 1.000,    i36 1.000,    i37 1.000,    i38 1.000,    i39 1.000,    i40 1.000,    i41 1.000
i42 1.000,    i43 1.000


----     67 VARIABLE lr.L   unselected bars are left (lr=0) or right (lr=1) of selection

i10  1.000,    i11  1.000,    i13  1.000,    i14  1.000,    i16  1.000,    i19  1.000,    i24  1.000,    i30  1.000
i35  1.000,    i36  1.000,    i37  1.000,    i40  1.000,    i42  1.000,    i44  1.000,    i45  1.000,    i46  1.000
i47  1.000,    i48  1.000,    i49  1.000,    i50  1.000,    i51  1.000,    i52  1.000,    i53  1.000,    i54  1.000
i55  1.000,    i56  1.000,    i57  1.000,    i58  1.000,    i59  1.000,    i60  1.000,    i61  1.000,    i62  1.000
i63  1.000,    i64  1.000,    i65  1.000,    i66  1.000,    i67  1.000,    i68  1.000,    i69  1.000,    i70  1.000
i71  1.000,    i72  1.000,    i73  1.000,    i74  1.000,    i75  1.000,    i76  1.000,    i77  1.000,    i78  1.000
i79  1.000,    i80  1.000,    i81  1.000,    i82  1.000,    i83  1.000,    i84  1.000,    i85  1.000,    i86  1.000
i87  1.000,    i88  1.000,    i89  1.000,    i90  1.000,    i91  1.000,    i92  1.000,    i93  1.000,    i94  1.000
i95  1.000,    i96  1.000,    i97  1.000,    i98  1.000,    i99  1.000,    i100 1.000


----     67 VARIABLE start.L               =       10.000  first selected bar
            VARIABLE width.L               =       34.000  number of selected bars
            VARIABLE height.L              =       11.049  shortest selected bar
            VARIABLE area.L                =      375.674  width*height


Cplex needed 200 seconds for this. We can help Cplex a bit by adding the ordering constraint: \[\mathit{lr}_i \ge \mathit{lr}_{i-1}\] This has a significant impact:


original formulationwith ordering constraint
Objective375.674375.674
Time (s)21024
Nodes95,8306,040
Iterations2,190,751153,255


A simpler model


There is a simpler approach. Instead of a scalar variable \(\mathit{start}\) we use an indexed variable, indicating where we switch from \(\mathit{select}_{i-1}=0\) to \(\mathit{select}_{i}=1\). In our picture:

We want contiguous selected bars, or in other words: only a single start variable can be 1. This can be modeled as \[\begin{align} & \mathit{start}_i \ge \mathit{select}_i-\mathit{select}_{i-1} \\ &\sum_i \mathit{start}_i \le 1 \end{align}\] The first inequality can be interpreted as the following implication: \[\mathit{select}_{i-1}=0 \>{\bf{and}}\>\mathit{select}_{i}=1 \Rightarrow  \mathit{start}_i=1 \] Note, that \(\mathit{start}_i\) is unrestricted in all other cases. This formulation is not uncommon in machine scheduling models where we want to limit the number start-ups. 

With this we can state the following quadratic model:

Quadratic Model
\[\begin{align}\max&\>\color{darkred}{\mathit{area}}=\sum_i \color{darkred}{\mathit{selected}}_i \cdot \color{darkred}{\mathit{height}}\\ & \color{darkred}{\mathit{start}}_i \ge \color{darkred}{\mathit{select}}_i - \color{darkred}{\mathit{select}}_{i-1} && \forall i \\ & \sum_i \color{darkred}{\mathit{start}}_i \le 1 \\ & \color{darkred}{\mathit{select}}_i \in \{0,1\} \\ & \color{darkred}{\mathit{height}} \le \color{darkblue} {\mathit{hist}}_i + (\color{darkblue}{\mathit{maxhist}} - \color{darkblue} {\mathit{hist}}_i) \cdot (1- \color{darkred}{\mathit{select}}_i) && \forall i \\ & \color{darkred}{\mathit{start}}_i \in \{0,1\} \\ & \color{darkred}{\mathit{height}} \ge 0\\ \end{align}\]

This model can be linearized as we multiply a binary variable with a continuous variable [3]. The multiplication \[\begin{align} &y = \delta \cdot x \\ & \delta \in \{0,1\} \\ & x \in [0,U] \end{align}\] can be linearized as: \[\begin{align} & y \le  U \cdot \delta\\ & y \le x \\& y \ge x - U \cdot (1-\delta) \\ &  \delta \in \{0,1\} \\ & x,y \in [0,U] \end{align}\]

After a little bit of algebra (we have the complication that the variable \(\mathit{height}\) can exceed \(\mathit{hist}_i\) for non-selected bars), we arrive at the following linear model:

 
Linear Model
\[\begin{align}\max&\>\color{darkred}{\mathit{area}}=\sum_i \color{darkred}{\mathit{adjHeight}}_i \\ & \color{darkred}{\mathit{start}}_i \ge \color{darkred}{\mathit{select}}_i - \color{darkred}{\mathit{select}}_{i-1} && \forall i \\ & \sum_i \color{darkred}{\mathit{start}}_i \le 1 \\ & \color{darkred}{\mathit{adjHeight}}_i \le \color{darkred}{\mathit{select}}_i \cdot \color{darkblue} {\mathit{hist}}_i && \forall i \\ & \color{darkred}{\mathit{adjHeight}}_i \le \color{darkred}{\mathit{height}} && \forall i \\ & \color{darkred}{\mathit{adjHeight}}_i \ge \color{darkred}{\mathit{height}} - \color{darkblue}{\mathit{maxhist}} \cdot (1- \color{darkred}{\mathit{select}}_i) && \forall i \\ & \color{darkred}{\mathit{select}}_i \in \{0,1\} \\ & \color{darkred}{\mathit{start}}_i \in \{0,1\} \\ & \color{darkred}{\mathit{height}} \ge 0\\ & \color{darkred}{\mathit{adjHeight}}_i \ge 0 \\ \end{align}\]


In this model, we use \[\mathit{adjHeight}_i = \begin{cases} \mathit{height} & \text{if $\mathit{select}_i=1$} \\ 0 & \text{if $\mathit{select}_i=0$}\end{cases}\]

The solution of the small data set for this model looks like: 
 

----     50 VARIABLE select.L  bar is selected

i3 1.000,    i4 1.000,    i5 1.000


----     50 VARIABLE start.L  first selected bar

i3 1.000


----     50 VARIABLE adjHeight.L  adjusted height: heigth*selected(i)

i3 4.000,    i4 4.000,    i5 4.000


----     50 VARIABLE height.L              =        4.000  area height
            VARIABLE area.L                =       12.000  


Of course, we also ran it with the large, random data set. The solution looks like: 
 

----     49 VARIABLE select.L  bar is selected

i10 1.000,    i11 1.000,    i12 1.000,    i13 1.000,    i14 1.000,    i15 1.000,    i16 1.000,    i17 1.000
i18 1.000,    i19 1.000,    i20 1.000,    i21 1.000,    i22 1.000,    i23 1.000,    i24 1.000,    i25 1.000
i26 1.000,    i27 1.000,    i28 1.000,    i29 1.000,    i30 1.000,    i31 1.000,    i32 1.000,    i33 1.000
i34 1.000,    i35 1.000,    i36 1.000,    i37 1.000,    i38 1.000,    i39 1.000,    i40 1.000,    i41 1.000
i42 1.000,    i43 1.000


----     49 VARIABLE start.L  first selected bar

i10 1.000


----     49 VARIABLE adjHeight.L  adjusted height: heigth*selected(i)

i10 11.049,    i11 11.049,    i12 11.049,    i13 11.049,    i14 11.049,    i15 11.049,    i16 11.049,    i17 11.049
i18 11.049,    i19 11.049,    i20 11.049,    i21 11.049,    i22 11.049,    i23 11.049,    i24 11.049,    i25 11.049
i26 11.049,    i27 11.049,    i28 11.049,    i29 11.049,    i30 11.049,    i31 11.049,    i32 11.049,    i33 11.049
i34 11.049,    i35 11.049,    i36 11.049,    i37 11.049,    i38 11.049,    i39 11.049,    i40 11.049,    i41 11.049
i42 11.049,    i43 11.049


----     49 VARIABLE height.L              =       11.049  area height
            VARIABLE area.L                =      375.674  


To compare the performance, let's add the statistics to our table:

MIQP original formulationMIQP with ordering constraintLinear MIP
Objective375.674375.674375.674
Time (s)210240.8
Nodes95,8306,0400
Iterations2,190,751153,255589


Cplex was able to solve the MIP model to optimality during pre-processing, so no branch-and-bound nodes were needed.

Conclusion


The Largest Rectangular Area in a Histogram is easy to explain. Writing down the corresponding optimization model is a  bit more difficult. In addition, different formulations can lead to dramatic differences in performance. Clearly the linearized model is the winner.

References


No comments:

Post a Comment