Consider the following problem, original from (1):

We have a population of items with values having a mean \(\mu\). Furthermore we have a number of subsets of this population. Each subset \(i\) has a known count \(n_i\) and mean \(\mu_i\). It can be assumed the subsets do not overlap: the pairwise intersection between two subsets is always empty. We want to select \(N\) subsets such that the total average of all selected items is as close to \(\mu\) as possible.

OK, let’s see. Let:

\[\delta_i = \begin{cases}1 & \text{if subset $i$ is selected}\\ 0 & \text{otherwise}\end{cases}\] |

We could model the problem as:

\[\boxed{\begin{align} \min \> & \left| \mu – \frac{\sum_i n_i \mu_i \delta_i}{\sum_i n_i \delta_i} \right|\\ & \sum_i \delta_i = N \\ & \delta_i \in \{0,1\} \end{align}} \] |

where \(|.|\) indicates `**absolute value**`.

The objective is somewhat complicated. The ratio is the average of the items in the selected subsets. We can see this as follows. The average is the sum of the values of the items in the selected subsets divided by the number if selected items. Of course, the sum of values in a subset \(i\) is \(n_i \mu_i\) which explains the numerator in the fraction.

This model is quite nonlinear, but there is a way to linearize this. The first thing we can do is introduce a variable \(y\) defined by

\[y=\frac{\sum_i n_i \mu_i \delta_i}{\sum_i n_i \delta_i}\] |

and reformulate as:

\[ \sum_i n_i (y \delta_i ) = \sum_i n_i \mu_i \delta_i\] |

This is nonlinear because of the multiplication \(y\cdot\delta_i\). However, a multiplication of a binary variable and a continuous variable can be linearized with a little bit of effort (2). Let’s introduce a new variable \(x_i = y \cdot \delta_i\), then we can form the linear inequalities:

\[\begin{align} &y^{LO} \cdot \delta_i \le x_i \le y^{UP} \cdot \delta_i\\ &y-y^{UP}\cdot (1-\delta_i) \le x_i \le y-y^{LO}\cdot (1-\delta_i) \end{align}\] |

Here \(y^{LO},y^{UP}\) is the lower- and upper-bound on \(y\). We can use:

\[\begin{align} &y^{LO} := \min_i \mu_i\\ &y^{UP} := \max_i \mu_i \end{align}\] |

With this we can rewrite the condition \(\sum_i n_i (y \delta_i) = \sum_i n_i \mu_i \delta_i\) as:

\[\sum_i n_i x_i = \sum_i n_i \mu_i \delta_i\] |

The absolute value can be handled in a standard way, e.g.:

\[\begin{align} \min\>& z\\ &-z \le \mu – y \le z \end{align}\] |

Now we have all the tools to write the complete linear model:

\[\boxed{\begin{align} \min\>& z\\ &-z \le \mu – y \le z \\ &y^{LO} \cdot \delta_i \le x_i \le y^{UP} \cdot \delta_i\\ &y-y^{UP}\cdot (1-\delta_i) \le x_i \le y-y^{LO}\cdot (1-\delta_i) \\ &\sum_i n_i x_i = \sum_i n_i \mu_i \delta_i\\ & \sum_i \delta_i = N \\ & \delta_i \in \{0,1\} \\ & z \ge 0 \\ & y^{LO} \le y \le y^{UP} \\ & x_i \>\text{free} \end{align}}\] |

What if the subsets can overlap? That means a data element \(a_k\) can be a member of multiple subsets. If we want it to be counted just once, we need to consider the data elements themselves. This implies the need to introduce a binary variable that indicates whether \(a_k\) is in any of the selected subsets. I believe the high-level model can look like:

\[\boxed{\begin{align} \min\> & \left| \mu – \frac{\sum_k a_k \alpha_k}{\sum_k \alpha_k} \right|\\ & \alpha_k = \prod_i s_{i,k} \delta_i\\ & \sum_i \delta_i = N \\ & \delta_i, \alpha_k \in \{0,1\} \end{align}}\] |

Here we have data: \(s_{i,k}=1\) if value \(a_k\) is in subset \(i\). This can be linearized in a similar fashion.

##### References

- http://stackoverflow.com/questions/41879502/find-subset-with-similar-mean-as-full-set
- Multiplication of a continuous and a binary variable, http://yetanothermathprogrammingconsultant.blogspot.com/2008/05/multiplication-of-continuous-and-binary.html