Consider the following problem, original from (1):
We have a population of items with values having a mean μ. Furthermore we have a number of subsets of this population. Each subset i has a known count ni and mean μi. It can be assumed the subsets do not overlap: the pairwise intersection between two subsets is always empty. We want to select N subsets such that the total average of all selected items is as close to μ as possible.
OK, let’s see. Let:
δi={1if subset i is selected0otherwise |
We could model the problem as:
min|μ–∑iniμiδi∑iniδi|∑iδi=Nδi∈{0,1} |
where |.| indicates `absolute value`.
The objective is somewhat complicated. The ratio is the average of the items in the selected subsets. We can see this as follows. The average is the sum of the values of the items in the selected subsets divided by the number if selected items. Of course, the sum of values in a subset i is niμi which explains the numerator in the fraction.
This model is quite nonlinear, but there is a way to linearize this. The first thing we can do is introduce a variable y defined by
y=∑iniμiδi∑iniδi |
and reformulate as:
∑ini(yδi)=∑iniμiδi |
This is nonlinear because of the multiplication y⋅δi. However, a multiplication of a binary variable and a continuous variable can be linearized with a little bit of effort (2). Let’s introduce a new variable xi=y⋅δi, then we can form the linear inequalities:
yLO⋅δi≤xi≤yUP⋅δiy−yUP⋅(1−δi)≤xi≤y−yLO⋅(1−δi) |
Here yLO,yUP is the lower- and upper-bound on y. We can use:
yLO:=miniμiyUP:=maxiμi |
With this we can rewrite the condition ∑ini(yδi)=∑iniμiδi as:
∑inixi=∑iniμiδi |
The absolute value can be handled in a standard way, e.g.:
minz−z≤μ–y≤z |
Now we have all the tools to write the complete linear model:
minz−z≤μ–y≤zyLO⋅δi≤xi≤yUP⋅δiy−yUP⋅(1−δi)≤xi≤y−yLO⋅(1−δi)∑inixi=∑iniμiδi∑iδi=Nδi∈{0,1}z≥0yLO≤y≤yUPxifree |
What if the subsets can overlap? That means a data element ak can be a member of multiple subsets. If we want it to be counted just once, we need to consider the data elements themselves. This implies the need to introduce a binary variable that indicates whether ak is in any of the selected subsets. I believe the high-level model can look like:
min|μ–∑kakαk∑kαk|αk=∏isi,kδi∑iδi=Nδi,αk∈{0,1} |
Here we have data: si,k=1 if value ak is in subset i. This can be linearized in a similar fashion.
References
- http://stackoverflow.com/questions/41879502/find-subset-with-similar-mean-as-full-set
- Multiplication of a continuous and a binary variable, http://yetanothermathprogrammingconsultant.blogspot.com/2008/05/multiplication-of-continuous-and-binary.html
No comments:
Post a Comment