Problem
The problem is easy to describe: given the coordinates of points \(p_i\), find the central point \(x\) that minimizes the sum of the distances from \(x\) to \(p_i\) [1]. This point is sometimes called the geometric median [2].
Data
We just use some random data on the unit square \([0,1]\times[0,1]\):
The last entry labeled mean is just the average of the \(x\)- and \(y\)-coordinates.---- 13 PARAMETER p points (randomly generated) x y point1 0.17174713 0.84326671 point2 0.55037536 0.30113790 point3 0.29221212 0.22405287 point4 0.34983050 0.85627035 point5 0.06711372 0.50021067 point6 0.99811763 0.57873338 point7 0.99113304 0.76225047 point8 0.13069248 0.63971876 point9 0.15951786 0.25008053 point10 0.66892861 0.43535638 point11 0.35970027 0.35144137 point12 0.13149159 0.15010179 point13 0.58911365 0.83089281 point14 0.23081574 0.66573446 point15 0.77585761 0.30365848 point16 0.11049229 0.50238487 point17 0.16017276 0.87246231 point18 0.26511455 0.28581432 point19 0.59395592 0.72271907 point20 0.62824868 0.46379787 point21 0.41330699 0.11769536 point22 0.31421227 0.04655151 point23 0.33855027 0.18209959 point24 0.64572713 0.56074555 point25 0.76996172 0.29780586 point26 0.66110626 0.75582167 point27 0.62744750 0.28386420 point28 0.08642462 0.10251467 point29 0.64125115 0.54530950 point30 0.03152485 0.79236064 point31 0.07276700 0.17566105 point32 0.52563261 0.75020767 point33 0.17812371 0.03414099 point34 0.58513117 0.62122998 point35 0.38936190 0.35871415 point36 0.24303462 0.24642154 point37 0.13050280 0.93344972 point38 0.37993791 0.78340046 point39 0.30003426 0.12548322 point40 0.74887411 0.06923246 point41 0.20201556 0.00506586 point42 0.26961305 0.49985148 point43 0.15128587 0.17416945 point44 0.33063773 0.31690605 point45 0.32208696 0.96397664 point46 0.99360221 0.36990306 point47 0.37288857 0.77197833 point48 0.39668414 0.91309632 point49 0.11957773 0.73547889 point50 0.05541847 0.57629980 mean 0.39042709 0.47299042
NLP Model
A straightforward non-linear programming model can look like:
Unconstrained NLP Model |
---|
\[\min \sum_i \sqrt{\sum_c (\color{darkred}x_c-\color{darkblue}p_{i,c})^2 }\] |
We use \(c = \{x,y\}\), i.e. we have \(x\) and \(y\)-coordinates. Note that we use \(x\) in two different contexts: element of set \(c\), being the \(x\)-coordinate, and the decision variable \(x_c\).
We can use the mean as a very good starting point to help the NLP solver. I.e. \[x_c := \frac{\displaystyle\sum_{i=1}^n p_{i,c}}{n}\]
The picture below shows why the mean is such a good starting point:
Optimal center point is close to mean point |
The numeric values are here:
---- 45 PARAMETER results x(center) vs mean x y sumdist mean 0.39042709 0.47299042 18.02414861 x 0.37048298 0.43857572 17.96891985
The sumdist column shows the objective values for these two points.
This is an easy NLP problem. Most NLP solvers just need a few iterations. With a system like GAMS or AMPL we get exact gradients automatically. That is much preferable to finite differences which seems the prevalent method people use in an R or Python environment.
It was pointed out to me that the problem is not smooth: the gradient of \(\sqrt{x}\) is not defined at zero, This can only happen when our point \(x\) exactly coincides with a data point (i.e. all coordinates are exactly the same). We can protect against this by using: \[\min \sum_i \sqrt{\varepsilon+\sum_c (x_c-p_{i,c})^2 }\] for a very small constant \(\varepsilon\gt 0\).
Cone programming I
The above problem can also be written as a cone programming problem. This will allow us to use a different class of solvers to work on this problem. Here we use CVXPY [3] to express the model. The Python code can look like:
import cvxpy as cp x = cp.Variable(2) # center point obj = cp.Minimize( cp.sum( [ cp.norm(x-p[i,:]) for i in range(N) ] ) ) prob = cp.Problem(obj) objval = prob.solve(solver=cp.SCS, verbose=True)
This is very much a straight translation of our unconstrained NLP model. Although we only declared two \(x\) variables, behind the scenes the model is blown up to a rather large one. We can see from the log:
---------------------------------------------------------------------------- SCS v2.1.1 - Splitting Conic Solver (c) Brendan O'Donoghue, Stanford University, 2012 ---------------------------------------------------------------------------- Lin-sys: sparse-direct, nnz in A = 150 eps = 1.00e-04, alpha = 1.50, max_iters = 5000, normalize = 1, scale = 1.00 acceleration_lookback = 0, rho_x = 1.00e-03 Variables n = 52, constraints m = 150 Cones: soc vars: 150, soc blks: 50
The generated SOCP (second order cone programming) model is larger, but also very sparse. The solver has no problem solving it very quickly. With SOCP solvers we usually don't worry about an initial point like we used in the NLP model.
Cone programming II
If we don't use a modeling tool that can do these transformations automatically, we can use a DIY approach. Second-order cone constraints can be stated as: \[||A_i x+b_i||_2 \le c_i^Tx + d_i \>\>\forall i\] This would imply that we can write our model as:
SOCP Model Attempt |
---|
\[\begin{align} \min & \sum_i \color{darkred} d_i \\ & \color{darkred} d_i^2 \ge \sum_c (\color{darkred}x_c-\color{darkblue}p_{i,c})^2 && \forall i \\ &\color{darkred} d_i \ge 0, \color{darkred} x_c \text{ free}\end{align} \] |
Unfortunately this will yield a message like: CPLEX Error 5002: 'e(point1)' is not convex or Constraint 'e(point1)'(0) is not convex. Q should be positive semidefinite in a constraint with finite upper bound. We can repair this as follows:
This now solves quickly. We can now understand that CVXPY did quite a few steps before passing the model to the solver. As argued in [4], it is much better if the modeling system takes care of these reformulations. Some of them are not immediate obvious, and hand-crafted reformulations can be error-prone.
The NLP model has just 2 variables \(x_c\). This SOCP model has 152 variables. Using \(n=50\) data points, we added \(3n\) variables. When looking at the number of constraints we see a similar thing. The NLP model has no constraints, but this SOCP model has \(3n=150\) constraints (\(n\) of them cones).
Repaired SOCP Model |
---|
\[\begin{align} \min & \sum_i \color{darkred} d_i \\ & \color{darkred} d_i^2 \ge \sum_c \color{darkred}y_{i,c}^2 && \forall i \\ & \color{darkred} y_{i,c} = \color{darkred} x_c -\color{darkblue}p_{i,c} && \forall i,c \\ &\color{darkred} d_i \ge 0, \color{darkred} x_c \text{ free}, \color{darkred} y_{i,c} \text{ free}\end{align} \] |
This now solves quickly. We can now understand that CVXPY did quite a few steps before passing the model to the solver. As argued in [4], it is much better if the modeling system takes care of these reformulations. Some of them are not immediate obvious, and hand-crafted reformulations can be error-prone.
The NLP model has just 2 variables \(x_c\). This SOCP model has 152 variables. Using \(n=50\) data points, we added \(3n\) variables. When looking at the number of constraints we see a similar thing. The NLP model has no constraints, but this SOCP model has \(3n=150\) constraints (\(n\) of them cones).
Conclusion
The min sum distance problem has a simple NLP formulation which can be improved by using a good initial point. It can also be formulated as a SOCP problem. Using high-level modeling tools this is not difficult. Without automatic reformulations things become a bit less obvious.
References
- The point that minimizes the sum of euclidean distances to a set of n points, https://stackoverflow.com/questions/57277247/the-point-that-minimizes-the-sum-of-euclidean-distances-to-a-set-of-n-points
- Geometric median, https://en.wikipedia.org/wiki/Geometric_median
- https://www.cvxpy.org/
- Victor Zverovich, Robert Fourer, Automatic Reformulation of Second-Order Cone Programming Problems, https://ampl.com/MEETINGS/TALKS/2015_01_Richmond_2E.2.pdf
No comments:
Post a Comment