Thursday, March 26, 2015

K-means clustering formulated as MIQCP optimization problem

It is instructive to formulate the k-means clustering problem as a Mixed-Integer Quadratically Constrained Problem (MIQCP) in GAMS. It looks trivial at first sight but in fact requires a little bit of attention.

We start with a small data set. E.g. 3 clusters and 50 data points in two dimensions (x and y). I.e. our sets look like:

image

We generate some random clusters as follows:

image

Should be an easy problem:

image

The clustering model could look like:

image

We solver here for two things simultaneously:

  1. The location of the clusters (continuous variables c)
  2. The assignment of points to clusters (binary variables x)

The equation distances computes all (squared) distances between the points and the clusters. We multiply by x so that we only sum these distances when a point is assigned to a cluster (in that case x=1). Finally we make sure each point is assigned to exactly one cluster.

Unfortunately this model is non-convex, so we can solve this only with global solvers such as Baron or GloMiQo.

Convexification

We can make the model convex as follows:

image

We use a big-M construct where the upper bound of d forms a reasonable tight value for M. Here we have that d(i,k)=0 if point i is not assigned to cluster k (note that d≥0). The objective has become linear in this case.

Now we can solve the problem with commercial solvers like Cplex and Gurobi. The results (with the location of the clusters from variable c) are:

image

Performance

For larger data sets it turns out the performance is not very good. E.g. with 100 points and 4 clusters we have 400 binary variables and it takes hours or days to prove optimality. Both Cplex and Gurobi are not able to prove optimality even when we load an optimal solution using MIPSTART. Xpress has even more troubles, and returns bad solutions. We see more often that MIQP/MIQCP solvers are not as robust and fast as (linear) MIP solvers. It is too bad the MIQCP solvers are doing so poorly on this problem.

Using a solver would have opened up possibly interesting extensions like additional constraints on the clustering, such as “at least n points in each cluster”.

Bug in Gurobi

We were able to confuse Gurobi by writing the model as:

image

It now accepts the model (although it is non-convex) and produces a bogus solution: all points are assigned to cluster 1. Gurobi should have refused to solve the problem (with some message about Q matrix not PSD).