It is instructive to formulate the k-means clustering problem as a Mixed-Integer Quadratically Constrained Problem (MIQCP) in GAMS. It looks trivial at first sight but in fact requires a little bit of attention.
We start with a small data set. E.g. 3 clusters and 50 data points in two dimensions (x and y). I.e. our sets look like:
We generate some random clusters as follows:
Should be an easy problem:
The clustering model could look like:
We solver here for two things simultaneously:
- The location of the clusters (continuous variables c)
- The assignment of points to clusters (binary variables x)
The equation distances computes all (squared) distances between the points and the clusters. We multiply by x so that we only sum these distances when a point is assigned to a cluster (in that case x=1). Finally we make sure each point is assigned to exactly one cluster.
Unfortunately this model is non-convex, so we can solve this only with global solvers such as Baron or GloMiQo.
We can make the model convex as follows:
We use a big-M construct where the upper bound of d forms a reasonable tight value for M. Here we have that d(i,k)=0 if point i is not assigned to cluster k (note that d≥0). The objective has become linear in this case.
Now we can solve the problem with commercial solvers like Cplex and Gurobi. The results (with the location of the clusters from variable c) are:
For larger data sets it turns out the performance is not very good. E.g. with 100 points and 4 clusters we have 400 binary variables and it takes hours or days to prove optimality. Both Cplex and Gurobi are not able to prove optimality even when we load an optimal solution using MIPSTART. Xpress has even more troubles, and returns bad solutions. We see more often that MIQP/MIQCP solvers are not as robust and fast as (linear) MIP solvers. It is too bad the MIQCP solvers are doing so poorly on this problem.
Using a solver would have opened up possibly interesting extensions like additional constraints on the clustering, such as “at least n points in each cluster”.
Bug in Gurobi
We were able to confuse Gurobi by writing the model as:
It now accepts the model (although it is non-convex) and produces a bogus solution: all points are assigned to cluster 1. Gurobi should have refused to solve the problem (with some message about Q matrix not PSD).