Yet Another Math Programming Consultant: k-means clustering heuristic in GAMS and the XOR operator

In http://yetanothermathprogrammingconsultant.blogspot.com/2015/03/k-means-clustering-formulated-as-miqcp.html we were quite underwhelmed by the performance of MIQCPs (Mixed Integer Quadratically Constrained Problems).

A standard heuristic is easily formulated in GAMS:

$ontext

K-means clustering heuristic

$offtext

option seed=101;

sets
   k 'clusters' /k1*k4/
   i 'data points' /i1*i100/
   ik(i,k) 'assignment of points to clusters'
   xy 'coordinates of points' /x,y/
;
parameter
   m(k,xy) 'random clusters to generate data points'
   p(i,*) 'data points'
   numk    'number of clusters'
   cluster(i)    'cluster number'
;

numk = card(k);
alias(ii,i);

*------------------------------------------------
* generate random data
* points are around some clusters
*------------------------------------------------
m(k,xy) = uniform(0,4);
cluster(i) = uniformint(1,numk);
ik(i,k) = cluster(i)=ord(k);
p(i,xy) = sum(ik(i,k),m(k,xy)) + uniform(0,1);
display m,p;

sets
ik(i,k) 'assignment of points to clusters'
ik_prev(i,k) 'previous assignment of points to clusters'
ik_diff
trial   'number of trials' /trial1*trial15/
iter    'max number of iterations' /iter1*iter20/
;
parameters
n(k)       'number of points assigned to cluster'
c(k,xy)    'centroids'
notconverged '0=converged,else not converged'
d(i,k)     'squared distance'
dclose(i) 'distance closest cluster'
trace(trial,iter) 'reporting'
;

loop(trial,

*------------------------------------------------
* Step 1
* Random assignment
*------------------------------------------------

      cluster(i) = uniformint(1,numk);
      ik(i,k) = cluster(i)=ord(k);

      notConverged = 1;
      loop(iter$notConverged,

*------------------------------------------------
* Step 2
* Calculate centroids
*------------------------------------------------
          n(k) = sum(ik(i,k), 1);
          c(k,xy)$n(k) = sum(ik(i,k), p(i,xy))/n(k);

*------------------------------------------------
* Step 3
* Re-assign points
*------------------------------------------------

          ik_prev(i,k) = ik(i,k);
          d(i,k) = sum(xy, sqr(p(i,xy)-c(k,xy)));
          dclose(i) = smin(k, d(i,k));
          ik(i,k) = yes$(dclose(i) = d(i,k));

*------------------------------------------------
* Step 4
* Check convergence
*------------------------------------------------

          ik_diff(i,k) = ik(i,k) xor ik_prev(i,k);
          notConverged = card(ik_diff);

          trace(trial,iter) = sum(ik(i,k),d(i,k));
      );
);

display trace;

The results show it is important to use different starting configurations (i.e. multiple trials):

---- 100 PARAMETER trace reporting

iter1 iter2 iter3 iter4 iter5 iter6

trial1      366.178      28.135      14.566
trial2      271.458      26.741      14.566
trial3      316.975      23.912      14.566
trial4      299.522      24.511      14.566
trial5      346.148      77.747      14.566
trial6      313.978      64.730      14.865      14.566
trial7      310.735      27.412      14.566
trial8      330.829     213.308     184.504     102.191      76.611      74.210
trial9      356.897      90.286      16.112      14.566
trial10     345.086      82.716      76.154      76.146
trial11     354.545     169.648      75.626      74.350      71.929      67.385
trial12     310.257      52.175      18.800      14.566
trial13     289.293      23.505      14.566
trial14     337.024      20.783      14.566
trial15     336.969      28.747      14.566

+ iter7 iter8 iter9 iter10

trial8 67.933 42.621 14.881 14.566
trial11 59.815 59.591

The consensus is that 14.566 is the optimal sum of the squared distances. Note that this was an extremely easy problem:

Notes:

We use an xor operator to find the difference between if the sets ik_prev and ik. If there is no difference card(ik_diff)=0 and thus notConverged=0. See below for a small example illustrating the xor.
The assignment
ik(i,k) = yes$(dclose(i) = d(i,k));
is incorrect if point i is exactly between two clusters. In that case we assign the point to two clusters. We can fix this by a more complicated loop:
ik(i,k) = no;
loop((i,k)$(dclose(i) = d(i,k)),
ik(i,k)$(sum(ik(i,kk),1)=0) = yes;
);
The construct
cluster(i) = uniformint(1,numk); ik(i,k) = cluster(i)=ord(k);
cannot be simplified to
ik(i,k) = uniformint(1,numk)=ord(k);.
The latter version is incorrect.
The assignment
c(k,xy)$n(k) = sum(ik(i,k), p(i,xy))/n(k);
will keep a cluster with zero points at its current location.

The XOR operator

This fragment illustrates a few different ways to calculate the difference between two sets:

set
   i /a*z/
   j(i) /a,b/
   k(i) / b,c/
   diff(i) 'difference between sets: card(diff)=0 if equal'
;

diff(i) = (j(i)-k(i)) + (k(i)-j(i));
display diff;

diff(i) = j(i) xor k(i);
display diff;

diff(i) = 1$j(i) <> 1$k(i);
display diff;

---- 9 SET diff difference between sets: card(diff)=0 if equal

a, c

---- 12 SET diff difference between sets: card(diff)=0 if equal

a, c

---- 15 SET diff difference between sets: card(diff)=0 if equal

a, c

My preferred syntax diff(i)=j(i)<>k(i) is not allowed. In my opinion, notation should really try to help readability, and not show off mathematical prowess.

Yet Another Math Programming Consultant

Tuesday, March 31, 2015

k-means clustering heuristic in GAMS and the XOR operator

The XOR operator

No comments:

Post a Comment