Tuesday, March 31, 2015

k-means clustering heuristic in GAMS and the XOR operator

In http://yetanothermathprogrammingconsultant.blogspot.com/2015/03/k-means-clustering-formulated-as-miqcp.html we were quite underwhelmed by the performance of MIQCPs (Mixed Integer Quadratically Constrained Problems).

A standard heuristic is easily formulated in GAMS:

$ontext

 
K-means clustering heuristic

$offtext


option seed=101;

sets
   k 
'clusters' /k1*k4/
   i 
'data points' /i1*i100/
   ik(i,k)
'assignment of points to clusters'
   xy
'coordinates of points' /x,y/
;
parameter
   m(k,xy)
'random clusters to generate data points'
   p(i,*) 
'data points'
   numk   
'number of clusters'
   cluster(i)   
'cluster number'
;

numk =
card(k);
alias(ii,i);

*------------------------------------------------
* generate random data
* points are around some clusters
*------------------------------------------------
m(k,xy) = uniform(0,4);
cluster(i) = uniformint(1,numk);
ik(i,k) = cluster(i)=
ord(k);
p(i,xy) =
sum(ik(i,k),m(k,xy)) + uniform(0,1);
display m,p;

sets
  ik(i,k)
'assignment of points to clusters'
  ik_prev(i,k)
'previous assignment of points to clusters'
  ik_diff
  trial  
'number of trials' /trial1*trial15/
  iter   
'max number of iterations' /iter1*iter20/
;
parameters
  n(k)      
'number of points assigned to cluster'
  c(k,xy)   
'centroids'
  notconverged 
'0=converged,else not converged'
  d(i,k)    
'squared distance'
  dclose(i) 
'distance closest cluster'
  trace(trial,iter)
'reporting'
;

loop(trial,

*------------------------------------------------
* Step 1
* Random assignment
*------------------------------------------------

      cluster(i) = uniformint(1,numk);
      ik(i,k) = cluster(i)=
ord(k);

      notConverged = 1;
     
loop(iter$notConverged,

*------------------------------------------------
* Step 2
* Calculate centroids
*------------------------------------------------
          n(k) =
sum(ik(i,k), 1);
          c(k,xy)$n(k) =
sum(ik(i,k), p(i,xy))/n(k);

*------------------------------------------------
* Step 3
* Re-assign points
*------------------------------------------------

          ik_prev(i,k) = ik(i,k);
          d(i,k) =
sum(xy, sqr(p(i,xy)-c(k,xy)));
          dclose(i) =
smin(k, d(i,k));
          ik(i,k) =
yes$(dclose(i) = d(i,k));

*------------------------------------------------
* Step 4
* Check convergence
*------------------------------------------------

          ik_diff(i,k) = ik(i,k)
xor ik_prev(i,k);
          notConverged =
card(ik_diff);

          trace(trial,iter) =
sum(ik(i,k),d(i,k));
      );
);

display trace;

The results show it is important to use different starting configurations (i.e. multiple trials):

----    100 PARAMETER trace  reporting

              iter1       iter2       iter3       iter4       iter5       iter6

trial1      366.178      28.135      14.566
trial2      271.458      26.741      14.566
trial3      316.975      23.912      14.566
trial4      299.522      24.511      14.566
trial5      346.148      77.747      14.566
trial6      313.978      64.730      14.865      14.566
trial7      310.735      27.412      14.566
trial8      330.829     213.308     184.504     102.191      76.611      74.210
trial9      356.897      90.286      16.112      14.566
trial10     345.086      82.716      76.154      76.146
trial11     354.545     169.648      75.626      74.350      71.929      67.385
trial12     310.257      52.175      18.800      14.566
trial13     289.293      23.505      14.566
trial14     337.024      20.783      14.566
trial15     336.969      28.747      14.566

      +       iter7       iter8       iter9      iter10

trial8       67.933      42.621      14.881      14.566
trial11      59.815      59.591

The consensus is that 14.566 is the optimal sum of the squared distances. Note that this was an extremely easy problem:

image

Notes:

  1. We use an xor operator to find the difference between if the sets ik_prev and ik. If there is no difference card(ik_diff)=0 and thus notConverged=0. See below for a small example illustrating the xor.
  2. The assignment
    ik(i,k) = yes$(dclose(i) = d(i,k));
    is incorrect if point i is exactly between two clusters. In that case we assign the point to two clusters. We can fix this by a more complicated loop:
    ik(i,k) = no;
    loop((i,k)$(dclose(i) = d(i,k)),
      ik(i,k)$(
    sum(ik(i,kk),1)=0) = yes;
    );

  3. The construct
    cluster(i) = uniformint(1,numk); ik(i,k) = cluster(i)=ord(k);
    cannot be simplified to 
    ik(i,k) = uniformint(1,numk)=ord(k);
    The latter version is incorrect.
  4. The assignment
    c(k,xy)$n(k) = sum(ik(i,k), p(i,xy))/n(k); 
    will keep a cluster with zero points at its current location.
The XOR operator

This fragment illustrates a few different ways to calculate the difference between two sets:

set
   i
/a*z/
   j(i)
/a,b/
   k(i)
/  b,c/
   diff(i)
'difference between sets: card(diff)=0 if equal'
;

diff(i) = (j(i)-k(i)) + (k(i)-j(i));
display diff;

diff(i) = j(i)
xor k(i);
display diff;

diff(i) = 1$j(i) <> 1$k(i);
display diff;

----      9 SET diff  difference between sets: card(diff)=0 if equal

a,    c

----     12 SET diff  difference between sets: card(diff)=0 if equal

a,    c

----     15 SET diff  difference between sets: card(diff)=0 if equal

a,    c


My preferred syntax diff(i)=j(i)<>k(i) is not allowed. In my opinion, notation should really try to help readability, and not show off mathematical prowess.