In http://yetanothermathprogrammingconsultant.blogspot.com/2015/03/k-means-clustering-formulated-as-miqcp.html we were quite underwhelmed by the performance of MIQCPs (Mixed Integer Quadratically Constrained Problems).

A standard heuristic is easily formulated in GAMS:

$ontext $offtext option seed=101; sets k 'clusters' /k1*k4/ i 'data points' /i1*i100/ ik(i,k) 'assignment of points to clusters' xy 'coordinates of points' /x,y/ ; parameter m(k,xy) 'random clusters to generate data points' p(i,*) 'data points' numk 'number of clusters' cluster(i) 'cluster number' ; numk = card(k); alias(ii,i); *------------------------------------------------ * generate random data * points are around some clusters *------------------------------------------------ m(k,xy) = uniform(0,4); cluster(i) = uniformint(1,numk); ik(i,k) = cluster(i)= ord(k); p(i,xy) = sum(ik(i,k),m(k,xy)) + uniform(0,1); display m,p; sets ik(i,k) 'assignment of points to clusters' ik_prev(i,k) 'previous assignment of points to clusters' ik_diff trial 'number of trials' /trial1*trial15/ iter 'max number of iterations' /iter1*iter20/ ; parameters n(k) 'number of points assigned to cluster' c(k,xy) 'centroids' notconverged '0=converged,else not converged' d(i,k) 'squared distance' dclose(i) 'distance closest cluster' trace(trial,iter) 'reporting' ; loop(trial, *------------------------------------------------ * Step 1 * Random assignment *------------------------------------------------ cluster(i) = uniformint(1,numk); ik(i,k) = cluster(i)= ord(k); notConverged = 1; loop(iter$notConverged, *------------------------------------------------ * Step 2 * Calculate centroids *------------------------------------------------ n(k) = sum(ik(i,k), 1); c(k,xy)$n(k) = sum(ik(i,k), p(i,xy))/n(k); *------------------------------------------------ * Step 3 * Re-assign points *------------------------------------------------ ik_prev(i,k) = ik(i,k); d(i,k) = sum(xy, sqr(p(i,xy)-c(k,xy))); dclose(i) = smin(k, d(i,k)); ik(i,k) = yes$(dclose(i) = d(i,k)); *------------------------------------------------ * Step 4 * Check convergence *------------------------------------------------ ik_diff(i,k) = ik(i,k) xor ik_prev(i,k); notConverged = card(ik_diff); trace(trial,iter) = sum(ik(i,k),d(i,k)); ); ); display trace; |

The results show it is important to use different starting configurations (i.e. multiple trials):

---- 100 PARAMETER trace reporting iter1 iter2 iter3 iter4 iter5 iter6 trial1 366.178 28.135 14.566 + iter7 iter8 iter9 iter10 trial8 67.933 42.621 14.881 14.566 |

The consensus is that 14.566 is the optimal sum of the squared distances. Note that this was an extremely easy problem:

Notes:

- We use an
**xor**operator to find the difference between if the sets**ik_prev**and**ik**. If there is no difference**card**(**ik_diff**)=0 and thus**notConverged**=0. See below for a small example illustrating the**xor**. - The assignment

ik(i,k) =**yes**$(dclose(i) = d(i,k));

is incorrect if point**i**is exactly between two clusters. In that case we assign the point to two clusters. We can fix this by a more complicated loop:

ik(i,k) =**no**;**loop**((i,k)$(dclose(i) = d(i,k)),

ik(i,k)$(**sum**(ik(i,kk),1)=0) =**yes**;

);

- The construct

cluster(i) = uniformint(1,numk); ik(i,k) = cluster(i)=**ord**(k);

cannot be simplified to

ik(i,k) = uniformint(1,numk)=**ord**(k);.

The latter version is incorrect. - The assignment

c(k,xy)$n(k) =**sum**(ik(i,k), p(i,xy))/n(k);

will keep a cluster with zero points at its current location.

##### The XOR operator

This fragment illustrates a few different ways to calculate the difference between two sets:

set i /a*z/ j(i) /a,b/ k(i) / b,c/ diff(i) 'difference between sets: card(diff)=0 if equal' ; diff(i) = (j(i)-k(i)) + (k(i)-j(i)); display diff; diff(i) = j(i) xor k(i); display diff; diff(i) = 1$j(i) <> 1$k(i); display diff; |

---- 9 SET diff difference between sets: card(diff)=0 if equal a, c ---- 12 SET diff difference between sets: card(diff)=0 if equal a, c ---- 15 SET diff difference between sets: card(diff)=0 if equal a, c |

My preferred syntax

**diff(i)=j(i)<>k(i)**is not allowed. In my opinion, notation should really try to help readability, and not show off mathematical prowess.