Monday, March 1, 2010

Data Clustering Algorithm - K-means

K-means is a very basic but still very powerfull clustering algorithm. It aims to partition observations into k clusters in which each observation belongs to the cluster with the nearest distance. This is probably most widely used algorithm, compared to other machine learning techniques.

Intuitively, we now introduce major steps and some of their details in a simple K-means implementation.

Step 1: Initialization of K centroids

Centroid is a important concept here. The centroid is the center of one cluster. K-means is actually a iterative method. Centroids are updated through iterations, by computing the means of all current datapoints in one cluster. However, the initialization of centroids is very tricky, and will have a great impact on the clustering result. By default, the centroids are selected randomly from the datapoints in the given dataset. But various improvements are now available. Random, Forgy, MacQueen and Kaufman are four good examples. Discussions about them are available here.

Step 2: Assignment of datapoints to nearest centroids

In this step, we need to compute the distances from each datapoint to these K centroids. Based on these distance information, we assign a "nearest" centroid to each datapoint. The word "nearest" is quoted because another tricky thing will happen: what distance measure are we going to use? Again, the default measure is the euclidean distance. But this choice should depend on different situations. Some new distance measures like walk-based or ontology-based, etc are emerging for different needs.

Step 3: Update the centroids

After we have obtained these distance information (we can actually save them in a hash table or similar structure), we need to update the centroids. The new centroids are computed by taking the means of all datapoints in one "current" cluster. I put the word "current" because the clusters are changing all the time.

All we need to do now is to iterate these steps until the algorithm converges. Yes, it will, usually. The convergence properties are discussed here.

I have attached a matlab version of K-means algorithm here. It is a very simple implementation: random initialization and plain Euclidean distance are adopted. This is only for you to have a general concept of what this algorithm do.

-----------------K-means Algorithm - Matlab Implementation------------------


function [labels] = KMeansCluster(dataset, k)% This is a matlab version of a simple K-Means algorithm% 'dataset' is the data ready for cluster% 'k' is the number of clusters
% Dimension of each datapointdim = length(dataset(1, :));
% Number of datapoints in the given datasetnum = length(dataset);
% Initialize labels for datapointslabels = zeros(num, 1);
% Initialize centroids for this datasetcentroids = zeros(k, dim);
% Initialize k centroids% Default method - randomly choose k datapointscentIDs = randint(k, 1, num);for i = 1 : k centroids(k, :) = dataset(centIDs(i), :);end
% K-Means BeginnotConverge = 1;threshold = 0.001;maxIter = 1000;iterCount = 0;
while notConverge % Compute distances from each datapoint to these k centroids, and % assign each datapoint a cluster label % Create a table to save all intermediate distance information distTable = zeros(num, k); for i = 1 : num data = dataset(i, :); for j = 1 : k % Use simple Euclidean distance distTable(i, j) = pdist([data; centroids(j, :)], 'euclidean'); end end % Assignment Step [Y, labels] = min(disTable, [], 2); % Update Step % Compute new centroids % Save previous centroids for convergence checking tempCentroids = centroids; for i = 1 : k total = zeoros(1, dim); count = 0; for j = 1 : num if labels(j) == i total = total + dataset(j, :); count = count + 1; end end centroids(i, :) = total/count; end % Check convergence if max(abs(centroids - tempCentroids) < notconverge =" 0;" itercount =" iterCount"> maxIter break; endendend

No comments: