Clustering

Problem Description

Clustering problems are described in Section 8.3 of .

Test Data

• 3gauss.mat - 200 points from a mixture of 3 Gaussian distributions.
• banana.mat - 200 points from a banana data set.
• spiral.mat - 200 points from a spiral data set.
• datasets.mat - Larger data sets of the above 3 types.

Program Descriptions

• NCE.m
This program finds a set of cluster means via the CE method with (independent) Normal updating.

Usage:
Call the program from MATLAB, with the following syntax:

[ mu , count , score ] = NCE( N , g , alpha , k, data , modif , drplot , c , sigma_0 )

Example: [ mu , count , score ] = NCE( 1000 , 10 , 0.7 , 5 , DATA , 0 , 1 )

 Inputs: N - Number of samples to generate each iteration g - Number of these samples to use to update parameters alpha - Smoothing parameter k - Number of cluster means to find data - The data we are trying to fit means to (Should be n x d, where there are n points, and d dimensions) modif - If 1, use modified smoothing, otherwise use standard smoothing drplot - If 1, draws the cluster means and the data (for 2-dimensions) c - Optional starting centroids sigma_0 - Optional starting standard deviation Outputs: mu - The centroids found via the CE method, using Normal updating with the parameter set count - The number of iterations taken score - The final score of these centroids
• genNCE.m
This program is used internally to generate the cluster means.

• scoreNCE.m
This program is used internally to evaluate the performance of a particular set of means against the data.

• MCE.m
This program finds a set of cluster means via the CE, looking at clustering as a ``mincut'' type problem.

Usage:
Call the program from MATLAB, with the following syntax:

x = MCE( N , rho , alpha , k , data )

Example: x = MCE( 2000 , 0.01 , 0.6 , 3 , DATA )

 Inputs: N - Number of samples to generate each iteration rho - The fraction of samples used to update the probabilities alpha - Smoothing parameter k - Number of clusters to assign points to data - The data we are trying to assign to clusters (Should be n x d, where there are n points, and d dimensions) Outputs: x - The best found assignment of the data points
• genMCE.m
This program is used internally to assign data points to clusters.

• MCEJ.m
This is slight modification of the above program, using the ``injection'' idea (due to Zdravko Botev). It generally produces superior results to the unmodified MCE method.

Usage:
Call the program from MATLAB, with the following syntax:

x = MCEJ( N , rho , alpha , k , data )

Example: x = MCEJ( 1400 , 0.03 , 0.75 , 4 , DATA )

 Inputs: N - Number of samples to generate each iteration rho - The fraction of samples used to update the probabilities alpha - Smoothing parameter k - Number of clusters to assign points to data - The data we are trying to assign to clusters (Should be n x d, where there are n points, and d dimensions) Outputs: x - The best found assignment of the data points
• scoreMCE.m
This program is used internally to evaluate the performance of a particular assignment against the data.

• clNCE.m
This is a small program which labels the points in a dataset according to a set of cluster means.

Usage:
Call the program from MATLAB, with the following syntax:

x = clNCE( c , data , k )

Example: x = clNCE( mu , DATA , 5 )

 Inputs: c - A set of cluster means data - The data we are trying to assign to clusters (Should be n x d, where there are n points, and d dimensions) k - Number of clusters to assign points to Outputs: x - The assignment of the data points to clusters
• cMCE.m
This is a small program which calculates cluster means from a given cluster labelling.

Usage:
Call the program from MATLAB, with the following syntax:

c = cMCE( x , y , k )

Example: mu = cMCE( x , data , 6 )

 Inputs: x - A labelling of data points data - The data we are trying to fit means to (Should be n x d, where there are n points, and d dimensions) k - Number of cluster means Outputs: x - The cluster means calculated for this labelling of data points

Bibliography

1. D. P. Kroese and R. Y. Rubinstein.
The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning.
Springer, 2004.