Clustering

Problem Description

Clustering problems are described in Section 8.3 of [1].

Test Data

  • 3gauss.mat - 200 points from a mixture of 3 Gaussian distributions.
  • banana.mat - 200 points from a banana data set.
  • spiral.mat - 200 points from a spiral data set.
  • datasets.mat - Larger data sets of the above 3 types.

Program Descriptions

  • NCE.m
    This program finds a set of cluster means via the CE method with (independent) Normal updating.

    Usage:
    Call the program from MATLAB, with the following syntax:

    [ mu , count , score ] = NCE( N , g , alpha , k, data , modif , drplot , c , sigma_0 )

    Example: [ mu , count , score ] = NCE( 1000 , 10 , 0.7 , 5 , DATA , 0 , 1 )

    Inputs:   
     N - Number of samples to generate each iteration
     g - Number of these samples to use to update parameters
     alpha - Smoothing parameter
     k - Number of cluster means to find
     data - The data we are trying to fit means to (Should be n x d, where there are n points, and d dimensions)
     modif - If 1, use modified smoothing, otherwise use standard smoothing
     drplot - If 1, draws the cluster means and the data (for 2-dimensions)
     c - Optional starting centroids
     sigma_0 - Optional starting standard deviation
    Outputs:   
     mu - The centroids found via the CE method, using Normal updating with the parameter set
     count - The number of iterations taken
     score - The final score of these centroids
  • genNCE.m
    This program is used internally to generate the cluster means.

  • scoreNCE.m
    This program is used internally to evaluate the performance of a particular set of means against the data.

  • MCE.m
    This program finds a set of cluster means via the CE, looking at clustering as a ``mincut'' type problem.

    Usage:
    Call the program from MATLAB, with the following syntax:

    x = MCE( N , rho , alpha , k , data )

    Example: x = MCE( 2000 , 0.01 , 0.6 , 3 , DATA )

    Inputs:   
     N - Number of samples to generate each iteration
     rho - The fraction of samples used to update the probabilities
     alpha - Smoothing parameter
     k - Number of clusters to assign points to
     data - The data we are trying to assign to clusters (Should be n x d, where there are n points, and d dimensions)
    Outputs:   
     x - The best found assignment of the data points
  • genMCE.m
    This program is used internally to assign data points to clusters.

  • MCEJ.m
    This is slight modification of the above program, using the ``injection'' idea (due to Zdravko Botev). It generally produces superior results to the unmodified MCE method.

    Usage:
    Call the program from MATLAB, with the following syntax:

    x = MCEJ( N , rho , alpha , k , data )

    Example: x = MCEJ( 1400 , 0.03 , 0.75 , 4 , DATA )

    Inputs:   
     N - Number of samples to generate each iteration
     rho - The fraction of samples used to update the probabilities
     alpha - Smoothing parameter
     k - Number of clusters to assign points to
     data - The data we are trying to assign to clusters (Should be n x d, where there are n points, and d dimensions)
    Outputs:   
     x - The best found assignment of the data points
  • scoreMCE.m
    This program is used internally to evaluate the performance of a particular assignment against the data.

  • clNCE.m
    This is a small program which labels the points in a dataset according to a set of cluster means.

    Usage:
    Call the program from MATLAB, with the following syntax:

    x = clNCE( c , data , k )

    Example: x = clNCE( mu , DATA , 5 )

    Inputs:   
     c - A set of cluster means
     data - The data we are trying to assign to clusters (Should be n x d, where there are n points, and d dimensions)
     k - Number of clusters to assign points to
    Outputs:   
     x - The assignment of the data points to clusters
  • cMCE.m
    This is a small program which calculates cluster means from a given cluster labelling.

    Usage:
    Call the program from MATLAB, with the following syntax:

    c = cMCE( x , y , k )

    Example: mu = cMCE( x , data , 6 )

    Inputs:   
     x - A labelling of data points
     data - The data we are trying to fit means to (Should be n x d, where there are n points, and d dimensions)
     k - Number of cluster means
    Outputs:   
     x - The cluster means calculated for this labelling of data points

Bibliography

  1. D. P. Kroese and R. Y. Rubinstein.
    The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning.
    Springer, 2004.