R Code for SimClust: Algorithm described in "Simultaneous Gene Clustering and Subset Selection for Sample Classification via MDL"

About the code:

R code - not optimized for speed
Required libraries: MASS, leaps, cluster
Assist functions
Source in the functions in the files below: bunchoffcns.q, mdlfcns.q, testerrorfcns.q, Clusterfcns.q, Selectionfcns.q.

The main program is SimClust.q for multiclass data and SimClust2.q for binary data.
Input (same for both SimClust and SimClust2):
Required:
data.resp - the response vector (label range 0 to C-1, where C is the number of classes)

data.std - the data matrix (rows=samples, columns=genes)

classabund - a vector c(x,y,z,...) of class abundances, number of observations in each class

nc - number of classes

nbrgenes - number of genes to cluster. In the paper we used 200

Clset - vector of number of clusters to be searched, ex c(seq(3,12))

Optional:
nmax - max size model, default 20

nb - number of models retained from leaps for each optimal scoring model, default=50

frac - fraction of data set for training, default 3-fold crossvalidation

nbriter - number of internal iterations, seed of kmeans

  • bunchoffcns.qSome basic functions, cross validation, permuations, sorting vectors for clustering, nearestneighbor etc
  • mdlfcns.qFunctions for mdl, aic, bic selection
  • testerrorfcns.qFunctions for prediction (LDA, DLDA)
  • Clusterfcns.qCluster functions for variables
  • Selectionfcns.qSelecting individual variables and clusters of variables
  • SimClust2.qThe main cross validation program, binary
  • SimClust.qThe main cross validation program for multiclass
  • Calling the functions:
    example:

    simclust<-SimClust(lander.resp2,lander.data.std,classabund,3,200,seq(3,10),B=2)

    Summarizing the output:

    Boxplots of testerror rates:

    TE<-as.data.frame(simclust$testerrmLDA)

    names(TE)<-c("clustMDL","clustAIC","clustBIC","clustAICc","indMDL","indAIC","indBIC","indAICc")

    boxplot(TE,ylim=c(0,1))

    Back to : Rebecka

    04/18/03