R Code for SimClust: Algorithm described in "Simultaneous Gene Clustering and Subset Selection for Sample Classification via MDL"
About the code:
- R code - not optimized for speed
- Required libraries: MASS, leaps, cluster
- Assist functions
- Source in the functions in the files below: bunchoffcns.q, mdlfcns.q, testerrorfcns.q, Clusterfcns.q, Selectionfcns.q.
- The main program is SimClust.q for multiclass data and SimClust2.q for binary data.
- Input (same for both SimClust and SimClust2):
- Required:
- data.resp - the response vector (label range 0 to C-1, where C is the number of classes)
- data.std - the data matrix (rows=samples, columns=genes)
- classabund - a vector c(x,y,z,...) of class abundances, number of observations in each class
- nc - number of classes
- nbrgenes - number of genes to cluster. In the paper we used 200
- Clset - vector of number of clusters to be searched, ex c(seq(3,12))
- Optional:
- nmax - max size model, default 20
- nb - number of models retained from leaps for each optimal scoring model, default=50
- frac - fraction of data set for training, default 3-fold crossvalidation
- nbriter - number of internal iterations, seed of kmeans
- bunchoffcns.qSome basic functions, cross validation, permuations, sorting vectors for clustering, nearestneighbor etc
- mdlfcns.qFunctions for mdl, aic, bic selection
- testerrorfcns.qFunctions for prediction (LDA, DLDA)
- Clusterfcns.qCluster functions for variables
- Selectionfcns.qSelecting individual variables and clusters of variables
- SimClust2.qThe main cross validation program, binary
- SimClust.qThe main cross validation program for multiclass
- Calling the functions:
- example:
- simclust<-SimClust(lander.resp2,lander.data.std,classabund,3,200,seq(3,10),B=2)
- Summarizing the output:
- Boxplots of testerror rates:
- TE<-as.data.frame(simclust$testerrmLDA)
- names(TE)<-c("clustMDL","clustAIC","clustBIC","clustAICc","indMDL","indAIC","indBIC","indAICc")
- boxplot(TE,ylim=c(0,1))
Back to : Rebecka
04/18/03