Clustering and Model Selection
 
What's new?
Rebecka Jornsten, PhD.
Assistant Professor,
Department of Statistics
Rutgers University
110 Frelinghuysen Road,
Piscataway, NJ 08854, USA

Email: rebecka@stat.rutgers.edu
Phone: (732) 445-3145
FAX: (732) 445-3428
Office: 451 Hill Center

Introduction Papers and Presentations Other

[Overview | Simultaneous selection | Multi-level mixture models| Data depth based clustering | Clustering based on p-values]

  • The goal of statistical analysis of high-dimensional and complex biological data structures is to summarize the data in an efficient manner, and allow researchers to more easily interpret the experiments. My research in this area is centered on developing clustering methodology that allows for direct, objective, and efficient interpretation of biological data. I am particularly interested in developing statistical methodologies that incorporate model selection and sparsity into clustering.

  • Simultaneous subset selection via rate-distortion theory: Clustering and subset selection are not separable problems. Indeed, if the clustering model changes, then genes (data objects) may change cluster membership, and vice versa. However, to perform an all-subset selection search is an intractable problem in clustering since the complexity of this task grows combinatorially with the number of clusters and data dimensions. In this recent paper, we appeal to results from rate-distortion theory to turn the combinatorial task into a simple line search.
    The simultaneous selection approach is not only on-par with the full combinatorial approach in terms of selection accuracy, but is also much faster and requires only a fraction of the computational effort.
    In addition, we can generalize this methodology to multiple testing and subset model of differential expression patterns. Our approach provides lists of significantly differentially expressed genes, as well as a sparse representation of their differential expression patterns. We can show that we gain power of detection via this approach, and we obtain easy-to-interpret gene-specific subset models.

  • Multi-level mixture models: Most clustering methods require the choice of a distance metric, e.g. 1-correlation or euclidean distance. However, in some applications it is not easy to choose a metric. For example, in gene expression studies, the shape of differential expression across experimental conditions is the main focus, but clearly the scale of expression changes are biologically relevant.
    I am currently working on a multi-level framework that allows for clustering using multiple distance metrics simultaneously.
    Clusters may share a shape, and be transformed in terms of scale and offset. This constitutes a substanial saving in terms of model parameters, and we can detect more distinct cluster shapes and directly interpret the relationship between clusters
    The multi-level mixture models can also be extended to multi-factor experiments. In a recent paper , we introduce a profile-EM algorithm to fit models where some clusters share a common shape for selected levels of the experimental factor of interest.

  • Clustering based on the L1 Data Depth: The concept of Data Depth generalizes the median and the rank-order to a multivariate setting. Together with Yehuda Vardi and Cunhui Zhang, I developed a cluster validation tool, the "Relative Data Depth (ReD)", as a robust alternative to the popular silhouette width. ReD is less sensitive to the relative scale of clusters than the silhouette.
    It is common to separate the tasks of clustering and cluster validation. In this paper I propose a clustering method that is based on the cluster validation index, ReD, directly.
    This method, DDclust, is fully non-parametric, robust, and allows for clusters of varying size and shape. I am currently working on extending data depth based clustering methodology to functional data.

  • PCLUST - Clustering based on p-values
    Most clustering methods require the selection of a distance metric, and it is sometimes easier to formulate these choices in terms of a statistical test. For example, we may view similarity as a function of (i) mean and variance, (ii) mean only, or (iii) profile shape.
    Jun Li, Regina Liu, and I propose a clustering method based on pairwise P-values. This method is very flexible in terms of the metrics we can consider, and furthermore, it easily allows for "anchoring" clusters to select profiles.
  • Subset model selection in clustering
  • Clustering with multiple distance metrics - mixture models with profile transformations submitted to Biometrics
  • MIXL, Multi-level mixture modeling BIRS presentation, July 2006
  • MIXL, Multi-level mixture modeling Under revision for Biostatistics
    Joint work with Sunduz Keles
  • Simultaneous subset selection via rate-distortion theory NYU seminar, Nov 2006
  • Simultaneous subset selection via rate-distortion theory. Under review at JCGS
  • Incorporating multiple distance metrics into model-based clustering via multi-level mixture models. Coming soon.
  • Multi-level mixture modeling with subset selection, with applications to clustering of gene expression data. ebCTC presentation.
  • Clustering based on P-values

  • PClust Statistics seminar, Penn State, Jan 2005. Joint work with Jun Li and Regina Liu
  • Data depth based clustering and classification

  • DDclust Clustering and Classification based on the L1 data depth,
    Rebecka Jornsten
    Journal of Multivariate Analysis Volume 90, Issue 1 , July 2004, Pages 67-89

  • Clustering based on Data Depth, Statistics seminar, BU
  • A Robust Clustering Method and Visualization Tool Based on Data Depth,
    Rebecka Jornsten, Yehuda Vardi and Cunhui Zhang
    Statistical data analysis based on the L1norm and related methods. Birkhauser 2002, Statistics for industry and technology. Y. Dodge editor.
  • Simultaneous clustering and subset selection via MDL

  • Simultaneous Gene Clustering and Subset Selection for Classification via MDL Rebecka Jornsten and Bin Yu Bioinformatics, 2003 19: 1100-1109. R code is avilable here.

  • Simultaneous clustering and subset selection via MDL , Advances in Minimum Description Length: Theory and Applications, MIT press. (2004) P. Grunwald, IJ Myung, M. Pitt Editors.
  • Visit The Hart Lab at Rutgers University.
  • Visit ebCTC, environmental bioinformatics and Computational Toxicology Center. This Consortium of UMDNJ, Rutgers University and Princeton University is funded by USEPA STAR Grant number GAD R 832721-010.