Clustering and Model Selection
Microarray analysis - imputation, significance analysis, clustering, and compression.
Data depth - clustering, validation and functional analysis
Statistical modeling of dendritic branching
I am interested in developing new statistical methodology for Clustering and Model Selection. These days, data often have a complex structure, and the clustering techniques we apply should reflect this. Recently, I have focused on Multi-level mixture models, which can incorporate multiple distance metrics into clustering simultaneously, and be used to analyze multi-factor experiments. In addition, I am working on subset selection problems in clustering. Some cluster profiles have a simple structure, and by using a full parameterization of these clusters we are "wasting parameters". Simultaneous subset selection via rate-distortion theory lets us select sparse representations for all clusters. This makes interpretation of the clusters more objective, and allows us to detect more distinct clusters.
I am currently working on extending the multi-level models and simultaneous selection to regularized estimation schemes, and more complex experimental designs.
My research is often motivated by my collaborative projects. I have a long-standing collaboration with the The Hart Lab, W.M. Keck Center for Collaborative Neuroscience. The Hart lab is particularly interested in the genomic regulation of neuronal development.
I am also involved in research on the molecular regulation of neuron formation, and the Statistical Analysis of Dendritic Branching. My collaborators at the Firestein Lab, Department of Molecular and Cell Biology, have identified many important intrinsic and extrinsic molecular regulators of dendritic branching.
I am working on the development of computational tools for the Analysis of Genomics data, including missing value imputation and significance analysis of microarray data.
I am involved in several projects related to the notion of Data Depth. Data depth allows for the generalization of the median and rank-order to a multivariate setting.
I am currently working on robust analysis of functional data via data depth, applicable to the analysis of time-course gene expression data and Sholl profiles (the number of dendrites as a function of distance to the cell body).
More details can be found at the following project pages:
- Clustering and Model Selection: Multi-level mixture modeling, Simultaneous subset selection via rate-distortion theory, Clustering based on p-values, Clustering via the L1 data depth, Simultaneous clustering and subset selection for classification via MDL.
- Microarray analysis - imputation, compression and applications. LinCmb - adaptive missing value imputation, Meta-data based imputation, Compression of Microarray Images, Genomic regulation of neuronal development.
- Statistical analysis of dendritic branching. Regulatory proteins: Snapin, Cypin and PSD-95, Analysis of heterogeneous cell populations via mixtures of generalized linear models.
- Data Depth, Multi-terminal Estimation, and Applications Functional analysis via extensions of the band depth, Data depth based clustering, "Comprestimation", Analysis of EEG data.
|
Online course material.
- Time Series, Spr07, Tue 6.40-9.30, SEC 208
This applied course in time series analysis covers both time and frequency domain methods. Your final grade is based, in equal parts, on your performance in the lab exercises, take-home final, and individual project.
- Stat687, Spr07, Th12-2, Hill 552
Presentation and Criticism. In this course, students are asked to present an overview of a current topic in statistics. You are expected to give 2 short talks, and participate actively in the in-class discussions.
- Stat 563-Regression, Fall06, M6.40-9.30, 552 Hill.
Regression is one of the most versatile tools in statistics. In this class, you will apply linear, non-linear and regularized regression methods to several, diverse data sets. Students are expected to complete lab exercises, and hand in a final data analysis project. Don't miss the end-of-semester poster session.
- Stat 586, Data Interpretation - part I, Fall05, W6.20-9, 552 Hill
In this class we explore linear models, generalized linear models, classification and clustering. We also cover cross-validation, model averaging, L1-regularization, and bagging. You will be expected to learn a statistical computing package, and conduct a final data analysis project. If you are not familar with statistical computing, this class offers an opportunity to learn about the statistical software package R.
-
Stat 401, Basic Statistics for research, Fall04, W2F5, MU215.
In this class we offer an introduction to statistical analysis of different types of data; sample surveys and experimental data. The emphasis in this class is on the understanding of the methods, and interpreting the outcome of statistical analyses.
- Statistical Methods In Bioinformatics, Spring03, M-W 9.50-11.10, Serin 385E.
This class is taught to a diverse body of students, including statisticians, biologists and computer scientists. We review papers together, and you are expected to actively participate in the in-class discussions.
I was an instructor at the NIDDK and NIAMS short courses on microarray data analysis in 2006. Lectures from these workshops can be found here
In collaboration with the Hart lab, we offer short courses on microarray data analysis here at Rutgers. Please check the Hart lab web page for upcoming events.
|
Biography: I received my Ph.D. in Statistics from UC Berkeley under
the supervision of Bin Yu in Dec 2001. For more information, please view my current CV: pdf/word.
R Codes
Links
Genomics.
Nature genetics
Genome Research
Bioinformatics journal
Wentian Li's list of microarray papers and pre-prints, North Shore LIJ Research Institute
Sunduz Keles Statistics, Biostatistics and Medical Informatics, University of Wisconsin
Sandrine Dudoit, Biostatistics, UC Berkeley
Bioinformatics Hotlist, maintained by Stephen D. Scott, University of Nebraska
Statistics.
The American Statistical Association
The Institute of Mathematical Statistics
StatLib
Rutgers.
Visit
The Hart Lab at Rutgers University.
Visit
The Firestein Lab at Rutgers University.
Visit
ebCTC, environmental bioinformatics and Computational Toxicology Center. This Consortium of UMDNJ, Rutgers University and Princeton University is funded by USEPA STAR Grant number GAD
R 832721-010.
|