Technical Reports

Technical Reports

2006-001: Differential expression in DNA microarray and protein array experiment, by D. Amaratunga and J. Cabrera [9/13/06]

A typical experiment in genomics is to compare of two groups of microarray or protein array data, in order to find differentially expressed genes or proteins that might be involved in some biological process. In this paper we propose a model with minimal distributional assumptions, and a conditional /t/ procedure for analyzing such data.

2006-002: Mixture models with multiple levels, with application to the analysis of multii-factor gene expression data, by Rebecka Jörnsten and Sündüz Keleş [11/19/06]

Model-based clustering is a popular tool for summarizing high-dimensional data. With the number of high-throughput large-scale gene expression studies still on the rise, the need for effective data summarizing tools has never been greater. By grouping genes according to a common experimental expression profile, we can gain new insights into the biological pathways that steer biological processes of interest. Clustering of gene profiles can also assist in assigning functions to genes that have not yet been functionally annotated.

2006-003: Model-selection consistency of the LASSO in high-dimensional linear regression, by Cun-Hui Zhang and Jian Huang [11/29/06]

Meinshausen and Buhlmann (2004) showed that, for neighborhood selection in Gaussian graphic models, under a neighborhood stability condition, the LASSO is consistent even when the number of variables is of greater order than the sample size. Zhao and Yu (2006) formalized the neighborhood stability condition in the context of linear regression as a strong irrepresentable condition. They showed that under this condition, the LASSO selects exactly the set of non-zero regression coefficients, provided that these coefficients are bounded away from zero at certain rate. In this paper, the regression coefficients outside an ideal model are assumed to be small but not necessarily zero. Under a partial Riesz condition on the correlation of design variables, we prove that the LASSO selects a model of the right order of dimensionality, controls the bias of the selected model at a level determined by the contributions of small regression coefficients and threshold bias, and selects all coefficients of greater order than the bias of the selected model. An interesting aspect of our results is that the logarithm of the number of variables can be of the same order as the sample size for certain random dependent designs.