Ohio State is in the process of revising websites and program materials to accurately reflect compliance with the law. While this work occurs, language referencing protected class status or other activities prohibited by Ohio Senate Bill 1 may still appear in some places. However, all programs and activities are being administered in compliance with federal and state law.

Seminar: Wanjie Wang

Statistics Seminar Series
February 2, 2016
All Day
209 W. Eighteenth Ave. (EA), Room 170

Title

Important Features PCA (IF-PCA) for Large-Scale Inference, with Applications in Gene Microarrays

Speaker

Wanjie Wang, University of Pennsylvania

Abstract

Identification of sample labels is a major problem in statistics with many applications. In the Big Data era, it faces two main challenges: (1) the number of features is much larger than the sample size, and (2) the signals are sparse and weak, masked by large amount of noise.
 
We propose a new tuning-free clustering procedure for large-scale data, Important Features PCA (IF-PCA). IF-PCA consists of a feature selection step, a PCA step, and a k-means step. The first two steps reduce the data dimensions recursively, while the main information is preserved. As a consequence, IF-PCA is fast and accurate, producing competitive performance in application to 10 gene microarray data sets.
 
We also propose a model that can capture the rarity and weakness of signal. Under this model, the statistical limits for the clustering problem and IF-PCA has been found.