Tuesday, February 2, 2016 - 3:00pm
209 W. Eighteenth Ave. (EA), Room 170
Important Features PCA (IF-PCA) for Large-Scale Inference, with Applications in Gene Microarrays
Wanjie Wang, University of Pennsylvania
Identification of sample labels is a major problem in statistics with many applications. In the Big Data era, it faces two main challenges: (1) the number of features is much larger than the sample size, and (2) the signals are sparse and weak, masked by large amount of noise.
We propose a new tuning-free clustering procedure for large-scale data, Important Features PCA (IF-PCA). IF-PCA consists of a feature selection step, a PCA step, and a k-means step. The first two steps reduce the data dimensions recursively, while the main information is preserved. As a consequence, IF-PCA is fast and accurate, producing competitive performance in application to 10 gene microarray data sets.
We also propose a model that can capture the rarity and weakness of signal. Under this model, the statistical limits for the clustering problem and IF-PCA has been found.