Seminar: Wanjie Wang

June 26, 2017
Tuesday, February 2, 2016 - 3:00pm
209 W. Eighteenth Ave. (EA), Room 170
Statistics Seminar Series

Title

Important Features PCA (IF-PCA) for Large-Scale Inference, with Applications in Gene Microarrays

Speaker

Wanjie Wang, University of Pennsylvania

Abstract

Identification of sample labels is a major problem in statistics with many applications. In the Big Data era, it faces two main challenges: (1) the number of features is much larger than the sample size, and (2) the signals are sparse and weak, masked by large amount of noise.
 
We propose a new tuning-free clustering procedure for large-scale data, Important Features PCA (IF-PCA). IF-PCA consists of a feature selection step, a PCA step, and a k-means step. The first two steps reduce the data dimensions recursively, while the main information is preserved. As a consequence, IF-PCA is fast and accurate, producing competitive performance in application to 10 gene microarray data sets.
 
We also propose a model that can capture the rarity and weakness of signal. Under this model, the statistical limits for the clustering problem and IF-PCA has been found.
S M T W T F S
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30
 
31