
Title
High Dimension, Low Sample Size Data Analysis
Speaker
Jeongyoun Ahn, University of North Carolina
Abstract
Most of the literature regarding the statistical analysis of High Dimension, Low Sample Size (HDLSS) data deals with the situations where both the dimension d and the sample size n go to infinity together. In this talk the case where d tends to infinity while n is fixed is examined. We show that the sample covariance matrix behaves as if the underlying distribution is spherical if d is much larger than n. This result plays a key role in extending to more general settings the asymptotic geometric representation of HDLSS data, which says the randomness of the data only lies in random rotations of a regular n-simplex. The classification problem with HDLSS data is also considered in this presentation. There exists a one-dimensional direction in the data space (i.e., the n dimensional subspace generated by the data vectors) such that the projected data have only two distinct values. This direction is uniquely determined in the data space and lies within the affine set of the data. It has a similar formula to the Fishers linear discrimination direction and is shown to be equivalent in non-HDLSS cases.
In the second part of the talk the bandwidth selection problem in the kernel method is considered. The usual cross validation method is observed to be subjective to sampling variation and computationally expensive. A new method is proposed, based on the geometrical understanding of kernel based classification: a nonlinear classification that is actually a linear one in the embedded feature space. A bandwidth that makes this linear classification task the easiest is chosen. This method is empirically shown to be robust to sampling variation and take much less computing time.
Meet the speaker in Room 212 Cockins Hall at 4:30 p.m. Refreshments will be served.