Ohio State is in the process of revising websites and program materials to accurately reflect compliance with the law. While this work occurs, language referencing protected class status or other activities prohibited by Ohio Senate Bill 1 may still appear in some places. However, all programs and activities are being administered in compliance with federal and state law.

Seminar: Jeongyoun Ahn

Statistics Seminar
February 23, 2006
All Day
209 W. Eighteenth Ave. (EA), Room 170

Title

High Dimension, Low Sample Size Data Analysis

Speaker

Jeongyoun Ahn, University of North Carolina

Abstract

Most of the literature regarding the statistical analysis of High Dimension, Low Sample Size (HDLSS) data deals with the situations where both the dimension d and the sample size n go to infinity together. In this talk the case where d tends to infinity while n is fixed is examined. We show that the sample covariance matrix behaves as if the underlying distribution is spherical if d is much larger than n. This result plays a key role in extending to more general settings the asymptotic geometric representation of HDLSS data, which says the randomness of the data only lies in random rotations of a regular n-simplex. The classification problem with HDLSS data is also considered in this presentation. There exists a one-dimensional direction in the data space (i.e., the n dimensional subspace generated by the data vectors) such that the projected data have only two distinct values. This direction is uniquely determined in the data space and lies within the affine set of the data. It has a similar formula to the Fishers linear discrimination direction and is shown to be equivalent in non-HDLSS cases. 

In the second part of the talk the bandwidth selection problem in the kernel method is considered. The usual cross validation method is observed to be subjective to sampling variation and computationally expensive. A new method is proposed, based on the geometrical understanding of kernel based classification: a nonlinear classification that is actually a linear one in the embedded feature space. A bandwidth that makes this linear classification task the easiest is chosen. This method is empirically shown to be robust to sampling variation and take much less computing time.

Meet the speaker in Room 212 Cockins Hall at 4:30 p.m. Refreshments will be served.