Seminar: Ping Li

October 23, 2014

All Day

209 W. Eighteenth Ave. (EA), Room 170

Title

BigData: Hashing Algorithms for Large-Scale Search and Learning

Speaker

Ping Li, Rutgers University

Abstract

The talk will begin with an interesting story about Cauchy distribution. Consider two data vectors, u and v, and a vector R of i.i.d. Cauchy variables. Pr{sgn(<u,R>) = sgn(<v,R>)} is essentially a monotonic function of the chi-square similarity (a nonlinear kernel) between u and v. This observation leads to useful bigdata (LINEAR) algorithms for building large-scale statistical models and searching for near neighbors, in terms of the chi-square similarity (kernel). Chi-square similarity has been known for its superb performance in data generated from histograms (e.g., computer vision and NLP).

Modern applications of internet search and machine learning routinely encounter datasets with (hundreds of) billions of examples in billion or even billion square dimensions (e.g., documents represented by high-order n-grams). Developing novel algorithms for efficient search and machine learning has become an active area of research. Hashing can be very useful in many scenarios, for example:

Some device only has limited computing/storage/power resources;
To achieve higher accuracy, we may want to explicitly consider pairwise or 3-way interactions (or high-order n-grams) in linear models.
We hope to reduce the complexity of learning models (e.g., deep nets) by hashing the inputs or hashing the outputs.
Hashing is an effective way of indexing (and space partitioning), which allows efficient sub-linear time near neighbor search.
Perhaps surprisingly, our newest research can show that, if designed carefully, hashing (which naturally leads to linear algorithms) can also model the nonlinear effect (e.g., nonlinear kernels). Examples of such kernels include resemblance, chi-square and CoRE kernels.
This talk will cover a variety of hashing algorithms including sign Cauchy projections, b-bit minwise hashing, one permutation, and densified one permutation hashing, etc.