Modeling Node Popularity in Networks and Bootstrap Methods for Massive Data
Srijan Sengupta, University of Illinois at Urbana-Champaign
In this talk I will present recent work in two emerging areas of statistics, namely networks and big data.
Network data analysis is a rapidly growing research field in statistics, with a strong emphasis on the study of community structure using blockmodels. A network feature that is closely associated with community structure is the popularity of nodes in different communities. Neither the classical stochastic blockmodel nor its degree-corrected extension can satisfactorily capture the dynamics of node popularity. I will propose a popularity-adjusted blockmodel for flexible modeling of node popularity. I will establish consistency of likelihood modularity for community detection under the proposed model, and illustrate the improved empirical insights that can be gained through this methodology by analyzing the political blogs network and the British MP network.
The bootstrap is a popular and powerful method for assessing precision of estimators and inferential methods. However, for massive datasets which are increasingly prevalent, the bootstrap becomes prohibitively costly in computation even with modern computing platforms. Building on Bag of Little Bootstraps or BLB (Kleiner et al, 2014) and the idea of fast double bootstrap, I will propose a fast resampling method, the subsampled double bootstrap (SDB), for both independent data and time series data. The SDB is consistent under mild conditions for both independent and dependent cases. Methodologically, SDB is superior to BLB in terms of speed, sample coverage and automatic implementation for a given time budget. Its advantage relative to BLB and bootstrap is also demonstrated in simulations and data illustration.