
Dissertation Defense: Haozhen Yu
Title: Local Assessment of Model Sensitivity and Case Influence in Regression and Beyond
Abstract: The sensitivity of a model to data perturbations is key to model diagnostics and understanding model stability and complexity. Historically, case-deletion statistics, such as Cook’s distance, have primarily been considered for sensitivity analysis in linear regression, where the notions of leverage and residual are central to understanding the influence of individual data points. However, there is a lack of systematic methods for quantifying model sensitivity to data perturbations across various modeling procedures.
This dissertation bridges the gap by investigating two complementary approaches to quantifying model changes: local assessment and case-deletion statistics and interpreting their implications. The local influence approach examines model changes caused by infinitesimal data perturbations, requiring only the full-data solution and thus avoiding additional model fitting. In contrast, case-deletion statistics measure the difference between the leave-one-out (LOO) and the full-data solutions.
The first part of the dissertation focuses on local assessment. When case-deletion statistics are computationally expensive, infinitesimal case-weight perturbation provides a practical alternative for evaluating model sensitivity. This local influence analysis reveals notable commonalities in the form of case influence across different methods, allowing us to generalize the concepts of leverage and residual far beyond linear regression. At the same time, the results show differences in the mode of case influence, depending on the method. Through the lens of local influence, we provide a generalized and convergent perspective on case sensitivity in modeling, including regularized regression, large margin classification, generalized linear models, and quantile regression.
The second part of the dissertation focuses on case-deletion statistics. Although this is typically computationally intensive, previous work on quantile regression and support vector machine (SVM) has produced efficient path-following algorithms for computing the LOO solution. Using these algorithms, we examine case-deletion statistics in both modeling procedures. For quantile regression, despite the lack of a closed-form expression of the case-deletion statistics, influential observations exhibit either large generalized leverage or residuals, consistent with the diagnostic principles underlying Cook's distance in mean regression. For classification, inspired by Cook's distance, we propose a threshold-based approach to identifying influential observations in SVM and demonstrate its effectiveness through empirical studies.
Advisor: Yoonkyung Lee