Title: Statistical inference for some problems in network analysis
Co-Chairs: Associate Professor Liza Levina, Associate Professor Ji Zhu
Cognate Member: Professor Mark Newman
Members: Assistant Professor Xuanlong Nguyen
Abstract: Recent advances in computing and measurement technologies have led to an explosion in the amount of data that are being collected in all areas of application. Much of these data have network or graph structures, and they are common in diverse engineering and scientific areas, such as biology, computer science, electrical engineering, economics, sociology and so on. This dissertation makes three contributions to inference problems in statistical network analysis. The first two parts of the dissertation focus on community analysis. In the first part, we establish general theory for checking consistency of community detection under the degree-corrected block model, a generalization of standard stochastic block model, which allows variation in node degrees within a community, thus accommodating hub nodes. We compare several community detection criteria under both the standard and the degree-corrected block models. We show which criteria are consistent under which models and constraints, as well as compare their relative performance in practice. The second part proposes a new framework for community identification. Most community detection methods focus on partitioning the entire network into communities, with the expectation of many ties within communities and few ties between. However, many networks contain nodes that do not fit in with any of the communities, and forcing every node into a community can distort results. We propose a framework that extracts one community at a time, allowing for weakly connected nodes. The proposed extraction criterion performs well on simulated and real networks. For the case of the block model, we establish asymptotic consistency of estimated node labels and propose a hypothesis test for determining the number of communities. The third part makes a contribution to the link prediction problem. In many applications, notably in genetics, a partially observed network may not contain any negative examples of absent edges, which creates a difficulty for many existing supervised learning approaches. We develop a new method which treats the observed network as a sample of the true network with different sampling rates for positive and negative examples, where the sampling rate for negative examples is allowed to be 0. We obtain a relative ranking of potential links by their probabilities, utilizing information on node covariates as well as on network topology. Empirically, the method performs well under many settings, including when the observed network is sparse and when the sampling rate is low even for positive examples. We also apply the method to a protein-protein interaction network and a school friendship network.