Huang Lin of the Department of Biostatistics defends his dissertation on "Statistical Theory and Methodology for the Analysis of Microbial Compositions, with Applications".
Committee Chairperson: Shyamal D. Peddada, PhD, Department of Biostatistics
Ying Ding, PhD, Department of Biostatistics
Jeanine Buchanich, PhD, MPH, MEd, Department of Biostatistics
Hong Wang, PhD, Department of Biostatistics
Matthew Rogers, PhD, Department of Surgery
Graduate faculty of the University and all other interested parties are invited to attend via Zoom https://pitt.zoom.us/j/610330510
The human body is estimated to have more than 100 times as many microbial genes as human genes. Therefore, increasingly researchers are finding associations between microbiome and human diseases such as obesity, inflammatory bowel diseases, HIV, and so on. Determination of what microbes are present in a given environment, which is known as differential abundance (DA) analysis, is a challenging and very important problem that has received considerable interest during the past decade. It is well documented in the literature that the observed microbiome data (OTU/SV table) are relative abundances with an excess of zeros. Since relative abundances sum to a constant, these data are necessarily compositional. Hence conventional methods of analyses are not appropriate as they necessarily inflate the false discovery rate (FDR). Analysis of Composition of Microbiomes (ANCOM) was the first DA method that honored the compositional structure of microbiome data and allowed statistical inference regarding absolute abundances using relative abundance data. While ANCOM controls FDR under some reasonable conditions, it can be computationally intensive, and it does not provide p-values for individual taxa. To overcome such difficulties, in this dissertation we develop a general regression-based framework, called the Analysis of Composition of Microbiomes with Bias Correction (ANCOM-BC), that can be used for addressing a broad collection of problems encountered by researchers. Firstly, a novel normalization procedure is introduced which asymptotically eliminates bias due to differential sampling fractions across samples. Secondly, it performs DA analysis while controlling FDR as well as ANCOM and maintaining high power when comparing two or more ecosystems. Thirdly, it is applicable for analyzing time-course and other multi-group studies while controlling mixed-directional FDR (mdFDR). The framework is general enough to accommodate a variety of study designs including repeated measures and longitudinal studies. Lastly, the framework allows researchers to investigate distance correlations and develop networks among microbiomes within and between samples.
This dissertation work is organized as follows. In the first part of the dissertation, an off-set based regression model is introduced which is motivated by the general structure of microbiome data. We develop an E-M algorithm based methodology to estimate various parameters of the model. Using these estimators, a pivotal statistic is constructed for performing DA analysis of each taxon while controlling the overall FDR. We demonstrate analytically that all existing methods, except for ANCOM, fail to control FDR. Next, we extend the methodology for performing DA when there are more than two experimental groups. We introduce methods for a variety of alternative hypothesis. For example, we developed union-intersection type tests for multiple ecosystems while controlling mixed-directional FDR (mdFDR), or test for patterns of absolute abundance over ordered ecosystems. Ordered ecosystems arise in dose-response studies, time-course experiments, disease stages etc. In many applications researchers are interested in studying dependence among microbes within an ecosystem or across systems (e.g. gut and oral microbiomes). Standard notion of correlation is not appropriate for these data. Although Dirichlet-Multinomial model is widely (and wrongly) used, it forces a correlation structure that is not true for microbiome or other similar count data. In the third part of this dissertation, we plan to develop distance correlation based methods for simplex to address such problems. Methods developed in this dissertation work will be applied to some real data available to us from our collaborators.