Biostatistics Research Day

An annual departmental event that showcases student research and promotes interdisciplinary research among graduate students and faculty, This is a fantastic opportunity for students to gain experience presenting their research on statistical methods or applications, receiving feedback and building connections with faculty and alumni. 

2024 Event Details

Thursday, February 22
1:30-5:30 p.m.

Pitt Public Health, G23 and the Commons

Keynote Speaker
Amanda Golbeck
1:30-2:30, G23

Poster Presentations
2:30-3:45, Commons

Oral Presentations
3:45-5:30, G23

2023 Presentations and Abstracts 

Posters

Wenjia Wang - Accurate and Ultra-efficient $p$-value Calculation for Higher Criticism Tests

In modern data analysis for detecting rare and weak signals, higher criticism (HC) test and its variations have been an effective group-testing method, but the computation accuracy and speed have long been an issue when the number of p-values combined (K) and/or the number of repeated tests (N) are large. To this end, we propose refined computing strategies for higher criticism and four of its variations. Specifically, we first propose a cross-entropy based importance sampling (ISCE) method to benchmark all existing methods and develop modified SetTest (MST) analytic approach. We then propose an ultra-fast interpolation (UFI) computation method independent of K based on our pre-calculated statistical tables. Eventually, by integrating above methods we proposed, we construct a computation strategy and an R package “HCp” for accurate and ultra-fast p-value computation of virtually any K and small p-values in HC tests. Extensive simulations are implemented to benchmark accuracy and speed of proposed methods. By applying to a COVID-19 disease surveillance example for spatio-temporal patient cluster detection, we confirm viability of the proposed method for such large-scale inferences.

Michelle Sun - The epigenetic determinant of time-restricted feeding and health span is linked to the 12h ultradian oscillator

A distinct 12-hour clock exists in addition to the 24-hour circadian clock to coordinate metabolic and stress rhythms. These 12h rhythms are established by an XBP1s-dependent 12h oscillator. We have evidence to suggest that RBBP5 is the common epigenetic regulator of the 12h oscillator and diverse stress responses. It is known that knocking down of RBBP5 blunts transcriptional responses to stress and that RBBP5 expression is downregulated in humans with fatty liver. Yet, here we show that RBBP5 heterozygous knockout mice are paradoxically healthier. Heterozygous mice exhibit improved glucose tolerance and insulin sensitivity. They are leaner with greater energy expenditure, despite no significant differences in activity and food intake compared to WT mice. Heterozygous mice also demonstrate self-imposed time-restricted feeding with food intake reduced during rest phase and increased during activity phase. These findings suggest that RBBP5 has a hormetic and systemic effect on metabolism and energy intake. Determine the relationship between these behavior patterns and illness outcomes, and how changes in illness severity relates to these behavior patterns.

Benjamin Melchior Kacso Panny - Generalized linear mixed-effects models of two-step task performance in people with depression in a RCT involving ketamine infusion and automated self-association training

Background: GLMMs can be used to study human decision-making in a reward learning task (two-step task) by separating “Model-Free” (MF) and “Model-Based” (MB) influences on task performance. The impact of certain interventions on such influences in the context of depression is unknown. Methods: Task data was collected in a sample of depressed patients (n = 104) in a RCT involving a single ketamine infusion and automated self-association training. Task data was collected at between 2 to 5 visits for each participant. GLMMs were fit to the data using REML to determine MF and MB influences on behavior in the sample, according to intervention arm. Results: Results reveal evidence of model-free and model-based behavior patterns in our sample, while a single ketamine infusion does not appear to affect either model-free or model-based influences on behavior. Similar, further analyses will be performed to test for the effect of automated self-association training alone and in combination with a ketamine infusion on these MF and MB influences. Further analyses will also test for whether changes in depression symptom severity over time affect MF and MB influences on task performance. Conclusion: Model-free and model-based behavior patterns are present in a sample of people with depression. Further work needs to be done to determine the relationship between these behavior patterns and illness outcomes, and how changes in illness severity relates to these behavior patterns.

Runjia Li  - Estimation of Conditional Average Treatment Effects for Time-to-event Outcomes with Competing Risks
Numerous statistical theory and methods have been proposed for estimating the causal treatment effects in observational studies. The majority of approaches can be categorized into outcome-based modeling, treatment-based modeling, and modeling for both outcome and treatment with a doubly robust feature. Currently, most of the methods with doubly robust feature do not address treatment-effect heterogeneity, specifically, estimation of personalized treatment effects in time-to-event outcomes with competing risks. We developed a framework for estimating conditional causal average treatment effects defined as the risk difference of cumulative incidence functions given a subject’s characteristics. Our method integrates targeted maximum likelihood estimation with various algorithms for outcome modeling and propensity score modeling. In extensive simulation studies, our method outperformed others, even in scenarios where the outcome model was mis-specified. Application of our method is illustrated in a study of treatment effects for sepsis patients admitted to intensive care units.

Manqi Cai - MODE: Multi-Omics Cellular Deconvolution with Single-Cell RNA-Seq and DNAM References 
Cellular deconvolution has been widely used to infer cellular compositions of bulk gene expression data, largely facilitated by the popularity of single-cell RNA-seq. Recently, whole-genome sequencing of single-cell DNA methylation (scDNAm) is emerging and creates new possibilities for deconvolving bulk DNAm data, especially in solid tissues that lack cell references. Multiple studies found that multi-omics data share similar celltype markers. When multi-omics data are collected from the same tissue samples, a joint deconvolution will provide more accurate estimates of the unified underlying cellular fractions than deconvolving each omics data type separately. To achieve this goal, we develop MODE (Multi-Omics DEconvolution), a novel deconvolution framework that utilizes both single-cell DNAm and RNA-seq references to estimate cellular fractions from bulk omics data accurately. With ultrahigh-dimensional and sparse scDNAm data, MODE considers the spatial dependence of close DNAm sites and builds a precise signature matrix. Real data benchmarking shows that MODE improves cellular fraction estimates by jointly deconvolving multi-omics data collected from the same samples.

Jinhong Li - Fusion Learning for Causal Inference Using Distributed Data
When estimating conditional average treatment effect (CATE) in data-driven precision medicine, data integration is often undertaken to achieve a larger sample size and better statistical efficiency. Since data from different sources may be inherently heterogeneous, challenges arise in how to optimally and securely combine information while accounting for between-study heterogeneity. In the meanwhile, privacy concern may prevent the sharing of individual-participant data. In this paper, we propose a two-step doubly robust algorithm: we first estimate the marginal effects and propensity scores in each distributed data to formulate an objective function for estimating CATE; we then aggregate study-level data from model estimates to optimize the objective function. Specifically, in the second step, we leverage fused lasso to capture heterogeneity among different data sources, and confidence distribution to preserve data privacy and avoid the sharing of individual-participant data. The performance of this approach is evaluated by simulation studies and a real-world study of the causal effect of oxygen saturation target on survival rates for multi-hospital ICU patients.

Penghui Huang - Accurate estimation of rare cell type fractions from tissue omics data via hierarchical deconvolution
Bulk transcriptomics in tissue samples reflects the average expression levels across different cell types and is highly influenced by cellular fractions. As such, it is critical to estimate cellular fractions to both deconfound differential expression analyses and infer cell type-specific differential expression. While experimentally counting cells is not feasible in most tissues and studies, in-silico cellular deconvolution methods have been developed as an alternative. However, existing methods are designed for tissues consisting of clearly distinguishable cell types and have difficulties estimating highly correlated or rare cell subtypes. To address this challenge, we propose Hierarchical Deconvolution (HiDecon) that uses single-cell RNA sequencing references and a hierarchical cell type tree, which models the similarities among cell types and cell differentiation relationships, to estimate cellular fractions in bulk data. By coordinating cell fractions across layers of the hierarchical tree, cellular fraction information is passed upwards and downwards along the tree, which helps correct estimation biases by pooling information across related cell types. The flexible hierarchical tree structure also enables estimating rare cell fractions by splitting the tree to higher resolutions. Through simulations and real data applications with ground truth of measured fractions, we demonstrate that HiDecon significantly outperforms existing methods and accurately estimates cellular fractions.

Jiaqian Liu - Prediction of Asthma Exacerbations in Children Using Mixed Effect Random Forests on EHR data
Asthma is the most common multifactorial chronic disease among children. Identifying children at high risk of severe asthma outcomes is essential in clinical practice. Studies have utilized machine learning methods to enhance the prediction of asthma occurrence or progression using electronic health records (EHRs) data. However, these studies often neglected the clustered and correlated nature of EHR data (e.g., multiple visits of the same individual). To address this issue, this study applied and evaluated the use of Mixed Effect Random Forests method, which takes into account the clustered structure by incorporating random effects in constructing prediction models for binary outcomes. We applied the method to a real-world asthma EHR dataset from the Pittsburgh Children’s Hospital for predicting severe asthma exacerbations.

Xue Yang - Statistical Inference for Response-Adaptive Randomization Designs: A Comparative Study
Response-adaptive randomization (RAR) aims to improve patient benefit by skewing allocation to more efficacious treatments maintaining the validity of treatment comparisons. While numerous RAR designs have been developed in the literature, sample average treatment effect (ATE) remains to be the basis of inference for causal treatment effect. In this article, we propose some alternative estimators of causal treatment effect for RAR designs based on inverse-probability-weighting (IPW) and compare them to the sample ATE analytically based on bias, variance, type I error and statistical power. We conducted extensive simulation studies to assess the operating characteristics of these estimators. Results show that when implemented and analyzed correctly, RAR can treat substantial proportion of the patients using the better treatment minimally sacrificing the power compared to a standard allocation design. The analytical and simulation results also indicate that the IPW-based estimators are consistent with smaller bias and are more efficient compared to the sample ATE. Moreover, the methods based on log relative risk and log odds ratio control the type I error better than the sample ATE.

Molin Yue - Multi-Omics Cell Type Deconvolution with Multi-modal Learning
Downstream analysis of omics data at the bulk level is usually confounded by its cellular heterogeneity. Various algorithms have been developed to deconvolute bulk samples with a single data modality, either gene expression or DNA methylation (DNAm), but rarely on multi-omics data. Motivated by the fact that cell-type populations are biologically shared by both gene expression and DNAm measured within the same sample, we developed a supervised, reference-free deep-learning algorithm using intermediate fusion that can robustly deconvolute bulk samples using one or multiple omics modalities. We tested this algorithm on large-scale white blood cells and found that it achieved high spearman correlation values of 0.86 (neutrophils), 0.84 (lymphocytes), 0.83 (monocytes), 0.94 (eosinophils), and 0.70 (basophils) on testing dataset, compared to other popular deconvolution algorithms whose correlations were no greater than 0.8. Even with one missing modality (e.g., DNAm) in the testing data, the algorithm was still able to get 0.81, 0.78, 0.73, 0.94, and 0.28, demonstrating the improved accuracy of our models when utilizing both modalities and robustness to the missing modality.

Hung-Ching Chang - PECMAN: Partial Sum statistics for High-Dimensional Causal Mediation Analysis with Multi-omics Applications
Causal mediation analysis of high-dimensional data is challenging due to the curse of dimensionality. To reduce the noise, many methods tend to apply some filtering strategies on the high-dimensional mediator such as the penalty regression and the screening test. However, most of these methods do not give consideration to the cross- world assumption, which may result in biased estimation. Recent studies develop a series of orthogonal transformation methods to satisfy this assumption. However, the orthogonal transformation can result in poor interpretability. In this study, we develop PECMAN, an interpretable high-dimensional causal mediation analysis method. We highlight two types of sparsity, which are in the exposure-mediator relationship and the mediator-outcome relationship, and further show that filtering strategies on the mediation-outcome relationship satisfy the cross-world assumption under a proper sparsity assumption. Extensive simulation experiments indicate that our method has  greater statistical power for detecting mediation effects at various sparsity levels. Finally, we demonstrate our method’s advantages through an application on the COPDGene dataset. We estimate the mediation effect of each patch and find that the inferior lobes contribute more mediation effects than the superior lobes.

Gehui Zhang - Bayesian inference on Stochastic Volatility Models with Informative Missingness
Mobile devices that collect activity and emotional data in one’s natural environment are becoming an increasingly popular and powerful tool for clinicians and researchers. Volatility, which is a measure of dispersion or variation, is an interpretable and predictive metric from psychosocial and behavioral data. Existing methods for estimating stochastic volatility, which were formulated primarily in the context of financial data, assume missing at random(MAR), which is not consistent with the observational nature of data from mobile devices. The main focus of this talk is estimating stochastic volatility while accounting for informative missingness. We developed an imputation method based on Tukey’s representation (or the simplified selection model) under the stochastic volatility model with a missing not at random (MNAR) mechanism. We incorporated the novel imputation approach into a Particle Gibbs with Ancestor Sampling (PGAS) method to provide an efficient method for conducting inference on stochastic volatility with informed missingness. The performance of the method is illustrated through simulation studies and in the analysis of multi-modal ecological momentary assessment (EMA) data from a study of suicidal ideation and behavior in young adults with a history of suicidal thoughts and behaviors.

Xiangning Xue - DiffCircaPipeline: A framework for multifaceted characterization of differential rhythmicity
Circadian oscillations of gene expression regulates daily physiological processes, and their disruption is linked to many diseases. Circadian rhythms can be disrupted in a variety of ways, including differential phase, amplitude, and rhythm fitness. Although many differential circadian biomarker detection methods have been proposed, a workflow for systematic detection of multifaceted differential circadian characteristics with accurate false positive control is not currently available. We propose a comprehensive and interactive pipeline to capture the multifaceted characteristics of differentially rhythmic biomarkers. Analysis outputs are accompanied by informative visualization and interactive exploration. The workflow is demonstrated in multiple case studies and is extensible to general omics  applications.

Lang Zeng - Dynamic Prediction using Time-Dependent Cox Survival Neural Network
The target of dynamic prediction is to provide individualized risk predictions over time which can be updated as new data become available. Motivated by establishing a dynamic prediction model for the progressive eye disease, age-related macular degeneration (AMD), we proposed a time-dependent Cox model-based survival neural network to predict its progression on a continuous time scale using longitudinal fundus images. We evaluate and compare our proposed method with joint modeling and landmarking approaches through comprehensive simulations using two time-dependent accuracy metrics, the Brier Score and dynamic AUC. We applied the proposed approach to a large AMD study, the Age-Related Eye Disease Study (AREDS), in which more than 50,000 fundus images were captured over a period of 12 years for more than 4000 participants. We built a dynamic prediction model to predict the AMD progression risk over time using longitudinal images together with demographic information. Our approach achieves satisfactory prediction performance in both simulation studies and real data analysis.

Xueping Zhou - Genome Hierarchy Grouping Structure and Correlation Guided Feature Selection for Multivariate Outcome Prediction
Developing efficient feature selection and accurate prediction algorithms for multivariate phenotypes is a major and often difficult task in analyzing omics data. Many machine-learning methods have been proposed to perform such tasks for univariate outcome, including top-performing penalized regression-based approaches. In this paper, we propose a novel supervised learning algorithm to perform feature selection and multivariate outcome prediction for data with potentially high-dimensional predictors and responses. The method incorporates known genome hierarchy grouping and correlation structures into feature selection, regression coefficient estimation, and outcome prediction under a penalized multivariate multiple linear regression model. Extensive simulations show its superior performance over its competing methods. We apply the proposed method to an in-house multi-omics asthma dataset with two applications. In the first study for cell type deconvolution, the proposed method achieves better cell type fraction prediction using bulk gene expression data. In the second association study between multivariate gene expression and high-dimensional DNA methylation data, the proposed method reveals novel associations signals, providing more biological insights on how CpG sites regulate gene expressions.

Jian Zou - Transcriptomic Congruence and Selection of Representative Cancer Models Towards Precision Medicine
Cancer models are instrumental to substitute for human studies and expedite basic, translational and clinical cancer research. For a given cancer subtype, a wide selection of models, such as cell lines, patient-derived xenografts, tumoroids and genetically modified murine models, are often available to researchers. However, how to quantify their congruence to human tumors and to select the most appropriate cancer model is a largely unsolved issue. Here, we develop Congruence Analysis and Selection of CAncer Models (CASCAM), a statistical and machine learning framework for authenticating and selecting the most representative cancer models in pathway-specific and drug-relevant context using transcriptomic data. CASCAM offers harmonization between tumor and cancer model omics data, interpretable machine learning for congruence quantification, mechanistic investigation, and pathway-based topological visualization to determine the final cancer model selection. The workflow is presented using breast cancer invasive lobular carcinoma (ILC) subtype, while the method is generalizable to any cancer subtype for precision medicine development.

Talks

Na Bo - Estimating heterogenous survival treatment effect under counterfactual framework
Estimating heterogeneous treatment effect plays a central role in personalized medicine as it provides critical information for tailoring existing therapies for each patient to get the optimal treatment. Recently, meta-learning approaches have received a lot of attention in estimating conditional average treatment effect (CATE) by using multi-step algorithms coupled with flexible machine learning methods. In this project, we provide a metalearning framework to estimate CATE on survival outcomes. We consider several pseudo-CATE regression approaches along with popular machine learning methods such as random survival forests, Cox-Lasso, and survival neural networks. We address advantages and challenges in implementing these methods to survival outcomes through comprehensive simulations and provide guidelines for applying these methods to survival outcomes in different situations. Finally, we demonstrate the methods by analyzing a large randomized clinical trial, the AREDS study for an eye disease, agerelated macular degeneration, to estimate CATE and make individualized treatment recommendations.

Jiahe Li - An approximated Expectation-Maximization (EM) Algorithm for integrative analysis of datasets with nonresponse
Missing data are pervasive in clinical trials and public health studies. Standard statistical methods, such as the likelihood-based methods and estimating equation-based methods, often require unverifiable assumptions and modelling of the missing-data mechanism. Misspecification of the missing-data model often leads to biased estimates and wrong conclusions. For integrative analysis of data from multiple studies, the issue with missing data is exacerbated that the missing-data process varies from study to study. Modelling study-specific missing-data mechanisms under a unified framework is prohibitive. Here we propose an approximated expectation-maximization (AEM) algorithm for the integrative regression analysis of data with nonresponse where the datasets are assumed to follow the same regression model and are independent from each other. Each dataset may suffer from an arbitrary missing-data mechanism. With a consistent initial estimator from either a prior study or a complete dataset, we devised an AEM algorithm to approximate the sufficient statistics from empirical conditional regression estimates. Without modelling the missing data mechanisms, the AEM algorithm yields a more efficient estimator. Simulation studies will be used to illustrate the efficiency gain from the initial estimator under various settings.

Zhiyu Sui - Robust Transfer Learning of Individualized Treatment Rules
Causality-based individualized treatment rules (ITRs) are a steppingstone to precision medicine. To ensure unconfoundedness, ITRs are ideally derived from randomized experimental data, but the use cases of ITRs in the real-world data extend far beyond these controlled settings. It is of great interest to transfer knowledge learned from experimental data to real-world data, but hurdles remain. In this paper, we address two challenges in the transfer learning of ITRs. 1) In well-designed experiments, granular information crucial to decision making can be thoroughly collected. However, part of this may not be accessible in real-world decision-making. 2) Experimental data with strict inclusion criteria reflect a population distribution that may be very different from the real-world population data, leading to suboptimal ITRs. We propose a unified weighting scheme to learn a calibrated and robust ITR that simultaneously addresses the issues of covariate shift and missing covariates during prospective deployment, with a quantile-based approach to ensure worst-case safety under the uncertainty due to unavailable covariates. The performance of this method is evaluated in simulations and real-data applications.