Biostatistics Research Day

An annual day of exploration and collaboration engaging students, faculty and staff in biostatistics and health data science, this event features student presentations and a keynote speaker. This is a fantastic opportunity for students to gain experience presenting their research on statistical methods or applications, receiving feedback and building connections with faculty and alumni.

2025 Event Details

Wednesday, March 12
12-4 p.m.

Keynote

Artificially Intelligent Geospatial Systems: A Case Study in Spatial Energetics for Mobile Health Data
Presented by Sudipto Banerjee, senior associate dean for academic programs of biostatistics and statistics and former chair of the Department of Biostatistics, Fielding School of Public Health, UCLA.

Schedule

Oral Presentations	12-1:30 p.m.	3785 Scaife Hall
Keynote Talk (GR eligible)	1:30-2:30 p.m.	3785 Scaife Hall
Poster Presentations	3-4 p.m.	First Floor Commons, Public Health

Oral Presentations

PhD Students

Wenjia Wang

A Bayesian Outcome-Guided Clustering Framework via Consensus Clustering for Molecular Disease Subtyping

Xinlei Chen

FLoRI: Federated Learning of Robust Individualized Decision Rules with Application to Heterogeneous Mult-Hospital Sepsis Populations

Chen Liu

Heterogenous Causal Mediation Analysis Using Bayesian Additive Regression Trees

Abstracts

Molecular disease subtyping has become an effective approach to dissect the heterogeneous patient population into homogeneous subgroups towards precision medicine. In modern omics studies, conventional unsupervised clustering on high-dimensional molecular features may not capture clinically meaningful disease subtypes. Since Bayesian consensus clustering enjoys high flexibility in heterogeneous data integration, this motivates us to develop the multivariate guided Bayesian consensus clustering model to uncover disease subtyping that is jointly associated with multiple and multi-type clinical outcomes. The model we propose allows specific clustering assignments based on outcomes and omics features separately but requires them to adhere to an overall master clustering that is the intrinsic disease subtyping. Thus, compared with the existing outcome-guided models, our fully model-based method can allow multiple outcome guidance, avoid parameter tuning, and tolerate slight discrepancy among data. Extensive simulations and two real applications in lung disease and triple-negative breast cancer demonstrate the advantages of multiple outcome guidance and superior performance of our model.

Sepsis is a life-threatening condition affecting millions of individuals in the US each year. The complexity of sepsis clinical management makes individualized treatment approaches desirable. The University of Pittsburgh Medical Center (UPMC) has collected electronic health records data of sepsis patients from multiple hospitals. The goal of this study is to derive individualized decision rules (IDRs) that could be safely applied to and uniformly improve decision-making across hospitals in the UPMC Health System by only using a subset of hospitals for training. Traditional approaches assume that data are sampled from a single population of interest. With multiple hospitals that vary in patient populations, treatments, and provider teams, an IDR that is successful in one hospital may not be as effective in another, and the performance achieved by a globally optimal IDR may vary greatly across hospitals, preventing it from being safely applied to unseen hospitals. To address these challenges, as well as the practical restriction of data sharing across hospitals, we introduce a new objective function and a federated learning algorithm for learning IDRs that are robust to distributional uncertainty from heterogeneous data. The proposed framework uses a conditional maximin objective to enhance individual outcomes across hospitals, ensuring robustness against hospital-level variations. Compared to the traditional approach, the proposed method enhances the survival rate by 10 percentage points among patients who may experience extreme adverse outcomes across hospitals. Additionally, it increases the overall survival rate by 2-3 percentage points when the learned IDR is applied to unseen hospital populations.

Causal mediation analysis provides insights into the mechanisms through which treatments affect outcomes. While mediation effects often vary across individuals, most existing methods focus solely on population-average effects, overlooking individual-level heterogeneity. To address this limitation, we propose a Bayesian regression tree ensemble method that flexibly models non-linear relationships and captures treatment-by-mediator interactions in the mediation process. Using hierarchical posterior sampling, our approach provides credible intervals with nominal coverage rates for testing heterogeneous mediation effects. Additionally, we leverage regression tree summaries to identify subgroups with distinct mediation effects and employ SHapley Additive exPlanation (SHAP) values to highlight key moderators and their influence on the mediation process. Comprehensive simulations demonstrate the method’s accuracy in estimating and inferring heterogeneous mediation effects. Finally, we apply our method to investigate the heterogeneous mediation of Alzheimer’s disease pathology burden on the effect of apolipoprotein E (APOE) genotype on late-life cognition.

Poster Presentations

PhD Students

Lang Zeng

Stochastic Gradient Descent for Cox Model

Yingjin Zhang

Flexible Bayesian Mixture Models with Dynamic Information Borrowing for Complex Endpoints in Adaptive Critical Care Trials Design

Sarah Wang

Efficient and accurate p-value calculation for Adaptive Rank Truncated Product Test

RuoFei Yin

Artifact of detecting biomarkers associated with sequencing depth in RNA-Seq

Yiwen Dong

The association between amyloid and physical activity in a racially diverse cohort of older adults

Yueting Wang

The Application of a Bayesian Finite Mixture Model Approach for Clustering Correlated Mixed-type Variables

Thien Pham

From Genes to Clocks: A Bayesian Framework to Investigate Cross-species Congruence of Circadian Rhythms

Abstracts

Optimizing Cox regression and Cox neural network (Cox-NN) presents significant computational challenges in large-scale studies. Stochastic gradient descent (SGD), known for its scalability, has recently been adapted to optimize Cox models, where parameters are updated using mini-batches of data. In this work, we demonstrate that SGD for Cox models targets the average mini-batch log-partial likelihood. We define the maximizer of this objective as the mini-batch-based maximum partial likelihood estimator (MB-MPLE) for Cox model. We establish that the MB-MPLE for the Cox-NN is consistent and achieves the optimal minimax convergence rate up to a polylogarithmic factor. For Cox regression, we further prove the root-n-consistency and asymptotic normality of the MB-MPLE, with asymptotic variance depending on the batch size. This provides the statistical foundation for optimizing Cox models with SGD when the optimization error is small. Additionally, we examine the impact of batch size on Cox-NN training and the asymptotic efficiency of the MB-MPLE in Cox regression. These findings are validated by extensive numerical experiments and provide guidance for selecting batch sizes for SGD-based Cox model optimization. Finally, we demonstrate the effectiveness of SGD in a real-world application where full gradient descent is infeasible due to the large scale of data.

Background: Organ-support-free days (OSFD) within 28 days is a composite ICU endpoint with discrete count data (0–27) and a degenerate mass at 28 due to censoring. Its varying skewness and dispersion complicate statistical modeling. Existing covariate adjusted response-adaptive (CARA) trial designs estimate treatment effects and allocate treatment independently within subgroups, missing opportunities to borrow information when effects are similar. We propose a Bayesian approach that enables dynamic information borrowing, improving estimation accuracy and adaptive allocation while accommodating complex distribution of OSFD. Methods: We propose a Bayesian finite mixture model that enables adaptive information borrowing across subgroups to improve treatment effect estimation and allocation in a CARA randomization trial. Unlike commonly used CARA methods, which estimate treatment effects and allocate treatments using only within-subgroup data at each interim analysis, our approach leverages similarities across subgroups to enhance estimation efficiency. Specifically, we model non-censored counts using a Beta-Binomial distribution and censoring using a Binomial component. Dirichlet Process Mixture Models (DPMMs) dynamically cluster subgroups based on posterior similarity, allowing treatment effect estimation to borrow strength from similar subgroups while preventing inappropriate borrowing. These estimates guide adaptive patient allocation based on posterior superiority probabilities, with early stopping rules and Type I error controlled via alpha-spending adjustments. Results: Simulation studies demonstrate that our method significantly reduces mean squared error in treatment effect estimation compared to independent subgroup analyses. In scenarios with heterogeneous treatment effects, it effectively identifies partial exchangeability, allowing adaptive information borrowing to enhance the effective sample size and improve statistical power. The dynamic borrowing mechanism optimally balances robustness and efficiency, leading to more accurate estimations. Our framework facilitates earlier stopping decisions and enables more ethical and efficient treatment allocation. Conclusion: Our Bayesian adaptive design not only accounts for the complex distributional features of OSFD outcomes but also introduces an innovative information borrowing framework that improves estimation accuracy and trial efficiency. By leveraging similarities across subgroups, this method provides a practical and ethical solution for optimizing treatment allocation and decision-making in critical care trials with composite endpoints of mortality and morbidity.

In large-scale omics data analysis, detecting true signals often requires effective p-value combination methods. To address this, the Adaptive Rank Truncated Product (ARTP) method is an effective way to identifying sparse signals by focusing on the top-ranking associations. ARTP is a two-layer structured approach: the first layer uses the Rank Truncated Product (RTP) to combine the top most significant signals. The second layer adaptively optimizes the selection of the number of top signals by minimizing the resulting p-value. Despite nice theoretical property and interpretability in practice, existing computation methods of ARTP are computationally infeasible in large-scale applications when the number of p-value combined is large or the significance level required is small due to multiple comparison. In this project, we develop a computationally efficient framework to accurately implement ARTP. We apply an unbiased cross-entropy-based importance sampling method (ISCE) to firstly benchmark an existing integration-based approximation method for the inner layer of RTP. We then apply ISCE to efficiently compute the second layer. Finally, to apply to large-scale applications such as genomics data, we further propose an ultra-fast approach with linear computation order based on pre-calculated statistical tables and cubic spline interpolation. With extensive simulations, we demonstrate that our package is capable of combining up to 2,000 p-values with target p-value significance level down to 10^(-15).

Bulk RNA-seq data exhibits significant sample-to-sample variation due to technical bias such as library size, which can confound biological heterogeneity with technical effects. Thus, an essential step before analyzing gene expression under different biological conditions is normalization, in which raw data are adjusted as expected to eliminate the systematic experimental bias and technical variation. Multiple popular normalization methods have been proposed and widely used, including Counts per million (CPM), median-of-ratios method implemented in DESeq2 and Trimmed Mean of M-values (TMM) in edgeR. However, we still found a large proportion of normalized expression level of genes correlated with the library size in a motivating human post-mortem striatum in-house RNA-seq data. We confirmed the universal existence of this problem by systematically examining 159 GEO datasets and 24 TCGA datasets. To address this, we present a new modeling framework for the normalization. We propose a negative binomial regression, where the relationship of observed expression counts and library size is modeled with a gene-specific power term. We show by simulation and real datasets that the proposed normalization successfully removes the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Particularly, the proposed normalization scheme improves detection accuracy of differentially expressed genes or circadian biomarkers.

Introduction Physical activity (PA) is proposed to be an effect modifier of Alzheimer’s disease (AD) risk, however, it remains unclear if this modification is a result of a direct impact on AD pathophysiology or through indirect mechanisms, like increased cognitive reserve or reduced cardiometabolic risk. In the present study of racially diverse older adults. we explored the association of PA and amyloid deposition measured by [11C] Pittsburgh Compound B-PET (PiB). Methods This study included 238 participants from two cohorts: Hearscore (HS, n=115) and the Human Connectome Project (HCP, n=123). Participants received PiB imaging on 2 different scanners (Siemens HR+/Biograph mCT) to assess amyloid-β (Aβ) deposition. PA data were collected using wrist-worn Actiwatch devices for HCP (processed with ActiLife) and ActiGraph GT9X for HS (analyzed with the GGIR R package). PA metrics included daily step count, light physical activity, moderate activity, moderate-to-vigorous physical activity (MVPA), and inactivity minutes. To address scanner-related variability, PA variables were harmonized using the ComBat method, and Aβ deposition SUVRs were adjusted using Transfer Learning ComBat (TL-ComBat). Spearman’s correlation analyses examined associations between Aβ deposition and PA outcomes, with bootstrapped 95% confidence intervals. Multinomial logistic regression was applied to evaluate Aβ deposition across brain regions, categorizing SUVR values into low, medium, and high groups based on the 1/3 and 2/3 quantiles. Models were adjusted for age, sex, ApoE4 status, race, and education. Results Spearman’s correlation analyses showed that higher levels of PA were significantly associated with lower Aβ deposition. across several brain regions (p < 0.05 for all models). In multinomial logistic regression, each one-unit increase in moderate activity time (64.5 minutes) and MVPA (64.6 minutes) was associated with a 47.6% higher likelihood of being in the low Aβ group versus the high Aβ group and a 44.9% higher likelihood of being in the medium Aβ group versus the high Aβ group. Comparison of ComBat and TL-ComBat methods demonstrated that TL-ComBat more effectively reduced scanner-related variability in SUVR values. Anderson-Darling tests confirmed improved harmonization, with distributions of SUVRs between scanners showing greater alignment after TL-ComBat adjustment. Conclusion This study provides evidence that higher PA levels, including daily steps and MVPA, are associated with lower Aβ deposition in a racially diverse cohort. And considering that the skewness observed in the brain imaging data in this study, applying TL-Combat proved more effective in reducing scanner variability than applying Combat to the entire dataset.

Existing mixed-type data clustering methods often assume local independence without accounting for cluster-specific dependence structures, fail to provide quantitative measure for variable selection or handle censored biomarkers. To address these limitations, we propose a Bayesian finite mixture model (BFMM) that integrates three distinct cluster-specific covariance structures of mixed-data, spike-and-slab priors to quantify variable importance, and a specialized sampling step to impute censored biomarkers. Through extensive simulations, we demonstrate that BFMM outperforms existing methods in clustering accuracy, especially when censoring is introduced. We further applied the proposed BFMM clustering method to two different real-world clinical datasets and performed post hoc analysis: (1) one from SENECA study, where participants met Sepsis-3 criteria within 6 hours of hospital presentation; (2) the other from EDEN study, where participants were within 48 hours of developing acute lung injury (ALI). BFMM identifies clinically meaningful Sepsis and ALI subtype clusters, addresses censored biomarkers, and highlights key variables driving cluster assignment. These findings underscore the potential of BFMM as a powerful and interpretable tool for analyzing complex biomedical data, with significant implications for precision medicine and targeted interventions.

Circadian rhythms orchestrate fundamental physiological and behavioral processes, yet their regulation varies across species and tissues. While model organisms are widely used in circadian research, their molecular congruence with humans remains debated, as species-specific behaviors—such as nocturnality—may misalign rhythmic processes. Existing methods, such as JTK_CYCLE, detect rhythmicity and estimate phase but lack uncertainty quantification, limiting confidence in peak-time inference and rhythmicity congruence analysis. Moreover, threshold-based approaches often impose binary classifications, where different cutoffs yield inconsistent conclusions and obscure the continuous spectrum of rhythmic alignment. To overcome these limitations, we introduce a Bayesian inference framework for rhythmicity congruence analysis, providing a threshold-free approach to quantify rhythmic concordance while identifying systematic phase shifts at the species or tissue level. Additionally, we enable statistical detection of phase-shifting genes with uncertainty quantification, offering insights into key molecular drivers of circadian misalignment. To facilitate hypothesis generation, we provide visualization tools for pathway-level congruence using topological module detection and network-based analyses. We will apply this framework to human GTEx data, a large-scale baboon study with 25 common tissues with GTEx, and multiple mouse studies. This scalable, inference-driven approach advances comparative circadian research by refining conserved and species-specific regulation and informing translational applications of animal models.

MS Students

Xingjian Zhang

Unraveling Circadian Gene Patterns: A Multi-Tissue Study in Baboons and Humans

Margaret Hines

Integrating transcriptomic data and patient outcomes to identify clinically relevant breast cancer subtypes with semi-supervised clustering

Alaa Alghwiri

Using Machine learning to predict medication therapy problems among patients with chronic kidney disease

Abstracts

Circadian related biomarkers and pathways play a crucial role in regulating biological rhythms in the human body. Humans are naturally active during the day and rest at night, and disruptions in circadian gene function have been associated with various disorders, including insomnia and Alzheimer’s disease. Given their importance, extensive research on circadian genes is essential, both through postmortem human studies and invasive animal experiments. While rodents, particularly mice, are commonly used in circadian research, their genetic differences from humans limit direct translational insights. In this thesis, we focus on primates, specifically baboons, to investigate circadian gene regulation. We reproduce findings from previous baboon studies and compare the performance of MeteCycle and Cosinor analysis in detecting rhythmic gene expression. To integrate results across multiple (up to 40) tissues, we apply meta-analysis techniques, including Fisher’s and Adaptive Weighted Fisher’s (AW-Fisher) methods, to baboon data and extend these methods to human datasets for cross-species comparisons. Additionally, we employ Gaussian Mixture Models (GMM) to cluster genes based on their phase distributions, aiming to uncover underlying clustered patterns across tissues in core clock gene regulation. For future research, we plan to incorporate Bayesian methods and circular statistics to refine phase shift analysis. Specifically, we propose using random effects models to identify potential factors influencing phase variation in both baboons and humans. If successful, our research will provide a valuable pan-tissue reference for studying circadian genes and related disorders in primates, offering a more biologically relevant model compared to rodents.

Within several types of breast cancer, there exists a large heterogeneity in survival outcomes despite patients exhibiting similarities in more traditional disease subtypes. Further disease subtyping leveraging patients’ molecular profile to cluster samples by significant gene markers has been proposed to better understand this trend and potential treatment targets. However, many such methods are unsupervised, yielding disease subtypes that show little to no clinical relevance. Here, I explore and apply a semi-supervised clustering method that incorporates information regarding survival outcomes, patient characteristics, and transcriptomic data from three cohorts. First, genes are pre-selected based on whether they exhibit a significant association with overall survival. The expression level of the pre-selected genes is then used to cluster patients using the sparse K-means method. Differences in survival time between clusters are evaluated to explore how effectively the clustering method separated patients. This method was applied to identify potential subtypes within two types of breast cancer with particularly variable survival across patients, invasive lobular carcinoma (ILC) and triple negative breast cancer (TNBC). Clinical information and gene expression data from microarray/RNAseq from three large, publicly available breast cancer cohorts were used for this analysis. Results show that semi-supervised clustering does indeed provide subtypes with more significantly different survival outcomes than unsupervised clustering, though heterogeneity in the significant gene sets identified in each of the three cohorts warrant further investigation.

Abstract Introduction: Patients with chronic kidney disease (CKD) are at risk of medication therapy problems (MTP) due to high comorbidity and medication burden. Using data from the Kidney Coordinated HeAlth Management Partnership (Kidney CHAMP) trial, we used machine learning to build a predictive model to identify MTP high-risk patients with CKD in the primary care setting. Methods: We used baseline data from patients enrolled in the intervention arm of the Kidney CHAMP trial, completed May 2019 to July 2022, which tested a population health management strategy, including medication management, for improving CKD care. The dataset was divided into 80% training and 20% testing subsets. The area under the ROC curve (AUROC) was used to assess classification accuracy in distinguishing between patients with and without MTP. Eight candidate models were considered, and the top three performing models (Random Forest, Support Vector Machines, and Gradient Boosting), based on cross-validated AUROC on training data, underwent further refinement. The model with the highest AUROC in the testing set, while considering the bias/variance trade-off, was selected as the best-performing model. SHapley Additive exPlanations (SHAP) was then leveraged using the best-performing model to evaluate the impact of each predictor to the final risk score. Results: Among 730 patients who received medication review at baseline, 566 (77.5%) had at least 1 MTP. Key demographics were mean age 74 years, 55% females, 92% White, 64% with diabetes, and the mean number of medications was 5.8 at baseline. The Random Forest model had the best performance on the testing set with AUROC 0.72, sensitivity 0.80, and specificity 0.64. The five most influential variables, ranked in descending order of importance for predicting individuals with MTP, were diabetes status (yes/no), hemoglobin A1C (HbA1C), urine albumin-to-creatinine ratio (UACR), systolic blood pressure, and age. Conclusion: In outpatient primary care, a machine learning-based MTP risk calculator that use routinely available clinical data can identify patients with moderate-high risk CKD who are at high risk for developing MTPs.

Previous Event Details

Keynote

Women and Diversity in Statistics and Data Science: What Have We Done, and What Can We Do?

Abstract

It has been 55 years since statistician Elizabeth L. Scott formulated three important research questions about women in academe: Why are there so few women on the faculty, why are so few working toward and obtaining their PhDs, and why are women paid less than men with equivalent experience and productivity? Scott was a pioneer in this work, and she researched and spoke on these topics over the last 20 years of her career. My own efforts began about 15 years ago. In this talk, I take stock in some of the things that have been accomplished over these recent years for women and diversity in statistics and data science. I also provide some ideas of things we can do to recruit and retain a diverse faculty and student body.

Student Presentations - Oral

Predicting Pediatric Asthma Severe Outcomes via Machine Learning Methods using EHR Data with Repeated Clinic Visits
Jiaqian Liu

Abstract

Asthma is the most common multifactorial chronic disease among children. Identifying children at high risk of severe asthma outcomes, such as emergency department (ED) visits and hospitalizations, is essential in clinical practice. Studies have utilized machine learning methods to enhance the prediction of pediatric asthma occurrence or progression using electronic health records (EHR) data. However, these studies often neglected the correlated nature of EHR data (e.g., repeated clinic visits of the same patients). To address this issue, this research applied and evaluated the mixed effects machine learning method, including random forests with random effects and neural networks with random effects. These methods consider the correlations by incorporating random effects in the development of prediction models for both binary and continuous outcomes. These methods were applied to real-world asthma EHR data obtained from the Children’s Hospital of Pittsburgh, focusing on predicting ED visits due to severe asthma exacerbations and length of stay (LOS) when hospitalized. Moreover, we also characterized the importance of predictors using the kernel SHAP metric and identified vulnerable patient groups with a high risk of more frequent ED visits or longer LOS.

COMPOSITE: Compound Poisson Model-Based Single-Cell Multiplet Detection Method
Haoran Hu

Abstract

In this presentation, I will discuss COMPOSITE (COMpound POiSson multIplet deTEction model), a novel statistical model targeting the issue of multiplets in droplet-based single-cell sequencing. Multiplets, resulting from the encapsulation of multiple cells in one droplet, lead to inaccurate cell type annotations and obscure biological insights, particularly in single-cell multiomics. By modeling the recorded expression levels, often elevated in multiplets due to stable features like housekeeping genes, through a compound Poisson distribution, COMPOSITE adeptly infers the multiplet status. I will showcase the significant effectiveness of COMPOSITE in enhancing the accuracy of single-cell data analysis.

Stochastic Volatility with Informative Missingness
Gehui Zhang

Abstract

Stochastic volatility models, treating time series variance as a stochastic process, have proven to be an important tool for analyzing dynamic variability. Current methods for stochastic volatility models are limited by the assumption of missing at random. With advancements in technology facilitating dynamic self-response data collection for which missing data are inherently informative, this limitation in statistical methodology hampers scientific advancement. This article aims to pioneer statistical methodology for stochastic volatility with data that are missing not at random. It introduces a novel imputation method based on Tukey’s representation, utilizing the Markovian nature of stochastic volatility models to overcome unidentifiability often faced when modeling informative missingness. This imputation method is combined with a new conditional particle filtering with ancestral sampling procedure that accounts for variability in imputation to formulate a particle Gibbs sampling scheme. The performance of the method is illustrated through simulations and analyzing mobile phone self- reported mood from an individual being monitored after an unsuccessful suicide attempt.

2023 Presentations and Abstracts

Posters

Wenjia Wang - Accurate and Ultra-efficient $p$-value Calculation for Higher Criticism Tests

In modern data analysis for detecting rare and weak signals, higher criticism (HC) test and its variations have been an effective group-testing method, but the computation accuracy and speed have long been an issue when the number of p-values combined (K) and/or the number of repeated tests (N) are large. To this end, we propose refined computing strategies for higher criticism and four of its variations. Specifically, we first propose a cross-entropy based importance sampling (ISCE) method to benchmark all existing methods and develop modified SetTest (MST) analytic approach. We then propose an ultra-fast interpolation (UFI) computation method independent of K based on our pre-calculated statistical tables. Eventually, by integrating above methods we proposed, we construct a computation strategy and an R package “HCp” for accurate and ultra-fast p-value computation of virtually any K and small p-values in HC tests. Extensive simulations are implemented to benchmark accuracy and speed of proposed methods. By applying to a COVID-19 disease surveillance example for spatio-temporal patient cluster detection, we confirm viability of the proposed method for such large-scale inferences.

Michelle Sun - The epigenetic determinant of time-restricted feeding and health span is linked to the 12h ultradian oscillator

A distinct 12-hour clock exists in addition to the 24-hour circadian clock to coordinate metabolic and stress rhythms. These 12h rhythms are established by an XBP1s-dependent 12h oscillator. We have evidence to suggest that RBBP5 is the common epigenetic regulator of the 12h oscillator and diverse stress responses. It is known that knocking down of RBBP5 blunts transcriptional responses to stress and that RBBP5 expression is downregulated in humans with fatty liver. Yet, here we show that RBBP5 heterozygous knockout mice are paradoxically healthier. Heterozygous mice exhibit improved glucose tolerance and insulin sensitivity. They are leaner with greater energy expenditure, despite no significant differences in activity and food intake compared to WT mice. Heterozygous mice also demonstrate self-imposed time-restricted feeding with food intake reduced during rest phase and increased during activity phase. These findings suggest that RBBP5 has a hormetic and systemic effect on metabolism and energy intake. Determine the relationship between these behavior patterns and illness outcomes, and how changes in illness severity relates to these behavior patterns.

Benjamin Melchior Kacso Panny - Generalized linear mixed-effects models of two-step task performance in people with depression in a RCT involving ketamine infusion and automated self-association training

Background: GLMMs can be used to study human decision-making in a reward learning task (two-step task) by separating “Model-Free” (MF) and “Model-Based” (MB) influences on task performance. The impact of certain interventions on such influences in the context of depression is unknown. Methods: Task data was collected in a sample of depressed patients (n = 104) in a RCT involving a single ketamine infusion and automated self-association training. Task data was collected at between 2 to 5 visits for each participant. GLMMs were fit to the data using REML to determine MF and MB influences on behavior in the sample, according to intervention arm. Results: Results reveal evidence of model-free and model-based behavior patterns in our sample, while a single ketamine infusion does not appear to affect either model-free or model-based influences on behavior. Similar, further analyses will be performed to test for the effect of automated self-association training alone and in combination with a ketamine infusion on these MF and MB influences. Further analyses will also test for whether changes in depression symptom severity over time affect MF and MB influences on task performance. Conclusion: Model-free and model-based behavior patterns are present in a sample of people with depression. Further work needs to be done to determine the relationship between these behavior patterns and illness outcomes, and how changes in illness severity relates to these behavior patterns.

Runjia Li - Estimation of Conditional Average Treatment Effects for Time-to-event Outcomes with Competing Risks
Numerous statistical theory and methods have been proposed for estimating the causal treatment effects in observational studies. The majority of approaches can be categorized into outcome-based modeling, treatment-based modeling, and modeling for both outcome and treatment with a doubly robust feature. Currently, most of the methods with doubly robust feature do not address treatment-effect heterogeneity, specifically, estimation of personalized treatment effects in time-to-event outcomes with competing risks. We developed a framework for estimating conditional causal average treatment effects defined as the risk difference of cumulative incidence functions given a subject’s characteristics. Our method integrates targeted maximum likelihood estimation with various algorithms for outcome modeling and propensity score modeling. In extensive simulation studies, our method outperformed others, even in scenarios where the outcome model was mis-specified. Application of our method is illustrated in a study of treatment effects for sepsis patients admitted to intensive care units.

Manqi Cai - MODE: Multi-Omics Cellular Deconvolution with Single-Cell RNA-Seq and DNAM References
Cellular deconvolution has been widely used to infer cellular compositions of bulk gene expression data, largely facilitated by the popularity of single-cell RNA-seq. Recently, whole-genome sequencing of single-cell DNA methylation (scDNAm) is emerging and creates new possibilities for deconvolving bulk DNAm data, especially in solid tissues that lack cell references. Multiple studies found that multi-omics data share similar celltype markers. When multi-omics data are collected from the same tissue samples, a joint deconvolution will provide more accurate estimates of the unified underlying cellular fractions than deconvolving each omics data type separately. To achieve this goal, we develop MODE (Multi-Omics DEconvolution), a novel deconvolution framework that utilizes both single-cell DNAm and RNA-seq references to estimate cellular fractions from bulk omics data accurately. With ultrahigh-dimensional and sparse scDNAm data, MODE considers the spatial dependence of close DNAm sites and builds a precise signature matrix. Real data benchmarking shows that MODE improves cellular fraction estimates by jointly deconvolving multi-omics data collected from the same samples.

Jinhong Li - Fusion Learning for Causal Inference Using Distributed Data
When estimating conditional average treatment effect (CATE) in data-driven precision medicine, data integration is often undertaken to achieve a larger sample size and better statistical efficiency. Since data from different sources may be inherently heterogeneous, challenges arise in how to optimally and securely combine information while accounting for between-study heterogeneity. In the meanwhile, privacy concern may prevent the sharing of individual-participant data. In this paper, we propose a two-step doubly robust algorithm: we first estimate the marginal effects and propensity scores in each distributed data to formulate an objective function for estimating CATE; we then aggregate study-level data from model estimates to optimize the objective function. Specifically, in the second step, we leverage fused lasso to capture heterogeneity among different data sources, and confidence distribution to preserve data privacy and avoid the sharing of individual-participant data. The performance of this approach is evaluated by simulation studies and a real-world study of the causal effect of oxygen saturation target on survival rates for multi-hospital ICU patients.

Penghui Huang - Accurate estimation of rare cell type fractions from tissue omics data via hierarchical deconvolution
Bulk transcriptomics in tissue samples reflects the average expression levels across different cell types and is highly influenced by cellular fractions. As such, it is critical to estimate cellular fractions to both deconfound differential expression analyses and infer cell type-specific differential expression. While experimentally counting cells is not feasible in most tissues and studies, in-silico cellular deconvolution methods have been developed as an alternative. However, existing methods are designed for tissues consisting of clearly distinguishable cell types and have difficulties estimating highly correlated or rare cell subtypes. To address this challenge, we propose Hierarchical Deconvolution (HiDecon) that uses single-cell RNA sequencing references and a hierarchical cell type tree, which models the similarities among cell types and cell differentiation relationships, to estimate cellular fractions in bulk data. By coordinating cell fractions across layers of the hierarchical tree, cellular fraction information is passed upwards and downwards along the tree, which helps correct estimation biases by pooling information across related cell types. The flexible hierarchical tree structure also enables estimating rare cell fractions by splitting the tree to higher resolutions. Through simulations and real data applications with ground truth of measured fractions, we demonstrate that HiDecon significantly outperforms existing methods and accurately estimates cellular fractions.

Jiaqian Liu - Prediction of Asthma Exacerbations in Children Using Mixed Effect Random Forests on EHR data
Asthma is the most common multifactorial chronic disease among children. Identifying children at high risk of severe asthma outcomes is essential in clinical practice. Studies have utilized machine learning methods to enhance the prediction of asthma occurrence or progression using electronic health records (EHRs) data. However, these studies often neglected the clustered and correlated nature of EHR data (e.g., multiple visits of the same individual). To address this issue, this study applied and evaluated the use of Mixed Effect Random Forests method, which takes into account the clustered structure by incorporating random effects in constructing prediction models for binary outcomes. We applied the method to a real-world asthma EHR dataset from the Pittsburgh Children’s Hospital for predicting severe asthma exacerbations.

Xue Yang - Statistical Inference for Response-Adaptive Randomization Designs: A Comparative Study
Response-adaptive randomization (RAR) aims to improve patient benefit by skewing allocation to more efficacious treatments maintaining the validity of treatment comparisons. While numerous RAR designs have been developed in the literature, sample average treatment effect (ATE) remains to be the basis of inference for causal treatment effect. In this article, we propose some alternative estimators of causal treatment effect for RAR designs based on inverse-probability-weighting (IPW) and compare them to the sample ATE analytically based on bias, variance, type I error and statistical power. We conducted extensive simulation studies to assess the operating characteristics of these estimators. Results show that when implemented and analyzed correctly, RAR can treat substantial proportion of the patients using the better treatment minimally sacrificing the power compared to a standard allocation design. The analytical and simulation results also indicate that the IPW-based estimators are consistent with smaller bias and are more efficient compared to the sample ATE. Moreover, the methods based on log relative risk and log odds ratio control the type I error better than the sample ATE.

Molin Yue - Multi-Omics Cell Type Deconvolution with Multi-modal Learning
Downstream analysis of omics data at the bulk level is usually confounded by its cellular heterogeneity. Various algorithms have been developed to deconvolute bulk samples with a single data modality, either gene expression or DNA methylation (DNAm), but rarely on multi-omics data. Motivated by the fact that cell-type populations are biologically shared by both gene expression and DNAm measured within the same sample, we developed a supervised, reference-free deep-learning algorithm using intermediate fusion that can robustly deconvolute bulk samples using one or multiple omics modalities. We tested this algorithm on large-scale white blood cells and found that it achieved high spearman correlation values of 0.86 (neutrophils), 0.84 (lymphocytes), 0.83 (monocytes), 0.94 (eosinophils), and 0.70 (basophils) on testing dataset, compared to other popular deconvolution algorithms whose correlations were no greater than 0.8. Even with one missing modality (e.g., DNAm) in the testing data, the algorithm was still able to get 0.81, 0.78, 0.73, 0.94, and 0.28, demonstrating the improved accuracy of our models when utilizing both modalities and robustness to the missing modality.

Hung-Ching Chang - PECMAN: Partial Sum statistics for High-Dimensional Causal Mediation Analysis with Multi-omics Applications
Causal mediation analysis of high-dimensional data is challenging due to the curse of dimensionality. To reduce the noise, many methods tend to apply some filtering strategies on the high-dimensional mediator such as the penalty regression and the screening test. However, most of these methods do not give consideration to the cross- world assumption, which may result in biased estimation. Recent studies develop a series of orthogonal transformation methods to satisfy this assumption. However, the orthogonal transformation can result in poor interpretability. In this study, we develop PECMAN, an interpretable high-dimensional causal mediation analysis method. We highlight two types of sparsity, which are in the exposure-mediator relationship and the mediator-outcome relationship, and further show that filtering strategies on the mediation-outcome relationship satisfy the cross-world assumption under a proper sparsity assumption. Extensive simulation experiments indicate that our method has greater statistical power for detecting mediation effects at various sparsity levels. Finally, we demonstrate our method’s advantages through an application on the COPDGene dataset. We estimate the mediation effect of each patch and find that the inferior lobes contribute more mediation effects than the superior lobes.

Gehui Zhang - Bayesian inference on Stochastic Volatility Models with Informative Missingness
Mobile devices that collect activity and emotional data in one’s natural environment are becoming an increasingly popular and powerful tool for clinicians and researchers. Volatility, which is a measure of dispersion or variation, is an interpretable and predictive metric from psychosocial and behavioral data. Existing methods for estimating stochastic volatility, which were formulated primarily in the context of financial data, assume missing at random(MAR), which is not consistent with the observational nature of data from mobile devices. The main focus of this talk is estimating stochastic volatility while accounting for informative missingness. We developed an imputation method based on Tukey’s representation (or the simplified selection model) under the stochastic volatility model with a missing not at random (MNAR) mechanism. We incorporated the novel imputation approach into a Particle Gibbs with Ancestor Sampling (PGAS) method to provide an efficient method for conducting inference on stochastic volatility with informed missingness. The performance of the method is illustrated through simulation studies and in the analysis of multi-modal ecological momentary assessment (EMA) data from a study of suicidal ideation and behavior in young adults with a history of suicidal thoughts and behaviors.

Xiangning Xue - DiffCircaPipeline: A framework for multifaceted characterization of differential rhythmicity
Circadian oscillations of gene expression regulates daily physiological processes, and their disruption is linked to many diseases. Circadian rhythms can be disrupted in a variety of ways, including differential phase, amplitude, and rhythm fitness. Although many differential circadian biomarker detection methods have been proposed, a workflow for systematic detection of multifaceted differential circadian characteristics with accurate false positive control is not currently available. We propose a comprehensive and interactive pipeline to capture the multifaceted characteristics of differentially rhythmic biomarkers. Analysis outputs are accompanied by informative visualization and interactive exploration. The workflow is demonstrated in multiple case studies and is extensible to general omics applications.

Lang Zeng - Dynamic Prediction using Time-Dependent Cox Survival Neural Network
The target of dynamic prediction is to provide individualized risk predictions over time which can be updated as new data become available. Motivated by establishing a dynamic prediction model for the progressive eye disease, age-related macular degeneration (AMD), we proposed a time-dependent Cox model-based survival neural network to predict its progression on a continuous time scale using longitudinal fundus images. We evaluate and compare our proposed method with joint modeling and landmarking approaches through comprehensive simulations using two time-dependent accuracy metrics, the Brier Score and dynamic AUC. We applied the proposed approach to a large AMD study, the Age-Related Eye Disease Study (AREDS), in which more than 50,000 fundus images were captured over a period of 12 years for more than 4000 participants. We built a dynamic prediction model to predict the AMD progression risk over time using longitudinal images together with demographic information. Our approach achieves satisfactory prediction performance in both simulation studies and real data analysis.

Xueping Zhou - Genome Hierarchy Grouping Structure and Correlation Guided Feature Selection for Multivariate Outcome Prediction
Developing efficient feature selection and accurate prediction algorithms for multivariate phenotypes is a major and often difficult task in analyzing omics data. Many machine-learning methods have been proposed to perform such tasks for univariate outcome, including top-performing penalized regression-based approaches. In this paper, we propose a novel supervised learning algorithm to perform feature selection and multivariate outcome prediction for data with potentially high-dimensional predictors and responses. The method incorporates known genome hierarchy grouping and correlation structures into feature selection, regression coefficient estimation, and outcome prediction under a penalized multivariate multiple linear regression model. Extensive simulations show its superior performance over its competing methods. We apply the proposed method to an in-house multi-omics asthma dataset with two applications. In the first study for cell type deconvolution, the proposed method achieves better cell type fraction prediction using bulk gene expression data. In the second association study between multivariate gene expression and high-dimensional DNA methylation data, the proposed method reveals novel associations signals, providing more biological insights on how CpG sites regulate gene expressions.

Jian Zou - Transcriptomic Congruence and Selection of Representative Cancer Models Towards Precision Medicine
Cancer models are instrumental to substitute for human studies and expedite basic, translational and clinical cancer research. For a given cancer subtype, a wide selection of models, such as cell lines, patient-derived xenografts, tumoroids and genetically modified murine models, are often available to researchers. However, how to quantify their congruence to human tumors and to select the most appropriate cancer model is a largely unsolved issue. Here, we develop Congruence Analysis and Selection of CAncer Models (CASCAM), a statistical and machine learning framework for authenticating and selecting the most representative cancer models in pathway-specific and drug-relevant context using transcriptomic data. CASCAM offers harmonization between tumor and cancer model omics data, interpretable machine learning for congruence quantification, mechanistic investigation, and pathway-based topological visualization to determine the final cancer model selection. The workflow is presented using breast cancer invasive lobular carcinoma (ILC) subtype, while the method is generalizable to any cancer subtype for precision medicine development.

Talks

Na Bo - Estimating heterogenous survival treatment effect under counterfactual framework
Estimating heterogeneous treatment effect plays a central role in personalized medicine as it provides critical information for tailoring existing therapies for each patient to get the optimal treatment. Recently, meta-learning approaches have received a lot of attention in estimating conditional average treatment effect (CATE) by using multi-step algorithms coupled with flexible machine learning methods. In this project, we provide a metalearning framework to estimate CATE on survival outcomes. We consider several pseudo-CATE regression approaches along with popular machine learning methods such as random survival forests, Cox-Lasso, and survival neural networks. We address advantages and challenges in implementing these methods to survival outcomes through comprehensive simulations and provide guidelines for applying these methods to survival outcomes in different situations. Finally, we demonstrate the methods by analyzing a large randomized clinical trial, the AREDS study for an eye disease, agerelated macular degeneration, to estimate CATE and make individualized treatment recommendations.

Jiahe Li - An approximated Expectation-Maximization (EM) Algorithm for integrative analysis of datasets with nonresponse
Missing data are pervasive in clinical trials and public health studies. Standard statistical methods, such as the likelihood-based methods and estimating equation-based methods, often require unverifiable assumptions and modelling of the missing-data mechanism. Misspecification of the missing-data model often leads to biased estimates and wrong conclusions. For integrative analysis of data from multiple studies, the issue with missing data is exacerbated that the missing-data process varies from study to study. Modelling study-specific missing-data mechanisms under a unified framework is prohibitive. Here we propose an approximated expectation-maximization (AEM) algorithm for the integrative regression analysis of data with nonresponse where the datasets are assumed to follow the same regression model and are independent from each other. Each dataset may suffer from an arbitrary missing-data mechanism. With a consistent initial estimator from either a prior study or a complete dataset, we devised an AEM algorithm to approximate the sufficient statistics from empirical conditional regression estimates. Without modelling the missing data mechanisms, the AEM algorithm yields a more efficient estimator. Simulation studies will be used to illustrate the efficiency gain from the initial estimator under various settings.

Zhiyu Sui - Robust Transfer Learning of Individualized Treatment Rules
Causality-based individualized treatment rules (ITRs) are a steppingstone to precision medicine. To ensure unconfoundedness, ITRs are ideally derived from randomized experimental data, but the use cases of ITRs in the real-world data extend far beyond these controlled settings. It is of great interest to transfer knowledge learned from experimental data to real-world data, but hurdles remain. In this paper, we address two challenges in the transfer learning of ITRs. 1) In well-designed experiments, granular information crucial to decision making can be thoroughly collected. However, part of this may not be accessible in real-world decision-making. 2) Experimental data with strict inclusion criteria reflect a population distribution that may be very different from the real-world population data, leading to suboptimal ITRs. We propose a unified weighting scheme to learn a calibrated and robust ITR that simultaneously addresses the issues of covariate shift and missing covariates during prospective deployment, with a quantile-based approach to ensure worst-case safety under the uncertainty due to unavailable covariates. The performance of this method is evaluated in simulations and real-data applications.

2025 Event Details

Keynote

Schedule

Oral Presentations

PhD Students

Abstracts

A Bayesian Outcome-Guided Clustering Framework via Consensus Clustering for Molecular Disease Subtyping

FLoRI: Federated Learning of Robust Individualized Decision Rules with Application to Heterogeneous Multi-Hospital Sepsis Population

Heterogeneous Causal Mediation Analysis Using Bayesian Additive Regression Trees

Poster Presentations

PhD Students

Abstracts

Stochastic Gradient Descent for Cox Model

Flexible Bayesian Mixture Models with Dynamic Information Borrowing for Complex Endpoints in Adaptive Critical Care Trials Design

Efficient and accurate p-value calculation for Adaptive Rank Truncated Product Test

Artifact of detecting biomarkers associated with sequencing depth in RNA-Seq

The association between amyloid and physical activity in a racially diverse cohort of older adults

The Application of a Bayesian Finite Mixture Model Approach for Clustering Correlated Mixed-type Variables

From Genes to Clocks: A Bayesian Framework to Investigate Cross-species Congruence of Circadian Rhythms

MS Students

Abstracts

Unraveling Circadian Gene Patterns: A Multi-Tissue Study in Baboons and Humans

Integrating transcriptomic data and patient outcomes to identify clinically relevant breast cancer subtypes with semi-supervised clustering

Using Machine learning to predict medication therapy problems among patients with chronic kidney disease

2024

Keynote

Abstract

Student Presentations - Oral

Predicting Pediatric Asthma Severe Outcomes via Machine Learning Methods using EHR Data with Repeated Clinic Visits Jiaqian Liu

Abstract

COMPOSITE: Compound Poisson Model-Based Single-Cell Multiplet Detection Method Haoran Hu

Abstract

Stochastic Volatility with Informative Missingness Gehui Zhang

Abstract

2023

Predicting Pediatric Asthma Severe Outcomes via Machine Learning Methods using EHR Data with Repeated Clinic Visits
Jiaqian Liu

COMPOSITE: Compound Poisson Model-Based Single-Cell Multiplet Detection Method
Haoran Hu

Stochastic Volatility with Informative Missingness
Gehui Zhang