Biostatistics guest speaker, Lu Tang, University of Michigan, will present, "Fusion Learning in Integrative Data Analysis."
Pooling data sets from multiple studies is often undertaken in practice to achieve larger sample sizes and greater statistical power. A major analytic challenge arising from data integration pertains to data heterogeneity in terms of underlying study population, study design, data collection or data availability. Ignoring such heterogeneity in integrative data analysis may result in biased estimation and misleading inference. In this talk, I will present new machine learning methodologies to address the challenge. 1) The first part of the talk is motivated from the ELEMENT multiple cohorts of Mexican adolescents to study the association of metabolomics outcomes with in utero environmental exposure to toxic chemicals (e.g. PBA and phthalates). I will introduce new data integration analytics based on the fused LASSO that allow learning similarities and differences of covariate effects across cohorts in the setting of generalized linear models. 2) The second part of my talk is motivated by a prospective longitudinal cohort study examining risk predictors of suicidal ideation in US medical interns. To improve statistical power in the utility of pattern mixture models for non-ignorable missing data, I will introduce a fusion learning method to identify and merge similar missing data patterns under the framework of generalized estimating equations (GEE) in population-average models for longitudinal data. I will also discuss ongoing and future projects, including scalability issues in the wake of recent movements toward distributed computing.