Biostatistics guest speaker, Lucas Mentch, University of Pittsburgh, Statistics, will present, "Inference and Variable Selection with Random Forests."
Abstract: Despite the success of tree-¬based supervised learning algorithms (bagging, boosting, random forests), these methods are often seen as prediction-¬only tools whereby the interpretability and intuition of traditional statistical models are sacrificed for predictive accuracy. We present an overview of recent work that suggests this black-¬box perspective need not be the case. We consider a general resampling scheme in which predictions are averaged across base-learners built with subsamples and demonstrate that the resulting estimator belongs to an extended class of U-¬statistics. As such, a central limit theorem is developed allowing for confidence intervals to accompany predictions, as well as formal hypothesis tests for variable significance and model additivity. The test statistics proposed can also be extended to produce consistent measures of variable importance. In particular, we propose to extend the typical randomized node-wise feature availability to tree-wise feature availability, allowing for hold-out variable importance measures that unlike out-of-bag measures, are robust to correlation structures between predictors. Demonstrations will be provided on ebird citizen science data.