Scheduled Colloquia

Fall 2023

All colloquia are at 2 pm, unless otherwise noted.
Titles and abstracts will be added as they become available.

September 8: Dr. Jesús D. Arroyo Relión | Texas A&M University 


Title: Joint spectral clustering in multilayer networks

Abstract: Modern network datasets are often composed of multiple layers, such as different views, time-varying observations or independent sample units. These data require flexible and tractable models and methods capable of aggregating information across the networks. To that end, this talk considers the community detection problem under the multilayer degree-corrected stochastic blockmodel. We propose a spectral clustering algorithm and demonstrate that its misclustering error rate improves exponentially with multiple network realizations, even in the presence of significant layer heterogeneity. The methodology is illustrated in a case study of US airport data, where we identify meaningful community structure and trends influenced by pandemic impacts on travel. This is joint work with Joshua Agterberg and Zachary Lubberts.

September 22: Dr. Todd Odgen | Columbia University

Location: Cabbell Hall 058

Title:  Functional data analysis of a compartment modeling framework with applications in dynamic PET imaging 

Abstract: Compartment modeling describes the movement of substances or individuals among different states and has applications in epidemiology, pharmacokinetics, ecology, and many other areas.  Fitting such a model to data typically involves solving a system of linear differential equations and estimating the parameters upon which the functions depend.  In order for this approach to be valid, it is necessary that a number of fairly strong assumptions hold, assumptions involving various aspects of the kinetic behavior under investigation.  In many situations, such models are understood to be simplifications of the "true" kinetic process.  While in some circumstances such a simplified model may be a useful (and close) approximation to the truth, in other cases, important aspects of the kinetic behavior cannot be represented.  We present a nonparametric approach, based on principles of functional data analysis, to modeling of pharmacokinetic data.  We illustrate its use through application to data from a dynamic PET imaging study of the human brain.

September 29: Dr. Mike Baiocchi | Stanford University

Location: Cabbell Hall 058

Title: Anti-labor trafficking in the Amazon: some early wins from the Stanford Human Trafficking Data Lab 

Abstract: In close partnership with Brazilian federal prosecutors, our lab developed and deployed a satellite-based detection system to identify high-probability slave labor sites. We will cover the challenges of developing a cheap and high-throughput -- but also reliable -- image detection algorithm for use in a high-stakes, real-world setting. We will discuss our recent field testing -- which resulted in successful raids of labor camps -- and the learnings from this field-testing. Finally, we will discuss two follow up projects: (i) running a rigorous evaluation of the algorithm's impact on detection and interdiction of labor-trafficking, and (ii) in partnership with local social workers, developing and implementing a behavioral intervention to improve the support system for survivors.

October 6: Dr. Yangfeng Ji | Univerisity of Virginia

Location: Cabbell Hall 058

Title: Large Language Models: Yet Another Example on the Benefits and Risks of Data-driven Modeling

Abstract: Large language models (LLMs) have drawn significant attention from the AI research community, on building new models and improving their performance. In addition, the popularity of LLM-based applications (e.g., ChatGPT and Bard) has motivated the exploration of new applications in various domains, such as education and medicine. However, recent work shows the limitations of traditional data-driven modeling still exist in LLMs, such as vulnerability under adversarial attacks and inconsistency to linguistic variations. To show the two sides of the same coin, this talk consists of two parts. The first part provides a high-level overview of large language models and the progress on recent research built upon LLMs; the second part demonstrates the potential risks caused by the limitations of LLMs. The talk concludes with a brief summary of future research challenges.

October 13: Dr. Peisong Han | Univerisity of Michigan

Location: Cabbell Hall 058

November 3: Dr. Kundu Debamita | Univerisity of Virginia

Location: Cabbell Hall 058

November 10: Dr. Jianxin Xie | Univerisity of Virginia

Location: Cabbell Hall 058

November 16: Dr. Anru Zhang | Duke Univerisity

Location: Thornton Hall


Spring 2023

All colloquia are at 2 pm, unless otherwise noted.
Titles and abstracts will be added as they become available.

February 17: Dr. Hong Zhu | University of Virginia

Location: WNR 115

Title:  Improving Methods for CER with Complex Observational Healthcare Data

Abstract: Large observational healthcare data, such as registry, claims, and electronic health record data, are primary research tools for comparative effectiveness research (CER). Compared to randomized clinical trials, CER studies are more representative of real-world clinical practice for the afflicted population. Nevertheless, CER using large, complex observational healthcare data presents unique methodological challenges related to semi-competing risks, confounding, and missing data. To address these challenges, Dr. Zhu has developed and implemented novel and robust analytical methods and algorithms to large observational healthcare data of different sources and complex structures. The experience motivated a Patient-Centered Outcomes Research Institute (PCORI) funded project on improving CER methodology. In this talk, Dr. Zhu will present her research work on improving methods for design and analysis of CER using large, complex observational healthcare data. She will also briefly discuss her work on improving methods for design of pragmatic clinical trials.

February 24: Dr. Soutik Ghosal | University of Virginia

Location: WNR 115

Title: Importance of covariate adjustment in the ROC analysis


The receiver operating characteristics (ROC) curve is a handy graphical tool to assess the diagnostic accuracy of biomarkers. The ROC curve further allows a summary of the performance of a biomarker through the area under its curve (AUC) which is used as an overall assessment of the performance and is one of the most useful diagnostic accuracy metrics. While the comprehensive evaluation is practical, the performance of the biomarker in diagnosing diseases can be impacted by relevant covariate information. For example, the estimated fetal weight (EFW) is an ultrasound biomarker for predicting birthweight-related adverse outcomes such as large-for-gestational-age (LGA) or small-for-gestational-age (SGA). However, it would be implausible to assume a uniform performance of EFW across the entire maternal population. Significant covariates such as maternal BMI, maternal race, or various other maternal or neonatal risk factors could potentially impact the diagnostic accuracy of EFW. The adjustment of these significant covariates on the diagnostic accuracy is of utmost importance as the classification of any future data can be impacted by this.

However, most of the well-known and conventional methods of estimating ROC curves don’t take the covariate information into account, or the ones that do, assess the covariate impact indirectly. In this talk, I will primarily focus on a particular framework for modeling the ROC curves that use placement value (PV). PV can be defined as the standardization of the diseased biomarker score with respect to the healthy biomarker distribution, and interestingly, it can be shown that the CDF of the PV is the ROC curve. Several PV-based ROC methodologies have been proposed by exploiting this relationship, and this framework seamlessly takes the covariate information into the model to assess their impact on diagnostic accuracy. Apart from the covariate, the PV-based framework can incorporate constraints whenever necessary. In the talk, I will present some of my recent and ongoing work based on this framework and apply them in a few NICHD-conducted studies.

March 17: Dr. Mohamad Kazem Shirani Faradonbeh | University of Georgia

Location: WNR 115

Title: What Beliefs Help Dynamic Data-Driven Decision-Making?

Abstract: Design of decision-making algorithms for dynamic environments is a fundamental problem in artificial intelligence. In many applications that outcomes of different decisions are unknown, data-driven algorithms are required to learn the outcomes. We study this problem and propose the first set of reinforcement learning algorithms for uncertain environments that evolve as stochastic differential equations. First, we propose fast and effective algorithms for learning to control instabilities such as drug overdoses and infectious outbreaks. For these algorithms that develop probabilistic beliefs about the unknown environment, we establish performance guarantees. Then, we proceed to the problem of learning to minimize a cost function, that captures many applications such as personalized insulin pumps for diabetic patients. We present a novel and easy-to-implement data-driven algorithm that sequentially updates its probabilistic beliefs. Then, we prove its efficiency and perform a regret analysis that fully characterizes the effect of uncertainty. So, we address the important exploration-exploitation dilemma: the algorithm successfully explores the uncharted territories, while simultaneously making good decisions.

March 24: Dr. Xinyi Li | Clemson University

Location: WNR 115

Title: Functional Individualized Treatment Regimes with Imaging Features

Abstract: Precision medicine seeks to discover an optimal personalized treatment plan and thereby provide informed and principled decision support, based on the characteristics of individual patients. With recent advancements in medical imaging, it is crucial to incorporate patient-specific imaging features in the study of individualized treatment regimes. We propose a novel, data-driven method to construct interpretable image features which can be incorporated, along with other features, to guide optimal treatment regimes. The proposed method treats imaging information as a realization of a stochastic process, and employs smoothing techniques in estimation. We show that the proposed estimators are consistent under mild conditions. The proposed method is applied to a dataset provided by the Alzheimer's Disease Neuroimaging Initiative.

March 31: Dr. John Stufken | George Mason University

Location: WNR 115

Title: Musings on Subdata Selection

Abstract: Data reduction or summarization methods for large datasets (full data) aim at making inferences by replacing the full data by the reduced or summarized data. Data storage and computational costs are among the primary motivations for this. In this presentation, data reduction will mean the selection of a subset (subdata) of the observations in the full data. While data reduction has been around for decades, its impact continues to grow with approximately 2.5 exabytes (2.5 x 10^18 bytes) of data collected per day. We will begin by discussing an information-based method for subdata selection under the assumption that a linear regression model is adequate. A strength of this method, which is inspired by ideas from optimal design of experiments, is that it is superior to competing methods in terms of statistical performance and computational cost when the model is correct. A weakness of the method, shared with other model-based methods, is that it can give poor results if the model is incorrect. We will therefore conclude with discussions of a method based on a more flexible model and a method based on a model-free method.

April 7: Dr. Tianhao Wang | University of Virginia

Location: WNR 115

Title: Differentially Private Machine Learning: Improvement, Byzantine Resilience, and Input Perturbation

Abstract: I will present our (ongoing) recent work on differentially private machine learning (DP-ML). First, I will present simple yet effective strategies to improve the performance of DP-stochastic gradient descent (DP-SGD), the widely adopted method for DP-ML. Then I will discuss ways to defend against Byzantine attacks in DP-SGD. Finally, I will talk about recent advances in synthetic data generation, which is another popular approach to DP-ML.

April 21: Dr. Chao Gao | University of Chicago

Location: WNR 115

Title: Detection and Recovery of Sparse Signal Under Correlation

Abstract: We study a p dimensional Gaussian sequence model with equicorrelated noise. In the first part of the talk, we consider detection of a signal that has at most s nonzero coordinates. Our result fully characterizes the nonasymptotic minimax separation rate as a function of the dimension p, the sparsity s and the correlation level. Surprisingly, not only does the order of the minimax separation rate depend on s, it also varies with p-s. This new phenomenon only occurs when correlation is present. In the second part of the talk, we consider the problem of signal recovery. Unlike the detection rate, the order of the minimax estimation rate has a dependence on p-2s, which is also a new phenomenon that only occurs with correlation. We also consider detection and recovery procedures that are adaptive to the sparsity level. While the optimal detection rate can be achieved adaptively without any cost, the optimal recovery rate can only be achieved in expectation with some additional cost.

April 28: Dr. Vivian Li | University of California, Riverside

Location: Zoom

Title: Statistical Methods for Analyzing and Comparing Single-cell Gene Expression Data

Abstract: Single-cell gene expression data provide an opportunity to characterize the molecular features of diverse cell types and states in tissue development and disease progression. However, it remains a challenge to construct a comprehensive view of single-cell transcriptomes in health and disease due to the knowledge gap in properly modeling the high-dimensional, sparse, and noisy data. In this talk, I will introduce two statistical methods we have developed for analyzing and comparing single-cell gene expression data. The first one is an integration method which enables joint analysis of single-cell samples from different biological conditions. This method can learn coordinated gene expression patterns that are common among, or specific to, different biological conditions and identify cellular types across single-cell samples. I will also discuss the applicability of our method in diverse biomedical problems. The second one is a computational method for identifying, quantifying, and comparing RNA transcripts from scRNA-seq data. Accurate and sensitive profiling of RNA transcripts is of great importance in understanding the mechanisms and consequences of gene expression regulation and can have diagnostic values in clinical settings. We propose a method to address computational questions arising from this biological problem.