Seminar Series: Yiqun Chen

February 1, 2024
3:00PM - 4:00PM
EA170

Date Range
2024-02-01 15:00:00 2024-02-01 16:00:00 Seminar Series: Yiqun Chen Speaker: Yiqun ChenTitle: Statistical insight for biomedical data science, with applications to single-cellRNA-sequencing dataAbstract: My research centers around bringing statistical insights and understanding to thepractice of modern data science, and I will cover two projects related to this research vision inthis talk.The first part of the talk is motivated by the practice of testing data-driven hypotheses. In thebiomedical sciences, it has become increasingly common to collect massive datasets without apre-specified research question. In this setting, a data analyst might use the data both togenerate a research question, and to test the associated null hypothesis. For example, insingle-cell RNA-sequencing analyses, researchers often first cluster the cells, and then test fordifferences in the expected gene expression levels between the clusters to quantify up- ordown-regulation of genes, annotate known cell types, and identify new cell types. However, thispopular practice is invalid from a statistical perspective: once we have used the data to generatehypotheses, standard statistical inference tools are no longer valid. To tackle this problem, Ideveloped a conditional selective approach to test for a difference in means between pairs ofclusters obtained via k-means clustering. The proposed approach has appropriate statisticalguarantees (e.g., selective Type 1 error control).In the second part of the talk, I will consider how to leverage large language models (LLMs)such as ChatGPT for biomedical discovery. While significant progress has been made incustomizing large language models for biomedical data, these models often require extensivedata curation and resource-intensive training. In the context of single-cell RNA-sequencing data,I will show that we can achieve surprisingly competitive results on many downstream tasks via amuch simpler alternative: I input textual descriptions of genes into an off-the-shelf LLM, such asChatGPT, to obtain low-dimensional representations of the genes, or “embeddings.” I then usethese embeddings as features in downstream tasks. A similar approach enables LLM-derivedembeddings of cells. This work highlights the potential of LLMs to provide meaningful andconcise representations for biomedical data, and also raises a number of challenging statisticalquestions. Addressing these questions requires bringing principled statistical thinking to thepractice of modern data science.This talk features joint work with Lucy Gao (University of British Columbia), Daniela Witten(University of Washington), and James Zou (Stanford University). Note: Seminars are free and open to the public. Reception to follow.  EA170 America/New_York public

Speaker: Yiqun Chen

Title: Statistical insight for biomedical data science, with applications to single-cell
RNA-sequencing data

Abstract: My research centers around bringing statistical insights and understanding to the
practice of modern data science, and I will cover two projects related to this research vision in
this talk.


The first part of the talk is motivated by the practice of testing data-driven hypotheses. In the
biomedical sciences, it has become increasingly common to collect massive datasets without a
pre-specified research question. In this setting, a data analyst might use the data both to
generate a research question, and to test the associated null hypothesis. For example, in
single-cell RNA-sequencing analyses, researchers often first cluster the cells, and then test for
differences in the expected gene expression levels between the clusters to quantify up- or
down-regulation of genes, annotate known cell types, and identify new cell types. However, this
popular practice is invalid from a statistical perspective: once we have used the data to generate
hypotheses, standard statistical inference tools are no longer valid. To tackle this problem, I
developed a conditional selective approach to test for a difference in means between pairs of
clusters obtained via k-means clustering. The proposed approach has appropriate statistical
guarantees (e.g., selective Type 1 error control).


In the second part of the talk, I will consider how to leverage large language models (LLMs)
such as ChatGPT for biomedical discovery. While significant progress has been made in
customizing large language models for biomedical data, these models often require extensive
data curation and resource-intensive training. In the context of single-cell RNA-sequencing data,
I will show that we can achieve surprisingly competitive results on many downstream tasks via a
much simpler alternative: I input textual descriptions of genes into an off-the-shelf LLM, such as
ChatGPT, to obtain low-dimensional representations of the genes, or “embeddings.” I then use
these embeddings as features in downstream tasks. A similar approach enables LLM-derived
embeddings of cells. This work highlights the potential of LLMs to provide meaningful and
concise representations for biomedical data, and also raises a number of challenging statistical
questions. Addressing these questions requires bringing principled statistical thinking to the
practice of modern data science.


This talk features joint work with Lucy Gao (University of British Columbia), Daniela Witten
(University of Washington), and James Zou (Stanford University).



 

Note: Seminars are free and open to the public. Reception to follow.