TY - JOUR
T1 - Identifying Differentially Expressed Genes in RNA Sequencing Data with Small Labelled Samples
AU - Guo, Yin
AU - Xiao, Yanni
AU - Li, Limin
N1 - Publisher Copyright:
© 2004-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - RNA-seq, including bulk RNA-seq and single-cell RNA-seq, is a next-generation sequencing-based RNA profiling method capable of measuring gene expression patterns with high resolution, and has gradually become an essential tool for the analysis of differential gene expression at the whole transcriptome level. Differential gene identification is a key problem in many biological studies such as disease genetics. Two-sample location test methods are widely used in case-control studies to identify the significant differential genes. However, due to the high cost of labelled data collection, many studies face the small sample problem since there is only small labelled data available, for which the traditional methods often lose power. To address this issue, we propose a novel rank-based nonparametric test method called WMW-A test based on Wilcoxon-Mann-Whitiney test by introducing a three-sample statistic through another auxiliary sample, which is either given or generated in form of unlabelled data. By combining the case, control and auxiliary samples together, we construct a three-sample WMW-A statistic based on the gap between the average ranks of the case and control samples in the combined samples. The extensive simulation experiments and real applications on different gene expression datasets, including one bulk RNA-seq dataset and two single cell RNA-seq datasets, show that the WMW-A test could significantly improve the test power for two-sample problem with small sample sizes, by either available or generated auxiliary data. The applications on two real small SARS-CoV-2 datasets further show the improvement of WMW-A test for differentially expressed gene identification with small labelled samples.
AB - RNA-seq, including bulk RNA-seq and single-cell RNA-seq, is a next-generation sequencing-based RNA profiling method capable of measuring gene expression patterns with high resolution, and has gradually become an essential tool for the analysis of differential gene expression at the whole transcriptome level. Differential gene identification is a key problem in many biological studies such as disease genetics. Two-sample location test methods are widely used in case-control studies to identify the significant differential genes. However, due to the high cost of labelled data collection, many studies face the small sample problem since there is only small labelled data available, for which the traditional methods often lose power. To address this issue, we propose a novel rank-based nonparametric test method called WMW-A test based on Wilcoxon-Mann-Whitiney test by introducing a three-sample statistic through another auxiliary sample, which is either given or generated in form of unlabelled data. By combining the case, control and auxiliary samples together, we construct a three-sample WMW-A statistic based on the gap between the average ranks of the case and control samples in the combined samples. The extensive simulation experiments and real applications on different gene expression datasets, including one bulk RNA-seq dataset and two single cell RNA-seq datasets, show that the WMW-A test could significantly improve the test power for two-sample problem with small sample sizes, by either available or generated auxiliary data. The applications on two real small SARS-CoV-2 datasets further show the improvement of WMW-A test for differentially expressed gene identification with small labelled samples.
KW - auxiliary sample
KW - Differentially expressed genes
KW - small sample problem
KW - two-sample independent test
KW - wilcoxon-mann-whitney test
UR - https://www.scopus.com/pages/publications/85189330268
U2 - 10.1109/TCBB.2024.3382147
DO - 10.1109/TCBB.2024.3382147
M3 - 文章
AN - SCOPUS:85189330268
SN - 1545-5963
VL - 21
SP - 1311
EP - 1321
JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics
JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics
IS - 5
ER -