Skip to main navigation Skip to search Skip to main content

OCRClassifier: integrating statistical control chart into machine learning framework for better detecting open chromatin regions

  • Xi'an Jiaotong University

Research output: Contribution to journalArticlepeer-review

Abstract

Open chromatin regions (OCRs) play a crucial role in transcriptional regulation and gene expression. In recent years, there has been a growing interest in using plasma cell-free DNA (cfDNA) sequencing data to detect OCRs. By analyzing the characteristics of cfDNA fragments and their sequencing coverage, researchers can differentiate OCRs from non-OCRs. However, the presence of noise and variability in cfDNA-seq data poses challenges for the training data used in the noise-tolerance learning-based OCR estimation approach, as it contains numerous noisy labels that may impact the accuracy of the results. For current methods of detecting OCRs, they rely on statistical features derived from typical open and closed chromatin regions to determine whether a region is OCR or non-OCR. However, there are some atypical regions that exhibit statistical features that fall between the two categories, making it difficult to classify them definitively as either open or closed chromatin regions (CCRs). These regions should be considered as partially open chromatin regions (pOCRs). In this paper, we present OCRClassifier, a novel framework that combines control charts and machine learning to address the impact of high-proportion noisy labels in the training set and classify the chromatin open states into three classes accurately. Our method comprises two control charts. We first design a robust Hotelling T2 control chart and create new run rules to accurately identify reliable OCRs and CCRs within the initial training set. Then, we exclusively utilize the pure training set consisting of OCRs and CCRs to create and train a sensitized T2 control chart. This sensitized T2 control chart is specifically designed to accurately differentiate between the three categories of chromatin states: open, partially open, and closed. Experimental results demonstrate that under this framework, the model exhibits not only excellent performance in terms of three-class classification, but also higher accuracy and sensitivity in binary classification compared to the state-of-the-art models currently available.

Original languageEnglish
Article number1400228
JournalFrontiers in Genetics
Volume15
DOIs
StatePublished - 2024

Keywords

  • cell-free DNA
  • machine learning approach
  • multivariate control chart
  • noisy label
  • open chromatin region
  • sequencing data analysis

Fingerprint

Dive into the research topics of 'OCRClassifier: integrating statistical control chart into machine learning framework for better detecting open chromatin regions'. Together they form a unique fingerprint.

Cite this