A high-order representation and classification method for transcription factor binding sites recognition in Escherichia coli

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Background Identifying transcription factors binding sites (TFBSs) plays an important role in understanding gene regulatory processes. The underlying mechanism of the specific binding for transcription factors (TFs) is still poorly understood. Previous machine learning-based approaches to identifying TFBSs commonly map a known TFBS to a one-dimensional vector using its physicochemical properties. However, when the dimension-sample rate is large (i.e., number of dimensions/number of samples), concatenating different physicochemical properties to a one-dimensional vector not only is likely to lose some structural information, but also poses significant challenges to recognition methods. Materials and method In this paper, we introduce a purely geometric representation method, tensor (also called multidimensional array), to represent TFs using their physicochemical properties. Accompanying the multidimensional array representation, we also develop a tensor-based recognition method, tensor partial least squares classifier (abbreviated as TPLSC). Intuitively, multidimensional arrays enable borrowing more information than one-dimensional arrays. The performance of each method is evaluated by average F-measure on 51 Escherichia coli TFs from RegulonDB database. Results In our first experiment, the results show that multiple nucleotide properties can obtain more power than dinucleotide properties. In the second experiment, the results demonstrate that our method can gain increased prediction power, roughly 33% improvements more than the best result from existing methods. Conclusion The representation method for TFs is an important step in TFBSs recognition. We illustrate the benefits of this representation on real data application via a series of experiments. This method can gain further insights into the mechanism of TF binding and be of great use for metabolic engineering applications.

Original languageEnglish
Pages (from-to)16-23
Number of pages8
JournalArtificial Intelligence in Medicine
Volume75
DOIs
StatePublished - 1 Jan 2017

Keywords

  • Classification
  • Computational biology
  • Machine learning
  • Partial least squares
  • Tensor
  • Transcription factor binding sites

Fingerprint

Dive into the research topics of 'A high-order representation and classification method for transcription factor binding sites recognition in Escherichia coli'. Together they form a unique fingerprint.

Cite this