Estimating the length distributions of genomic micro-satellites from next generation sequencing data

  • Xuan Feng
  • , Huan Hu
  • , Zhongmeng Zhao
  • , Xuanping Zhang
  • , Jiayin Wang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Genomic micro-satellites are the genomic regions that consist of short and repetitive DNA motifs. In contrast to unique genome, genomic micro-satellites expose high intrinsic polymorphisms, which mainly derive from variability in length. Length distributions are widely used to represent the polymorphisms. Recent studies report that some micro-satellites alter their length distributions significantly in tumor tissue samples comparing to the ones observed in normal samples, which becomes a hot topic in cancer genomics. Several state-of-the-art approaches are proposed to identify the length distributions from the sequencing data. However, the existing approaches can only handle the micro-satellites shorter than one read length, which limits the potential research on long micro-satellite events. In this article, we propose a probabilistic approach, implemented as ELMSI that estimates the length distributions of the micro-satellites longer than one read length. The core algorithm works on a set of mapped reads. It first clusters the reads, and a k-mer extension algorithm is adopted to detect the unit and breakpoints as well. Then, it conducts an expectation maximization algorithm to approach the true length distributions. According to the experiments, ELMSI is able to handle micro-satellites with the length spectrum from shorter than one read length to 10 kbps scale. A series of comparison experiments are applied, which vary the numbers of micro-satellite regions, read lengths and sequencing coverages, and ELMSI outperforms MSIsensor in most of the cases.

Original languageEnglish
Title of host publicationBioinformatics and Biomedical Engineering - 6th International Work-Conference, IWBBIO 2018, Proceedings
EditorsIgnacio Rojas, Francisco Ortuno
PublisherSpringer Verlag
Pages461-472
Number of pages12
ISBN (Print)9783319787220
DOIs
StatePublished - 2018
Event6th International Work-Conference on Bioinformatics and Biomedical Engineering, IWBBIO 2018 - Granada, Spain
Duration: 25 Apr 201827 Apr 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10813 LNBI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference6th International Work-Conference on Bioinformatics and Biomedical Engineering, IWBBIO 2018
Country/TerritorySpain
CityGranada
Period25/04/1827/04/18

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Keywords

  • Estimation approach
  • Genomic micro-satellite
  • Length distribution
  • Next generation sequencing data

Fingerprint

Dive into the research topics of 'Estimating the length distributions of genomic micro-satellites from next generation sequencing data'. Together they form a unique fingerprint.

Cite this