Be United in Actions: Taking Live Snapshots of Heterogeneous Edge-Cloud Collaborative Cluster with Low Overhead

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Failure recovery is one of the most essential problems in the Internet of Things (IoT) systems, especially in crucial scenarios, such as traffic control and healthcare. Meanwhile, with the ever-increasing demand of IoT applications and for latency and security considerations, more and more IoT applications are migrated to large clusters that consist of both cloud and edge servers. However, with the scale of edge-cloud collaborative clusters continue to expand, the risk of system errors and failures is also increasing. The conventional snapshot/rollback method is a powerful way for solving this problem and it is widely used in cloud computing scenarios. But when transplanting to edge-cloud collaborative clusters with the nature of distribution and heterogeneity, it will introduce serious network interruption and guest performance impact. Therefore, in this article, to address the above problems, we propose a duration-aware cluster snapshot system, named Phalanx, which can take live snapshots of edge-cloud collaborative clusters with low performance overhead. In Phalanx, we use the low-overhead precopy model and first propose a virtual machine (VM) snapshot duration prediction method that can accurately predict the snapshot duration of each single VM. Then, based on the prediction results, we coordinate the snapshot process to ensure the whole cluster has a consistency-friendly schedule, thereby solving the network interruption problems and finally, minimizing the adverse performance impact to the guest IoT applications. We implement the prototype of Phalanx on QEμKVM platform and conduct several experiments. The experimental results show that Phalanx offers negligible network interruption while incurring 10.68%-20.9% less performance impact over existing solutions.

Original languageEnglish
Pages (from-to)7311-7324
Number of pages14
JournalIEEE Internet of Things Journal
Volume9
Issue number10
DOIs
StatePublished - 15 May 2022

Keywords

  • Availability
  • Edge-cloud collaborative cluster
  • Network interruption
  • Precopy
  • Snapshot
  • Virtual machine (VM)

Fingerprint

Dive into the research topics of 'Be United in Actions: Taking Live Snapshots of Heterogeneous Edge-Cloud Collaborative Cluster with Low Overhead'. Together they form a unique fingerprint.

Cite this