TY - JOUR
T1 - Be United in Actions
T2 - Taking Live Snapshots of Heterogeneous Edge-Cloud Collaborative Cluster with Low Overhead
AU - Shi, Bin
AU - Dong, Bo
AU - Zheng, Qinghua
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2022/5/15
Y1 - 2022/5/15
N2 - Failure recovery is one of the most essential problems in the Internet of Things (IoT) systems, especially in crucial scenarios, such as traffic control and healthcare. Meanwhile, with the ever-increasing demand of IoT applications and for latency and security considerations, more and more IoT applications are migrated to large clusters that consist of both cloud and edge servers. However, with the scale of edge-cloud collaborative clusters continue to expand, the risk of system errors and failures is also increasing. The conventional snapshot/rollback method is a powerful way for solving this problem and it is widely used in cloud computing scenarios. But when transplanting to edge-cloud collaborative clusters with the nature of distribution and heterogeneity, it will introduce serious network interruption and guest performance impact. Therefore, in this article, to address the above problems, we propose a duration-aware cluster snapshot system, named Phalanx, which can take live snapshots of edge-cloud collaborative clusters with low performance overhead. In Phalanx, we use the low-overhead precopy model and first propose a virtual machine (VM) snapshot duration prediction method that can accurately predict the snapshot duration of each single VM. Then, based on the prediction results, we coordinate the snapshot process to ensure the whole cluster has a consistency-friendly schedule, thereby solving the network interruption problems and finally, minimizing the adverse performance impact to the guest IoT applications. We implement the prototype of Phalanx on QEμKVM platform and conduct several experiments. The experimental results show that Phalanx offers negligible network interruption while incurring 10.68%-20.9% less performance impact over existing solutions.
AB - Failure recovery is one of the most essential problems in the Internet of Things (IoT) systems, especially in crucial scenarios, such as traffic control and healthcare. Meanwhile, with the ever-increasing demand of IoT applications and for latency and security considerations, more and more IoT applications are migrated to large clusters that consist of both cloud and edge servers. However, with the scale of edge-cloud collaborative clusters continue to expand, the risk of system errors and failures is also increasing. The conventional snapshot/rollback method is a powerful way for solving this problem and it is widely used in cloud computing scenarios. But when transplanting to edge-cloud collaborative clusters with the nature of distribution and heterogeneity, it will introduce serious network interruption and guest performance impact. Therefore, in this article, to address the above problems, we propose a duration-aware cluster snapshot system, named Phalanx, which can take live snapshots of edge-cloud collaborative clusters with low performance overhead. In Phalanx, we use the low-overhead precopy model and first propose a virtual machine (VM) snapshot duration prediction method that can accurately predict the snapshot duration of each single VM. Then, based on the prediction results, we coordinate the snapshot process to ensure the whole cluster has a consistency-friendly schedule, thereby solving the network interruption problems and finally, minimizing the adverse performance impact to the guest IoT applications. We implement the prototype of Phalanx on QEμKVM platform and conduct several experiments. The experimental results show that Phalanx offers negligible network interruption while incurring 10.68%-20.9% less performance impact over existing solutions.
KW - Availability
KW - Edge-cloud collaborative cluster
KW - Network interruption
KW - Precopy
KW - Snapshot
KW - Virtual machine (VM)
UR - https://www.scopus.com/pages/publications/85114735338
U2 - 10.1109/JIOT.2021.3111023
DO - 10.1109/JIOT.2021.3111023
M3 - 文章
AN - SCOPUS:85114735338
SN - 2327-4662
VL - 9
SP - 7311
EP - 7324
JO - IEEE Internet of Things Journal
JF - IEEE Internet of Things Journal
IS - 10
ER -