TY - GEN
T1 - Partial Failure Resilient Memory Management System for (CXL-based) Distributed Shared Memory
AU - Zhang, Mingxing
AU - Ma, Teng
AU - Hua, Jinqi
AU - Liu, Zheng
AU - Chen, Kang
AU - Ding, Ning
AU - Du, Fan
AU - Jiang, Jinlei
AU - Ma, Tao
AU - Wu, Yongwei
N1 - Publisher Copyright:
© 2023 Owner/Author(s).
PY - 2023/10/23
Y1 - 2023/10/23
N2 - The efficiency of distributed shared memory (DSM) has been greatly improved by recent hardware technologies. But, the difficulty of distributed memory management can still be a major obstacle to the democratization of DSM, especially when a partial failure of the participating clients (e.g., due to crashed processes or machines) should be tolerated.In this paper, we present CXL-SHM, an automatic distributed memory management system based on reference counting. The reference count maintenance in CXL-SHM is implemented with a special era-based non-blocking algorithm. Thus, there are no blocking synchronization, memory leak, double free, and wild pointer problems, even if some participating clients unexpectedly fail without freeing their possessed memory references. We evaluated our system on real CXL hardware with both micro-benchmarks and end-to-end applications, which demonstrate the efficiency of CXL-SHM and the simplicity/flexibility of using CXL-SHM to build efficient distributed applications.
AB - The efficiency of distributed shared memory (DSM) has been greatly improved by recent hardware technologies. But, the difficulty of distributed memory management can still be a major obstacle to the democratization of DSM, especially when a partial failure of the participating clients (e.g., due to crashed processes or machines) should be tolerated.In this paper, we present CXL-SHM, an automatic distributed memory management system based on reference counting. The reference count maintenance in CXL-SHM is implemented with a special era-based non-blocking algorithm. Thus, there are no blocking synchronization, memory leak, double free, and wild pointer problems, even if some participating clients unexpectedly fail without freeing their possessed memory references. We evaluated our system on real CXL hardware with both micro-benchmarks and end-to-end applications, which demonstrate the efficiency of CXL-SHM and the simplicity/flexibility of using CXL-SHM to build efficient distributed applications.
KW - CXL
KW - distributed shared memory
KW - non-blocking
UR - https://www.scopus.com/pages/publications/85176935866
U2 - 10.1145/3600006.3613135
DO - 10.1145/3600006.3613135
M3 - 会议稿件
AN - SCOPUS:85176935866
T3 - SOSP 2023 - Proceedings of the 29th ACM Symposium on Operating Systems Principles
SP - 658
EP - 674
BT - SOSP 2023 - Proceedings of the 29th ACM Symposium on Operating Systems Principles
PB - Association for Computing Machinery, Inc
T2 - 29th ACM Symposium on Operating Systems Principles, SOSP 2023
Y2 - 23 October 2023 through 26 October 2023
ER -