Partial Failure Resilient Memory Management System for (CXL-based) Distributed Shared Memory

  • Mingxing Zhang
  • , Teng Ma
  • , Jinqi Hua
  • , Zheng Liu
  • , Kang Chen
  • , Ning Ding
  • , Fan Du
  • , Jinlei Jiang
  • , Tao Ma
  • , Yongwei Wu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

39 Scopus citations

Abstract

The efficiency of distributed shared memory (DSM) has been greatly improved by recent hardware technologies. But, the difficulty of distributed memory management can still be a major obstacle to the democratization of DSM, especially when a partial failure of the participating clients (e.g., due to crashed processes or machines) should be tolerated.In this paper, we present CXL-SHM, an automatic distributed memory management system based on reference counting. The reference count maintenance in CXL-SHM is implemented with a special era-based non-blocking algorithm. Thus, there are no blocking synchronization, memory leak, double free, and wild pointer problems, even if some participating clients unexpectedly fail without freeing their possessed memory references. We evaluated our system on real CXL hardware with both micro-benchmarks and end-to-end applications, which demonstrate the efficiency of CXL-SHM and the simplicity/flexibility of using CXL-SHM to build efficient distributed applications.

Original languageEnglish
Title of host publicationSOSP 2023 - Proceedings of the 29th ACM Symposium on Operating Systems Principles
PublisherAssociation for Computing Machinery, Inc
Pages658-674
Number of pages17
ISBN (Electronic)9798400702297
DOIs
StatePublished - 23 Oct 2023
Event29th ACM Symposium on Operating Systems Principles, SOSP 2023 - Koblenz, Germany
Duration: 23 Oct 202326 Oct 2023

Publication series

NameSOSP 2023 - Proceedings of the 29th ACM Symposium on Operating Systems Principles

Conference

Conference29th ACM Symposium on Operating Systems Principles, SOSP 2023
Country/TerritoryGermany
CityKoblenz
Period23/10/2326/10/23

Keywords

  • CXL
  • distributed shared memory
  • non-blocking

Fingerprint

Dive into the research topics of 'Partial Failure Resilient Memory Management System for (CXL-based) Distributed Shared Memory'. Together they form a unique fingerprint.

Cite this