Skip to main navigation Skip to search Skip to main content

SoK: Dataset Copyright Auditing in Machine Learning Systems

  • Linkang Du
  • , Xuanru Zhou
  • , Min Chen
  • , Chusong Zhang
  • , Zhou Su
  • , Peng Cheng
  • , Jiming Chen
  • , Zhikun Zhang
  • Xi'an Jiaotong University
  • Zhejiang University
  • Vrije Universiteit Amsterdam
  • Hangzhou Dianzi University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

As the implementation of machine learning (ML) systems becomes more widespread, especially with the introduction of larger ML models, we perceive a spring demand for massive data. However, it inevitably causes infringement and misuse problems with the data, such as using unauthorized online artworks or face images to train ML models. To address this problem, many efforts have been made to audit the copyright of the model training dataset. However, existing solutions vary in auditing assumptions and capabilities, making it difficult to compare their strengths and weaknesses. In addition, robustness evaluations usually consider only part of the ML pipeline and hardly reflect the performance of algorithms in real-world ML applications. Thus, it is essential to take a practical deployment perspective on the current dataset copyright auditing tools, examining their effectiveness and limitations. Concretely, we categorize dataset copyright auditing research into two prominent strands: intrusive methods and non-intrusive methods, depending on whether they require modifications to the original dataset. Then, we break down the intrusive methods into different watermark injection options and examine the non-intrusive methods using various finger-prints. To summarize our results, we offer detailed reference tables, highlight key points, and pinpoint unresolved issues in the current literature. By combining the pipeline in ML systems and analyzing previous studies, we highlight several future directions to make auditing tools more suitable for real-world copyright protection requirements.

Original languageEnglish
Title of host publicationProceedings - 46th IEEE Symposium on Security and Privacy, SP 2025
EditorsMarina Blanton, William Enck, Cristina Nita-Rotaru
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages2076-2094
Number of pages19
ISBN (Electronic)9798331522360
DOIs
StatePublished - 2025
Event46th IEEE Symposium on Security and Privacy, SP 2025 - San Francisco, United States
Duration: 12 May 202515 May 2025

Publication series

NameProceedings - IEEE Symposium on Security and Privacy
ISSN (Print)1081-6011

Conference

Conference46th IEEE Symposium on Security and Privacy, SP 2025
Country/TerritoryUnited States
CitySan Francisco
Period12/05/2515/05/25

Fingerprint

Dive into the research topics of 'SoK: Dataset Copyright Auditing in Machine Learning Systems'. Together they form a unique fingerprint.

Cite this