An Automated Monitoring and Repairing System for DNN Training

Research output: Contribution to journalArticlepeer-review

Abstract

With the widespread adoption of machine learning models, especially deep neural networks (DNNs), as an integral part of new intelligent software, the new tools to effectively support the model engineering and debugging process have received extensive attention. However, the existing tools only provide limited support for the training process. They are either post-training tools that fail to detect problems timely, resulting in wasting time and resources on training buggy models, or merely collecting the training data and still require manual analysis. In this paper, we propose AutoTrainer, an automated monitoring and repairing system for DNN training, which provides real-time monitoring for the model training process and automatically repairs eight commonly seen training problems. AutoTrainer monitors the training process and detects potential training problems. For any detected problem, AutoTrainer tries to fix it with the built-in state-of-the-art solutions. Our experiments on six datasets and 701 models show that the problem detection accuracy of AutoTrainer reaches 100% without false positives. Moreover, it fixes 98.42% of all detected problems and improves the model accuracy by 36.42% on average.

Original languageEnglish
Pages (from-to)1655-1673
Number of pages19
JournalIEEE Transactions on Dependable and Secure Computing
Volume22
Issue number2
DOIs
StatePublished - 2025

Keywords

  • Deep learning debugging
  • deep learning repairing
  • deep learning training
  • machine learning security

Fingerprint

Dive into the research topics of 'An Automated Monitoring and Repairing System for DNN Training'. Together they form a unique fingerprint.

Cite this