Detecting Silent Data Corruptions With Deep Neural Network

PI Name Franck Cappello, MCS
PI Institution Argonne National Laboratory
Project Description

Our research is focused on leveraging deep neural network to detect silent data corruptions. To be more specific, once an error occurred, can we detect it immediately, and more importantly, can we still identify it after several time steps? Also, there are normally several status variables in an application, what is the correlation between them and can we leverage this information in our detector. We also plan to investigate several other aspects:

  • What type of DNN provides the best performance?
  • What are the hyper-parameters for the DNNs that provide the best detection sensitivity?

We will apply this research to the FLASH4 applications and plan to publish papers presenting the results of this research.

Testbed

Nvidia DGX, Voltas