CPU overheating characterization in HPC systems: a case study - ERODS Accéder directement au contenu
Communication Dans Un Congrès Année : 2018

CPU overheating characterization in HPC systems: a case study

Caractérisation des surchauffes CPU dans les systèmes HPC : un cas d'étude

Résumé

With the increase in size of supercomputers, also increases the number of abnormal events. Some of these events might lead to an application failure. Others might simply impact the system efficiency. CPU overheating is one such event that decreases the system efficiency: when a CPU overheats, it reduces its frequency. This paper studies the problem of CPU overheating in supercomputers. In a first part, we analyze data collected over one year on a supercomputer of the top500 list to understand under which conditions CPU overheating occurs. Our analysis show that overheating events are due to some specific applications. In a second part, we evaluate the impact of such overheating events on the performance of MPI applications. Using 6 representative HPC benchmarks, we show that for a majority of the applications, a frequency drop on one CPU impacts the execution time of distributed runs proportionally to the duration and to the extent of the frequency drop.
Fichier principal
Vignette du fichier
CPU_Overheating_Characterization_in_HPC_Systems:A_Case_Study.pdf (332.88 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01949708 , version 1 (10-12-2018)

Identifiants

  • HAL Id : hal-01949708 , version 1

Citer

Marc Platini, Thomas Ropars, Benoit Pelletier, Noël de Palma. CPU overheating characterization in HPC systems: a case study. Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop, Nov 2018, Dallas, United States. ⟨hal-01949708⟩
80 Consultations
1082 Téléchargements

Partager

Gmail Facebook X LinkedIn More