PoS - Proceedings of Science
Volume 415 - International Symposium on Grids & Clouds 2022 (ISGC2022) - Network, Security, Infrastructure & Operations Session
Operation and maintenance analysis platform at IHEP
Q. Hu*, L. Wang, W. Zheng and X. Jiang
Full text: pdf
Published on: September 28, 2022
As the scale of computing facilities continues to grow and the computing environment becomes more and more complex, the difficulty of operation and maintenance of large-scale computing clusters is also increasing. Operation and maintenance methods based solely on configuration management automation technology cannot quickly and effectively solve various service failures in computing clusters. It is urgent to adopt emerging technologies to obtain comprehensive cluster operation and maintenance information, integrate monitoring data from multiple heterogeneous sources, and comprehensively analyze anomalous patterns in monitoring data. Based on the results of data analysis, it becomes possible to locate the root cause of service failures and help computing clusters quickly restore services. To provide a more stable cluster operating environment, IHEPCC combined big data technology and data analysis index tools to design and implement the operation and maintenance analysis toolkit (OMAT) as an open framework, which includes data collection, correlation analysis, anomaly detection, and alerting and other functions. This report introduces the architecture, processing capabilities, and some key functions of OMAT. Combined with the processing flow of monitoring data, it introduces the specific implementation of the system in data collection, data processing, data storage, and data visualization. The current OMAT platform has been applied to multiple cross-regional computing clusters including IHEP, covering about 5000 nodes. The collected information includes node status, storage performance, network traffic, user operations, account security, power environment, and other operation and maintenance indicators to ensure the computing cluster’s performance and stable operation.
DOI: https://doi.org/10.22323/1.415.0011
How to cite

Metadata are provided both in "article" format (very similar to INSPIRE) as this helps creating very compact bibliographies which can be beneficial to authors and readers, and in "proceeding" format which is more detailed and complete.

Open Access
Creative Commons LicenseCopyright owned by the author(s) under the term of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.