Operation and maintenance analysis platform at IHEP
September 28, 2022
As the scale of computing facilities continues to grow and the computing environment becomes more and more complex, the difficulty of operation and maintenance of large-scale computing clusters is also increasing. Operation and maintenance methods based solely on configuration management automation technology cannot quickly and effectively solve various service failures in computing clusters. It is urgent to adopt emerging technologies to obtain comprehensive cluster operation and maintenance information, integrate monitoring data from multiple heterogeneous sources, and comprehensively analyze anomalous patterns in monitoring data. Based on the results of data analysis, it becomes possible to locate the root cause of service failures and help computing clusters quickly restore services. To provide a more stable cluster operating environment, IHEPCC combined big data technology and data analysis index tools to design and implement the operation and maintenance analysis toolkit (OMAT) as an open framework, which includes data collection, correlation analysis, anomaly detection, and alerting and other functions. This report introduces the architecture, processing capabilities, and some key functions of OMAT. Combined with the processing flow of monitoring data, it introduces the specific implementation of the system in data collection, data processing, data storage, and data visualization. The current OMAT platform has been applied to multiple cross-regional computing clusters including IHEP, covering about 5000 nodes. The collected information includes node status, storage performance, network traffic, user operations, account security, power environment, and other operation and maintenance indicators to ensure the computing cluster’s performance and stable operation.
How to cite
Metadata are provided both in "article" format (very similar to INSPIRE) as this helps creating
very compact bibliographies which can be beneficial to authors and
readers, and in "proceeding" format
which is more detailed and complete.