Conventional computing resource management systems use a system model to describe resources and a scheduler to control their allocation, computing resources are divided into isolated parts to provide computing services for different experiments. To improve resource utilization and reduce the deployment complexity of differentiated operating environments, our computing resources are configured to support a running environment of all HTC (high throughput computing) experiment jobs. To prevent experiments with few computing resources from occupying a large amount of extra computing resources for a long time, we configured the running jobs’ quota of each experiment to ensure the fairness. The conventional computing resource management systems does not adapt well to the ever-expanding resources scale and complex scheduling strategies.
Faced with these problems, we developed and implemented a new framework based on device management database and Open Maintain Analysis Tools (OMAT), a flexible and general approach to manage resources in a complex environment with that significantly reduces manual intervention. Novel aspects of the framework include a flexible configuration method for configuring the relationship between device, service, and experiment; alarm policies that quickly detect unallocated computing resources, and an operationally implementable way to quickly generate a scheduling policy and make it effective. This framework is robust, flexible, and scalable that can evolve with changes in resources and experiments.
The framework was designed to solve real problems encountered in the deployment of HTCondor, a high throughput computing scheduler system at IHEP of Chinese Academy of Sciences.