Sky Workload Manager (SWM or SkyWM) is a middleware for a dynamic high performance computing workload management. It was designed with scaling, flexibility and fault tolerance in mind.
• Multiple cloud integration allows automatic resources renting in a most favourable cloud. Based on defined by system administrator rules SWM automatically selects cloud provider that will be used for job offloading. Thus the job, for example, can be executed on least expensive, or most powerful cloud resources.
• Job Relocation leads to better resource utilization. Transparent from user point of view the job relocation is a key feature of SWM. Besides improvement of an on-premise resources utilization, it automatically relocates a job to a cloud and returns the results back to user. Together with the job SWM can also relocate its data, container image and job metrics.
• Smart Data Scheduling. Data transfers are coordinated by advanced scheduling algorithms. Different types of data can automatically be delivered to and from execution nodes or remote clusters.
• Enhanced Security. SWM guarantees safety of user data. Only encrypted connections are used in order to transfer information via network. Each user authorizes with an unique digital certificate. All data transfers between on-premise clusters or cloud are always encrypted.
• Extreme Scalability. SWM scales almost infinitely. All the communications are encapsulated within node partitions. The communications between nodes of different partitions established through the manager nodes. The communications between any nodes are asynchronous.
• Advanced Fault-Tolerance. SWM services consist of many tiny modules, that can automatically be reloaded independently from each other on a failure. This mechanism is a building block for the node level fault-tolerance. Another fault-tolerance type, cluster level one, is based on several advanced algorithms, which increase the resistance to network partitioning and other cluster-wide failures.
• Containerization is used for job management simplification and process isolation. SWM allows to run jobs in containers and can copy the container images to the execution site. Besides all advantages of the containerization technology this mechanism also allows to move job environment among clusters or cloud.
• Accounting and Reporting allows to perform better usage control and budget optimization. SWM supports nested accounts that can have its own administrator and separated or shared budgets. The parent account budget can be shared among its childrens. Reports can be generated based on resources, which usage is tracked on on-premise clusters and clouds.
• Power Management. SWM detects and can power off unused nodes. When new jobs arrive, the stopped nodes can be started automatically.
• Live Updates reduce cluster downtime. Minor updates can be applied without daemons stoppage. The technologies that are laid in the base of the software allow to replace old parts (modules) of the running application to the new versions of them.
• Built-in Simulation Mode. System administrator can deploy SWM on a single machine, starting а mode of simulation of the whole distributed system. This allows to try, learn and test our product before deployment on production cluster.
• Friendly User Interface. Having broad experience of deploying and interaction with classic HPC workload managers, our experts have developed intuitive and powerful user interface, which allows easy to install, configure, learn and use the workload manager.