OpenStack Cluster Monitoring System Based on Machine Learning Algorithms
Canonical is a UK-based, privately held computer software company founded to market commercial support and related services for Ubuntu and related projects. Principally, these are free and open-source software (FOSS) or tools designed to improve collaboration between free software developers and contributors.
The PoC Cluster Monitoring System project’s main purpose was to demonstrate the implementation of Machine Learning algorithms using OpenStack technologies.
Monitoring the performance of clusters in real time can become a routine but not trivial process for an administrator. Sometimes it is not obvious what part of a cluster has failed. The entire process, from discovering an error to fixing it, can take hours. As a result, the technical problems can affect the business dramatically.
Implementing Machine Learning algorithms to discover errors in the cluster makes it possible to save time by drawing the administrator’s attention to the problem as soon as it happens. System faults can even be predicted in advance.
The main challenge in this project was to discover errors in real time. The DataArt team also had to deal with a large amount of data and a complicated layer structure.
Meeting the Challenge
DataArt was chosen as a trusted development partner with a strong experience in building Big Data and IoT based solutions.
The main problems were the huge amount of data, the complicated layer structure and a wide array of file formats. Therefore, the DataArt team decided to train a classifier using a model in which the parameters were words from a test sample with a value equal to the number of keywords in the text.
As a training model our team used a pruned decision tree from the WEKA library. For a quick search of suspected error locations, the IoT team wrote an Apache Spark Streaming job, which «listens» to the stream of log messages, processes them, and performs a real-time NLP-analysis of each log file. If an error has occurred, the end user receives an alert with a description of the potential problem.
The DataArt team developed an intricate solution, which was aimed at predicting potential failures as quickly as possible, basically as they occur. Therefore, it provides our client much more time to fix issues.
OpenStack technologies enable companies to be more innovative and develop business models faster. A simple cluster failure for a company that strongly depends on its infrastructure can turn out to be disastrous.
The PoC project demonstrates the possibility of using Machine Learning combined with OpenStack for monitoring purposes to save companies from experiencing unnecessary problems. More importantly, the PoC demonstrates that OpenStack can be used to address a wide range of issues with Machine Learning algorithms.