Energy Consumption Prediction Model Using Spark MLib
The client is a manufacturer of heating and industrial systems.
The client is working on an IoT prototype for their boiler to integrate well-known hardware with the Internet. Sensors from the device send notifications to the server, which saves binary values to a Cassandra database. The idea of the prototype consisted of two parts:
create a Spark job for converting data into a readable format;
develop a machine-learning algorithm for an energy-consumption prediction model based on the history of boiler usage coupled with a weather forecast.
The client wanted to use Apache Zeppelin for the visualization and a 3-node cluster for Spark and Cassandra.
DataArt was chosen as a trusted development partner with a strong experience in building Big Data and IoT based solutions.
The DataArt team was responsible for reviewing and improving cluster configuration for better performance, investigating and fixing issues with the Cassandra data schema, and developing Spark jobs for both parts of the prototype.
Our team identified several major issues during the knowledge transfer:
Spark and Cassandra clusters were set up on the same machines and configured in a way that caused performance and networking issues;
The Cassandra data schema wasn’t optimized for client queries and needed secondary indexes as a minimum acceptance criteria;
The client didn’t have any experience with Apache Zeppelin.
Our team suggested all the necessary changes in the Spark and Cassandra cluster configuration, in data schema for improved performance and provided the Client with all the necessary details and instructions for Apache Zeppelin usage.
Our solution was written on top of an open source distributed computer framework, Apache Spark, using the Scala programming language to create a scalable architecture to handle large volumes of data for processing. For the implementation of the machine-learning algorithm, the Spark MLib library was used. After calculating a linear regression model for each boiler, coefficients were saved to a Cassandra table and were used for visualizing energy consumption usage at the Client Mobile Application. To make the development process faster and simpler, we used Zeppelin notebooks for demos, visualizations and tests.
The solution developed by DataArt was based on an Open Source stack, the main goal of which was to demonstrate how industrial devices could be used as part of an IoT ecosystem. As a result we:
Designed a Cassandra Data model, which best suits all Client business requirements.
Helped with Spark and Cassandra cluster configuration and made changes to the architecture based on our BigData experience;
Developed a Spark Job for migrating binary Cassandra data to a human readable format;
Created linear regression model for predicting energy consumption and implemented calculation in a separate Spark Job (using Spark MLib)
Apache Spark, Spark SQL, Spark MLib, Hive, Apache Cassandra, Scala.