A Machine Learning Approach for Anaerobic Reactor Performance Prediction Using Long Short-Term Memory Recurrent Neural Network

. Predictive models are important to help manage high-value assets and to ensure optimal and safe operations. Recently, advanced machine learning algorithms have been applied to solve practical and complex problems, and are of significant interest due to their ability to adaptively ‘learn’ in response to changing environments. This paper reports on the data preparation strategies and the development and predictive capability of a Long Short-Term Memory recurrent neural network model for anaerobic reactors employed at Melbourne Water’s Western Treatment Plant for sewage treatment that includes biogas harvesting. The results show rapid training and higher accuracy in predicting biogas production when historical data, which include significant outliers, are preprocessed with z-score standardisation in comparison to those with max-min normalisation. Furthermore, a trained model with a reduced number of input variables via the feature selection technique based on Pearson’s correlation coefficient is found to yield good performance given sufficient dataset training. It is shown that the overall best performance model comprises the reduced input variables and data processed with z-score standardisation. This initial study provides a useful guide for the implementation of machine learning techniques to develop smarter structures and management towards Industry 4.0 concepts.


Introduction
Melbourne Water's self-powered Western Treatment Plant (WTP) at Werribee, Victoria, Australia [1], provides essential sewage treatment and under normal operating condition, its biological digestion process produces 65,000m 3 methane-rich biogas per day. This biogas is harvested and used to generate 7MW renewable electrical energy per day which is worth $8 million (AUD) per year. The anaerobic treatment lagoons are covered by 2-mm thick highdensity polyethylene (HDPE) floating sheets, approximately 450m x 170m, to capture odorous and greenhouse gases produced by the bacteria in the sewage under the covers. The untreated raw sewage content, such as fats, oil, floating solids and buoyed sludge or other fibrous material, may be carried to the surface of the lagoon by small bubbles, potentially forming into a solid mass called 'scum'. The scum in the anaerobic lagoon is a mixture of floating solids, undigested sludge and trapped gases formed during anaerobic digestions. This scum can accumulate under the covers, amassing into a large iceberg-like body, which is called 'scumberg'. The formulation of scumbergs has the potential to adversely impact the structural integrity of the HDPE floating covers and block the pathway of biogas, thereby affecting the WTP performance. Due to the complex nature of the scum accumulation and scumbergs formation, it is very difficult to develop analytical or rule-based computing models for forecasting their effects on the performance of the anaerobic reactors and the structural integrity of the HDPE floating covers. Consequently, the only reliable performance data can only be obtained from the actual measurements at the sewage processing plant as the processes cannot reliably be scaled down to a laboratory-sized simulation. Therefore, there is a need for 'smart' real-time monitoring and diagnostic-prognostic capability modelling for the operation and management of this high-value asset based on current and/or historical operational data.
In many practical cases, physical systems for which explicit rules are either unknown or too difficult to determine, cannot be accurately modelled using traditional computing methods. In the last few decades, machine learning (ML) techniques have been extensively employed in engineering applications, which include structural engineering [2][3][4], water reservoir operations [5,6], and structural health monitoring [7][8][9]. Artificial neural networks (ANN) are one of the machine learning-based algorithms that mimic the information processing and knowledge acquisition in the human brain. ANN is highly desirable due to its ability to model nonlinearity and predict accurately even in the presence of noisy and incomplete data of the real-world system. Furthermore, it can adaptively update its model in response to changing environments over time [10,11] which makes ANNs an ideal candidate for solving complex engineering problems. However, ANNs are unable to provide justifications for their solutions and can be unpredictably inaccurate when extrapolating solutions for problems outside the network training domain.
A recurrence neural network (RNN) is a special type of ANN for sequential data modelling which store some past information. However, traditional RNNs have the problem of gradient vanishing/exploding and lack of long-term memory ability [12]. Long Short-Term Memory (LSTM) network was first proposed by Hochreiter and Schmidhube [13] to overcome the limitation of RNNs. LSTM network can effectively learn long-term dependencies between time steps and is superior to most RNN prediction methods [13]. In short, the difference between the traditional RNN and LSTM is the internal operation of the recurrent cell. In a traditional RNN, only one internal state exists and it is recomputed in every time step, whereas an LSTM cell has an additional self-connected memory cell state in which information can be stored; these memory cells are managed by cell gates, allowing learning of long-term dependencies [13]. There is significant interest in using LSTM network architecture for time-series forecasting for various engineering application including wind-power, solar power and electric load [14][15][16][17]. This architecture can also be expected to be highly advantageous for producing data-driven and adaptive predictive models based on ANN techniques for optimal biogas harvesting while ensuring the structural integrity of the floating covers.
In this paper, an LSTM network architecture, which comprises one sequence input layer followed by one LSTM layer, a fully-connected layer (consisting of one neuron) and an output layer, is developed and demonstrated to predict the biogas production of WTP anaerobic reactor. The investigation includes a parametric study to design the LSTM network topology, which includes data pre-processing, optimising the number of hidden units in an LSTM layer and training parameters, refer to Figure 1. A historical data sample from WTP is partitioned to train the LSTM network input and recurrent weights and biases and to evaluate the trained models' performance. In this study, MATLAB R2020a was used to develop the LSTM network. This study reports on the data preparation of the real-life dataset of the anaerobic reactor and environment readings and the development of LSTM network architecture for biogas production prediction.

Data Preparation
A real-world historical dataset of the 25W anaerobic reactor at WTP consists of 14 variables with 365 daily readings from November 2018 to October 2019 as collected by Melbourne Water (shown in Table 1), was used to train and develop the LSTM prediction models. The output variable is biogas production, whereas the remaining variables are treated as inputs for the prediction. Data pre-processing is an important stage for designing ML models, which primarily entails transforming raw data into a clean and useable format. Data normalisation/standardisation is a standard pre-processing technique that changes the values in a dataset to a common scale without misrepresenting the difference in the range of variables. For RNN/LSTM models, this is a crucial procedure to ensure stability and improve network training and performance [17]. In this study, two common data scaling techniques are applied and investigated: z-score standardisation (standardised variables), rescales data to have zero mean and standard deviation of 1 and min-max normalisation (normalised variables), linearly transforms data in the range [-1, 1] [18]. Normally, it is advantageous to reduce the number of inputs by removing redundant variables to improve the speed learning algorithm and eliminate overfitting to a degree. A feature selection technique aims to reduce the number of features (inputs) by using a correlation-based method to filter the correlated variables. In this study, Pearson's correlation coefficient is employed to measure the association between the input variables (refer to Figure 2). Based on the correlation matrix, 7 inputs variables, viz. average average and minimum 1 , average and minimum

Model Validation
To ensure generalisation from the training data, it is essential that part of the test data set is preserved, unseen by the ML during training. Otherwise, the model is likely to overfit, yielding high accuracy on the training data, but failing to generalise from the training dataset, thereby resulting in poor predictive performance on new data. To evaluate trained ML models, k-fold cross-validation technique is a common resampling procedure that randomly splits the dataset into k groups then trains the model on all groups except one that is reserved for testing the model [19]. Chronological ordered cross-validation, where training sets consist of observations that occur prior to those that form the testing set, is more suitable for time-series and sequential data modelling [20]. This study employed an expanding window (also known as forward-chaining) cross-validation with datasets partitioned into 6 nearly evenly distributed train-test sets (approximately 60 data points), equating to a 5-split (iteration) procedure, refer to Figure 3. The average error values of the 5 splits were used to evaluate the models' overall performance.

Figure 3: Expanding window 5-split time-series cross-validation.
The performances of LSTM models were compared and assessed based on the statistical error measurements; mean square error (MSE), mean absolute error (MAE) [21] and coefficient of determination (R 2 ). It should be noted that the error measurements are calculated based on the standardised and normalised output variables. The training of the neural network used MSE as the loss function and stochastic gradient descent with momentum as the optimisation algorithm, which introduces learning variables, the learning rates and momentum parameter [22], to overcome problems of slow or non-convergence encountered in traditional gradient descent methods.
In this study, the several LSTM models were developed using standardised data with full and reduced input variables denoted as SFIV and SRIV models, respectively, and normalised data with full and reduced input variables denoted as NFIV and NRIV models, respectively. The study firstly investigated the optimal training parameters and the number of hidden units of a single LSTM layer for each model. Furthermore, the performances based on the error measurements for each best performing model were compared.

Results and Discussion
Firstly, the optimum number of epoch (the number of times that the whole dataset is passed to the network) was determined by evaluating the average MSE, MAE and R 2 of the 5 splits. In this investigation, a 10-hidden-unit LSTM layer was considered with epoch varying from 10 to 2000. The initial learning rate was 0.01, and the learning rate schedule was set to piecewise, where the learning rate was reduced by a factor of 0.2 after half of the number epochs had passed and the gradient threshold/clipping was 1. The optimal epoch for both SFIV and SFIV models are 50, and for NFIV and NRIV models are 100 and 200, respectively, refer to Table 2. Below or beyond these optimal epoch values will likely lead to underfit or overfit the trained models. It should be noted that inspection of learning curves (loss error over epoch/iteration) is necessary to monitor the behaviour of the model and ensure convergence.   Based on the error measurements, the optimal number of hidden units in a one LSTM layer for SFIV and SRIV models are 11 and 6, respectively and for NFIV and NRIV models are 6 and 9, respectively, as indicated in Table 3. The coefficients of determination were compared to evaluate the models with different pre-processed data. It is shown that NFIV and NRIV models performed relatively poorly compared to data pre-processed with z-score standardisation, refer to  Tables 3 and 4, and this is evident in Figures 4 and 5. This is because extreme outliers exist in the historical data and are retained during the learning process. In practice, data standardisation is more robust in handling outliers/extremities or non-uniformly distributed data than min-max normalisation. Furthermore, the models with normalised variables took relatively longer to train.  The best LSTM model is the SRIV model and it is shown that the other best performing models can also generalise and predict with good accuracy after the last split (refer to Table 4). The substantial improvement in the models' performance on the last split suggests most of the dependencies and key features can be learnt in the first four data partition sets.

Conclusion
This study has reported on the development of an ANN architecture for predicting the performance of WTP anaerobic reactor. With the currently available dataset, the findings have shown that an LSTM network can be utilised to predict biogas production. It is shown that the prediction model with data standardisation yields higher accuracy and outperforms the max-min normalisation for the WTP historical dataset, which included significant outliers. The LSTM model trained by using standardised data with reduced input variables yields the best average performance on all splits. It is shown that the LSTM predicting model with a reduced number of input variables via Pearson's correlation coefficient selection method can achieve good accuracy given sufficient dataset training. Ongoing studies on the data preparation and development of ML algorithms and their architectures for WTP performance forecasting as well as the monitoring of floating cover structural integrity are underway in integrating AI-enable approach for future management and operation of this critical asset.