Recurrent Neural Network Training using ABC Algorithm for Traffic Volume Prediction

This study evaluates the use of the Artificial Bee Colony (ABC) algorithm to optimize the Recurrent Neural Network (RNN) that is used to analyze traffic volume. Related studies have shown that Deep Neural Networks are superseding the Shallow Neural Networks especially in terms of performance. Here we show that using the ABC algorithm in training the Recurrent Neural Network yields better results, compared to several other algorithms that are based on statistical or heuristic techniques that were preferred in earlier studies. The ABC algorithm is an example of swarm intelligence algorithms which are inspired by nature. Therefore, this study evaluates the performance of the RNN trained using the ABC algorithm for the purpose of forecasting. The performance metric used in this study is the Mean Squared Error (MSE) and ultimately, the outcome of the study may be generalized and extended to suit other domains.


Introduction
The Artificial Bee Colony (ABC) algorithm is based on the intelligent foraging behavior of the honey-bee swarm, which makes it suitable for optimization problems [14]. In his proposal of the ABC algorithm, Karaboga aimed to solve multi-dimensional and multi-modal optimization problems [12]. A function is considered to be multi-modal if it has several local optima. Furthermore, it is multidimensional if the local optima are distributed randomly in the search space, essentially complicating the process of finding the optimal solution. The ABC algorithm has been applied to solve many kinds of real-world problems such as leaf-constrained minimum spanning tree problem, flow shop scheduling problem, inverse analysis problem and radial distribution system network reconfiguration problem among others [21], [29].
Basturk and Karaboga [1] evaluated the ABC algorithm based on five multi-dimensional benchmark functions: sphere function, Rosenbrock Valley, Griewank function, Rastrigin function and Step function. The results obtained show that the ABC algorithm is quite robust for multi-modal problems, since it has multi-agents that work independently and in parallel. This is also echoed by the results they obtained after comparing the performance of the ABC with that of the Particle Swarm Optimization algorithm, Particle Swarm Inspired Evolutionary Algorithm and Genetic Algorithm [14].
Karaboga et. al. [17] used the ABC algorithm to train Feed-Forward Artificial Neural Networks with an aim to overcome drawbacks such as getting stuck in local minima and computational complexity. They discovered that the algorithm had good exploration and exploitation capabilities especially in searching for the optimal weightset which is crucial in training Neural Networks. In this case, exploration refers to the ability to examine the viability of numerous unknown sections in order to discover the global optimum in the search space and exploitation refers to ability to utilize knowledge of the preceding good solutions to find improved solutions.
The data used in this study in the evaluation of the optimized neural network represents the vehicle count at specific junctions of select motorways in the whole of Britain. However, the optimized neural network can be trained for any other road network whose data is available.
The rest of this paper is organized as follows: Section 2 begins with an overview on swarm intelligence followed by Section 3 which explains the fundamental concept of the ABC algorithm. Later, Section 4 looks at the implementation of the ABC algorithm in optimizing the Recurrent Neural Network. In Section 5, we find the experiments and results. Eventually, a summary of the findings of this paper is presented in Section 6.

Swarm intelligence
Swarm intelligence refers to the collective intelligence exhibited by the collaborative behavior of social insect colonies or animal societies in pursuit of a defined purpose. This means that the entities that collaborate form a swarm, which is alternatively defined as a set of agents which act on their environment with an aim of solving a distributed problem [23]. These entities work together with a common goal thus increasing their chances of finding the best or optimal solution to the task at hand. In so doing, they inadvertently enhance the exploration and exploitation of their environment. Furthermore, this process serves to break down the problem into smaller and simpler tasks which are easily solved by sub-groups whose solutions are aggregated to formulate the overall solution. So, the time used to find a solution is decreased exponentially with an increase in the agents involved and also because some of these smaller tasks can be solved concurrently. The dedicated effort of such agents to a single, simplified and well-defined task also minimizes occurrence of errors as may be experienced when a single agent is tasked with the same problem. Therefore, the collective effort is useful in cases where a problem can be compartmentalized into smaller manageable tasks.
Examples of swarm intelligence algorithms include Artificial Bee Colony, Ant Colony Optimization, Particle Swarm Optimization, Immune Algorithm, Bacterial Foraging Optimization, Cat Swarm Optimization, Cuckoo Search Algorithm, Firefly Algorithm, Gravitational Search Algorithm among others [15], [23]. These algorithms are evidence of various assortments of swarms in the world and their varied level of intelligence but selforganization and labor division are key features they collectively possess.

Artificial bee colony algorithm
The ABC algorithm is a swarm-based algorithm presented by Karaboga [12]. This algorithm is inspired by the intelligent-search behavior of honeybees, known for their systematic collection of nectar that they process into honey. Nectar (food) is collected from flowers located in the neighboring fields (food sources) away from their hives. The bees communicate with each other by means of a waggle dance so as to share information about the quality of food sources. This information shared among the colony members includes the location and proximity of the food source to the hive, the quality of food source and quantity of food. This majorly governs the foraging range with correct accuracy thus enabling the swarm to direct its efforts to the best food source. Their mutual dependence is pegged on their distinct but partially evolving roles that adapt to the needs of the colony. The needs of the colony, decentralized decision-making and the age of the bees as well as their physical structure serve as a control for their social life. Therefore, self-organization, autonomy, distributed functioning and division of labor constitute the swarms' ability to solve distributed problems as a unit and adapt to any environment. [23], [24], [27].
The intelligence exhibited by the collective behaviour of swarms via local interactions may be characterized into four distinctive features. The firrst one is positive feedback which refers to the creation of convenient structures such as recruitment and reinforcement. Then we have negative feedback that involves counterbalancing of the positive feedback in order to stabilize the collective pattern and avoid saturation The third is fluctuations which involve the variations incurred in form of errors, random task switching among swarm individuals which stimulates creativity and discovery of new structures. Lastly, we have multiple interactions tha refer to the relationship and cooperation between the various agents in the swarm that result in the overall development [17], [18].
The honeybee forage selection model is based on three components: food sources (alternative solutions), employed foragers (active solution seekers) and unemployed foragers (passive solution seekers) made up of onlookers and scouts. In addition, two leading modes of the behavior are expressed: recruitment to a food source and abandonment of a food source. Thus, the position of a food source represents a potential solution to the optimization problem and the quantity of a food source corresponds to the calculated fitness value of the associated solution [12], [13], [14], [26].
In essence, food sources signify the profitability of the proposed solution in terms of complexity involved in attaining it. This complexity is evaluated based on proximity, ease of extraction, energy concentration which is calculated as a probability value. Employed foragers are associated with a particular food source or simply a solution they are working on, whereas, the unemployed foragers are looking for potential food sources to exploit or simply looking out for alternative solutions. Thus, the scouts find alternative food sources while the onlookers establish viable solutions from the information given to them by the employed foragers through the waggle dance.
At the beginning, the number of employed bees and the number of available food sources. Additionally, an employed bee turns into a scout when the position of a food source declines after a predetermined limit of foraging attempts, at that time exploitation ceases. Thus, the employed and onlooker bees usually perform the exploitation whereas the scouts perform the exploration of the search space. This process of foraging can be viewed as a complex problem broken down into many parts and the ultimate task is to find a viable solution since there are many ways in reaching the goal [9], [18], [23]. Let us examine figure 1 as illustrated by Karaboga [12], for a better understanding of this foraging behaviour. Figure 1: The honeybee nectar foraging behavior [12].
In figure 1 above, there are two discovered food sources: A and B. Any potential forager will always start as an unemployed forager and will not have any knowledge about the food sources around the nest. This limits the prospective options for such a bee to the following: i.
To become a scout and instinctively start searching around the nest for food (S). ii.
To become a recruit after watching the waggle dances for the available food sources (R). This bee then evaluates the available food sources, memorizes a food source location and immediately starts exploiting it thus becoming an employed forager. The foraging bee takes with it a load of nectar from the source and unloads it to a food store back in the hive after which the bee takes on one of the three roles below: i. It recruits other bees (onlookers) and returns to the same food source (EF1). ii.
It continues to forage at the same food source without recruiting other bees (EF2). iii.
It becomes an uncommitted follower after abandoning the food source (UF). Therefore, this formulates the procedure of the ABC algorithm which is separated into five distinct phases; Initialization phase, Employed bee phase, Probabilistic selection phase, Onlooker bee phase and the Scout bee phase [12], [23]: i. Initialization Phase The Food Source locations are randomly initialized within the search space as calculated using equation (1) below.
where i = 1, 2, …, SN and SN indicates the number of Food Sources (equal to half of the bee colony); j = 1, 2, …, D and D is the dimension of the problem; represents the parameter for i th employed bee on j th dimension, meaning that they are dependent on each other; and are upper and lower bounds of .
ii. Employed Bee Phase Every Employee Bee is assigned to the resultant Food Source generated by equation (2) below for further exploitation.
where k is a neighbor of i, i ≠ k; is a random number in the range [−1, 1] to control the production of neighbor solutions around ; is the new solution for . The value of the new Food Source is measured using a fitness value calculated by equation (3) below.
where abs is the absolute objective function associated with each Food Source; is the fitness value.
The two food sources (Original Food Source) and (New Food Source) are compared and the best is chosen based on a greedy selection of their fitness values.
iii. Probabilistic Selection Phase Then, a probability value for each Food Source is calculated using equation (4) which is useful for Onlooker Bees when they evaluate the viability of a Food Source amongst the available options.
where is the fitness value of i-th solution; is the selection probability of i-th solution.
iv. Onlooker Bee Phase The Employed Bees advertise the viability of their Food Sources to the Onlooker Bees which select a Food Source to exploit based on the fitness and probability values associated with it i.e., the more fitness, the higher the probability. The Food Sources that are picked are further exploited using equation (2). This improves the solution and their fitness values are also calculated using equation (3). Once again, to yield an improved solution, a greedy selection process is performed on the original and new Food Sources, similar to Employed Bee Phase.
v. Scout Bee Phase The Employed Bee for a Food source that doesn't generate better results over time becomes a Scout Bee and the Food Source is abandoned. This leads to the random generation of a new Food Source in the search space using equation (1). Subsequently, the Employed bee phase, Probabilistic selection phase, Onlooker bee phase and Scout bee phases will execute until termination criterion is satisfied. The best food source solution is obtained as output. Note that the steps of the algorithm presented in section 4 are quite elaborate than the fore mentioned summary. [12], [13], [15], [18].

RNN training using ABC algorithm
Artificial Neural Networks are based on the simulated network of biological neurons in which neurons are the essential computational units [22]. Hence, the underlying concept is to train a mathematical model so that it can reproduce some physical phenomena or make some predictions. The model is presented with training samples that are the actual outputs of the studied system corresponding to the actual inputs of the problem. Later, the error obtained between the actual and the predicted value serves as the metric for measuring the performance of the algorithm in terms of prediction [5]. Artificial Neural Networks can broadly be categorized into Shallow Neural Network and Deep Neural Network techniques. Shallow Neural Networks generally have only one hidden layer as opposed to Deep Neural Networks which have several levels of hidden layers. Therefore, Deep Neural Networks utilize functions whose complexity is of a higher magnitude contrary to Shallow Neural Networks, given that all resources remain constant [3].
Shallow Neural Network (SNN) techniques contain less than two layers of nonlinear feature transformations. Examples of the SNN techniques are Conditional Random Fields (CRFs), Gaussian Mixture Models (GMMs), Support Vector Machines (SVMs), Maximum Entropy (MaxEnt) models, Logistic Regression, Kernel Regression, Multi-Layer Perceptron's (MLPs) with a single hidden layer including Extreme Learning Machines (ELMs). SNN techniques effectively solve wellconstrained problems due to their limited modeling and representational power which poses a challenge when dealing with complicated real-world applications. A wellconstrained problem is one for which a function is to be minimized or maximized with respect to well defined constraints [3], [6].
So, the basic concept behind Artificial Neural Networks owes to their imitation of biological neurons as shown in figure 2 which is an elementary neuron with several inputs and one output. Here, each input x is fed to the next layer, in our case an output layer y, with an appropriate weight w. The sum of the weighted inputs and the bias forms the input to the transfer function f. The bias is a threshold that represents the minimum level that a neuron needs for activating and is represented by b.
Neurons can use any differentiable transfer function f to generate their output. Therefore, in multi-layer networks, the input values to the inputs of the first layer, allow the signals to propagate through the network, and read the output values where output of the th node can be described by the function in Eq. 4.1 below [25], [28].
where is the output of the node; is the th input to the node; is the connection weight between the node and input ; is the threshold (or bias) of the node; is the node transfer function.
Multilayer networks often use the sigmoid transfer function which generates outputs between 0 and 1 as the neuron's net input goes from negative to positive infinity. This is used for models where we have to predict the probability as an output. Hence, its suitability because the probability of real-world entities exist in the range of 0 and 1. Sigmoid output neurons are often used for pattern recognition, clustering and prediction problems.
The information from a layer to the next one is transmitted by means of the activation function, represented in equation (6). The activation function relies on the weighted sum and bias to make a calculation on whether a neuron will be activated or not, thus introducing non-linearity to the network. This non-linear transformation performed on the inputs and sent through the network enables it to learn and perform complex tasks.
The main goal is to minimize the cost function by optimizing the network weights. The fundamental idea of this optimization approach is to individually interpret and change the weight values. Also, note that dynamic environments present a relatively higher network complexity which suggests the need for Deep Neural Networks. Therefore, the data presented to the network has to be split into three sets; training set, validation set and the testing set. This facilitates the training, verification and evaluation of the networks' performance. Furthermore, the complexity of the challenge is represented by the Mean Squared Error (MSE) in equation (7). The MSE is obtained while comparing the target input against the predicted output could determine the number of hidden layers. The optimization of the network is achieved by minimizing the MSE which is essentially a network error function. Henceforth, the training algorithm is used to find the optimal weights that are used for initializing the Neural Network. In this case, the ABC algorithm is used to find the precise weights that enable the network connections to make accurate decisions. The algorithm uses a cost function as a measure for our progress in determining the right weights [19], [25].
where, ( ( )) is the error at the ℎ iteration; ( ( )), the weights in the connections at the ℎ iteration; and represent the desired and the actual values of ℎ output node; is the number of output nodes; is the number of inputs.
A Recurrent Neural Network (RNN) is an extension of the conventional feed-forward neural network described above. The major difference is that RNNs have cyclic connections which make them reliable for modeling time-series data in dynamic environments. This means that at any given the output is related to the present input and the input at previous timestamps. Therefore, we build on the concept above of the elementary neuron in relation to the RNN. Here, we have the input sequence denoted by x = (x1, x2, ..., xt), the hidden layer denoted by h = (h1, h2, ..., ht) and the output vector sequence denoted by y = (y1,  y2, ..., yt). Usually the RNN calculates the hidden vector sequence h using equation (8) and the output vector sequence y using equation (9) with t = 1 to T [20]; where function is the activation function; w is a weight matrix; b is the bias term. However, the Long Short-Term Memory (LSTM) architecture is preferable because it resolves the underlying vanishing and exploding gradient problems of the traditional RNN. The LSTM -RNN uses three gates that form a cell which consequently solves the problems mentioned above thus making the network robust. Thus, the LSTM cell replaces the recurrent hidden cell in Eq. 4.4 above. The equations to compute the values for the three gates are described below [11], [20].
In LSTM -RNN, the input gate i, the forget gate g, and the output gate o control the information flow. The input gate decides the ratio of input which has an effect when calculating the cell state, c. The forget gate calculates the ratio of the previous memory ℎ −1 using equation (11) and decides whether to pass it onwards or not. The result obtained is used for determining the cell state in equation (12). The output gate which is based on equation (13) determines whether pass out the output of the memory cell or not. This process as represented by the ratios from the three gates is denoted by equation (14) and also depicted diagrammatically in the figure 3 [20]. Therefore, the algorithm below outlines the optimization process for the deep neural network using the ABC algorithm [10], [12], [19], [25] is the fitness value of ; indicates a neighbor solution of ; is the probability value of ; is the maximum cycle number in the algorithm. Remember that at the beginning, one half of the colony consists of onlooker bees and the second half constitutes the employed bees which are equal to the number of food sources (viable solutions) and any employed bee whose food source has been exhausted becomes a scout bee. Therefore, the algorithm starts by generating a randomly distributed initial population ( food source positions), where denotes the size of population. Each solution ( = 1, 2, ..., ) is adimensional vector. D being the number of optimization parameters. After initialization, the population of the solutions is subjected to repeated cycles, = 1, 2, ..., , of the search process until a termination criterion is achieved. Each cycle of the search consists of three steps: engaging the employed bees with their food sources and evaluating their viability; sharing the food sources viability information with the onlookers which select a food source and again assess its viability; determining the scout bees and sending them out randomly to explore new food sources. An employed bee produces a modification on the solution in its memory depending on the probability and fitness tests. Thereby, generating optimal weights that serve to minimize the cost function and with each cycle the RNN is adequately trained with varying parameters using the ABC algorithm until optimal conditions are met [10], [18], [19].

Experiments and results
During the training phase, the Recurrent Neural Network is presented with a set of the training data from the dataset and the input weights are adjusted by using the ABC algorithm as a learning algorithm. The dataset can be acquired from the Road Traffic Statistics website for Great Britain [7]. The purpose of the weight adjustment is to enable the RNN to learn so that it would adapt to the given training data [10]. The dataset also has to be split to a suitable ratio to enable the training of the network, validation and testing of the results obtained. Thereafter, the performance of the network is evaluated based on the Mean Squared Error (MSE) obtained between the desired output and the actual output thus testing the validity of the network in terms of its prediction efficiency. Figure 4 below depicts the performance graph obtained on execution of the algorithm in MATLAB [2]. The figure 4 represents the best validation performance of the network. On several runs of the algorithm the MSE obtained was 1.1232e3. This is the value obtained on epoch 9 after which the error gradually starts to increase due to overfitting but in this case, it gradually maintains a constant level. In other experiments, the MSE of the RNN before it was optimized was 3.853e3 [4]. The difference between the two MSEs basically shows that the ABC algorithm is actually efficient in terms of optimization. Generally, lower MSEs translate to high accuracy. Graphically, this is seen in the regression plots for the dataset in figure 6. Figure 5 shows the respective regression values of the three different sets of the dataset. Splitting of the dataset helps with the early stopping of the network in order to achieve its generalization capability. The three sets of data all obtain value greater than 0.9 and the aggregate regression value is 0.93625 which borders 1. This shows a high relationship between the desired outputs and the obtained outputs, which shows a high accuracy in the networks ability to forecast efficiently. Furthermore, there is a high cross-correlation between the input data and the error time-series as depicted in the graph in figure 6 below. Figure 6: Correlation between the input and the output error.
The figure 6 above means that the network is able to model the predictive characteristics of the time-series lag which is the difference between the expected and the actual values. This correlation is depicted in the figure and the values fall in between the acceptable confidence limits as shown by the dotted red line. This is further exemplified in the time series plot of figure 7 below which shows the relationship between the predicted values and the actual values The figure 7 above shows the desired output values plotted against the actual values obtained by the RNN optimized by the ABC algorithm. This time-series graph shows the level of accuracy that can be obtained during prediction with a well-trained RNN. The high efficacy of the ABC algorithm is also depicted in the graph regardless of one incorrectly predicted value. However, the other values fall between the confidence limits and as such with further training and fine-tuning of the parameters the RNN can actually produce reliable results. This means that the generalized model can actually produce accurate predictions. The values of the RNN after optimization The results in table 1 above reflect the MSEs obtained during the training, validation and testing phases of the experiment. The optimal MSE for the training phase was 359.8409, the validation phase had an optimal MSE of 569.0172 and the optimal MSE at the Test phase was 673.4512. These values show that the error rate reduced gradually with an increase in the number of hidden layers. Other experiments have been performed using other algorithms for optimization of the deep neural networks. These algorithms include the Levenberg-Marquardt Backpropagation algorithm which had an optimal MSE of 360.2578 at the training phase, the Scaled Conjugate Gradient Backpropagation algorithm which had a least MSE of 480.9656 at the validation phase and the Resilient Backpropagation algorithm which had 467.9015 as the MSE at the testing phase [4]. The ABC trained RNN has peak performance when the hidden layer size is between 40 and 80 given an input vector size of 500. In comparison to the fore-mentioned training algorithms, the ABC algorithm surpasses the other training algorithms in similar conditions.

Conclusion
It is evident from the results that the ABC algorithm outperforms the backpropagation algorithms. However, the parameter settings for the algorithm need to be refined for the model to be generalized. Moreover, different architectures of other deep neural networks can be implemented especially in distributed computing environments so as to sustain a greater number of the hidden layers or even produce a sustainable hybrid thereof. Furthermore, deep neural networks need to be optimized so as to enhance the practicability of a model that yields reliable forecasting in dynamic environments.