Improvement of the Deep Forest Classifier by a Set of Neural Networks

A Neural Random Forest (NeuRF) and a Neural Deep Forest (NeuDF) as classification algorithms, which combine an ensemble of decision trees and neural networks, are proposed in the paper. The main idea underlying NeuRF is to combine the class probability distributions produced by decision trees by means of a set of neural networks with shared parameters. The networks are trained in accordance with a loss function which measures the classification error. Every neural network can be viewed as a non-linear function of probabilities of a class. NeuDF is a modification of the Deep Forest or gcForest proposed by Zhou and Feng, using NeuRFs. The numerical experiments illustrate the outperformance of NeuDF and show that the NeuRF is comparable with the random forest.


Introduction
In spite of the intensive development of a huge number of various modern classification models, including the deep learning models, the ensemble methodology remains one of the most efficient approaches for solving machine learning problems. The ensemble learning models are based on constructing multiple classifiers for training data and on aggregating their corresponding predictions in accordance with a certain rule. The final ensemble classifier is represented as a weighted average of outputs of the base or weak classifiers. The weight of each classifier can be viewed as its contribution to the final decision. Several approaches use some functions that combine the outputs from all base classifiers instead of weighted averages. From a statistical point of view, one of the ideas underlying the improvement of the classifier performance by means of the ensemble combinations is based on reduction of variance of the classification error [11]. This occurs because the usual effect of ensemble averaging is the reduction of the variance of a set of classifiers.
Three main techniques of combining the classifiers can be pointed out [44]: bagging, stacking and boosting. Bagging [4] aims to improve accuracy by combining multiple classifiers. One of the most powerful bagging methods is the random forest (RF) method [5], which uses a large number of individual decision trees in order to combine their predictions. Another technique for achieving the highest generalization accuracy in the framework of ensemble-based methods is stacking [41]. This technique is used to combine various classifiers by means of a metalearner that takes into account which classifiers are reliable and which are not. The best known ensemble-based technique is boosting which improves the performance of weak classifiers by means of their combining into a single strong classifier. Both boosting and bagging techniques use voting for combining the classifiers. However, the voting mechanism is differently implemented. In particular, examples in bagging are chosen with equal probabilities. Boosting supposes to choose the examples with probabilities that are proportional to their weights [32].
There are several review papers devoted to various approaches based on the combination of classifiers. A detailed analysis of many ensemble-based methods can be found in a review proposed by Ferreira and Figueiredo [14]. The review compares a huge number of modifications of boosting algorithms. One of the first books thoroughly studying combination rules for improving classification performance was written by Kuncheva [23]. An interesting review of ensemble-based methods is proposed by Polikar [30]. A nice review is presented by Wozniak et al. [42] A comprehensive analysis of combination algorithms and their application to machine learning approaches such as classification, regression, clusterization can be also found in a review paper written by Rokach [32]. We have to point out also other recent reviews [13,19,31,43]. A detailed description and an exhaustive analysis of most ensemblebased models are given in Zhou's book [44].
One of the widely used and exhibiting extremely high performance ensemble-based methods is a RF [4]. It is a classifier consisting of a collection of randomized decision trees. According to main algorithms for constructing the RF, a certain numbers of training elements and features are drawn at random with replacement from a training set in order to build every decision tree in the forest. The RF models have been successfully used in various practical problems. The detailed descriptions of many RF applications and properties of RFs have been reviewed by many authors [2,9,15,27,33].
An interesting new ensemble-based method which can be viewed as a combination of several ensemble-based methods, including the RF and the stacking, is proposed by Zhou and Feng [45] and called the Deep Forest (DF) or gcForest. Its structure consists of layers similar to a multilayer neural network structure, but each layer in gcForest contains many RFs instead of neurons. gsForest can be regarded as an multi-layer ensemble of decision tree ensembles. As pointed out by Zhou and Feng [45], gcForest is much easier to train and can perfectly work when there are only small-scale training data in contrast to deep neural networks which require great effort in hyperparameter tuning and large-scale training data. A lot of numerical experiments provided by Zhou and Feng [45] illustrated that gcForest outperforms many well-known methods or is at least comparable with them.
Advantages of gcForest motivate us to modify it in order to improve its classification capability. Some improvements have been proposed by Utkin and Ryabinin [37,38,39]. In particular, modifications of the DF for solving the weakly supervised and fully supervised metric learning problems were proposed in [39] and [37], respectively. A transfer learning model using the DF was presented in [38]. The main idea underlying the proposed modifications is to assign weights to decision trees in every RF in order to minimize the corresponding loss functions which depend on the problem solved. The weights are used to replace the standard averaging of the class probabilities for every instance and every decision tree with the weighted average. The weights are regarded as training parameters which can be computed by solving the constrained quadratic optimization problems.
By introducing the tree weights, we simultaneously try to overcome another shortcoming of gcForest. It cannot be fully considered as an alternative to deep neural networks due to its uncontrollability in the sense of defining a goal in tasks different from the standard classification. One of the advantages of neural networks is the flexibility of specifying the error or loss function depending on the data processing task or a specific application. The loss function in the standard classification problem is determined by the difference between a true class label of a training set element and a label computed by means of the forward propagation. The Euclidean distance between the input and output of the network is used in autoencoders. Various types of distances between the probability distributions of the source and target data are used in transfer learning problems. The variety of error functions allows solving a lot of machine learning problems by specifying the required loss function. Therefore, another aim of the modifications is to modify gcForest in order to use different loss functions. We have to point out that the idea of weighting in RFs is also not new. Most weighting RF methods use weights of classes to deal with imbalanced datasets, for example, [10]. At the same time, there are a lot of publications devoted to more complex weight assignments to every tree. In partic-ular, Li et al. [25] propose to assign weights to decision trees according to their classification ability. A similar approach for weighting decision trees is presented by Kim et al. [20]. An interesting study of weighted voting methods in RFs is also given in [34]. The main difference of these methods from the proposed approach is that all the methods use some measures of the classification quality in order to assign the weights. Moreover, these measures are obtained on the basis of testing data. To the best of our knowledge, there are no methods which consider the weights as training parameters. The proposed approach allows us to select a weighting assignment scheme in a flexible way by using different loss functions for optimization.
The approach using weights of decision trees for computing a target probability class vector for every RF have illustrated the outperformance in comparison with gcForest. However, it has some shortcomings. First, the number of weights is strongly depends on the number of decision trees in every RF. On the one hand, we increase the number of trees in order to increase the classification accuracy, but the large number of decision trees leads to the same large number of weights. As a result, the number of training parameters is increased and the model may lead to overfitting. On the other hand, a reduction of decision trees may lead to a reduction of the classification accuracy. Second, the weighted average used for computing the RF probability class vector is a linear function of the weights. This fact significantly restricts a set of possible solutions and may make worse the classifier.
In order to overcome the above difficulties, we propose to use a neural network of a special form for computing the probability class vectors. The neural network plays a role of a non-linear analog of the linear function of weights. Of course, we do not have the weights of decision trees in the explicit form now. But we get a function which combines the probabilities of every class at the leaf nodes in order to obtain the RF probability class vector. In other words, the neural network plays a role of a non-linear function of weights. It should be noted that the proposed neural network is not standard because we have to identically process probabilities of every class. This implies that if the number of classes is C, then we construct C identical neural networks with shared parameters. In particular, if a training data have two classes, then the obtained neural network is very similar to the Siamese neural network [6] which has been widely used in many applications (see, for example, [1,8,17]). Outputs of all identical networks for every training instance form the corresponding probability class vector. In fact, the neural networks can be viewed as a nonlinear alternative to the weighted sum of probabilities. In particular, this approach coincides with the approach using the weighted averages when activation functions of all units in the neural networks are linear. The proposed combinations of the neural network with the RF and the DF are called NeuRF and NeuDF, respectively.
It should be noted that the idea to jointly use RFs and neural networks is not new. An interesting approach for constructing a denoising RF was proposed by Hibino et al. [16]. Another combination of the RF with the neural network was presented by Kontschieder et al. [21] where an ensemble of random trees is restructured as a collection of random neural networks, which exhibits better generalization performance. The authors of [21] introduced a soft differentiable decision function at the split nodes and a global loss function defined on a tree. Following this approach, several similar models were proposed in [3,18,35,36,40,46]. Maji et al. [28] used a deep neural network for unsupervised learning followed by supervised learning of the deep neural network response using a RF.
In contrast to the above combinations of neural networks and RFs, in the presented paper, we incorporate the neural networks into the DF in order to correct and to control the class vectors at outputs of RFs. Our experiments demonstrate that NeuRF and NeuDF are competitive on many publicly available datasets.

A short introduction to deep forests
One of the important peculiarities of gcForest is its cascade structure proposed by Zhou and Feng [45]. Every cascade is represented as an ensemble of decision tree forests. The cascade structure is a part of a total gcForest structure. It implements the idea of representation learning by means of the layer-by-layer processing of raw features. Each level of cascade structure receives feature information processed by its preceding level, and outputs its processing result to the next level. The architecture of the cascade proposed by Zhou and Feng [45] is shown in Fig. 1. It can be seen from the figure that each level of the cascade consists of several RFs which generate 3-dimensional class vectors concatenated each other and with the original input. It should be noted that this structure of forests can be modified in order to improve the gcForest for a certain application. After the last level, we have the feature representation of the input feature vector, which can be classified in order to get the final prediction. The gcForest representational learning ability is enhanced by applying the second part of gcForest called as the so-called multi-grained scanning. The multigrained scanning structure uses sliding windows to scan the raw features. Its output is a set of feature vectors produced by sliding windows of multiple sizes. We mainly pay attention to the first part of gcForest because our modification relates to the RFs. Given an instance, each forest produces an estimate of a class distribution by counting the percentage of different classes of examples at the leaf node where the concerned instance falls into, and then averaging across all trees in the same forest as it is schematically shown in Fig. 2. The class distribution forms a class vector, which is then concatenated with the original vector to be input to the next level of cascade. The usage of the class vector as a result of the RF classification is very similar to the idea un-derlying the stacking algorithm [41] which trains the firstlevel learners using the original training dataset. Then the stacking algorithm generates a new dataset for training the second-level learner (meta-learner) such that the outputs of the first-level learners are regarded as input features for the second-level learner while the original labels are still regarded as labels of the new training data. In contrast to the standard stacking algorithm, gcForest simultaneously uses the original vector and the class vectors (meta-learners) at the next level of cascade by means of their concatenation. This implies that the feature vector is enlarged after every cascade level. After the last level, we have the feature representation of the input feature vector, which can be classified in order to get the final prediction. Zhou and Feng [45] propose to use different forests at every level in order to provide the diversity which is an important requirement for the RF construction.
It is interesting to note that the same architecture of the cascade forest was proposed by Miller et al. [29]. This architecture differs from gcForest in using only class vectors at the next cascade levels without concatenation with the original vector. Miller et al. [29] illustrated by numerical experiments that their approach is comparable to the approach [45]. We have to point out that the cascade structure with neural networks without backpropagation instead of forests was proposed by Hettinger et al. [7].

Weighted averages in forests
One of the ways to improve gcForest is to assign weights to decision trees in every RF. The weights aim to correct the original averaging of class probability distributions over all decision trees in accordance with a predefined objective function. In the standard classification problem, the objective function is the error function or the difference between class labels of training instances and values of the forest class probability distributions. In the metric learning problem, the objective function is the distance between similar and dissimilar instances. Different machine learning problems define the corresponding objective function and the corresponding weights of decision trees.
Our aim is to briefly consider the idea of the weighted average in order to propose the neural networks for processing the class probability distributions. Therefore, we will consider the standard classification problem for simplicity. The classification problem can be formally written as follows. Given n training data (examples, instances, patterns) S = {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x n , y n )}, in which x i ∈ R m represents a feature vector involving m features and y i ∈ {1, ..., C} represents the class of the associated instances, the task of classification is to construct an accurate classifier c : R m → {1, ..., C} that maximizes the probability that c(x i ) = y i for i = 1, ..., n.
A decision tree in every forest produces an estimate of the class probability distribution p = (p 1 , ...  Suppose that all RF have the same number T of decision trees, every cascade level contains M RFs, and the number of cascade levels is Q. The objective function for computing optimal weights is defined as the Euclidean distance between the class vector and a vector such that its element with index y i is 1 and other elements are 0. According to [45], the class distribution forms a class vector which is then concatenated with the original vector to be input to the next level of the cascade. Suppose an origin vector is x i , and the p (t,k,q) i,c is the probability of class c for an instance x i produced by the tth tree from the k-th forest at the cascade level q. Since we consider a single RF at some cascade level, then we omit indices k and q corresponding to the forest and the level, respectively. Let us also introduce the notation Here w t is the weight of the t-th tree in the considered forest. Suppose that 1 is a vector having T unit elements. Then the c-th element v i,c of the class vector produced by the considered forest for the instance x i is determined in gcForest as The weighted average of the class probability distributions leads to the following class vectors v i,c = p i,c · w.
It follows from the above that gcForest is a special case of the weighting scheme when all weights are 1/T . An illustration of the weighted averaging is shown in Fig. 3, where we partly modify a picture from [45] in order to show how elements of the class vector are derived as a simple weighted sum. One can see from Fig. 3 that the augmented features v i,c , c = 1, ..., C, corresponding to the q-th forest are obtained as weighted sums, i.e., there hold v i,1 = 0.4w 1 + 0.2w 2 + 1.0w 3 + 0.0w 4 , v i,2 = 0.4w 1 + 0.5w 2 + 0.0w 3 + 0.0w 4 , v i,3 = 0.2w 1 + 0.3w 2 + 0.0w 3 + 1.0w 4 .
It has been mentioned that the use of the weighted averaging significantly improves the DF and allows us to solve various machine learning problems by controlling the objective function for computing optimal weights [37,38]. However, we need a more complex function of the class probability distributions sometimes in order to get superior results. This function can be implemented by means of neural networks which will be considered in the next section.

Neural networks as a function of class probabilities
Let us return to the weighted averaging. The value v i,c can be represented as a function f of probabilities p i,c , i.e., v i,c = f (p i,c ). It is important to point out that the function f does not depend on the class c. At the same time, it is identical for all classes. Suppose now that the function f is not linear and is implemented by using the neural network. This implies that, for every class, we have to identically transform the vector p i,c in order to get the vector v i for every forest. It can be done by using C identical neural networks with shared parameters. The input of the c-th network is the vector p i,c of the length T . The output of the c-th network is expected to be 1 if the class label of the i-th instance coincides with the number of the network, i.e., if the condition y i = c is valid, otherwise the output is expected to be 0. The networks are trained on the basis of sets of vectors p i,c obtained for every training example (x i , y i ), i = 1, ..., n. The condition for training is that parameters of all networks have to be identical, i.e., the networks are implemented with shared parameters. This implies that that all networks are trained simultaneously. Fig. 4 illustrates the use of identical neural networks with shared parameters for computing the class vectors. It can be seen from the picture that the input vector for the first neural network consists of first class probabilities of class probability distributions produced by all trees, i.e., it is the vector (0.4, 0.2, 1.0, 0.0). The input vector for the second neural network consists of probabilities of the sec-ond class, i.e., it is the vector (0.4, 0.5, 0, 0). The same can be written for the third network input vector. In other words, the k-th network uses all probabilities of the k-th class. In the case of two classes, we have the standard Siamese neural network [6].
It should be noted that one network, say the last one, is superfluous because the C-th element of the vector v i can be obtained from its other elements under condition that the sum of all probabilities should be equal to 1. However, we use it in order to compensate a possible bias of probabilities.
A total algorithm of training the DF is given as Algorithm 1.
Having the trained NeuDF, we can make decision about the class of a new example x. By using the trained decision trees and the neural networks, the vector x is augmented at each level. Finally, we get the vector v i of augmented features after the Q-th level of the forest cascade corresponding to the original example x. The example x belongs to the class c, if the sum of the c-th elements of all vectors v i obtained for all RFs and all cascades (the total number of vectors is Q q=1 M q ) is maximal. The preliminary numerical experiments show that the proposed combination of the RFs and the neural networks may lead to overfitting. This is caused by a large number parameters of neural networks when the number of decision trees is also large because the number of trees defines the input vector for the neural networks. We have the following contradiction. On the one hand, we try to increase the number of trees in a RF in order to get better results. On the other hand, we have to use in this case a large neu-  Train all trees from the k-th forest at the q-th level in accordance with the gcForest algorithm [45] 4: For every x i , compute C vectors of probabilities p i,c , c = 1, ..., C

5:
Train C neural networks from the k-th forest at the q-th level 6: For every x i , compute v i by using the trained neural networks for the k-th forest 7: end for 9: The concatenated vector x i is used for the next level 10: end for ral network with many parameters (weights), which may lead to overfitting by a small training dataset. In order to overcome this difficulty, we proposed to use small neural networks with input vector of the dimensionality s. Here s is a tuning parameter. At that, all trees are united into groups such that there are s groups. The class probability distribution for every group is determined by averaging all class probability distributions in the group.

Numerical experiments
In order to illustrate NeuRF and NeuDF, we compare them with the gcForest. NeuDF has the same cascade structure as the standard gcForest described in [45]. Each level of the cascade structure consists of 10 RFs. In NeuDF, we do not use the Multi-Grained Scanning part. Three-fold cross-validation is used for the class vector generation. The number of cascade levels is 4.
NeuRF and NeuDF use a software in Python implementing the gcForest, which is available at https://github.com/leopiney/deep-forest to implement the procedure for computing optimal weights of trees and the corresponding class vectors. Accuracy measure A used in numerical experiments is the proportion of correctly classified cases on a sample of data. To evaluate the average accuracy, we perform a cross-validation with 100 repetitions, where in each run, we randomly select N training data and N test = 3N/4 test data.
The neural network in most numerical experiments consists of two hidden layers (total four layers). The number of neurons on the first hidden layer increases by 10% of the input layer. For example, if the input vector consists of 100 features then the first hidden layer contains 110 neurons. On the second layer, it decreases by 10% relative to the input layer, that is, consists of 90 neurons. However, we also investigate how the accuracy measures depend on the number of hidden layers in the neural network. The activation function is the sigmoid. The neural network is trained by using 50 epochs. The value of tuning parameter s is taken 4. Some numerical experiments illustrate the dependence of the classification accuracy on the parameter s. The number of decision trees in every RF is taken 1000. However, we also study how the number of trees impact the classification accuracy.
First, we compare NeuRF and NeuDF with the RF and gcForest, respectively, by using some public datasets from UCI Machine Learning Repository [26]. Table 1 is a brief introduction about these datasets, while more detailed information can be found from, respectively, the data resources. Table 1 shows the number of features m for the corresponding dataset, the number of examples n and the number of classes C. Different values for the regularization hyper-parameter λ have been tested, choosing those leading to the best results.
We also investigate the proposed models by using the well-known datasets: MNIST and CIFAR-10. The MNIST dataset is a commonly used large database of 28 × 28 pixel handwritten digit images [24]. It has a training set of 60,000 examples, and a test set of 10,000 examples. The digits are size-normalized and cen-tered in a fixed-size image. The dataset is available at http://yann.lecun.com/exdb/mnist/. The CIFAR-10 data set consists of 32 × 32 color images drawn from 10 categories. It consists of 50,000 training and 10,000 test images each. It was collected by Krizhevsky et al. [22]. The data set is available at https://www.cs.toronto.edu/~kriz/cifar.html.
Numerical results of comparison of the RF and NeuRF are shown in Table 2, where the first column contains abbreviations of the tested data sets, the second column is the accuracy measure by using the RF, the third column contains the accuracy measures of NeuRF, and the fourth column represents the difference between the accuracy measures of NeuRF and the RF. It can be seen from Table  2 that the proposed NeuRF outperforms the RF for most considered data sets. However, we have to point out that this outperformance is not significant. In order to formally compare the proposed NeuRF with the RF, we apply the t-test which has been proposed and described by Demsar [12] for testing whether the average difference in the performance of two classifiers is significantly different from zero. Since we use the differences between accuracy measures of NeuRF with the RF (see Table 2), then we compare them with 0. The t statistics in this case is distributed according to the Student distribution with 16 − 1 degrees of freedom. The results of computing the t statistics for the difference are the p-value denoted as p and the 95% confidence interval for the mean 0.198, which are p = 0.036 and [0.0139, 0.3823], respectively. The t-test demonstrates the outperforming of NeuRF in comparison with the RF, but the p-value is very close to the bound (0.05) of accepting   Table 3. It can be seen from Table 3 that the proposed NeuDF outperforms the DF for all considered data sets. Moreover, the results of computing the t statistics for the differences between NeuDF and the DF (see Table 3) are the 95% confidence interval Let us formally compare also the RF and NeuDF as a extreme cases among the considered models models. By computing the t statistics for the differences between  Table 4. One can see from Table 4 that NeuRF and NeuDF clearly outperform the RF and the DF, respectively.
Another question is how the accuracy measures of       NeuRF and NeuDF depend on the decision tree group numbers s, i.e., on the tuning parameter s. Fig. 5 illustrates these dependences for the Ecoli dataset by using NeuRF (the left plot) and NeuDF (the right plot). It can be seen from the obtained results that there is an optimal value s which provides the largest accuracy. This value is 4, and it coincides for NeuRF as well as for NeuDF. The same results are obtained for the MNIST dataset (see Fig. 6). It is interesting to note that the optimal values of s coincide for the Ecoli and MNIST datasets. However, this is just a coincidence. If we perform the same numerical experiments, for example, with the Yeast dataset, then we get optimal value s = 6.
We also investigate how the number of hidden layers h in every neural network impacts on the the accuracy measures. The corresponding curves are shown in Figs. 7-8. Here we again have an optimal value of h, which provides the largest accuracy. It is interesting to note that the increase of the hidden layers does not improve the results. Moreover, this increase makes the results worse. It can be explained by the overfitting effect when a lot of training parameters of the modified RF (weights of trees) are replaced by a lot of connection weights of the neural network. Finally, we investigate how the accuracy measures depend on the number T of decision trees in every RF. Figs. 9-10 clearly shows that the accuracy measures increase with T , but the computational complexity increases also in this case.

Conclusion
New classification models based on combination of the DF and the neural network have been presented in the paper. The main idea underlying these models is to improve RFs and the DF by combining the class probability distributions produced by decision trees for every training example by using a series of identical shallow neural networks with shared weights.
The proposed models have a number of advantages. First of all, we replace a simple rule for the class probability distribution combination (averaging) by a more complex function implemented by the neural network, which aims to minimize a classification loss function. Second, the neural network allows us to simply use various loss functions for computing the optimal RF class probability distributions. This leads to opportunity to solve tasks different from the standard classification, for example, transfer learning. Moreover, by applying the proposed models, we can modify the stacking algorithm used in the DF extending a set of the augmented features by some new functions of the tree class probability vectors. The investigation of new augmented features is a very interesting problem which can be viewed as a direction for further research.
It should be noted that the proposed models have not demonstrated a significant improvement when they were applied to a separate RF. A small increase of the accuracy measures for many datasets in this case is compensated by additional computations because of the neural network training. However, numerical experiments have illustrated that the proposed combinations may be very effective for the DF because it forms the appropriate augmented features in the stacking algorithm. That is why we have considered modifications of RFs as well as the DF in the paper.
The neural networks in the proposed models are trained by using a training part of datasets. At the same time, a direction for further research is to change the neural network learning strategy. For example, they may learn by using testing data or a combination of training and testing data. The above changes may lead to outperforming results.