Feature Augmentation based Hybrid Collaborative Filtering using Tree Boosted Ensemble

Requirements for recommendation systems are currently on the raise due to the huge information content available online and the inability of users to manually filter required data. This paper proposes a Feature augmentation based hybrid collaborative filtering using Tree Boosted Ensemble (TBE), for prediction. The proposed TBE recommender is formulated in two phases. The first phase creates category based training matrix using similar user profiles, while the second phase employs the boosted tree based model to predict ratings for the items. A threshold based filtering is finally applied to obtain precise recommendations for the user. Experiments were conducted with MovieLens dataset and performances were measured in terms of Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). The proposed model was observed to exhibit MAE levels of 0.64 and RMSE levels of 0.77 with a variation level of ±0.1. Comparisons with state-of-the-art models indicate that the proposed TBE model exhibits reductions in MAE at 6% to 14% and RMSE at ~0.2.


Introduction
Information explosion has led to a huge amount of data being generated online. However, human intellect and perception levels are stable, leading to difficulty in processing all the information available to them [1]. This has led to the formulation of prescription based models that provides automatic recommendations to the users. Such automated recommendations help users to a large extent by categorizing the information and providing the most significant information to the users such that they do not miss them. Systems enabling such automated categorizations and filtering are called recommender systems.
Recommender or a recommendation system is a specialized information filtering model that performs predictions based on the preference a user provides to an item. The preference levels are measured using ratings provided by the user to the item or similar items [2]. User ratings are analyzed and items similar to best rated items are recommended for the users. Recommender systems are not sidelined to predicting products alone. Due to the high online usage levels, such systems have become very popular and are currently used to predict books [3], music, news, research articles and even search queries, jokes and restaurants. Some of the current and most popular recommendation systems were music predictions by Last.fm and Pandora radio. Interests in recommendation systems were sparked by the Netflix challenge that offered a prize of $1 Million for improving their model by 10% [4].
Recommendation systems can be designed in three major aspects [5] namely; collaborative filtering, content based filtering and hybrid recommenders. Collaborative filtering models [6] are based on analyzing the user's behaviors and preferences to provide predictions. Major advantages of using such models are that they are based on available and machine analyzable content. This makes their recommendations more accurate and relatable. However, they suffer from issues like data sparsity, data unavailability (cold start) [7] and data volume [8]. Content based recommenders [9] are based on items that model user's profiles. Predictions are based on the created profiles. Hybrid recommenders [10] are a combination of collaborative and content based recommenders.
This paper proposes a feature augmentation based collaborative filtering mechanism to predict items preferred by users. It uses a model based recommendation approach, where prediction is modelled as a regression problem. Recommendations are usually fine-tuned to users. Hence predictions for one user pertain to that user only. The proposed collaborative filtering architecture is modelled in two phases. The first phase deals with identifying the current user's interests and forming their profile, finding users similar to the current profile and identifying the item vectors pertaining to similar users. Most recommendation systems stop at this level to provide recommendations. The proposed approach moves further by building a training matrix from the item vectors that is passed to the next phase. The second phase uses a boosted tree based ensemble to create a prediction model that is used for the final predictions. Experiments and comparisons indicate the high effectiveness of the proposed model as it exhibits a considerable reduction in the Mean Absolute Error (MAE) and the Root-Mean-Square Error (RMSE) in comparison to existing state-of-the-art models.
The remainder of the paper is organized as follows: section 2 presents the literature review, section 3 presents the problem formulation, section 4 presents a detailed description of the proposed ensemble model, section 5 presents the experimental results and section 6 concludes the work.

Literature review
This section discusses some of the recent contributions in the domain of recommendation systems.
An artificial neural network based recommendation system that uses content-based modelling for predictions was proposed by Paradarami et al. in [11]. This work performs model based predictions by utilizing user reviews. Model based predictions are hybridized with ANN to perform enhanced predictions. A similar hybridized recommendation model specifically for elearning environments was proposed by Chen et al. in [12]. Several hybrid versions of recommenders are currently on raise, like user specific hybrid recommender for offline optimization by Dooms et al. [13], an augmented matrix based hybrid system by Wu et al. [14], a latent factor based recommendation system by Zheng et al. [15] and a linear regression based collaborative recommendation model by Ge et al. [30].
Utilizing metaheuristics for recommendations have currently been on the raise due to the increase in data volume. A cuckoo search based collaborative filtering model was proposed by Katarya et al. in [16]. This model uses k-means clustering for user grouping and cuckoo search for the process of prediction. Other metaheuristic or evolutionary algorithm based recommendation systems include memetic algorithm and genetic algorithm based recommender system by Banati et al. [17], PSO based recommender system [18] and fuzzy ant based recommenders by Nadi et al. [19].
A weighting strategy based recommender that performs genre based clustering was proposed by Fremal et al. [20]. This method analyzes twelve weighting strategies in terms of MAE and RMSE to obtain the best weighting model for effective recommendations. A similar multiple clustering based recommendation model was proposed by Ma et al. in [21]. A trust and similarity based recommender for leveraging multiviews was proposed by Guo et al. in [22]. A prediction system to recommend complimentary products was proposed by McAuley et al. in [23]. This model concentrates in identifying substitutable versions of customers' interests. A coordinate based recommendation system SCoR was proposed by Papadakis et al. [32]. This model uses a combination of matrix factorization and collaborative filtering to improve the prediction process. A user perception based model was proposed by Chen et al. [33]. This is a critiquing based model that considers user's perception of products for the prediction process.
Although several models for recommendations are available, most of them follow the regular filtering mechanisms, resulting in huge data for processing, thereby increasing the computational complexity to a large extent.

Problem formulation
The collaborative filtering model has been formulated as a prediction problem, where the proposed Tree Based Ensemble (TBE) model predicts the probable ratings that will be provided by the user for a particular item.
Let CL be the set of customers, where CL={C1,C2,C3…Cm} and PC be the purchase list of a customer, where the item purchased ix and corresponding rating given to the item rx are the mandatory components. All available n items are contained in the items list I={i1,i2,i3…in}. The ratings are formulated as real numbers R in the interval [rmin ,rmax], where rmin and rmax are defined by the domain.
The problem is to predict a set of items from I for a customer C such that the customer would have a high probability of purchasing it, pertaining to constraints given in eq 1.

Collaborative filtering using tree based boosted ensemble
Collaborative filtering is the process of predicting a user's interests based on their past behaviors. This paper proposes a model based collaborative filtering approach using a boosted ensemble. Algorithm for the proposed collaborative filtering architecture is given below. The proposed collaborative filtering architecture has been modelled in two major phases namely; profile induced item matrix creation using feature augmentation and ensemble based predictions. The first phase collects, Feature Augmentation based Hybrid Collaborative Filtering using... filters and integrates data corresponding to the user for whom the recommendation is to be made. The second phase creates a boosted ensemble, trains it using the created training data and provides predictions.

Profile induced item matrix creation using feature augmentation
Recommendation systems are built for heterogeneous users. Every user's requirements is distinct, however there also exists slight similarities with other users in the system [24]. Hence it is important to identify and build appropriate profile for the current user for the model to be trained upon. Effectiveness of predictions depends entirely on the quality of the training data built at this phase. The process of building the user's profile is performed using Feature Augmentation. Feature Augmentation is the process of computing a set of features to be passed to the subsequent phase for evaluation.
The initial phase of this process is to generate an item list from the purchase history of the customer under analysis. This is given by Where ix is an item from the set of all items purchased by the customer.
The ratings pertaining to the item list IC are integrated to obtain the user's preferences.
The mere factor that the customer has purchased an item does not guarantee the person's affinity towards the product. Hence affinity levels for the product are obtained by integrating the ratings. The next step is to identify users with interests similar to the current user C. Identifying similarities begins by identifying the commonalities existing between C, the current user, paired with all the other existing users (NC). This is given by Where IC and INC correspond to items purchased by customers C and NC.
The common items identified in ICommon are integrated with their corresponding ratings from C and NC, to create the ratings matrix, which is given by, Where rc is the rating given to product ix by customer C and rnc is the rating given to product ix by customer NC.
Correlation of ratings rc and rnc between the current customer C and every other customer NC is determined to identify the similarity levels between the two customers. Similarities are identified between two rating vectors, and a similarity identification model is used for the process [25]. Some of the common similarity measures are Euclidean distance, Minkowski distance and Pearson correlation [2]. Distance based measures requires the input vectors to be standardized prior to operations, while major advantage of a correlation based model is that they operate based on cosine similarities, hence do not require standardized values. This avoids the additional overhead of standardizing the input data. Hence the proposed TBE model uses Pearson correlation as the similarity measure identifier. TBE model uses all the items identified as common (entire population) to obtain the similarity, which is given by Where RC and RNC are the rating vectors corresponding to C and NC , the numerator calculates the covariance of RC and RNC, the denominator calculates the product of standard deviations of RC and RNC.
The final item set is obtained by filtering item data satisfying the similarity threshold (ρThresh). The similarity threshold is domain and data dependent. The proposed TBE model sets a similarity threshold of 0.5 for analysis. Items corresponding to users with satisfied thresholds are considered for building the training matrix. The selection criteria for items is given by Item categorization plays a vital role in determining the details pertaining to items. Training data for TBE is constructed with the item categorization details, rather than the actual items. Broader and highly specific categorizations tend to provide more accurate results. The proposed model also deals with integrating items falling under multiple categories. Categories pertaining to items under ISelected are obtained. Item categorizations tend to be nominal rather than numeric. Hence they are normalized with 1-of-n encoding to obtain the numeric training data matrix (T) for training the ensemble model.

Ensemble based Predictions
Model based recommenders utilize a machine learning model to predict recommendations for a user. The proposed TBE model builds a boosted tree ensemble for prediction.
Ensemble modelling [26] is the process of incorporating multiple models for prediction, rather than relying on the results of a single model. Boosting is a machine learning ensemble aimed to reduce bias and variance in the prediction system to provide an effective prediction model. It is a supervised learning approach operating by creating a set of weak learners to form a single strong learner. This work uses decision trees to build the model for recommendations based on the training data [27].
The proposed boosting model operates by iteratively training the algorithm based on the resultant errors from previous iterations.
Let DT(x) be the base decision tree used for training. The process of prediction is given by Where y' is the prediction given by the decision tree model DT. However, being a weak learner, the predictions by DT will constitute errors e, which can be given by Where y is the actual solution and y' is the predicted solution.
The next level prediction model is built by integrating the error component e into the prediction model. This is given by Similarly, the next level error is given by The next model training incorporates e' into the training process. This is iteratively performed until the error e reaches an acceptable threshold.
Training data for the recommendation problem is modelled with category based training matrix constructed from item ratings obtained from similar users. Rating corresponding to the item vector is incorporated as the class label for the training matrix (T). The training matrix is passed to the boosted decision tree and the trained model is obtained. The process of training matrix creation and prediction is repeated for each user individually on every recommendation requirement, to obtain result pertaining to an individual user.
Test data is obtained by considering the items not contained in the purchase list of customer C. The formulation of test data is given by Categories corresponding to ITest are obtained and 1of-n encoding is applied to obtain the test matrix. TBE is formulated as a regression model. Hence an error level of 0.001 is set as the acceptable error limit for the TBE model.
The results provide probable user ratings for each item. Recommendations can be provided by sorting the results in decreasing order and providing the top n rated products as probable recommendations.

Experimental results
The proposed TBE model is implemented using Python and uses MovieLens data [28,29] for analysis. MovieLens is a benchmark dataset used to validate recommendation systems. The dataset pertaining to 1Million reviews is considered for evaluation. It contains details pertaining to 6040 users and provides reviews for 3952 movies. Ratings are provided on a 5 point scale. The dataset also contains categorizations of movies in terms of genres. A single movie sometimes belongs to multiple genres, providing scope for multiple options. The proposed model operates by considering movies as items and genres as the categorization parameters.
Recommendation models are usually measured in terms of Root-Mean-Square Error (RMSE) and Mean Absolute Error [13,31].
Where yi and yi' are the actual and the predicted ratings for the N test reviews. MAE measures the effectiveness of the predictions. Smaller MAE values exhibit better predictions. RMSE depicts the stability of the predictions, in other words, the prediction variance.
Low MAE values represent a good predictor, while high RMSE values indicate high variability in predictions.
Performance of the proposed model is measured in terms of RMSE and MAE. Scalability of the model is measured by sampling the data from 100K reviews, moving up to 1 Million reviews. Exhibited results were attained by performing 1000 iterations and identifying the mean of the obtained predictions.
Mean Absolute Error (MAE) corresponding to datasets of various sizes is shown in figure 1. It could be observed that the proposed TBE model exhibits similar MSE values exhibiting low fluctuations irrespective of the data size. The best MAE was observed to be 0.51, and average MAE was observed to be 0.64, with fluctuation levels of ±0.1. This exhibits the stability of the proposed model irrespective of the data size. the data size. The best RMSE was observed to be 0.62, and average RMSE was observed to be 0.77, with fluctuation levels of ±0.1. This exhibits the low variability in prediction levels of TBE. Enhanced performance of the proposed model is attributed to the two major factors, feature augmentation and the tree based boosted ensemble. Feature augmentation is based on the user's profile. Hence the input data contains several attributes depicting the user's profile. This results in the model being highly fine-tuned towards the user's requirements. This enables better predictions. further, usage of the boosting model ensures that every wrong prediction increases the weight of the instance. This enables even rarely found instances to have a significant impact on the final prediction process. These factors enable enhanced results in the proposed model.
The Comparison of the proposed model is performed with the weighted strategy based model (SW I, MLR and CM II) proposed by Fremal et al. [20] and K-Means and Cuckoo Search based model proposed by Katarya et al. [16]. Both these models are recent and also considers the MovieLens data for their prediction process. Hence this work considers the two models for comparison.
A comparison on MAE values of the proposed model with SW I, MLR, CM II and K-Means Cuckoo is shown in figure 3. It could be observed that the proposed TBE model exhibits lowest MAE levels, exhibiting reduction levels in the range of 6% to 14%.

Conclusion
Recommendations have become one of the major requirements in the current information rich world. However, the voluminous data available for the recommendation engines poses a huge challenge. This paper proposes a feature augmentation based tree bagging ensemble model, TBE for recommendations. TBE, being an iterative model uses a weak classifier, hence the computational complexities pertaining to TBE was observed to be very low. Further, the repeated data filtering process in the first phase reduces data to a large extent. This further reduces the computational complexities, hence speeding up the prediction process to a considerable extent. This aids in handling large datasets effectively, indicating enhanced scalability levels.
The major advantage of TBE is that it uses the available data for the current user and user's similar to the current user. Hence TBE has the capability to identify even hidden patterns from the available data. Further, this process of prediction solves the data sparsity issue that affects collaborative filtering approaches. Limitations of the proposed model are that cold start problem has not been handled. Future extensions of the proposed model will be designed by incorporating user's demographic data, which can enable solving the cold start issue.