Clustering of variables for enhanced interpretability of predictive models

A new strategy is proposed for building easy to interpret predictive models in the context of a high-dimensional dataset, with a large number of highly correlated explanatory variables. The strategy is based on a first step of variables clustering using the CLustering of Variables around Latent Variables (CLV) method. The exploration of the hierarchical clustering dendrogram is undertaken in order to sequentially select the explanatory variables in a group-wise fashion. For model setting implementation, the dendrogram is used as the base-learner in an L2-boosting procedure. The proposed approach, named lmCLV, is illustrated on the basis of a toy-simulated example when the clusters and predictive equation are already known, and on a real case study dealing with the authentication of orange juices based on 1H-NMR spectroscopic analysis. In both illustrative examples, this procedure was shown to have similar predictive efficiency to other methods, with additional interpretability capacity. It is available in the R package ClustVarLV.


Introduction
In the context of high-dimensional data with a large number of variables, p, and small number of observations, n, such as microarray data, metabolomic and volatolomic data (among a large variety of -omic data collected using high-throughput fingerprinting technologies), most goal has also been addressed with approaches designed to constrain Principal Component loadings to be shrunk to zeros (Jolliffe et al., 2003;Chipman and GU, 2005;Cox and Arnold, 2018), or by first performing a cluster analysis in order to construct interpretable components (Enki et al., 2013;Bühlmann et al., 2013).
Our proposal is also to first perform a clustering of the explanatory variables and then to fit a linear model between a response variable Y and p explanatory variables X j (j = 1, . . . , p) in a groupwise fashion. The algorithm adopted herein is easy to implement and is oriented towards the interpretability of the latent component introduced sequentially into the model.
In the case study considered herein, each selected latent component can be assumed to be associated with a compound from its chemical spectrum in NMR (Nuclear Magnetic Resonance). With metabolomic data, it is possible to imagine that the latent components associated to subsets of metabolites may be related to specific biological pathways. Within the context of DNA microarray data, similar studies have been undertaken (Hastie et al., 2001;Park et al., 2007) in which hierarchical cluster analysis allows to identify supergenes, obtained by averaging genes within the clusters. These supergenes are thereafter used to fit regression models, thereby attaining concise interpretation and accuracy. One of the main differences in this study, compared with previous research works (Hastie et al., 2001;Park et al., 2007), is that representatives of the clusters of variables, or latent variables, are not necessarily an average of the observed variables, but can be components that best reflect the variability within each cluster of variables. The second difference is in the procedure adopted for the progressive construction of the regression linear model.
The clustering of the explanatory variables is based herein on the CLV (CLustering of Variables around Latent Variables) approach (Vigneau and Qannari, 2003;Vigneau and Chen, 2016), implemented in the ClustVarLV R package (Vigneau et al., 2015). In summary, the CLV approach consists of clustering together highly correlated variables into clusters (or groups of variables), while exhibiting within each cluster a latent variable (or latent component) representative to this cluster. It turns out that each latent component is defined as a linear combination of only the variables belonging to the corresponding cluster. From this point of view, CLV components are sparse components in the space of observations, and aim to best reflect the variance-covariance structure between the explanatory variables.
Using the same approach, but defining a slightly different criterion, Enki et al. (2013) also proposed a clustering of variables method for producing interpretable principal components. Figueiredo and Gomes (2015) investigated an approach for clustering of variables, based on the identification of a mixture with bipolar Watson components defined on the hypersphere.
Herein, the central feature of the clustering procedure is a bottom-up aggregation approach of the explanatory variables involving the CLV criterion.
The dendrogram obtained is then explored in order to identify and select the predictive group's latent components regarding the response variable Y . In contrast with a previous study , hierarchical clustering, which is the more time consuming step, is performed only once and the fitting model stage has been modified accordingly. More precisely, an L2-boosting procedure for which the base-learner model is the CLV hierarchical algorithm has been considered. It consists to, iteratively, select a group of explanatory variables. The residuals of the response variable is then regressed on the latent component associated with the selected group and the predicted response is updated with the shrunken version of this local predictor.
The proposed methodology combining variables clustering and iterative linear model fitting, designated as lmCLV, is described in Section 2. Section 3 includes a simple simulated dataset in order to illustrate the behavior of the procedure in a known context, as well as one real case study dealing with the authentication of orange juice based on 1 H-NMR spectroscopy.
We consider contexts where p > n or p ≫ n and where the explanatory variables may be arranged into groups of highly correlated variables.
Both the explanatory variables matrix and the response vector are assumed to be columncentered. In addition, the user may choose to standardize, or not, the variables to a unit variance.

lmCLV 's outlines
lmCLV combines two main methods: 1. The CLV method which is performed first.The similarities between the variables, herein evaluated on the basis of their covariance or correlation coefficients, are depicted by a tree diagram which is used as the learner for the model fitting method. The clustering of variables, using CLV method, is detailed in Sect.2.3.
2. The L2-boosting procedure which provides an efficient iterative model fitting method.
The outline of the L2-boosting algorithm is depicted in Sect.2.4, and the way the CLV base-learner is used in the course of this procedure is explained in Sect.2.5.

Clustering of the variables
The clustering of the explanatory variables using the CLV method for directional groups (Vigneau and Qannari, 2003;Vigneau et al., 2015) aims to both identify clusters of variables and define a latent component associated with each cluster.
For a given number of clusters, K, the aim of the CLV method is to seek a partition of K latent variables, each being associated with one cluster, so as to maximize the internal coherence within the clusters. When the agreement between the variables is assessed using their covariance or correlation coefficients, regardless of the sign of these coefficients, the clusters we are looking for are named directional groups. In this case, the aim is to define the partition, G K , and group's latent variables matrix, C K , so that to get the optimal value T * of the CLV criterion, so that: under the constraints ||c k || = 1.
In eq.(2), δ kj = 1 if the j th variable belongs to the group G k , and δ kj = 0 otherwise. In other terms, (δ kj ) is the generic term of a binary matrix ∆ of the group's membership of the p variables. ∆ has only one nonzero element per row.
It is easy to show (Vigneau and Qannari, 2003) that when the partition G K is fixed, the latent variable in a cluster G k (k = 1, ..., K) is defined as the first standardized principal component of a data matrix formed by the variables belonging to this cluster. The latent variable c k associated with the cluster G k (k = 1, ..., K) can therefore be expressed as a linear combination of the variables of this group: For various numbers of clusters, from K = p clusters, in which each variable is considered to form a cluster by itself, to K = 1, where there is a single cluster containing all the variables, an ascendant hierarchical clustering algorithm is also proposed with respect to the CLV criterion and aggregating rules detailed in Vigneau and Qannari (2003). The strategy proposed herein consists of firstly constructing the whole hierarchy of the p explanatory variables and to explore repeatedly the dendrogram obtained.

L2-boosting procedure for model fitting
The second feature of the strategy is a boosting procedure, which can also be represented as a functional gradient descent (FGD) algorithm (Efron and Hastie, 2016;Bühlmann and Hothorn, 2007), performed on the residuals of an iteratively updated shrunken linear model for the prediction of the response variable y. The outline of the algorithm can be depicted as follows: 1. Set m = 0, and initializeŷ (0) , for instance by choosingŷ (0) = 1 nȳ , with 1 n a vector of ones of length n, andȳ = 1/n n i=1 y i , 2. Increase m by 1, and compute the residuals e 3. Apply the base-learner procedure to the actual residuals. The aim at this step is to identify the "best" CLV latent component denoted c * (m) . This base-learner procedure will be described in Sect.2.5; This procedure depends on two parameters: the stopping iteration parameter, M, and the shrinkage parameter, ν, which can be determined via cross-validation or other information criterion as previously suggested in Bühlmann and Hothorn (2007).
In practice, because our base-learner procedure returns one-dimensional components associated with sequential group-wise selections of the explanatory variables, we have often observed (as it will be shown in Sect.3) that a relatively high value of ν (greater than 0.5) generally performs better. Moreover, as the predictive ability of the model appeared to be relatively stable, large values of M (say 50 or more) are not necessarily useful.

Base-learner procedure
The third step of the algorithm presented in Sect.2.4 is the core of the proposed strategy.
At each iteration, m, of the algorithm, the aim is to select a cluster of explanatory variables and their associated representative, i.e. the group's latent component c * (m) . This choice is guided by CLV hierarchical clustering of the p explanatory predictors of X = [x 1 | . . . |x p ] (Sect.2.3).
• For each size of partition (from p clusters to one cluster), we first aim to identify the CLV latent component which has the largest correlation (in absolute value) with the actual residuals e (m) . For q = 1, . . . p, that is for each partition G p−q+1 , we define : • The next step consists of choosing a specific level q between 1 and p, that is a latent component c * q and its associated group of predictors G * q , in such a way that G * q is as large as possible while accommodating an undimentionality criterion. The unidimensionality condition is assessed herein using the modified Kaiser-Guttman (KG) rule (Karlis et al., 2003).
If we denote m this specific level: where |G * q | denotes the cardinal of G * q , λ 1 and λ 2 are respectively the first and the second largest eigenvalues of the correlation matrix of the explanatory variables belonging to G * q and the threshold L is defined according to Karlis et al. (2003) by: The latent component c * (m) and the groups of explanatory variables in G * m are returned to the main algorithm (Sect.2.4) which continues at step 4.
Finally, it can be noticed that at each step the base learner returns a latent component which is itself a linear combination of a subset of the explanatory variables (eq.3). It is therefore easy to reformulate the prediction function in terms of a linear combinations of the p predictors, and to obtain an estimate of the coefficients' vector β (eq.1).

A toy-simulated example
The lmCLV procedure is illustrated in this section using a simple example based on a simulated model as in Chen and Vigneau (2016). We considered herein p = 70 explanatory variables supposed to be measured on n = 100 observations. Moreover, these variables were assumed to be structured into five groups (G 1 to G 5 ) of various sizes: G 1 was the largest group consisting of 35 variables, G 2 was the smallest group with 5 variables, whereas G 3 , G 4 and G 5 were 10 variables each.
Each group of variables was generated around a prototypical variable. The five prototypes were centered and standardized random variables with a known structure of covariance. In practice, n realizations of a vector (Z 1 , . . . , Z 5 ) t were generated from a centered multivariate normal distribution with a covariance matrix Σ: Let us denote Z the n x 5 generated prototypical matrix. Then, the variables of each group were randomly simulated according to a multivariate normal distribution N (0 n , I n ), where 0 n represents the n-dimensional null vector and I n the n×n identity matrix, as follows: where Λ is a p x 5 binary matrix defining the allocation of explanatory variables into the 5 groups. The column-wise marginal sum of Λ was [35, 5, 10, 10, 10]. In eq. (7), ω j ∈ {+1, −1} was used to randomly create positive or negative correlations between each simulated variable x j in a group and the associated prototypical variable, in order to generate directional groups of variables.
Finally the response variable y (∈ R n ) was generated with the following model: where ε resulting from a multivariate normal distribution N (0 n , I n ).
The response y was supposed to be mainly related to the variables of the smallest group, a RMSE CV value of 2.45, which is similar to the optimal values observed with previous methods. With both of these parameters, 23 variables were selected: seven from G 1 , the five variables from G 2 , six variables from G 3 , and two variables from each of the last two group.

Orange juice authentication case study
In this case study, a 1 H NMR spectroscopic profiling approach was investigated to discriminate between authentic and adulterated juices (Vigneau and Thomas, 2012). In this study, we considered the adulteration of orange juice (Citrus sinensis) with clementine juice (Citrus reticulata). Supplementation of substitution with cheaper or easier to find similar fruit is one of the type of fraud conducted within the fruit juice industry. For the experiment, twenty pure orange juices and ten pure clementine juices were selected from the Eurofins database. They are deemed to be representative of the variability of the fruit juices available on the market. From these juices, 120 blends were prepared by mixing one of the twenty orange juices and one of the ten clementine juices in known proportions. The proportion of the clementine juice in a mix were 10%, 20%, . . ., or 60%. The experimental design is described in more detail in Vigneau and Thomas (2012). The database was completed with the twenty authentic orange juices, for a total of 140 juice samples, and these were analyzed on an 1 H Nuclear Magnetic Resonance (NMR) spectroscopy platform.
The NMR variables are associated with chemical shifts, given in ppm. In the following, two spectral regions were simultaneously considered: The region from 6 to 9 ppm and the region from 0.5 to 2.3 ppm. The first spectral region mostly includes chemical shifts associated with aromatic components and the second one due to amino-acid-specific spectral shifts (among others). Between these regions lies the typical 1 H NMR spectra for orange juice sugars (Rinke et al., 2007), which cannot discriminate between Citrus sinensis and Citrus reticulata juice. After a preprocessing binning step, 300 chemical shift variables were collected between 6 to 9 ppm, and 180 variables between 0.5 to 2.3 ppm.
The data matrix is therefore composed of n=140 observations and p=480 variables. The log-transformed data are available in the R package ClustVarLV (dataset authen NMR). The mean of the log-transformed signals for both spectral regions are shown in Figure 3.
In the following, a pareto-scaling was applied to each variable. This scaling, which consists of dividing each variable by the square root of its standard deviation, was shown in this case study (Vigneau and Thomas, 2012) to be preferable to the usual pre-scaling by the standard deviation. Moreover, the splitting of the observations into ten segments was defined for Cross-Validation purposes. A proportional stratified allocation rule has been adopted in such a way that each segment contained two observations of each of the seven experimental levels, ranging from 0 to 60% of co-fruit added to pure orange juices. to the predefined maximum number of iterations, M. The same applies to the β coefficients of each of the (pre-processed) explanatory variables belonging to a selected group.
Finally, a group importance criterion has been introduced. This importance is assessed as the sum, over all its occurrences, of the decrease in residual variance allowed by introducing the shrunken OLS estimate of the associated latent component.
In our case study, using the whole data set, with lmCLV parameters ν = 0.5 and M = 25, the group importance estimates are depicted in Figure 5. This reveals the presence of a group of spectral variables (G1) of high importance, as well as two other important groups (G2 and G3). Each group is numbered according to its first occurrence, which corresponds rather well with its importance in the model.
The first group involved nine spectral variables at 7.52, 7.51, 7.50, 6.77, 2.10, 2.08, 2.07, Group Importance values are expressed relatively to the total variance of the response variable y.
2.06 and 2.04 ppm. The spectral range between 7.50 and 7.52 ppm, combined with the signal at 6.77 ppm and around 2 ppm, is particularly interesting and could be associated with 4 amino-3-methylbenzoic acid (see for instance the Spectral Database for Organic Compounds (SDBSWeb, 2020)). This information is quite stable: the spectral variables at 7.52, 7.51, 6.77 and 2.08 ppm had been selected in the first group at each of the 10 iterations of the CV procedure. The same variables have also be noted in Vigneau and Thomas (2012), but their presumed association with the same compound could not be so clearly highlighted. The second group of spectral variables consisted of ten variables (7.15, 7.10-7.09, 1.10-1.05 and 0.96 ppm). The CV procedure showed that the range between 1.09 and 1.05 ppm and the peak at 7.10-7.09 ppm was simultaneously retrieved 7 times out of 10 as the second or third group. Lastly, the third group consisted of nine spectral variables, and specifically a range between 1.65 and 1.60 ppm. The relationship between these first three group components and the proportion of clementine juice in a mix is shown in Figure 6.
Elastic-Net (E-Net) was adjusted using the R package glmnet and by considering the Gaussian family model with mixing parameter α fixed to 0.5. The parameter λ was chosen as the mean of optimal values determined for each CV fold. For the L2-boosting approach (L2-boost), in the R package mboost, values of parameter ν from 0 (0.01 precisely) to 1, by 0.1, with a maximal number of 1000 iterations, was explored. RandomForest was considered because this machine learning approach often proves to be effective in high-dimensional classification or regression problems and provides an interesting variables importance ranking.
The R package randomForest was used with parameter ntree = 5000, and default values for the two other parameters, mtry (i.e. 160) and nodesize (i.e. 5). As shown in Table 1, lmCLV has an expected prediction performance similar to Sparse PLS Regression, Elastic-Net and the L2-boosting approach. Quite surprisingly, in this case study, Random Forest did not perform very well. In fact, one can observe that by Random Forest the prediction of the percentage of co-fruit added was overestimated or underestimated at the extremes of the experimental scale, i.e. for 0-10% or 50-60%.
The number of variables involved in the model fitted on the whole dataset, using the predetermined model's parameters, is indicated in the last column of Table 1. As shown in Figure 7, while Sparse PLS Regression has the tendency to identify fewer but larger spectral ranges, the L2-boosting approach retained a relatively small number of variables associated with narrow ranges. For lmCLV, 20 groups and their latent variable were identified. Since several small spectral ranges were merged within the same group, the total number of spectral variables involved was rather high. However, the order of extraction and the grouping effect, which are a specificity of lmCLV, cannot be revealed in Figure 7. Globally, 19 spectral variables were retained with the five methods considered herein. We systematically found the variables at 7.51-7.52 ppm, but not the other variables that belonged to the first group  Figure 7: Location of the spectral variables involved in the models according to the method considered, in the orange juice authentication case study.
extracted with lmCLV. The variables between 7.09 to 7.12 ppm as well as the variable at 1.07 ppm are also present, as in the second group extracted with lmCLV. We finally noted that the areas at 6.96-6.98 ppm, 1.56-1.57 ppm, 0.86-0.88 ppm were selected with the five methods.

Conclusion
In this study, we introduced a strategy for linear model fitting based on the hypothesis that high-dimensional datasets often include highly correlated variables having similar effect on the response variable. This is specifically the case for modern scanning instruments such as those used in spectroscopy (infrared, near-infrared, Raman, NMR,...) which are able to collect a large number of sequential spectral variables, several of them being representative of one feature/signal. Omic data, and specifically metabolomic data, contain a large quantity of measured elements that are components of the same metabolic pathways and that constitute the biological information that is sought. The basis of the approach is to then cluster the explanatory variables first, as other authors have already proposed (Bühlmann et al., 2013;Enki et al., 2013). In lmCLV, the clustering stage is based on the CLV method and consists of identifying unidimensional latent components which represent clusters of variables.
These latent components play the role of the predictors in the second stage which results in introducing the explanatory variables in a group-wise fashion.
In constrast with Bühlmann et al. (2013) for the CRL approach (Cluster Representative Lasso), and Hastie et al. (2001) or Park et al. (2007) in which each cluster representative is defined as the mean variable of a group of variables, the CLV latent components are defined at the same time as the clusters, and are derived from the eigen decomposition of the within cluster covariance matrix. In addition, for the construction of the regression model itself, we have adopted an L2-boosting approach, rather than a Lasso approach. This makes it possible to build a simple and efficient algorithm in which the CLV dendrogram constitutes the base-learner.
Compared to our previous study , the algorithm proposed here requires much less computer resources. In the previous study, the clustering stage was performed on the data matrix combining the residuals of the response variable and the explanatory variables, and was repeated for each iteration. However, the clustering of several hundred variables requires the largest part of the computation time. Using the new version of lmCLV, for the case study on orange juice authentication (Sect. 3.2), which involved 480 explanatory variables, the whole procedure (ν = 0.5, M = 100) took 1 min 48 sec on one 3.4 GHz processor, including 1 min 25 sec for the clustering stage alone. For the implementation of this algorithm, a function lmCLV will be available in version 2.0.2 of the R package ClustVarLV.
In both examples presented in this study, lmCLV was shown to have similar predictive efficiency to other methods, and importantly, provided additional interpretability capacity.
The use of latent components that are easy to interpret, because each of them is associated with a group of collinear variables, is consistent with the development of modern simplifying approaches for modelization. However, compared to Lasso-based methodologies, dimensionality reduction based on variables clustering makes it easier to identify directions of interest for prediction and makes it possible to highlight functional links between explanatory variables where they exist.