K-CAE: Image Classification Using Convolutional AutoEncoder Pre-Training and K-means Clustering

The work presented in this paper is in the general framework of classification using deep learning and, more precisely, that of convolutional Autoencoder. In particular, this last proposes an alternative for the processing of high-dimensional data, to facilitate their classification. In this paper, we propose the incorporation of convolutional autoencoders as a general unsupervised learning data dimension reduction method for creating robust and compressed feature representations for better storage and transmission to the classification process to improve K-means performance on image classification tasks. The experimental results on three image databases, MNIST, Fashion-MNIST, and CIFAR-10, show that the proposed method significantly outperforms deep clustering models in terms of clustering quality.


Introduction
Computers are a very important part of this society, but there are still so many things that a human does better, despite their limited storage and computational capacity.Probably one of the most intriguing areas of study is learning, which can be described in many ways, including acquiring new knowledge, improving existing knowledge, representing knowledge, organizing knowledge, and discovering facts through experiments.In addition, the continuous growth of data volume contributes to the improvement of techniques that seek the implicit knowledge of these data [1].Machine learning (ML) is the application of underlying computational methods to experience-based decisionmaking.It is a very important part of artificial intelligence and should be one of the main characteristics of intelligent systems.By learning, we can exploit and build models of reality based on experiences, either by creating a model completely or by modernizing a partially built model.The goals of machine learning are to provide greater solution accuracy, greater problem coverage, greater economy in obtaining solutions, and greater simplicity in representing knowledge.Machine learning tasks are divided into three types: supervised, unsupervised, and reinforcement learning.
Unsupervised learning or clustering, also known as segmentation, is the grouping into homogeneous classes that consists of representing a cloud of points in any space as a set of groups called clusters.Its goal is to organize a collection of data, examples, and points into clusters (sets) that verify the following statement: Points within the same cluster are more similar and closer to one another than points in different clusters.Today, clustering is a basic and essential preprocessing step for many real-world applications [2].For example, Machine learning can be used to assist in document analysis, marketing, sales, etc.Specifically, the clustering algorithm can cluster according to various data similarity measures and data clustering patterns to find useful and relevant information for the application.For the samples to be properly allocated to different clusters, the meaningful feature values of the samples must be obtained first.However, in practical applications, the data we obtain is usually large and usually contains noise, which makes clustering a difficult task to perform.For example, in the MNIST dataset, each handwritten digit input image has 784 pixels.Although we know that some pixels (such as pixels in the corners of the image) may not be as useful as other pixels (such as pixels in the center of the image), it is difficult to manually distinguish between them when clustering and to reduce the dimensions and number of features in the cluster.Traditional clustering algorithms can only attain limited performance as the dimensionality increases.Dealing with high-level representation offers beneficial components that make the clustering process possible.Representative features with compact clusters are much more useful because there is no supervision knowledge to provide information about category labels.Unsupervised models for representation learning include convolutional auto-encoders (CAEs).They integrate inputs into a new representation space, allowing the encoding process to provide useful features.The encoding part projects the data into a collection of feature spaces, from which the decoding part reconstructs the original data [3].
In this study, we introduce a clustering method kmeans integrated within a CAE framework that aims to simultaneously learn feature representation and cluster assignment.Contrasting traditional clustering techniques, our approach uses deep neural network representation learning to identify compact and representative latent feature spaces for future classification and recognition.We train our model in an end-to-end approach with fixed parameters without any pre-training or fine-tuning techniques, enabling a faster training process.The majority of existing approaches essentially rely on pretraining the parameters using varied values.
The main contributions treated in this paper are: • A survey of the literature on embedding deep learning and clustering (Section 2).
• A clarification of the deep convolutional autoencoder with embedded clustering (Section 3) was used in the literature review.
• The proposition of the Convolutional Autoencoder (CAE), which is a simple but more general representation learning framework, allows us to reduce the dimension of the database (inputs) and generate a feature vector (minimum dimensional data) before performing the clustering phase by the K-means algorithm (second part) to obtain better results (section 4).
• The experiment that we conducted and the results obtained by our algorithm are presented in section 5. • A summary of learned lessons and a reflection on future research works (section 6).

Related work
Several algorithms for the incorporation of CAE in the K-means exist in the literature.In this section, we outline the best-known and most recent ones.We noticed that all of these algorithms have shown good results in the last few years.However, no one of them could be said to be the best, as they all depend on the content of input parameters and their application domain: [3] proposed an approach for clustering that is integrated with a deep convolutional auto-encoder (DCAE).Their method simultaneously learns feature representations and cluster assignments through DCAEs, in contrast to conventional clustering approaches.Since DCAEs completely use the capabilities of convolutional neural networks, they are effective for image processing.They use objective functions for clustering and reconstruction.To achieve consistent performance in clustering, all data points are iteratively allocated to their new matching cluster centers during the optimization process.The experimental results on the MNIST dataset demonstrate that, in terms of clustering quality, the proposed method significantly outperforms deep clustering algorithms.We adopt a similar embedding with a convolutional autoencoder based on the k-means clustering algorithm.
[4] presented a novel autoencoder network-based clustering approach.They achieved a stable and compact representation that is better suited for clustering by carefully considering the constraint of the distance between data and cluster centers.They believe that this is the first attempt at creating an auto-encode for clustering.The data can be well partitioned in the altered space since this deep architecture can develop a potent nonlinear mapping.The usefulness of the proposed approach has also been shown by the experimental results.Some facts are still inconsistent, however.This problem might be solved by maximizing the difference between cluster centers in the code layer.By contrast, we cover more representation data by CAE than AE.
For the aim of classifying graphs, the authors of [5] proposed the GraphEncoder approach.They start by introducing a deep neural network (DNN), which uses a sparse autoencoder as its basic building block, the normalized graph similarity matrix.The best non-linear graph representations that can rebuild the input matrix and achieve the necessary sparsity attributes are then followed through a greedy layer-wise pretraining approach.The clustering results are obtained by running k-means on the sparse encoding output by the last layer after stacking many layers of sparse autoencoders.In the same way, we were proposing our algorithm, we apply it in the first convolutional autoencoder and the k-means in the second part for clustering.
Recently, in [6], the authors developed a modified deep learning strategy for lung cancer diagnosis that incorporated convolutional neural networks (CNN) with Kernel K-Means clustering.The proposed CNN architecture was used to analyze all of the data in the first step.The kernel k-means clustering approach obtains the attended neurons of the feature map for each image resulting from the convolutional layers in CNN.The centroid of each cluster is then determined using this procedure, which determines the prediction class of each data point in the validation set.Several k values were used in k-fold cross-validation to measure the performance of their suggested strategy.
By combining the K-means clustering technique with deep learning, the authors of [7] present a new idea for image classification.Because there are significant changes in the foreground and background of input images, they use the K-means clustering algorithm for image preprocessing.The accuracy of image preprocessing using K-means clustering can be improved.They use a two-dimensional deep convolutional neural network to categorize images in the BabyAIImage and Question datasets into multiple classes based on shape, color, size, and location.The researchers' goal in this work is to use the deep learning algorithm to create a system that targets children's visual abilities such as visual acuity, tracking, color perception, depth perception, and object recognition.In contrast to our approach, which aims to use K-means for image preprocessing [8] proposed DeepCluster, a clustering algorithm that learns both the parameters of a neural network and the cluster assignments of the generated features.DeepCluster uses a typical clustering technique, k-means, to iterative group the features and uses the following assignments as supervision to update the network's weights.They use DeepCluster to train convolutional neural networks unsupervised on big datasets.
[9] proposed a SARS-CoV-2 population structure based on a convolutional autoencoder (CAE) trained with numerical feature vectors mapped from coronavirus spike peptide sequences to help predict future infection risks.They started by transferring input sequences of spike proteins and reducing their dimensionality into relevant numerical feature vectors suited for clustering.Then, they used principal component analysis (PCA) to produce a projection of a dataset before fitting a model.Using PCA can reduce the input numerical representation of each sample before applying CAE.The proposed method beats K-means and hierarchical clustering.The results show that, in comparison to virus isolates that are more widely distributed, cluster strains provide improved knowledge of the unknown population lineages.
In [10], the FDG PET/CT image features of multiple myeloma (MM) patients were extracted by a convolutional autoencoder.Both supervised and unsupervised clustering of the extracted features allowed significant and independent predictions of worse PFS.The obtained results support the usefulness of AI algorithm-based cluster analyses of FDG PET/CT images for risk stratification of patients with MM.
We would like to address the following points from this brief and selective study of the incorporation of CAE in the K-means: • CNN remains the most used model in brain tumor classification.We found that the architecture based on the Bayesian capsule neural network gave the lowest value of accuracy (74.4%).• The preprocessing of the images improves their accuracy.Testing with 10-fold cross-validation improves the results too.• Most of the previous work has been evaluated using the precision, recall, and F1 score metrics on the data set for better performance evaluation, which is essential to measure the model generalization of the test data.

Background
Our proposed approach comprises two modules: dimension reduction and classification.The dimension reduction is realized using CAE.The classification is carried out using K-means.In this section, we briefly review the main concepts of CAE and K-means.

Convolutional neural networks (CNN)
CNNs are massively used in image-based learning applications.Due to their autonomous feature extraction technique, CNNs can extract useful data from training samples.CNNs are usually created with several convolutional, pooling, and fully connected layers.To extract features, the input is convolved with convolutional kernels, as shown in Figure 1.Without significantly changing the feature map's resolution, the pooling layer reduces the network's computational complexity.In CNN, as the number of layers increases, the size of the pooling layers typically falls.Max pooling and average pooling are two of the most used forms of pooling layers [11].

AutoEncoder (AE)
Although they don't require a training dataset, AEs fit into the category of unsupervised learning.An AE creates a compressed latent space representation of the input data, which then decompresses it to reconstruct the data.In the compression step, AEs carry out dimensionality reduction, which is similar to principal component analysis (PCA) but unlike PCA, which uses linear transformation, AEs use deep neural networks to do the linear transformation.

Convolutional AutoEncoder (CAE)
Convolutional AutoEncoders are unsupervised dimensionality reduction models composed of convolutional layers capable of creating compressed image representations.
In general, CAEs are used to extract robust features, reduce and compress the size of the input dimension, and remove the noise while simultaneously preserving all necessary information.
The use of convolutional layers is the main difference between CAE and traditional AE.It is important to note that these layers are distinguished by their desirable capability of knowledge extraction and internal representation of image data learning.
More specifically, as shown in Figure 3, CAEs are composed of 2 CNN models, the encoder and the decoder.The encoder's principal function is to convert the initial input image into a latent representation with reduced dimensionality.The decoder, on the other hand, is responsible for rebuilding the compressed latent representation and producing an output image that is as similar to the original as possible.

K-means clustering algorithm
The most frequently used static clustering method in scientific and industrial applications is the k-means algorithm [12].It's a clustering approach that divides 'n' observations into k clusters, with each observation

The proposed K-CAE classification algorithm
To overcome the limitations of the data representation and the high dimensionality of the dataset and feature extraction, we have developed in this work an embedding of the CAE and K-means (K-CAE).The objective of this proposed approach is the application of deep learning to learn models to transform the input data into more user-friendly representations and to reduce the dimensions for classification.The CAE is based on a set of successive transformations that amplify the features of the input data that discriminate against them and attenuate their variations.
The proposed K-CAE architecture is shown in Figure 5.The initial training dataset is used to train a CAE.The decoder component is eliminated once the CAE has completed its training process, and the encoder is employed to reduce the size of the original highdimensional image dataset into a compressed image dataset.Finally, the compressed image dataset produced by the CAE's encoder is utilized to feed and train a Kmeans clustering model.The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class.There are 50000 training images and 10000 test images.

K-CAE for image recognition
The goal is to train the CAE to find features, but here we use the encoder part for compressing the initial highdimensional image dataset into a compressed image dataset.
Deconvolutional layer of size 64×32×32; After a CAE3 has been trained, the decoder components (items 8 to 13 in the list above) can be removed, and the CAE can then be used to initialize unsupervised K-means.The softmax activation function is applied.

Performance evaluation
In this subsection, we validate the efficiency and robustness of the proposed approach by performing comprehensive experimental simulations.The measurement of quality is based on the well-known and widely used evaluation metrics: accuracy (Acc), precision, recall, and F1-score.These parameters can be calculated using Equations (1, 2, 3, and 4):

Results
The classification process consists of two steps: the first one performs the dimension reduction with CAE, and the second stage represents the decision-making process for classification using K-means.
We train CAE on the three datasets.We obtain the following models, respectively (as shown in Figures 6  and 7): Obtained results for models 1 and 2: In the first part, the model generated by CAE and presented in Figure 6 is composed of seven convolution layers, three max pooling layers, and two fully connected layers.
The input image is of size 28×28.It goes first to the convolution layer, which is composed of 14 filters.Each of our layers of convolution is followed by a function of activation called RELU, which forces the neurons to return positive values.
The outputs of the CAE1 and 2 are a reduced-size image of 4×4.
In the training process of CAE1 and 2, the data is divided into training and test sets.Therefore, 60000 samples were used to train the CAE model, and the remaining 10000 samples were used for testing purposes to calculate the accuracy error (as described above).We obtain an accuracy of 81.44% after training the CAE 1 and 2 models for 50 epochs.There is still a modest result.

To improve this result:
In the second part, we apply the K-means clustering algorithm (see Figure 8) to the results of the encoder and the feature vector resulting from the previous step to determine which of the images are similar to each other and group them into one of the 10 classes.The division of the database into learning and testing remains the same.We obtain an accuracy of 96.22% after training the CAE1 and 2 models for 50 epochs.In the training process of CAE3, the data is divided into training and test sets.Therefore, 50000 samples were used to train the CAE model, and the remaining 10000 samples were used for testing purposes to calculate the accuracy error (as described above).We obtain an accuracy of 63.95% after training the CAE 3 model for 50 epochs.Still, the result is modest.

To improve this result:
Table 2: The results of the K-CAE (MNIST database).
In the second part, we apply the K-means clustering algorithm (see Figure 9) to the results of the encoder and the feature vector resulting from the previous step to determine which of the images are similar to each other and group them into one of the 10 classes.The division of the database into learning and testing remains the same.We obtain an accuracy of 76.48% after training the CAE3 model for 50 epochs.We discovered (see Figures 8 and 9) that the number of epochs (epoch=50) increases the accuracy of learning and testing.The results found are good, and the models learn more information.In contrast, as the number of epochs increases, the error (loss) of learning and testing decreases.

Discussion
In this study, we have shown that the incorporation of convolutional autoencoders as an image preprocessing technique (dimension reduction) could improve the performance of K-means models, leading to robust and accurate results.Therefore, it can be considered a promising tool for high-dimensional and noisy dataset applications.
Table 2 summarizes the performance of the proposed approach regarding the MNIST dataset for the 10 classes in terms of the evaluation performance measures.• We noticed that the first and second models (applied to the MNIST and Fashion-MNIST databases) gave the same and good results compared to the third model (applied to the CIFAR-10 database), and this is due to the number of convolution layers and the number of intermediate pooling layers.As this number increases, the performance increases.
• According to our study, we noticed that the results obtained by the application of CAE are close to the results obtained after the integration of the k-means clustering method in the deep classification by CAE and sometimes better.We demonstrate the effectiveness of our K-CAE algorithm mainly by comparing it with the deep convolutional embedded clustering method (DCAE) [3] and the AutoEncoder Clustering (AEC) [4] algorithm, in terms of accuracy, and we evaluated it on the MNIST dataset.
To validate the performance of our proposed method, We compared our method with five baseline methods: Kmeans, CAE, AEC [4], DCAE [3], and SARS-CoV-2 [9].The results are summarized in Table.4. Our proposed method outperforms the baseline methods by a significant margin in accuracy (96.22%).Especially, the proposed method substantially outperforms the secondplace method by 3.52%, which also uses the CAE approach with jointed clustering loss.Sometimes, the dimensionality of the input data is very high, and classical learning algorithms cannot provide better performance.To overcome this problem, deep learning algorithms can reduce the dimensionality of the data, such as convolutional neural networks based on the multilayer perceptron.

Conclusion and perspectives
In this work, we proposed and suggested the incorporation of convolutional autoencoders as a general unsupervised learning data dimension reduction method for creating robust and compressed feature representations to improve K-means performance on image classification tasks.
The results presented in this paper show that deep learning methods can be effectively employed for image classification.Our results show that CAEs are capable of extracting meaningful information from digits by dimension reduction, and when combined with the Kmeans clustering algorithm, we were able to significantly improve classification accuracy.
Our work opens the way to many perspectives that can be incorporated in the future.Among which we can cite: • We will also use other supervised deep learning algorithms, such as Convolutional Neural Networks (CNN), etc [13]; • With the use of large data sets, we will introduce the notion of incrementality into the database provided to the autoencoder [14]; • This architecture can also be used in certain application domains, such as handwriting recognition, with very large datasets.

For
unsupervised training of CAE, we use three image datasets (MNIST handwritten digit images; Fashion-MNIST, Zalando's article images; and CIFAR-10 Color images) from the UCI machine repository to validate the accuracy and efficiency of our proposed approach.The MNIST database contains a training set of 60000 examples and a test set of 10000 examples with varying resolutions averaging around 28×28 pixels.Fashion-MNIST is a dataset comprised of 28×28 grayscale images of 70000 fashion products from 10 categories, with 7000 images per category.The training set has 60000 images and the test set has 10000 images.Fashion-MNIST shares the same image size, data format, and structure of training and testing split with the original MNIST.

Figure 8 :Obtained results for model 3 :
Figure 8: Accuracy and error rates obtained from K-CAE (MNIST and Fashion-MNIST databases)

Table 1
lists several state-of-the-art studies that incorporate CAE into the K-means.The objective of each study is described in the table, along with the classifier model employed, the dataset used, and the performance of the results.

Table 1 :
Summary of state-of-the-art references.

Table 3 :
Comparison of the results of our proposed three models.

Table 3
compares the results obtained by our three proposed models according to the following criteria: number of epochs, CAE architecture (number of convolution layers, number of pooling layers, and number of fully connected layers), accuracy rate, as well as error rate.

Table 4 :
The results of the evaluation parameters of six different approaches.