Entropy-Guided Assessment of Image Retrieval Systems: Advancing Grouped Precision as an Evaluation Measure for Relevant Retrievals

The performance evaluation of Content Based Image Retrieval systems (CBIR), can be considered as a challenging and overriding problem even for human and expert users regarding the important numbers of CBIR systems proposed in the literature and applied to different image databases. The automatic measures widely used to assess CBIR systems are inspired from the general Text Retrieval (TR) domain such as precision and recall metrics. This paper proposes a new quantitative measure adapted to the CBIR particularity of relevant images grouping, which is based on the entropy of the returned relevant images. The proposed performance measure is easy to understand and to implement. A good discriminating power of the proposed measure is shown through a comparative study with the existing and well-known CBIR evaluation measures


Introduction
The aim of Content Based Image Retrieval (CBIR) systems [1] [2] is to rank the most similar images in the database given a user query and based on image content rather than textual annotations or metadata.A typical example of an image retrieval system, is when the CBIR system returns the relevant images from the database, in response to the image query of the user.Query by image content is an extremely active discipline; a large number of systems in different application areas are designed in the last twenty years.In [5], the authors report a tremendous growth in publications on this topic covering many disciplines such as medicine, botany, face recognition, fingerprint identification and place recognition.CBIR systems are based on automatic low-level image features extraction, such as color, gray shades and texture; not on a manual keywords annotation [3] and [4].The evaluation of CBIR systems is based on benchmarking and performance metrics.The goal of a benchmark is to compare different systems on a set of test images database.An exhaustive survey on this topic can be found for example in [19], [21] and [20].
A main problem in the field of CBIR evaluation is the lack of a common performance measure that allows a quantitative and objective comparison of visual retrieval systems.The most used measures describe the number and/or the rank of relevant images within a returned list, Müller [8] and van Rijsbergen [16] present a good summary.Recent measures dedicated to CBIR system's evaluation have been proposed in the last few years.In [18] the authors proposed a new measure called: Mean Normalized Retrieval Order (MNRO) which uses the sigmoid Gompertz function to overcome the weaknesses of Mean Average Precision (MAP) and Average Normalized Modified Retrieval Rank (ANMRR) [17].
The density of returned relevant results is important and compatible with human vision evaluation.The common evaluation measures cannot illustrate the grouping propriety of the returned relevant images.In other words, the interrelation between relevant images was missed, which is important for a fast exploration of the result by a user visual inspection.For example, assuming a window size = 10, a system returning 100 images with 10 relevant images in one window is better than a system returning the same results with one relevant image by window.Furthermore, we extend the evaluation scale to achieve a better discriminating power in which two systems having a same precision value can be evaluated differently.
The rest of the paper is organized as follows.Section 2 provides an overview of the most used measures for information retrieval evaluation, section 3 describe the limitations of the standard measures especially for image retrieval.In section 3 we provides an outline on the proposed entropy based measure.Section 5 provides the experimen-tal results and discussion.Finally, Section 6 draws the conclusions.

Measuring information retrieval: quantitative assessment
Quantitative evaluation measures in information retrieval (IR) are designed to fulfill specific criteria, including their correlation with user satisfaction, their ability to discriminate among retrieval results, and their ease of interpretation and implementation.These measures serve as valuable tools in assessing the performance and effectiveness of information retrieval systems.
The most widely used evaluation measures in IR are derived from the fundamental concepts of recall and precision.Recall represents the ability of a retrieval system to retrieve all relevant documents from a given data-set, while precision measures the proportion of retrieved documents that are truly relevant.These measures provide valuable insights into the accuracy and completeness of the retrieval results, enabling researchers and practitioners to assess and compare different systems or algorithms.
However, there are also alternative evaluation measures based on utility theory.These measures, as described in works such as [9,10,11], focus on measuring the worth or value of the retrieval output to the user.Utility-based measures take into consideration the utility or benefit that users derive from the retrieved documents, providing a different perspective on the quality of the retrieval system's output.
Utility-based measures are particularly useful in evaluating set-based retrieval output, as observed in tasks like the TREC filtering task [12].By considering the worth of the retrieved documents to the user, these measures capture the relevance and usefulness of the retrieved set as a whole, rather than treating each document independently.
In a comprehensive evaluation scenario, an effective performance measure should adhere to the following criteria: -Relevance of Retrieved Images: The measure should consider the number of relevant images returned by the system.It is essential that the retrieved images are indeed relevant to the user's query.This criterion ensures that the system accurately identifies and retrieves the desired content.
-Retrieval of Relevant Images: The measure should also take into account the size of the returned list.
It is crucial that all relevant images are successfully retrieved by the system.A good performance measure should strive for high recall, aiming to retrieve as many relevant images as possible.
-Ranking of Relevant Images: The rank of the relevant images within the returned list is another important factor.The measure should prioritize placing the most relevant images at the top of the list.A higher-ranked position indicates a better performance, as it facilitates quick and efficient access to the most relevant content.
-Interrelations among Returned Relevant Images: The measure should consider the interrelations between the returned relevant images.Ideally, the relevant images should be grouped together rather than scattered throughout the list.This criterion ensures that the retrieval system provides coherent and meaningful results, enhancing the user's browsing experience.
By incorporating these criteria into the performance measure, researchers and practitioners can gain a comprehensive understanding of the system's effectiveness in information retrieval tasks.It allows for a holistic evaluation, considering relevance, retrieval completeness, ranking quality, and the overall organization of the retrieved content.

Mean average precision
Mean Average Precision (MAP) has been a popular evaluation metric in the field of Text Retrieval since its introduction in the Text Retrieval Conferences (TREC) starting from TREC 3 in 1994 [6].Over the years, it has gained widespread adoption among researchers as a reliable measure for assessing the performance of their retrieval systems [7].
The MAP metric provides a comprehensive assessment by considering both precision and the ordering of relevant documents in the retrieval results.It calculates the average precision for each query and then takes the mean of these average precision values.The formula for MAP is as follows: Here, R represents the total number of relevant documents in the entire collection for a specific information query.The term r i denotes the ranking position of the ith relevant document in the retrieved list.
The MAP metric takes into account the ranking position of each relevant document.It assigns higher importance to relevant documents appearing at the top of the retrieved list.The formula calculates the precision at each position and then averages these precision values over all the relevant documents, providing a single numerical value that represents the overall performance of the retrieval system.
By utilizing MAP, researchers can evaluate the effectiveness of their retrieval systems by considering both the accuracy of the results (precision) and the completeness of the results (recall).It enables the comparison of different systems and the measurement of improvements made over time or across different experiments.

R-precision
The concept of R-precision provides a valuable insight into the performance of an information retrieval system by focusing on the precision achieved after retrieving a specific number, R, of relevant images for a given query.In other words, R-precision measures the precision of the retrieval results up to a certain rank.
When R is equal to the total number of relevant images for the query, reaching an R-precision of 1.0 signifies an ideal scenario with perfect relevance ranking and perfect recall.It implies that all the relevant images in the collection have been retrieved within the top R positions, ensuring a complete and accurate representation of the query's intended information.
An R-precision value less than 1.0 indicates that not all the relevant images have been retrieved within the first R positions.This could be due to the presence of irrelevant or less relevant images in higher ranks, affecting the precision achieved.As the R-precision approaches 1.0, it signifies an improvement in the retrieval system's performance, as a larger proportion of the relevant images are appearing earlier in the retrieved list.
Evaluating the R-precision allows researchers and practitioners to assess the effectiveness and efficiency of their retrieval systems by examining how well the system ranks and retrieves relevant images at different points.It complements other evaluation measures like precision at different ranks, average precision, or mean average precision, providing a more granular understanding of the retrieval system's performance in the early stages of retrieval.

Precision and recall
The standard metrics used for evaluating the performance of information retrieval systems are precision and recall [15,16].Precision measures the proportion of relevant documents retrieved by the system out of all the documents that were returned.It provides an indication of the accuracy and relevance of the retrieval results.A high precision indicates that a large percentage of the retrieved documents are indeed relevant to the user's query.
On the other hand, recall measures the proportion of relevant documents that were retrieved out of all the relevant documents in the collection.It captures the system's ability to retrieve all relevant documents and reflects its completeness.A high recall suggests that a significant portion of the relevant documents has been successfully retrieved.
Precision and recall are complementary metrics that help assess different aspects of retrieval system performance.While precision emphasizes the quality of the retrieved results, recall emphasizes the system's ability to capture all relevant information.The balance between precision and recall depends on the specific requirements and goals of the information retrieval task.
By evaluating precision and recall, researchers and practitioners can gain insights into the effectiveness and efficiency of their information retrieval systems.These metrics allow for comparisons between different retrieval algorithms or system configurations, aiding in the optimization and enhancement of retrieval performance.

P = r(N) N
In this context, the variable r(N) denotes the count of relevant images retrieved, whereas N represents the size of the retrieved list.Precision is a straightforward evaluation measure that is often favored due to its ease of implementation.However, it does not take into account the specific rank positions of the relevant elements, making it less sensitive to their order in the retrieval results.
Recall is defined as the proportion of relevant documents retrieved from the database (Rel) out of all the relevant documents present.R = r(N) Rel Ideally, a retrieval system should aim for high values for both precision (P) and recall (R) metrics.Rather than relying on individual measures of precision or recall, it is common to utilize a joint precision-recall (PR) graph to provide a comprehensive description of the system's performance [3].The PR graph visually illustrates the trade-off between precision and recall at various thresholds or rankings.
However, one limitation of the PR graph is that its interpretation can be influenced by the number of relevant images associated with a particular query [8].The shape and characteristics of the PR curve may vary depending on the specific query and the number of relevant images present.This means that comparing PR graphs across different queries or data-sets may not always provide a fair or meaningful comparison.
Despite this drawback, the PR graph remains a valuable tool for evaluating retrieval system performance.It allows researchers and practitioners to analyze the trade-off between precision and recall, make informed decisions about system parameters or algorithms, and understand the system's behavior at different retrieval thresholds.By considering the PR graph alongside other evaluation metrics, researchers can gain deeper insights into the strengths and weaknesses of their retrieval systems.

Recall-precision graph
A recall-precision graph is a graphical representation that illustrates the trade-off between recall and precision for a given information retrieval system or algorithm.Recall measures the completeness of the results returned by the system.It represents the proportion of relevant documents retrieved out of all the relevant documents in the collection.Higher recall indicates that more relevant documents are being retrieved.Precision, on the other hand, measures the accuracy of the retrieved results.It represents the proportion of relevant documents among all the documents retrieved.Higher precision indicates that a higher percentage of the retrieved documents are relevant.In a recallprecision graph, recall is typically plotted on the y-axis, while precision is plotted on the x-axis.The graph shows how the precision changes as the recall increases.The curve on the graph illustrates the relationship between recall and precision, and it can provide insights into the effectiveness of an information retrieval system.Ideally, a retrieval sys-tem should achieve high precision and high recall simultaneously.However, in practice, there is often a trade-off between the two measures.The recall-precision graph helps to visualize this trade-off and assists in finding the optimal balance based on specific retrieval system requirements.

Entropy based measures
Entropy-based measures derived from the field of information theory play a significant role in the validation and evaluation of clustering algorithms.These measures provide valuable insights into the quality and effectiveness of clustering results.Among the various entropy-based measures, two popular ones commonly used are Entropy and Purity, proposed by Zhao and Karypis [13], and the V-measure proposed by Rosenberg and Hirschberg [14].
The concept of entropy, borrowed from information theory, provides a quantitative measure of the uncertainty or disorder within a cluster.It assesses how well the cluster's members are distributed across different classes or categories.Lower entropy indicates a higher degree of purity and cohesion within the cluster, suggesting that the members of the cluster predominantly belong to the same class.
Purity, on the other hand, measures the homogeneity of a cluster in terms of class labels.It evaluates how well the cluster assignments align with the true class labels of the data points.A high purity score signifies that the cluster contains predominantly instances from a single class, indicating a more accurate and reliable clustering result.
The V-measure combines both entropy and purity to provide a balanced evaluation metric for clustering.It captures the trade-off between homogeneity and completeness of a clustering solution.The V-measure is particularly useful when dealing with imbalanced data-sets, where some classes have a significantly larger number of instances than others.
By employing entropy-based measures such as Entropy, Purity, and the V-measure, researchers and practitioners can objectively assess the quality and coherence of clustering results.These measures help in comparing and selecting appropriate clustering algorithms, fine-tuning parameters, and optimizing the clustering process to obtain meaningful and accurate clusters.
In the given context, the variables can be defined as follows: N represents the total number of data elements, C denotes the number of standard partitions, K signifies the total number of clusters, k i refers to the size of cluster i, and A i j indicates the count of elements in partition j that are assigned to cluster i.
The calculation of the V − measure involves assessing the homogeneity and completeness of a clustering solution.These evaluations rely on entropy measures such as H(C) and H(K), as well as conditional entropy's including H(C|K) and H(K|C).

Shortcomings of conventional quantitative metrics in evaluation
The standard quantitative metrics fail to consider certain crucial factors that are vital for a comprehensive quantitative evaluation of content-based image retrieval systems.Firstly, they overlook the significance of high density of relevant results, where relevant images are clustered together within a small or large collection area in the retrieved window.This characteristic is not adequately captured by existing evaluation metrics, which can be attributed to their origins as general information retrieval (IR) measures.Secondly, the discriminating power of a quantitative evaluation metric is often overlooked.This raises an important question: if two retrieval results have the same precision value, does it imply that they are similar?In other words, can we evaluate their corresponding systems as identical?
Considering these points from our perspective, it becomes evident that the existing evaluation metrics might not fully address the nuances and complexities of contentbased image retrieval.There is a need for more refined metrics that take into account factors such as the clustering of relevant results and the ability to differentiate between retrieval outcomes with similar precision values.By developing and incorporating such metrics, we can improve the accuracy and effectiveness of quantitative evaluations in content-based image retrieval systems.
Table 1 shows the level of respect to the up cited proprieties by different measures used in this study.
In the following subsections, we discussed in detail these points which must be verified by our proposed CBIR evaluation measure.

Relevant results density
Verification of the pertinent results in the case of image retrieval, is much different from that of the general infor- Medium Medium Medium mation retrieval, from which common evaluation measures are inspired.In the case of textual results search, the verification of pertinent results among the returned list must be done on a sequential manner, from the first result to the last one [22].However, the visual verification of pertinent images is by nature very fast, and guided by the location and the grouping of relevant images.An evaluation process starts with an inherent transformation of the returned results to a binary list containing relevant and irrelevant items.Figure 1 display some user query results (sad and happy emojis) [22].Even if the results in the first results are more precise (27,77%) than those in second one (precision = 22,22%), the presentation of the returns in the first results is difficult for the user to evaluate and verify.However, when the findings are gathered together, even with a lower precision rate, the results are considerably better for user evaluation.Additionally, it should be noted that, in contrast to the first results, the relevant images in the second results are located near the bottom of the 2D list.The results in this example are binary (either a sad or happy emoji), however in the real situations, the scenario is far more complex and has more than two potential outcomes.Another problem, is what we called situation search.Asking a system to return only one image from a database containing many relevant images raise almost to a full precision.In the next two subsections we study the effects of relevant images on the evaluation process when two systems have a same precision rate.

Comparing results having a same full precision
An ideal CBIR system provides a perfect image retrieval results, in which each image query returns a list of relevant results with no prior knowledge about its size.Its size varies from an input image to another input image.Therefore, the returned list = the relevant list of a given query in the database.Let P(R, N) the precision of a retrieval result, where R represents the number of relevant images, and N represents the size of the returned list.We distinguish two evaluation cases regarding the number of relevant images contained in the database: -The effectiveness of a system when the database contains a few relevant images, In this situation of full precision (P = 1), the amount of getting precision increase when R decrease.A minor error of retrieving R affect the systems precision not equitably.Hence, we define relevant error R_ERR, as a minimum precision mistake taken by a given system reducing R as: -The effectiveness of a system when the database contains many relevant images.When we have a large number of relevant images in a database, it is very challenging to have a full precision from a large returned list than from a small one.Hence, we define retrieval error N_ERR, as a maximum no zero precision mistake taken by a system as: The best retrieval situation is the system that returns one and only one exact image.This system has the highest risk, in which precision has a binary value (0 or 1).The next best system is a system in which the size of the final returned list is higher.In that case, the risk to obtain no relevant image is higher than in the case of a smaller returned list.We can define a precision error P_ERR as follows: P_ERR represents the minimum of the two errors as shown in figure 2 in which the minimum errors intersection is depicted.

Comparing results having a same precision (P < 1)
Two results having a same precision value P i (R i , N i ) = P j (R j , N j ) can be evaluated differently when R i R j .System i is better than system j when R i < R j , because the minimum size of returned list needed by system i to return the same number of relevant images R j is: In that case, the new precision becomes:

An entropy based development for visual retrieval systems
It is important and useful for a user to see the images that he needs, arranged together in the same part on the returned list.The main idea of a new measure project is to evaluate retrieval systems regarding a degree in which relevant images are grouped, which is very practical for a visual retrieval perspective.
The proposed Entropy Grouped Relevant images measure (EGR) is initially presented as follows: Where: N represent the number of returned images, R is the number of relevant images in the returned list.K and C are the sets of the detected clusters and the standard partitions respectively.A i j is Number of elements that are members of cluster i and partition j of the same class.Figure 3 show an example of returned results composed of clusters (a) and partitions (b).

Experimentation's
In order to evaluate the proposed measure, we compare it with other precision-based measures, including standard precision P, MAP, R-precision and RBP [23] measures.The comparison process is built around two tests: comparison based on a fixed size of a returned list, and comparison based on different sizes.

Best ranks on a fixed returned list
As can be seen in table 2, the results are ordered and ranked according to the different measures used in this comparative study.The top five results are well ranked by EGR and RBP measures, the best five results are highlighted in bold.Theirs ranks correspond well to the user ranking and to the real position of these results.The other measures ranked theme on the first three ranks.The discriminating power of the proposed measure and RBP measure appears in the ranking of the five best results on the five best ranks.Whereas, precision (P) for example ranks 52 results on five best places (which correspond to 55% of all results).

Worst ranks on a fixed returned list
The superiority of the proposed entropy based measure, over the other measures to interprets the worst results ranking appears on table 3. The verified worst results appears individually on each rank on EGR measure.Whereas, it appears with other results in the case of precision measure (P).The other measures (ie, MAP, R and RBP measures) cannot capture this results as the worst places.

Comparison based on different sizes of a returned list
The first comparison is built around the effectiveness of the proposed measure to evaluate systems having different sizes of returned images which are all relevant.
As can be seen from figure 4, the best system is when R = N = 1; i.e., target search.The next best systems are ordered according to the greatest value of their returned lists.Such results corresponds well to P − ERROR depicted in figure 2. Table 4 summarize the five evaluation measures when R = N.

Some returned images are relevant (R ≤ N)
The first remark can be seen from table 4 is that a results are ordered according to EGR values, they are well corresponds to the human order than the other measures.
There is an attempts to compare this results even that they have different natures (different sizes and different numbers of relevant images returned).EGR values are very closes when theirs corresponding results are perceptually very closes.Inversely, they are very different when theirs corresponding results are very distinct.

Conclusions
We have proposed a new evaluation measure to assess image retrieval systems.The proposed metric is compatible and conforms with human vision evaluation.In addition to the number and the rank of the relevant images on the returned list, the proposed measure can capture and enhance the presence of relevant images in a close area of the returned list.Based on entropy of pertinent images grouping, the proposed measure presents a high discriminating power against several retrieval cases, in which the actual measures evaluate them as equivalent.This allows us to use the proposed CBIR evaluator as a scale rather than an evaluation metric.Further investigations and experiments should be conducted, encompassing diverse situations and scenarios, to establish a robust and reliable performance measure for the proposed metric in the field of image retrieval.Additionally, its applicability in other domains, such as image quality assessment and data clustering.

Figure 1 :
Figure 1: Example of two returned lists: dispersed results and grouped results.

Figure 2 :
Figure 2: P_ERROR according to minimum error rates, P_ERROR is the same as N_ERROR except when N=1

Table 1 :
A summary table of the measures used in this study in response to the proprieties of number, ranking and grouping.

Table 2 :
Some selected results of the five best ranks according to five evaluation metrics when the returned list size N=12.

Table 3 :
Some chosen results of the five last ranks according to five evaluation metrics when the returned list size N=12.

Table 4 :
The best full precision results arranged by EGR measure.