\documentclass[11pt,twoside]{article}
\usepackage{informat}
\usepackage{epsfig}
\usepackage{url}
\usepackage{amsmath,amssymb,multirow,array}
\usepackage{tabularx}
\usepackage{subcaption}
\usepackage{graphics}
\usepackage{graphicx}
\usepackage{amsfonts}%
\usepackage{mathrsfs}%
\usepackage[title]{appendix}%
\usepackage{xcolor}%
\usepackage{textcomp}%
\usepackage{manyfoot}%
\usepackage{booktabs}%
\usepackage{algorithm}%
\usepackage{algorithmicx}%
\usepackage{algpseudocode}%
\usepackage{listings}%
\usepackage{geometry}
\usepackage[figuresright]{rotating}     
\usepackage{lipsum}
 \usepackage{amsthm}
 \usepackage{array,multirow,makecell}
%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcolumntype{R}[1]{>{\raggedleft\arraybackslash }b{#1}}
\newcolumntype{L}[1]{>{\raggedright\arraybackslash }b{#1}}
\newcolumntype{C}[1]{>{\centering\arraybackslash }b{#1}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\code}[1]{\mbox{\sc #1}}
\def\CfCi{\code{CfCi}}
\def\CfMi{\code{CfMi}}
\def\NAFCP{\code{NAFCP}}
\def\NECLAT{\code{NECLAT}}
\def\GrAFCI+{\code{GrAFCI+}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\theoremstyle{thmstyleone}%
\newtheorem{theorem}{Theorem}%  meant for continuous numbers
\newtheorem{proposition}[theorem]{Proposition}% 
%\theoremstyle{thmstyletwo}%
\newtheorem{example}{Example}%
\newtheorem{remark}{Remark}%

%\theoremstyle{thmstylethree}%
\newtheorem{definition}{Definition}

\makeatletter
\def\@opargbegintheorem#1#2#3{\trivlist
   \item[]{\bfseries #1\ #2\ (#3)} \itshape}
\makeatother


%%%%%%%%%%%%%%%%%%%%%%%%
\newtheorem{property}{Property}
%%%%%%%%%%%%%%%%%%%%%%%%
\bibliographystyle{unsrt}

\begin{document}
  \title{Closed Itemset Mining: A graph theory perspective}
   \author{Fatima Zohra Lebbah \\
Higher School of Electrical and Energetic Engineering of Oran (ESG2E), Vicinal Road N9,Oran, 31000, Algeria\\
Laboratory of Research in Computer Science-SBA, \\
Computational Intelligence and Soft Computing Team (CISCO), Higher School of Computer Science, BP 73, Bureau de poste EL WIAM, Sidi-Belabbes, 22016, Algeria\\
fz\_lebbah@yahoo.fr\\
fz.lebbah@esgee-oran.dz\\   
      \AndAuthor
      Moussa Larbani \\
      School of Mathematics and Statistics, Carleton University, Ontario,1125 Colonel By Drive,Ottawa,K1S 5B6, Canada\\
National Higher School of Statistics and Applied Economics (ENSSEA), Pôle Universitaire de Koléa, Tipaza, Algiers, 42003,Algeria\\
larbani61@hotmail.com\\
 \AndAuthor
Abdellatif Rahmoun \\
Laboratory of Research in Computer Science-SBA, \\
Computational Intelligence and Soft Computing Team (CISCO), Higher School of Computer Science, BP 73, Bureau de poste EL WIAM, Sidi-Belabbes, 22016, Algeria\\
a.rahmoun@esi-sba.dz
      }
  \titleodd{Closed Itemset Mining: A graph theory perspective}
  \authoreven{FZ. Lebbah et al.}
  \keywords{dataset, closed frequent itemset, labeled graph, maximal clique}
%  \received{June 24, 2013}

  \abstract{Data Mining is a field that focuses on extracting and analyzing usable data from large databases. This paper specifically concentrates on the problem of finding closed frequent itemsets, which is extensively studied in the field. Previous techniques based on graph theory have used tree structures and recursive algorithms, which have limitations. In this paper, we propose a scalable modeling approach that represents a transaction dataset using an undirected and labeled graph. The labels on the edges are computed and assigned in a clever manner. We also introduce a polynomial and exact algorithm based on the clique notion in graph theory to compute all the closed frequent itemsets. Our initial testing results demonstrate the efficiency of our algorithm in terms of CPU-time and memory usage compared to recent methods in the literature. Additionally, our graph model can be easily extended when the dataset is updated. We utilize linear structures with boolean values to implement our graph and employ an exact algorithm with polynomial complexity. These aspects of our approach provide strong foundations for investigating more challenging issues related to the problem at hand.}
  \abstractSi{}

  \maketitle

\section{Introduction}\label{intro}
Nowadays, Data Mining is the science which allows us to work with a dataset in line with our perspectives. It involves the process of finding patterns and knowledge from a large amount of data.  However, achieving the desired and specific information poses a significant challenge for many researchers. This has led to the development of various competing models and techniques, such as models \cite{Raedt-10, HanKamber-11, Aggarwal-15, fournier-seq-17, fournier-freq-17}. 

Itemset Mining is an appealing field within the realm of Data Mining, primarily because of the need to manage large-scale data. As the size of data increases, traditional methods and techniques used to address Itemsets’ problems tend to become slower and less efficient. Researchers are currently focused on discovering and proposing new techniques and algorithms to model and solve problems related to Itemset Mining, including computing frequent, closed, and maximal itemsets.

The process of extracting frequent sets of items bought by customers, known as finding frequent itemsets in a transaction database \cite{Agrawal-93}\cite{HanKamber-11}, has been addressed by various techniques. Some of these techniques include Constraint Programming \cite{Raedt-08}\cite{Tias-11}\cite{Leung-09}\cite{Belaid-19} and \cite{Mazouri-2019}, as well as Graph Theory  \cite{Chi-04}\cite{Schelegel-2011}\cite{Tiwari-2010}, \cite{Gouda-2001} and \cite{Alzoubi-2015}.%, with respect to closed or maximal itemsets.

Currently, graph theory-based approaches have been proposed to find frequent itemsets. However, in this article, we present a new approach to address the problem of closed Frequent Itemsets $CFI$s. We show that a transaction database can be effectively modeled using the clique's concept through an undirected and labeled graph.

In this article, we introduce a cliques-based algorithm for finding $CFI$s, which we refer to as {\CfCi}. This approach uses linear structures and a polynomial algorithm. Specifically, we show how our graph model enables the step-by-step computation of all the $CFI$s.

The article is structured as follows:

Section 2 introduces the basic definitions of the $CFI$s mining problem, along with a review of related works.

Section 3 presents graph theory notions used to transform a transaction database into an undirected and labeled graph. 

Section 4 describes the equivalency between a (Closed) Frequent Itemsets and a (Maximal) clique and how to deduce a $CFI$ from a maximal clique.

Section 5 depicts our {\CfCi} algorithm and the called functions. The complexity analysis and an executed example are also given in this section.

In Section 6, we compare our algorithm {\CfCi} to the latest existing techniques in the literature to highlight its contribution. This section also includes the reporting of experimental results on various datasets and their analysis. 

In section 7, we discuss, describe, analyze, and interpret our findings.

Finally, Section 8 concludes the paper.

\section{Problem Definition and Related Works}\label{sec-probdef}
In this section, we begin by introducing the problem using an example of a market basket. Then, we provide basic concepts and definitions.

For instance, let's consider the market baskets illustrated in Figure \ref{fig0}, where we analyze five shopping baskets. The aim is to study customer shopping habits and identify associations and correlations between different items. In this paper, we focus on computing itemsets that are frequently purchased together by customers. Specifically, we address the problem of $CFI$s.

\begin{figure}[h]
\centerline{\includegraphics[scale = 0.25]{eps-figures/shopp-analyzis.eps}}%\epsfxsize=3in \epsffile{cdo-fig2.ps}}
\caption{Market baskets analysis}\label{fig0}
\end{figure}

Table \ref{tbl0} represents the market baskets as transactions and items. Each row corresponds to a customer transaction, and each column represents an item. The assigned letters $m$, $b$, $e$, $j$, and $s$ represent specific items: milk, bread, egg, jam, and sugar, respectively. 

According to the table, the itemset $\left\lbrace m,j\right\rbrace $ has a frequency of three, showing that three customers purchase the both articles milk and jam. Similarly, the itemset $\left\lbrace b,e\right\rbrace $ has a frequency of two, showing that two customers purchase the both articles bread and egg.

\begin{table}[h]
\centering
\caption{The itemset database, as shown on the left side of Figure \ref{fig0}, can be represented using binary matrix notation, as shown on the right side of the figure.}\label{tbl0}
\setlength\tabcolsep{0.2em}
\begin{tabular}{c c c}
\begin{tabular}{ll}\hline
$\mathcal{T}$ & $\quad Itemset$ \\ \hline
$t_1$ & $\quad \left\lbrace m,b,e,j \right\rbrace $\\
$t_2$ & $\quad \left\lbrace m,s,e,j \right\rbrace $\\
$t_3$ & $\quad \left\lbrace b,e\right\rbrace $\\
$t_4$ & $\quad \left\lbrace m,j \right\rbrace $\\\hline
\end{tabular}
& $\qquad$ &
\begin{tabular}{lccccc}\hline
 & $m$ & $s$ & $b$ & $e$ & $j$ \\ \hline
$t_1$ & 1 & 0 & 1 & 1 & 1 \\
$t_2$ & 1 & 1 & 0 & 1 & 1\\
$t_3$ & 0 & 0 & 1 & 1 & 0\\
$t_4$ & 1 & 0 & 0 & 0 & 1\\\hline
\end{tabular}
\end{tabular}
\end{table}

Therefore, an itemset database problem is defined by a set of items $\mathcal{I} = \left\lbrace m,s,b,e,j \right\rbrace $ and a set of transactions $\mathcal{T}=\left\lbrace t_1,t_2,t_3,t_4 \right\rbrace $. A boolean matrix $\mathcal{D}(n\times m)$, which is called the transaction database, represents the relationship between these items through the transactions.

Therefore, an itemset database problem is defined by a set of items $\mathcal{I} = \left\lbrace m,s,b,e,j \right\rbrace $ and a set of transactions $\mathcal{T}=\left\lbrace t_1,t_2,t_3,t_4 \right\rbrace $. The relationship between these items through the transactions is represented by a boolean matrix $\mathcal{D}(n\times m)$, which is called the transaction database (see Definition \ref{def-tr-mat}). Here, $n=\vert \mathcal{I}\vert$ the number of items in $\mathcal{I}$ and $m=\vert \mathcal{T}\vert$ represents the number of transactions in $\mathcal{T}$.

\begin{definition}[transaction database]\label{def-tr-mat}
Let $\mathcal{T}=\left\lbrace t_1, t_2,\cdots t_n \right\rbrace$ be the set of transactions, and $\mathcal{I}=\left\lbrace i_1, i_2,\cdots i_m \right\rbrace$ the set of items. A transaction database is defined through the boolean database matrix $\mathcal{D}(n\times m)$:
\begin{equation}
\mathcal{D}_{j,k}=\left\lbrace 
\begin{array}{ll}
1 & \text{if the item } i_j \text{ occurs in the}\\
 &  \text{transaction } t_k\\
0 & else\\
\end{array}
\right. 
\end{equation}
where: \begin{description}
\item $n=\vert \mathcal{T}\vert$ is the number of transactions,
\item $m=\vert \mathcal{I}\vert$ is the number of items.
\end{description}
\end{definition}

Based on Figure \ref{fig0} and Table \ref{tbl0}, the analyzed shopping results in the following transaction database $\mathcal{D}(4\times 5)$ as shown in Matrix \ref{tbl1}.

\begin{equation}\label{tbl1}
\mathcal{D}=\left[ 
\begin{array}{ccccc}
1 & 0 & 1 & 1 & 1 \\
1 & 1 & 0 & 1 & 1 \\ 
0 & 0 & 1 & 1 & 0 \\
1 & 0 & 0 & 0 & 1 \\  
\end{array}
\right] 
\end{equation}

\subsection{Frequent Itemset Mining}
Before delving into the topic of  Closed Frequent Itemset, it is important to introduce the concept of support, as defined in Definition \ref{def-supp}.

\begin{definition}[support measure]\label{def-supp}
Let $\mathcal{D}$ be a transaction database, $\mathcal{I}$ be the set of items, and $\mathcal{T}$ be the set of transactions. The support measure of an itemset $I\subseteq \mathcal{I}$ is denoted as $Supp(I)$ and defined as:
\begin{equation}
Supp(I)=\sum_{i\in I}(\prod_{t\in \mathcal{T}}\mathcal{D}_{ti})
\end{equation}
$Supp(I)$ is the number of transactions containing all the items belonging to $I$.
\end{definition}

\begin{definition}[frequent itemset mining]\label{def-freqitem}
Let $\mathcal{D}$ be a transaction database and $\theta$ be the minimum support threshold, where $\theta>0$. An itemset $I$ is considered frequent if its support, denoted as $Supp(I)$, is greater than or equal to $\theta$ ($\theta>0$). Conversely, an itemset $I$ is considered infrequent if its support is less than $\theta$ ($Supp(I)< \theta$).
\end{definition}

For example, let's consider the transaction database presented in Matrix \ref{tbl1}. The support values for $I_1=\left\lbrace i_2 \right\rbrace $, $I_2=\left\lbrace i_3, i_4 \right\rbrace $ and $I_3=\left\lbrace i_1, i_5 \right\rbrace $ are $S1$, $2$ and $3$, respectively.

\begin{property}[downward closure]\label{pr-dcp} 
Every subset of a frequent itemset is also frequent. In other words,
\begin{equation}\label{eq-apr}
I \textrm{ is frequent }\Rightarrow \forall I_i \subset I : I_i \textrm{ is frequent} 
\end{equation}

or equivalently:

\begin{equation}\label{eq-napr}
I \textrm{ is infrequent }\Rightarrow \forall I_i \supset I : I_i \textrm{ is infrequent} 
\end{equation}
\end{property}

Property \ref{pr-dcp} \cite{Aggarwal-15}, also known as the \textit{Apriori Principle}, is related to Property \ref{pr-mnt} and Property \ref{pr-anti-mnt} \cite{Fournier-19,Jabbour-2013}, which introduce the characteristics of monotonicity and anti-monotonicity of supports and the inclusion of frequent itemsets, respectively.
  
\begin{property}[anti-Monotony]\label{pr-anti-mnt}\cite{agrawal-94} Let there be two itemsets $I_1$ and $I_2$ such that $I_1 \subset I_2$ . It follows that $Supp(I_1) \geq Supp(I_2)$.
\begin{equation}
\forall I_i,I_j\in \mathcal{I}/I_i \subseteq I_j :  Supp(I_i) \geq Supp(I_j)
\end{equation}
\end{property}

\begin{property}[monotonicity]\label{pr-mnt} 
Let $I_1$ and $I_2$ be two itemsets such that $I_1 \subseteq I_2$. If $Supp(I_2) \geq \theta$ then $Supp(I_1) \geq \theta$.
\begin{equation}
\begin{array}{l}
\forall I_i,I_j \in \mathcal{I}/I_i \subseteq I_j :  Supp(I_j) \geq \theta \Rightarrow \\ 
Supp(I_i) \geq \theta\\
\end{array}
\end{equation}
\end{property}

In other words, the relation \textit{Frequent} (as stated in Property \ref{pr-anti-mnt}) indicates that the subsets of a frequent itemset are also frequent. Additionally, the support measure exhibits monotonicity, as stated in Property \ref{pr-mnt}.

\subsection{Closed Itemset Mining}
A closed itemset, denoted as $I_c$, is a frequent itemset that has a support, denoted as $Supp(I_c)$, equal to or greater than a fixed minimum support threshold, denoted as $\theta$. All the items in $I_c$ must be present in the optimal set of transactions, denoted as $T_{I_c}$, which has a size equal to $Supp(I_c)$ (as defined in Definition \ref{def-closeditem}).

\begin{definition}[closed itemset]\label{def-closeditem}
Let $I_c$ be a frequent itemset and $Supp(I_c)$ be its support. The itemset $I_c$ is closed if-if it is not included in a larger frequent itemset $I_i$ that has the same support, $Supp(I_c)=Supp(I_i)$, and appears in all the transactions that include $I_c$, as well as additional transactions.

\begin{equation}
\begin{array}{l}
I_c\textrm{ is closed }\Leftrightarrow \nexists I_i \supset I_c: I_i\textrm{ is frequent } \wedge \\
Supp(I_c)=Supp(I_i) \wedge T_{I_c} \subset T_{I_i}.\\
\end{array}
\end{equation}
where, $T_{I_c}$ and $T_{I_i}$ are the sets of the transactions which contain the elements of ${I_c}$ and ${I_i}$, respectively.  
\end{definition}

\begin{figure*}[h!]
\centerline{\includegraphics[scale = 0.5]{eps-figures/Association-analyzis-part.eps}}%\epsfxsize=3in \epsffile{cdo-fig2.ps}}
\caption{\small{The search space of frequent itemset mining for the database of Table \ref{tbl0} and $\theta = 2$}}\label{fig-Ass-An}
\end{figure*}

Note that our aim is to identify all the closed itemsets. Figure \ref{fig-Ass-An} depicts the first part of the Hasse diagram for the database represented in Matrix \ref{tbl1}. An edge is drawn from the itemset $I_1$ to the itemset $I_2$ if and only if $I_1\subset I_2$ and $\vert I_2\vert = \vert I_1\vert + 1$. In this figure, the blue ellipses represent the frequent itemsets, while the green ellipses represent the closed itemsets. The support of each closed itemset is mentioned on the left side of the corresponding ellipse.

For instance, as shown in Figure \ref{fig-Ass-An}, if we set the minimum support threshold ($\theta$) to 2, the support of $\left\lbrace e \right\rbrace $ is $3$, while the support of its superset $\left\lbrace b,e \right\rbrace $ is $2$. Similarly, the support of $\left\lbrace m,j \right\rbrace $ is $3$. This shows the anti-monotony property (as stated in Property \ref{pr-anti-mnt}), where {e} is a subset of $\left\lbrace e \right\rbrace \subset\left\lbrace b,e \right\rbrace$, and their supports are $3$ and $2$, respectively.

\subsection{Related Works} \label{ssec-rw}
So far, several algorithms have been developed for mining Closed Frequent Itemsets, including {\code{AprioriClose}} \cite{pasquier-99}, {\code{AprioriTID Close}} \cite{agrawal-94}, {\code{LCM}} \cite{Takeaki-2004}, {\code{{CHARM}} \cite{zaki-05}, {\code{FP-Close}}\cite{Grahne-2005}, {\NAFCP} \cite{Tuong-15}, {\NECLAT} \cite{aryabarzan-21} and {\GrAFCI+} \cite{Ledmi-2021}. However, most of these methods utilize tree structures and recursive algorithms.

One primary challenge in the field of data mining is to efficiently find all the closed itemsets while optimizing CPU time and memory usage.

To address this challenge, we propose a graph model based on a linear structure that minimizes memory usage. Additionally, we introduce the exact algorithm {\CfCi} to find all the closed frequent itemsets.

To emphasize the contribution of our proposed approach in this work, we will compare the results of the {\CfCi} algorithm with the latest algorithms, namely {\NAFCP}, {\NECLAT} and {\GrAFCI+}. This comparison will be presented in Section \ref{sec-exp}. We consider not only the novelty of the chosen algorithm but also the availability of corresponding applications and the detailed aspect of the results.

Table \ref{tbComp} presents the key characteristics of the algorithms that are considered in the experiments section.

\begin{table}[h]
\centering\scriptsize
\caption{\small{Characteristics of the tested algorithms and their used structures}}\label{tbComp}
\setlength\tabcolsep{0.2em}
\begin{tabular}{l|p{1.8cm}|p{1.7cm}|p{1.7cm}}
\cline{2-4}
\multicolumn{1}{c|}{ }& Used $\qquad$ Structure & Used $\qquad\quad
$ Algorithms & Proposed $\quad$ Approach \\\hline
{\CfCi} & linear $\quad$ structure & iterative & exact\\ \hline
{\NAFCP} & tree structure & recursive & approximate\\ \hline
{\NECLAT} & tree structure & recursive & approximate\\ \hline
{\GrAFCI+} & tree structure & recursive & approximate\\ \hline
\end{tabular}
\end{table}
  
\section{Itemset Mining VS Undirected Graph}\label{sec-item-graph}
Firstly, we propose an undirected and labeled graph to represent a dataset. This model is based on the idea that connecting items that appear in the same transaction forms a clique, which is a sub-graph where every pair of distinct vertices are adjacent. Therefore, each transaction is represented by a clique. However, since a bond may be shared by multiple cliques, we introduce a labeling approach that ensures a bond shared by different cliques (as defined in Definition  \ref{def-trs-lab}) is accurately represented in the graph model.

In our graph model, we assign a binary value, known as a sub-label, to each transaction. Specifically, as defined in Definition \ref{def-trs-sblab}, the bond $(j,k)$ that represents the occurrence of items $i_j$ and $i_k$ in the same transaction $t_p$ is characterized by the sub-label $\mathcal{L}^p(j,k)=10^{n-p-1}$, where $n$ is the number of transactions.

\begin{definition}[sub-label]\label{def-trs-sblab}
Consider a database $\mathcal{D}(n\times m)$ and a transaction $t_p\in \mathcal{T}$. The occurrence of items $i_j$ and $i_k$ in the same transaction $t_p$, represented by $\mathcal{D}\left[ p, j\right]=1$ and $\mathcal{D}\left[ p, k\right]=1$, is modeled by the bond $(j, k)$ labeled with the binary value $\mathcal{L}^p(j,k)=10^{n-p-1}$. This sub-label is a part of the overall label $\mathcal{L}^p(j,k)$ that represents the relationship between the items $i_j$ and $i_k$.
\end{definition}

Each time the pair of items $i_j$ and $i_k$ appears in a transaction, a new sub-label is added to the label of the bond $(i, j)$. In other words, the label of the bond $(i_j, i_k)$, denoted by $\mathcal{L}(i_j,i_k)$, is the sum of all the sub-labels associated with that bond, as expressed by Equation \ref{eq-label} in Definition \ref{def-label}.

\begin{definition}[bond label]\label{def-label}
Let $\mathcal{G}=\langle X,U,\mathcal{L} \rangle$ be the graph model of a transactions database. The label of the bond $(i,j)\in U$ is given by:
\begin{equation}\label{eq-label}
\mathcal{L}(i,j)=\sum_{t_p\in \mathcal{T}}\mathcal{L}^p(i,j)
\end{equation}
\end{definition}

In Definition \ref{def-size-lab}, we introduce the notion of the label size denoted by $HW(\mathcal{L}(i,j))$, which is the number of the included ones or the included sub-labels in $\mathcal{L}(i,j)$. 

\begin{definition}[label size]\label{def-size-lab}
Let $\mathcal{G}=\langle X,U,\mathcal{L} \rangle$ be the graph model of a transaction database. The size of the label $\mathcal{L}(i,j)$, denoted as $HW(\mathcal{L}(i,j))$, represents the Hamming Weight of$\mathcal{L}(i,j)$, which is the number of included sub-labels.
\end{definition}

Definition \ref{def-trs-cl} introduces the concept that the items appearing in the transaction t are well represented by a clique. To differentiate one transaction or clique from another, we assign the corresponding sub-label $\mathscr{L}^p(i,j)$ to each bond $(i,j)$.

\begin{definition}[transaction VS clique]\label{def-trs-cl}
Let $\mathcal{D}(n\times m)$ be a transaction database defined on the set $\mathcal{T}$ of transactions and the set $\mathcal{I}$ of items. Suppose that the transaction $t_p\in \mathcal{T}$ includes all the items of the itemset $I\subset \mathcal{I}$. The transaction $t_p$ is represented by the clique $C_p$, where:
\begin{equation}
\mathcal{C}_p=\left\lbrace i_j,i_k\in I_i, \mathcal{L}^p(i_j,i_k)=10^{(n - p-1)}\right\rbrace 
\end{equation}
\end{definition}

In Definition \ref{def-trs-lab}, we depict how we model each transaction and its included items by a labeled clique in the corresponding undirected and labeled graph, called dataset graph.

\begin{definition}[\small{dataset VS Labeled graph}]\label{def-trs-lab}
Let $\mathcal{D}(n\times m)$ be a transaction database, where $n=\vert\mathcal{T}\vert$ represents the number of transactions and $m=\vert\mathcal{I}\vert$ represents the number of items. We associate with $\mathcal{D}$ the undirected and labeled graph $\mathcal{G}=\left\langle X,U,\mathcal{L}\right\rangle $ where:
\begin{description}
\item $X$: set of vertices that represents the set of items, with $$\vert X\vert=\vert \mathcal{I}\vert=m$$
\item $U$: set of bonds which are defined as follows:
\begin{equation}
\begin{split}
& \left\lbrace \begin{array}{ll}
(i_j,i_k)\in U_{\mathcal{D}} & \textrm{ if  } \exists t_p\in \mathcal{T}:\mathcal{D}\left[p,j \right]=1\\
& \wedge \mathcal{D}\left[p,k \right]=1\\
(i_j,i_k)\notin U & else\\
\end{array}
\right.\\
&\mbox{and}\\
&\vert U\vert = m+\sum_{i=1..n}\sum_{j=1..m}\mathcal{D}\left[ i,j \right]-n
\end{split}
\end{equation}
\item $\mathcal{L}_{ij}$: binary value assigned to the bond $(i,j)\in U$, which is defined as follows:
\begin{equation}\label{labcomp}
\mathcal{L}_{i_j,i_k}=\sum_{p=1..n}\mathcal{D}\left[ p,j\right]*\mathcal{D}\left[ p,k\right]*10^{n-p-1}
\end{equation}
\end{description}
\end{definition}


For instance, let's consider the itemset mining problem \cite{Tias-11} described below:
\begin{center}
$\begin{array}{c|c|c|c|c|}
 & i_1 & i_2 & i_3 & i_4 \\ \hline  
t_1 & 1 & 0 & 1 & 1 \\ \hline
t_2 & 1 & 1 & 0 & 1 \\ \hline
t_3 & 0 & 0 & 1 & 1 \\ \hline
\end{array}$
\end{center}

In Figures \ref{fig1cl1}, \ref{fig1cl2}, and \ref{fig1cl3}, we depict the cliques generated from the transactions $t_1$, $t_2$, and $t_3$, respectively.
  

\begin{figure}[h]\centering
\begin{subfigure}{0.3\textwidth}
    \includegraphics[width=0.8\textwidth]{eps-figures/example1clique1.eps}
    \caption{$t_1$ VS $C_1$}\label{fig1cl1}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
    \includegraphics[width=0.8\textwidth]{eps-figures/example1clique2.eps}%\epsfxsize=3in \epsffile{cdo-fig2.ps}}
	\caption{$t_1$ VS $C_1$}\label{fig1cl2}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
    \includegraphics[width=0.8\textwidth]{eps-figures/example1-clique3.eps}%\epsfxsize=3in \epsffile{cdo-fig2.ps}}
	\caption{$t_2$ VS $C_2$}\label{fig1cl3}
\end{subfigure}
\caption{Transactions VS Cliques}\label{ex-cliques}
\end{figure}

Therefore, our graph model $\mathcal{G}=\left\langle X,U,\mathcal{L}\right\rangle $ is formed by combining all the cliques extracted from $D$, and the corresponding adjacency matrix $\mathcal{A}$ will be a binary symmetric matrix, as defined in Definition \ref{def-adj-mat}.

\begin{definition}[items adjacency matrix]\label{def-adj-mat}
Let $\mathcal{G}=\left\langle X,U, \mathcal{L}\right\rangle$ be the graph modeling the database $\mathcal{D}(n\times m)$. The corresponding adjacency matrix $\mathcal{A}$ to $\mathcal{G}$ is defined as follows: 
\begin{equation}
a_{jk}=\left\lbrace \begin{array}{l}
\mathscr{L}(i_j,i_k) \textrm{ if   }(i_j,i_k)\in U\\
0 \textrm{ else}\\
\end{array}
\right.
\end{equation}
where:

$\mathscr{L}(i_j,i_k)=\sum_{p=1}^n \mathscr{L}^p(i_j,i_k)$
\end{definition}

We illustrate through Figures \ref{fig-sub01}, \ref{fig-sub02} and \ref{fig-sub03}, the sub-graphs $\mathcal{G}^1=\left\langle X,U,\mathcal{L}^1\right\rangle $, $\mathcal{G}^2=\left\langle X,U,\mathcal{L}^2\right\rangle $ and $\mathcal{G}^3=\left\langle X,U,\mathcal{L}^3\right\rangle $ which model the transactions $t_1$, $t_2$ and $t_3$, respectively. On the right of each sub-graph $\mathcal{G}^p$, we introduce the corresponding adjacency matrix $\mathcal{L}^p$ containing the bonds sub-labels.
 
\begin{center}
\begin{minipage}{0.4\textwidth}
\begin{figure}[H]
\centerline{\includegraphics[scale = 0.3]{eps-figures/sublabel01.eps}}%\epsfxsize=3in \epsffile{cdo-fig2.ps}}
\caption{$\mathcal{G}^1=\left\langle X,U,\mathcal{L}^1\right\rangle$ models the transaction $t_1$}\label{fig-sub01}
\end{figure}
\end{minipage}
\begin{minipage}{0.4\textwidth}
\begin{center}
$\mathcal{L}^1=\left[ 
\begin{tabular}{cccc}
100 & 000 & 100 & 100 \\
000 & 000 & 000 & 000 \\
100 & 000 & 100 & 100 \\
100 & 000 & 100 & 100 \\
\end{tabular}
\right] $
\end{center}
\end{minipage}
\end{center}

\begin{center}
\begin{minipage}{0.4\textwidth}
\begin{figure}[H]
\centerline{\includegraphics[scale = 0.3]{eps-figures/sublabel02.eps}}%\epsfxsize=3in \epsffile{cdo-fig2.ps}}
\caption{$\mathcal{G}^2=\left\langle X,U,\mathcal{L}^2\right\rangle $ models the transaction $t_2$}\label{fig-sub02}
\end{figure}
\end{minipage}
\begin{minipage}{0.4\textwidth}
\begin{center}
$\mathcal{L}^2=\left[ 
\begin{tabular}{cccc}
010 & 010 & 000 & 010 \\
010 & 010 & 000 & 010 \\
000 & 000 & 000 & 000 \\
010 & 010 & 000 & 010 \\
\end{tabular}
\right] $
\end{center}
\end{minipage}
\end{center}

\begin{center}
\begin{minipage}{0.4\textwidth}
\begin{figure}[H]
\centerline{\includegraphics[scale = 0.3]{eps-figures/sublabel03.eps}}%\epsfxsize=3in \epsffile{cdo-fig2.ps}}
\caption{$\mathcal{G}^3=\left\langle X,U,\mathcal{L}^3\right\rangle $ models the transaction $t_3$}\label{fig-sub03}
\end{figure}
\end{minipage}
\begin{minipage}{0.4\textwidth}
\begin{center}
$\mathcal{L}^3=\left[ 
\begin{tabular}{cccc}
000 & 000 & 000 & 000 \\
000 & 000 & 000 & 000 \\
000 & 000 & 001 & 001 \\
000 & 000 & 001 & 001 \\
\end{tabular}
\right] $
\end{center}
\end{minipage}
\end{center}

As illustrated in Figure \ref{fig1}, according to Definition \ref{def-adj-mat}, the combination of the sub-graphs $\mathcal{G}^1$, $\mathcal{G}^2$ and $\mathcal{G}^3$, provides the labeled and undirected graph $\mathcal{G}=\left\langle X,U,\mathcal{L}\right\rangle $. The corresponding adjacency matrix $\mathcal{A}=\mathcal{L}^1+\mathcal{L}^2+\mathcal{L}^3$ is given on the right side of the figure.

\begin{center}
\begin{minipage}{0.4\textwidth}
\begin{figure}[H]
\centerline{\includegraphics[scale = 0.3]{eps-figures/example10.eps}}%\epsfxsize=3in \epsffile{cdo-fig2.ps}}
\caption{$\mathcal{G}=\left\langle X,U,\mathcal{L}\right\rangle $ models the database transactions $t_1$, $t_2$ and $t_3$}\label{fig1}
\end{figure}
\end{minipage}
\begin{minipage}{0.4\textwidth}
\begin{center}
$\mathcal{A}_\mathcal{D}=\left[ 
\begin{tabular}{cccc}
110 & 010 & 100 & 110 \\
010 & 010 & 000 & 010 \\
100 & 000 & 101 & 101 \\
110 & 010 & 101 & 111 \\
\end{tabular}
\right] $
\end{center}
\end{minipage}
\end{center}


\section{Frequent Itemset VS Clique} \label{sec-fc-clique}
In this section, we present the utilization of the maximal clique principle in our approach to find the closed itemsets. 

The concept of a frequent itemset is defined in Definition \ref{def-fr-lab}, using the concept of a clique in graph theory. In other words, a clique in the dataset corresponding graph $\mathcal{G}=\left\langle X,Y,\mathcal{L}\right\rangle $ models a frequent itemset, where the contained vertices correspond to items.
   
\begin{definition}[frequent itemset VS Clique]\label{def-fr-lab}
%\lipsum[2]
Let's consider the adjacency matrix $\mathcal{A}(n\times n)$, the set of items $\mathcal{I}$, the set of transactions $\mathcal{T}$ and the min-support $\theta$. An itemset $I$ is frequent if and only if its items are vertices of a clique $\mathcal{C}$ in $\mathcal{G}$, such that all the labels of the corresponding bonds share at least $\theta$ sub-labels.
\begin{equation}
\forall i,j\in I: \vert\cap_{\mathcal{SL}(i,j)}\vert \geqslant \theta
\end{equation}
where $\mathcal{SL}(i,j)$ is the set of the sub-labels whose sum equals $\mathcal{A}\left[ i,j\right] $
\end{definition}

To compute the $CFI$s, the infrequent items (see Property \ref{pr-dcp}) and empty transactions should be removed. For instance, consider the dataset given in Figure \ref{bef-rem}. 

\begin{figure}%
  \centering
  \subfloat[][]{
  \begin{tabular}{@{}l|lllll@{}}
 & $i_1$ & $i_2$ & $i_3$ & $i_4$ & $i_5$\\ 
\midrule
$t_1$ & 1 & 0 & 1 & 1 & 0 \\ 
$t_2$ & 1 & 1 & 0 & 1 & 0 \\ 
$t_3$ & 0 & 0 & 1 & 1 & 0 \\
$t_4$ & 0 & 0 & 0 & 0 & 1 \\
%\botrule
\end{tabular}}%
  \qquad
  \subfloat[][]{
$\mathcal{D}=\left[ \begin{array}{ccccc}
1 & 0 & 1 & 1 & 0\\
1 & 1 & 0 & 1 & 0\\
0 & 0 & 1 & 1 & 0\\
0 & 0 & 0 & 0 & 1\\
\end{array}\right] $
}
\caption{Example of itemset mining problem including $5$ items and $4$ transactions}\label{bef-rem}%
\end{figure}

As illustrated in Figure \ref{after-rem}, if the min-support is set to $\theta=2$, we notice that $i_2$ and $i_5$ are infrequent and should be removed. Thus, the dataset $\mathcal{D}$ is replaced by $\mathcal{D'}$ (see sub-Figure \ref{rem-a}), which includes the empty transaction $t_4$. At this level, $t_4$ is deleted to move to $\mathcal{D''}$ (see sub-Figure \ref{rem-b}).

\begin{figure}[h!]%
  \centering
  \subfloat[][]{
 \begin{tabular}{ccc}
$\begin{array}{c|c|c|c|c|}
 & i_1 & i_3 & i_4 \\ \hline  
t_1 & 1 & 1 & 1 \\ \hline
t_2 & 1 & 0 & 1 \\ \hline
t_3 & 0 & 1 & 1 \\ \hline
t_4 & 0 & 0 & 0 \\ \hline
\end{array}$
& $\quad$ &
$\mathcal{D'}=\left[ \begin{array}{ccc}
1 & 1 & 1 \\
1 & 0 & 1 \\
0 & 1 & 1 \\
0 & 0 & 0 \\
\end{array}\right] $
\end{tabular}
\label{rem-a}}
  \qquad
  \subfloat[][]{
\begin{tabular}{ccc}
$\begin{array}{c|c|c|c|c|}
 & i_1 & i_3 & i_4 \\ \hline  
t_1 & 1 & 1 & 1 \\ \hline
t_2 & 1 & 0 & 1 \\ \hline
t_3 & 0 & 1 & 1 \\ \hline
\end{array}$
& $\quad$ &
$\mathcal{D''}=\left[ \begin{array}{ccc}
1 & 1 & 1 \\
1 & 0 & 1 \\
0 & 1 & 1 \\
\end{array}\right] $
\end{tabular}
\label{rem-b}}
  \qquad
  \subfloat[][]{
  \begin{tabular}{c}
  \includegraphics[scale = 0.3]{eps-figures/example10-clean1.eps}\\
  \small{$\mathcal{G}_\mathcal{D''}=\langle X_\mathcal{D''},U_\mathcal{D''},\mathcal{L}_\mathcal{D}\rangle$}
  \end{tabular}
\label{rem-c}}
  \caption{Removing infrequent items and empty transactions}\label{after-rem}
\end{figure}

The graph corresponding to $\mathcal{D''}$, called $\mathcal{G''}=\langle X'',U'',\mathcal{L}\rangle$ (see sub-Figure \ref{rem-c}), will be the input for our proposed algorithms and will be noted as $\mathcal{G}=\langle X,U,\mathcal{L}\rangle$.
 
Every pair $i_j,i_k \in I_c$, where $l_c$ is a closed itemset, is frequent, as defined in Definition \ref{def-closeditem}. Therefore, if we project this concept onto the graph model $\mathcal{G}=\langle X,U,\mathcal{L}\rangle$, the label of each bond $(i, j)\in U$ must contain at least $\theta$ ones. Stated differently, a bond in our graph needs to be consistent (Definition \ref{def-cons-bnd}). 
 
\begin{definition}[consistent bond]\label{def-cons-bnd}
Let $\mathcal{G}=\langle X,U,\mathcal{L}\rangle$ be the graph model of the dataset $\mathcal{D''}$. Suppose that $\mathcal{L}(i,j)$ is the corresponding label to the bond $(i,j)$, and $HW(\mathcal{L}(i,j))$ is the number of the ones included in $\mathcal{L}(i,j)$. If $HW(\mathcal{L}(i,j))\geqslant\theta$ then $(i,j)$ is consistent and $(i,j)$ is inconsistent otherwise.
\end{definition}

Consequently, as given in Definition \ref{def-cons-gr}, if all the bonds of the graph are consistent,the graph is consistent.

\begin{definition}[Consistent data mining graph]\label{def-cons-gr}
Let $\mathcal{G}_c=\langle X,U_c,\mathcal{L}\rangle$ be the dataset graph. $\mathcal{G}_c$ is consistent if-if all its bonds are consistent:
\begin{equation}
\begin{array}{l}
\mathcal{G}_c=\langle X_c,U_c,\mathcal{L}\rangle \mbox{ is consistent }\Leftrightarrow \forall (i,j)\in U :\\
 (i,j) \mbox{ is consistent.}\\
\end{array}
\end{equation}
\end{definition}

Thus, the input graph of our algorithm should be a consistent graph whose inconsistent bonds are omitted. As illustrated in Figure \ref{fig1-clean2}, $\mathcal{G}_c=\langle X_c,U_c,\mathcal{L}_c\rangle$ is the consistent version of $\mathcal{G}$ (see Figure \ref{fig1}). The corresponding adjacency matrix $\mathcal{A}(3\times 3)$ is shown on the right side of the figure. 

\begin{center}
\begin{minipage}{0.3\textwidth}
\begin{figure}[H]
\centerline{\includegraphics[scale = 0.3]{eps-figures/example10-clean2.eps}}%\epsfxsize=3in \epsffile{cdo-fig2.ps}}
\caption{The consistent graph $\mathcal{G}_c=\langle X_c,U_c,\mathcal{L}_c\rangle$ (see sub-Figure \ref{rem-c})}\label{fig1-clean2}
\end{figure}
\end{minipage}
\begin{minipage}{0.4\textwidth}
\begin{center}
$\mathcal{A}=\left[ 
\begin{tabular}{ccc}
110 & 0 & 110 \\
0 & 101 & 101 \\
110 & 101 & 111 \\
\end{tabular}
\right] $
\end{center}
\end{minipage}
\end{center}

\section{{\CfCi} Exact Algorithm For Complete Enumeration}\label{cfci-sec}
%\subsection{{\CfCi} algorithm}
In Definition \ref{def-gr-cl}, we introduce the closed itemset notion in terms of the graph theory. A $CFI$ corresponds to a maximal clique (see Definition \ref{def-max-cli}), whose at least $\theta$ sub-labels are shared by the bonds.

\begin{definition}[maximal clique]\label{def-max-cli}
Let $\mathcal{G}=\left\langle X,U\right\rangle $ be an unditrected graph. A clique $C$ is maximal if it cannot be extended into a larger clique. In other words, $C$ is not a subset of a larger clique. 
\end{definition}

\begin{definition}[closed itemset VS dataset graph]\label{def-gr-cl}
Let $\mathcal{A}(n\times n)$ be the adjacency matrix of the consistent graph $\mathcal{G}=\left\langle X,U,\mathcal{L}\right\rangle $. An itemset $I$ is closed if its items are the vertices of a maximal clique $\mathcal{CQ}$ such that all the labels of the corresponding bonds share at least $\theta$ sub-labels.
\end{definition}

In graph theory, finding maximal cliques is a hard problem. It is difficult to find a polynomial algorithm that provides this kind of clique. However, in this paper, we propose Algorithm 2, called {\CfCi}. This algorithm, based on the bonds' sub-labels, is exact and has polynomial complexity.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
Considering, for example, the graph given in Figure \ref{fig1-clean2}. The corresponding structures are given below.

\begin{center}
\begin{minipage}{0.23\textwidth}
$\mathcal{D}=\left[ 
\begin{tabular}{ccc}
1 & 1 & 1 \\
1 & 0 & 1 \\
0 & 1 & 1 \\
\end{tabular}
\right] $
\end{minipage}
\begin{minipage}{0.23\textwidth}
\begin{center}
$\mathcal{A}=\left[ 
\begin{tabular}{ccc}
110 & 0 & 110 \\
0 & 101 & 101 \\
110 & 101 & 111 \\
\end{tabular}
\right] $
\end{center}
\end{minipage}

\begin{minipage}{0.25\textwidth}
\begin{center}
$Labs=\left[  110,101,111 \right]   $ 

$supp=\left[ 2,2,3 \right]$

$IndVect=\left[ 
\begin{tabular}{l}
1 \\
2 \\
3 \\
\end{tabular}
\right] $
\end{center}
\end{minipage}
\end{center} 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Algorithm \ref{algo-FCIB-Items}, having the polynomial complexity $O(\frac{nbcons\times (nbcons-1)}{2})$ (see Proposition \ref{prp2}), provides the $CFI$s corresponding to the initial labels that belong to $Labs$.

\begin{algorithm*}[h!]
\caption{BasicCFI($M$)}
\label{algo-FCIB-Items}
\begin{algorithmic}[1]
\Require{$\theta$: threshold, $M$: adjacency matrix of the cleaned graph, $nbcons$: number of consistent vertices; $ConsItem$: vector of consistent vertices;}
\State $VertexSet\leftarrow \emptyset$;
\For {$i=0..nbcons$} \Comment{the vertices are covered, according to their labels,\\ from the smallest to the biggest one}
\State  $VertexSet \leftarrow VertexSet \cup \left\lbrace i \right\rbrace$; 
\State $IndVect\leftarrow IndVect \cup \left\lbrace i \right\rbrace$;$SuppVect\leftarrow SuppVect \cup \left\lbrace supp\left[ i \right]  \right\rbrace$
%\While {$v \in succ[i]$}
\For {$j=i+1..nbcons$}
\If {$DistLabs(i,j)=supp\left[ j\right] $}\Comment{return the distance between the\\ labels $i$ and $j$}
\State $VertexSet\leftarrow VertexSet \cup \left\lbrace j \right\rbrace$
\EndIf
\EndFor
\If {$VertexSet\neq \emptyset$}
\State $FCI\leftarrow FCI \cup \left\lbrace VertexSet\right\rbrace $
\EndIf
\State $VertexSet \leftarrow \emptyset$
\EndFor
\State \Return $FCI$
\end{algorithmic}
\end{algorithm*}

\begin{proposition}\label{prp2} Algorithm \ref{algo-FCIB-Items} has the worst case complexity $\approx O(\frac{nbcons^2-nbcons}{2})$, where $nbcons$ is the number of the consistent vertices.
\end{proposition}
\begin{proof}\label{proof2}
In this algorithm, we have three nested loops. At most, at the outer one $nbcons$ vertices will be treated and at the inner ones $\frac{nbcons-1}{2}$ will be processed. Thus, the complexity of $BasicCFI$'s $\approx O(\frac{nbcons^2-nbcons}{2})$.
\end{proof}

Let's consider, for example, the graph $\mathcal{G}=\left\langle X,U,\mathcal{L} \right\rangle$ illustrated in Figure \ref{fig-total}. 

\begin{figure}[h!]\centering
   \includegraphics[scale = 0.3]{eps-figures/graph-theta1.eps}\caption{\small{$\mathcal{G}=\left\langle X,U,\mathcal{L} \right\rangle$ models the dataset given in Table \ref{tbl0}}}\label{fig-total}
\end{figure}

We suppose that $\theta=1$, which makes the graph consistent (see \ref{def-cons-gr}) and the correct input of our algorithm.

Figure \ref{max-Bcl} shows the different steps of Algorithm \ref{algo-FCIB-Items} applied to the graph. The sub-figures contain arrows in green, blue and red, that correspond to the fact that the intersection between the first label and the second generates a label equal to the first one, a label with support$\geqslant \theta$ and a label with support$< \theta$, respectively.

\begin{figure}[h!]\centering
\begin{subfigure}{0.5\textwidth}\centering
   \includegraphics[scale = 0.4]{eps-figures/example-basicCFI-1.eps}\caption{\small{$I_1=\left\lbrace i_5\right\rbrace $}}\label{fig-Bcl1}
\end{subfigure}
\begin{subfigure}{0.5\textwidth}\centering
    \includegraphics[scale = 0.4]{eps-figures/example-basicCFI-2.eps}
\caption{\small{$I_2=\left\lbrace i_2,i_1,i_4\right\rbrace $}}\label{fig-Bcl2}
\end{subfigure}
\begin{subfigure}{0.5\textwidth}\centering
    \includegraphics[scale = 0.4]{eps-figures/example-basicCFI-3.eps}
\caption{\small{$I_3=\left\lbrace i_3,i_4\right\rbrace $}}\label{fig-Bcl3}
\end{subfigure}
\begin{subfigure}{0.5\textwidth}\centering
    \includegraphics[scale = 0.4]{eps-figures/example-basicCFI-4.eps}
\caption{\small{$I_4=\left\lbrace i_1,i_4\right\rbrace $}}\label{fig-Bcl4}
\end{subfigure}
\begin{subfigure}{0.5\textwidth}\centering
    \includegraphics[scale = 0.4]{eps-figures/example-basicCFI-5.eps}
\caption{\small{$I_5=\left\lbrace i_4\right\rbrace $}}\label{fig-Bcl5}
\end{subfigure}
\caption{Generated $CFI$s by Algorithm \ref{algo-FCIB-Items}. $IndVect$ and $SuppVect$ contain the vertices who are at the origin of the corresponding labels and their supports, respectively.}\label{max-Bcl}
\end{figure}

Algorithm \ref{algo-Freq-Items}, called {\CfCi}, is the proposed algorithm of the polynomial  complexity $\approx O(\frac{l^3-l^2}{2}(v-1))$, where $l$ is the labels’ number and $v$ is the vertices’ number (see Proposition \ref{prp1}). It provides all the $CFI$s of the tackled dataset since the second loop of the algorithm computes all the possible labels’ intersections. In other words, each new generated label whose$supp\geqslant \theta$ should lead to a new $CFI$. 

\begin{algorithm*}[h!]
\caption{\CfCi()}
\label{algo-Freq-Items}
\begin{algorithmic}[1]
\Require{$\theta$: threshold, $M$: adjacency matrix of the cleaned graph, $nbcons$: number of consistent vertices; $ConsItem$: vector of consistent vertices; $Labs$: set of the generated labels by $BasicCFI$ algorithm}
\State $beg\leftarrow 0$; $end\leftarrow \vert Labs\vert$; $e\leftarrow \vert Labs\vert$ 
\For{$k=0\cdots \vert Labs\vert$}
\State $bg=beg[k]$; $ed=end[k]$
\For {$i=bg\cdots ed$}
\State $b\leftarrow 0$
\For{$j=i+1\cdots ed$}
\State $cl\leftarrow 0$;$dst\leftarrow DistLabs(Labs[i],Labs[j])$
\If {$dst\neq SuppVect[i] \wedge dst \geq \theta$}  
\State $cl=comp2Labs(Labs[i],Labs[j])$\Comment{return to $cl$ the intersection of the labels $i$ and $j$}
\If {$cl\notin Labs$}
\State $Labs\leftarrow Labs \cup \{cl\}$; $IndVect\leftarrow IndVect\cup \{i\}$
\State $SuppVect\leftarrow SuppVect \cup \{dst\}$
\State $VertexSet\leftarrow NewClique(\vert Labs\vert-1)$
\If {$VertexSet\neq \emptyset$}
\State $VertexSet\leftarrow VertexSet\cup \{ind\}$;$FCI\leftarrow FCI\cup \{VertexSet\}$
\State $VertexSet\leftarrow \emptyset$			
\EndIf
\State $b\leftarrow b+1$
\EndIf
\EndIf
\EndFor
\If {$b\geq 1$} 
\State $beg\leftarrow beg + e$; $end\leftarrow end + {ed +b}$; $e\leftarrow e+b$
\EndIf
\EndFor
\EndFor
\end{algorithmic}
\end{algorithm*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

To find a new maximal clique, our algorithm calls the function $NewClique$ (see Algorithm \ref{algo-NewClique}). $NewClique(r)$ computes the incident vertices to $IndVect\left[ r\right] $, that share at least $\theta$ sub-labels contained in $Labs\left[ r \right]$.

\begin{algorithm}[h!]
\caption{$NewClique$}
\label{algo-NewClique}
\begin{algorithmic}[1]
\Function{$NewClique$}{$r$}
\State $VertexSet\leftarrow\emptyset$
\State $ind\leftarrow IndVect[r]$
\While {$v\in succ[ind]$}
\State $dst=DistLabs(r,v)$   				
\If {$dst=SuppVect[r]$} 
\State $VertexSet\leftarrow VertexSet\cup \{v\}$
\EndIf					        								        		
\EndWhile
\State \Return $VertexSet$
\EndFunction
\end{algorithmic}
\end{algorithm}

\begin{proposition}\label{prp1} Algorithm \ref{algo-Freq-Items} has the worst case complexity $\approx O(\frac{l^3-l^2}{2}(v-1))$, where $l$ is the labels' number and $v$ is vertices' number.
\end{proposition}
\begin{proof}\label{proof1}
In this algorithm, we have three nested loops. Consider $l$ the labels' number and $v$ vertices' number. At most, at the outer one $l$ labels will be treated, and at the first and the second inner ones $l$ and $\frac{l-1}{2}$ iterations will be processed, respectively. In addition, at most $v-1$ successors are handled in the function $NewClique$. Thus, the {\CfCi}'s complexity $\approx O(\frac{l^3-l^2}{2} \times v-1)$. 
\end{proof}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In Figure \ref{max-cll}, we illustrate the variation of the used structures: $Labs$, $FCI$, $IndVect$ and $SuppVect$. As given in the proof of Proposition \ref{prp1}, Algorithm \ref{algo-Freq-Items} executes in $nbcons$ outer loops, where $nbcons$ is the number of the consistent vertices. Thus, since $nbcons=5$, our algorithm executes $5$ outer loops. Through sub-Figures \ref{fig-cl11}, \ref{fig-cl22}, \ref{fig-cl33}, \ref{fig-cl44} and \ref{fig-cl55}, we illustrate the different loops' details. We used the green arrow to express the novelty and consistency of the generated label, the blue to show that the generated label is already implemented, and the red when the support of the generated label is less than $\theta$. 

\begin{figure}[h!]\centering
\begin{subfigure}{0.5\textwidth}\centering
   \includegraphics[scale = 0.4]{eps-figures/example-CFI-1.eps}\caption{\small{No new label is generated.}}\label{fig-cl11}
\end{subfigure}
\begin{subfigure}{0.5\textwidth}\centering
    \includegraphics[scale = 0.4]{eps-figures/example-CFI-2.eps}
\caption{\small{No new label is generated.}}\label{fig-cl22}
\end{subfigure}
\begin{subfigure}{0.5\textwidth}\centering
    \includegraphics[scale = 0.4]{eps-figures/example-CFI-3.eps}
\caption{\small{$I_6=\left\lbrace i_3,i_4,i_4\right\rbrace $ is generated.}}\label{fig-cl33}
\end{subfigure}
\begin{subfigure}{0.5\textwidth}\centering
    \includegraphics[scale = 0.4]{eps-figures/example-CFI-31.eps}
\caption{\small{No new label is generated.}}\label{fig-cl44}
\end{subfigure}
\begin{subfigure}{0.5\textwidth}\centering
    \includegraphics[scale = 0.4]{eps-figures/example-CFI-4.eps}
\caption{\small{No new label is generated.}}\label{fig-cl55}
\end{subfigure}
\caption{Generated $CFI$s by Algorithm {\CfCi}. $IndVect$ and $SuppVect$ contain the vertices who are at the origin of the corresponding labels and their supports, respectively.}\label{max-cll}
\end{figure}
      
Thus, Algorithm \ref{algo-Freq-Items}, applied to the graph, provides six maximal cliques as given in Figure \ref{max-cl}. Each clique $\mathcal{C}_i$ induces a closed itemset $I_i$. 

The maximal cliques resulted from our algorithm $\mathcal{C}_1$, $\mathcal{C}_2$, $\mathcal{C}_3$, $\mathcal{C}_4$, $\mathcal{C}_5$ and $\mathcal{C}_6$ are illustrated via sub-Figures \ref{fig-cl1}, \ref{fig-cl2},\ref{fig-cl3}, \ref{fig-cl4}, \ref{fig-cl5} and \ref{fig-cl6}, respectively.
  
\begin{figure}[H]\centering
\begin{subfigure}{0.4\textwidth}\centering
   \includegraphics[scale = 0.3]{eps-figures/graph-theta1-clique1.eps}
    \caption\small{$\mathcal{C}_1$ modeling the $CFI$ $I_1=\left\lbrace i_5\right\rbrace $}\label{fig-cl1}
\end{subfigure}
\begin{subfigure}{0.4\textwidth}\centering
    \includegraphics[scale = 0.3]{eps-figures/graph-theta1-clique2.eps}
\caption{\small{$\mathcal{C}_2$ modeling the $CFI$ $I_2=\left\lbrace i_2,i_1,i_4\right\rbrace $}}\label{fig-cl2}
\end{subfigure}
\begin{subfigure}{0.4\textwidth}\centering
    \includegraphics[scale = 0.3]{eps-figures/graph-theta1-clique3.eps}
\caption{\small{$\mathcal{C}_3$ modeling the $CFI$ $I_3=\left\lbrace i_3,i_4\right\rbrace $}}\label{fig-cl3}
\end{subfigure}
\begin{subfigure}{0.4\textwidth}\centering
    \includegraphics[scale = 0.3]{eps-figures/graph-theta1-clique4.eps}
\caption{\small{$\mathcal{C}_4$ modeling the $CFI$ $I_4=\left\lbrace i_1,i_4\right\rbrace $}}\label{fig-cl4}
\end{subfigure}
\begin{subfigure}{0.4\textwidth}\centering
    \includegraphics[scale = 0.3]{eps-figures/graph-theta1-clique5.eps}
\caption{\small{$\mathcal{C}_5$ modeling the $CFI$ $I_5=\left\lbrace i_4\right\rbrace $}}\label{fig-cl5}
\end{subfigure}
\begin{subfigure}{0.4\textwidth}\centering
    \includegraphics[scale = 0.3]{eps-figures/graph-theta1-clique6.eps}
\caption{\small{$\mathcal{C}_6$ modeling the $CFI$ $I_6=\left\lbrace i_3,i_1,i_4\right\rbrace $}}\label{fig-cl6}
\end{subfigure}
\caption\small{Maximal cliques VS $CFI$s}\label{max-cl}
\end{figure}


\section{Experiments}\label{sec-exp}
As stated in Section \ref{ssec-rw}, besides {\CfCi}, we consider in our experiments the algorithms {\NAFCP} \cite{Tuong-15}, {\NECLAT} \cite{aryabarzan-21} and {\GrAFCI+} \cite{Ledmi-2021}. In addition, to have credible results, we chose the datasets according to their varied densities. More precisely, the chosen datasets : Mushroom, Lymph and Soybean datasets \footnote{The tested datasets are downloaded from \url{https://dtai.cs.kuleuven.be/CP4IM/datasets/}} have respectively $18\%$, $40\%$ and $32\%$ (see Table \ref{tbl-ClItm}). 

In Table \ref{tbl-ClItm}, we introduce the characteristics of each tested dataset $DName(nbItem,nbTr,dns)$, where $DName$ is the dataset's name, and $nbItem$, $nbTr$ and $dns$ are, respectively, the corresponding items' number, transactions' number and density. The datasets' information given in the table are:
\begin{itemize}
\item $freq$: the frequency adopted by the algorithm,
\item $\theta$: the corresponding support to $freq$,
\item $dns$: the density of the dataset according to $freq$,
\item $nvr$: the vertices' number of the consistent dataset's graph,
\item $nbd$: the bonds' number of the consistent dataset's graph.
\end{itemize}  
    
\begin{table}[h]
\caption{Characteristics of the tested dataset: Mushroom, Lymph and SoyBean.}
\label{tbl-ClItm}
\begin{center}
%\begin{minipage}{.48\textwidth}
{\renewcommand{\arraystretch}{1}
\begin{tabular}{l|lr|lr} \hline
\multicolumn{5}{c}{Mushroom(119,8124,18\%)}\\\hline
$freq$ & {\code{$\theta$}} & $dns$ & $nvr$ & $nbd$ \\\hline
$5\%$ & 406 & 30\% & 67 & 1168\\ 
$25\%$ & 2031 & 51\% & 31 & 229 \\
$50\%$ & 4062 & 74\% & 12 & 52 \\
$80\%$ & 6499 & 94\% & 5 & 14 \\\hline
\end{tabular}}
{\renewcommand{\arraystretch}{1}
\begin{tabular}{l|lr|lr} \hline
\multicolumn{5}{c}{Lymph(148,68,40\%)}\\\hline
$freq$ & {\code{$\theta$}} & $dns$  & $nvr$ & $nbd$ \\  
\hline
$5\%$ & 7 & 45\% & 59 & 1369 \\ 
$25\%$ & 37 & 60\% & 40 & 549 \\
$50\%$ & 74 & 76\% & 24 & 193 \\
$80\%$ & 118 & 91\% & 11 & 58 \\\hline
\end{tabular}}
{\renewcommand{\arraystretch}{1}
\begin{tabular}{l|lr|lr} \hline
\multicolumn{5}{c}{Soybean(50,630,32\%)}\\\hline
$freq$ & {\code{$\theta$}} & $dns$ & $nvr$ & $nbd$ \\  
\hline
$5\%$ & 32 & 36\% & 44 & 601 \\ 
$25\%$ & 158 & 56\% & 23 & 157 \\
$50\%$ & 315 & 75\% & 12 & 55 \\
$80\%$ & 504 & 93\% & 5 & 13 \\\hline
\end{tabular}}
\end{center}
\end{table}

The results provided by the tested algorithms are analyzed regarding the CPU-time, the number of computed $CFI$s (called $ncl$) and the memory usage (called $Mem$).

The used computer has the following configuration: Linux-20.04.1 on a notebook with Intel Core i7 and 8 GB/RAM. The experiments are performed in the Java language.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In sub-Figures \ref{cfi-mush-cpu}, \ref{cfi-mush-ncl} and \ref{cfi-mush-mem}, we present the behavior of {\CfCi}, {\NAFCP}, {\NECLAT} and {\GrAFCI+} applied to Mushroom, in terms of the gotten CPU-time, $ncl$ (the number of closed itemsets) and $mem$ (the memory usage), respectively. 

\begin{figure}[h!]\centering
\begin{subfigure}{0.4\textwidth}
    \includegraphics[width=1\textwidth]{eps-curves/cpu-mushroom.eps}
    \caption{CPU-time}\label{cfi-mush-cpu}
\end{subfigure}
\\

\begin{subfigure}{0.4\textwidth}
    \includegraphics[width=1\textwidth]{eps-curves/ncl-mushroom.eps}
    \caption{Number of $CFI$s}\label{cfi-mush-ncl}
\end{subfigure}
\\

\begin{subfigure}{0.4\textwidth}
    \includegraphics[width=1\textwidth]{eps-curves/mem-mushroom.eps}
    \caption{Memory usage}\label{cfi-mush-mem}
\end{subfigure}
\caption{MUSHROOM dataset}\label{cfi-mush}
\end{figure}

We notice, via sub-Figure 12a, that the applied methods have approximately the same CPU time until the frequency of $25\%$, where {\CfCi} and {\GrAFCI+} become slower. At a frequency of $5\%$, {\CfCi} and {\NECLAT} are the fastest ones, whereas {\NAFCP} is considerably the slowest one.
%%
Sub-Figure \ref{cfi-mush-ncl} shows that all the methods, at frequencies of $80\%$, $50\%$ and $25\%$, provide the same number of $CFI$s. Whereas, at a frequency of $5\%$, our algorithm provides more $CFI$s than {\NECLAT} and {\GrAFCI+}, and less than {\NAFCP}.
%%

In terms of memory usage, sub-Figure \ref{cfi-mush-mem} highlights the efficiency of our algorithm compared to the others.

In Figure \ref{cfi-lymph}, we introduce three diagrams that express the behavior of the adopted methods, applied to the dataset $Lymph$, in terms of CPU-time, number of $CFI$s and memory usage.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[h!]\centering
\begin{subfigure}{0.4\textwidth}
    \includegraphics[width=1\textwidth]{eps-curves/cpu-lymph.eps}
    \caption{CPU-time}\label{cfi-lymph-cpu}
\end{subfigure}
\\

\begin{subfigure}{0.4\textwidth}
    \includegraphics[width=1\textwidth]{eps-curves/ncl-lymph.eps}
    \caption{Number of $CFI$s}\label{cfi-lymph-ncl}
\end{subfigure}
\\

\begin{subfigure}{0.4\textwidth}
    \includegraphics[width=1\textwidth]{eps-curves/mem-lymph.eps}
    \caption{Memory usage}\label{cfi-lymph-mem}
\end{subfigure}
\caption{LYMPH dataset}\label{cfi-lymph}
\end{figure}

As shown in sub-Figure \ref{cfi-lymph-cpu}, at frequencies $80\%$ and $50\%$, all the methods have the same level of fastness, whereas {\CfCi} becomes faster at frequency $25\%$. In addition, we notice that at frequency $5\%$, {\NAFCP} becomes the slowest, {\NECLAT} the fastest and {\CfCi} becomes close to {\NECLAT}.
%%

Sub-Figure \ref{cfi-lymph-ncl} shows, in terms of the $CFI$s number, that at frequencies $80\%$, $50\%$ and $25\%$, all the methods provide the same results except {\GrAFCI+}, which gives more $CFI$s at frequency $80\%$. Whereas, at frequency $5\%$, the provided number of $CFI$s differs from algorithm to algorithm, and our algorithm provides more $CFI$s than {\NAFCP} and {\NECLAT} and less than {\GrAFCI+}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Figure \ref{cfi-soy} contains SoyBean's diagrams, that reveal the behavior of the tested methods in terms of CPU-time, number of $CFI$s and memory usage.

\begin{figure}[h!]\centering
\begin{subfigure}{0.4\textwidth}
    \includegraphics[width=1\textwidth]{eps-curves/cpu-soybean.eps}
    \caption{CPU-time}\label{cfi-soy-cpu}
\end{subfigure}
\\

\begin{subfigure}{0.4\textwidth}
    \includegraphics[width=1\textwidth]{eps-curves/ncl-soybean.eps}
    \caption{Number of $CFI$s}\label{cfi-soy-ncl}
\end{subfigure}
\\

\begin{subfigure}{0.4\textwidth}
    \includegraphics[width=1\textwidth]{eps-curves/mem-soybean.eps}
    \caption{Memory usage}\label{cfi-soy-mem}
\end{subfigure}
\caption{SOYBEAN dataset}\label{cfi-soy}
\end{figure}

We notice through sub-Figure \ref{cfi-soy-cpu} that {\CfCi} and {\NECLAT} start the fastest ones at f$req = 80\%$, then {\NECLAT} remains the most efficient. On the other side, as the frequency decreases, the fastness of {\CfCi} and {\GrAFCI+} decreases. 

Regarding the number of CFIs, all the methods provide the same number of CFIs at each frequency, except GrAFCI+, which gives one more CFI.

We notice through sub-Figure \ref{cfi-soy-cpu} that {\CfCi} and {\NECLAT} start the fastest ones at $freq=80\%$, then {\NECLAT} remains the most efficient, and as the frequency decreases, {\CfCi} and {\GrAFCI+}.

%%
Regarding the number of $CFI$s, all the methods provide the same number of closed frequent itemsets at each frequency, except {\GrAFCI+} which gives one more $CFI$s.

%%
Sub-Figure \ref{cfi-soy-mem} shows the efficiency of our algorithm in regards to memory usage. Contrary to {\GrAFCI+} and {\NECLAT}, which are really costly.

%%%%%%%%%%%%%%%%%%
As stated in Section \ref{cfci-sec}, our algorithm is a complete approach that is designed to provide the exact number of $CFI$s. We observed through the diagrams given above, that {\NAFCP}, {\GrAFCI+} and {\NECLAT} may yield less or more $CFI$s in certain situations. In addition, our algorithm is the most efficient in terms of memory usage for all datasets, except Lymph with $freq = 5\%$. Thus, {\CfCi} has the average CPU-time results and the best results in terms of the number of $CFI$s and memory usage.

We have to precise that the case where {\CfCi}, applied to Lymph, provides the worst results in terms of memory usage. This result can be justified by the size of the corresponding graph. According to Table \ref{tbl-ClItm}, the graph that models the Lymph dataset contains the biggest values, namely 118 vertices and 91 bonds, compared to the others.
%%%%%%%%%%%%%%%%%%
\section{Discussion}\label{sec-disc}
There are advantages and disadvantages to each of the previous data mining techniques. Most of them are based on tree structures (see Table \ref{tbComp}), namely NList, Gr-tree and FP-tree in {\NAFCP}, {\GrAFCI+} and {\NECLAT}, respectively. This is  because of the recursive relationship indirectly expressed in Property 1.

In addition, the authors of the tested techniques, proposed approximate approaches, and employed recursive algorithms to construct and to explore the corresponding tree structure. 

Using recursive structures and approximated algorithms in most SOTA approaches means that memory usage is costly and finding all closed itemsets is not guaranteed. In addition, updating tree structures, when the corresponding dataset is modified, is generally a complex task.

In this article, we developed a new boolean and linear structure that can be easily updated when necessary, and iterative algorithms to optimize memory usage. Since our algorithm {\CfCi} is exact and polynomial, it provides all the $CFI$s and shows good CPU-time.

In certain instances, when the frequency is $5\%$, {\CfCi} slows down as compared to {\NAFCP}, {\GrAFCI+} and {\NECLAT}. This is because of the step that verifies if the computed $CFI$ does not belong to the current set of $CFI$s.

Most times, the number of the $CFI$s $ncl$ provided by our algorithm is the same as the other techniques. Sometimes, we get less or more $CFI$s, particularly when $freq = 5\%$. Our algorithm comprises calculating all the intersections between the bonds’ labels and preventing duplicated $CFI$s. Thus, we have the incompleteness of the algorithms {\NAFCP}, {\NECLAT} and {\GrAFCI+} compared to {\CfCi}.

Our approach is the best choice for optimizing memory usage, except in the Lymph dataset when $freq = 5\%$.

\section{Conclusion and Future Works}\label{sec-conclu}
This paper has introduced new modeling and solving approaches to find all the $CFI$s. We have presented an efficient graph modeling approach that is based on the clique notion in graph theory. Implementing our model by using a linear structure makes our graph model more flexible to be updated when transactions or items are added or removed. Our proposed model is based on the fact that a frequent itemset is considered a clique in an undirected graph. Thus, {\CfCi} is a new algorithm, based on the maximal clique's principle, has been introduced to tackle the closed itemsets mining problem.

Our first experiments have shown the efficiency of {\CfCi} in finding all the $CFI$s compared to the recent algorithms existing in the literature. This is because the proposed algorithm uses boolean structures and is basically conceived to find all $CFI$s, through systematic searches. Thus, our algorithm is more efficient than the tested methods in terms of the number of $CFI$s and memory usage, and approximately more efficient in terms of CPU time.

To summarize, the structure used in our approach is linear, and the proposed algorithm is iterative, exact, and polynomial. That is the origin of the promising results exposed and analyzed in this article.

However, there are some cases where our approach is less efficient than the newest methods, particularly in terms of CPU time. This is because of the {\CfCi}'s step, which tackles the duplicate $CFI$s cases.

As with any research work, a new version that inherits the advantages and avoids the disadvantages of our approach can follow our approach. Exploring opportunities for the practical implementation of the proposed algorithm will serve as the future direction for our research. We consider adapting our dataset graph model and {\CfCi} algorithm to tackle classification and clustering problems.

\bibliography{itemset-mining}
\end{document}

