Automated Self-Learning Chatbot Initially Built as a FAQs Database Information Retrieval System: Multi-level and Intelligent Universal Virtual Front-Office Implementing Neural Network

The method proposed in this paper is based on dynamical information system capable toimplement a universal multi-level virtual front-office made by FAQs and chatbot self-learning systems


Introduction
Robot and artificial intelligence appeared a lot of time ago in the design of the "Leonardo's mechanical knight" [1] and earlier in the 12 th century by Al-Jazari [2].Artificial Intelligence (AI) begins eight centuries later with the conceptualizations of Alan Turing (Turing's test) [3].The merging of robot and artificial intelligence is much more recent and opens perspectives with very strong potentials and related concerns.Virtual assistants as "Chatbot" (also known as a talkbot, chatterbot, Bot, IM bot, interactive agent, or Artificial Conversational Entity) can dialogue as real personal assistants; their main current use is on smartphones [4] to retrieve information useful for everyday life, to set the alarm or an appointment on the agenda, to send mail or text messages, find places or browse or launch apps.As robots enables mechanical automatisms, chatbot implementing AI allows the information automatism.
A chatbot is a computer program designed to simulate conversation with human users, especially over Internet; it acts like a human computer interface created to facilitate communication between human and computer, understanding natural language questions and answering with actual answers.Although chatbot is a current hot topic, it has been object for the past fifty years.The chatbot systems idea originated in M.I.T back in 1966, where professor Joseph Weizenbaum implemented the ELIZA chatbot to emulate a psychotherapist [5].After ELIZA have been developed a lot of chatbots, for example to simulate the interaction with different personalities [6], matching with web-based search engines (AskJevees) [7], and with open-source initiatives like ALICE [8] [9] implementing artificial intelligent applications called AIML (Artificial Intelligence Markup Language) [8].AIML is a widely adopted standard for creating chatbots and mobile virtual assistants like ALICE [8], Mitsuku [10], English Tutor [11], The Professor [12] and many more.Over the years, chatbots have become a sophisticated tool, able to perform natural conversations and make users happy for the quick support they can provide.Although a chatbot cannot handle all customer queries, it can be used to deal with many of the routine queries activating service requests.
The knowledge bases of existing chatbots are mostly built manually [13], thus requiring a long time, and are difficult to adapt to new domains.There are several researches on the extraction of knowledge from different types of data sets [14] [15] [16] but these approaches use the characteristics of their domains and are therefore only suitable for their specific tasks, limiting the possibility to transform directly their methods in a general knowledge extraction approach.
Nowadays there are tens of thousands chatbots available online, over 30K already on the Messenger platform alone [17], and their number is growing rapidly thanks to the ease of implementation and distribution provided by services such as Facebook Messenger, Telegram, etc. Users around the world are logging into messaging apps to not only chat with friends but also to connect with brands, to browse merchandise, and to watch content (in the mid-2014 the number of messages exchanged on the main four platforms of instant messaging has exceeded the number of messages exchanged on the 4 main social networks [18]).
The approach described in this paper involves the implementation of a smart virtual front-office model that exploits the synergy between a list of FAQs and a chatbot [19], [20].The main features of this model are: • be able to self-learn their knowledge base from an archive of questions and answers (FAQs) organized in a static tree structure; • the use of machine learning algorithms to dynamically generate its own knowledge base; • the speeding up of the management of most user requests; • the possibility to acquire different feedbacks thus improving the efficiency of the system, so as to be able to easily reshape the available data according to greater effectiveness in supporting users; • allowing full control over the management activities of the system itself.Microsoft today offers a service loading knowledge base (KB) for a chatbot by copying and pasting phrases of FAQs [21].Recently some researchers implemented long short-term memory (LSTM) neural networks [22], which automatically generate responses for users requests on social media.Other authors highlighted that a chatbot does not understand colloquial usage [23], and cannot yet simulate the full range of intelligent human conversation [24].
In the proposed research are presented the following claims with respect to the state of the art: -automatic loading and construction of the knowledge base of the chatbot system (automatic generation of the AIML files constructing the KB by an automatic data entry); -automatic updating of the training dataset, progressively reducing the probabilistic error of request recognition (see self-learning multi-level model enabling automatic storing of the training dataset described in the next section); -gradually overcoming the barrier due to the training of model including colloquial usage forms.

Case study
The proposed case study consists of improving customer support through a website by exploiting the FAQs archive available to its users, and useful for the automated creation of a chatbot.The management of requests for assistance from a chatbot will reduce the workload of human operators dedicated to customer support and will allow the assistance service to be operational even in the absence of available operators.
If the system does not find a response to a user question, it can redirect the user to a human operator or store the user questions and data in order to send these information to the first available operator.
Figure 1 shows a functional scheme of the whole system.
The multi-level access to the different modules of the system allows the optimization of available resources and, at the same time, a better user-experience: • according to the sections consulted by users, a series of relevant FAQs, organized in a tree structure, are proposed (level 1); • when the user requests a contact with an operator to ask further information, he is put in touch with the chatbot that, according to the user's requests and related information, proposes the most relevant answers (level 2); • if the chatbot does not find a response for the user, it puts him in contact with a specific human operator, selected by the system on an available operators list, on one of the available channels (e.g.e-mail, chat, telephonelevel 3), thus optimizing the response time; the response provided by the operator is memorized by the system and is exploited by a self-learning algorithm to increase the knowledge base of the system.Operators are able to improve the knowledge base and the overall performance of the system taking advantage of: • the archive of questions to which the chatbot does not respond (the training dataset is gradually enriched by adding a new FAQ formulated through of the three level responses if Fig. 1); • a series of tools, tests and indications provided by the system in a fully automatic way in order to make the operators capable of optimizing the knowledge base (operators is guided by a platform allowing an automatic update of the training dataset).The system's interactions are sketched in the diagram of Figure 2.

System design
The system has been designed in a modular way to allow fast integration into any web-based platform and isn't bound to a particular environment.It can be logically divided into three macro-modules: • FAQ: management by authenticated users (operators and administrators) and display by unauthenticated users of FAQs and Glossary items; • Chatbot: management and use of the "Chatbot" application that will allow interaction between users and the system by simulating communications in natural language; • Back-office: management of data and the consequent information in the system available to users; system supervision; statistics; system optimization.These three macro-modules are detailed in the use case diagram of Figure 4, where there are explained the main actions performed; this diagram shows the actors of the system: • the end user, which accesses to the FAQ and chatbot system to find answers to some of his questions; • the operator of the system that has access to the backoffice interfaces for managing the information that the system makes available to the users; • the system administrator who has access to back-office interfaces to supervise the system's activities.
The database is used as aggregator of the data coming from the system modules and provides the necessary information for data processing.In this way it is possible to decouple the logic operations from the data administration, as a function of a smart management of all the underlying functionalities.
During a user's visit, the system stores its preferences and the sections visited so that these information can be used to better evaluate which answers to generate to the user's questions.
Each component of the system cooperates synergistically with the others in order to improve the user-experience; at the same time this integration allows to update the system's knowledge base by simply operators tools, exploiting an archive of questions to which the system has not been answered, and providing useful indications on how to improve the knowledge base in function of a better performance.
The diagram of Figure 5 shows the interaction between the system components.
The autoresponder with self-learning module consists of two modules: an artificial intelligence (A.I.) module, and a chatbot module; the A.I. module generates the answers of the user questions, and allows to train the knowledge model.Both modules are connected to the database but only the chatbot module modifies it, by inserting new data useful for statistical purposes and for the optimization of the knowledge base.The use of machine learning algorithms based on neural networks has been compared with the use of AIML during the design of the artificial intelligence part of this module, in order to validate the proposed final solution.The chatbot module also manages user questions and system-generated responses to make them available to administrators for control purposes, and to generate statistics on system usage.An algorithm has been modeled to read the knowledge base available in the database and to transform these data into a format that is exploitable by neural networks able to generate AIML patterns.The database module deals with exchanging data with the database server by recycling and caching queries in order to optimize the performance; this module provides a single interface for all modules of the system connected to the database.
The FAQ module allows the complete management of the FAQs and glossary entries; this module updates the system usage statistics regarding the operators activities.
The operator module manages all the actions concerning the operators of the system, by allowing the authentication of each operator or administrator, the modification of his personal data and, in the case of users authenticated as administrators, the management of other accounts through the administrator module; the operator and the administration write on the database all the actions performed by a user for statistical purposes.

Data model
The entity-relationships model of Figure 6 highlights the data necessary for each system entities and how these are associated with each other.This simplified diagram shows a clearer description of the main characteristics of the database architecture.
From Figure can be deduced the following data structure: • FAQs are grouped into categories that can be grouped into other parent categories themselves, creating a tree structure that simplifies the selection and use of FAQ data; • every user can have different roles in the system (tables: User, UserRole and Role) • there may be more different answers to a question and there may be different definitions for a glossary entry (tables: FAQAlternativeQuestion and GlossaryAlternativeDefinition); • words without semantic content are stored in a specific table (DummyWords); • all the users' questions to which the chatbot has not answered are traced (table: NotAnsweredFAQ); the operators can use these questions to create new FAQs or to add new answers to existent FAQs; • all the chatbot answers and the corresponding user questions are traced (table: AutoresponderLog); • all the actions performed by the operators on the system are traced (table: OperatorActionsLog) for statistical and control purpose.
The dynamic training dataset of the chatbot system (initial knowledge base) is built from the previous described data model: the dataset is constructed by a starting database containing glossary, by the FAQs database, and by the DummyWords list (words to ignore).

Self-learning algorithm
Upon receipt of a natural language question asked by a user, the first operation to be performed is the elimination of all words without semantic meaning according with the context in which the chatbot operates; this operation is performed through a comparison with a list of words, characters and symbols to be ignored.Once the words without semantic meaning are removed from the original question, a list of remaining keywords remains is passed to the A.I. engine thus activating the answer searching process.
If the A.I. engine finds a response to the keyword sequence, this is sent to the user; otherwise the question is stored to be subsequently verified by a human operator.According to the last user choices and the context in which the chat has been opened, the question is passed to a specific operator.
When an operator answers to an user question, the response provided by the human operator is stored, and it is used to regenerate the knowledge-base.
This algorithm depicted in the next diagram of Figure 7.
Second action of the previous flow-chart "process tags extracted from the application through A.I. engine" underlies two A.I. pre-processing algorithms: one for the automatic production of standard AIML patterns, and one for setting the right weights of the neural network.Preprocessing algorithms take into account user question generating outputs for feeding to the respective A.I. engine.The reply of these engines has the same data type (a natural language phrase and a Boolean value indicating response validity).
The AIML A.I. engine is based on an open-source library compatible with AIML 1.0.1 standard.The artificial neural network (ANN) is a fully connected neural network that is used as a universal functions approximator performing the training, and behaving as a classifier of questions.

Automated AIML patterns production
The automatic production of AIML patterns based on system knowledge-base (FAQs, glossary entries, dummy words), is managed by an iterative algorithm that extracts keywords from the input text and using them to construct AIML patterns (see Figure 8).All the generated AIML patterns are stored in a file ready to be processed by the AIML engine.
For the automatic reconstruction of AIML files, all the questions (between FAQ, Glossary and alternative questions of both) are analyzed in a cycle that, for each question: 1. normalize the question in the archive (remove words with no semantic meaning and unwanted characters such as punctuation, dashes, etc.); 2. create an AIML pattern on the normalized question; o for each group of 2 and 3 words taken from words in the normalized alternative question: ▪ create a SRAI pattern on the AIML pattern of the original question; ▪ verify that the newly created SRAI pattern has not already been created based on the AIML pattern cache.This process is started automatically when an operator changes the knowledge base of the system.It is possible to disable the automatic creation of AIML patterns in order to allow massive changes to the knowledge base and to re-enable this functionality at the end of all changes: in this way the operators avoid making unnecessary reconstruction of the AIML patterns until all the necessary changes have been made to the knowledge base.
The diagram in Figure 9 exhibits the steps for the creation of each SRAI pattern related to each original question and to all the alternative questions generated by the original question.
By compared Figure 9 with Figure 8, the only passage to be explained in this diagram is the "Create keywords permutations" action, having the task to create the SRAI pattern in different combinations by the 2 or 3 selected keywords using the wildcard character [*] provided by the AIML standard.

AIML optimizer
In order to find the best combination of patterns for the construction of the AIML file that will manage the chatbot, an automatic algorithm has been implemented.The chatbot: • is able to find the best combination of keywords / patterns among all those that can be created; • can find any redundant questions, which lead to different answers while being syntactically similar (these questions are reported to the operators so that they can verify and modify them according with an appropriate knowledge base).This algorithm is called for each editing process in the knowledge-base.The AIML optimizer algorithm is described in the next flow-chart diagram of Figure 10, where it is highlighted the optimization procedure that aims to remove some keywords between those obtained in the analyzed question thus activating the system response.

AIML vs neural network architecture
An AIML module is based on the recognition of precise textual patterns.By using a defined logic in the patterns definition, the AIML allows the construction of recursive patterns and the use of wildcards that subtend one or more words, making AIML a standard ready to simulate a conversation in natural language.
Although the process of creating AIML patterns can be automated, the final result is a series of static patterns, which will respond to the inputs according to precise rules and structures.
Better results can be achieved using an artificial neural network instead of an AIML-based system, because: • there will be a better correspondence between the questions proposed to the system and the answers provided by it ( depending on the implemented artificial neural network architecture, it is possible to construct complex maps between the input and the output text generated by the system, exceeding the limits and the static nature of AIML patterns and handling specific or difficult cases); • the execution speed is greater, especially in conjunction with large archives of texts to be managed (while the number of texts to be managed increases, the number of AIML patterns necessary to manage all the possible cases increases; an artificial neural network does not need the addition of new neurons or new layers in order to manage a new texts; moreover the recognition of AIML patterns occurs through the direct comparison of text strings which is a very slow operation, especially when compared to the additions and multiplications used in neural networks, operations in which the electronic computers excel thus requiring parallel computing).In Figure 10 is illustrated the flowchart of the AIML optimizer.

Automated neural network
The automatic creation of the neural network has been implemented by exploiting some algorithms able to process the knowledge base of the system without the intervention of human actors.The main parameters of the neural network, like the number of neurons used in the hidden layers, are automatically set on the basis of the available dataset, so as to allow an adaptation of the solution to any new dataset suitable for the system implementation.
The technique called "one hot encoding" [25] (discussed in the following paragraph) has been adopted to make the text of the questions usable by the underlying neural network.Mathematically, one hot encoding generates a balanced matrix, which is easy to understand during complex computations inside algorithms.One hot encoding technique is then used to encode categorical features: one-hot is a group of bits among which the allowed combinations of values are only those with a single high bit (1), and all the others are low (0) [25].

3.2.4.1
One-hot encoding The words found in the questions archive, allow to create a dictionary.In this dictionary, each word will correspond to an index.In order to transform any new sentence into binary vectors, a binary vector is created having a length equal to the number of words of the dictionary and having all the elements equal to zero.Below is shown an example useful to understand the one-shot encoding approach.
For example, taking the following three sentences: [ today is a beautiful day ] [ she is a beautiful girl ] [ that girl has a red car ] is created the following dictionary of 11 words: [today,is,a,beautiful,day,she,girl,that ,has,car,red] A new sentence like: [ your new red car is simply beautiful ] would be converted to the following binary vectors: [ 0, 1, 0, 1, 0, 0, 0, 0, 0, which turned into text would be for the four weighted keywords:

[ new beautiful car red ]
The semantic meaning of the sentence has been lost during the transformation, but a fully connected ANN processing this input can still find a combination of its weight values to match the best answer to each question ignoring the order in which the words of the sentence are placed: the combination of words is processed by taking into account only the maximum weight of their occurrence into a sentence.
Another way is to create a binary vector for each word of the input question and thus create a binary matrix able to bring back the order in which the single words succeed each other in the original question.
To obviate the problem of the management of sentences composed by different number of words, it is used the technique of zero-padding: by this approach are added null vectors having a number equal to the number of missing words up to a value N, where N is the number of words managed; if a sentence has a number of words that exceeds the limit of the words managed, only the first N words will be considered for the data processing.

3.2.4.2
Neural network based chatbot In our model a normalization function on the texts to be processed is considered.This function clean the words with no semantic meaning and not useful for the project aims.All input texts are cleaned by articles, adverbs, prepositions, special characters, accented letters, etc. in order to analyze the input text exclusively by the key words useful in tracing the most appropriate answer for the input question.
In Figure 11 is illustrated the diagram of the used model.
The learning phase (upper part of Figure 11) is started at each database variation made by an operator updating the questions/answers knowledge base.The learning phase duration depends on the number of questions to be analyzed, but is faster in the execution if compared with the equivalent algorithm of the AIML patterns creation.
Another advantage in the execution time is also checked when the neural network is used to answer new questions.

3.2.4.3
Neural network model The ANN model used in the project is reported in the following Python script based on the Keras library: • NumPy -a package for scientific computing; • PyAIML -an interpreter package for AIML.
Some features of the system are performed by exploiting Jupiter Notebook on an Anaconda distribution; this development method allows to quickly test thus validating some modules of the system before inserting them as components of the final prototype system (preliminary test phase).
Eclipse Oxygen was used as IDE, with PyDev plugin installed to support Django's modules development.
Some useful functions are adopted to test the system and to optimize the underlying dataset.By using these functions, it is possible to evaluate the efficiency of the AIML Chatbot system vs Neural Network Chatbot system without an huge dataset of new questions to be submitted to the system.The most relevant function for the optimization of the dataset used for the learning stage of the A.I. module, is a function that adds random noise in the questions used for training and that uses the new produced questions to verify the effectiveness of A.I. algorithm.This function is based on the optimizer of the algorithm described in 3.2.2where the only difference is that the ANN is used instead of the AIML engine for the evaluation of the chatbot performance.
Comparing the answers generated by the system on all the questions in the archive, and modifying these questions in different ways, it is possible to construct a metric able to indicate the redundancy degree of the questions sored into the archive.Staring to the results provided by this function, it is possible to create visual interfaces for operators able to highlight any questions to be modified.These interfaces are useful to build a more appropriate knowledge base.
The following table shows the average results obtained applying the described algorithm applied on 10 datasets of questions and answers concerning 5 different application domains; each dataset includes 180÷300 questions and answers whereas a set of questions can be referred to a single answer: where "all keeped keywords" refers to the results of the original dataset, the "none keeped keywords" refers to the obtained results skipping every keywords (the system has not knowledge base), the "50% keeped keywords" refers to the obtained results skipping a keyword every two keywords, the "66,6% keeped keywords" refers to the obtained results skipping a keyword every three keywords and so on.

Keeped
Analyzing the questions to which the system responded by proposing an incorrect answer, it is possible to make a correction on the same questions, exploiting the keywords highlighted as equivocal to replace these questions with similar ones but based on different keywords.The analysis of the unanswered questions generated by the system provide an indication to the operators on which topics to generate new questions and / or answers in order to enrich the system knowledge base in an optimized way.The accuracy index provides a useful value to compare the system's performance before and after any substantial changes to the knowledge base so as to report to the operators any final performance decays.

Results
The approach described in this document allows to obtain a highly automated management system of front-office services applicable in any context characterized by a knowledge base consisting only of an archive of questions and answers like a FAQs archive.The main advantage in adopting this approach is given by the high automation of the different processes underlying the system, which allows: • an easier management of front-office procedures by implementing an efficient multi-level architecture able to automatize the self-learning training process; • an easier data entry procedure interfaced with a portal framework (the user adds requests online into a web page thus enriching automatically the training dataset and eliminating other manual steps for the dataset construction); • a lower workload for operators, enabled by the chatbot that replaces operators in most users' questions (when the training dataset will be efficient the operators will be totally theoretically replaced by the virtual assistant); • the applicability in any context, not dependent on the technologies used (the self-learning process could be applied in different applications where users require responses); • the usage of the chatbot system also for colloquial usage forms.
The comparison between AIML and ANN for the automatic construction of chatbot based on the same input datasets, allowed to verify the best performances of the second approach.Using the metrics described in the previous paragraph it has been possible to find a better accuracy, a smaller number of wrong answers and a smaller number of missed answers through the use of an ANN through to the use of AIML, as shown in the following table:

Accuracy Mistaked Answers
No reply AIML 85% 5% 10% Fully connected neural network 99% 0% 1% An improvement to the proposed solution may be provided by the use of a recurrent ANN, in order to exploit the memory of the neural network to better associate the responses available in the system to the questions of its users, thus recognizing the order in which the keywords are arranged in the sequences created from a user's question.

Conclusion
Authors proposed in this work an innovative self-learning multi-level virtual front office, by improving a structured FAQ procedure together with an innovative chatbot system.The proposed model is suitable for industrial applications requiring the optimization of human resources activities.The goal of the self-learning model is

Figure 3 -Figure 4 -
Figure 3 -The interactions between the Chatbot and the artificial intelligence engine

Figure 11 -
Figure 11 -Neural network learning and using model to eliminate totally the humans responses represented by level 3 of the model of Fig.1, by constructing dynamically and automatically the training dataset.The model could be applied to Big Data and Machine to Machine (M2M) systems[29].The experimental dataset is reported in[30].