Evaluating Public Sentiments of Covid-19 Vaccine Tweets Using Machine Learning Techniques

,


Introduction
The Covid-19 outbreak in late 2019 has led to tens of millions of confirmed cases and millions of deaths worldwide.The economic and social disruption of the pandemic has altered the way of life for many around the globe as public health protocols such as social distancing, wearing of masks, and travel restrictions were introduced.Although adherence to these protocols has effectively controlled the spread of the virus, there has been a global effort to develop a Covid vaccine to fight the virus headon and help get immunity against it.Studies have indicated that at least 70% of the world's population is expected to be vaccinated to achieve herd immunity [1].Consequently, some major pharmaceutical companies and research institutions across the globe have announced the development of Covid vaccines that promise to help ease restrictions and return the world to pre-pandemic routines.
While these developments inspire hope and optimism, other obstacles threaten the fight to eradicate the deadly virus.A significant proportion of people are unsure of the safety of the Covid vaccine, as skepticism on social media has led to the vaccine rollout exercise faced with fears, hesitancy, and opposition [2].Meanwhile, public opinions and support for the Covid-19 vaccine are essential as this may affect whether vaccinations can be administered to large populations to achieve herd immunity.
As the Covid-19 pandemic continues to spread globally, Covid-related issues received increasing attention from the research community.Although some studies have highlighted the socio-economic impact of the pandemic [3]- [9], analysis of Covid-19 vaccine-related issues is rare.As a social media choice for many, Twitter plays a vital role in disseminating health information in the fight against Covid-19.There is an urgent need to analyze how issues related to Covid-19 vaccine have been discussed on Twitter to understand better public perceptions, concerns, and issues that may affect their willingness to get vaccinated.Besides, identifying popular themes in tweets related to Covid-19 vaccines can play a vital role in guiding vaccine education and communication.This study examines general sentiments and opinions related to the Covid-19 vaccine rollout program by analyzing English tweets collected between January 21, 2021, and January 31, 2021.The study also identifies algorithms with suitable metrics to evaluate the performance of supervised Machine Learning classifiers on the Covid-19 vaccine tweets.The findings will be handy in assisting governments and other public health policymakers to understand trends in social media data related to the Covid-19 vaccine and make timely adjustments to vaccine education to boost public confidence in the vaccination exercise.

Related works
Studies on Twitter sentiment analysis provide valuable insights into real-world events and people's perceptions of these events.A review of existing literature indicates that various studies related to Twitter sentiment analysis use popular machine learning algorithms such as Logistic Regression, Support Vector Machines, Naive Bayes to predict sentiment from tweets [10]- [14].These algorithms essentially give better accuracy with less computational resources and are regarded as the baseline learning methods in sentiment analysis of Twitter data [15].
Recent studies demonstrate that deep learning models allow sentiment analysis systems to capture complex linguistic features and read context within a text, achieving better accuracy and performance [16].Researchers have used these techniques to analyze unstructured data from social media posts such as Twitter [17], [18].A related study implemented a quantuminspired sentiment representative framework that can model semantic and the sentiment information of subjective natural language text [19].Experimental results demonstrate the effectiveness of the framework as it significantly outperforms most state-of-the-art baselines.
Following the successful applications of these algorithms on Twitter data, an increasing number of studies have used similar approaches to analyze and understand the public response and discussions on Twitter concerning Covid-19.For example, a study analyzed the global sentiments of tweets related to Covid-19 to understand how people's sentiment in different countries has changed over time [20].Two types of analysis were performed concerning the positive and negative sentiments, fear, and trust emotions exhibited in tweets related to Work from Home and Online Learning.The first was exploratory data analysis to provide insight into the number of daily confirmed cases.The second aspect evaluated different deep learning methods for sentiment classification on the dataset.The results showed that the general positive sentiments towards Work from Home and Online Learning have been consistently higher than negative sentiments.
A similar analysis was performed on social media posts to increase understanding of public awareness of COVID-19 pandemic trends.The research uncovers meaningful themes of concern posted by Twitter users in English during the pandemic [21].The analyses included frequency of keywords, sentiment analysis, and topic modeling to identify and explore discussion topics over time.The results indicate that people have a negative outlook toward COVID-19.
A related study applied machine learning techniques to investigate the psychological reactions of Twitter users to Covid-19 [22].Several salient topics were identified and categorized into themes, including "confirmed cases," "Covid-19 related death," "early signs of the outbreak," "economic impact of the pandemic," and "Preventive measures."The analysis shows that fear for the unknown nature of the coronavirus is dominant in all topics.Other successful examples of studies analyzing Twitter sentiments to determine the impact of Covid-19 on the daily aspect of life are well documented [6], [7], [9], [12], [23], [24].These studies proved essential in assisting governments in making informed choices on managing the Covid-19 pandemic situation.
However, research on Twitter emerged sentiments on Covid-19 vaccine rollout program remain less explored in literature.In relation to the related works, this study explores public reactions and discussions on Twitter concerning the Covid-19 vaccine rollout program.The performance comparison of different machine learning classifiers on Covid-19 vaccine Twitter dataset is also be evaluated.

Research design
A purposive sampling technique was adopted in gathering Covid-19 Twitter data, published between 21st January to 31st January 2021.Our data analysis is divided into two broad parts.The first part dives into an exploratory analysis of tweets, data visualizations, and a description of the key characteristics of Covid-19 vaccine Twitter data.This approach aims to present insight and help understand public reactions and discussions on Twitter concerning Covid-19 vaccine rollout programs worldwide.The second part deals with the sentiment classification of tweets using supervised machine learning algorithms.We chose a supervised machine learning approach because our data is well labeled.The supervised learning technique allows us to measure the chosen classifiers' accuracy scores while performing sentiment classification.

Data collection
Twitter offers a variety of APIs to provide Twitter data access, including reading tweets and accessing user profiles.This study uses Twitter API and a Python script to access Covid-19 vaccine Twitter comments.A query for a hashtag (#Covid19vaccine) was run daily to collect a large number of tweet samples from around the globe.The study excluded tweets written in languages other than English.Our approach of getting the dataset is based on its availability and accessibility and how well the research community accepts the approach.

Data labeling
The labeling process aims to assign positive or negative labels to tweets.This study used human annotation to assign the value 1 (positive class) to text with positive sentiment and the value 0 (negative class) to text with negative sentiment.Some of the extracted tweets were duplicated, while others gave contradictory interpretations and proved difficult to label.As a result, some data points were removed from the dataset.The final dataset contains 15239 unique tweets, with 10519 labeled as positive and 4720 labeled as negative.The dataset contains two columns (text and labels).The text column contains the text to which a label applies.These texts are transformed into features used by the model during training and prediction.The label column contains either 1 or 0, representing the sentiments of the tweet being classified.An example of the tweet dataset obtained can be seen in Figure 1.

Data preprocessing
Analyzing sentiment in tweets generally requires some fundamental cleaning and preprocessing steps to improve the quality of the dataset [25].It includes cleaning and formatting the data before feeding it into a machine learning algorithm.Twitter datasets are often noisy with many irregularities such as punctuation marks, symbols, @links, stop words, and other special characters irrelevant for the sentiment analysis.The collected tweets are filtered using a python script to preprocess and clean the dataset to increase precision.Background noises such as white space, punctuation, hashtags, urls, special characters, hyperlinks, and stop words were removed.Also, tokenization using n-grams was applied to segment the text data and create a new document with the set n-grams, while lemmatization was applied to determine the base forms of words.

Feature extraction
This represents the extraction of lexical features such as ngrams and transforming them into a feature set that is usable by a machine learning classifier.It plays a crucial role in text classification and directly influences the text classification model [26].Term Frequency-Inverse Document Frequency (TF-IDF) is a popular feature extraction method commonly used in text classification and sentiment analysis.TF-IDF evaluates how important a word is to a document in a dataset by converting textual representation of information into a vector space.This study applied a TF-IDF Vectorizer Python module of Scikit-learn to extract TF-IDF.First, we trained our classifier using unigram and bigram as the feature set to represent context in the Twitter data.The features are tested with the TF-IDF and trained on the classifier.After this, the trained classifier is used in predicting the test data.

Sentiment classification using machine learning
The next step after the feature extraction is to feed the feature vectors into the machine learning classifiers to perform sentiment classification.We classified the vaccine tweets using Logistic Regression (LR), Support Vector Machine (SVM), Naïve Bayes (NB), and Random Forest (RF) machine learning models, and their performances were compared.Scikit-learn python library, an open-source machine learning package that provides access to machine learning classification algorithms, was used.In each experiment, the training set is used to optimize and train the machine learning algorithms, while the test set is used to evaluate the performance of the models.

Performance evaluation of machine learning classifiers
The test data was evaluated to understand better how well the classifiers performed after training.The standard metrics used to evaluate the models include accuracy, precision, recall, and f1-score

Principal result
We performed data analysis from the collected tweets to identify public sentiments, keyword associations, and social media trends related to the Covid-19 vaccine rollout program.We search for insights using descriptive text analysis and data visualization such as word clouds and ngram representations.Below are brief descriptions of the data analyses on the processed Twitter dataset.

Word cloud representation of tweets
Word cloud was used to visualize how words are distributed across the dataset.The most recurring words provide us insight into how user sentiments about the vaccine rollout program evolved on Twitter over the study period (Figure 2).The main goal is to examine what trend can be inferred from the word frequency in our Twitter data.The illustration from the word cloud shows that along with the search word 'covid19vaccine', words such as 'vaccine', 'dose', 'first', 'second' had many mentions.These words emphasize the awareness of the number of vaccine doses required to be fully inoculated.Names of the initially approved Covid-19 vaccines (Pfizer, Moderna, AstraZeneca) also dominated Twitter during the study period.Again, some Twitter users highlighted the crucial roles governments, healthcare professionals, and other relevant state institutions played in the vaccination S. K. Akpatsa et al.
exercise as words like 'government', 'state', 'doctor', and 'hospital' were among the most frequently used words.

N-gram representation of tweets
N-grams are a set of consecutive words or a sequence of words in a textual document.To identify the most popular n-grams, we built a list of unique words in our Twitter data and counted each word's occurrences in a corpus.Since the Twitter dataset for this study is about Covid-19 vaccine, Covid-19 related keywords such as 'covid19vaccine', 'covidvaccine', 'covid19', 'covid', and 'coronavirus' were excluded so they do not skew our word frequency analysis.We chose uni-grams (n=1), bi-grams (n=2), and tri-grams (n=3) for further analysis to understand which words were used the most separately and in combination regardless of the grammar structure and semantic meaning.Figure 3 shows the most popular n-grams related to the Covid-19 vaccine tweets.These ngrams highlight how vaccine-related themes such as 'vaccine distribution,' 'vaccine administration,' and 'health engagements' dominate Twitter discussions during the study period.
From the bi-gram, phrases such as the 'first dose', 'second dose', 'receive first' imply some people have already received their first or second dose of the vaccine.The phrase 'side effects' also gained significant recognition among Twitter users over the period.This discussion signifies the fear of potential vaccine side effects that could put the vaccination program at risk.From the tri-gram, phrases such as the 'get first dose', 'received second dose', and 'one step closer', indicate how well people have embraced the vaccination process.Also, phrases such as 'first consignment covishield', 'largest vaccination drive', and 'world largest vaccination', emphasize the global perspective of the fight against the Covid-19 pandemic.The phrase 'vaccine immunity duration' raises concern about how long Covid-19 vaccine-induced immunity will last.

Sentiment classification
We performed sentiment classification with four different machine learning classifiers: LR, RF, SVM, and NB, on the dataset and their performances were evaluated.To ensure that the models were learning the patterns in the data and not fitting to the noise, we implemented k-fold cross-validation technique to determine the efficiency of the classifiers.In k-fold cross-validation, the dataset is divided into k subsets which are repeated k times.For each iteration, the model is trained using k subset as the training sample and the resulting model validated on the remaining part of the data.All four different classifiers were crossvalidated five (5) times and the experimental result is illustrated in Table 1.
The result shows that the SVM classifier reaches the highest accuracy mark of 83.74 while the Naïve Bayes classifier has the lowest accuracy of 78.90 among all the classifiers (Table 1).Similarly, the predictive accuracy of the classifiers is determined to find out which model perform best in classifying the Covid-19 vaccine tweets (Table 2).The illustration in Figure 4 clearly identified SVM as the best-fit machine learning classifier on the Covid-19 vaccine Twitter dataset.[12], our study demonstrates that machine learning algorithms could be leveraged to study the evolving public discourse and sentiments during the Covid-19 vaccine rollout program.Some prior Covid-related studies have described themes such as mask-wearing, social distancing, regular washing of hands, and the need for Covid vaccine as the most effective measures to stop further spread of the Covid-19 virus [4], [6].Our study identifies a new trend of Covid-19-related discussions on Twitter during the study period.These discussions mainly focused on: (1) health information about the vaccine, (2) vaccine distribution and administration, (3) the number of vaccine doses required for immunity, and (4) questions about vaccine availability.Together with other vaccine education efforts, these themes are essential for the overall success of the vaccination program.Our n-grams were consistent with Covid-19-related studies [6], [24] that examine discussions and sentiments that emerged on Twitter and concerns about the safety measures to adopt when reopening from lockdown.

Practical implications
This study set out to examine trends that can be inferred from the targeted Twitter dataset.The study demonstrates that most of the collected tweets represent positive sentiments, which indicates the public's overall confidence in reaction to the Covid-19 vaccine rollout program.However, the n-gram representation results also suggest that a significant proportion of the public expresses negative sentiments about the Covid-19 vaccine on Twitter.There were questions about the vaccine's safety and efficacy, as the potential side effects dominate discussions over the period.Additionally, there were concerns about the vaccines' long-term protection against Covid-19.As the Covid-19 vaccine rollout program continues, more efforts from governments and relevant authorities are required to answer ongoing questions regarding vaccine choices, vaccine hesitancy, vaccine side-effects, and the durability of the immunity response to Covid-19 vaccines.
Twitter remains a great source of information for many and can be used to explore the levels of public awareness and sentiments about the Covid-19 and its related themes.During a pandemic where people may be confined to their homes, perceptions of people about the vaccine are more likely to be inferred from social media and information online [27].Due to the need for a worldwide Covid-19 vaccination program, understanding the threat of vaccine misinformation on social media and its negative influence on the general public's vaccine uptake is important.To encourage a positive vaccine attitude, it is suggested that social media firms devise schemes on how to promote accurate vaccine information while removing vaccine misinformation from their platforms.Besides, governments worldwide should engage their citizens on key, accurate, and timely health information regarding the vaccination program.
As evidence is still evolving, understanding the worldwide Covid-19 vaccination program's potential challenges is crucial in helping governments and relevant institutions develop schemes that will allay citizens' apprehensions about Covid-19 inoculations.
While machine learning classifiers perform relatively well on the dataset, our analysis was limited to a small collection of tweets expressed in English.Our findings may not be a true reflection of public sentiments on the vaccination program due to the risk of missing out on vital information available from tweets generated in other languages.Future work might consider the evaluation of large-scale Covid-19 vaccine Twitter datasets using deep learning models.

Conclusion
Our work focused on examining public discourse and reactions on Twitter concerning the Covid-19 vaccine rollout program.Popular machine learning algorithms were applied to predict sentiments from the collected tweets.Most Twitter users were optimistic during the study period, although some negative sentiments threatened the overall success of the vaccine implementation program.Also, we identified a new Covid-related discussion trend that focuses on vaccine distribution and administration, the number of vaccine doses required for immunity, and other health information about the vaccine.These findings can be a handy tool to help policymakers and relevant authorities anticipate the appropriate measures that can be taken to mitigate any potential challenges to the vaccine rollout program.The pressing need to achieve herd immunity against Covid-19 requires timely reactions to address the concerns of the general public to boost trust and confidence in the vaccination program.
. The metrics were calculated in terms of positives and negatives.The classification accuracy presents the sum of true positives and true negatives divided by the sum of all data points in the test set.Precision is the number of correctly classified positive examples divided by the total number of examples that are classified as positive.The recall measures the number of correctly classified positive examples divided by the total number of actual positive examples in the test set.The f1-score finds a balance between Precision and Recall and tell how precise and robust the classifiers were.The mathematical representation of the metrics is presented below.True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives 4 Results and discussions

Table 2 :
Evaluation of the ML Models.

Comparison to prior works
Our findings are consistent with studies using social media data to assess public responses to Covid-19.Compared with a study that implemented a Naïve Bayes model to analyze Twitter sentiments concerning Covid-19