Hierarchical Clustering and Dimensionality Reduction for SARS-CoV-2 Genome Analysis Across Highly Affected Nations

Venkataramanan V, Srinivasan J, Ramadevi K, Dillibabu M

Abstract


The global pandemic caused by the novel coronavirus SARS-CoV-2 has prompted extensive research into its genetic diversity to support drug development and vaccination strategies. In this study, we analyze the genetic similarity patterns of SARS-CoV-2 genome sequences from six severely affected nations: USA, Italy, Spain, France, Germany, and the UK. A total of 359 complete human host SARS-CoV-2 genome sequences, ranging from 29,538 to 29,987 base pairs, are processed using k-mer representation, with k = 2 (dinucleotides) and k = 3 (codons). These representations are converted into 50-dimensional feature vectors. To identify intrinsic patterns within this high-dimensional dataset, we apply agglomerative hierarchical clustering using average linkage. A Silhouette score of 0.48 and a Hopkins statistic of 0.85 indicate moderate clustering tendency and structure. Four primary clusters are identified, highlighting notable genomic similarities. Specifically, sequences from the USA, Spain, and Italy predominantly group together, suggesting shared genetic traits. To further aid interpretation, we apply dimensionality reduction techniques—Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE)—which project the high-dimensional feature vectors into 2-dimensional space. Visualizations confirm the clustering structure, with USA, Spain, and Italy forming a distinct and tight cluster, while sequences from France, Germany, and the UK show more dispersed patterns. This study provides a quantitative and visual understanding of SARS-CoV-2 genetic diversity across heavily impacted nations. The combination of k-mer-based feature encoding, hierarchical clustering, and dimensionality reduction offers actionable insights that may inform more targeted therapeutic and vaccine design strategies.


Full Text:

PDF


DOI: https://doi.org/10.31449/inf.v49i13.7856

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.