Hierarchical Clustering and Dimensionality Reduction for SARS-CoV-2 Genome Analysis Across Highly Affected Nations

Abstract

The global pandemic caused by the novel coronavirus SARS-CoV-2 has prompted extensive research into its genetic diversity to support drug development and vaccination strategies. In this study, we analyze the genetic similarity patterns of SARS-CoV-2 genome sequences from six severely affected nations: USA, Italy, Spain, France, Germany, and the UK. A total of 359 complete human host SARS-CoV-2 genome sequences, ranging from 29,538 to 29,987 base pairs, are processed using k-mer representation, with k = 2 (dinucleotides) and k = 3 (codons). These representations are converted into 50-dimensional feature vectors. To identify intrinsic patterns within this high-dimensional dataset, we apply agglomerative hierarchical clustering using average linkage. A Silhouette score of 0.48 and a Hopkins statistic of 0.85 indicate moderate clustering tendency and structure. Four primary clusters are identified, highlighting notable genomic similarities. Specifically, sequences from the USA, Spain, and Italy predominantly group together, suggesting shared genetic traits. To further aid interpretation, we apply dimensionality reduction techniques—Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE)—which project the high-dimensional feature vectors into 2-dimensional space. Visualizations confirm the clustering structure, with USA, Spain, and Italy forming a distinct and tight cluster, while sequences from France, Germany, and the UK show more dispersed patterns. This study provides a quantitative and visual understanding of SARS-CoV-2 genetic diversity across heavily impacted nations. The combination of k-mer-based feature encoding, hierarchical clustering, and dimensionality reduction offers actionable insights that may inform more targeted therapeutic and vaccine design strategies.

References

Authors

  • Venkataramanan V Department of Information Technology, K J Somaiya School of Engineering, Somaiya Vidyavihar University, Mumbai 400076, India
  • Srinivasan J Department of Computer Applications, Madanapalle Institute of Technology and Science MITS, Madanapalle 517325, Andhra Pradesh, India
  • Ramadevi K Department of Information Technology, Panimalar Engineering College, Anna University, Chennai 600025, Tamil Nadu, India
  • Dillibabu M Department of Information Technology, Panimalar Engineering College, Anna University, Chennai 600025, Tamil Nadu, India

DOI:

https://doi.org/10.31449/inf.v49i13.7856

Downloads

Published

11/23/2025

How to Cite

Hierarchical Clustering and Dimensionality Reduction for SARS-CoV-2 Genome Analysis Across Highly Affected Nations. (2025). Informatica, 49(13). https://doi.org/10.31449/inf.v49i13.7856