Gujarati Optical Character Recognition Using Efficient Text Feature Extraction Approaches

Avani Samir Bhuva, Dhirendra Mishra

Abstract


India isthe most populous country, with 22 official regional languages. Retrieving information from these regional languages is a challenging task. Approximately 62 million people worldwide speak the Gujarati language. This research paper aims to understand and extract the meaningful text features of the Gujarati text from the OCR Gujarati dataset. This research focuses on extracting meaningful text features from the Gujarati OCR dataset, which comprises 23,100 samples generated using the TERAFONT-VARUN font and augmented with horizontal/vertical shifts and rotational transformations. This study explores three levels of text feature extraction: Mid-level features using the Integrated Shape Numeric Encoding Approach (ISNEA) and Fusion of Region Geometric Features (FRGF), Mid-high-level features via the One-Bit Frequency Count Approach (OBFCA), and High-level features through a deep learning-based CNN model. The extracted features were stored in a structured Gujarati Text Feature Vector Dictionary. ISNEA struggles with characters containing maatra’s modifiers, with 87.5% on standardized OCR images. OBFCA resolves the maatra’s issue by row-wise binary frequency computation, yielding 90.11% accuracy. FRGF significantly outperforms ISNEA and OBFCA, with 93.5% accuracy using Eccentricity as a single feature and 92.75% using Eccentricity + Perimeter as a fused feature. The Euclidean distance and cosine similarities were also used to measure the similarities between the extracted text features. Comparative analysis against existing methods confirms the superiority and robustness of the proposed approaches in Gujarati OCR feature extraction.


Full Text:

PDF

References


“invideo.in”, Accessed on 4th Aug 2022 [Online] Available:https://invideo.io/blog/video-marketing-statistics/.

Dr. A. Anushya 2020 Video Tagging Using Deep Learning:A Survey International Journal of Computer Science and Mobile Computing Vol.9 Issue.2.

Hong Liang Xiao Sun, Yunlei Sun and Yuan Gao 2017 Text feature extraction based on deep learning: a review Liang et al. EURASIP Journal on Wireless Communications and Networking Article Number:211

S. A. Rajesh V. A. Bharadi and P. Jangid 2015 Performance improvement of complex plane based feature vector for online signature recognition using soft biometric features International Conference on Communication, Information & Computing Technology , pp. 1-7

Shah A. Mishra D. 2018 A Review of Biometrics Modalities and Data Mining Algorithm Ambient Communications and Computer Systems. Advances in Intelligent Systems and Computing vol 696. Springer

Desai N.P. and Dabhi V.K. 2022 Resources and components for gujarati NLP systems: a survey Artificial Intelligence Review 55(11):1-19

J. Bharvad D. Garg and S. Ribadiya 2021 A Roadmap on Handwritten Gujarati Digit Recognition using Machine Learning International Conference for Convergence in Technology pp. 1-4

K. S. Gautam and M. M. Goswami 2020 Survey on Handwritten Gujarati Word Image Matching International Conference on Advanced Computing and Communication Systems India pp. 534-538

K. B. Khushali M. M. Goswami and S. K. Mitra 2020 Handwritten Gujarati Word Image Matching using Autoencod International Conference on Computational Intelligence and Networks pp. 1-4

P. Jyoti, S. Dimple,K. Rashmi Rekha, P. Suchit 2020 Gujarati Handwritten Character Recognition from Text Images Procedia Computer Science Volume 171

Desai Apurva 2010 Gujarati handwritten numeral optical character reorganization through neural network Pattern Recognition Volume 43 Pages 2582-2589 ISSN 0031-3203

K. Moro M. Fakir B. Bouikhalene R. El Yachi and B. El Kessab 2013 Comparison of two feature extraction methods based on the raw form and his skeleton for Gujarati handwritten digits Facta universitatis-series: Mathematics and Informatics 28.2 : 161-178.

Moro Kamal fakir Mohamed Belaid Bouikhalene el Kessab Badre-Eddine and Ayachi R. 2014 Gujarati Handwritten Numeral Optical Character through Naive Bayes Classifier International journal of Computer Science & Network Solutions Volume 2. ISSN 2345-3397

P. Goel and A. Ganatra 2022 A Pre-Trained CNN based framework for Handwritten Gujarati Digit Classification using Transfer Learning Approach International Conference on Smart Systems and Inventive Technology India pp. 1655-1658

A. Vanani, V. Patel K. Limbachiya and A. Sharma 2022 Handwritten Gujarati Numeral Recognition using Deep Learning International Conference on Innovative Sustainable Computational Technologies India pp. 1-4

A. Shirke N. Gaonkar P. Pandit and K. Parab 2021 Handwritten Gujarati Script Recognition International Conference on Advanced Computing and Communication Systems pp. 1174-1179

B. Rajyagor and R. Rakholia 2021 Isolated Gujarati Handwritten Character Recognition using Deep Learning IEEE International Conference on Electrical, Computer and Communication Technologies pp. 1-6

S. Aniket, R. Atharva, C. Prabha, D. Rupali and P. Shubham 2019 Handwritten Gujarati script recognition with image processing and deep learning International Conference on Nascent Technologies in Engineering pp. 1-4

D. S. Joshi and Y. R. Risodkar 2018 Deep Learning Based Gujarati Handwritten Character Recognition 2018 International Conference On Advances in Communication and Computing Technology pp. 563-566

B. Divya, M. M. Goswami and S. Mitra 2020 DNN based approaches for Segmentation of Handwritten Gujarati Text IEEE International Symposium on Sustainable Energy Signal Processing and Cyber Security pp. 1-6

Abhinav Sharma Dhiren Soneji Aabha Ranade, Dhwani Serai Priya RL CS Lifna and Shashikant R Dugad 2023 Gujarati Script Recognition Procedia Computer Science Volume 218.

P. Borad, P. Dethaliya and A. Mehta 2020 Augmentation based Convolutional Neural Network for recognition of Handwritten Gujarati Characters IEEE International Conference for Innovation in Technology pp. 1-4.

S. Antani and L. Agnihotri 1999 Gujarati character recognition Proceedings of the International Conference on Document Analysis and Recognition Cat. No.PR00318 pp. 418-42

M. I. Shah and C. Y. Suen 2010 Word Spotting in Gray Scale Handwritten Pashto Documents International Conference on Frontiers in Handwriting Recognition India pp. 136-141

Kolcz A. Alspector and J. Augusteijn M. et al. 2000 A Line-Oriented Approach to Word Spotting in Handwritten Documents Pattern Analysis & Applications pp. 153–168

Enver Akbacak and Cabir Vural 2022 Deep multi-query video retrieval Journal of Visual Communication and Image Representation Volume 85

Y. Tewari, P. Soni S. Singh M. S. Turlapati and A. Bhuva 2021 Real Time Sign Language Recognition Framework for Two Way Communication International Conference on Communication information and Computing Technology pp. 1-6

K. Moro M. Fakir B. Bouikhalene R. El Yachi and B. El Kessab 2014 New Approach of Feature Extraction Method Based on the Raw form and his Skeleton for Gujarati Handwritten Digits Using Neural Networks Classifier International journal of Computer Science & Network Solutions Volume 2. ISSN 2345-339

Rathod Anand, Hasan Mosin , Swadas Prashan, Rathod Anand and Desai Nidhi 2022 Gujarati OCR / Typed Gujarati Characters Dataset , CC BY-NC-SA 4.0

H. Althobaiti and Chao Lu 2017 A survey on Arabic Optical Character Recognition and isolated handwritten Arabic Character Recongition algorithm using encoded freeman chain code Annual Conference on Information Science and Systems pp. 1-6

M. A. Mohamad Z. Musa and A. R. Ismail 2023 Crow Search Freeman Chain Code Feature Extraction Algorithm for Handwritten Character Recognition IEEE International Conference On Software Engineering and Computer Systems pp. 258-26

A. O. Salau and S. Jain 2019 Feature Extraction: A Survey of the Types, Techniques, Applications International Conference on Signal Processing and Communication India pp. 158-164

Rathod Anand, Hasan Mosin , Swadas Prashan, Rathod Anand and Desai Nidhi 2022 Gujarati OCR / Typed Gujarati Characters Dataset , CC BY-NC-SA 4.

Patel, Jagin & Desai, Apurva. (2018). Gujarati Text Localization, Extraction and Binarization from Images. International Journal of Computer Sciences and Engineering. 6. 714-724. 10.26438/ijcse/v6i8.714724.

Mehta, Ami & Gor, Ashish. (2017). Multifont multisize Gujarati OCR with style identification. 275-281. 10.1109/ICECDS.2017.8389951.

Sharma, Ankit & Thakkar, Priyank & Adhyaru, Dipak & Zaveri, Tanish. (2019). Handwritten Gujarati Character Recognition Using Structural Decomposition Technique. Pattern Recognition and Image Analysis. 29. 325-338. 10.1134/S1054661819010061.

Suthar, S.B., Thakkar, A.R. Dataset Generation for Gujarati Language Using Handwritten Character Images. Wireless Pers Commun 136, 2163–2184 (2024). https://doi.org/10.1007/s11277-024-11369-9

Rathod Anand, Hasan Mosin , Swadas Prashan, Rathod Anand and Desai Nidhi (2022). Gujarati OCR / Typed Gujarati Characters Dataset . CC BY-NC-SA 4.0.

Vyas, Archana & Swital. (2015). Classification of offline gujarati handwritten characters. 10.1109/ICACCI.2015.7275831.




DOI: https://doi.org/10.31449/inf.v49i28.8341

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.