Empirical Analysis of Dataset Size Impact on Classification Performance in Precision Agriculture Using Machine Learning Models
Abstract
This study empirically investigates the relationship between dataset size and classification performance in precision agriculture applications. Seven machine learning models (Decision Tree, Random Forest, Logistic Regression, SVM, Gaussian Naïve Bayes, KNN, and AdaBoost) were evaluated on seven agricultural datasets ranging from 100 to 4,000 samples. Performance was assessed using five metrics: accuracy, precision, recall, F1-score, and ROC-AUC. The methodology involved two phases: initial evaluation using complete datasets, followed by systematic analysis of subdivided datasets to examine performance variation with data volume. Statistical analysis using Pearson correlation coefficients revealed no significant correlation between dataset size and model performance (r = 0.12, p > 0.05). Results indicate that Random Forest and Decision Tree models achieved the highest average performance across datasets (88.48% and 85.37% accuracy, respectively). The findings suggest that dataset quality and problem characteristics have greater influence on classification performance than dataset size alone in precision agriculture applications.References
T. Ayoub Shaikh, T. Rasool, and F. Rasheed Lone, “Towards leveraging the role of machine learning and artificial intelligence in precision agriculture and smart farming,” Comput. Electron. Agric., vol. 198, no. June 2021, p. 107119, 2022, doi: 10.1016/j.compag.2022.107119.
S. Condran, M. Bewong, M. Z. Islam, L. Maphosa, and L. Zheng, “Machine Learning in Precision Agriculture: A Survey on Trends, Applications and Evaluations over Two Decades,” IEEE Access, vol. 10, no. June, pp. 73786–73803, 2022, doi: 10.1109/ACCESS.2022.3188649.
A. Sen, R. Roy, and S. R. Dash, “Smart Farming Using Machine Learning and IoT,” Agric. Informatics Autom. Using IoT Mach. Learn., vol. 3, no. March, pp. 13–34, 2021, doi: 10.1002/9781119769231.ch2.
E. M. B. M. Karunathilake, A. T. Le, S. Heo, Y. S. Chung, and S. Mansoor, “The Path to Smart Farming: Innovations and Opportunities in Precision Agriculture,” Agric., vol. 13, no. 8, pp. 1–26, 2023, doi: 10.3390/agriculture13081593.
D. Radočaj, M. Jurišić, and M. Gašparović, “The Role of Remote Sensing Data and Methods in a Modern Approach to Fertilization in Precision Agriculture,” Remote Sens., vol. 14, no. 3, 2022, doi: 10.3390/rs14030778.
P. K. Singh and A. Sharma, “An intelligent WSN-UAV-based IoT framework for precision agriculture application,” Comput. Electr. Eng., vol. 100, no. July 2021, p. 107912, 2022, doi: 10.1016/j.compeleceng.2022.107912.
H. Bagha, A. Yavari, and D. Georgakopoulos, “Hybrid Sensing Platform for IoT-Based Precision Agriculture,” Futur. Internet, vol. 14, no. 8, 2022, doi: 10.3390/fi14080233.
C. R. Kagan, D. P. Arnold, D. J. Cappelleri, C. M. Keske, and K. T. Turner, “Special report: The Internet of Things for Precision Agriculture (IoT4Ag),” Comput. Electron. Agric., vol. 196, no. January, 2022, doi: 10.1016/j.compag.2022.106742.
A. Paullada, I. D. Raji, E. M. Bender, E. Denton, and A. Hanna, “Data and its (dis)contents: A survey of dataset development and use in machine learning research,” Patterns, vol. 2, no. 11, p. 100336, 2021, doi: 10.1016/j.patter.2021.100336.
M. Zheng, F. Wang, X. Hu, Y. Miao, H. Cao, and M. Tang, “A Method for Analyzing the Performance Impact of Imbalanced Binary Data on Machine Learning Models,” Axioms, vol. 11, no. 11, 2022, doi: 10.3390/axioms11110607.
P. Wibowo and C. Fatichah, “An in-depth performance analysis of the oversampling techniques for high-class imbalanced dataset,” Regist. J. Ilm. Teknol. Sist. Inf., vol. 7, no. 1, pp. 63–71, 2021, doi: 10.26594/register.v7i1.2206.
A. Althnian et al., “Impact of dataset size on classification performance: An empirical evaluation in the medical domain,” Appl. Sci., vol. 11, no. 2, pp. 1–18, 2021, doi: 10.3390/app11020796.
M. A. Alshammari and M. Alshayeb, “The effect of the dataset size on the accuracy of software defect prediction models: An empirical study,” Intel. Artif., vol. 24, no. 68, pp. 72–88, 2021, doi: 10.4114/intartif.vol24iss68pp72-88.
J. Imlawi and M. Alsharo, “Evaluating classification accuracy: The impact of resampling and dataset size,” Int. J. Bus. Inf. Syst., vol. 24, no. 1, pp. 91–101, 2017, doi: 10.1504/IJBIS.2017.080947.
A. Bailly et al., “Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models,” Comput. Methods Programs Biomed., vol. 213, p. 106504, 2022, doi:[1] T. Ayoub Shaikh, T. Rasool, and F. Rasheed Lone, “Towards leveraging the role of machine learning and artificial intelligence in precision agriculture and smart farming,” Comput. Electron. Agric., vol. 198, no. June 2021, p. 107119, 2022, doi: 10.1016/j.compag.2022.107119.
S. Condran, M. Bewong, M. Z. Islam, L. Maphosa, and L. Zheng, “Machine Learning in Precision Agriculture: A Survey on Trends, Applications and Evaluations over Two Decades,” IEEE Access, vol. 10, no. June, pp. 73786–73803, 2022, doi: 10.1109/ACCESS.2022.3188649.
A. Sen, R. Roy, and S. R. Dash, “Smart Farming Using Machine Learning and IoT,” Agric. Informatics Autom. Using IoT Mach. Learn., vol. 3, no. March, pp. 13–34, 2021, doi: 10.1002/9781119769231.ch2.
E. M. B. M. Karunathilake, A. T. Le, S. Heo, Y. S. Chung, and S. Mansoor, “The Path to Smart Farming: Innovations and Opportunities in Precision Agriculture,” Agric., vol. 13, no. 8, pp. 1–26, 2023, doi: 10.3390/agriculture13081593.
D. Radočaj, M. Jurišić, and M. Gašparović, “The Role of Remote Sensing Data and Methods in a Modern Approach to Fertilization in Precision Agriculture,” Remote Sens., vol. 14, no. 3, 2022, doi: 10.3390/rs14030778.
P. K. Singh and A. Sharma, “An intelligent WSN-UAV-based IoT framework for precision agriculture application,” Comput. Electr. Eng., vol. 100, no. July 2021, p. 107912, 2022, doi: 10.1016/j.compeleceng.2022.107912.
H. Bagha, A. Yavari, and D. Georgakopoulos, “Hybrid Sensing Platform for IoT-Based Precision Agriculture,” Futur. Internet, vol. 14, no. 8, 2022, doi: 10.3390/fi14080233.
C. R. Kagan, D. P. Arnold, D. J. Cappelleri, C. M. Keske, and K. T. Turner, “Special report: The Internet of Things for Precision Agriculture (IoT4Ag),” Comput. Electron. Agric., vol. 196, no. January, 2022, doi: 10.1016/j.compag.2022.106742.
A. Paullada, I. D. Raji, E. M. Bender, E. Denton, and A. Hanna, “Data and its (dis)contents: A survey of dataset development and use in machine learning research,” Patterns, vol. 2, no. 11, p. 100336, 2021, doi: 10.1016/j.patter.2021.100336.
M. Zheng, F. Wang, X. Hu, Y. Miao, H. Cao, and M. Tang, “A Method for Analyzing the Performance Impact of Imbalanced Binary Data on Machine Learning Models,” Axioms, vol. 11, no. 11, 2022, doi: 10.3390/axioms11110607.
P. Wibowo and C. Fatichah, “An in-depth performance analysis of the oversampling techniques for high-class imbalanced dataset,” Regist. J. Ilm. Teknol. Sist. Inf., vol. 7, no. 1, pp. 63–71, 2021, doi: 10.26594/register.v7i1.2206.
A. Althnian et al., “Impact of dataset size on classification performance: An empirical evaluation in the medical domain,” Appl. Sci., vol. 11, no. 2, pp. 1–18, 2021, doi: 10.3390/app11020796.
M. A. Alshammari and M. Alshayeb, “The effect of the dataset size on the accuracy of software defect prediction models: An empirical study,” Intel. Artif., vol. 24, no. 68, pp. 72–88, 2021, doi: 10.4114/intartif.vol24iss68pp72-88.
J. Imlawi and M. Alsharo, “Evaluating classification accuracy: The impact of resampling and dataset size,” Int. J. Bus. Inf. Syst., vol. 24, no. 1, pp. 91–101, 2017, doi: 10.1504/IJBIS.2017.080947.
A. Bailly et al., “Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models,” Comput. Methods Programs Biomed., vol. 213, p. 106504, 2022, doi: 10.1016/j.cmpb.2021.106504.
L. S. Lin, Y. S. Lin, D. C. Li, and Y. H. Liu, “Improved learning performance for small datasets in high dimensions by new dual-net model for non-linear interpolation virtual sample generation,” Decis. Support Syst., vol. 172, no. April, p. 113996, 2023, doi: 10.1016/j.dss.2023.113996.
K. Lechqar and M. Errais, “Crop Recommendation in the Context of Precision Agriculture,” in Advances on Intelligent Computing and Data Science, 2023, pp. 523--532, doi: https://doi.org/10.1007/978-3-031-36258-3_46.
S. García, A. Fernández, J. Luengo, and F. Herrera, “A study of statistical techniques and performance measures for genetics-based machine learning: Accuracy and interpretability,” Soft Comput., vol. 13, no. 10, pp. 959–977, 2009, doi: 10.1007/s00500-008-0392-y.
DOI:
https://doi.org/10.31449/inf.v49i23.8137Downloads
Additional Files
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika







