Data Mining Approach to Effort Modeling On Agile Software Projects

Hrvoje Karna, Sven Gotovac, Linda Vicković


Software production is a complex process. Accurate estimation of the effort required to build the product, regardless of its type and applied methodology, is one of the key problems in the field of software engineering. This study presents the approach to effort estimation on agile software project using local data and data mining techniques, in particular k-nearest neighbor clustering algorithm. The applied process is iterative, meaning that in order to build predictive models, sets of data from previously executed project cycles are used. These models are then utilized to generate estimate for the next development cycle. Used data enrichment process, proved to be useful as results of effort prediction indicate decrease in estimation error compared to the estimates produced solely by the estimators. The proposed approach suggests that similar models can be built by other organizations as well, using the local data at hand and this way optimizing the management of the software product development.

Full Text:



M. M. Kirmani, (2017). "Software Effort Estimation in Early Stages of Software Development A Review", International Journal of Advanced Research in Computer Science (IJARCS), vol. 8, no. 5, pp. 1155-1159.

M. Jørgensen, (2004). "A review of studies on expert estimation of software development effort", The Journal of Systems and Software, vol. 70, no. 1, pp. 37–60.

B. Tanveer et al., (2017). "Effort estimation in agile software development: Case study and improvement framework", Journal of Software: Evolution and Practice, Special issue paper.

M. Usman et al., (2018). "Effort Estimation in Large-Scale Software Development: An Industrial Case Study", Information and Software Technology, vol. 99, pp. 21-40.

E. Mendes, (2012). "Improving Software Effort Estimation Using an Expert-Centered Approach", in Proc. of the 4th international conference on Human-Centered Software Engineering (HCSE '12), Springer-Verlag, Berlin, Heidelberg, pp. 18-33.

M. Jørgensen, (2014). "What We Do and Don't Know about Software Development Effort Estimation", IEEE Software, vol. 31, no. 2, pp. 37-40.

M. Kim et al., (2017). "Data scientists in software teams: state of the art and challenges", IEEE Transactions on Software Engineering, vol. 44, no. 11, pp. 1024-1038.

M. Bicego and Marco Loog, (2016). "Weighted K-Nearest Neighbor Revisited", in Proc. of the 23rd International Conference on Pattern Recognition (ICPR '16), Cancún, México.

L. Radlinski and W. Hoffmann, (2010). "On Predicting Software Development Effort Using Machine Learning Techniques and Local Data", International Journal of Software Engineering and Computing (IJSEA), vol. 2, no. 2, pp. 123-136.

H. Karna et al., (2018). "Application of Data Mining Methods for Effort Estimation of Software Projects", Software: Practice and Experience, vol. 49, no. 2, pp. 171-191.

J. Sutherland and K. Schwaber, "The scrum guide", Available:

J. Ponce, (2009). "Data Mining and Knowledge Discovery in Real Life Applications (2nd ed.)", Springer, New York.

Maninderjit Kaur, Sushil Kumar Garg, (2014). "Survey on Clustering Techniques in Data Mining for Software Engineering", International Journal of Advanced and Innovative Research, vol. 3, no. 4.

U. Fayyad et al., (1996). "From Data Mining to Knowledge Discovery in Databases", AI Magazine, vol. 17, no. 3, pp. 37-54.

T. Menzies, (2001). "Practical machine learning for software engineering and knowledge engineering", Handbook of Software Engineering and Knowledge Engineering, vol. 1.

M. Halkidi et al., (2011). "Data Mining in Software Engineering", Intelligent Data Analysis Journal (IDA '11), vol. 15, no. 3, pp. 413-441.

C. Morbitzer et al., (2003). "Application of Data Mining Techniques for Building Simulation Performance Prediction Analysis", in Proc. of the 8th International IBPSA Conference, Eindhoven, Netherlands.

P. Li et al., (2017). "The Distance-Weighted K-nearest Centroid Neighbor Classification", Journal of Information Hiding and Multimedia Signal Processing, vol. 8, no. 3, pp. 611-622.

B. W. Boehm, (1981). "Software Engineering Economics", Englewood Cliffs, Prentice Hall, NJ, USA.

A. K. Bardsiri and S. M. Hashemi, (2014). "Software Effort Estimation: A Survey of Well-known Approaches", International Journal of Computer Science Engineering (IJCSE), vol. 3, no. 1.

M. Usman et al., (2017). "An Effort Estimation Taxonomy for Agile Software Development", International Journal of Software Engineering and Knowledge Engineering (IJSEKE), vol. 27, no. 4, pp. 641–674.

K. Dejaeger et al., (2012). "Data Mining Techniques for Software Effort Estimation: A Comparative Study", IEEE Transactions on Software Engineering, vol. 38, no. 2, pp. 375-397.

A. E. Hassan and Tao Xie, (2010). "Software intelligence: the future of mining software engineering data", in Proc. of the Workshop on Future of Software Engineering Research (FoSER '10), Santa Fe, NM, USA.

T. Xie et al., (2009). "Data Mining for Software Engineering", IEEE Computer, vol. 42, no. 8, pp. 55-62.

G. Robles et al., (2014). "Estimating Development Effort in Free/Open Source Software Projects by Mining Software Repositories", in Proc. of the 11th Working Conference on Mining Software Repositories (MSR 2014), ACM Press New York, NY, USA, pp. 222-231.

K. Molokken and M. Jørgensen, (2005). "Expert Estimation of Web-Development Projects: Are Software Professionals in Technical Roles More Optimistic Than Those in Non-Technical Roles?", Empirical Software Engineering, vol. 10, no. 1, pp. 7-29.

P. K. Suri and P Ranjan, (2012). "Comparative analysis of software effort estimation techniques", International Journal of Computer Applications (IJCA), vol. 48, no. 21, pp. 12-19.

Q. Taylor et al., (2010). "Applications of data mining in software engineering", International Journal of Data Analysis Techniques and Strategies, vol. 2, no. 3, pp. 243-257.

L. L. Minku et al., (2016). "Data mining for software engineering and humans in the loop", Progress in Artificial Intelligence (PRAI), vol. 5, no. 4, pp. 307-314.

R. Marques et al., (2018). "Assessing Agile Software Development Processes with Process Mining: A Case Study", in Proc. of the 20th IEEE Conference on Business Informatics (CBI), pp. 109-118, Vienna, Austria.

P. Abrahamsson et al., (2002). "Agile software development methods: Review and analysis". Available:

T. Dingsøyr et al., (2018). "Exploring software development at the very large-scale: a revelatory case study and research agenda for agile method adaptation", Empirical Software Engineering, vol. 23, no. 1, pp. 490–520.

M. Choetkiertikul et al., (2018). "A deep learning model for estimating story points", IEEE Transactions on Software Engineering.

K. S. Rubin, (2013). "Essential Scrum: a practical guide to the most popular agile process (1st Edition)", Addison-Wesley, Upper Saddle River, NJ, USA.

J. Aguilar et al., (2014). "The Size of Software Projects Developed by Mexican Companies", in Proc. of the International Conference on Software Engineering Research and Practice (SERP'14). Available:

M. Jørgensen et al., (2000). "Human judgement in effort estimation of software projects", Presented at Beg, Borrow, or Steal Workshop, in Proc. of the International Conference on Software Engineering, Limerick, Ireland.

T. M. Cover and P. E. Hart, (1967). "Nearest neighbor pattern classification", IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27.

L.-Y. Hu et al., (2016). "The distance function effect on k-nearest neighbor classification for medical datasets", SpringerPlus, vol. 5, no. 1.

J. Gou et al., (2012). "A New Distance-weighted k-nearest Neighbor Classifier", Journal of Information & Computational Science, vol. 9, no. 6, pp. 1429–1436.

J. Huang et al., (2017). "Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study", The Journal of Systems and Software, vol. 132, pp. 226–252.

E. Kocaguneli et al., (2012). "Exploiting the essential assumptions of analogy-based effort estimation", IEEE Transactions on Software Engineering, vol. 38, no. 2, pp. 425–439.

"IBM SPSS Modeler: KNN Node", Available:

R. Olu-Ajayi, (2017). "An Investigation into the Suitability of k Nearest Neighbor (k-NN) for Software Effort Estimation", International Journal of Advanced Computer Science and Applications (IJACSA), vol. 8, no. 6.

P. Le and V. Nguyen, (2017). "A k-Nearest Neighbors approach for COCOMO calibration", in Proc. of the 4th NAFOSTED Conference on Information and Computer Science, Hanoi, Vietnam, pp. 219-224.

"IBM SPSS Modeler 14.2", Available:

B. Kitchenham et al., (1999). "Assessing prediction systems", University of Otago Information Science Discussion Paper No. 99/14. Available:

C. Tofallis, (2015). "A better measure of relative prediction accuracy for model selection and model estimation", Journal of the Operational Research Society. vol. 66, no. 3, pp. 1352–1362.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.