Document Type : Original Article

Authors

1 Department of Industrial Engineering, College of Engineering, University of Yazd, Yazd, Iran.

2 1 Department of Industrial Engineering, College of Engineering, University of Yazd, Yazd, Iran.

Abstract

In a data-driven decision-making process, there are various types of data that should be thoroughly processed and analyzed. Data mining is a well-recognized method to obtain such information by analyzing data and transforming it into actionable insights for further use. Among the various data mining techniques such as classification, clustering, and association rules, this research focused on classification techniques and presented an innovative regression-based learning approach in the decision tree (DT) models. DT algorithms are easy-to-understood and can work with different data types including continuous, discrete, and non-numerical. Despite a large number of existing studies, which attempt to enhance the performance of the DT models, there is still a gap in accurately extracting knowledge from databases. In this research, this issue is addressed by exploiting regression and coefficient of determination (R2) methods in a DT. The proposed tree provides new insights in the following aspects: split criterion, handling continuous and discrete variables, labeling leaf node, pruning process by stopping criteria and tree evaluation. The superiority of the proposed algorithm is demonstrated using a real-world hospital database and a comparison with existing approaches is provided. The results showed that the proposed algorithm outperforms the existing methods in terms of higher accuracy and lower complexity.

Keywords

Abellán, J., & Moral, S. (2003). Building classification trees using the total uncertainty criterion. International Journal of Intelligent Systems, 18 (12), 1215–1225.
Abellán, J., Mantas, C. J., & Castellano, J. G. (2018). Adaptative CC4. 5: Credal C4. 5 with a rough class noise estimator. Expert Systems with Applications, 92, 363-379.
Abpeikar, S., Ghatee, M., Foresti, G. L., & Micheloni, C. (2020). Adaptive neural tree exploiting expert nodes to classify high-dimensional data. Neural Networks, 124, 20-38.
Bach, M. P., Dumičić, K., Žmuk, B., Ćurlin, T., & Zoroja, J. (2018). Internal fraud in a project-based organization: CHAID decision tree analysis. Procedia computer science, 138, 680-687.
Baitharu, T. R., & Pani, S. K. (2016). Analysis of data mining techniques for healthcare decision support system using liver disorder dataset. Procedia Computer Science, 85, 862-870.
Baloochian, H., & Ghaffary, H. R. (2019). Multiclass Classification Based on Multi-criteria Decision-making. Journal of Classification, 36(1), 140-151.
Bar-Hen, A., Gey, S., & Poggi, J. M. (2015). Influence measures for CART classification trees. Journal of Classification, 32(1), 21-45.
Barsacchi, M., Bechini, A., & Marcelloni, F. (2020). An analysis of boosted ensembles of binary fuzzy decision trees. Expert Systems with Applications, 154, 113436.
Begon, J. M., Joly, A., & Geurts, P. (2017, July). Globally induced forest: A prepruning compression scheme. In International Conference on Machine Learning (pp. 420-428). PMLR.
Benkercha, R., & Moulahoum, S. (2018). Fault detection and diagnosis based on C4. 5 decision tree algorithms for grid connected PV system. Solar Energy, 173, 610-634.
Biggs, D., De Ville, B., & Suen, E. (1991). A method of choosing multiway partitions for classification and decision trees. Journal of applied statistics, 18(1), 49-62.
Bobadilla, J., Ortega, F., Hernando, A., & Glez-de-Rivera, G. (2013). A similarity metric designed to speed up, using hardware, the recommender systems k-nearest neighbors’ algorithm. Knowledge-Based Systems, 51, 27-34.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (2017). Classification and regression trees. Routledge.
Chandra, B., Kothari, R., & Paul, P. (2010). A new node splitting measure for decision tree construction. Pattern Recognition, 43(8), 2725-2731. 
Chang, C. L., & Chen, C. H. (2009). Applying decision tree and neural network to increase quality of dermatologic diagnosis. Expert Systems with Applications, 36(2), 4035-4041.
Chou, J. S. (2012). Comparison of multilabel classification models to forecast project dispute resolutions. Expert Systems with Applications, 39(11), 10202-10211.
Cohen, S. (2021). The basics of machine learning: strategies and techniques. In Artificial Intelligence and Deep Learning in Pathology (pp. 13-40). Elsevier.
Czajkowski, M., & Kretowski, M. (2019). Decision tree underfitting in mining of gene expression data. An evolutionary multi-test tree approach. Expert Systems with Applications, 137, 392-404.
De Caigny, A., Coussement, K., & De Bock, K. W. (2018). A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees. European Journal of Operational Research, 269(2), 760-772.
Delen, D., Kuzey, C., & Uyar, A. (2013). Measuring firm performance using financial ratios: A decision tree approach. Expert systems with applications, 40(10), 3970-3983.
Farid, D. M., Zhang, L., Rahman, C. M., Hossain, M. A., & Strachan, R. (2014). Hybrid decision tree and naïve Bayes classifiers for multi-class classification tasks. Expert Systems with Applications, 41(4), 1937-1946.
Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-valued attributes in decision tree generation. Machine learning, 8(1), 87-102.
Ghiasi, M. M., Zendehboudi, S., & Mohsenipour, A. A. (2020). Decision tree-based diagnosis of coronary artery disease: CART model. Computer methods and programs in biomedicine, 192, 105400.
Ginde, A. A., Liu, M. C., & Camargo, C. A. (2009). Demographic differences and trends of vitamin D insufficiency in the US population, 1988-2004. Archives of internal medicine, 169(6), 626-632.
Gkioulekas, I., & Papageorgiou, L. G. (2021). Tree regression models using statistical testing and mixed integer programming. Computers & Industrial Engineering, 153, 107059.
Guo, Z., Shi, Y., Huang, F., Fan, X., & Huang, J. (2021). Landslide susceptibility zonation method based on C5. 0 decision tree and K-means cluster algorithms to improve the efficiency of risk management. Geoscience Frontiers, 101249.
Hamsa, H., Indiradevi, S., & Kizhakkethottam, J. J. (2016). Student academic performance prediction model using decision tree and fuzzy genetic algorithm. Procedia Technology, 25, 326–332.
Han, J., Pei, J., Kamber, M. (2011). Data Mining: Concepts and Techniques. Elsevier.
Handley, T. E., Hiles, S. A., Inder, K. J., Kay-Lambkin, F. J., Kelly, B. J., Lewin, T. J., ... & Attia, J. R. (2014). Predictors of suicidal ideation in older people: a decision tree analysis. The American Journal of Geriatric Psychiatry, 22(11), 1325-1335.
Höppner, F. (2020). Multidimensional Decision Tree Splits to Improve Interpretability. Procedia Computer Science, 176, 156-165.
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics, 15(3), 651-674.
Itani, S., Lecron, F., & Fortemps, P. (2020). A one-class classification decision tree based on kernel density estimation. Applied soft computing, 91, 106250.
Kappelhof, N., Ramos, L. A., Kappelhof, M., van Os, H. J. A., Chalos, V., van Kranendonk, K. R., ... & Marquering, H. A. (2021). Evolutionary algorithms and decision trees for predicting poor outcome after endovascular treatment for acute ischemic stroke. Computers in Biology and Medicine, 133, 104414.
Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data. Applied statistics, 119-127.
Keswani, B., Vijay, L., Keswani, P., Vijay, P., & Mohapatra, A. G. (2020). Amalgamation of Machine Learning and Artificial Intelligence for Breast Cancer Detection. In Terahertz Biomedical and Healthcare Technologies (pp. 177-193). Elsevier.
Khajehei, M., & Etemady, F. (2010, September). Data mining and medical research studies. In 2010 Second International Conference on computational intelligence, modelling and simulation (pp. 119-122). IEEE.
Khalili-Damghani, K., Abdi, F., & Abolmakarem, S. (2018). Hybrid soft computing approach based on clustering, rule mining, and decision tree analysis for customer segmentation problem: Real case of customer-centric industries. Applied Soft Computing, 73, 816-828.
Khoshgoftaar, T. M., Allen, E. B., Jones, W. D., & Hudepohl, J. P. (2000). Accuracy of software quality models over multiple releases. Annals of Software Engineering, 9(1-2), 103-116.
Kotsiantis, S. B. (2013). Decision trees: a recent overview. Artificial Intelligence Review, 39(4), 261-283.
Kotsiantis, S. B., Zaharakis, I., & Pintelas, P. (2007). Supervised machine learning: A review of classification techniques. Emerging artificial intelligence applications in computer engineering, 160, 3-24.
Kudła, P., & Pawlak, T. P. (2018). One-class synthesis of constraints for Mixed-Integer Linear Programming with C4. 5 decision trees. Applied Soft Computing, 68, 1-12.
Kuhn, M., & Johnson, K. (2013). Classification trees and rule-based models. In Applied predictive modeling (pp. 369-413). Springer, New York, NY.
Lee, S. K. (2006). On classification and regression trees for multiple responses and its application. Journal of Classification, 23(1), 123-141.
Le, T., & Clarke, B. (2018). On the interpretation of ensemble classifiers in terms of Bayes classifiers. Journal of Classification, 35(2), 198-229.
Li, X., Zhao, H., & Zhu, W. (2015). A cost sensitive decision tree algorithm with two adaptive mechanisms. Knowledge-Based Systems, 88, 24-33.
Lippmann, R. P. (1987). An introduction to computing with neural nets. IEEE Assp magazine, 4(2), 4-22.
Loh, W. Y. & Shih, Y. S. (1997). Split selection methods for classification trees. Statistica Sinica, 815-840.
Lu, J., Behbood, V., Hao, P., Zuo, H., Xue, S., & Zhang, G. (2015). Transfer learning using computational intelligence: A survey. Knowledge-Based Systems, 80, 14-23.
Luo, X., Xia, J., & Liu, Y. (2021). Extraction of dynamic operation strategy for standalone solar-based multi-energy systems: A method based on decision tree algorithm. Sustainable Cities and Society, 70, 102917.
Manikandan, R., Patan, R., Gandomi, A. H., Sivanesan, P., & Kalyanaraman, H. (2020). Hash polynomial two factor decision tree using IoT for smart health care scheduling. Expert Systems with Applications, 141, 112924.
Mantas, C. J., & Abellán, J. (2014). Analysis and extension of decision trees based on imprecise probabilities: Application on noisy data. Expert Systems with Applications, 41(5), 2514-2525.
Mantas, C. J., & Abellán, J. (2014b). Credal-C4. 5: Decision tree based on imprecise probabilities to classify noisy data. Expert Systems with Applications, 41(10), 4625-4637.
Mantas, C. J., Abellán, J., & Castellano, J. G. (2016). Analysis of Credal-C4. 5 for classification in noisy domains. Expert Systems with Applications, 61, 314-326.
Mehta, M., Agrawal, R., & Rissanen, J. (1996, March). SLIQ: A fast scalable classifier for data mining. In International conference on extending database technology (pp. 18-32). Springer, Berlin, Heidelberg.
Meng, X., Zhang, P., Xu, Y., & Xie, H. (2020). Construction of decision tree based on C4. 5 algorithm for online voltage stability assessment. International Journal of Electrical Power & Energy Systems, 118, 105793.
Milanović, M., & Stamenković, M. (2016). CHAID decision tree: Methodological frame and application. Economic Themes, 54(4), 563-586.
Myles, A. J., Feudale, R. N., Liu, Y., Woody, N. A., & Brown, S. D. (2004). An introduction to decision tree modeling. Journal of Chemometrics: A Journal of the Chemometrics Society, 18(6), 275-285.
Oliver, J. J., & Hand, D. (1996). Averaging over decision trees. Journal of Classification, 13(2), 281-297.
Panhalkar, A. R., & Doye, D. D. (2021). Optimization of decision trees using modified African buffalo algorithm. Journal of King Saud University-Computer and Information Sciences.
Pashaei, E., Ozen, M., & Aydin, N. (2015, August). Improving medical diagnosis reliability using Boosted C5. 0 decision tree empowered by Particle Swarm Optimization. In 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (pp. 7230-7233). IEEE.
Patel, N., & Upadhyay, S. (2012). Study of various decision tree pruning methods with their empirical comparison in WEKA. International journal of computer applications, 60(12).
Perri, P. F., & van der Heijden, P. G. (2012). A property of the CHAID partitioning method for dichotomous randomized response data and categorical predictors. Journal of classification, 29(1), 76-90.
Pilz, S., Trummer, C., Pandis, M., Schwetz, V., Aberer, F., Gruebler, M., ... & Maerz, W. (2018). Vitamin D: current guidelines and future outlook. Anticancer research, 38(2), 1145-1151.
Piramuthu, S. (2008). Input data for decision trees. Expert Systems with applications, 34(2), 1220-1226.
Pradhan, B. (2013). A comparative study on the predictive ability of the decision tree, support vector machine and neuro-fuzzy models in landslide susceptibility mapping using GIS. Computers & Geosciences, 51, 350-365.
Przybyła-Kasperek, M., & Aning, S. (2021). Stop Criterion in Building Decision Trees with Bagging Method for Dispersed Data. Procedia Computer Science, 192, 3560-3569.
Quadrianto, N., & Ghahramani, Z. (2014). A very simple safe-Bayesian random forest. IEEE transactions on pattern analysis and machine intelligence, 37(6), 1297-1303.
Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-106.
Quinlan, J. R. (1996). Improved use of continuous attributes in C4. 5. Journal of artificial intelligence research, 4, 77-90.
Rajeswari, S., & Suthendran, K. (2019). C5. 0: Advanced Decision Tree (ADT) classification model for agricultural data analysis on cloud. Computers and Electronics in Agriculture, 156, 530-539.
Rezapour, M., Molan, A. M., & Ksaibati, K. (2020). Analyzing injury severity of motorcycle at-fault crashes using machine learning techniques, decision tree and logistic regression models. International journal of transportation science and technology, 9(2), 89-99.
Rutkowski, L., Jaworski, M., Pietruczuk, L., & Duda, P. (2014). The CART decision tree for mining data streams. Information Sciences, 266, 1–15.
Sagi, O., & Rokach, L. (2021). Approximating XGBoost with an interpretable decision tree. Information Sciences, 572, 522-542.
Sahani, N., & Ghosh, T. (2021). GIS-based spatial prediction of recreational trail susceptibility in protected area of Sikkim Himalaya using logistic regression, decision tree and random forest model. Ecological Informatics, 101352.
Salzberg, S.L. (1994). C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers Inc, 1993. Mach Learn 16, 235–240. https://doi.org/10.1007/ BF00993309.
Sarker, I. H. (2018). Mobile data science: Towards understanding data-driven intelligent mobile applications. arXiv preprint arXiv:1811.02491.
Sarker, I. H., Colman, A., Han, J., Khan, A. I., Abushark, Y. B., & Salah, K. (2020). Behavdt: A behavioral decision tree learning to build user-centric context-aware predictive model. Mobile Networks and Applications, 25(3), 1151–1161.
Saroj, R. K., & Anand, M. (2021). Environmental factors prediction in preterm birth using comparison between logistic regression and decision tree methods: an exploratory analysis. Social Sciences & Humanities Open, 4(1), 100216.
Shobha, G., & Rangaswamy, S. (2018). Machine learning. In Handbook of statistics (Vol. 38, pp. 197-228). Elsevier.
Sies, A., & Van Mechelen, I. (2020). C443: a Methodology to See a Forest for the Trees. Journal of Classification, 37(3), 730-753.
Sim, D. Y. Y., Teh, C. S., & Ismail, A. I. (2017). Improved boosting algorithms by pre-pruning and associative rule mining on decision trees for predicting obstructive sleep apnea. Advanced Science Letters, 23(11), 11593-11598.
Sut, N., & Simsek, O. (2011). Comparison of regression tree data mining methods for prediction of mortality in head injury. Expert systems with applications, 38(12), 15534-15539.
Tanyu, B. F., Abbaspour, A., Alimohammadlou, Y., & Tecuci, G. (2021). Landslide susceptibility analyses using Random Forest, C4. 5, and C5. 0 with balanced and unbalanced datasets. CATENA, 203, 105355.
Tao, Q., Li, Z., Xu, J., Xie, N., Wang, S., & Suykens, J. A. (2021). Learning with continuous piecewise linear decision trees. Expert Systems with Applications, 168, 114214.
Wang, F., Wang, Q., Nie, F., Li, Z., Yu, W., & Ren, F. (2020). A linear multivariate binary decision tree classifier based on K-means splitting. Pattern Recognition, 107, 107521.
Wang, F., Wang, Q., Nie, F., Yu, W., & Wang, R. (2018). Efficient tree classifiers for large scale datasets. Neurocomputing, 284, 70-79.
Wang, G., Zhang, C., & Huang, L. (2008, August). A study of classification algorithm for data mining based on hybrid intelligent systems. In 2008 Ninth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (pp. 371-375). IEEE.
Wang, L., Li, Q., Yu, Y., & Liu, J. (2018). Region compatibility based stability assessment for decision trees. Expert Systems with Applications, 105, 112-128.
Wang, R., Kwong, S., Wang, X. Z., & Jiang, Q. (2014). Segment based decision tree induction with continuous valued attributes. IEEE transactions on cybernetics, 45(7), 1262-1275.
Wang, Y., Xia, S. T., & Wu, J. (2017). A less-greedy two-term Tsallis Entropy Information Metric approach for decision tree classification. Knowledge-Based Systems, 120, 34-42.
Wang, Y., Zhang, Y., Lu, Y., & Yu, X. (2020). A Comparative Assessment of Credit Risk Model Based on Machine Learning——a case study of bank loan data. Procedia Computer Science, 174, 141-149.
Windeatt, T., & Ardeshir, G. (2001, September). An empirical comparison of pruning methods for ensemble classifiers. In International Symposium on Intelligent Data Analysis (pp. 208-217). Springer, Berlin, Heidelberg.
Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2017). Trees and rules. Data Mining, 209–242.
Wu, C. C., Chen, Y. L., Liu, Y. H., & Yang, X. Y. (2016). Decision tree induction with a constrained number of leaf nodes. Applied Intelligence, 45(3), 673-685.
Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., ... & Steinberg, D. (2008). Top 10 algorithms in Data mining‖, knowl inf Syst.
Yeo, B., & Grant, D. (2018). Predicting service industry performance using decision tree analysis. International Journal of Information Management, 38(1), 288-300.
Yeturu, K. (2020). Machine learning algorithms, applications, and practices in data science. In Handbook of Statistics (Vol. 43, pp. 81-206). Elsevier.
Yu, F., Li, G., Chen, H., Guo, Y., Yuan, Y., & Coulton, B. (2018). A VRF charge fault diagnosis method based on expert modification C5. 0 decision tree. International Journal of Refrigeration, 92, 106-112.
Zhang, X., Dai, J., & Yu, Y. (2015). On the union and intersection operations of rough sets based on various approximation spaces. Information Sciences, 292, 214-229.